# DAGs - a generalization of the alignment problem
##Specification
- Name: problem15 in Rosalind
- Name your notebook as: DAG.ipynb
- options: none
- input: filename passed as first parameter to main
- output: a text file. ( using print ( .... file=someFileObject) is a handy way to do this after you have opened someFileObject as a text file). I find it handy to name these files by creating a string by concatenating the string named infile with ".out" ... rosalind5.txt.out ( for example).
-Rosalind Problem Name: Find the longest path in a DAG

As always, include a Inspection Intro Markdown that describes your specific algorithm at the beginning of the notebook, and another Inspection Results markdown at the end of the notebook that documents: your inspection team, the findings of the team, and your resolution of those findings.

Please submit your notebook, an example of one of the Rosalind files that you ran and passed, and the output that your program generated as a text file.

## Description
As a prelude to an upcoming assignment in profile alignments, please consider the related problem of generalized weighted-DAGs (directed acyclic graph). Rosalind has a very nice example that includes weighted edges connecting nodes. As applied to a simple pairwise alignment, nodes represent a specific assignment of a character pairing between a character in one of the sequences to either an initial insert/delete (indel), an extension of that indel, or a match/mismatch. 

Unlike classical pairwise alignment, the specific order to consider the resolution of scores in the graph is established using the topology of the graph ( topological ordering). This ordering will avoid recursion in developing the final score. This is another way of using Dynamic programming in a more generalized way. ( pp247-52 in Compeau and the detour on pp287-88).

## Hints
1) scoring of nodes that have no inputs (or possibly incomplete inputs) must be done carefully. Choose these wisely. In a pairwise alignment, these nodes would be considered boundary nodes.

2) The problem is quite clear on the starting/ending node to consider. Make sure to evaluate these properly.

3) each node can have many inputs or none. In the generalized DAG, there is no limit on the number of inputs. Consider a data structure here that allows you to represent the incoming edge along with a score that is computed by adding the weight of that incoming edge to the score of its preceding node.

4) As you write these classes, consider that these may be usable in the future.

5) sets and dictionaries require keys that are immutable. Those keys can be integers, or floats, or tuples. Think carefully about using tuples as containers-that-can-be-keys.

6) python has a container called a namedtuple. These are available in the container module. They are quite handy.

7) python has two functions called any() and all(); any() evaluates to true if any of its iterable arguments are True; all() evaluates to True if all of its iterable arguments are true. None, for example, evaluates to False. 

8) consider implementing a helper method called _argMax_. If you happen to have a dictionary of values, it returns the key whose paired-value is maximal among all values in the dictionary.

9) Carefully consider the implementations that are provided in the text. Specifically, the topology ordering pseudocode removes edges from the constructed DAG. This is a bit annoying, considering that we need that DAG for scoring and backtracking. What is the purpose that removing an edge is attempting to perform? Look over the Wikipedia article on "Dataflow" for an idea as to how you might avoid destroying your graph.

10) My implementation has specific objects that are nodes, edges, and the DAG itself. It is implemented in about 120 lines including whitespace.

11) Rosalind will provide test cases where the proscribed beginning and ending node have inputs/outputs that may induce cycles. In these problems, remember that we are considering the DAG that is described between (inclusively) the start and end node so the inputs to start don't matter; the outputs from end don't matter either.

 

Here is a template to use.

## Inspection Intro

In [1]:
class DAG:
    def __init__(self, begin, end):
        pass
    
    def generate_kmer_comp(self, length, sequence):
        kmer_list = []
        if length >= 2:
            for x in range(len(sequence)):
                if x+length <= len(sequence):
                    kmer_list.append(sequence[x:x+length])
        else:
            raise ValueError("Invalid length received")
        return kmer_list
    
    def reconstruct_string(self, kmer_list):
        string = kmer_list[0]
        for kmer in kmer_list[1:]:
            string += kmer[len(kmer)-1:]
            pass
        return string
    
    def overlap(self, list):
        adjacency_list = {x:0 for x in list}
        for key, value in adjacency_list.items():
            for y in list:
                if key[1:] == y[:-1]:
                    adjacency_list[key] = y

        return adjacency_list
        
    


def main(inFile = None):
    '''
    Do the main thing
    '''
    with open(inFile) as fh:
        begin = int(fh.readline().rstrip())
        end = int(fh.readline().rstrip())
        thisDAG = DAG(begin, end)
        
        for line in fh.readline():
            pass
        
    
    
    
    
if __name__ == "__main__":
    main(inFile = 'rosalind_ba5d.txt') 

FileNotFoundError: [Errno 2] No such file or directory: 'rosalind_ba5d.txt'

In [77]:
def generate_kmer_comp(length, sequence):
        kmer_list = []
        if length >= 2:
            for x in range(len(sequence)):
                if x+length <= len(sequence):
                    kmer_list.append(sequence[x:x+length])
        else:
            raise ValueError("Invalid length received")
        return kmer_list
    
def reconstruct_string(kmer_list):
    string = kmer_list[0]
    for kmer in kmer_list[1:]:
        string += kmer[len(kmer)-1:]
        pass
    return string

def overlap(list):
    adjacency_list = {x:set() for x in list}
    for key, value in adjacency_list.items():
        for y in list:
            if key[1:] == y[:-1]:
                adjacency_list[key].add(y)

    return adjacency_list

def generate_bruijn(k, sample_seq):
    nodes = generate_kmer_comp(k-1, sample_seq)
    final_dict = overlap(nodes)
    
    return final_dict

In [102]:
sample_kmers = ["ACCGA",
"CCGAA",
"CGAAG",
"GAAGC",
"AAGCT"]

sample_output = "ACCGAAGCT"

#for key, value in overlapGraph(sample_kmers).items():
 #   if value != 0:
  #      print("{} -> {}".format(key, value))
       

sample_seq = "AAGATTCTCTAC"
k = 4


lex = generate_bruijn(k, sample_seq)

for x, y in lex.items():
    if y:
        print("{} -> {}".format(x, ",".join(str(e) for e in y)))

AAG -> AGA
AGA -> GAT
GAT -> ATT
ATT -> TTC
TTC -> TCT
TCT -> CTA,CTC
CTC -> TCT
CTA -> TAC


## Inspection Results