## data classes

Here I define and import the main data classes for pangraph.
It contains the following classes:
- `Insertion`
- `Deletion`
- `Substitution`
- `Edits`: a collection of in/dels and substitutions. It has a method `apply` that can be used to apply the edits to a reference sequence to obtain a query.
- `Node`: a node in a pangraph. It identifies a particular occurrence of a block (i.e. one line in the block alignment)
- `Block`: a block in a pangraph. It represents an alignment.
- `Path`: a path in a pangraph. It represents a genome as a sequence of blocks.
- `Pangraph`: a set of genomes represented as a collection of paths and blocks.

In [1]:
from pangraph_classes import Insertion, Deletion, Substitution, Edits
from pangraph_classes import Node, Block, Path
from pangraph_classes import Pangraph

Here is an example of utility functions that we might want: apply a set of edits to a reference sequence to obtain a query.

In [2]:

def apply_edits_to_ref(edits: Edits, ref: str) -> str:
    """
    Apply the edits to the reference to obtain the query sequence
    """
    qry = list(ref)
    for S in edits.subs:
        qry[S.pos] = S.alt
    for I in edits.ins:
        if I.pos > 0:
            qry[I.pos - 1] += I.ins
        elif I.pos == 0:
            qry[0] = I.ins + qry[0]
    for D in edits.dels:
        for l in range(D.length):
            qry[D.pos + l] = ""
    return "".join(qry)


#                    1
#          01234567890
ref = "ACCTGGCTTT"
qry = "AGCCTGGTAT"

I = Insertion(1, "G")
D = Deletion(6, 1)
S = Substitution(8, "A")

E = Edits([I], [D], [S])

print(f"ref = {ref}")
print(f"qry = {qry}")
print(f"edits = {E}")


qry_reconstructed = apply_edits_to_ref(E, ref)
print(f"              qry = {qry}")
print(f"reconstructed qry = {qry_reconstructed}")

ref = ACCTGGCTTT
qry = AGCCTGGTAT
edits = Edits(
Ins -> Insertion(pos=1, ins=G)	
Del -> Deletion(pos=6, length=1)	
Sub -> Substitution(pos=8, alt=A)	
)
              qry = AGCCTGGTAT
reconstructed qry = AGCCTGGTAT


### from ref-qry sequences to set of variations

When adding a sequence to a block, the first operation we need to do is to express the query as a function of the reference plus a set of edits.

In [3]:
# We need a function:
def map_variations(ref:str, qry:str) -> Edits:
    # use nextalign to align ref to query
    # return the set of edits
    pass


# example alignment
#        0            1        
#        012   3456789012345678
# ref = "ACT---TTGCGTATTTACTATA"
# qry = "ACTAGATTGAGTATCT---ATA"
# sub =           x    x
# ins =     xxx
# del =                  xxx

ref = "ACTTTGCGTATTTACTATA"
qry = "ACTAGATTGAGTATCTATA"


E = map_variations(ref, qry)

E_expected = Edits(
    [Insertion(3, "AGA")],
    [Deletion(13, 3)],
    [Substitution(6, "A"), Substitution(11, "C")],
    )

print(qry)
print(apply_edits_to_ref(E_expected, ref))


ACTAGATTGAGTATCTATA
ACTAGATTGAGTATCTATA


### adding a sequence to a block

We want a function to append a sequence to a block. This function should take the query sequence, align it to the block consensus, create a new node and add it to the block.

In [4]:
# Block
#              0         1
#              0123456789012345   678
# consensus = "ACTTTGCGTATTTACT---ATA"
# seq_0     = "AC---GCGTATTTACT---ATA"
# seq_1     = "ACTTTGCGTATTTACTCCCATA"
# seq_2     = "ACTTTGCGTATCTACT---ATA"
# sequence to append
# seq_new   = "ACTTTGCGGATTTACT---ATA"


cons = "ACTTTGCGTATTTACTATA"
E_0 = Edits(
    [],
    [Deletion(2, 3)],
    [],
    )
E_1 = Edits(
    [],
    [Insertion(16, "CCC")],
    [],
    )
E_2 = Edits(
    [],
    [],
    [Substitution(11, "C")],
    )

aln = {
    "node_0" : E_0,
    "node_1" : E_1,
    "node_2" : E_2,
}

block = Block("bl_0", cons, aln)

def append_sequence(
        block:Block,
        new_seq:str,
        node_id:str,
    ):
    """
    Append a sequence to a block
    """
    # find the set of edits
    E_new = map_variations(block.consensus, new_seq)
    block.alignment[node_id] = E_new


# append the sequence to the block    
new_seq = "ACTTTGCGGATTTACTATA"
append_sequence(block, new_seq, "node_new")