## data classes

Here I define and import the main data classes for pangraph.
It contains the following classes:
- `Insertion`
- `Deletion`
- `Substitution`
- `Edits`: a collection of in/dels and substitutions. It has a method `apply` that can be used to apply the edits to a reference sequence to obtain a query.
- `Node`: a node in a pangraph. It identifies a particular occurrence of a block (i.e. one line in the block alignment)
- `Block`: a block in a pangraph. It represents an alignment.
- `Path`: a path in a pangraph. It represents a genome as a sequence of blocks.
- `Pangraph`: a set of genomes represented as a collection of paths and blocks.

In [1]:
from pangraph_classes import Insertion, Deletion, Substitution, Edits
from pangraph_classes import Node, Block, Path
from pangraph_classes import Pangraph

Here is an example of utility functions that we might want: apply a set of edits to a reference sequence to obtain a query.

In [2]:

def apply_edits_to_ref(edits: Edits, ref: str) -> str:
    """
    Apply the edits to the reference to obtain the query sequence
    """
    qry = list(ref)
    for S in edits.subs:
        qry[S.pos] = S.alt
    for I in edits.ins:
        if I.pos > 0:
            qry[I.pos - 1] += I.ins
        elif I.pos == 0:
            qry[0] = I.ins + qry[0]
    for D in edits.dels:
        for l in range(D.length):
            qry[D.pos + l] = ""
    return "".join(qry)


#                    1
#          01234567890
ref = "ACCTGGCTTT"
qry = "AGCCTGGTAT"

I = Insertion(1, "G")
D = Deletion(6, 1)
S = Substitution(8, "A")

E = Edits([I], [D], [S])

print(f"ref = {ref}")
print(f"qry = {qry}")
print(f"edits = {E}")


qry_reconstructed = apply_edits_to_ref(E, ref)
print(f"              qry = {qry}")
print(f"reconstructed qry = {qry_reconstructed}")
assert qry_reconstructed == qry

ref = ACCTGGCTTT
qry = AGCCTGGTAT
edits = Edits(
Ins -> Insertion(pos=1, ins='G')	
Del -> Deletion(pos=6, length=1)	
Sub -> Substitution(pos=8, alt='A')	
)
              qry = AGCCTGGTAT
reconstructed qry = AGCCTGGTAT


### from ref-qry sequences to set of variations

When adding a sequence to a block, the first operation we need to do is to express the query as a function of the reference plus a set of edits.

In [3]:
# We need a function:
def map_variations(ref:str, qry:str) -> Edits:
    """Use nextalign to align single ref sequence to
    query sequence. Return the set of edits."""
    pass


# example alignment
#        0            1        
#        012   3456789012345678
# ref = "ACT---TTGCGTATTTACTATA"
# qry = "ACTAGATTGAGTATCT---ATA"
# sub =           x    x
# ins =     xxx
# del =                  xxx

ref = "ACTTTGCGTATTTACTATA"
qry = "ACTAGATTGAGTATCTATA"


E = map_variations(ref, qry)

E_expected = Edits(
    ins=[Insertion(3, "AGA")],
    dels=[Deletion(13, 3)],
    subs=[Substitution(6, "A"), Substitution(11, "C")],
    )

print(qry)
print(apply_edits_to_ref(E_expected, ref))


ACTAGATTGAGTATCTATA
ACTAGATTGAGTATCTATA


### adding a sequence to a block

We want a function to append a sequence to a block. This function should take the query sequence, align it to the block consensus, create a new node and add it to the block.
It should use nextclade aligner.

In [4]:
# Block
#              0         1
#              0123456789012345   678
# consensus = "ACTTTGCGTATTTACT---ATA"
# seq_0     = "AC---GCGTATTTACT---ATA"
# seq_1     = "ACTTTGCGTATTTACTCCCATA"
# seq_2     = "ACTTTGCGTATCTACT---ATA"
# sequence to append
# seq_new   = "ACTTTGCGGATTTACT---ATA"


consensus = "ACTTTGCGTATTTACTATA"
E_0 = Edits(dels=[Deletion(2, 3)])
E_1 = Edits(ins=[Insertion(16, "CCC")])
E_2 = Edits(subs=[Substitution(11, "C")])

aln = {
    "node_0" : E_0,
    "node_1" : E_1,
    "node_2" : E_2,
}

block = Block("bl_0", consensus, aln)

def append_sequence(
        block:Block,
        new_seq:str,
        new_node_id:str,
    ):
    """
    Append a sequence to a block, and assing the given node_id
    """
    # find the set of edits
    E_new = map_variations(block.consensus, new_seq)
    # append the edits to the block alignment dictionary
    block.alignment[new_node_id] = E_new


# append the sequence to the block    
new_seq = "ACTTTGCGGATTTACTATA"
append_sequence(block, new_seq, "node_new")

E_new = block.alignment["node_new"]
E_new_expected = Edits(subs=[Substitution(8, "G")])
# block.alignment["node_new"] == E_new_expected

### merging two blocks

The next step would be to combine this to merge two blocks. This would
- take block A and block B as input.
- add one by one the sequences of the shallow block (B) to the deep block (A).
- this is done using the previous function, aligning them using the nextclade aligner to the consensus of A and appending the resulting edits in the block alignment with the node_id from B.

Let's consider the following example:

**Block A**
```txt
            0         1         2         3         
            0123456789012345678901234567890123456789
consensus = GACTAAACCTGTCCGCTGAAACTGAGCGGGGTACTGCAGC

                                         GAT
    seq_0 = GACTAAA---GTCCGCTGAAACTGAGCGGGGTATTGCAGC
                           ACA
    seq_1 = GACTAAACCTGTCCGCTGAAATTGAGCGGG---CTGCAGC
                          AAT
    seq_2 = ---TAAACCTGTCCGCTGAAACTGAGCGGGGTACTGCAGC
```

**Block B**
```txt
            0         1         2         3         
            01234567890123456789012345678901234567
consensus = GACCAAACCTGTCCGCTGAAACTGCGGGGTACTGCAGC

                x
    seq_3 = GACCTAACCTGTC---TGAAACTGCGGGGTACTGCAGC
               x                      AAA
    seq_4 = GACTAAACCTGTCCGCTGAAACTGCGGGGTACTGCAGC
```

The alignment of the two consensus sequences:
```txt
            x
cons_A = GACTAAACCTGTCCGCTGAAACTGAGCGGGGTACTGCAGC
cons_B = GACCAAACCTGTCCGCTGAAACTG--CGGGGTACTGCAGC
```

We expected the **merged block** to be:

```txt
            0         1         2         3         
            0123456789012345678901234567890123456789
consensus = GACTAAACCTGTCCGCTGAAACTGAGCGGGGTACTGCAGC

                                         GAT
    seq_0 = GACTAAA---GTCCGCTGAAACTGAGCGGGGTATTGCAGC
                           ACA
    seq_1 = GACTAAACCTGTCCGCTGAAATTGAGCGGG---CTGCAGC
                          AAT
    seq_2 = ---TAAACCTGTCCGCTGAAACTGAGCGGGGTACTGCAGC
              
    seq_3 = GACCTAACCTGTC---TGAAACT--GCGGGGTACTGCAGC
                                        AAA
    seq_4 = GACTAAACCTGTCCGCTGAAACT--GCGGGGTACTGCAGC
```


In [6]:
cons_A = "GACTAAACCTGTCCGCTGAAACTGAGCGGGGTACTGCAGC"
aln_A = {
    "seq_0" : Edits(ins=[Insertion(29, "GAT")], dels=[Deletion(7, 3)], subs=[Substitution(33, "T")]),
    "seq_1" : Edits(ins=[Insertion(15, "ACA")], dels=[Deletion(30, 3)], subs=[Substitution(21, "T")]),
    "seq_2" : Edits(ins=[Insertion(14, "AAT")], dels=[Deletion(0, 3)], subs=[Substitution(0, "C")]),
    }


cons_B = "GACCAAACCTGTCCGCTGAAACTGCGGGGTACTGCAGC"
aln_B = {
    "seq_3" : Edits(dels=[Deletion(13, 3)], subs=[Substitution(4, "T")]),
    "seq_4" : Edits(ins=[Insertion(26, "AAA")], subs=[Substitution(3, "T")]),
    }

block_A = Block("block_A", cons_A, aln_A)
block_B = Block("block_B", cons_B, aln_B)

def merge_blocks(block_A:Block, block_B:Block, new_block_id:str) -> Block:
    """
    Merge two blocks into a single block. This function modifies block A.
    """
    
    # append new sequences to block A
    for node_id, E in block_B.alignment.items():
        seq = apply_edits_to_ref(E, block_B.consensus)
        append_sequence(block_A, seq, node_id)

    block_A.id = new_block_id
    return block_A

block_new = merge_blocks(block_A, block_B)


new_cons = "GACTAAACCTGTCCGCTGAAACTGAGCGGGGTACTGCAGC"
new_aln = {
    "seq_0" : Edits(ins=[Insertion(29, "GAT")], dels=[Deletion(7, 3)], subs=[Substitution(33, "T")]),
    "seq_1" : Edits(ins=[Insertion(15, "ACA")], dels=[Deletion(30, 3)], subs=[Substitution(21, "T")]),
    "seq_2" : Edits(ins=[Insertion(14, "AAT")], dels=[Deletion(0, 3)], subs=[Substitution(0, "C")]),
    "seq_3" : Edits(dels=[Deletion(13, 3), Deletion(23, 2)], subs=[Substitution(4, "T")]),
    "seq_4" : Edits(dels=[Deletion(23,2)], ins=[Insertion(28, "AAA")]),
    }
expected_new_block = Block("block_new", new_cons, new_aln)
# block_new == expected_new_block

### splitting a block

When merging one of the first operations we need to do is to split a block on a given set of cut-points. This function should split the consensus and the underlying alignment.

The first utility function needed for this is a function that splits the alignment on a given set of positions.

Let's consider for example this alignment:
```txt
            0         1         2         3         
            0123456789012345678901234567890123456789
consensus = ATCAGTGTATGCTTCTTTGAAACTTGAGTTTGGCGATTCA

                GGT
    seq_0 = ATCAGTGTATGCTTCTTTG---CTTGAGTTTcGCGATTCA
                            AGA
    seq_1 = ATCA---TATGCTTCTTTGAAACTTGtGTTTGGCGATTCA
                           CTT
    seq_2 = ATCAGTGT---CTTCcTTGAAACTTGAGTTTGGCGATTCA
```

We want to split between positions 9 and 10:

```txt
0           |  0         1         2         
0123456789  |  012345678901234567890123456789
ATCAGTGTAT  |  GCTTCTTTGAAACTTGAGTTTGGCGATTCA
            |                                
    GGT     |         
ATCAGTGTAT  |  GCTTCTTTG---CTTGAGTTTcGCGATTCA
            |        AGA
ATCA---TAT  |  GCTTCTTTGAAACTTGtGTTTGGCGATTCA
            |       CTT
ATCAGTGT--  |  -CTTCcTTGAAACTTGAGTTTGGCGATTCA
```

- Insertions and substitutions are easy to split, it just depends on which side of the split they fall.
- Deletions are a bit more tricky, because they can be split in two deletions across the split, as in the case of the third sequence.

For **insertions that happen on the cut** we might want to pick a side in which to append them by convention, e.g.  the left side.

In [5]:
aln = {
    "seq_0" : Edits([Insertion(4, "GGT")], [Deletion(19, 3)], [Substitution(31, "C")]),
    "seq_1" : Edits([Insertion(16, "AGA")], [Deletion(4, 3)], [Substitution(26, "T")]),
    "seq_2" : Edits([Insertion(15, "CTT")], [Deletion(8, 3)], [Substitution(15, "C")]),
}

def split_alignment_line(edits: Edits, pos:int):
    """function that splits a single alignment line _before_ position `pos`.
    `pos` must be between 1 and L (in 0-based indexing), where L is the length of the consensus sequence."""
    pass

E0_split = split_alignment_line(aln["seq_0"], 10)
E0_split_expected = [
    Edits(ins=[Insertion(4, "GGT")]),
    Edits(dels=[Deletion(9, 3)], subs=[Substitution(21, "C")]),
]

# splitting a deletion on the cut
E2_split = split_alignment_line(aln["seq_2"], 10)
E2_split_expected = [
    Edits(ins=[Insertion(5, "CTT")], dels=[Deletion(0, 1)]),
    Edits(dels=[Deletion(8, 2)], subs=[Substitution(5, "C")]),
]

Rather than a single position, the function should accept a list of positions. It should split the alignment on each of the positions, and return a list of alignments.

Finally, the function should work not just on alignments but on full blocks, splitting also the consensus sequences and updating the node dictionaries.