# IMGT - PARTIS annotation comparison

In this notebook we will look at some summary stats of the IMGT output provided by Tatsuya and the equivilent output by partis 

In [234]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *

In [235]:
def translate(seq):
    """super simple translation function for sanity checks - doesn't care if len(seq)%3!=0"""
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
        'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    }
    protein =""
#     if len(seq)%3 == 0:
    for i in range(0, len(seq), 3):
        codon = seq[i:i + 3]
        if "N" in codon:
            print("FOUND A NNNNN!")
            break
        if len(codon) == 3:
            protein+= table[codon]
    return protein

In [236]:
seq = "GAGGTGC"
for i in range(0, len(seq), 3):
    print(seq[i:i + 3])

GAG
GTG
C


# IMGT

Data that came from tatsuya doing the demultiplexing and counting "by hand" then annoating using high vquest IMGT tool. We'll be following along with the original steps done in the r markdown for 1.6

### Database generation

* Raw file is an output of the IMGT's high-vquest. Make sure to export AIRR format as well, to make processing simpler (i.e. less files to import for simplicity).

* Process both IgH and IgK at the same time as per PCR strategy.

### Read high vquest files (AIRR,7,8,9)


```{r read data}
IMGT_OUT <- "/home/jared/MatsenGroup/Projects/gc/gcreplay/nextflow/sandbox/2022-01-16-tatsuya-bioinformatics/PR1_6_IMGT_OUT/"
airr <- read_tsv(paste(IMGT_OUT, "vquest_airr.tsv", sep=""))
change <- read_tsv(paste(IMGT_OUT ,"7_V-REGION-mutation-and-AA-change-table.txt", sep=""))
ntmut <- read_tsv(paste(IMGT_OUT, "8_V-REGION-nt-mutation-statistics.txt", sep=""))
aamut <- read_tsv(paste(IMGT_OUT, "9_V-REGION-AA-change-statistics.txt", sep=""))
print("done")
```

In [237]:
imgt_path = "../2022-01-16-tatsuya-bioinformatics/PR1_6_IMGT_OUT/"
# airr 
imgt_airr = pd.read_csv(imgt_path+"vquest_airr_patch.tsv", sep="\t")
imgt_change = pd.read_csv(imgt_path+"7_V-REGION-mutation-and-AA-change-table.txt", sep="\t")
imgt_ntmut = pd.read_csv(imgt_path+"8_V-REGION-nt-mutation-statistics.txt", sep="\t")
imgt_aamut = pd.read_csv(imgt_path+"9_V-REGION-AA-change-statistics.txt", sep="\t")



In [238]:
print(f"There are {imgt_airr.shape[0]} total sequences identified by tatsuya + imgt annotation")

There are 5287 total sequences identified by tatsuya + imgt annotation


In [239]:
imgt_airr.head()

Unnamed: 0,sequence_id,sequence,sequence_aa,rev_comp,productive,complete_vdj,vj_in_frame,stop_codon,locus,v_call,...,rearrangement_id,repertoire_id,rearrangement_set_id,sequence_analysis_category,d_number,5prime_trimmed_n_nb,3prime_trimmed_n_nb,insertions,deletions,junction_decryption
0,211210P01A01H_1-1,gaggtgcagcttcaggagtcaggacctagcctcgtgaaaccttctc...,,F,T,T,T,F,IGH,Musmus IGHV3-8*02 F,...,,,,2 (indelcorr),0.0,0.0,0.0,,,(8)-3{2}-8(14)
1,211210P01A01H_2-1,gaggtgcagcttcaggagtcaggacctagcctcgtgaaaccttctc...,,F,T,T,T,F,IGH,Musmus IGHV3-8*02 F,...,,,,2 (indelcorr),0.0,0.0,0.0,,,(8)-3{2}-8(14)
2,211210P01A02H_1-1,gaggtgcagcttcaggagtcaggacctagcctcgtgaaaccttctc...,,F,T,T,T,F,IGH,Musmus IGHV3-8*02 F,...,,,,2 (indelcorr),0.0,0.0,0.0,,,(8)-3{2}-8(14)
3,211210P01A02H_2-1,gaggtgcagcttcaggagtcaggacctagcctcgtgaaaccttctc...,,F,T,T,T,F,IGH,Musmus IGHV3-8*02 F,...,,,,2 (indelcorr),0.0,0.0,0.0,,,(8)-3{2}-8(14)
4,211210P01A02K_1-1,acctgcgacgggagttcacagactgcaaccggtgtacattccgagg...,,F,T,T,T,F,IGH,Musmus IGHV3-8*02 F,...,,,,2 (indelcorr),0.0,0.0,0.0,,,(8)-3{2}-8(14)


### Assemble V(D)J nt/aa seq in AIRR

```{r airr preprocessing}
# assemble V(D)J nt and aa sequences in airr from components for gctree
airr <- airr %>% 
  mutate(seq_nt = paste(fwr1, cdr1, fwr2, cdr2, fwr3, cdr3, fwr4, sep = "")) %>% 
  mutate(seq_aa = paste(fwr1_aa, cdr1_aa, fwr2_aa, cdr2_aa,
                        fwr3_aa, cdr3_aa, fwr4_aa, sep = ""))
print("done")
```

In [240]:
imgt_airr[['fwr1', 'cdr1', 'fwr2', 'cdr2', 'fwr3', 'cdr3', 'fwr4']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5287 entries, 0 to 5286
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   fwr1    4850 non-null   object
 1   cdr1    4828 non-null   object
 2   fwr2    4803 non-null   object
 3   cdr2    4803 non-null   object
 4   fwr3    4853 non-null   object
 5   cdr3    4756 non-null   object
 6   fwr4    4703 non-null   object
dtypes: object(7)
memory usage: 289.3+ KB


In [241]:
imgt_airr['seq_nt'] = imgt_airr[['fwr1', 'cdr1', 'fwr2', 'cdr2', 'fwr3', 'cdr3', 'fwr4']].fillna("").agg(''.join, axis=1)
imgt_airr['seq_aa'] = imgt_airr[['fwr1_aa', 'cdr1_aa', 'fwr2_aa', 'cdr2_aa', 'fwr3_aa', 'cdr3_aa', 'fwr4_aa']].fillna("").agg(''.join, axis=1)
imgt_airr["seq_nt_lens"] = [len(seq) for seq in imgt_airr["seq_nt"].values]
imgt_airr["seq_aa_lens"] = [len(seq) for seq in imgt_airr["seq_aa"].values]

In [320]:
imgt_airr["sequence_id"]

0              211210P01A01H_1-1
1              211210P01A01H_2-1
2              211210P01A02H_1-1
3              211210P01A02H_2-1
4              211210P01A02K_1-1
                  ...           
5282     211210P16unmatchedH_2-4
5283     211210P16unmatchedH_3-4
5284    211210P16unmatchedK_1-33
5285    211210P16unmatchedK_2-25
5286    211210P16unmatchedK_3-19
Name: sequence_id, Length: 5287, dtype: object

**After this, the following steps are performed in R to get all the necessary data organized into one place**

### Compile/rename all data into one tibble

```{r data compilation, echo=TRUE}
vquest <- airr %>% 
  select(ID = "sequence_id", 
         locus = "locus",
         V = "v_call",
         D = "d_call",
         J = "j_call",
         Productive = "productive",
         AAjunction = "junction_aa",
         seq_nt,
         seq_aa,
         seq_input = sequence,
         "gapped seq_nt" = "sequence_alignment",
         "gapped seq_aa" = "sequence_alignment_aa",
         v_cigar,
         d_cigar,
         j_cigar) %>% 
  mutate(num = 1:nrow(airr), .before = ID) %>% 
  mutate(seq_nt_length = str_length(seq_nt), .after = seq_nt) %>% 
  mutate(seq_aa_length = str_length(seq_aa), .after = seq_aa) %>% 
  mutate(nt_mut = ntmut$`V-REGION Nb of mutations`, .after = seq_nt_length) %>% 
  mutate(nt_mut_silent = ntmut$`V-REGION Nb of silent mutations`, .after = nt_mut) %>% 
  mutate(nt_mut_replacement = 
           ntmut$`V-REGION Nb of nonsilent mutations`, .after = nt_mut_silent) %>% 
  mutate(aa_replacement = aamut$`V-REGION Nb of AA changes`, .after = nt_mut_replacement) %>% 
  mutate(Vchanges = change$`V-REGION`, .after = aa_replacement) 

print("done") #if evaluated
```

According to this, the columns that we need, ultimitely, before using tatsuya's R code is the following

### The resulting file

In [313]:
# TODO make map
required_columns_map = {}

In [243]:
all_data = pd.read_csv("../partis_test/vquest_org.csv")
print("\n".join(all_data.columns.values))

num
ID
locus
V
D
J
Productive
AAjunction
seq_nt
seq_nt_length
nt_mut
nt_mut_silent
nt_mut_replacement
aa_replacement
Vchanges
seq_aa
seq_aa_length
seq_input
gapped seq_nt
gapped seq_aa
v_cigar
d_cigar
j_cigar


In [314]:
all_data["nt_mut"]

0         7 (9)
1         7 (9)
2         3 (5)
3         3 (5)
4         3 (5)
         ...   
5282    13 (14)
5283      5 (7)
5284    13 (14)
5285         11
5286         13
Name: nt_mut, Length: 5287, dtype: object

In [319]:
all_data["gapped seq_nt"]

0       gaggtgcagcttcaggagtcaggacct...agcctcgtgaaacctt...
1       gaggtgcagcttcaggagtcaggacct...agcctcgtgaaacctt...
2       gaggtgcagcttcaggagtcaggacct...agcctcgtgaaacctt...
3       gaggtgcagcttcaggagtcaggacct...agcctcgtgaaacctt...
4       gaggtgcagcttcaggagtcaggacct...agcctcgtgaaacctt...
                              ...                        
5282    gacattgtgatgactcagtctcaaaaattcatgtccacatcagtgg...
5283    gaggtgcagcttcaggagtcaggacct...agcctcgtgaaacctt...
5284    gacattgtgatgactcagtctcaaaaattcatgtccacatcagtag...
5285    gacattgtgatgactcagtcgcaaaaaatcatgtccacatcagtag...
5286    gacattgtgatgactcagtctcaaaaattcatgtccacttcagtag...
Name: gapped seq_nt, Length: 5287, dtype: object

### SANITY CHECK #1

are the amino acid columns simply the translation of the nt? I assume so but we better check because we will manually translate partis nucleotide output.

In [244]:
count = 0
for idx, row in imgt_airr.iterrows():
#     print(idx)
    if row.seq_aa.upper() != translate(row.seq_nt.upper()):
        print(f"index:\n{idx}")
        print(f"seq nt:\n{row.seq_nt}")
        print(f"seq nt upper:\n{row.seq_nt.upper()}")
        print(f"seq amino acid:\n{row.seq_aa.upper()}")
        print(f"seq amino acid, translated:\n{translate(row.seq_nt.upper())}")
        print()
        count += 1
    if count >= 10:
        break

index:
34
seq nt:
gaggtgcaccttcggagtcaggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgactccatcaccagtggttactggaactggatccggaaattcccagggaataaacttgagtacatggggtacataagctacagtggtagcacttactacaatccatctctcaacagtcgaatctccatccctcgagacacatccaagaaccagtactacctacagttgaattctgtgacttctgaggacacagccacatattactgtgcaagggacttcgatgtctggggcgcagggaccacggtcaccgtctcctcag
seq nt upper:
GAGGTGCACCTTCGGAGTCAGGACCTAGCCTCGTGAAACCTTCTCAGACTCTGTCCCTCACCTGTTCTGTCACTGGCGACTCCATCACCAGTGGTTACTGGAACTGGATCCGGAAATTCCCAGGGAATAAACTTGAGTACATGGGGTACATAAGCTACAGTGGTAGCACTTACTACAATCCATCTCTCAACAGTCGAATCTCCATCCCTCGAGACACATCCAAGAACCAGTACTACCTACAGTTGAATTCTGTGACTTCTGAGGACACAGCCACATATTACTGTGCAAGGGACTTCGATGTCTGGGGCGCAGGGACCACGGTCACCGTCTCCTCAG
seq amino acid:
EVHLRSQDLAS*NLLRLCPSPVLSGDSITSGYWNWIRKFPGNKLEYMGYISYSGSTYYNPSLNSRISIPRDTSKNQYYLQLNSVTSEDTATYYCARDFDVWGAGTTVTVSS
seq amino acid, translated:
EVHLRSQDLAS*NLLRLCPSPVLSLATPSPVVTGTGSGNSQGINLSTWGT*ATVVALTTIHLSTVESPSLETHPRTSTTYS*IL*LLRTQPHITVQGTSMSGAQGPRSPSPQ

index:
40
seq nt:
aagtcg

In [245]:
len("DIVMTQSQKFMSTSVGDRVSVTCKASQNVGTNVAWYQQKPGQSPKALF") * 3

144

In [274]:
seq = "GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAGGAGACAGGGTCAGCGTCACCTGCAAGGCCAGTCAGAATGTGGGTACTAATGTAGCCTGGTATCAACAGAAACCAGGGCAATCTCCTAAAGCACTTTTTACTCGGCATCCTACAGGTACAGTGGAGTCCCTGATCGCTTCACAGGCAGTGGATCTGGGACAGATTTCACTCTCACCATCAGCAATGTGCAGTCTGAAGACTTGGCAGAGTATTTCTGTCAGCAATATAACAGCTATCCTCTCACGTTCGGCTCGGGGACTAAGCTAGAAATAAAAC"
seq[140:]

'TTTTACTCGGCATCCTACAGGTACAGTGGAGTCCCTGATCGCTTCACAGGCAGTGGATCTGGGACAGATTTCACTCTCACCATCAGCAATGTGCAGTCTGAAGACTTGGCAGAGTATTTCTGTCAGCAATATAACAGCTATCCTCTCACGTTCGGCTCGGGGACTAAGCTAGAAATAAAAC'

# Partis Data

Data that came from the pipeline then annotated by partis tool

### Database generation

There is a vast amount of information in the partis annotation output. 


In [248]:
partis_output_dir = "../partis_test/PR-1-6/"
partis_igk = pd.read_csv(partis_output_dir+"engrd/single-chain/partition-igk.tsv", sep="\t")
partis_igh = pd.read_csv(partis_output_dir+"engrd/single-chain/partition-igh.tsv", sep="\t")

In [249]:
print(f"There are {partis_igh.shape[0]+partis_igk.shape[0]} total sequences identified by partis annotation")

There are 5161 total sequences identified by partis annotation


In [250]:
partis_igh.head()

Unnamed: 0,sequence_id,sequence,rev_comp,productive,v_call,d_call,j_call,sequence_alignment,germline_alignment,junction,...,d_sequence_start,d_sequence_end,d_germline_start,d_germline_end,j_support,j_identity,j_sequence_start,j_sequence_end,j_germline_start,j_germline_end
0,PR-1-6.211203.unmatched.H04.H.R.1-1,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,False,True,IGHV3-8*chg,IGHD1-1*chg,IGHJ1*chg,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,TGTGGAAGGGACTTCGATGTCTGG,...,290.0,294.0,1.0,5.0,1.0,1.0,295.0,337.0,1.0,43.0
1,PR-1-6.211203.unmatched.H03.H.R.2-1,GAGGTGCAGCTTCAGGAGTCAGGACCCAGCCTCGTGAAACCTTCTC...,False,True,IGHV3-8*chg,IGHD1-1*chg,IGHJ1*chg,GAGGTGCAGCTTCAGGAGTCAGGACCCAGCCTCGTGAAACCTTCTC...,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,TGTGCAAGGGACTTCGATGTCTGG,...,290.0,294.0,1.0,5.0,1.0,1.0,295.0,337.0,1.0,43.0
2,PR-1-6.211203.P01.B12.H.R.3-1,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,False,True,IGHV3-8*chg,IGHD1-1*chg,IGHJ1*chg,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,TGTGCAAGGGACTTCGATGTCTGG,...,290.0,294.0,1.0,5.0,1.0,1.0,295.0,337.0,1.0,43.0
3,PR-1-6.211203.P01.D04.H.R.1-1,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,False,True,IGHV3-8*chg,IGHD1-1*chg,IGHJ1*chg,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,TGTGCAAGGGACTTCGATGTCTGG,...,290.0,294.0,1.0,5.0,1.0,1.0,295.0,337.0,1.0,43.0
4,PR-1-6.211203.P01.F06.H.R.2-1,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,False,True,IGHV3-8*chg,IGHD1-1*chg,IGHJ1*chg,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,GAGGTGCAGCTTCAGGAGTCAGGACCTAGCCTCGTGAAACCTTCTC...,TGTGCAAGGGACTTCGATGTCTGG,...,290.0,294.0,1.0,5.0,1.0,0.976744,295.0,337.0,1.0,43.0


In [251]:
partis_airr = partis_igk.append(partis_igh)

In [252]:
print(partis_airr.shape)
partis_airr.head()
# 

(5161, 42)


Unnamed: 0,sequence_id,sequence,rev_comp,productive,v_call,d_call,j_call,sequence_alignment,germline_alignment,junction,...,d_sequence_start,d_sequence_end,d_germline_start,d_germline_end,j_support,j_identity,j_sequence_start,j_sequence_end,j_germline_start,j_germline_end
0,PR-1-6.211203.P14.A06.K.R.1-335,GACATTGTAATGACTCAGTCTCAAAAATTCATGTCCACATCAGAAG...,False,True,IGKV6-15*chg,IGKDx-x*x,IGKJ4*chg,GACATTGTAATGACTCAGTCTCAAAAATTCATGTCCACATCAGAAG...,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,TGTCACCAATATAAAAGTTATCCTCTCACGTTC,...,288.0,287.0,2.0,1.0,1.0,0.970588,288.0,321.0,1.0,34.0
1,PR-1-6.211203.unmatched.G05.K.R.1-39,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,False,True,IGKV6-15*chg,IGKDx-x*x,IGKJ4*chg,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,TGTCAGCAATATAACAGCTATCCTCTCACGTTC,...,288.0,287.0,2.0,1.0,1.0,1.0,288.0,321.0,1.0,34.0
2,PR-1-6.211203.P10.A04.K.R.1-675,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,False,True,IGKV6-15*chg,IGKDx-x*x,IGKJ4*chg,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,TGTCAGCAATATAACAGCTATCCTCTCACGTTC,...,288.0,287.0,2.0,1.0,1.0,1.0,288.0,321.0,1.0,34.0
3,PR-1-6.211203.P10.unmatched.H.R.1-7,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,False,True,IGKV6-15*chg,IGKDx-x*x,IGKJ4*chg,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,TGTCAGCAATATAACAGCTATCCTCTCACGTTC,...,288.0,287.0,2.0,1.0,1.0,1.0,288.0,321.0,1.0,34.0
4,PR-1-6.211203.P05.A07.K.R.3-5,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,False,True,IGKV6-15*chg,IGKDx-x*x,IGKJ4*chg,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,TGTCAGCAATATAACAGCTATCCTCTCACGTTC,...,288.0,287.0,2.0,1.0,1.0,1.0,288.0,321.0,1.0,34.0


In [253]:
# partis_airr["sequence"]

In [254]:
# convert back to expected sequence_id like tatsuya does
# TODO, we'll include the date in the manifest for automation
partis_airr["nf_sequence_id"] = partis_airr["sequence_id"]
partis_airr[
    ['nf_PR','date', 'nf_plate', 'nf_well', 'nf_chain','_', 'nf_rank_string']
] = partis_airr['sequence_id'].str.split('.', expand=True)
partis_airr["nf_rank"] = [int(r.split("-")[0]) for r in partis_airr["nf_rank_string"]]
partis_airr["nf_count"] = [int(r.split("-")[1]) for r in partis_airr["nf_rank_string"]]
partis_airr.drop(["nf_rank_string", "_"], axis=1, inplace=True)
partis_airr["sequence_id"] = [
    f"211210{r.nf_plate}{r.nf_well}{r.nf_chain}_{r.nf_rank}-{r.nf_count}"
    for i, r in partis_airr.iterrows()
]

In [255]:
partis_airr["seq_nt"] = [seq.lower() for seq in partis_airr["sequence"]]
# partis_airr["seq_aa"] = [translate(seq.upper()).lower() for seq in partis_airr["seq_nt"]]

In [256]:
partis_airr["germline_alignment"]

0       GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...
1       GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...
2       GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...
3       GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...
4       GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...
                              ...                        
2061                                                  NaN
2062                                                  NaN
2063                                                  NaN
2064                                                  NaN
2065                                                  NaN
Name: germline_alignment, Length: 5161, dtype: object

In [257]:
sum(partis_airr["germline_alignment"].isna().values)

95

In [258]:
partis_airr["seq_nt_lens"] = [len(seq) for seq in partis_airr["seq_nt"].fillna("").values]

In [259]:
# partis_airr["seq_nt_lens"].value_counts()

In [260]:
5161 - (3012 + 2050)

99

## Sanity Check 1 - Single Cell Comparison

Let's look at the same ranked BCR's from both methods

In [261]:
partis_airr["sequence_id_minus_rank"] = [v.split("_")[0] for v in partis_airr["sequence_id"].values]
imgt_airr["sequence_id_minus_rank"] = [v.split("_")[0] for v in imgt_airr["sequence_id"].values]

### Compile/rename all data into one tibble
```{r data compilation, echo=TRUE}
vquest <- airr %>% 
  select(ID = "sequence_id", 
         locus = "locus",
         V = "v_call",
         D = "d_call",
         J = "j_call",
         Productive = "productive",
         AAjunction = "junction_aa",
         seq_nt,
         seq_aa,
         seq_input = sequence,
         "gapped seq_nt" = "sequence_alignment",
         "gapped seq_aa" = "sequence_alignment_aa",
         v_cigar,
         d_cigar,
         j_cigar) %>% 
  mutate(num = 1:nrow(airr), .before = ID) %>% 
  mutate(seq_nt_length = str_length(seq_nt), .after = seq_nt) %>% 
  mutate(seq_aa_length = str_length(seq_aa), .after = seq_aa) %>% 
  mutate(nt_mut = ntmut$`V-REGION Nb of mutations`, .after = seq_nt_length) %>% 
  mutate(nt_mut_silent = ntmut$`V-REGION Nb of silent mutations`, .after = nt_mut) %>% 
  mutate(nt_mut_replacement = 
           ntmut$`V-REGION Nb of nonsilent mutations`, .after = nt_mut_silent) %>% 
  mutate(aa_replacement = aamut$`V-REGION Nb of AA changes`, .after = nt_mut_replacement) %>% 
  mutate(Vchanges = change$`V-REGION`, .after = aa_replacement) 

print("done") #if evaluated
```

In [266]:
print("\n".join(partis_airr.columns))

sequence_id
sequence
rev_comp
productive
v_call
d_call
j_call
sequence_alignment
germline_alignment
junction
junction_aa
v_cigar
d_cigar
j_cigar
clone_id
vj_in_frame
stop_codon
locus
np1
np2
duplicate_count
cdr3_start
cdr3_end
cell_id
v_support
v_identity
v_sequence_start
v_sequence_end
v_germline_start
v_germline_end
d_support
d_identity
d_sequence_start
d_sequence_end
d_germline_start
d_germline_end
j_support
j_identity
j_sequence_start
j_sequence_end
j_germline_start
j_germline_end
nf_sequence_id
nf_PR
date
nf_plate
nf_well
nf_chain
nf_rank
nf_count
seq_nt
seq_nt_lens
sequence_id_minus_rank


In [315]:
columns_of_interest = [
    "sequence_id",
    "locus",
#     "v_call", 
#     "d_call", 
#     "j_call",
    "productive",
    "junction_aa",
    "sequence",
    "seq_nt",
    "seq_nt_lens",
    "v_cigar"
#     "seq_aa"
]

In [316]:
imgt_coi = imgt_airr.loc[imgt_airr["sequence_id_minus_rank"]=="211210P10A01K", columns_of_interest ]
imgt_coi

Unnamed: 0,sequence_id,locus,productive,junction_aa,sequence,seq_nt,seq_nt_lens,v_cigar
2454,211210P10A01K_1-400,IGK,T,CQQYHSYPLTF,gtgttgatggagacattgtgatgactcagtctcaaaaattcatgtc...,gacattgtgatgactcagtctcaaaaattcatgtccacatcagtag...,322,11S14=1X144=1X55=1X57=1X13=74S
2455,211210P10A01K_2-3,IGK,T,CQQYHSYPLTF,gtgttgatggagacattgtgatgactcagtctcaaaaattcatgtc...,gacattgtgatgactcagtctcaaaaattcatgtccacatcagtag...,322,11S14=1X144=1X55=1X57=1X13=74S
2456,211210P10A01K_3-3,IGK,T,CQQYHSYPLTF,gtgttgatggagacattgtgatgactcagtctcaaaaattcatgtc...,gacattgtgatgactcagtctcaaaaattcatgtccacatcagtag...,322,11S14=1X144=1X55=1X57=1X13=73S


In [318]:
partis_coi = partis_airr.loc[partis_airr["sequence_id_minus_rank"]=="211210P10A01K", columns_of_interest + ["germline_alignment"]]
partis_coi

Unnamed: 0,sequence_id,locus,productive,junction_aa,sequence,seq_nt,seq_nt_lens,v_cigar,germline_alignment
2116,211210P10A01K_1-389,IGK,True,CQQYHSYPLTF,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...,gacattgtgatgactcagtctcaaaaattcatgtccacatcagtag...,321,287M34I,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...
2170,211210P10A01K_2-3,IGK,False,CQQYHSYPLTF,GACATTGTGATGACTCAGTCTCAAAAAATTCATGTCCACATCAGTA...,gacattgtgatgactcagtctcaaaaaattcatgtccacatcagta...,322,22M1I265M34I,GACATTGTGATGACTCAGTCTC.AAAAATTCATGTCCACATCAGTA...
2223,211210P10A01K_3-3,IGK,True,CQQYHSYPLTF,GACATTGTGATGACTCAGTCTCAAAAATTCACGTCCACATCAGTAG...,gacattgtgatgactcagtctcaaaaattcacgtccacatcagtag...,321,287M34I,GACATTGTGATGACTCAGTCTCAAAAATTCATGTCCACATCAGTAG...


## Locus

In [283]:
imgt_airr.locus.value_counts()

IGK    3030
IGH    1823
Name: locus, dtype: int64

In [284]:
partis_airr.locus.value_counts()

IGK    3016
IGH    2050
Name: locus, dtype: int64

## V Call

In [288]:
imgt_airr.v_call.value_counts()

Musmus IGKV6-15*01 F                                                  2914
Musmus IGHV3-8*02 F                                                   1788
Musmus IGKV6-15-1*01 P                                                  97
Musmus IGKV4-72*01 F                                                     8
Musmus IGHV3-8*01 F                                                      7
Musmus IGKV6-14*01 F                                                     7
Musmus IGHV3-8*02 F, or Musmus IGHV3S1*02 F                              6
Musmus IGHV3-8*02 F, or Musmus IGHV3S1*01 F                              6
Musmus IGKV6-23*01 F                                                     3
Musmus IGHV3S1*01 F, or Musmus IGHV3S1*02 F                              3
Musmus IGHV3-3*03 F                                                      2
Musmus IGHV3-8*01 F, or Musmus IGHV3-8*02 F or Musmus IGHV3S1*01 F       2
Musmus IGHV1-74*01 F                                                     2
Musmus IGHV2-5*01 F      

In [293]:
sum(imgt_airr.v_call.isna())

434

In [286]:
partis_airr.v_call.value_counts()

IGKV6-15*chg    3016
IGHV3-8*chg     2050
Name: v_call, dtype: int64

In [294]:
sum(partis_airr.v_call.isna())

95

## D Call

In [298]:
imgt_airr.d_call.value_counts()

Musmus IGHD3-3*01     32
Musmus IGHD2-4*01      8
Musmus IGHD2-13*01     3
Musmus IGHD3-1*01      2
Musmus IGHD1-2*01      2
Musmus IGHD1-1*02      2
Musmus IGHD6-1*01      2
Musmus IGHD2-14*01     1
Musmus IGHD1-1*01      1
Musmus IGHD2-2*01      1
Name: d_call, dtype: int64

In [299]:
sum(imgt_airr.d_call.isna())

5233

In [300]:
partis_airr.d_call.value_counts()

IGKDx-x*x      3016
IGHD1-1*chg    2050
Name: d_call, dtype: int64

In [301]:
sum(partis_airr.d_call.isna())

95

## J Call

In [302]:
imgt_airr.j_call.value_counts()

Musmus IGKJ4*01 F                                               2467
Musmus IGHJ1*01 F                                               1768
Musmus IGKJ4*01 F, or Musmus IGKJ4*02 F                          215
Musmus IGKJ2*02 F, or Musmus IGKJ4*01 F                          120
Musmus IGKJ2*03 F, or Musmus IGKJ4*01 F                           40
Musmus IGKJ1*02 F                                                 15
Musmus IGHJ1*03 F                                                 15
Musmus IGKJ2*02 F, or Musmus IGKJ4*01 F or Musmus IGKJ4*02 F      14
Musmus IGKJ2*02 F                                                  9
Musmus IGHJ1*01 F, or Musmus IGHJ1*03 F                            9
Musmus IGKJ5*01 F                                                  8
Musmus IGKJ2*01 F                                                  8
Musmus IGKJ2*01 F, or Musmus IGKJ4*01 F                            6
Musmus IGKJ4*02 F                                                  6
Musmus IGHJ3*01 F, or Musmus IGHJ3

In [303]:
sum(imgt_airr.j_call.isna())

574

In [304]:
partis_airr.j_call.value_counts()

IGKJ4*chg    3016
IGHJ1*chg    2050
Name: j_call, dtype: int64

In [305]:
sum(partis_airr.j_call.isna())

95

## Nucleotide Sequence

In [279]:
imgt_coi.seq_nt.values

array(['gacattgtgatgactcagtctcaaaaattcatgtccacatcagtaggagacagggtcagcgtcacctgcaaggccagtcagaatgtgggtactaatgtagcctggtatcaacagaaaccagggcaatctcctaaagcactgatttactcggcatcctacaggtacagtggagtccctgatcgcttcacaggcagtggatctgggacagatttcacgctcaccatcagcaatgtgcagtctgaagacttggcagagtatttctgtcagcaatatcacagctatcctctcacgttcggctcggggactaagctagaaataaaac',
       'gacattgtgatgactcagtctcaaaaattcatgtccacatcagtaggagacagggtcagcgtcacctgcaaggccagtcagaatgtgggtactaatgtagcctggtatcaacagaaaccagggcaatctcctaaagcactgatttactcggcatcctacaggtacagtggagtccctgatcgcttcacaggcagtggatctgggacagatttcacgctcaccatcagcaatgtgcagtctgaagacttggcagagtatttctgtcagcaatatcacagctatcctctcacgttcggctcggggactaagctagaaataaaac',
       'gacattgtgatgactcagtctcaaaaattcatgtccacatcagtaggagacagggtcagcgtcacctgcaaggccagtcagaatgtgggtactaatgtagcctggtatcaacagaaaccagggcaatctcctaaagcactgatttactcggcatcctacaggtacagtggagtccctgatcgcttcacaggcagtggatctgggacagatttcacgctcaccatcagcaatgtgcagtctgaagacttggcagagtatttctgtcagcaatatcacagctatcctctcacgttcggctcggggactaagctagaaataaaac'],


In [281]:
partis_coi.seq_nt.values

array(['gacattgtgatgactcagtctcaaaaattcatgtccacatcagtaggagacagggtcagcgtcacctgcaaggccagtcagaatgtgggtactaatgtagcctggtatcaacagaaaccagggcaatctcctaaagcactgatttactcggcatcctacaggtacagtggagtccctgatcgcttcacaggcagtggatctgggacagatttcacgctcaccatcagcaatgtgcagtctgaagacttggcagagtatttctgtcagcaatatcacagctatcctctcacgttcggctcggggactaagctagaaataaaa',
       'gacattgtgatgactcagtctcaaaaaattcatgtccacatcagtaggagacagggtcagcgtcacctgcaaggccagtcagaatgtgggtactaatgtagcctggtatcaacagaaaccagggcaatctcctaaagcactgatttactcggcatcctacaggtacagtggagtccctgatcgcttcacaggcagtggatctgggacagatttcacgctcaccatcagcaatgtgcagtctgaagacttggcagagtatttctgtcagcaatatcacagctatcctctcacgttcggctcggggactaagctagaaataaaa',
       'gacattgtgatgactcagtctcaaaaattcacgtccacatcagtaggagacagggtcagcgtcacctgcaaggccagtcagaatgtgggtactaatgtagcctggtatcaacagaaaccagggcaatctcctaaagcactgatttactcggcatcctacaggtacagtggagtccctgatcgcttcacaggcagtggatctgggacagatttcacgctcaccatcagcaatgtgcagtctgaagacttggcagagtatttctgtcagcaatatcacagctatcctctcacgttcggctcggggactaagctagaaataaaa'],
  

## Productive

In [308]:
imgt_airr.productive.value_counts()

T    4612
F     241
Name: productive, dtype: int64

In [307]:
partis_airr.productive.value_counts()

True     4523
False     543
Name: productive, dtype: int64

## junction_aa

In [310]:
imgt_airr.junction.value_counts()

tgtcagcaatataacagctatcctctcacgttc     1388
tgtgcaagggacttcgatgtctgg               843
tgtggaagggacttcgatgtctgg               623
tgtcaccaatataaaagctatcctatcacgttc      139
tgtcagcaatatagcagctatcctctcacgttc       64
                                      ... 
tgtcagcaatataagaactaccctatcacgttc        1
tgtggaagagacttcgatttctgg                 1
tgtcaccaatataaaaactatcctctcacgtttc       1
tatcaccaatataaaagttatcctctcacgttc        1
tgtcagcaatataacagctatcctctcacgtgc        1
Name: junction, Length: 320, dtype: int64

In [311]:
partis_airr.junction.value_counts()

TGTCAGCAATATAACAGCTATCCTCTCACGTTC    1587
TGTGCAAGGGACTTCGATGTCTGG              932
TGTGGAAGGGACTTCGATGTCTGG              705
TGTCACCAATATAAAAGCTATCCTATCACGTTC     136
TGTCAGCAATATAGCAGCTATCCTCTCACGTTC      61
                                     ... 
TGGGCAAGGGACGTCGATGTCTGG                1
TGTTCAAGGGACTTCACCGTCTGG                1
TGTGGAGGGGACTTCGATGTCTGG                1
TGTCAGCTATATAACAGCTATCCTCTCACGTTC       1
TGGGGAAGGGACTTCGATTTCTGG                1
Name: junction, Length: 335, dtype: int64

## 

In [107]:
imgt_airr["seq_nt_lens"].value_counts().head()

322    2769
337    1757
0       434
264      87
321      53
Name: seq_nt_lens, dtype: int64

In [108]:
partis_airr["sequence_lens"].value_counts().head()

321    2644
337    1998
322     283
320      56
372      39
Name: sequence_lens, dtype: int64

## Productive