## PacBio analysis pipeline
I used the following tutorial to do alignments and filtering: https://jbloomlab.github.io/alignparse/lasv_pilot.html

Pipeline:
- Filter with dmstools2, lose <10% (mostly for barcodes that don't align)
- Make two dataframes, one with a row for each read and one with a row for each unique barcode
- Make a consensus mutation for each barcode from all reads of that barcode
    -  This depends on there being enough reads of an individual barcode
    -  If a mutation appears in <X% of reads of a given barcode chuck it
- See if what remains is an allowed mutation, if so keep it
- Look for mutations in the backbone, filter barcodes with such mutations
- Only keep barcodes with >Y reads

### Defined Functions
**mutConsenses:**
    - Makes a list of all mutants and a list of just unique mutations
    - figures out what fraction of reads each mutation is in
    - Makes final single member list for easy conversion
**enumerateMuts:**
    - This takes the unique barcodes and makes a list of all of the barcodes that are one off from it
**oneCodonOnlyCheck:**
    - Determine whether consensus mutation is in the same column
**Translate:**
    - Used to make the aminoacid notation mutation name
**passedAllFilters:**
    - Generate final barcode list with paired mutants after all filters
**filterReads:**
    - Filters we use to generate independent datasets
**getBCfreqs:**
    - Filters out no mutation reads
    - Gets frequency of each barcode
**getMutPos:**
    - Extract all mutated nucleotide positions in a given consensus
**getCodonSeqs:**
    - Figure out what the original and the new codons are
**OBOcheck:**
    - Figure out which barcodes are one mutation away from another barcode
    - Return the similar barcode.  Keep the barcode with more reads
**getAAmut:**
    - Return the mutation and the new aminoacid for barcodes with just one mutated codon
**mutFilters:**
    - Make columns indicating if there is a consensus mutation with no indels
**BBmutFilters:**
    - Choose barcodes where the backbone has either no mutation or no consensus mutation

**Preassigned Variable List:** <br>
    - filterLabels <br>
    - PROGRAMMED_MUTATION_CODONS <br>
    - Template sequence


In [1]:
import sys
print(sys.version)

3.8.8 (default, Apr 13 2021, 12:59:45) 
[Clang 10.0.0 ]


In [2]:
# Import packages, set global parameters
import numpy as np
import itertools
import math
import pandas as pd
import time
import Bio
from Bio import SeqIO
from Bio.Seq import Seq
import matplotlib
import matplotlib.pyplot as plt
from ast import literal_eval
import csv

# This is the degree of consensus we will be looking for later
CONSENSUS_CUTOFF = 0.9

# This is the minimum number of reads we will accept for the final lookup table
MINIMUM_READS = 4

# Total number of mutations we are looking for
TOTAL_MUTS = 8778

FILE_NAME = 'NP_11_21_1_BB_AB'


In [3]:
# Function definitions 

#---------------------------------------------------
def mutConsenses(BCmutListList):
    readsOfBC = len(BCmutListList)
    consensusMutDNA = []
    compiledMutList = []
    consensusPercentBCs = []
    compiledMutListUnique = []
    
    for mutList in BCmutListList:
        if type(mutList) != list:
            mutList = ['noMutation']
        for mut in mutList:
            compiledMutList.append(mut)
            if mut not in compiledMutListUnique:
                compiledMutListUnique.append(mut)
    for uniqueMut in compiledMutListUnique:
        percentBCs = compiledMutList.count(uniqueMut)/readsOfBC
        if percentBCs > CONSENSUS_CUTOFF:
            consensusMutDNA.append(uniqueMut)
            consensusPercentBCs.append(percentBCs)
    if consensusMutDNA == []:
        consensusMutDNA = ['NoConsensus']
        consensusPercentBCs = ['NoConsensus']
    return consensusMutDNA, consensusPercentBCs
#---------------------------------------------------
def enumerateMuts(barcode):
    offByOnes = []
    for i, letter in enumerate(barcode):
        for base in ['A','T','G','C']:
            if letter != base:
                offByOnes.append(barcode[:i] + base + barcode[i + 1:])
    return offByOnes
#----------------------------------------------------
def oneCodonOnlyCheck(listOfPositions):
    affectedCodons = []
    for position in listOfPositions:
        # position-1 gets the position with a 0 index
        # dividing by 3 and taking the floor gets the codon position
        # The codon position already has a 0 index
        mutPosition = math.floor((position-1)/3)
        if mutPosition not in (affectedCodons):
            affectedCodons.append(math.floor((position-1)/3))
    return(affectedCodons)
#-----------------------------------------------------
def translate(seq): 
    table = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 
    protein ="" 
    if len(seq)%3 == 0: 
        for i in range(0, len(seq), 3): 
            codon = seq[i:i + 3] 
            protein += table[codon] 
    return protein 
#-------------------------------------------------------
def passedAllFilters (row):
    if row['OBO_found'] == True or \
        row['No_Mutations'] == True or \
        row['No_Consensus'] == True or \
        row['Indels'] == True or \
        row['HowManyAffectedCodons'] != 1 or \
        row['Codon_Allowed'] == False or \
        row['mutantAA'] == 'Silent mutation' or \
        row['mutantAA'] == 'Nonsense mutation' or \
        row['Enough_reads'] == False or \
        row['No_BB_Mutations'] == False:
        return False
    else:
        return True
#----------------------------------------------------
def filterReads(df):
    filteredDFs = {}
    noFilter = df['BCfrequency'] > 0
    OBOCheck = df['OBO_found'] == False
    noMutCheck = df['No_Mutations'] == False
    consensusCheck = df['No_Consensus'] == False
    indelCheck = df['Indels'] == False
    OCOcheck = df['HowManyAffectedCodons'] == 1
    allowedCheck = df['Codon_Allowed'] == True
    silentCheck = df['AAMutation'] != 'Silent mutation'
    nonsenseCheck = df['AAMutation'] != 'Nonsense mutation'
    enoughReadsCheck = df['Enough_reads'] == True
    BBMutCheck = df['No_BB_Mutations'] == True
    allFilters = OBOCheck & noMutCheck & consensusCheck & indelCheck & OCOcheck & allowedCheck & \
                silentCheck & nonsenseCheck & enoughReadsCheck & BBMutCheck
    filterSet = [noFilter, OBOCheck, noMutCheck, consensusCheck, indelCheck, OCOcheck, allowedCheck, \
                 silentCheck, nonsenseCheck, enoughReadsCheck, BBMutCheck, allFilters]
    for i, aFilter in enumerate(filterSet):
        filteredDFs[filterLabels[i]] = df[aFilter]
    return filteredDFs
#-----------------------------------------------------
def getConsensus(uniqueBCsIn): # This also checks for consensus
    consensusMutList = []
    consensusLevelList = []
    consensusBBMutList = []
    appearanceList = []
    for x, BC in enumerate(uniqueBCsIn):
        consensusMut = 'XXX'
        consensusLevel = 0
        BCfrequency = 0
        uniqueBCGroup = BCGroups.get_group(BC)
        BCmutListList = uniqueBCGroup['splitMuts'].array
        BCBBmutListList = uniqueBCGroup['splitBBMuts'].array
#         appearanceList.append(uniqueBCGroup['BCfrequency'].array[0])
        consensusMut, consensusLevel = mutConsenses(BCmutListList)
        consensusMutList.append(consensusMut)
        consensusLevelList.append(consensusLevel)
        consensusBBMutList.append(mutConsenses(BCBBmutListList)[0])
        if x in [0,500,1000,5000,10000,50000,100000,200000]:
            print(str(100*x/len(uniqueBCsIn))[:4] + ' percent done')
    consensusDict = {
                      "Barcode_sequence": uniqueBCs,
                      "Consensus_mutations": consensusMutList,
#                       "BCFrequency": appearanceList,
                      "Backbone_consensus_muts": consensusBBMutList
                    }
    newDF = pd.DataFrame.from_dict(consensusDict)
    return newDF
#-----------------------------------------------------
def getMutPos(row):
    positionList=[]
    if (row['No_Mutations'] == False and row['No_Consensus'] == False and row['Indels'] == False):
        for mut in row['Consensus_mutations']:
            positionList.append(int(mut[1:-1]))
    else:
        positionList.append(-1)
    return positionList
#-----------------------------------------------------
def getCodonSeqs (row):
    originalCodon = 'XXX'
    mutantCodon = 'YYY'
    mutantCodonList = []
    if row['HowManyAffectedCodons'] == 1 and row['Indels'] == False:
        codonPosition = row['AffectedCodonInt']
        DNAposition = codonPosition * 3
        originalCodon = TEMPLATE_STRING[DNAposition:DNAposition + 3]
        mutCodonSeqList = list(originalCodon)
        for mutationNumber, naPosition in enumerate(row['Mut_positions']):
            withinCodonPosition = (naPosition - 1)%3
            mutCodonSeqList[withinCodonPosition] = row['Consensus_mutations'][mutationNumber][-1:]
            mutantCodon = ''.join(mutCodonSeqList)
    return originalCodon, mutantCodon
#-----------------------------------------------------
def OBOcheck(row): 
    uniqueBC = row['Barcode_sequence']
    BCtoAdd = None
    hammingDFound = False
    for barcodeWithOneMistake in enumerateMuts(uniqueBC):
        if barcodeWithOneMistake in uniqueBCLookup:
            hammingDFound = True
            BCtoAdd = barcodeWithOneMistake
            # Since we are running from less to more frequent barcodes
            # This removes the less frequent barcode from consideration
            uniqueBCLookup.remove(uniqueBC)
            break
#     if row.name%50000 == 0:
#         print(str(100*row.name/len(uniqueBCs))[:4] + ' percent done')
    return hammingDFound, BCtoAdd
#-----------------------------------------------------
def getAAmut(row):
    aaMut = 'X0Y'
    mutantAA = 'X'
    if (row['Original_codon'] != 'XXX') and (row['Mutant_codon'] != 'YYY'):
        originalAA = translate(row['Original_codon'])
        AAposition = row['AffectedCodonInt']
        mutantAA = translate(row['Mutant_codon'])
        if originalAA == mutantAA:
            aaMut = mutantAA = 'Silent mutation'
        elif mutantAA == '_':
            aaMut = mutantAA = 'Nonsense mutation'
        else:
            aaMut = (originalAA + str(AAposition) + mutantAA)
    return aaMut, mutantAA
#-----------------------------------------------------
def mutFilters(row):
    noMut = noCon = indel = False
    if row.Consensus_mutations == ['noMutation']:
        noMut = True
    if row.Consensus_mutations == ['NoConsensus']:
        noCon = True
    if ('ins' in str(row.Consensus_mutations)) or ('del' in str(row.Consensus_mutations)):
        indel = True
    return noMut, noCon, indel
#-----------------------------------------------------
def BBmutFilters(row):
    noMut = False
    if (row.Backbone_consensus_muts == ['noMutation']) or (row.Backbone_consensus_muts == ['NoConsensus']):
        noMut = True
    return noMut
#-----------------------------------------------------
def shiftAAone (row):
    corrected = 'X0Y'
    if row['AAMutation'][1:-1].isnumeric():
        position = int(row['AAMutation'][1:-1])
        corrected = row['AAMutation'][0] + str(position + 1) + row['AAMutation'][-1]
    return corrected
#============================================================================
#============================================================================
filterLabels = ['All data', 'Off-by-one barcodes', 'No mutation found', \
                    'No consensus found', 'Indel found', 'More than one codon mutated', \
                    'Unallowed mutation', 'Missense mutation', 'Nonsense mutation', \
                    'Not enough reads', 'Backbone Mutation', 'Fully filtered']
#-----------------------------------------------------
PROGRAMMED_MUTATION_CODONS = ['CGT','CAT','AAA','GAT','GAA',
                              'AGC','ACC','AAT','CAG','TGC',
                              'GGC','CCG','GCG','GTG','ATT',
                              'CTG','ATG','TTT','TAT','TGG']

In [None]:
# Import previously generated dataframe, add BC freq 
# This dataframe is PacBio data after using alignParse, it has barcodes, gene mutations and backbone mutations
readsDF = pd.read_csv('NP_11_21_1_backboneAware.csv') 

# Add a new column that displays how many times each barcode appears in the list
readsDF['BCfrequency'] = readsDF['Barcode_sequence'].map(readsDF['Barcode_sequence'].value_counts())
readsDF[['query_name', 'gene_mutations', 'Barcode_sequence', 'BCfrequency']]



In [None]:
# Convert values in mutation column into usable values 
# First split them by the spaces
readsDF['splitMuts'] = readsDF['gene_mutations'].str.split()
readsDF['splitBBMuts'] = readsDF['backbone_mutations'].str.split()

# This does a groupby and also makes a list of unique barcodes
BCGroups = readsDF.groupby('Barcode_sequence')
uniqueBCs = readsDF.Barcode_sequence.unique()

# uniqueBCsAndFreq = BCGroups.apply(lambda x: x['BCfrequency'].unique())
print('There are ' + str(readsDF.Barcode_sequence.nunique()) + ' unique barcodes in the table')


In [None]:
# Define the template for alignment - should be a fasta file with only one sequence in it 
TEMPLATE_FILE = 'rbcL_codonopt_barcoded.txt'
templateParsed = SeqIO.parse(TEMPLATE_FILE, 'fasta')
TEMPLATE_STRING = None
for record in templateParsed:
    assert TEMPLATE_STRING == None
    TEMPLATE_STRING = str(record.seq)
    RCtemplateString = record.seq.reverse_complement()
    TEMPLATE_TRANSLATION = str(record.seq.translate(to_stop=True))

In [7]:
# Generate dataframe of unique barcodes, perform consensus functions, save as a .pkl file
uniqueBCDF = getConsensus(uniqueBCs)
uniqueBCDF = pd.merge(uniqueBCDF, readsDF[['Barcode_sequence', 'BCfrequency']], 
                      how='left', on = 'Barcode_sequence')
uniqueBCDF.drop_duplicates(subset = ['Barcode_sequence'], inplace = True)
uniqueBCDF.to_pickle(FILE_NAME + '_consensusTable.pkl')


0.0 percent done
0.17 percent done
0.34 percent done
1.71 percent done
3.43 percent done
17.1 percent done
34.3 percent done
68.6 percent done


In [6]:
# Open the saved .pkl file of consensus mutations
uniqueBCDF = pd.read_pickle(FILE_NAME + '_consensusTable.pkl')


In [11]:
# Apply all filters and save final lookup tables 
# Reorder so that it goes from more to less common barcodes so that it removes the less common ones
uniqueBCDF = uniqueBCDF.sort_values(by=['BCfrequency'], ascending = True).reset_index()
# Off-by-one check function

# Make a list and then a set to look through, this is important (hash reasons) 
uniqueBCs = uniqueBCDF['Barcode_sequence'].tolist()
uniqueBCLookup = set(uniqueBCs)
uniqueBCDF['OBO_found'], uniqueBCDF['Similar_barcode'] = zip(*uniqueBCDF.apply(OBOcheck, axis = 1))

# Apply filters (edit this to run through dataframe once somehow)
uniqueBCDF['No_Mutations'], uniqueBCDF['No_Consensus'], uniqueBCDF['Indels'] = \
        zip(*uniqueBCDF.apply(mutFilters, axis = 1))
uniqueBCDF['No_BB_Mutations'] = uniqueBCDF.apply(lambda row: BBmutFilters(row), axis = 1)
uniqueBCDF['Mut_positions'] = uniqueBCDF.apply(lambda row: getMutPos(row), axis = 1)
uniqueBCDF['AffectedCodons'] = uniqueBCDF.apply(lambda row: oneCodonOnlyCheck(row.Mut_positions), axis = 1)
uniqueBCDF['HowManyAffectedCodons'] = uniqueBCDF.apply(lambda row: len(row.AffectedCodons)
                                                            if row.AffectedCodons[0]>-1 else (-1), axis = 1)
uniqueBCDF['AffectedCodonInt'] = uniqueBCDF.apply(lambda row: row.AffectedCodons[0]
                                                    if (len(row.AffectedCodons)) == 1 else (-2), axis = 1)
uniqueBCDF['Original_codon'], uniqueBCDF['Mutant_codon'] = zip(*uniqueBCDF.apply(getCodonSeqs, axis = 1))
uniqueBCDF['Codon_Allowed'] = uniqueBCDF.apply(lambda row: 
                                               True if row.Mutant_codon in PROGRAMMED_MUTATION_CODONS
                                               else False, axis = 1)
uniqueBCDF['Enough_reads'] = uniqueBCDF.apply(lambda row: True if (row['BCfrequency'] >= MINIMUM_READS)
                                               else False, axis = 1)
uniqueBCDF['AAMutation'], uniqueBCDF['mutantAA'] = zip(*uniqueBCDF.apply(getAAmut, axis = 1))
uniqueBCDF['Passed_all_filters'] = uniqueBCDF.apply(lambda row: passedAllFilters(row), axis=1)

# Shift aminoacid numbers by one
uniqueBCDF['correctedAAmut'] = uniqueBCDF.apply(lambda row: shiftAAone(row), axis=1)

# Count barcodes per mutant
uniqueBCDF['BCsForMut'] = uniqueBCDF['correctedAAmut'].map(uniqueBCDF['correctedAAmut'].value_counts())

# Save lookup table
uniqueBCDF[uniqueBCDF['Passed_all_filters'] == True][['Barcode_sequence', 
                                                'correctedAAmut']].to_csv(FILE_NAME + '_lookupTable.csv')

# Save lookup tables for reads with indels and backbone mutations, filter with read cutoff
uniqueBCDF[(uniqueBCDF['Enough_reads'] == True) & 
           (uniqueBCDF['No_BB_Mutations'] == False)][['Barcode_sequence', 
            'correctedAAmut']].to_csv(FILE_NAME + '_indels_lookupTable.csv')
uniqueBCDF[(uniqueBCDF['Enough_reads'] == True) & 
           (uniqueBCDF['Indels'] == True)][['Barcode_sequence', 
            'correctedAAmut']].to_csv(FILE_NAME + '_BBmuts_lookupTable.csv')


In [None]:
# Add columns to reads dataframe

readsDF = pd.merge(readsDF, uniqueBCDF[['Barcode_sequence', \
                                        'Consensus_mutations', \
                                        'OBO_found', \
                                        'No_Mutations', \
                                        'No_Consensus', \
                                        'Indels', \
                                        'HowManyAffectedCodons', \
                                        'Codon_Allowed', \
                                        'AAMutation', \
                                        'mutantAA', \
                                        'Enough_reads', \
                                        'No_BB_Mutations', \
                                        'Passed_all_filters']], \
                                        how='left', on = 'Barcode_sequence')
readsDF.head(2)

filteredReadDFs = filterReads(readsDF)

In [None]:
# Rarefaction curves data

rarefactionData = {}

for label in filterLabels:
    rarefactionData[label] = []

n = [10, 30, 100, 300, 1000, 3000, 10000, 30000]
howOftenToSample = 100000

readsDF['index'] = np.arange(len(readsDF))

for read in readsDF['index']:
    num=read%howOftenToSample
    if (num == 0) or (read in n):
        for label in filterLabels:
            filteredBCsUpToRead = filteredReadDFs[label]['Barcode_sequence'][:read]
            rarefactionData[label].append([filteredBCsUpToRead.count(), filteredBCsUpToRead.nunique()])
            
rarefactionDataTransposed = {}
# Transpose values
for label in filterLabels:
    BCcounts, unique_BCs = list(map(list, zip(*rarefactionData[label])))
    rarefactionDataTransposed[label] = BCcounts, unique_BCs        
        

In [None]:
# Plot rarefaction curves
for label in filterLabels:
    plt.scatter(rarefactionDataTransposed[label][0], rarefactionDataTransposed[label][1])

plt.legend(filterLabels, bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Barcode subsampling')
plt.xlabel('Barcodes in subsample')
plt.ylabel('Unique barcodes in subsample')

# Fit to negative binomial distribution curve

from scipy.optimize import curve_fit

def negativeBinomial(x, a, b):
    return a*(1-np.power(b, x/a))  # a*(1-b^(x/a))

xdata = rarefactionDataTransposed['Fully filtered'][0]
ydata = rarefactionDataTransposed['Fully filtered'][1]

popt, pcov = curve_fit(f = negativeBinomial, \
                      xdata = xdata, \
                      ydata = ydata, \
                      p0=[10000, 0.5], \
                      bounds=(0, np.inf))

estText = "Negative binomial \ncomplexity estimate: \n" + str(int(popt[0])) + ' barcodes'
print("Negative binomial skew estimate: " + str(popt[1])[:4])

# Plot the curve above
plt.plot(xdata, negativeBinomial(xdata, *popt), color = 'black')
plt.tight_layout()
plt.rcParams["figure.figsize"] = (7,7)
plt.text(2500000, 50000, estText, fontsize=12, horizontalalignment='center', verticalalignment='bottom')

plt.savefig(FILE_NAME + '_barcodeSubsamplingByFilter.pdf')

    

In [None]:
# Make histograms for the same data

binSet = range(0,100,1)
counts = []
bins = []
bars = []

for i, aFilter in enumerate(filterLabels):
    readHistData = filteredReadDFs[aFilter]['BCfrequency']
    histtype = 'step'
    if aFilter == 'Fully filtered':
        histtype = 'stepfilled'
    x = plt.hist(readHistData, alpha = 0.3, bins = binSet, histtype = histtype, label = aFilter)
    counts.append(x[0])
    bins.append(x[1])
    bars.append(x[2])

plt.rcParams["figure.figsize"] = (7,7)
plt.legend()
plt.title('Reads per barcode')
plt.ylabel('Barcodes')
plt.xlabel('Reads per barcode')
plt.savefig(FILE_NAME + '_preliminary_readsHist.pdf')



In [None]:
# Downsampled data histograms
m = [50000, 100000, 200000, 500000, 1000000]

BCList = filteredReadDFs['Fully filtered']['Barcode_sequence']

for i, barcode in enumerate(BCList):
    if i in m:
        plt.hist(filteredReadDFs['Fully filtered'][:i].groupby('AAMutation')['Barcode_sequence'].nunique(), \
                 bins = range(0,50,1), \
                 histtype = 'step', \
                 alpha = 0.7, density = True, label = 'First ' + str(i) + ' reads')
    if i == len(BCList) - 1:
        plt.hist(filteredReadDFs['Fully filtered'][:i].groupby('AAMutation')['Barcode_sequence'].nunique(), \
                 bins = range(0,50,1), \
                 histtype = 'step', \
                 edgecolor = 'black', density = True, label = 'All filtered reads (' + str(len(BCList)) + ')')
plt.legend()
plt.title('Barcodes per mutation subsamples')
plt.ylabel('Mutations')
plt.xlabel('Barcodes per Mutation')

plt.savefig(FILE_NAME + '_barcodesPerMutHistSubsampled.pdf')



In [None]:
# Histogram of barcodes per mutation

BCsPerMut = filteredReadDFs['Fully filtered'].groupby('AAMutation')['Barcode_sequence'].nunique()
missedMuts = TOTAL_MUTS - len(BCsPerMut)
BCsPerMut = BCsPerMut.append(pd.Series([0]*missedMuts))

binSet = range(0,50,1)

histData = plt.hist(BCsPerMut, bins = binSet, color = 'lightgray')
plt.title('Barcodes per mutation')
plt.ylabel('Mutations')
plt.xlabel('Barcodes per Mutation')

plt.savefig(FILE_NAME + '_barcodesPerMutHist.pdf')

# Fit poisson or gamma curve
# Print out median or whatever



In [None]:
# Make a graph showing how many mutants I have at each position of the protein
# Ideally I would have 19 mutants at each position

def getMutPosition(mutation):
    return int(mutation[1:-1])

def getMutAA(mutation):
    return mutation[-1:]

foundMutPos = []
foundMutAA = []

for mutant in filteredReadDFs['Fully filtered']['AAMutation']:
    foundMutPos.append(getMutPosition(mutant))
    foundMutAA.append(getMutAA(mutant))
        
mutationPositionData = {'Position' : foundMutPos , 'Amino acid' : foundMutAA}
dfPositionAA = pd.DataFrame(mutationPositionData)
mutationsByPosition = dfPositionAA.groupby('Position')['Amino acid'].nunique()
dfPositionAAanalyzed = pd.DataFrame(mutationsByPosition)
dfPositionAAanalyzed.to_csv(FILE_NAME + '_mutationsByPosition.csv')
mutationBCsByPosition = dfPositionAA.groupby('Position')['Amino acid'].count()

fig,ax = plt.subplots()
plt.gcf().set_size_inches((20, 5))    

ax.scatter(x = dfPositionAAanalyzed.index, \
            y = dfPositionAAanalyzed['Amino acid'], \
            s = 5, \
            color = 'black', \
            marker = "_")
ax.set_xlabel("Amino acid position",fontsize=14)

ax.set_ylabel("Mutantions at that position",fontsize=14)
ax.set_ylim([0, 20])
ax.set_xlim([-1, 465])
ax.set_yticks([0, 2, 5, 10, 15, 18, 19])

# twin object for two different y-axis on the sample plot
ax2=ax.twinx()
# make a plot with different y-axis using second axis object
ax2.scatter(mutationBCsByPosition.index, mutationBCsByPosition, s = 10, color = 'green')
ax2.set_ylabel("Barcodes of mutations at that position",fontsize=14)

plt.savefig(FILE_NAME + '_coverageByPosition.pdf')

plt.show()


In [None]:
# Calculate mutant coverage
totalAALength = 466
DMSAALength = 462
totalPossibleMuts = 19* totalAALength
DMSMuts = 19 * DMSAALength

mutationsAfterFilters = mutationsByPosition.sum()

# Calculate percent of possible mutations that we have
percentMuts = mutationsAfterFilters/totalPossibleMuts
percentDMSMuts = mutationsAfterFilters/DMSMuts
mutsW3BCsorMore = (BCsPerMut > 2).value_counts()[True]


print('We captured ' + str(100*percentMuts)[:4] + '% of possible single mutants (' + \
      str(mutationsAfterFilters) + ' of ' + str(totalPossibleMuts) + ')')
print('We captured ' + str(100*percentDMSMuts)[:4] + '% of possible single mutants we ordered (' + \
      str(mutationsAfterFilters) + ' of ' + str(DMSMuts) + ')')
print('We captured ' + str(100*mutsW3BCsorMore/DMSMuts)[:4] + '% of possible single mutants we ordered (' + \
      str(mutsW3BCsorMore) + ' of ' + str(DMSMuts) + ') with 3 barcodes or more')


In [None]:
# Heatmap showing each individual mutation barcode frequency
# This will only give the barcode heatmap for now
# The positions here are indexed to 0 so they're sort of off by 3

positionAAFreq = uniqueBCDF[uniqueBCDF['Passed_all_filters'] == True].groupby(['AffectedCodonInt', 
                                                                               'mutantAA']).size()
dfPositionAAFreq = positionAAFreq.to_frame(name = 'Barcodes')
pivotBCs = dfPositionAAFreq.pivot_table(index = 'mutantAA', columns = 'AffectedCodonInt', values = 'Barcodes')

fig, (ax1) = plt.subplots(1, tight_layout=True, figsize=(10,5), dpi = 100)
pcmBCs = ax1.pcolor(pivotBCs, cmap = 'viridis', norm=matplotlib.colors.LogNorm())

ax1.set_title('Barcodes per mutant', size = 15)
ax1.set_xlabel('Amino acid position in rbcL', size = 15)
ax1.set_ylabel('Amino acid', size = 15)
# This sets the ticks in the middle
ax1.set_yticks(np.arange(0.5, 20.5))
ax1.set_yticklabels(pivotBCs.index)
fig.colorbar(pcmBCs, ax = ax1)

plt.savefig(FILE_NAME + '_mutationBCHist.pdf')

plt.show()


In [None]:
# Run cumulative version of filters to make pie chart

startingBCs = len(filteredReadDFs['All data'])

cumulativeDF = filteredReadDFs['All data']

filteredReadsList = []

for i, aFilter in enumerate(filteredReadDFs.keys()):
    readsRemaining = len(cumulativeDF)
    cumulativeDF = cumulativeDF[['query_name', 
                'Barcode_sequence']].merge(filteredReadDFs[aFilter][['query_name']], 
                                           on = 'query_name', how = 'inner')
    filteredReads = readsRemaining - len(cumulativeDF)
    filteredReadsList.append(filteredReads)
#     print(aFilter)
#     print('This filter removed ' + str(filteredReads) + ' reads')
#     print(cumulativeDF['Barcode_sequence'].count())
#     print(cumulativeDF['Barcode_sequence'].nunique())

# This gives the remaining reads as the final slice
filteredReadsList[-1] = readsRemaining
    
readSummary = {filterLabels[i]: filteredReadsList[i] for i in range(len(filterLabels))} 
print(readSummary)
    
dfReadSummary = pd.DataFrame(readSummary.items(), columns=['Read type', FILE_NAME])
dfReadSummary.to_csv(FILE_NAME + '_readSummary.csv')

pieChart = dfReadSummary.plot.pie(y = FILE_NAME, \
                             labels = dfReadSummary['Read type'], \
                             figsize=(7, 7), \
                             autopct='%1.1f%%', \
                             cmap = 'Set3_r', \
                             startangle = 0)

pieChart.get_legend().remove()

plt.savefig(FILE_NAME + '_readPieChart.pdf')
    