### Organize the read counts
And at long last we have arrived at the final step.  
Here we will take our deduped BAM file and organize the read counts by gene.  

We will go through the reads and count up each instance of a transcript.

In [1]:
import glob
import pysam
import pandas as pd
import numpy as np

In [2]:
# We will make use of the `defaultdict` function which is a built-in python function.
from collections import defaultdict

In [3]:
# For example if we have a list of genes like this
temp_list = ['geneA', 'geneA', 'geneB', 'geneC', 'geneC', 'geneC']

In [4]:
# We can loop through it and add each instance of an gene showing up
temp_freq = defaultdict(lambda: 0)
for gene in temp_list:
    temp_freq[gene]+=1

In [5]:
temp_freq

defaultdict(<function __main__.<lambda>()>,
            {'geneA': 2, 'geneB': 1, 'geneC': 3})

### Reading the GFF file
We will take the gene name and gene type from the GFF file.  
If you dont want the GTF file taking up space, you could premake a .csv file with geneID:geneName and reference that file instead of the GTF file.  
We wont do that today.

In [6]:
GFF_FILE = 'gtf_file/gencode.vM27.annotation.gff3'

def readGFF():
    genes = {}
    
    print('Reading GFF file...')
    with open(GFF_FILE) as gffHandle:
        for line in gffHandle:
            if line[0] != '#' and line.split('\t')[2] == 'gene':
                genes[line.split('gene_id=')[1].split(';')[0]] = line
    
    return genes

In [7]:
genes = readGFF()

Reading GFF file...


In [8]:
#lets just take a look at the first 2 pair values
first2pairs = {k: genes[k] for k in list(genes)[:2]}
first2pairs

{'ENSMUSG00000102693.2': 'chr1\tHAVANA\tgene\t3143476\t3144545\t.\t+\t.\tID=ENSMUSG00000102693.2;gene_id=ENSMUSG00000102693.2;gene_type=TEC;gene_name=4933401J01Rik;level=2;mgi_id=MGI:1918292;havana_gene=OTTMUSG00000049935.1\n',
 'ENSMUSG00000064842.3': 'chr1\tENSEMBL\tgene\t3172239\t3172348\t.\t+\t.\tID=ENSMUSG00000064842.3;gene_id=ENSMUSG00000064842.3;gene_type=snRNA;gene_name=Gm26206;level=3;mgi_id=MGI:5455983\n'}

### LS-Data
For Light-Seq, we have to parse out both the Barcode seq and the gene thats mapped to it.  
If you recall, the barcode sequence is added into the header of an aligned read.  

In [9]:
# Lets create a dictionary with the gene counts.
# We use a new defaultdict construction as we will have both the genename, barcodeseq, and genecounts
frequencies = defaultdict(lambda: defaultdict(int)) # default to 0 count

In [10]:
DEDUP_FILES = sorted(glob.glob('outFiles/*_Dedup.bam')) #We only have one for this workshop
OUT_FILE_FREQ = 'TranscriptFrequencies.csv'
OUT_FILE = 'ReorderedLightSeq.csv'

allGenes = []
allConditionBarcodes = []
# Create dictionary of gene frequencies
for geneFile in DEDUP_FILES:  
    print('Reading %s...' % geneFile)

    condition = geneFile.split('/')[-1].split('_Dedup')[0]
    samFile = pysam.AlignmentFile(geneFile, "rb")
    samIter = samFile.fetch(until_eof=True)

    for read in samIter:
        barcodeSeq = read.query_name.split('_')[1]
        geneName = read.get_tag('XT')

        condBar = '%s/%s' % (condition, barcodeSeq)

        frequencies[geneName][condBar] += 1

        if not condBar in allConditionBarcodes:
            allConditionBarcodes.append(condBar)

# Now, create CSV
allData = defaultdict(list)

allData['Gene'] = frequencies.keys()

Reading outFiles/LS23A_Dedup.bam...


[E::idx_find_and_load] Could not retrieve index file for 'outFiles/LS23A_Dedup.bam'


We have looped through an entire BAM file and organized the gene counts from it.  
Lets go through the last read and which items we parsed out from it.

In [11]:
#this is the last read in our for loop
print(read)

M01675:146:000000000-JNRMM:1:2115:11772:2527_AGGGTA_ATAGTCTAAAAA	16	#21	15244	255	40M	*	0	0	CAATCTCAGGAATTATCGAAGACAAAATACTAAAATTATA	array('B', [34, 11, 36, 26, 34, 24, 38, 38, 38, 38, 38, 38, 35, 38, 37, 38, 38, 38, 38, 37, 32, 38, 38, 38, 38, 38, 38, 38, 37, 37, 37, 32, 36, 38, 37, 37, 34, 38, 37, 37])	[('NH', 1), ('HI', 1), ('AS', 39), ('nM', 0), ('XS', 'Assigned'), ('XN', 1), ('XT', 'ENSMUSG00000064370.1')]


In [12]:
# For each read we parsed out the genename, barcode seq, and its condition(its filename).
# We used the the pysam functions to parse out this info.
print(barcodeSeq)
print(geneName)
print(condBar)

AGGGTA
ENSMUSG00000064370.1
LS23A/AGGGTA


In [13]:
# Our frequencies dictionary has about this many items, which is the number of genes it mapped to
len(frequencies)

13161

In [14]:
# If we search for a specific gene, we'll just take the assigned gene from our last read. 
# We can see how many counts mapped to which barcode
frequencies.get('ENSMUSG00000064370.1')

defaultdict(int, {'LS23A/AGGGTA': 1657, 'LS23A/GTTAGG': 592})

In [15]:
# The next set of scripts will now reference the "genes" dictionary that we created from the GFF file to assign gene names
# It will also append the genecounts and our conditions to the growing dataframe.
allData = defaultdict(list)
allData['Gene'] = frequencies.keys()

print('Compiling data...')
for gene in allData['Gene']:
    allData['Gene name'].append(genes[gene].split('gene_name=')[1] \
                                                            .split(';')[0])
    for condBar in allConditionBarcodes:
        allData[condBar].append(frequencies[gene][condBar])

lightFreqs = pd.DataFrame.from_dict(allData)
lightFreqs.to_csv(OUT_FILE_FREQ, index=False)

Compiling data...


In [16]:
lightFreqs

Unnamed: 0,Gene,Gene name,LS23A/AGGGTA,LS23A/GTTAGG
0,ENSMUSG00000051951.6,Xkr4,9,3
1,ENSMUSG00000103377.2,Gm37180,4,0
2,ENSMUSG00000104017.2,Gm37363,2,0
3,ENSMUSG00000103025.2,Gm37686,1,0
4,ENSMUSG00000103201.2,Gm37329,2,0
...,...,...,...,...
13156,ENSMUSG00000064357.1,mt-Atp6,1,0
13157,ENSMUSG00000064363.1,mt-Nd4,860,308
13158,ENSMUSG00000064367.1,mt-Nd5,1150,459
13159,ENSMUSG00000064368.1,mt-Nd6,470,226


In [17]:
# Now, re-order data according to layer, replicate information
# Right now this part has to be manually entered in. 
# I have updated scripts so that we dont have to change the python script everytime we do a new experiment
# But for this workshop, we will use exactly what written in the github repo.

del lightFreqs['Gene']
lightFreqs.rename(columns={'Gene name': 'Gene',
                         'LS23A/GTTAGG': 'Th_1',
                         'LS23A/AGGGTA': 'Am_1'},
                  inplace=True)

lightFreqs = lightFreqs[['Gene', 'Th_1', 'Am_1']]

lightFreqs.set_index('Gene', inplace=True)
# Add together counts with duplicate indices
lightFreqs = lightFreqs.groupby(lightFreqs.index).sum()

lightFreqs.to_csv(OUT_FILE)


In [18]:
lightFreqs

Unnamed: 0_level_0,Th_1,Am_1
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1
0610009B22Rik,0,2
0610010F05Rik,2,5
0610012G03Rik,0,1
0610030E20Rik,0,1
0710001A04Rik,0,3
...,...,...
mt-Tf,2,0
mt-Tl1,3,4
mt-Tm,11,31
mt-Tq,0,1
