### Deduplicate the UMI's
Now we must dedup the umis. But prior to this step, there are a couple issues we must address.

In [1]:
import pysam
import pandas as pd

In [2]:
#Lets take a look at some of the gene assigned counts which is the last column in this file
gene_assigned = pd.read_csv('outFiles/LS23A_GeneAssigned', sep='\t', header=1)
gene_assigned

Unnamed: 0,Geneid,Chr,Start,End,Strand,Length,outFiles/TLS23A_R1_MouseAligned.sortedByCoord.out.bam
0,ENSMUSG00000102693.2,chr1,3143476,3144545,+,1070,0
1,ENSMUSG00000064842.3,chr1,3172239,3172348,+,110,0
2,ENSMUSG00000051951.6,chr1;chr1;chr1;chr1;chr1;chr1;chr1,3276124;3276746;3283662;3283832;3284705;349192...,3277540;3277540;3285855;3286567;3287191;349212...,-;-;-;-;-;-;-,6094,19
3,ENSMUSG00000102851.2,chr1,3322980,3323459,+,480,0
4,ENSMUSG00000103377.2,chr1,3435954,3438772,-,2819,9
...,...,...,...,...,...,...,...
55354,ENSMUSG00000064368.1,chrM,13552,14070,-,519,1191
55355,ENSMUSG00000064369.1,chrM,14071,14139,-,69,0
55356,ENSMUSG00000064370.1,chrM,14145,15288,+,1144,4359
55357,ENSMUSG00000064371.1,chrM,15289,15355,+,67,0


In [3]:
# Sort by highest number of counts
gene_assigned.sort_values('outFiles/TLS23A_R1_MouseAligned.sortedByCoord.out.bam', ascending=False)

Unnamed: 0,Geneid,Chr,Start,End,Strand,Length,outFiles/TLS23A_R1_MouseAligned.sortedByCoord.out.bam
47428,ENSMUSG00000119584.1,chr17,40157244,40159092,+,1849,331191
55323,ENSMUSG00000064337.1,chrM,70,1024,+,955,25327
55325,ENSMUSG00000064339.1,chrM,1094,2675,+,1582,14313
29423,ENSMUSG00000035202.9,chr9;chr9;chr9;chr9;chr9;chr9;chr9;chr9;chr9;c...,123195992;123195995;123196012;123196019;123196...,123196172;123197592;123196172;123196172;123196...,+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;...,11089,9371
55337,ENSMUSG00000064351.1,chrM,5328,6872,+,1545,5186
...,...,...,...,...,...,...,...
21099,ENSMUSG00000090341.2,chr7,41222701,41223169,-,469,0
21100,ENSMUSG00000108482.2,chr7;chr7;chr7,41244316;41244715;41245945,41244550;41244814;41246155,+;+;+,546,0
21103,ENSMUSG00000091401.2,chr7,41315941,41317542,-,1602,0
21105,ENSMUSG00000091004.3,chr7,41318032,41318731,-,700,0


### Top two genes
Note that the top two genes take up almost the majority of the reads. This can change depending on the GTF file you use.  
For example we did not have this issue with an older GTF file.  
These reads correspond to a Rn18s and mt-rRNA read. This also seems to be an issue specific to the primer we use in addition to these being highly abundant sequences.  

For this sequencing run, the number of counts are not too bad, but for some of the higher sequencing depths, these two reads can take up over a million reads or so.  
This is a problem for our UMI library as we use a 3 letter, 12 base UMI code.  
As you can imagine, this only allows for $3^{12} = 531441$ maximum possibilities. Its actually a little bit less because the umi-tools also allows for a 1 base traversal difference when its deduplicating.  
In our experience when we tried to deduplicate a read with this many counts, the deduplication pipeline stalled indefinitely, in fact it even stopped the o2 job because it was requesting inordinate amounts of memory (>64 gB).  

To solve this problem, we actually removed these two reads and created a new BAM file prior to deduplication.  
For your experimental system, this may or may not be a problem.

In [4]:
# This is the part of the script that removes the two counts
from collections import defaultdict
GENES_TO_FILTER = ['ENSMUSG00000119584.1', 'ENSMUSG00000064337.1']

def createSmallerGeneFile(filePrefix, geneFile):
    outGeneFile = '%s.bam' % filePrefix

    gF = pysam.AlignmentFile(geneFile)
    outGF = pysam.AlignmentFile(outGeneFile, 'w', template=gF)
    
    filteredCounts = defaultdict(int)

    for read in gF.fetch(until_eof=True):
        if read.has_tag('XT'):
            xtTag = read.get_tag('XT')

            if xtTag in GENES_TO_FILTER:
                filteredCounts[xtTag] += 1
            else:
                outGF.write(read)

    gF.close()
    outGF.close()

    print('Filtered out %s' % ', '.join(['%d of %s' % \
                                    (filteredCounts[gene], gene) \
                                    for gene in GENES_TO_FILTER]))

    return outGeneFile

In [5]:
filePrefix = 'outFiles/LS23A_filtered'
countedGeneFile = 'outFiles/TLS23A_R1_MouseAligned.sortedByCoord.out.bam.featureCounts.bam'
filteredCountsFile = createSmallerGeneFile(filePrefix, countedGeneFile)

[E::idx_find_and_load] Could not retrieve index file for 'outFiles/TLS23A_R1_MouseAligned.sortedByCoord.out.bam.featureCounts.bam'


Filtered out 331191 of ENSMUSG00000119584.1, 25327 of ENSMUSG00000064337.1


In [6]:
filteredCountsFile

'outFiles/LS23A_filtered.bam'

In [7]:
#Now we sort the file and index the file. UMI-tools requires this prior to doing the dedup function.
!samtools sort outFiles/LS23A_filtered.bam -o outFiles/LS23A_Sorted.bam
!samtools index outFiles/LS23A_Sorted.bam

### Umi-tools deduplication
For LS, we use the `--per-gene` deduplication option.  
This means that if you have two reads, and one read maps to bases 1-20 for Gene A, and the second read maps to bases 2-21,  
umi-tools will still consider these to be the same UMI group for the purposes of deduplication.  
This option is recommended if there is a fragmentation step in your library prep, which there usually is.  

For LS we do do a fragmentation step so we have this option on.  
Previously we deduped without this option so it will dedup by mapped position instead of by gene.  
It ultimately did not make a huge difference for us but its better to keep this option on.  

Note that we must enter in the  featurecounts tag notation when using the `--per-gene` option, `XT` and `XS`.  
If you use another package to assign gene counts, you must change these tags to the ones the other package uses.

In [10]:
%%time
# And finally deduplicate
!umi_tools dedup --per-gene --gene-tag=XT --assigned-status-tag=XS -I outFiles/LS23A_Sorted.bam -S outFiles/LS23A_Dedup.bam -L outFiles/LS23A_dedup.log 

CPU times: user 101 ms, sys: 34.8 ms, total: 136 ms
Wall time: 9.59 s
