# Problem

PCR duplicates occur during the library amplification step. Particular sequences are preferentially amplified, which causes them to be over-represented in your sequencing data. This presents a particular problem with RNA-seq data as you are depending on sequence density as a measure of gene expression. Additionally they can make a genome sequencing loci seem homozygous instead of heterozygous. This problem can be exasperated if your starting DNA fragments are too small, or too many PCR cycles are done (for example if too little DNA is present).

However more generally a duplicate can be defined as any sequence for which an identical sequence originating from the same material has already been seen. Library prep depends on random shearing + ligate adapters followed by PCR and adding your sample to the flowcell. Ideally since all samples are sheared at different places and there are so many samples, even though  they all have diplicates, statistically only one (and for most sequences none) of each fall within a flow cell to be bridge amplified. The fact that most don't get sequenced isn't a problem because again the random shearing means different start locations but all actual sequences should be covered by the 100bp length.

When we do get the exact sequence seen twice, a duplicate, it may be hard to determine if it is actually the same sequence occuring twice naturally or simply a pcr duplicate. For this we use many methods such as paired-end sequencing, indexing, or in this case UMIs. UMIs are added to the sample before ligation and represent a random sequence present at either end of our reads. If two sequences seem to match, but have different UMIs, it is safe to say they did not come from the same original sequence. Conversely, if they do have the same UMI then they are likely to be from the same origin as there are 96 different UMIs used (in our experiment).

# Example Input
```    
@SQ    SN:chr1    LN:50
QNAME         FLAG  RNNAME  POS  MAPQ   CIGAR  RXT  PXT  TLN  SEQ           QUAL

#Clear duplicate reads, share position and UMIs. Remove one
read1:UMI1    16    chr1    1    255    10M    *    0    10    AATTTAATGC    QualityScore
read2:UMI1    16    chr1    1    255    10M    *    0    10    AATTTAATGC    QualityScore

#Share position, but different UMI. Keep both
read3:UMI2    16    chr1    50   255    10M    *    0    10    AATTTAATGC    QualityScore
read4:UMI3    16    chr1    50   255    10M    *    0    10    AATTTAATGC    QualityScore

#Position off by two, but first two bases are soft clipped. Matching UMIs. Remove one.
read3:UMI4    16    chr1    52   255    2S10M  *    0    10    GGTTTAATGC    QualityScore
read4:UMI4    16    chr1    50   255    10M    *    0    10    AATTTAATGC    QualityScore
```
# Example Output
```
@SQ    SN:chr1    LN:50
QNAME         FLAG  RNNAME  POS  MAPQ   CIGAR  RXT  PXT  TLN  SEQ           QUAL

#Clear duplicate reads, share position and UMIs. Remove one
read1:UMI1    16    chr1    1    255    10M    *    0    10    AATTTAATGC    QualityScore

#Share position, but different UMI. Keep both
read3:UMI2    16    chr1    50   255    10M    *    0    10    AATTTAATGC    QualityScore
read4:UMI3    16    chr1    50   255    10M    *    0    10    AATTTAATGC    QualityScore

#Position off by two, but first two bases are soft clipped. Matching UMIs. Remove one.
read4:UMI4    16    chr1    50   255    10M    *    0    10    AATTTAATGC    QualityScore
```

# Pseudocode

 ```
 #First sort the file, then read in UMI's and make a dictionary from them
 -Samtools sort file by chromosome/position
 UMIs = read(my_UMIS)
 for each in UMIs:
     Add umi to my Umi_Dictionary
 
 Outputfile = open(FilteredFile.sam,w) #Make a new file for which all uniq entries will be placed
 
 #Read in and begin parsing through the sam file one line at a time
 MyFile = read(SamFile)
 for lines in MyFile:
     isolate headerUMI, CigarString, StartPosition    #Pull out the header UMI, CigarString and Position from line
     #This will be done by lines.split(":") and setting the values to the respective locations
     if Function(Check_Softclip) == true:             #Check for soft clipping and if present fix position
         StartPosition = Function(Fix_Position)
     if headerUMI in Umi_Dictionary:                  #Check if the UMI is in the known list, if not then stop
         if StartPosition not in Umi_Dictionary[headerUMI]:
             Add position to UMI dictionary key       #If UMI is in list and position isn't in dictionary, add position
             Outputfile.write(line)                   #Add the line to our filtered file
             unique +=1                               #Keeps count of how many entries are written out
         if StartPosition in Umi_Dictionary[headerUMI]:
             duplicate +=1                            #Keeps count of proper PCR duplicates
     if headerUMI not in Umi_Dictionary:
         wrongUMI +=1                                 #Keeps count of mis-sequenced UMIs

#Function to check for softclipping
def Check_Softclip (Cigar):
    if cigar string has soft clipping, return TRUE

#Function to fix POS if softclipping found
def Fix_Position (StartPosition):  
    check number of soft clips and return StartPosition +- difference
```