# Tiled Primer Design: Finding kmers

This jupyter notebook identifies k-mers for tiled PCR amplification across random sites in a single genome. 



### Credit

Written by Julie Chen and Gowtham Thakku

## Set up 

In [4]:
# The necessary python packages are imported here. If you receive an error, you may have to install these packages

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Blast import NCBIXML
import numpy as np
import pandas as pd
import itertools
import re
import xlrd

In [5]:
# declare some modifiable parameters here:

date = '20220713'
kmerSize = 5
genomeFileName = '../NC_000962.fasta' #Path here specifies the genome fasta file to identify k-mers in
record = SeqIO.read(genomeFileName, 'fasta')
fastaSeq = record.seq

# note: few more manual inputs later (see Doc for more instructions)

## Identifying compatible k-mer pairs

We will use k-mers as a semi-random, unbiased method to identify potential sites across the genome for tiled PCR amplification. \
The final chosen k-mer(s) will serve as the 3'-most end of the primer design.

### Generating a list of k-mers and their corresponding frequencies

In [6]:
'''
Returns a list of all possible nucleotide k-mers for a given k.

    Adapted from:
    http://saradoesbioinformatics.blogspot.com/2016/08/k-mer-composition.html

'''

def allKmers(k):
    
    nt = 'ACGT'
    permutations = itertools.product(nt, repeat = k)
    
    kmers = []
    for i, j in enumerate(list(permutations)):
        kmer = ''
        for item in j:
            kmer += str(item)
        kmers.append(kmer)
    
    return kmers

'''
Returns a list of the corresponding frequencies in a genome for each index of a list of k-mers.
    
    r'' is raw string notation
    ?= allows overlapping k-mers
    While .count could work similarly, it does not consider overlapping sequences separately.
    Also, .findall returns each substring/pattern itself, rather than the index positions.
    
'''

def kmerFrequency(kmers, genomeSeq):
    
    frequency = []
    for k in kmers:
        occurrence = 0
        pattern = re.compile(r'(?=(' + k + '))')
        for l in re.findall(pattern, str(genomeSeq)):
            occurrence += 1
        frequency.append(occurrence)
    
    return frequency

In [7]:
allKmersList = allKmers(kmerSize)

### Filtering: low heterogeneity in k-mer sequence

In [8]:
'''
Returns a list of k-mers filtered such that a nucleotide may only appear
twice in a row in any given sequence to remove k-mers with low heterogeneity.

By removing all k-mers where a nucleotide appears consecutively 3x (i.e. 'AAA'),
this ultimately eliminates 4x, 5x, 6x too.
    
'''
#PARAMETERS CAN POTENTIALLY BE CHANGED
def max2ConsecutiveNt(seqList):
    consecutive = ['AAA', 'CCC', 'GGG', 'TTT']
    noTripleNuc = []
    
    for i in seqList:
        contains = 0
        for j in consecutive:
            if i.find(j) > -1 : # if .find returns -1, no matches
                contains += 1
        if contains == 0:
            noTripleNuc.append(i)
            
    return noTripleNuc

In [9]:
noHomopolymericKmers = max2ConsecutiveNt(allKmersList)
print(len(noHomopolymericKmers))

864


In [10]:
# generate frequency list for the updated k-mers list (takes a few min to run)

kmerFreqList = kmerFrequency(noHomopolymericKmers, fastaSeq)

In [11]:
kmerStarterDict = {'k-mer sequence': noHomopolymericKmers, 'num. of occurrences': kmerFreqList}
df_kmer = pd.DataFrame(kmerStarterDict)
df_kmer.head()
df_kmer.to_csv('kmers.csv')

In [12]:
'''
Returns a list of diversity scores for a given k-mers list from a pandas dataframe.

    Presence of each unique nucleotide in a k-mer is +1.
    Max score = 4 where the k-mer contains each nucleotide.

'''
#PARAMETERS CAN POTENTIALLY BE CHANGED
def scoreSeqDiversity(df, kmerSeqColumn):
    
    allDiversityScores = []
    nt = ['A', 'C', 'G', 'T']
    
    for i in df.index: 
        diversityScore = 0
        
        for j in nt:
            if df.loc[i, kmerSeqColumn].find(j) != -1:
                diversityScore += 1
                
        allDiversityScores.append(diversityScore)
    
    return allDiversityScores

In [13]:
df_kmer['k-mer Seq. Diversity'] = scoreSeqDiversity(df_kmer, 'k-mer sequence')
df_kmer.head()

Unnamed: 0,k-mer sequence,num. of occurrences,k-mer Seq. Diversity
0,AACAA,2156,2
1,AACAC,3731,2
2,AACAG,3153,3
3,AACAT,2030,3
4,AACCA,3114,2


In [14]:
'''
To identify k-mers with high sequence diversity,
shortlist for k-mers that contain each type of nucleotide

e.g. remove AAAAT, but retain sequences like ACAGT

'''
#PARAMETERS CAN POTENTIALLY BE CHANGED
df_kmerDiverse = df_kmer[df_kmer['k-mer Seq. Diversity'] > 1]
print(len(df_kmerDiverse))
df = df_kmerDiverse

df_kmerDiverse.to_csv('kmers_diverse.csv')

864


In [162]:
#WHAT'S UP WITH THIS?
kmer='TTG'

df['remove'] = df['k-mer sequence'].str.find(kmer)
df = df[df.remove < 0]
df.shape           

(21, 4)

In [15]:
df.to_csv('kmers_remaining.csv')

In [16]:
df_kmerDiverseRanked = df.sort_values('num. of occurrences', ascending = False)
df_kmerDiverseRanked.head(21)
df_kmerDiverseRanked.to_csv('kmers_diverse_ranked.csv')