## TF-KOMPAS: kmer alignment to binding model
Author: Zachery Mielko

Date: 11/08/19


Aligns all kmers from high-throughput experiments to a PWM.  

Input: 

1. PWM
2. non-gapped kmer output from Seed-and-wobble

Output:

1. kPosition results
2. Log file (optional)

Only required parameter is the alignment mode. Setting it to 'palindrome' will have the aligner not classify orientation, just perform the alignment based on position. If it is set to 'nonPalindrome', then it will classify if a kmer belongs to one orientation or another. If the score for both is the same, it will just assign both orientations of the kmer. 

-----
Update (11/08/19)

1. Fixed visual alignment error. k-positions are 0 based but the visual report for the first half was 1-based.
2. PWM and kmer inputs are by default, delim = whitespace

-----
Update (10/27/19)

1. Added more detail to log file, such as Model k-positions. 
2. Set up for obligate dimeric palindromes.
3. Made variable names camelCase
-----
To-do:

1. Make the input flexible enough for SELEX-seq affinity tables
2. Adapt TFFMs from JASPER (first-order HMMs). 
    * Requires a XML parser and some messing around with. 


In [1]:
# Parameters
PWMFile = '/Users/ZMielko/Desktop/In_Vivo_Project/Ets1_KOMPAS_Mutations/Ets1_PWMadj.txt'
kmerFile = '/Users/ZMielko/Desktop/In_Vivo_Project/Ets1_KOMPAS_Mutations/Ets1_7mers_1111111.txt'
saveLocation = '/Users/ZMielko/Desktop/In_Vivo_Project/Ets1_KOMPAS_Mutations/'
saveFile = 'Ets1_7mer_align.txt'
alignMode = 'nonPalindrome'
# Optional Parameters
visualAlignment = True
logFile = True
# Imports
import pandas as pd
import numpy as np
from Bio.Seq import reverse_complement

# Get kmer data and length
kmer_data = pd.read_csv(kmerFile, delim_whitespace=True)
kmer_F = kmer_data.iloc[:,0]
k = len(kmer_F[0])

# Read PWM
PWM = pd.read_csv(PWMFile, delim_whitespace=True, header = None)
PWM = PWM.sort_values(by=0).to_numpy()
PWM = np.delete(PWM,0,1).astype('float')

# Pad the array
eqiProb = np.array(([0.25],[0.25],[0.25],[0.25]))
padding = np.repeat(eqiProb, k, axis=1)
PWM = np.concatenate((padding,PWM,padding), axis=1)

# Convert to dictionary
PWM_dict = {'A':PWM[0],'C':PWM[1],'G':PWM[2],'T':PWM[3]}

# Alignment for one kmer
def alignment(kmer):
    best_score = 0
    kposition = 0
    for PWM_pos, position in enumerate(range(0, PWM.shape[1] - k +1)):
        score = 0
        for letter_pos, letter in enumerate(kmer):
            score = score + PWM_dict[letter][letter_pos + PWM_pos]
        if score > best_score:
            best_score = score
            kposition = PWM_pos
    return((best_score,kposition))

# Align for all kmers
oriented_kmer = []
kposition = []
orientation = []
score = []
modelScore = []
for kmerScore in zip(kmer_F,kmer_data.iloc[:,2]):
    fwd = alignment(kmerScore[0])
    rev = alignment(reverse_complement(kmerScore[0]))
    if alignMode == 'palindrome':
        oriented_kmer.append(kmerScore[0])
        kposition.append(fwd[1])
        orientation.append("F")
        score.append(kmerScore[1])
        modelScore.append(fwd[0])
        oriented_kmer.append(reverse_complement(kmerScore[0]))
        kposition.append(rev[1])
        orientation.append("R")
        score.append(kmerScore[1])
        modelScore.append(rev[0])
    elif alignMode == 'nonPalindrome':
        if fwd[0] >= rev[0]:
            oriented_kmer.append(kmerScore[0])
            kposition.append(fwd[1])
            orientation.append("F")
            score.append(kmerScore[1])
            modelScore.append(fwd[0])
        if fwd[0] <= rev[0]:
            oriented_kmer.append(reverse_complement(kmerScore[0]))
            kposition.append(rev[1])
            orientation.append("R")
            score.append(kmerScore[1])
            modelScore.append(rev[0])
        
# If visual alignemnt is True, then draw the alignment as a column
if visualAlignment == True:
    visual = []
    for row in zip(oriented_kmer,kposition):
        visual.append(('-' * (row[1]))+ row[0]+ ('-' * (PWM.shape[1] - row[1] - k)))
    results = pd.DataFrame({'visual':visual,'kmer':oriented_kmer,'kposition':kposition,'orientation':orientation,'Escore':score})
else:    
    results = pd.DataFrame({'kmer':oriented_kmer,'kposition':kposition,'orientation':orientation,'Escore':score})
results.to_csv(saveLocation + '/' + saveFile, sep = '\t', index = False)

# If log file is True
if logFile == True:
    ##################
    # Log the output #
    ##################
    f = open(saveLocation + "/AlignmentLog.txt", "a")
    f.write("##### Parameters ##### \n")
    f.write(f"PWM file: {PWMFile}"+ "\n")
    f.write(f"kmerFile: {kmerFile}"+ "\n")
    f.write(f"Alignment Mode: {alignMode}"+ "\n")
    f.write(f"Visual alignment: {visualAlignment}"+ "\n")
    f.write(f"Model kPositions: {(k, PWM.shape[1] - (k))}"+ "\n")
    f.close()