## Bring in Alignment for mapping

This program will map TFBS using the Biopython's motif package.

**Inputs**: 
1. before alignment (fasta) 
2. after alignment (fasta) 
3. TFBS Position Frequency Matrix.

After working with this script for awhile, it seems like in the future I will have to have each motif be a different file and make a script that loops through the contents of a folder and outputs the position of each enhancer motif. This will be appended to the file output.

Each species ID will have to be appended to each dataframe. 

Also, I believe I should make the raw sequences from the alignment file.

In [32]:
from Bio import motifs
from Bio import SeqIO 
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC, generic_dna, generic_protein
import re
import pandas as pd
import numpy as np
import os, sys

In [21]:
## Alignment Input

# read in alignment as a list of sequences
alignment = list(SeqIO.parse("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta"))

# Check
print("Found %i records in alignment file" % len(alignment))

## Turn a sequences into a list of strings
## They are no longer bio.seq.seq objects though
alignment_string_list = []
for seq in alignment:
    alignment_string_list.append(str(seq.seq))

## Get just a list of the seq IDs
## Maybe I can add them back later
alignment_id = []
for seq in alignment:
    alignment_id.append(str(seq.id))

## This turns them into a dataframe
alignment_df = pd.DataFrame(
    {'id': alignment_id,
     'align_seq': alignment_string_list
    })


## Check
print list(alignment_df)
print type(alignment_df)

##############################
# [x] I feel weird going ahead with out seq ids. Maybe at this point turn into a dictionary?
# [x] No, what I really need is to turn this into a dataframe
# [ ] Now I need to make sure I can use the seqences to continue on. Since I turned it into a dataframe.


Found 9 records in alignment file
['align_seq', 'id']
<class 'pandas.core.frame.DataFrame'>


In [34]:
# Make the raw sequence file from alignment file
## Use "list comprehensions" to remove gap sign. 
## alignment_ungapped = [record.replace('-', '') for record in alignment_string_list]

In [49]:
## Raw Sequences Input

raw_sequences = list(SeqIO.parse("../data/fasta/ludwig_eve-striped-2.fasta", "fasta"))
print("Found %i records in raw sequence file" % len(raw_sequences))

# make all IUPAC.IUPACUnambiguousDNA()
raw_sequences_2 = []

for seq in raw_sequences:
    raw_sequences_2.append(Seq(str(seq.seq), IUPAC.IUPACUnambiguousDNA()))

# Check

#print raw_sequences_2
#print type(raw_sequences_2)

# Check
#for seq in raw_sequences_2:
    #print(seq.alphabet)
    #print(type(seq))

Found 9 records in raw sequence file


In [50]:
## Motif Input

bcd = motifs.read(open("../data/PWM/transpose_fm/bcd_FlyReg.fm"),"pfm")
print(bcd.counts)
pwm = bcd.counts.normalize(pseudocounts=0.0)
pssm = pwm.log_odds()

# Check
print(pssm.alphabet)
print(type(raw_sequences_2))

        0      1      2      3      4      5      6      7
A:   0.19   0.17   0.88   0.92   0.04   0.04   0.06   0.12
C:   0.37   0.08   0.04   0.02   0.02   0.87   0.52   0.25
G:   0.08   0.04   0.04   0.04   0.33   0.02   0.08   0.37
T:   0.37   0.71   0.04   0.02   0.62   0.08   0.35   0.27

IUPACUnambiguousDNA()
<type 'list'>


In [51]:
## Searching the Sequences
pssm_list = [ ]
for seq in raw_sequences_2:
    pssm_list.append(pssm.calculate(seq))

## Check
#print(pssm_list)
#for seq in pssm_list:
    #print("Background: %f" % bcd.pssm.mean(bcd.background))
    
################################
# [ ] Its the same background for all the sequences? That weird.  Right?
################################

# Patser Threshold

distribution = pssm.distribution(background=bcd.background, precision=10**4)
threshold = distribution.threshold_patser()

print("Patser Threshold %5.3f" % threshold) #automatically calulate Paster threshold. 

Patser Threshold 3.262
nothing


In [54]:
position_list = []
score_list = []

###################################
# [ ] Need to reiterate over raw_sequences_2
# [ ] When reiterating over raw_sequences_2, attach id
#################################
    
for position, score in pssm.search(raw_sequences_2[0], threshold=6):
    position_list.append(position)
    score_list.append(score)

# Change position to positive
position_list_pos = []
for x in position_list:
    if x < 0:
       position_list_pos.append(905 + x)
    else:
       position_list_pos.append(x)
#print(position_list_pos)

strand = []
for x in position_list:
    if x < 0:
       strand.append("negative")
    else:
       strand.append("positive")

In [59]:
## get alignment position using `alignment_string_list`

remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0

#######################
# [ ] Reiterate through all species?
# [ ] maybe create a list of dictionaries?
#######################

for xInd, x in enumerate(alignment_string_list[1]):    
    if x in nuc_list:
        remap_dict[counter] = xInd
        counter += 1
# Check
# print(remap_dict)

# Now find the value from the key??? Find the alignment posititon from raw position

align_pos = [remap_dict[x] for x in position_list_pos]

# check
print(align_pos)
print(alignment_id[1])
print(type(alignment_id[1]))

[224, 221, 403, 596, 712, 881, 869, 903, 1114]
ludwig_eve-striped-2||MEMB002A|+
<type 'str'>


In [62]:
# Make dataframe that has everything
pos_df = pd.DataFrame(
    {'raw_position': position_list,
     'raw_position_pos_only': position_list_pos,
     'alignment_position':position_list_pos,
     'strand_direction': strand,
     'score': score_list,
     'species': alignment_id[1]
    })

#############
# [ ] this needs to be a value for species!!
#     like pos_df = alignment_id[1]
######

#pos_df['species'] = alignment_id[1]
print(pos_df)

   alignment_position  raw_position  raw_position_pos_only      score  \
0                 157           157                    157  10.457056   
1                 154          -751                    154   8.946094   
2                 307          -598                    307   6.417715   
3                 455          -450                    455   9.909568   
4                 567          -338                    567   9.909568   
5                 675           675                    675  10.016483   
6                 663          -242                    663   8.959556   
7                 690           690                    690   6.349059   
8                 846           -59                    846   6.380240   

                            species strand_direction  
0  ludwig_eve-striped-2||MEMB002A|+         positive  
1  ludwig_eve-striped-2||MEMB002A|+         negative  
2  ludwig_eve-striped-2||MEMB002A|+         negative  
3  ludwig_eve-striped-2||MEMB002A|+         negat