# Mapping TFBS to alignments

## Dictionary postions with an alignment

Me and Bronski's conversation on how to do this

**Me**: Hey! I want to map nucleotide sequence position after an alignment.  I know you have done this before. So I would rather not reinvent the wheel. You did a dictionary in python, but how? Can I see your script? If this feature is embedded in a larger program it might be easier to just explain your strategy.

**Bronski**: So the strategy is to loop through an aligned sequence and create a dictionary where the keys are the original indices and the values are the indices in the alignment.

Here’s a simple example:

In [71]:
aligned_seq = 'AGC---TTCATCA'
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0
for xInd, x in enumerate(aligned_seq):    
    if x in nuc_list:
        remap_dict[counter] = xInd
        counter += 1
        
print(remap_dict)

{0: 0, 1: 1, 2: 2, 3: 6, 4: 7, 5: 8, 6: 9, 7: 10, 8: 11, 9: 12}


## My Attempt with ES2



**Breakdown of what I have to do**:

1. Read in alignment file.
2. seperate each sequence into it's own sequence
    - make dictionary for each sequence
    - print out sequence?
        - run TFBS finder for each sequence
3. Make vector of each sequence that says presence or absence at each position.
4. Figure out a way to visualize this.
   



### Read in Alignment File

Use `Bio.AlignIO.read()`
- The first argument is a handle to read the data from, typically an open file (see Section 24.1), or a filename.
- The second argument is a lower case string specifying the alignment format. As in Bio.SeqIO we don’t try and guess the file format for you! See [http://biopython.org/wiki/AlignIO](http://biopython.org/wiki/AlignIO) for a full listing of supported formats.

In [72]:
from Bio import AlignIO
alignment = AlignIO.read("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta")
print(alignment)

SingleLetterAlphabet() alignment with 9 rows and 1136 columns
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...C-- ludwig_eve-striped-2||MEMB002F|+
ATATAACCCAATAATTTGAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB002A|+
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB003C|-
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB002C|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...C-- ludwig_eve-striped-2||MEMB003B|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...C-- ludwig_eve-striped-2||MEMB003F|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCCAGAAG...C-- ludwig_eve-striped-2||MEMB003D|-
ATATAACCCAATAATTTGAACTAACTCGCAGGAGCGGCAAGAAG...CTA ludwig_eve-striped-2||MEMB002D|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...CTA ludwig_eve-striped-2||MEMB002E|-


In [73]:
for record in alignment:
    print(record.id)

ludwig_eve-striped-2||MEMB002F|+
ludwig_eve-striped-2||MEMB002A|+
ludwig_eve-striped-2||MEMB003C|-
ludwig_eve-striped-2||MEMB002C|+
ludwig_eve-striped-2||MEMB003B|+
ludwig_eve-striped-2||MEMB003F|+
ludwig_eve-striped-2||MEMB003D|-
ludwig_eve-striped-2||MEMB002D|+
ludwig_eve-striped-2||MEMB002E|-


Buuuuuuuut, we don't really need the alignment as an alignment per se. But it is important for viewing and testing later.  We need to have each seperate sequence, So I am going to use SeqIO.parse. 

In [74]:
from Bio import SeqIO

# read in alignment as a list of sequences
records = list(SeqIO.parse("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta")) 


In [75]:
# Testing with the first sequence
seqTest = records[0]
#print(seqTest.seq)
print(type(seqTest))


<class 'Bio.SeqRecord.SeqRecord'>


In [76]:
# Turn just the sequence into a string instead of fasta sequence
aligned_seq = str(seqTest.seq)
print(type(aligned_seq)) # check

<type 'str'>


**Notes on loop**

- `enumerate()`: prints out numbers counting up
- `xInd` is the keys that were enumerated.  
- then the `remap_dict[counter] = xInd` makes the dictionary
x is the nucleotide

In [83]:
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0

for xInd, x in enumerate(aligned_seq):    
    if x in nuc_list:
        remap_dict[counter] = xInd
        counter += 1

#checking dictionary created
print(len(remap_dict)) # should be length of alignment
print(remap_dict[40]) #should print the value of the number key
print(type(remap_dict[40])) #Check data type

{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 36, 34: 37, 35: 38, 36: 39, 37: 40, 38: 41, 39: 42, 40: 43, 41: 44, 42: 45, 43: 46, 44: 47, 45: 48, 46: 49, 47: 59, 48: 60, 49: 61, 50: 62, 51: 63, 52: 64, 53: 65, 54: 66, 55: 67, 56: 68, 57: 69, 58: 70, 59: 71, 60: 72, 61: 73, 62: 74, 63: 75, 64: 76, 65: 77, 66: 78, 67: 79, 68: 80, 69: 81, 70: 82, 71: 83, 72: 84, 73: 85, 74: 86, 75: 87, 76: 88, 77: 89, 78: 90, 79: 91, 80: 92, 81: 93, 82: 94, 83: 97, 84: 98, 85: 99, 86: 100, 87: 101, 88: 102, 89: 103, 90: 104, 91: 105, 92: 106, 93: 139, 94: 140, 95: 141, 96: 142, 97: 143, 98: 144, 99: 145, 100: 146, 101: 147, 102: 148, 103: 149, 104: 150, 105: 151, 106: 152, 107: 153, 108: 154, 109: 160, 110: 161, 111: 162, 112: 163, 113: 164, 114: 165, 115: 166, 116: 167, 117: 168, 118: 169, 119: 170, 120: 

We need two sequences. One that is not the alignment. 

## Putting the Remap together with TFBS 

The last part is to create the vector that should span the entire alignment printing 1 if the position has a bicoid site or 0 if not.  

In [78]:
## Attempt at vector

bcdSites = [0] * len(aligned_seq)

#from loctaingTFB.ipy
TFBS = [10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]

#Need to make positive. This is not what I need.
#TFBS_pos = [abs(k) for k in TFBS]

print((TFBS))
m = 7

# This is the range of the motif
for pos in TFBS:
    print(aligned_seq[pos:pos+m])


[10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]
ATAATTT
CCTCG--
--AACTG
--CTGTG
ACAG---
TCCATCC
-------
-------
-------
GAACGGT
GAGACAG
G------
-------
AACAGGC
AGTTGGG
TA-----
-TAATCC
GGGATTA
GCCGAGG
CACCTCA
CTTGGTA
CGATCC-
TGCGCCA
AAAGTCA
AATAAAT
GTC-CCA
CCC-TAA
CC-TAAT
------T


Now I need to make a new vector that says if the bicoid site is present or absent on the position. 
Can the negative positions be used in a query of the dictionary? Likely not. 

In [118]:
print(type(TFBS))
print(type(remap_dict))
print(TFBS)

# Okay, the problem is the negative numbers
another_key = [82, 85, 98]
print(len(remap_dict))

# So I need to convert the negative number first.
print([remap_dict[x] for x in another_key])


<type 'list'>
<type 'dict'>
[10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]
905
[94, 99, 144]


In [128]:
# Working on converting TFBS negative numbers. 
TFBS_2 = []
for x in TFBS:
    if x < 0:
       TFBS_2.append(905 + x)
    else:
       TFBS_2.append(x)

print(TFBS_2)



[10, 102, 137, 164, 225, 310, 309, 408, 420, 429, 453, 459, 465, 529, 558, 566, 597, 593, 600, 616, 613, 623, 665, 679, 777, 828, 825, 826, 886]


## Resources

1.  [get list of values with list of keys](https://stackoverflow.com/questions/18453566/python-dictionary-get-list-of-values-for-list-of-keys)
2.  [Ungap Sequences](http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html#ungap)