## Putting it together - lists

The [```range```](https://docs.python.org/2/library/functions.html#range) function takes a number (n) and returns a [range object](https://docs.python.org/3/library/stdtypes.html#typesseq-range). This can be used to create a list object of numbers from e.g. 0 to n. 

```
rg = range(5)
# convert to list to print
print(list(rg))
[0,1,2,3,4]
```

Using this function assign a list of the numbers 0 to 10 to a variable. Use subsetting with a step of 2 to extract the even numbers. Check that the modulo of the 4th element in this new list is zero. Remember that python counts the first object in a list as object zero. Finally sort the new list of even numbers in reverse order (you might want to recap the difference between the ```sort()``` method and the ```sorted``` function).

In [3]:
num_rnge = range(11) # set up range, returns a range object
print(list(num_rnge)) # convert to range object to list to print
num_rnge = list(num_rnge) # convert to range object to list to manipulate

evens = num_rnge[::2] # get the evens, subset with step value
print(evens[3]%2) # does fourth value divide cleanly by 2? i.e. no remainder

print(evens)

print(sorted(evens, reverse=True)) # reversed, sorted creates list on the fly
# i.e. does not change 'evens' list, only shows you reversed list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
0
[0, 2, 4, 6, 8, 10]
[10, 8, 6, 4, 2, 0]


Bonus - Create a string variable. What do you get if you use the ```list``` function on that variable? Look up the [```join()```](http://www.tutorialspoint.com/python/string_join.htm) method for strings and use that to put your string back together. For this last bit remember the list is the *sequence* you want to join together! 

I will warn you now the ```join``` method looks completely unintuitive! Think of it as a sequence instructions from left to right - 'use this character to *join* the stuff in this object together into a ```string```'.

In [4]:
s = 'iain' # create variable
lst = list(s) # list func on a string returns a list of the string characters
print(lst)
print(''.join(lst)) # join the list of chars back together with nothing between the list elements (i.e. letters)

# print '+'.join(lst) would give you i+a+i+n

['i', 'a', 'i', 'n']
iain


## Putting it together  - dicts

A date of the form 8-MAR-85 includes the name of the month, which must be translated to a number. Write a script that will translate from the name of a month to the number using a ```dict```.

Hint: In your script create a ```dict``` suitable for decoding month names to numbers (e.g. MAR would be 3). Use the string ```split()``` method to create a ```list``` from your ```string``` and use the ```dict``` to look up the number for the month in the input data using list indexing. Finally use string formatting to print an informative message relating the month name and number.

In [5]:
dt = '8-MAR-85' # date string
dts = dt.split('-') # split the string on the '-' character, get a list back, ['8', 'MAR', '85']
print(dts)

# dict for tranlation, month name is key, month number is value
dct = {'JAN':1, 'FEB':2, 'MAR':3, 'APR':4, 'MAY':5, 'JUN':6, 'JUL':7, 'AUG':8, 'SEP':9, 'OCT':10, 'NOV':11, 'DEC':12}

# look up month name we have in our string
month_num = dct[dts[1]]
print('The month number for %s is %d.' % (dts[1], month_num)) # print message with string formatting

['8', 'MAR', '85']
The month number for MAR is 3.


## Putting it together

The following text shows the triplet codons and the amino acids they code for. Paste these into two lists and use the ```zip()``` function to create a ```dict``` that will translate from DNA to AA. Use this to find the AA for the triplets TCC, ATG, ATC, CTC and GAG. Print these out.

codons: 'TTT','TTC','TTA','TTG','CTT','CTC','CTA','CTG','ATT','ATC','ATA','ATG','GTT','GTC','GTA','GTG','TCT','TCC','TCA','TCG','CCT','CCC','CCA','CCG','ACT','ACC','ACA','ACG','GCT','GCC','GCA','GCG','TAT','TAC','TAA','TAG','CAT','CAC','CAA','CAG','AAT','AAC','AAA','AAG','GAT','GAC','GAA','GAG','TGT','TGC','TGA','TGG','CGT','CGC','CGA','CGG','AGT','AGC','AGA','AGG','GGT','GGC','GGA','GGG'

amino acids: "F","F","L","L","L","L","L","L","I","I","I","M","V","V","V","V","S","S","S","S","P","P","P","P","T","T","T","T","A","A","A","A","Y","Y","STOP","STOP","H","H","Q","Q","N","N","K","K","D","D","E","E","C","C","STOP","W","R","R","R","R","S","S","R","R","G","G","G","G"

In [7]:
codons = ['TTT','TTC','TTA','TTG','CTT','CTC','CTA','CTG','ATT','ATC','ATA','ATG','GTT','GTC','GTA','GTG','TCT','TCC','TCA','TCG','CCT','CCC','CCA','CCG','ACT','ACC','ACA','ACG','GCT','GCC','GCA','GCG','TAT','TAC','TAA','TAG','CAT','CAC','CAA','CAG','AAT','AAC','AAA','AAG','GAT','GAC','GAA','GAG','TGT','TGC','TGA','TGG','CGT','CGC','CGA','CGG','AGT','AGC','AGA','AGG','GGT','GGC','GGA','GGG']
aas = ["F","F","L","L","L","L","L","L","I","I","I","M","V","V","V","V","S","S","S","S","P","P","P","P","T","T","T","T","A","A","A","A","Y","Y","STOP","STOP","H","H","Q","Q","N","N","K","K","D","D","E","E","C","C","STOP","W","R","R","R","R","S","S","R","R","G","G","G","G"]
trans_dct = dict(zip(codons, aas))

print('The amino acid for the codon TCC is %s.' % trans_dct['TCC'])
print('The amino acid for the codon TCC is %s.' % trans_dct['ATG'])
print('The amino acid for the codon TCC is %s.' % trans_dct['ATC'])
print('The amino acid for the codon TCC is %s.' % trans_dct['CTC'])
print('The amino acid for the codon TCC is %s.' % trans_dct['GAG'])

# hah - classic acrostic fun (https://en.wikipedia.org/wiki/Acrostic)

The amino acid for the codon TCC is S.
The amino acid for the codon TCC is M.
The amino acid for the codon TCC is I.
The amino acid for the codon TCC is L.
The amino acid for the codon TCC is E.


## Homework

This weeks homework is similar to last weeks. We again have the sequence for human PPARG in FASTA format. As you did last week you'll need to create a string variable from this. We also have the DNA triplets and the amino acids they code for. You can paste these into two lists and create a translation dictionary from them that will allow you to look up an amino acid given the DNA triplet that codes for that amino acid.

Using these data write a script that will look for the start codon and one of the stop codons ('TAG'). Print an informative message with the position of these two codons. Using these positions extract the proposed actual coding sequence (be careful with slicing positions here). Use the modulo operator to check whether this proposed coding sequence is in frame i.e. is it formed of 3 base codons only. 

Print out the first 3 and last 3 codons. Finally use your newly created translation dictionary to look up the first and last amino acids. Again be careful with the indices for slicing.

The output from your script shoould be something like:

```The coding sequence begins at xxx and ends at xxx.```

```The first 3 codons are xxxxxxxxx.```

```The last 3 codons are xxxxxxxxx.```

```The first amino acid is x.```

```The last amino acid is x.```

```
>gi|116284367|ref|NM_005037.5| Homo sapiens peroxisome proliferator-activated receptor gamma (PPARG), transcript variant 4, mRNA
GGCGCCCGCGCCCGCCCCCGCGCCGGGCCCGGCTCGGCCCGACCCGGCTCCGCCGCGGGCAGGCGGGGCC
CAGCGCACTCGGAGCCCGAGCCCGAGCCGCAGCCGCCGCCTGGGGCGCTTGGGTCGGCCTCGAGGACACC
GGAGAGGGGCGCCACGCCGCCGTGGCCGCAGAAATGACCATGGTTGACACAGAGATGCCATTCTGGCCCA
CCAACTTTGGGATCAGCTCCGTGGATCTCTCCGTAATGGAAGACCACTCCCACTCCTTTGATATCAAGCC
CTTCACTACTGTTGACTTCTCCAGCATTTCTACTCCACATTACGAAGACATTCCATTCACAAGAACAGAT
CCAGTGGTTGCAGATTACAAGTATGACCTGAAACTTCAAGAGTACCAAAGTGCAATCAAAGTGGAGCCTG
CATCTCCACCTTATTATTCTGAGAAGACTCAGCTCTACAATAAGCCTCATGAAGAGCCTTCCAACTCCCT
CATGGCAATTGAATGTCGTGTCTGTGGAGATAAAGCTTCTGGATTTCACTATGGAGTTCATGCTTGTGAA
GGATGCAAGGGTTTCTTCCGGAGAACAATCAGATTGAAGCTTATCTATGACAGATGTGATCTTAACTGTC
GGATCCACAAAAAAAGTAGAAATAAATGTCAGTACTGTCGGTTTCAGAAATGCCTTGCAGTGGGGATGTC
TCATAATGCCATCAGGTTTGGGCGGATGCCACAGGCCGAGAAGGAGAAGCTGTTGGCGGAGATCTCCAGT
GATATCGACCAGCTGAATCCAGAGTCCGCTGACCTCCGGGCCCTGGCAAAACATTTGTATGACTCATACA
TAAAGTCCTTCCCGCTGACCAAAGCAAAGGCGAGGGCGATCTTGACAGGAAAGACAACAGACAAATCACC
ATTCGTTATCTATGACATGAATTCCTTAATGATGGGAGAAGATAAAATCAAGTTCAAACACATCACCCCC
CTGCAGGAGCAGAGCAAAGAGGTGGCCATCCGCATCTTTCAGGGCTGCCAGTTTCGCTCCGTGGAGGCTG
TGCAGGAGATCACAGAGTATGCCAAAAGCATTCCTGGTTTTGTAAATCTTGACTTGAACGACCAAGTAAC
TCTCCTCAAATATGGAGTCCACGAGATCATTTACACAATGCTGGCCTCCTTGATGAATAAAGATGGGGTT
CTCATATCCGAGGGCCAAGGCTTCATGACAAGGGAGTTTCTAAAGAGCCTGCGAAAGCCTTTTGGTGACT
TTATGGAGCCCAAGTTTGAGTTTGCTGTGAAGTTCAATGCACTGGAATTAGATGACAGCGACTTGGCAAT
ATTTATTGCTGTCATTATTCTCAGTGGAGACCGCCCAGGTTTGCTGAATGTGAAGCCCATTGAAGACATT
CAAGACAACCTGCTACAAGCCCTGGAGCTCCAGCTGAAGCTGAACCACCCTGAGTCCTCACAGCTGTTTG
CCAAGCTGCTCCAGAAAATGACAGACCTCAGACAGATTGTCACGGAACACGTGCAGCTACTGCAGGTGAT
CAAGAAGACGGAGACAGACATGAGTCTTCACCCGCTCCTGCAGGAGATCTACAAGGACTTGTACTAGCAG
AGAGTCCTGAGCCACTGCCAACATTTCCCTTCTTCCAGTTGCACTATTCTGAGGGAAAATCTGACACCTA
AGAAATTTACTGTGAAAAAGCATTTTAAAAAGAAAAGGTTTTAGAATATGATCTATTTTATGCATATTGT
TTATAAAGACACATTTACAATTTACTTTTAATATTAAAAATTACCATATTATGAAATTGCTGATAGTA
```

```
codons: 'TTT','TTC','TTA','TTG','CTT','CTC','CTA','CTG','ATT','ATC','ATA','ATG','GTT','GTC','GTA','GTG','TCT','TCC','TCA','TCG','CCT','CCC','CCA','CCG','ACT','ACC','ACA','ACG','GCT','GCC','GCA','GCG','TAT','TAC','TAA','TAG','CAT','CAC','CAA','CAG','AAT','AAC','AAA','AAG','GAT','GAC','GAA','GAG','TGT','TGC','TGA','TGG','CGT','CGC','CGA','CGG','AGT','AGC','AGA','AGG','GGT','GGC','GGA','GGG'
```

```
amino acids: "F","F","L","L","L","L","L","L","I","I","I","M","V","V","V","V","S","S","S","S","P","P","P","P","T","T","T","T","A","A","A","A","Y","Y","STOP","STOP","H","H","Q","Q","N","N","K","K","D","D","E","E","C","C","STOP","W","R","R","R","R","S","S","R","R","G","G","G","G"
```

In [8]:
# sequence pasted from above into multiline string

seq = '''GGCGCCCGCGCCCGCCCCCGCGCCGGGCCCGGCTCGGCCCGACCCGGCTCCGCCGCGGGCAGGCGGGGCC
CAGCGCACTCGGAGCCCGAGCCCGAGCCGCAGCCGCCGCCTGGGGCGCTTGGGTCGGCCTCGAGGACACC
GGAGAGGGGCGCCACGCCGCCGTGGCCGCAGAAATGACCATGGTTGACACAGAGATGCCATTCTGGCCCA
CCAACTTTGGGATCAGCTCCGTGGATCTCTCCGTAATGGAAGACCACTCCCACTCCTTTGATATCAAGCC
CTTCACTACTGTTGACTTCTCCAGCATTTCTACTCCACATTACGAAGACATTCCATTCACAAGAACAGAT
CCAGTGGTTGCAGATTACAAGTATGACCTGAAACTTCAAGAGTACCAAAGTGCAATCAAAGTGGAGCCTG
CATCTCCACCTTATTATTCTGAGAAGACTCAGCTCTACAATAAGCCTCATGAAGAGCCTTCCAACTCCCT
CATGGCAATTGAATGTCGTGTCTGTGGAGATAAAGCTTCTGGATTTCACTATGGAGTTCATGCTTGTGAA
GGATGCAAGGGTTTCTTCCGGAGAACAATCAGATTGAAGCTTATCTATGACAGATGTGATCTTAACTGTC
GGATCCACAAAAAAAGTAGAAATAAATGTCAGTACTGTCGGTTTCAGAAATGCCTTGCAGTGGGGATGTC
TCATAATGCCATCAGGTTTGGGCGGATGCCACAGGCCGAGAAGGAGAAGCTGTTGGCGGAGATCTCCAGT
GATATCGACCAGCTGAATCCAGAGTCCGCTGACCTCCGGGCCCTGGCAAAACATTTGTATGACTCATACA
TAAAGTCCTTCCCGCTGACCAAAGCAAAGGCGAGGGCGATCTTGACAGGAAAGACAACAGACAAATCACC
ATTCGTTATCTATGACATGAATTCCTTAATGATGGGAGAAGATAAAATCAAGTTCAAACACATCACCCCC
CTGCAGGAGCAGAGCAAAGAGGTGGCCATCCGCATCTTTCAGGGCTGCCAGTTTCGCTCCGTGGAGGCTG
TGCAGGAGATCACAGAGTATGCCAAAAGCATTCCTGGTTTTGTAAATCTTGACTTGAACGACCAAGTAAC
TCTCCTCAAATATGGAGTCCACGAGATCATTTACACAATGCTGGCCTCCTTGATGAATAAAGATGGGGTT
CTCATATCCGAGGGCCAAGGCTTCATGACAAGGGAGTTTCTAAAGAGCCTGCGAAAGCCTTTTGGTGACT
TTATGGAGCCCAAGTTTGAGTTTGCTGTGAAGTTCAATGCACTGGAATTAGATGACAGCGACTTGGCAAT
ATTTATTGCTGTCATTATTCTCAGTGGAGACCGCCCAGGTTTGCTGAATGTGAAGCCCATTGAAGACATT
CAAGACAACCTGCTACAAGCCCTGGAGCTCCAGCTGAAGCTGAACCACCCTGAGTCCTCACAGCTGTTTG
CCAAGCTGCTCCAGAAAATGACAGACCTCAGACAGATTGTCACGGAACACGTGCAGCTACTGCAGGTGAT
CAAGAAGACGGAGACAGACATGAGTCTTCACCCGCTCCTGCAGGAGATCTACAAGGACTTGTACTAGCAG
AGAGTCCTGAGCCACTGCCAACATTTCCCTTCTTCCAGTTGCACTATTCTGAGGGAAAATCTGACACCTA
AGAAATTTACTGTGAAAAAGCATTTTAAAAAGAAAAGGTTTTAGAATATGATCTATTTTATGCATATTGT
TTATAAAGACACATTTACAATTTACTTTTAATATTAAAAATTACCATATTATGAAATTGCTGATAGTA'''

# lists created by pasting in values above
codons = ['TTT','TTC','TTA','TTG','CTT','CTC','CTA','CTG','ATT','ATC','ATA','ATG','GTT','GTC','GTA','GTG','TCT','TCC','TCA','TCG','CCT','CCC','CCA','CCG','ACT','ACC','ACA','ACG','GCT','GCC','GCA','GCG','TAT','TAC','TAA','TAG','CAT','CAC','CAA','CAG','AAT','AAC','AAA','AAG','GAT','GAC','GAA','GAG','TGT','TGC','TGA','TGG','CGT','CGC','CGA','CGG','AGT','AGC','AGA','AGG','GGT','GGC','GGA','GGG']
aa = ["F","F","L","L","L","L","L","L","I","I","I","M","V","V","V","V","S","S","S","S","P","P","P","P","T","T","T","T","A","A","A","A","Y","Y","STOP","STOP","H","H","Q","Q","N","N","K","K","D","D","E","E","C","C","STOP","W","R","R","R","R","S","S","R","R","G","G","G","G"]

# zip lists to make translation dict, keys are codons, values are amino acids
translation_dct = dict(zip(codons, aa))

# index method on string to get 'ATG' and 'TAG' positions
# string formatting for reporting
print('The coding sequence begins at %d and ends at %d.' % (seq.index('ATG'), seq.index('TAG')))

# assign variables for start and stop codon positions
start = seq.index('ATG')
stop = seq.index('TAG')+3 # go 2 from position one and then add another base because slicing does not incl last value

# extract coding sequence
coding_seq = seq[start:stop]
print(len(coding_seq)%3) # make sure this is in frame, has to be cleanly divisible by 3

# subetting to get first and last 3 codons
print('The first 3 codons are %s.' % coding_seq[:9])
print('The last three codons are %s.' % coding_seq[-9:]) # syntax here, go 9 values back from end

# subset and dict lookup to get first and last amino acids
print('The first amino acid is %s.' % translation_dct[coding_seq[:3]])
print('The last amino acid is %s.' % translation_dct[coding_seq[-3:]])

The coding sequence begins at 175 and ends at 655.
0
The first 3 codons are ATGACCATG.
The last three codons are AAAAAGTAG.
The first amino acid is M.
The last amino acid is STOP.
