# PLM4 - Lists, advanced loops, reading files

## A collection of DNA strings

We can store sequences in a string

In [10]:
dna_str = 'CAGCTATCG,ACTATTAGGAT,GCTGATCCTATCA'
print(dna_str)

CAGCTATCG,ACTATTAGGAT,GCTGATCCTATCA


But a list is a much better option. We can get a list using the methods **.split()** from a string if there is a separator character.

In [11]:
dna_str.split(',')

['CAGCTATCG', 'ACTATTAGGAT', 'GCTGATCCTATCA']

In [12]:
dnas = dna_str.split(',')
print(dnas)
print(type(dnas))

['CAGCTATCG', 'ACTATTAGGAT', 'GCTGATCCTATCA']
<class 'list'>


Another way to get a list from a string is with the **list()** function. In this case the lists has as elments the characters of the string.

In [13]:
bases = list('ACGTGTATGCGATCTA')
print(bases)

['A', 'C', 'G', 'T', 'G', 'T', 'A', 'T', 'G', 'C', 'G', 'A', 'T', 'C', 'T', 'A']


Or we can initialize the list ourselves

In [15]:
dnas = []  # an empty list
print(dnas)
dnas = ['ATG', 'AATG', 'CGGT', 'ATAATG']
print(dnas)

[]
['ATG', 'AATG', 'CGGT', 'ATAATG']


Remember membership operators

In [16]:
print('ATG' in dnas)
print('ATG' not in dnas)
print('GGT' in dnas)
print('GGT' in dnas[2])

True
False
False
True


Remember that lists (also strings) can be *indexed* and *sliced*

In [17]:
print(dnas[1])
print(dnas[-1])
print(dnas[1:2])
print(dnas[0:-1])

AATG
ATAATG
['AATG']
['ATG', 'AATG', 'CGGT']


But **lists** are **mutable** objects. This means that we can change the elements

In [19]:
dnas[3] = 'GTCGTATATA' #we can't replace only 1 character of an object in a list, but we can replace the object
print(dnas)

['ATG', 'AATG', 'CGGT', 'GTCGTATATA']


In [20]:
dnas[1:3] = ['AA', 'GC'] #we replace the first and the second element, not the third!
print(dnas)

['ATG', 'AA', 'GC', 'GTCGTATATA']


We use the *.append()* methods of **lists** to add one element

In [23]:
aaa = dnas.append('GATTCTA') #to add an extra object to the list
print(dnas)
print(aaa) #The function append does not retorn an object! Only applies to the original variable

['ATG', 'GC', 'GTCGTATATA', 'GATTCTA', 'GATTCTA']
None


The *del* keyword to delete one elment with a known index

In [24]:
del dnas[1] #to delete the element 1 (the second) in the list
print(dnas)

['ATG', 'GTCGTATATA', 'GATTCTA', 'GATTCTA']


Alternatively we can remove an element using the actual element

In [25]:
dnas.remove('GC')
print(dnas)

ValueError: list.remove(x): x not in list

If we need to know the index of an elements we have the *index* method

In [26]:
dnas.index('GATTCTA')

2

The *.sort()* method orders the list

In [27]:
dnas.sort() #to sort alphabetically. As well, does not return anything, only applies to the original variable
print(dnas)

['ATG', 'GATTCTA', 'GATTCTA', 'GTCGTATATA']


Or if we need the list in the opposite order we can use *.reverse()*

In [28]:
dnas.reverse() #to invert the order
print(dnas)

['GTCGTATATA', 'GATTCTA', 'GATTCTA', 'ATG']


Whenever we need to convert back to an *str* we can use the **.join()** method. But notice that *join* is not a *list* method, but an *str* method

In [29]:
'-'.join(dnas) #we join the elements of the list with the object we specify, in this case, a dash ("-")

'GTCGTATATA-GATTCTA-GATTCTA-ATG'

In [30]:
long_seq = ''.join(dnas) #we join all the elements with an empty space, we create the whole sequence.
print(long_seq)

GTCGTATATAGATTCTAGATTCTAATG


## Accumulators

In th previous session we used the previous code snippet:

In [31]:
print(dnas)
counts = 0
for dna in dnas:
    counts += dna.count('A')
print('number of A:', counts)

['GTCGTATATA', 'GATTCTA', 'GATTCTA', 'ATG']
number of A: 8


In the code above, the *variable* **count** is a special one. Its called and **accumulator**. An accumulator is nothing more than a regular variable that is used to store information along the interations of a loop. An accumulator does not necessary need to be ain *integer*. It can be aa veriable of any type. For instance a *string*

In [32]:
print(dnas)
concat_dna = ''
for dna in dnas:
    concat_dna += dna
print(concat_dna)

['GTCGTATATA', 'GATTCTA', 'GATTCTA', 'ATG']
GTCGTATATAGATTCTAGATTCTAATG


Or a *list* of booleans:

In [33]:
contains_TA = []
for dna in dnas:
    if 'TA' in dna:
        contains_TA.append(True)
    else:
        contains_TA.append(False)
print(contains_TA)

[True, True, True, False]


By the way... we can use the **sum** function to add *booleans* as if they were 0s and 1s

In [34]:
sum(contains_TA)

3

Or a list of *strings*:

In [35]:
seqs_with_TA = []
for dna in dnas:
    if 'TA' in dna:
        seqs_with_TA.append(dna)
print(seqs_with_TA)

['GTCGTATATA', 'GATTCTA', 'GATTCTA']


In this particular case, though we would prefer using the expresion below. This is called a **list comprehension** and is a concise way to create a list

In [37]:
seq_with_TA = [dna 
               for dna in dnas
               if 'TA' in dna]
print(seq_with_TA) #This is another way of performing the previous sequence (with TAs) but shorter method.

['GTCGTATATA', 'GATTCTA', 'GATTCTA']


## Advanced for loops

In Python, the most common way to loop over **a single list** is this one:

In [39]:
print(dnas)
for dna in dnas:
    print(dna)

['GTCGTATATA', 'GATTCTA', 'GATTCTA', 'ATG']
GTCGTATATA
GATTCTA
GATTCTA
ATG


When we need to loop over **multiple lists** at the same time the function *zip()* comes handy

In [40]:
dnas2 = ['ACTGA', 'AGTTATAT', 'TTT']
print(dnas2)
zip(dnas, dnas2) #we cannot see a zip object, but we can convert it into a list

['ACTGA', 'AGTTATAT', 'TTT']


<zip at 0x23dd24edf88>

The *zip()* function returns an **iterator** which an object type that allows iteration (such as a list). We can convert an **iterator** to a **list** to understand what's in the output of *zip*.

In [41]:
list(zip(dnas, dnas2)) #Now we have convert it to a list. This is a list of tuples, like lists but immutable

[('GTCGTATATA', 'ACTGA'), ('GATTCTA', 'AGTTATAT'), ('GATTCTA', 'TTT')]

We created a **list of tuples**. Tupples are similar to lists except that they are an *immutable* type. Many Python functions prefer to return tuples instead of lists.

In [42]:
paired_list = list(zip(dnas, dnas2))
print(paired_list)
print(paired_list[0]) #to get the first element
print(paired_list[0][1]) #to get the first and the second element of the list (the first and second tuples)

[('GTCGTATATA', 'ACTGA'), ('GATTCTA', 'AGTTATAT'), ('GATTCTA', 'TTT')]
('GTCGTATATA', 'ACTGA')
ACTGA


In [43]:
print(dnas)
print(dnas2)
for dna, dna2 in zip(dnas, dnas2):
    print(dna, dna2)

['GTCGTATATA', 'GATTCTA', 'GATTCTA', 'ATG']
['ACTGA', 'AGTTATAT', 'TTT']
GTCGTATATA ACTGA
GATTCTA AGTTATAT
GATTCTA TTT


In many cases we need to work with element indexes. For this we use the function *enumerate*

In [44]:
for num, dna in enumerate(dnas):
    print(num, dna) #we have to put two variable names: num, for the numbering of the elements, and dna, for the elements

0 GTCGTATATA
1 GATTCTA
2 GATTCTA
3 ATG


## Read and wirte files

The function **open** is used to open files. By default it opens a file in *read* mode.
The method **read** can subsequently read a file object into a string.

In [51]:
f = open('brca1_cds.dna')
content = f.read()

Notice that *content* contains end-of-line caracters

In [52]:
content

'atggatttatctgctcttcgcgttgaagaagtacaaaatgtcattaatgctatgcagaaaatcttagagtgtcccatctg\ntctggagttgatcaaggaacctgtctccacaaagtgtgaccacatattttgcaa\nattttgcatgctgaaacttctcaaccagaagaaagggccttcacagtgtcctttatgtaagaatgatataaccaaaag\ngagcctacaagaaagtacgagatttagtcaacttgttgaagagctattgaaaatcatttgtgcttttcagcttgacacaggtttggagt\natgcaaacagctataattttgcaaaaaaggaaaataactctcctgaacatctaaaagatgaagtttctatcatccaaagtatgggctacagaaaccgtgccaaaagacttctacagagtgaacccgaaaatccttccttg\ncaggaaaccagtctcagtgtccaactctctaaccttggaactgtgagaactctgaggacaaagcagcggatacaacctcaaaagacgtctgtctacattgaattgg\ngatctgattcttctgaagataccgttaataaggcaacttattgcag\ntgtgggagatcaagaattgttacaaatcacccctcaaggaaccagggatgaaatcagtttggattctgcaaaaaagg\nctgcttgtgaattttctgagacggatgtaacaaatactgaacatcatcaacccagtaataatgatttgaacaccactgagaagcgtgcagctgagaggcatccagaaaagtatcagggtagttctgtttcaaacttgcatgtggagccatgtggcacaaatactcatgccagctcattacagcatgagaacagcagtttattactcactaaagacagaatgaatgtagaaaaggctgaattctgtaataaaagcaaacagcctggcttagcaaggagccaacataacagatgggctggaagtaaggaaacatg

But these are not shown when printing. The provide the actual new-line jump

In [53]:
print(content)

atggatttatctgctcttcgcgttgaagaagtacaaaatgtcattaatgctatgcagaaaatcttagagtgtcccatctg
tctggagttgatcaaggaacctgtctccacaaagtgtgaccacatattttgcaa
attttgcatgctgaaacttctcaaccagaagaaagggccttcacagtgtcctttatgtaagaatgatataaccaaaag
gagcctacaagaaagtacgagatttagtcaacttgttgaagagctattgaaaatcatttgtgcttttcagcttgacacaggtttggagt
atgcaaacagctataattttgcaaaaaaggaaaataactctcctgaacatctaaaagatgaagtttctatcatccaaagtatgggctacagaaaccgtgccaaaagacttctacagagtgaacccgaaaatccttccttg
caggaaaccagtctcagtgtccaactctctaaccttggaactgtgagaactctgaggacaaagcagcggatacaacctcaaaagacgtctgtctacattgaattgg
gatctgattcttctgaagataccgttaataaggcaacttattgcag
tgtgggagatcaagaattgttacaaatcacccctcaaggaaccagggatgaaatcagtttggattctgcaaaaaagg
ctgcttgtgaattttctgagacggatgtaacaaatactgaacatcatcaacccagtaataatgatttgaacaccactgagaagcgtgcagctgagaggcatccagaaaagtatcagggtagttctgtttcaaacttgcatgtggagccatgtggcacaaatactcatgccagctcattacagcatgagaacagcagtttattactcactaaagacagaatgaatgtagaaaaggctgaattctgtaataaaagcaaacagcctggcttagcaaggagccaacataacagatgggctggaagtaaggaaacatgtaatgatag

The method **.read()** *"consumes"* the file object. If we call it again we get an empty list unless we re-open the file again. 

In [54]:
f.read()

''

After using a file it is good practice to close it.

In [56]:
f.close()

Since the file object is an *iterable* we can use it in a for loop and iterate through the lines directly.

In [57]:
f = open('brca1_cds.dna')
for line in f:
    print(line)
f.close()

atggatttatctgctcttcgcgttgaagaagtacaaaatgtcattaatgctatgcagaaaatcttagagtgtcccatctg

tctggagttgatcaaggaacctgtctccacaaagtgtgaccacatattttgcaa

attttgcatgctgaaacttctcaaccagaagaaagggccttcacagtgtcctttatgtaagaatgatataaccaaaag

gagcctacaagaaagtacgagatttagtcaacttgttgaagagctattgaaaatcatttgtgcttttcagcttgacacaggtttggagt

atgcaaacagctataattttgcaaaaaaggaaaataactctcctgaacatctaaaagatgaagtttctatcatccaaagtatgggctacagaaaccgtgccaaaagacttctacagagtgaacccgaaaatccttccttg

caggaaaccagtctcagtgtccaactctctaaccttggaactgtgagaactctgaggacaaagcagcggatacaacctcaaaagacgtctgtctacattgaattgg

gatctgattcttctgaagataccgttaataaggcaacttattgcag

tgtgggagatcaagaattgttacaaatcacccctcaaggaaccagggatgaaatcagtttggattctgcaaaaaagg

ctgcttgtgaattttctgagacggatgtaacaaatactgaacatcatcaacccagtaataatgatttgaacaccactgagaagcgtgcagctgagaggcatccagaaaagtatcagggtagttctgtttcaaacttgcatgtggagccatgtggcacaaatactcatgccagctcattacagcatgagaacagcagtttattactcactaaagacagaatgaatgtagaaaaggctgaattctgtaataaaagcaaacagcctggcttagcaaggagccaacataacagatgggctggaagtaaggaaacatgt

Notice that every print statement adds an end-of-line character. Since each variable **line** has already and end-of-line charater we often want to remove it. The **.strip()** method (without any argument) removes spaces and end-of-lines characters at the begining or end of the variable

In [58]:
f = open('brca1_cds.dna')
for line in f:  # the file object is an iterable!
    print(line.strip())
f.close()

atggatttatctgctcttcgcgttgaagaagtacaaaatgtcattaatgctatgcagaaaatcttagagtgtcccatctg
tctggagttgatcaaggaacctgtctccacaaagtgtgaccacatattttgcaa
attttgcatgctgaaacttctcaaccagaagaaagggccttcacagtgtcctttatgtaagaatgatataaccaaaag
gagcctacaagaaagtacgagatttagtcaacttgttgaagagctattgaaaatcatttgtgcttttcagcttgacacaggtttggagt
atgcaaacagctataattttgcaaaaaaggaaaataactctcctgaacatctaaaagatgaagtttctatcatccaaagtatgggctacagaaaccgtgccaaaagacttctacagagtgaacccgaaaatccttccttg
caggaaaccagtctcagtgtccaactctctaaccttggaactgtgagaactctgaggacaaagcagcggatacaacctcaaaagacgtctgtctacattgaattgg
gatctgattcttctgaagataccgttaataaggcaacttattgcag
tgtgggagatcaagaattgttacaaatcacccctcaaggaaccagggatgaaatcagtttggattctgcaaaaaagg
ctgcttgtgaattttctgagacggatgtaacaaatactgaacatcatcaacccagtaataatgatttgaacaccactgagaagcgtgcagctgagaggcatccagaaaagtatcagggtagttctgtttcaaacttgcatgtggagccatgtggcacaaatactcatgccagctcattacagcatgagaacagcagtttattactcactaaagacagaatgaatgtagaaaaggctgaattctgtaataaaagcaaacagcctggcttagcaaggagccaacataacagatgggctggaagtaaggaaacatgtaatgatag

There is no need of deffining a variable for the **file object**. Actually this will close the file automatically at the end of the loop.

In [66]:
for line in open('brca1_cds.dna'):
    print(line.strip())

atggatttatctgctcttcgcgttgaagaagtacaaaatgtcattaatgctatgcagaaaatcttagagtgtcccatctg
tctggagttgatcaaggaacctgtctccacaaagtgtgaccacatattttgcaa
attttgcatgctgaaacttctcaaccagaagaaagggccttcacagtgtcctttatgtaagaatgatataaccaaaag
gagcctacaagaaagtacgagatttagtcaacttgttgaagagctattgaaaatcatttgtgcttttcagcttgacacaggtttggagt
atgcaaacagctataattttgcaaaaaaggaaaataactctcctgaacatctaaaagatgaagtttctatcatccaaagtatgggctacagaaaccgtgccaaaagacttctacagagtgaacccgaaaatccttccttg
caggaaaccagtctcagtgtccaactctctaaccttggaactgtgagaactctgaggacaaagcagcggatacaacctcaaaagacgtctgtctacattgaattgg
gatctgattcttctgaagataccgttaataaggcaacttattgcag
tgtgggagatcaagaattgttacaaatcacccctcaaggaaccagggatgaaatcagtttggattctgcaaaaaagg
ctgcttgtgaattttctgagacggatgtaacaaatactgaacatcatcaacccagtaataatgatttgaacaccactgagaagcgtgcagctgagaggcatccagaaaagtatcagggtagttctgtttcaaacttgcatgtggagccatgtggcacaaatactcatgccagctcattacagcatgagaacagcagtttattactcactaaagacagaatgaatgtagaaaaggctgaattctgtaataaaagcaaacagcctggcttagcaaggagccaacataacagatgggctggaagtaaggaaacatgtaatgatag

The previous snippet is OK for a simple case such as the one used here. But the most idiomatic way of opening a file in Python is using a **with statement** 

In [67]:
with open('brca1_cds.dna') as f:
    for line in f:
        print(line.strip())

atggatttatctgctcttcgcgttgaagaagtacaaaatgtcattaatgctatgcagaaaatcttagagtgtcccatctg
tctggagttgatcaaggaacctgtctccacaaagtgtgaccacatattttgcaa
attttgcatgctgaaacttctcaaccagaagaaagggccttcacagtgtcctttatgtaagaatgatataaccaaaag
gagcctacaagaaagtacgagatttagtcaacttgttgaagagctattgaaaatcatttgtgcttttcagcttgacacaggtttggagt
atgcaaacagctataattttgcaaaaaaggaaaataactctcctgaacatctaaaagatgaagtttctatcatccaaagtatgggctacagaaaccgtgccaaaagacttctacagagtgaacccgaaaatccttccttg
caggaaaccagtctcagtgtccaactctctaaccttggaactgtgagaactctgaggacaaagcagcggatacaacctcaaaagacgtctgtctacattgaattgg
gatctgattcttctgaagataccgttaataaggcaacttattgcag
tgtgggagatcaagaattgttacaaatcacccctcaaggaaccagggatgaaatcagtttggattctgcaaaaaagg
ctgcttgtgaattttctgagacggatgtaacaaatactgaacatcatcaacccagtaataatgatttgaacaccactgagaagcgtgcagctgagaggcatccagaaaagtatcagggtagttctgtttcaaacttgcatgtggagccatgtggcacaaatactcatgccagctcattacagcatgagaacagcagtttattactcactaaagacagaatgaatgtagaaaaggctgaattctgtaataaaagcaaacagcctggcttagcaaggagccaacataacagatgggctggaagtaaggaaacatgtaatgatag

Remember that even when the loop has finished the **loop variable** persists:

In [61]:
line

'caattgggcagatgtgtgaggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccaggagctggacacctacctgataccccagatcccccacagccactactga\n'

This is sometimes useful to test code interactively. For instance how should we convert the *str* in each line from DNA alphabet to RNA alphabet?

In [62]:
line.replace('t', 'u')

'caauugggcagaugugugaggcaccuguggugacccgagaguggguguuggacaguguagcacucuaccagugccaggagcuggacaccuaccugauaccccagaucccccacagccacuacuga\n'

To save this line to a file we need to open the new file in *write* mode

In [64]:
with open('brca1_cds.rna', 'w') as f_rna:
    f_rna.write(line.replace('t', 'u'))

If we want to store all lines we can use a for loop.

In [65]:
with open('brca1_cds.dna') as f_dna, open('brca1_cds.rna', 'w') as f_rna:
    for line in f_dna:
        f_rna.write(line.replace('t', 'u'))