# Lists
## Concept
Often in our codes, we have collections of objects that "belong together". It can be a list of molecules, of pKas, of results of a measurement, etc... Those are all things we could store in a spreadsheet, in a table or plot in a graph.

We could imagine creating a variable for each of these objects, but it is tedious and we may not know in advance how many there will be.

Instead, python has several data types meant to handle collections of items. The simplest of them is the list.

In [1]:
mylist = ["this", "is", "a", "list"]
print(mylist)

['this', 'is', 'a', 'list']


The list is defined by the square brackets and any number of elements are separated with a comma.

You can store any data type you want in a list, float, int, strings (text), booleans, and even another list. Also you can mix, meaning you can store in the same list elements of different types.

In [2]:
empty_list = []
chemical_list = ["C2H4", "benzene","oxygen"]
pKa_list = [6.4, 3.2, 8.5, 7.2]
list_of_lists = [["Oxalic acid", "Lactic acid", "Acetic acid"],[1.23, 3.86, 4.75]]

You can also create a list from range:

In [3]:
list_from_range = list(range(5))
print(list_from_range)

[0, 1, 2, 3, 4]


## Accessing and modifying list items

You can access the length of the list (i.e. the number of elements) using the function "len".

In [4]:
print("length of pKa_list =",len(pKa_list))
print("length of list_of_lists =",len(list_of_lists))

length of pKa_list = 4
length of list_of_lists = 2


In the list of lists, len counts only 2 elements, since in the primary list, there are indeed only 2 elements (which are both themselves lists of 3 elements).

Once formed, you can access the ith element as `A[i]`. Note that python starts indexing at 0, so the first element is actually `A[0]`. 

In [5]:
print("First chemical is",chemical_list[0])
print("Second chemical is",chemical_list[1])
print("The first list in the list of lists is",list_of_lists[0])
print("The first acid in that list is",list_of_lists[0][0])

First chemical is C2H4
Second chemical is benzene
The first list in the list of lists is ['Oxalic acid', 'Lactic acid', 'Acetic acid']
The first acid in that list is Oxalic acid


By accessing the specific object in a list, you can also modify it:

In [6]:
print(chemical_list)
chemical_list[0] = "ethene"
print(chemical_list)

['C2H4', 'benzene', 'oxygen']
['ethene', 'benzene', 'oxygen']


**Exercise**
Loop over the elements of chemical_list and print them:

In [7]:
### BEGIN SOLUTION
for i in range(len(chemical_list)):
    print(chemical_list[i])
### END SOLUTION

ethene
benzene
oxygen


There are other (some would argue nicer) ways to loop over elements in a list. The simplest is:

In [8]:
for chemical in chemical_list:
    print(chemical)

ethene
benzene
oxygen


And if you need both the index and the element, you can use "enumerate":

In [9]:
for i, chemical in enumerate(chemical_list):
    print("Chemical",i+1,"is",chemical)

Chemical 1 is ethene
Chemical 2 is benzene
Chemical 3 is oxygen


**Exercise:** Here is a list of acids and a corresponding list of pKa. Sort both lists by increasing pKa (without using pre-built python sorting functions).

In [10]:
acids_list = ["Lactic", "Oxalic", "Acetic", "Carbonic", "Benzoic", "Citric", "Formic"]
pKa_list = [3.86, 1.23, 4.75, 6.37, 4.19, 3.08, 3.75]
### BEGIN SOLUTION
n_acids= len(pKa_list)
for i in range(n_acids):
    for j in range(i+1,n_acids):
        if pKa_list[i] > pKa_list[j]:
            pKa_list[i], pKa_list[j] = pKa_list[j], pKa_list[i]
            acids_list[i], acids_list[j] = acids_list[j], acids_list[i]
### END SOLUTION

In [11]:
assert acids_list == ['Oxalic', 'Citric', 'Formic', 'Lactic', 'Benzoic', 'Acetic', 'Carbonic']
assert pKa_list == [1.23, 3.08, 3.75, 3.86, 4.19, 4.75, 6.37]

## Expanding and combining lists

You can add element to a list using append

In [12]:
chemical_list.append("sulfate")
print(chemical_list)

['ethene', 'benzene', 'oxygen', 'sulfate']


**Exercise** create the same list as `list(range(5))` but by starting from an empty list, using a loop and append.

In [13]:
### BEGIN SOLUTION
mylist = []
for i in range(5):
    mylist.append(i)
print(mylist)
### END SOLUTION

[0, 1, 2, 3, 4]


You can also combine and multiply lists using the arithmetic operators.

In [14]:
big_list = chemical_list + ["phosphate", "propane"]
print(big_list)
print(["phosphate", "propane"]*5)

['ethene', 'benzene', 'oxygen', 'sulfate', 'phosphate', 'propane']
['phosphate', 'propane', 'phosphate', 'propane', 'phosphate', 'propane', 'phosphate', 'propane', 'phosphate', 'propane']


The multiplication can be useful to initialize a list to 0 or a list of list to empty for example.

In [15]:
important_list = [0]*10
print(important_list)
even_more_important = [[]]*7
print(even_more_important)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[[], [], [], [], [], [], []]


## Slicing

While it is possible to access list items 1 by 1, there are also ways to extract "slices" of the list. The general structure is similar to range, namely we can provide the starting point, end point, and "index jump", but they are separated by ":":

In [16]:
acids_list == ['Oxalic', 'Citric', 'Formic', 'Lactic', 'Benzoic', 'Acetic', 'Carbonic']
print(acids_list[1:3:1]) #from 1 (included) to 3 (excluded) going one by one
print(acids_list[1:3]) # If the index jumpy is omitted, 1 is assumed
print(acids_list[4:1:-1]) #from 1 to 4 going reverse

['Citric', 'Formic']
['Citric', 'Formic']
['Benzoic', 'Lactic', 'Formic']


It is possible to leave the slicing indices empty, which by default will mean the first or last (depending on context):

In [17]:
print(acids_list[3:]) # From 3 to last
print(acids_list[:3]) # First 3
print(acids_list[:]) #Everything
print(acids_list[::-1]) #Everything reversed

['Lactic', 'Benzoic', 'Acetic', 'Carbonic']
['Oxalic', 'Citric', 'Formic']
['Oxalic', 'Citric', 'Formic', 'Lactic', 'Benzoic', 'Acetic', 'Carbonic']
['Carbonic', 'Acetic', 'Benzoic', 'Lactic', 'Formic', 'Citric', 'Oxalic']


More surprising perhaps, if you use negative numbers, this corresponds to counting from the end:

In [18]:
print(acids_list[-1]) # The last item
print(acids_list[-3:]) # The last 3 items
print(acids_list[2:-2]) # Leave 2 at beginning and end

Carbonic
['Benzoic', 'Acetic', 'Carbonic']
['Formic', 'Lactic', 'Benzoic']


## Unpacking

If you want to assign elements of a list back to individual variables, you can use a shortcut called "unpacking":

In [19]:
mylist = [5.5, 7.8]
a, b = mylist
print("a =", a)
print("b =", b)

a = 5.5
b = 7.8


This packing/unpacking is implicitely used when you provide a list of returns to a function, and can also be used to switch the values of 2 variables (like we did in the sorting exercise):

In [20]:
a, b = b, a
print("a =", a)
print("b =", b)

a = 7.8
b = 5.5


# Tuples

Tuples are in many ways similar to lists, but the main difference is that they are immutable, meaning they cannot be modified after construction. This can be used for lists that should not be modified (for example the list of atom symbols).

In [21]:
Atoms = ('0', 'H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', #Added 0 to make sure Atoms[1] = 'H'
        'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr',
        'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr',
        'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd',
        'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd',
        'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf',
        'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po',
        'At', 'Rn')
print(Atoms[6])
print(Atoms[82])

C
Pb


If you try to change one of its elements, you will see it fails.

Since they are only a small nuance compared to list, we will not go further with it beyond mention their existence.

# Sets

Sets are another nuance on the list, but with somewhat more practical use. The main difference with a list is that sets cannot have duplicates, and their order is undefined (it may vary from what you initially assign due to the way it is implemented internally).

They are created using {} instead of [].

In [22]:
set_of_molecule = {"C2H4", "benzene","oxygen"}
print(set_of_molecule) # Note that the order may be different from your input
empty_set = set() # using {} will create an empty dictionary instead, see later sections

{'benzene', 'C2H4', 'oxygen'}


They again behave similarly to list. However, being unordered, you cannot access elements using `A[i]` (but you can loop over all elements). Similarly, instead of append (which means add at the end), there is only a function "add".

In [23]:
set_of_molecule.add("sulfate")
print(set_of_molecule)

{'sulfate', 'benzene', 'C2H4', 'oxygen'}


But if you try to add an already existing molecule, nothing will change.

In [24]:
print(len(set_of_molecule))
set_of_molecule.add("benzene")
print(len(set_of_molecule))

4
4


Like for list, you can combine sets, but it is done using the "|" symbol (or equivalently the union function). Unlike for list, the substraction also exist!

In [25]:
print(set_of_molecule | {"HCl", "oxygen"})
print(set_of_molecule.union({"HCl", "oxygen"}))
print(set_of_molecule - {"C2H4","sulfate"})

{'benzene', 'C2H4', 'oxygen', 'HCl', 'sulfate'}
{'benzene', 'C2H4', 'oxygen', 'HCl', 'sulfate'}
{'benzene', 'oxygen'}


You can transform a list to a set and vice versa:

In [26]:
set_from_list = set(["this", "was", "a", "list"])
print(set_from_list) # as pronounced by Yoda
list_from_set = list(set_from_list)
print(list_from_set) # the order is not frozen to whatever the set had

{'list', 'this', 'a', 'was'}
['list', 'this', 'a', 'was']


Sets are also the most efficient way (which can be relevant for large scale applications) to check if an item is present.

In [27]:
for molecule in ("benzene", "sulfate"):
    if molecule in set_of_molecule:
        print(molecule, "was here")
    else:
        print(molecule, "was not here")

benzene was here
sulfate was here


**Exercise** Your lab maintains a list of the chemicals currently available. Your new project needs a set of molecules to be able to perform the synthesis of ibuprofen.

Using only basic set arithmetics, create the set of molecules that will need to be bought (i.e. that are not already in the lab) and then create the total set of molecules available once these molecules are bought.

In [28]:
solvents_in_lab = ["acetone", "toluene", "methanol", "ethanol", "water"]
molecules_in_lab = ["HCl", "sulfuric acid", "magnesium", "LiAlH4", "NaBH4", "AlCl3", "NaCl"]
ibuprofen_chemicals = ["isobutylbenzene" , "acetic anhydride", "AlCl3", "NaBH4", "methanol", "HCl", "magnesium", "CO2"]

### BEGIN SOLUTION
to_buy = set(ibuprofen_chemicals) - set(solvents_in_lab) - set(molecules_in_lab)
total_available = set(ibuprofen_chemicals + solvents_in_lab + molecules_in_lab)
### END SOLUTION
assert to_buy == {'isobutylbenzene', 'acetic anhydride', 'CO2'}
assert total_available == {'acetone', 'methanol', 'AlCl3', 'toluene', 'water', 'NaCl', 'NaBH4', 'magnesium', 'CO2', 'acetic anhydride', 'HCl', 'LiAlH4', 'sulfuric acid', 'isobutylbenzene', 'ethanol'}

# Strings

We have been manipulating strings for a little while now, but have not looked into their properties. Actually a string is also a collection, a collection of individual characters. This means they can be manipulated in very much the same way as lists.

In [29]:
molecule = "isobutylbenzene"
print(len(molecule))
print(molecule[0])
print(molecule[-7:])
for character in molecule:
    print(character)
print(molecule + " is a molecule") # notice the + (and the following space) instead of the usual comma 

15
i
benzene
i
s
o
b
u
t
y
l
b
e
n
z
e
n
e
isobutylbenzene is a molecule


Strings also have a lot of handy functions to help manipulating them. The list below is non exhaustive.

In [30]:
# Query
print("C is a letter:","C".isalpha())
print("C is a number:","C".isnumeric())
print("C is upper case:","C".isupper())
print("C is lower case:","C".islower())
print()

# Transform
string = "Butadiene"
print("Butadiene in lower case is:", string.lower())
print("Butadiene in upper case is:", string.upper())
string = "C6H12O6"
print("There are",int(string[3:5]),"hydrogens.")
print()

# Look for specific symbols
reaction = " A + B = C "
print("Splitting a string:", reaction.split("=")) # Split into a list around the chosen symbol
print("+ is in the reaction: ","+" in reaction, "in position", reaction.index("+"))
print("Now this is chemistry", reaction.replace("=","->"))
print()

C is a letter: True
C is a number: False
C is upper case: True
C is lower case: False

Butadiene in lower case is: butadiene
Butadiene in upper case is: BUTADIENE
There are 12 hydrogens.

Splitting a string: [' A + B ', ' C ']
+ is in the reaction:  True in position 3
Now this is chemistry  A + B -> C 



To get a more exhaustive list, do not forget you can ask python for help

In [31]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  

**Exercise** Create a function that returns the set of all atoms present in a string. Since atoms can have either one or 2 letters (we will ignore those with 3), when you have an upper case letter, just check if the next character is lower case. If so, they form a single atom, else they are separate.

In [32]:
### BEGIN SOLUTION
def unique_atom_list(string):
    total_set = set()
    n_chars = len(string)
    for i, atom in enumerate(string):
        if atom.isupper():
            if i+1 < n_chars and string[i+1].islower():
                atom += string[i+1]
            total_set.add(atom)
    return total_set
### END SOLUTION

In [33]:
assert unique_atom_list("NaCl") == {"Na", "Cl"}
assert unique_atom_list("H2O") == {"H", "O"}
assert unique_atom_list("C6H12O6") == {"C", "H", "O"}
assert unique_atom_list("Fe(CN)6") == {"Fe", "C", "N"}
assert unique_atom_list("6CO2 + 6H2O → C6H12O6 + 6O2") == {"C", "O", "H"}

# Dictionaries

The last type of collection is dictionaries. Unlike most of the previous one, those are associative, with a "word" and its "definition". In python language, we refer to the "word" as the "key" and the "definition" as the value. It can be useful whenever you want to associate 2 variables. We did it implicitely when we sorted the pKa by keeping 2 separate lists, the index in the list being maintained, but it can be more convenient to associate directly both variables in one structure.

Dictionaries are defined by using {}, like sets, and like sets, they have no order and do not accept duplicates, i.e. 2 identical "words". The key and value are associated with ":".

In [34]:
pka_dict = {"Lactic acid":3.86, "Oxalic acid": 1.23, "Acetic acid": 4.75, "Carbonic acid":6.37, "Benzoic acid":4.19}
print(pka_dict)
empty_dict = {}

{'Lactic acid': 3.86, 'Oxalic acid': 1.23, 'Acetic acid': 4.75, 'Carbonic acid': 6.37, 'Benzoic acid': 4.19}


The elements are then accessed by using the key in place of the usual index. In other words `value = dict[key]`:

In [35]:
print("Lactic acid has a pKa of",pka_dict["Lactic acid"])

Lactic acid has a pKa of 3.86


When looping over elements, we can only loop over the keys or the key-value pairs.

In [36]:
for acid in pka_dict:
    print(acid)

for acid, pka in pka_dict.items():
    print(acid,"has a pKa of", pka)

Lactic acid
Oxalic acid
Acetic acid
Carbonic acid
Benzoic acid
Lactic acid has a pKa of 3.86
Oxalic acid has a pKa of 1.23
Acetic acid has a pKa of 4.75
Carbonic acid has a pKa of 6.37
Benzoic acid has a pKa of 4.19


You can access the keys and values as separate lists:

In [37]:
print(list(pka_dict.keys()))
print(list(pka_dict.values()))

['Lactic acid', 'Oxalic acid', 'Acetic acid', 'Carbonic acid', 'Benzoic acid']
[3.86, 1.23, 4.75, 6.37, 4.19]


To add a new entry, it is enough to assign a value to a new key:

In [38]:
print(len(pka_dict))
pka_dict["Citric acid"] = 3.08
print(len(pka_dict))

5
6


**Exercise** Using a translation dictionary, create a function that convert a DNA sequence into its base pair sequence and returns the resulting string. Reminder, the base pairs are A-T and C-G, so for example, the sequence ATCG would be converted into the sequence TAGC.

In [39]:
### BEGIN SOLUTION
def base_pair(DNA_string):
    pair_dict = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    result = ""
    for c in DNA_string:
        result += pair_dict[c]
    return result
### END SOLUTION

In [40]:
assert base_pair('ATGGCGATTA') == 'TACCGCTAAT'

Write a similar function that now translates the DNA sequence into an amino acid sequence using the translation table given below. Note that 3 nucleobases code for a single amino acid. The print statement should tell you if you have succeeded...

In [41]:
def DNA_translation(DNA_string):
    translation_table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    ### BEGIN SOLUTION
    peptide = ""
    for i in range(int(len(DNA_string)/3)):
        codon = DNA_string[3*i:3*i+3]
        peptide += translation_table[codon]
    return peptide
    ### END SOLUTION

In [42]:
print(DNA_translation('TGGGAGCTTCTATAACCTCTTGCGTATGAAGAC'))

WELL_PLAYED
