# Data types, lists and dictionaries in `python`
Written by Jason A. Hendry

---
# I. Data Types

- In order to understand data structures, it's helpful first to review data *types*
- We can think of any data we have of being comprised of a collection of individual units or pieces
- Each unit has a "type"

The main types are the following:

1. Booleans.
    - Abbrv: `bool`
    - What: True or False
    - Examples: True, False
    

In [None]:
interesting = True
interesting

In [None]:
type(interesting)

2. Integers.
    - Abbrv: `int`
    - What: the integers, typically positive and negative.
    - Examples: 0, 1, -5, 123898248394923

In [None]:
genomes = 300
genomes

In [None]:
genomes - 2

In [None]:
type(genomes)

3. Characters.
    - Abbrv: `char` or `str`
    - What: The alphabet and other punctuation (formally, the ASCII table)
    - Examples: "A", "b", "?"

In [None]:
sample_one = "sars-cov-2"
sample_one

In [None]:
type(sample_one)

In [None]:
sample_two = "67"
sample_two

In [None]:
sample_three = 68

In [None]:
type(sample_two)

4. Floating point numbers.
    - Abbrv: `float`
    - What: All real numbers, i.e. all numbers on the continuous number line
    - Examples: 5.5, 1.23402349, -2.829
    - NB: In reality, these are encoded only approximately.

In [None]:
mean_coverage = 87.52
mean_coverage

In [None]:
type(mean_coverage)

---
# II. Data structures

- When we combine many units of data together, we are creating data structures. 
- In `python`, there are a few 'native' data structures
- The most commonly used are **lists** and **dictionaries**

## II.a. Lists

- A list is just an ordered set of items
- Items are separated by commas: `"a", "b", "c"`
- The whole thing is enclosed in square brakets: `["a", "b", "c"]`

### Creation

In [None]:
empty_list = []
empty_list

In [None]:
type(empty_list)

In [None]:
another_empty_list = list()
another_empty_list

In [None]:
list_with_stuff = [0, 1, "hello", "car", 23.42, True]
list_with_stuff

An important feature of lists is that they can hold *a mix of data types*.

In [None]:
list_with_a_list = ["frog", list_with_stuff]
list_with_a_list

They can also contain *other lists*.

In [None]:
len(empty_list)

`len()` is a very important function in python. It tells you the length of objects.

### Addition

In [None]:
list_with_stuff.append([3, 4, 5])

In [None]:
list_with_stuff

In [None]:
len(list_with_stuff)

In [None]:
list_with_stuff

In [None]:
list_with_stuff + 3000

In [None]:
list4 = list_with_stuff + [3000, 4000, 5000]  # now we are adding two lists

In [None]:
list4

In [None]:
len(list_with_stuff)

We are creating a new, combined list, but we need to assign it to another variable name to save it.

### Subtraction

In [None]:
list_with_stuff

In [None]:
list_with_stuff.pop(1)

In [None]:
list_with_stuff

In [None]:
list_with_stuff.remove(2000)

In [None]:
list_with_stuff

In [None]:
list_with_stuff - [2000]  # nope

### Using `range()`

There is a trick to building lists of integers

In [None]:
list(range(10))

Important point(s):
- The number in range gives the `index` of the last item
- We start counting at zero
- If nothing else is given, that is also the length of the list


In [None]:
len(list(range(10)))

In [None]:
list(range(20))

In [None]:
list(range(5, 10))

In [None]:
list(range(1, 20, 4))

### Indexes and slicing

- Items in a list are 'indexed'
- This means they have an *order*
- And it means you can access them by refering to their position
- NB: we start counting at zero

In [None]:
nts = ["A", "T", "C", "G"]

In [None]:
nts[0]

In [None]:
nts[1]

We can access more than one item, and this is called *slicing*.

In [None]:
numbers = list(range(20))
numbers

In [None]:
numbers[3:8]

ADDITIONAL EXAMPLES

An important advantage of `python`: strings are **also indexed**!

In [None]:
import string as st

In [None]:
letters = st.ascii_lowercase
letters

In [None]:
letters[0]

In [None]:
letters[10:16]

### Cool tricks: list *comprehensions*

In [None]:
for value in [1, "dan", 2, 3, 5, [1, 2, 3]]:
    if isinstance(value, str):
        print(value + "dan")
    elif isinstance(value, int):
        print(value + 200)
    elif isinstance(value, list):
        print(value[0])
    else:
        print(value)

In [None]:
l = []
for i in range(10):
    l.append(i * 3)
l

In [None]:
l_comprehension = [i * 3 for i in range(10)]
l_comprehension

In [None]:
for i in list(range(10)):
    if i * 3 % 2 == 0:
        print(i * 3 )

In [None]:
l_comprehension = [i * 3 for i in range(10) if i * 3 % 2 == 0]
l_comprehension

Let's build codons

In [None]:
nts = "ATCG"
codons = [nts[i] + nts[j] + nts[k] 
          for i in range(3)
          for j in range(3)
          for k in range(3)]
codons

## Review:

- A list is an ordered (and indexed) set of items
- Holds:
    - Items of any data type, other lists
- Creation:
    - `[]` or `list()`
- Addition:
    - `.append()` or `+ []`
- Removal:
    - `.pop()` or `.remove()`
- You can slice a list
- Advantages:
    - Grow and shrink easily
- Disadvantages:
    - Use a lot of memory

## Example 1: the Moran model

The Moran model is one of the classical models of genetic drift. It is relatively easy to implement in python using lists.

In [None]:
import random

In [None]:
# Initialise
N = 10**2
init_freqA = 0.3

population = ["A"] * int(N * init_freqA)
population += ["B"] * int(N - len(population))

nsteps = 10**3

# Storage
freqA = []

In [None]:
for _ in range(nsteps):
    i = int(random.random() * N)
    j = int(random.random() * N)
    
    population.append(population[j])  # reproduce
    population.pop(i)  # die
    
    f = sum([1 for p in population if p == 'A'])/N
    freqA.append(f)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fig, ax = plt.subplots(1, 1)

ax.plot(freqA)
ax.set_ylim((0, 1))

## Example 2: SIR model

In [None]:
# Initialisation
N = 10**3
init_I = 10
init_S = N - init_I
S = [1] * init_S
I = [1] * init_I
R = []

beta = 0.15
gamma = 0.05

# Storage
t = [0]
tS = [len(S)]
tI = [len(I)]
tR = [len(R)]

In [None]:
while len(I) > 0:
    
    # Calculate transition rates
    rate_infect = len(I) * beta * (len(S) / N)
    rate_clear = len(I) * gamma
    rate_total = rate_infect + rate_clear
    
    # Move forward in time
    t_delta = random.expovariate(rate_total)
    t.append(t[-1] + t_delta)
    
    # Simulate an event
    u = random.random()
    if u <= (rate_infect / rate_total):
        I.append(S.pop())
    else:
        R.append(I.pop())
        
    # Calculate population state
    tS.append(len(S))
    tI.append(len(I))
    tR.append(len(R))

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 5))

# Plot
ax.fill_between(x=t, y1=[0]*len(t), y2=tI, label="Infected", zorder=3, alpha=1.0)
ax.fill_between(x=t, y1=[0]*len(t), y2=tS, label="Susceptible", zorder=2, alpha=0.5)
ax.fill_between(x=t, y1=[0]*len(t), y2=tR, label="Recovered", zorder=1, alpha=0.5)

# Labels
ax.set_xlabel("Days")

# Limits
ax.set_ylim((0, N))
ax.set_xlim((0, t[-1]))

# Legend
ax.legend()

---

## II.b. Dictionaries

Another very common data structure in python is a dictionary. You can think of it analogously to an actual physical dictionary: there are words (called key's in python) and definitions (called values).

- A dictionary is a collect of 'key-value' pairs
- The keys and values are separated by colons `"name": "Jason"`
- The different entries are separated by commas `"name": "Jason", "age": 30`
- The entire thing is enclosed in curly braces `{"name": "Jason", "age": 30}`

### Creation

In [None]:
empty_dictionary = {}
empty_dictionary

In [None]:
type(empty_dictionary)

In [None]:
another_empty_dictionary = dict()
another_empty_dictionary

In [None]:
dt_with_stuff = {"genus": "Plasmodium", 
                 "species": "falciparum", 
                 "chromosomes": 14}
dt_with_stuff

Both *keys* and *values* can be of any type

In [None]:
dt_with_more_stuff = {"genus": ["Plasmodium", "Plasmodium"],
                      "species": ["falciparum", "vivax"],
                      "chromosomes": [14, 14]}
dt_with_more_stuff

### Addition
Items are added by assigning values to a dictionary key

In [None]:
dt = {}
dt["name"] = "Jason"
dt

In [None]:
dt["age"] = 30
dt

We can also "update" dictionaries with other dictionaries

In [None]:
old_dt = {"name": "Jason", "age": 23, "location": "Toronto"}
old_dt

In [None]:
new_dt = {"name": "Jason", "age": 30, "location": "Oxford", "code_in_python": True}

In [None]:
old_dt.update(new_dt)

In [None]:
old_dt

### Deletion

In [None]:
old_dt

In [None]:
old_dt.pop("age")  # NB: this is key, not indexed

In [None]:
old_dt

In [None]:
del old_dt['code_in_python']

In [None]:
old_dt

### Getting items from a dictionary

- Dictionaries *do not have an inherent order*.
- They are *not* indexed, and so slicing is *not* possible

In [None]:
dt = {'samples' : 10, 'run' : 3, 'mean_coverage' : 74.9}  

In [None]:
dt[0]

In [None]:
dt['samples']

In [None]:
dt.keys()

In [None]:
dt.values()

## Example 1: Reverse complementing in `python`

In [None]:
seq = "TTTCTGTTGGTGCTGATATTGCGGAAAACTAACAATAAAGGA"

In [None]:
aa_map = {
    "A": "T",
    "T": "A",
    "C": "G",
    "G": "C"
}

In [None]:
rc_seq = []
for s in seq:
    rc_seq.append(aa_map[s])

In [None]:
"".join(rc_seq)[::-1]

In [None]:
rc_seq = ""
for s in seq:
    rc_seq = aa_map[s] + rc_seq

In [None]:
rc_seq

## Example 2: Loading a `.fastq` file

In [None]:
fastq_fn = "data/example.fastq"

In [None]:
dt = {}
with open(fastq_fn, "r") as fastq:
    for line in fastq:
        # Extract read identifier
        name = line.rstrip()
        
        # Extract sequence
        seq = fastq.readline().rstrip()
        
        # Skip "+" and quality scores
        fastq.readline()
        fastq.readline()
        
        # Store in dictionary
        dt[name] = seq

In [None]:
list(dt.keys())[:5]

In [None]:
list(dt.values())[:5]