# Introduction to Python Programming for Bioinformatics

## About this notebook

This notebook was originally written by [Marc Cohen](https://github.com/mco-gh), an engineer at Google. The original source can be found on [Marc's short link service](https://mco.fyi/), and starts with [Python lesson 0](https://mco.fyi/py0), and I encourage you to work through that notebook if you find some details missing here.

Rob Edwards edited the notebook, adapted it for bioinformatics, using some simple geneticy examples, condensed it into a single notebook, and rearranged some of the lessons, so if some of it does not make sense, it is Rob's fault!

It is intended as a hands-on companion to an in-person course, and if you would like Rob to teach this course (or one of the other courses) don't hesitate to get in touch with him.

## Using this notebook

You can download the original version of this notebook from [GitHub](https://linsalrob.github.io/ComputationalGenomicsManual/Python/python_bioinformatics_intro.ipynb) and from [Rob's Google Drive]()

**You should make your own copy of this notebook by selecting File->Save a copy in Drive from the menu bar above, and then you can edit the code and run it as your own**

There are several lessons, and you can do them in any order. I've tried to organise them in the order I think most appropriate, but you may disagree!


In [1]:
print("hello")

hello


# Lesson Links

* [Lesson 1 - Variables and Types](#Lesson-1---Variables-and-Types)
  * [Variables](#Variables)
  * [Naming variables](#Naming-variables)
  * [Types of data](#Types-of-data)
  * [Numeric Types](#Numeric-Types)
  * [String Types](#String-Types)
  * [Using Variables in Python](#Using-Variables-in-Python)
  * [Built in Python functions](#Built-in-Python-functions)
* [Lesson 2 - Expressions](#Lesson-2---Expressions)
  * [Constants vs. Variables](#Constants-vs.-Variables)
  * [Data Types](#Data-Types)
  * [The Boolean (bool) Type](#The-Boolean-(bool)-Type)
  * [The None Type](#The-None-Type)
  * [Comparison Opererators](#Comparison-Opererators)
  * [Boolean Operators - and, or, and not](#Boolean-Operators---and,-or,-and-not)
  * [Order of Evaluation](#Order-of-Evaluation)
  * [Python Precedence Rules](#Python-Precedence-Rules)
  * [F strings](#F-strings)
* [Lesson 3 Lists](#Lesson-3-Lists)
  * [Lists](#Lists)
  * [Creating Lists](#Creating-Lists)
  * [Sets](#Sets)
  * [List Operations](#List-Operations)
* [Lession 4 - Dictionaries](#Lession-4---Dictionaries)
  * [Dictionary Operations](#Dictionary-Operations)
  * [Rule of thumb for truth value of lists, and dictionaries](#Rule-of-thumb-for-truth-value-of-lists,-and-dictionaries)
* [Lesson 5 - Conditionals](#Lesson-5---Conditionals)
  * [Controlling Program Flow](#Controlling-Program-Flow)
  * [if Statements](#if-Statements)
  * [if Block Structure](#if-Block-Structure)
  * [Python's use of whitespace](#Python's-use-of-whitespace)
  * [else Statements](#else-Statements)
  * [elif Statements](#elif-Statements)
  * [For loops](#For-loops)
* [Lesson 6 - Functions](#Lesson-6---Functions)
  * [Defining Functions](#Defining-Functions)
  * [Docstrings](#Docstrings)
* [Lesson 7 - Modules](#Lesson-7---Modules)
  * [The from Statement](#The-from-Statement)
  * [When to use import vs. from](#When-to-use-import-vs.-from)

In [2]:
print("hello")

hello


# Lesson 1 - Variables and Types

Things you'll learn in this lesson:
- The basic data types you can work with in Python
- How to create and assign values to variables in Python
- How to call a function

## Variables

Any Python interpreter can be used as a calculator. To run this cell, either click on the triangle, or put your cursor in the cell and press Shift and Enter at the same time.


In [None]:
print(3 + 5 * 4)


This is great but not very interesting.
To do anything useful with data, we need to assign its value to a _variable_.
In Python, we can assign a value to a variable using the equals sign `=`.

If a variable doesn’t already exist, when you assign to it, Python creates it on the fly. If you assign to a variable that already exists, Python replaces its current value with a new value.

Examples

    instructor = "Rob"         # string value
    instructor = "Stevie" # same name, diff string value
    instructor = 42             # same name, integer value
    todays_high_temp = 18.2     # diff name, floating point value

We can track the length of a bacterial genome by assigning its length in basepairs to a variable. For example, if the length is 4,500,000 bp, we 
could assign that to a variable called `genome_size`:

In [None]:
genome_size = 4500000
print(genome_size)

From now on, whenever we use `genome_size`, Python will substitute the value we assigned to it. In simpler terms, **a variable is a reference to a value**.

In Python, variable names:

 - can include letters, digits, and underscores
 - cannot start with a digit
 - are case-sensitive

This means that, for example:
 - `genome0` is a valid variable name, whereas `0genome` is not
 - `genome` and `Genome` are different variables, and this will sometimes trip you up. It is usual practice to use lower case letters for variable names, and if you want to use two words, like `genome size`, to join them with an underscore (i.e. `genome_size`). Sometimes people will use capitals for words, like `GenomeSize`, (but that's wrong!) (but it still works).

## Naming variables

It is a good idea (and good practice!) to name variables something meaningful. Remember that when you come back to your code in 6 months or a year later, it is going to look like gibebrish, so try and reduce the amount of gibberish as much as possible! 

For example, while you are writing the code it might be obvious to use `r` to mean `RNA sequence coverage averaged by gene length`, however in 6 months, you'll be wondering if this was really averaged, or just the raw counts. Using a variable like `rnaseq_averaged` is more meaningfull. You should avoid short names (one or two letters), however, you might also want to avoid using `rna_sequence_coverage_averaged_by_gene_length`, which even though it is valid, it will be a pain to type everytime (and make your code look ugly!)


### Reserved Words

The following words have special meaning in Python. We call them keywords or reserved words and you may not use these names for your program variables.

> ```and, as, assert, break, class, continue, def, del, elif, else, except, False, finally, for, from, global, if, import, in, is, lambda, nonlocal, None, not, or, pass, raise, return, True, try, while, with, yield```

<details>
<summary>
Python also has a lot of built-in functions. We'll use some of these as we go through, but here is a table of all of them. Although you don't need to worry about them right now, you should avoid using the names of these functions as the name of a variable. Click the triangle at the start of this line to see the complete table.
</summary>


Function | Description
--- | ---
abs() | Returns the absolute value of a number
all() | Returns True if all items in an iterable object are true
any() | Returns True if any item in an iterable object is true
ascii() | Returns a readable version of an object. Replaces none-ascii characters with escape character
bin() | Returns the binary version of a number
bool() | Returns the boolean value of the specified object
bytearray() | Returns an array of bytes
bytes() | Returns a bytes object
callable() | Returns True if the specified object is callable, otherwise False
chr() | Returns a character from the specified Unicode code.
classmethod() | Converts a method into a class method
compile() | Returns the specified source as an object, ready to be executed
complex() | Returns a complex number
delattr() | Deletes the specified attribute (property or method) from the specified object
dict() | Returns a dictionary (Array)
dir() | Returns a list of the specified object's properties and methods
divmod() | Returns the quotient and the remainder when argument1 is divided by argument2
enumerate() | Takes a collection (e.g. a tuple) and returns it as an enumerate object
eval() | Evaluates and executes an expression
exec() | Executes the specified code (or object)
filter() | Use a filter function to exclude items in an iterable object
float() | Returns a floating point number
format() | Formats a specified value
frozenset() | Returns a frozenset object
getattr() | Returns the value of the specified attribute (property or method)
globals() | Returns the current global symbol table as a dictionary
hasattr() | Returns True if the specified object has the specified attribute (property/method)
hash() | Returns the hash value of a specified object
help() | Executes the built-in help system
hex() | Converts a number into a hexadecimal value
id() | Returns the id of an object
input() | Allowing user input
int() | Returns an integer number
isinstance() | Returns True if a specified object is an instance of a specified object
issubclass() | Returns True if a specified class is a subclass of a specified object
iter() | Returns an iterator object
len() | Returns the length of an object
list() | Returns a list
locals() | Returns an updated dictionary of the current local symbol table
map() | Returns the specified iterator with the specified function applied to each item
max() | Returns the largest item in an iterable
memoryview() | Returns a memory view object
min() | Returns the smallest item in an iterable
next() | Returns the next item in an iterable
object() | Returns a new object
oct() | Converts a number into an octal
open() | Opens a file and returns a file object
ord() | Convert an integer representing the Unicode of the specified character
pow() | Returns the value of x to the power of y
print() | Prints to the standard output device
property() | Gets, sets, deletes a property
range() | Returns a sequence of numbers, starting from 0 and increments by 1 (by default)
repr() | Returns a readable version of an object
reversed() | Returns a reversed iterator
round() | Rounds a numbers
set() | Returns a new set object
setattr() | Sets an attribute (property/method) of an object
slice() | Returns a slice object
sorted() | Returns a sorted list
staticmethod() | Converts a method into a static method
str() | Returns a string object
sum() | Sums the items of an iterator
super() | Returns an object that represents the parent class
tuple() | Returns a tuple
type() | Returns the type of an object
vars() | Returns the __dict__ property of an object
zip() | Returns an iterator, from two or more iterators

</details>


## Types of data
Python knows about several types of data. Three common ones are:

* integer numbers
* floating point numbers
* character strings

In the example above, variable `genome_size` was assigned an integer value of `4500000`. If we want to store a fraction, like the %GC of the genome, 
we can use a floating point value by executing:

In [None]:
percent_gc = 0.45
print(percent_gc)

To create a string, we add single or double quotes around some text.
To identify and track a bacteria throughout our study, we can assign it a unique identifier by storing it in a string:

In [None]:
bacteria_id = "001"
print(bacteria_id)

## Numeric Types

Python supports two main types of numbers
* int, arbitrary size signed integers, like these:
  * `2011`
  * `-999999999999`
* float, arbitrary precision floating point numbers, like these:
  * `3.1415926539`
  * `3.8 * 10**6`

For the most part, you don't need to worry about which type of number to use - Python will take care of that for you. The decimal point tells Python which to use.

Mixing floats and ints results in a float so, for example, `2011 * 3.14` results in a floating point number.

Try entering these expressions in the following cell:

```
print(5 - 6)  
print(8 * 9)
print(6 / 2)
print(5.0 / 2)
print(5 % 2)  
print(2 * 10 + 3)  
print(2 * (10 + 3))  
print(2 ** 4)
```
Were there any outputs you didn't expect?

In [None]:
# talk about the print variables with your neighbour, and then copy and paste them here!
# press shift-enter to execute the code after you have pasted it.

## String Types 

Strings are really a [list](#Lesson-3-Lists) of characters, and we can do a few interesting things with them. Note that we will talk more about Lists later, so some of this will become clearer when we cover that material.

A string is a collection of individual characters, and you can access each of them separately. 

In [1]:
sequence = "ACGT"
print(sequence[0])
print(sequence[1])
print(sequence[2])
print(sequence[3])

A
C
G
T


We can also access a `slice` of a string which is the start and stop positions of the string

In [2]:
print(sequence[0:2])

AC


## Using Variables in Python

Once we have data stored with variable names, we can make use of those variables in our calculations. We call these combinations of variables and values  **expressions**. When evaluating an expression, Python internally replaces the variable names with the values to which they refer.

We may want to store our genome in kilo base pairs as well as base pairs.



In [None]:
genome_kb = genome_size / 1000
print(genome_kb)

We might also decide to add a bacterial genus and species to our bacterial id

In [None]:
bacteria_id = "E coli: " + bacteria_id
print(bacteria_id)

## Built-in Python functions

To carry out common tasks with data and variables in Python,
the language provides us with several built-in functions.
To display information to the screen, we use the `print` function:

In [None]:
print(genome_size)
print(bacteria_id)

When we want to make use of a function, what computer scientists refer to as **calling the function**, we follow its name by parentheses. The parentheses are important: if you leave them off, the function doesn't actually run!

Sometimes you will include values or variables inside the parentheses for the function to use. In the case of `print`, we use the parentheses to tell the function which value we want to display. We will learn more about how functions work and how to create our own in later lessons.

We can display multiple things at once using only one `print` function call:

In [None]:
print(bacteria_id, " genome size in kb: ", genome_kb)

We can also call a function inside of another function call. For example, Python has a built-in function called `type` that tells you a value's data type:

In [None]:
print(type(60.3))
print(type(bacteria_id))
print(type(genome_size))
print(type(genome_kb))

We can also do arithmetic with variables right inside the `print` function:

In [None]:
print("genome size in MB:", genome_size / 1000000)

Note that the above function call did not change the value of `genome_size`:

In [None]:
print(genome_size)

To change the value of the `genome_size` variable, we have to
**assign** a new value to `genome_size` using the equals `=` sign:

In [None]:
genome_size = 3100000000
print("genome size is now:", genome_size)

What values do the variables `rrna` and `protein` have after each of the following statements?

Guess before executing the lines below...

In [None]:
rrna = 400
print("There are ", rrna, " rRNAs encoded in the human genome")
print("There are ", protein, " proteins encoded in the human genome")

In [None]:
protein = 19126
print("There are ", rrna, " rRNAs encoded in the human genome")
print("There are ", protein, " proteins encoded in the human genome")

In [None]:
rrna = rrna * 2.0
print("There are ", rrna, " rRNAs encoded in the human genome")
print("There are ", protein, " proteins encoded in the human genome")

In [None]:
protein = protein - 126.0
print("There are ", rrna, " rRNAs encoded in the human genome")
print("There are ", protein, " proteins encoded in the human genome")

## Multiple definitions at once!

Python allows you to assign multiple values to multiple variables in one line by separating the variables and values with commas. What does the following program print out?

In [None]:
a, b = "E. coli", "Salmonella"
print(a, b)

In [None]:
first, second = "crAssphage", "phiX174"
third, fourth = second, first
print(third, fourth)

# Lesson 2 - Expressions


Things you'll learn in this lesson:
- More about types in Python
- Boolean operators
- How to combine constants, variables, and operators into arithmetic, boolean, and comparison expressions
- Operator precedence
- The magical f-string

Link to the original version of this notebook on [Marco's short link service](https://mco.fyi/py2)


# Constants vs. Variables

* Literal values (like `"Rob"` and `2010`) are called constants because they don't change.

* Constants, are called constants because its value is fixed, unlike variables, whose associated value may change (or *vary*) over time.

* The data a variable refers to may be simple, e.g. a number or a string, or it may be complex, e.g. a list or an object (we'll learn about those later).

# Data Types

In Python, values have a *type*. We already saw three data types in the previous lesson. We'll take a look at a few other types.

# The Boolean (`bool`) Type

Python supports a special type called booleans, written `bool` in Python, which are used to indicate whether something is true or false. Booleans have one of two possible values:

* `True`
* `False`

When evaluating a **number** as a boolean, the following rules apply:

* 0 is `False`
* 0.0 is `False`
* all other numerical values are `True`

When evaluating a **string** as a boolean, the following rules apply:

* the empty string (`""` and `''`) is `False`
* all other strings are `True`

If it's something, its True. Otherwise its _NOT_. 


# The None Type

Python has a special type called `None` and it means *no value*.

It's a good choice when you want to initialize a variable without an obvious choice for the initial value, like this:

```bacteria = None```

None always evaluates to False in boolean expressions.


# Comparison Opererators

As their name suggests, comparison operators allow us to compare values and result in a boolean type indicating whether the comparison is `True` or `False`.

The following table summarizes the most commonly used operators in Python, along with their definition when applied to numbers and strings.

|operator|operation on numbers|operation on strings|
|--------|--------------------|--------------------|
|==|equal to|equal to|
|!=|not equal to|not equal to|
|>|greater than|lexicographically greater than|
|>=|greater than or equal to|lexocographically greater than or equal to|
|<|less than|lexicographically less than|
|<=|less than or equal to|lexocographically less than or equal to|

(Remember the crocodile!)


## Challenge

Which boolean value (`True` or `False`) do each of these expressions evaluate to?

* `123 == 10`
* `10 == 123`
* `123 == 123`
* `123 != 321`
* `123 != 123`
* `age == 65`
* `age != min_age`

* `"E. coli" == "Salmonella"`
* `"E. coli" == "E. coli"`
* `"E. coli" == "E.coli"`
* `"E. coli" != "e. coli"`

### `>` and `>=`
* `123 > 10`
* `10 > 123`
* `123 > 123`
* `123 >= 123`


# Boolean Operators - `and`, `or`, and `not`

Boolean operators are special operators in Python that let you combine boolean values in logical ways corresponding to how we combine truth values in the real world. An example of a boolean **and** expression would be "I'll buy a new phone if I like the features **and** the price is low". There are three main boolean operators: `and`, `or`, and `not`. We'll look at examples of each in the next cells.

## Boolean `and`

* `A and B`

is `True` only true when both A and B are `True`, otherwise it's `False`.

Example:

* I ride my bike only when it's both sunny and warm.
* In other words, if `is_sunny` and `is_warm` are both `True` then `is_sunny and is_warm` is `True` so I **will** ride my bike.

In Python...
```
if is_sunny and is_warm:
    # ride bike
```

We haven't learn about `if` statements so don't worry if the previous construct looks unfamiliar. It's a simple way of checking the value of a boolean expression, but we'll dive deeper into `if` statements shortly.


In [None]:
is_sunny = False
is_warm = False
print(is_sunny and is_warm)

### Truth Table for boolean `and`
|`var1`|`var2`|`var1 and var2`|
|------|------|---------------|
|`False`|`False`|`False`|
|`True`|`False`|`False`|
|`False`|`True`|`False`|
|`True`|`True`|`True`|

## Boolean `or`

* `A or B`

is `True` when either A or B are `True`, or when both are `True`, otherwise it's `False`.

Example:

* I ride my bike  when it's sunny, or warm, or both
* In other words, if `is_sunny` or `is_warm` (or both) are `True` then `is_sunny or is_warm` is `True` so I **will** ride my bike.

In Python...
```
if is_sunny or is_warm:
  # ride bike
```

### Truth Table for boolean `or`

|`var1`|`var2`|`var1 or var2`|
|------|------|---------------|
|`False`|`False`|`False`|
|`True`|`False`|`True`|
|`False`|`True`|`True`|
|`True`|`True`|`True`|

## Logical Not

* `not A`

is `True` when A is `False`
is `False` when A is `True`

### Truth Table for boolean `not`

|`var1`|`not var1`|
|------|---------|
|`False`|`True`|
|`True`|`False`|

# Expressions Revisited

* Python lets us combine variables, constants and operators into larger units called expressions.
* Expressions appear in many places
  * assignment statements. Here we are assigning the new number to the same variable
    * `age = age + 1    # we do this every birthday`
    * we could write this in two steps:
       * `tmp = age + 1`
       * `age = tmp`
    * but using one assignment is simpler, easier, and cleaner
  * function calls
    * `print(total_days * 365) # number of days alive`
* As we learn more, we'll see expressions popping up all over the place

In [None]:
age = 22
print(age)
age = age + 1
print(age)
age += 1
print(age)

In [None]:
age = 22
days_per_year = 365
days_old = age * days_per_year
print(f"I was {days_old} days old on my last birthday!")

## Types of Expressions

* arithmetic expressions

`genome_size = chromosome_1_size + chromosome_2_size`

* comparative expressions

`genome_size == 0`

* boolean expressions

`chromosome_1 and plasmid`

* combinations of the above

`plasmid and (chromosome_1_size + chromosome_2_size + plasmid_size) < genome_size`


## Order of Evaluation

How does Python know the correct order to evaluate a complex expression?

Example: `4 + 1 * 5`

Is that `(4 + 1) * 5`, which is `25`?  
Or is it `4 + (1 * 5)`, which is `9`?

Another example:  True or False and False

Is that `(True or False) and False`, which is `False`?  
Or is it `True or (False and False)`, which is `True`?

Python uses operator precedence rules to avoid this ambiguity and evaluate expressions in a predictable way.

## Python Precedence Rules

This is a subset of the complete rules (in order of highest to lowest precedence):

* parentheses (innermost to outermost, left to right)
* exponentiation (left to right)
* multiplication, division, modulus (left to right)
* addition, subtraction (left to right)
* comparisons (left to right)
* boolean not
* boolean and
* boolean or

[The Official Rules](https://docs.python.org/3/reference/expressions.html#operator-precedence)

## Practical Advice

**When in doubt, use parentheses.**

Coders make liberal use of parentheses because:
* You don't need to remember the precedence rules.
* You don't have to worry about surprises.
* It makes code more readable.
* It eliminates any ambiguity

For example, we could write this expression, which evaluates `A and B` first, then `C and D`, and finally takes the boolean `or` of the two preceding results:

`A and B or C and D`

but we much prefer to make explicit, like this so we don't have to _think_ (about precedence rules) every time we look at this code:

`(A and B) or (C and D)`

## F-strings

We often need to combine variables, values, and strings. For example, if we have the following variables:

- `bacteria_name`
- `genome_size`

we might want to print a report, where each line summarizes the values above. We could do that like this:

```
print("bacteria: ", bacteria_name, ", genome size: ", genome_size, " bp")
```
which produces this output:
```
bacteria: E. coli genome size: 4500000 bp
```

This sort of construct gets a bit tedious. Plus the space between the customer id and the following comma unintended and undesirable.

Python has a simple approach, called f-strings, that offer a more readable solution to this problem. If you prefix a string with the character `f`, it gives the string magic powers. Specifically, the sting has the ability to **interpolate** variables inside curly brackets. Here's how we could express the previous `print` statement using an f-string:

```
print(f"bacteria: {bacteria_name}, genome size: {genome_size} bp")
```

This is shorter, less tedious, easier to read and write, and solves the formatting issue related to the comma between the two fields.

Note that you can put any Python code inside the curly braces, so this technique is very powerful. Once you get going with Python, you'll find all sort of wonderful ways to use f-strings.

There is also a neat trick that can help with large numbers! Adding a `:,` after a number means `automatically insert thousands separators`:

```
print(f"bacteria: {bacteria_name}, genome size: {genome_size:,} bp")
```


In [None]:
bacteria_name = "E. coli"
genome_size = 4500000
print(f"bacteria: {bacteria_name}, genome size: {genome_size:,} bp")

# Lesson 3 Lists

**Lists and Dictionaries**


# Lists

* A list is a list of things, like a shopping list.
* Lists are ordered sequences.
* All the sequence operations you learned about with strings, like `len`, indexing, slicing, looping, `in`, etc. apply to lists as well.

Lists are defined inside square brackets, with list elements separated by commas, for example...
```
['a', 'b', 'c', 1, 2, 3]
```

Note that you can have both strings and numbers inside lists.

[List documentation](https://docs.python.org/3/library/stdtypes.html#list)

## Creating Lists

In [None]:
# Create an empty list (lists use square brackets)
li = []
print(f"empty list: {li}")

In [None]:
# Create and initialize a list with some data
li = ['Bacteria', 4500000, 3.14, True]
print(f"non-empty list: {li}")

In [None]:
# the same value can occur multiple times in a list
li = ['a', 'a', 'a']
print(li)

# Sets

Sets are like lists except for two key things:
* Sets are not ordered! The order that you get things back is not necessarily the same as the order that you put them in.
* Everything in a set is unique.

In [None]:
example_set = set()
print(f"empty set: {example_set}")
example_set.add('a')
example_set.add('a')
example_set.add('a')
print(f"non-empty set: {example_set}")

## List Operations

You can change and edit lists, and add things to them. (We call this mutable, but don't worry about that).

In [3]:
# The len() function gives us the size of a list.
li = ["Chr1", "Chr2", "Chr3"]
# get the size of a list
list_size = len(li)
print(list_size)

3


In [4]:
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "Chr6", "Chr7", "Chr8", "Chr9", "Chr10", "Chr11", "Chr12", "Chr13", "Chr14", "Chr15", "Chr16", "Chr17", "Chr18", "Chr19", "Chr20", "Chr21", "Chr22", "Chr23"]
# iterate (loop) over the elements in a list
for i in range(len(li)):
  print(li[i])

Chr1
Chr2
Chr3
Chr4
Chr5
Chr6
Chr7
Chr8
Chr9
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
Chr16
Chr17
Chr18
Chr19
Chr20
Chr21
Chr22
Chr23


In [5]:
# A better way to iterate over the elements in a list
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "Chr6", "Chr7", "Chr8", "Chr9", "Chr10", "Chr11", "Chr12", "Chr13", "Chr14", "Chr15", "Chr16", "Chr17", "Chr18", "Chr19", "Chr20", "Chr21", "Chr22", "Chr23"]
for i in li:
  print(i)

Chr1
Chr2
Chr3
Chr4
Chr5
Chr6
Chr7
Chr8
Chr9
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
Chr16
Chr17
Chr18
Chr19
Chr20
Chr21
Chr22
Chr23


In [6]:
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "Chr6", "Chr7", "Chr8", "Chr9", "Chr10", "Chr11", "Chr12", "Chr13", "Chr14", "Chr15", "Chr16", "Chr17", "Chr18", "Chr19", "Chr20", "Chr21", "Chr22", "Chr23"]
# test membership in a list

x = "ChrX"
if not x in li:
 print(f"{x} is not in the list")
else:
 print(f"{x} is in the list")

ChrX is not in the list


In [7]:
# indexing (list indexes start with zero!)
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "Chr6", "Chr7", "Chr8", "Chr9", "Chr10", "Chr11", "Chr12", "Chr13", "Chr14", "Chr15", "Chr16", "Chr17", "Chr18", "Chr19", "Chr20", "Chr21", "Chr22", "Chr23"]
print(li[2])

Chr3


In [8]:
# indexing out of bounds raises a runtime error
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "Chr6", "Chr7", "Chr8", "Chr9", "Chr10", "Chr11", "Chr12", "Chr13", "Chr14", "Chr15", "Chr16", "Chr17", "Chr18", "Chr19", "Chr20", "Chr21", "Chr22", "Chr23"]
print(li[99])


IndexError: list index out of range

In [9]:
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "Chr6", "Chr7", "Chr8", "Chr9", "Chr10", "Chr11", "Chr12", "Chr13", "Chr14", "Chr15", "Chr16", "Chr17", "Chr18", "Chr19", "Chr20", "Chr21", "Chr22", "Chr23"]
# slicing
print(li[1:3])

['Chr2', 'Chr3']


In [10]:
# concatenating lists
li1 = ["Chr1", "Chr2", "Chr3"]
li2 = ["Chr4", "Chr5", "Chr6"]
li3 = ["Chr7", "Chr8", "Chr9"]
li4 = li1 + li2 + li3
print(li4)

['Chr1', 'Chr2', 'Chr3', 'Chr4', 'Chr5', 'Chr6', 'Chr7', 'Chr8', 'Chr9']


In [11]:
# add an element
li = ["Chr1", "Chr2", "Chr3", "Chr4"]
print(li)
li.append("ChrX")
print(li)

['Chr1', 'Chr2', 'Chr3', 'Chr4']
['Chr1', 'Chr2', 'Chr3', 'Chr4', 'ChrX']


In [12]:
print(li)
# replace an element by index
li[4] = 'ChrY' # overwrites value at index 4
print(li)

['Chr1', 'Chr2', 'Chr3', 'Chr4', 'ChrX']
['Chr1', 'Chr2', 'Chr3', 'Chr4', 'ChrY']


In [13]:
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr1", "Chr2", "Chr3", "Chr1", "Chr2", "Chr1"]
# get the number of occurrences of a particular value
count = li.count("Chr1")
print(count)

4


In [14]:
li = ["Chr1", "Chr2", "Chr3", "Chr4", "Chr1", "Chr2", "Chr3", "Chr1", "Chr2", "Chr1"]
# get the (first) index of a particular value
index = li.index("Chr1")
print(index)

0


In [15]:
li = ["Chr1", "Chr2", "Chr3", "Chr4"]
# reverse a list
li.reverse()
print(li)
li.reverse()
print(li)

['Chr4', 'Chr3', 'Chr2', 'Chr1']
['Chr1', 'Chr2', 'Chr3', 'Chr4']


## Nested Lists

We can have lists of lists, and lists of lists of lists.

- list of lists: ```[[1, 2], [3, 4]]```


Later we will look at spreadsheets that are two-dimensional, and you can imagine them being held as lists of lists. The first list is each row of the spreadsheet, and the second list is each column, so that a cell has a unique value per list.

If you want to explore nested lists in more detail, have a look at [Marc's Python lesson 5](mco.fyi/py5), which covers these concepts in more detail.


# Lession 4 - Dictionaries

A dictionary is an organized collection of key/value pairs.
The data is organized for quick access via the key, somewhat like a real dictionary, where words are the keys and their definitions are the associated values.

Dictionaries are defined using curly braces with key:value pairs separated by commas, like this:
```
websites = {
  'google': 'https://google.com',
  'youtube': 'https://youtube.com',
  'FAME': 'https://fame.flinders.edu.au'
}
```

This data type is known by various names in other languages:
- map (C++)
- hash (Java)
- associative array (generic term)

This object type is extremely powerful for representing indexed data. The keys in a dictionary are arranged to facilitate fast lookup by key value.
- They are optimized for direct, not sequential, access
- There is no implied order of keys or values
- You can't index a dictionary by position
- But you can index dictionaries very quickly by key value, as we’ll see
- You can't take slices of a dictionary
- Dictionaries, like lists, can grow, shrink, or change over time

- Dictionary keys must be something that can't change (e.g., string, number, tuple) because changing keys on the fly would confuse the dictionary.
- Dictionary values can have any type

[Dictionary documentation](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict)

[Rob's introduction to dictionaries](https://youtu.be/uW8-HkmNq4Q?si=zhzlUxGut6ARE26K) (but because it's Java they are called hashes).


## Dictionary Operations


In [None]:
# Create an empty dictionary (use curly braces instead of parens or square brackets)
genetic_code = {}
print(genetic_code)

In [None]:
# Create and initialize a dictionary
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu' }
print(genetic_code)

In [None]:
# If the same key occurs multiple times, python only keeps the last value
x = { 'UUU' : 'Leu', 'UUU' : 'Phe' }
print(x)


In [None]:
# but the same value may appear any number of times.
x = { 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
print(x)

In [None]:
# Get the size of a dictionary (returns number of key/value pairs)
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu', 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
print(len(genetic_code))

In [None]:
# Retrieve the value associated with a given key
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu', 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
codon = 'UUU'
amino_acid = genetic_code[codon]
print(f"The translation of {codon} is {amino_acid}")

In [None]:
# The value inside the square brackets may be a literal, a variable or any
# arbitrary expression. Similar syntax to list indexing but key based,
# not positional.
# Attempting to retrieve a non-existent key causes an error
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu', 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
codon = 'AAA'
genetic_code[codon]

In [None]:
# play it safe by testing for key existence before access
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu', 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
codon = 'AAA'
if codon in genetic_code:
    amino_acid = genetic_code[codon]
    print(f"The translation of {codon} is {amino_acid}")
else:
    print(f"We didn't find a translation for {codon}")

codon = 'UUU'
if codon in genetic_code:
    amino_acid = genetic_code[codon]
    print(f"The translation of {codon} is {amino_acid}")
else:
    print(f"We didn't find a translation for {codon}")
    
# When used with dictionaries, the in operator only checks the existence
# of keys, not values. You can also use “not in” to test for non-existence
# of a key.

In [None]:
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu', 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
# loop through a dictionary (this iterates over the dictionary keys)
for codon in genetic_code:
    amino_acid = genetic_code[codon]
    print(f"The translation of {codon} is {amino_acid}")

## Nested Dictionaries
Just like we can have nested if statements, nested loops, and nested lists, we can also have nested dictionaries.
- dictionary of lists:  ```{'key1': [1,2], 'key2': [3,4]}```
- dictionary of dictionaries:
```
{
    'key1' : {
      'key1' : [1, 2],
      'key2' : [3, 4]
    }
    'key2' : {
      'key1' : [1, 2],
      'key2' : [3, 4]  
    }
}
```
This can get arbitrarily complex (dictionaries of lists of dictionaries of...).

If you want to explore nested dictionaries in more detail, have a look at [Marc's Python lesson 5](mco.fyi/py5), which covers these concepts in depth.



## Rule of thumb for truth value of lists, and dictionaries

All of these objects may be used as boolean values. The rules for converting a list, or map into a boolean value are as follows:
- if the object is empty, it evaluates to False
- if the object is non-empty, it evaluates to True

# Lesson 5 - Conditionals


# Controlling Program Flow

Question:

Does an stretch of DNA 2329 bp long encode a gene? 

* Up to now, we've looked at very simple programs, involving a sequence of statements (A, then B, then C…)
* But what you really want to do is probably a lot more complex than adding numbers or simple statements.

For our question, we are going to assume that we are talking about simple phage and bacterial genes that don't have introns (after all, you have more phage and bacterial genes than human genes).

Genes start with the codon `ATG` and end with one of the codons `TAA`, `TGA`, or `TAG`, and the codons need to be in-frame. 

<details>
<summary>More assumptions!</summary>
Of course, we are assuming that this is a phage or bacterial gene that doesn't have an intron, and that it doesn't start with `GTG` or `TTG`, and that the bacteria doesn't contain suppressor mutations that allow them to substitute amino acids in for the standard stop codons. In real life, we need to consider all those cases, but for now, we'll keep it simple!
</details>


# `if` Statements

* The `if` statement is how we express conditional logic in Python.
* Virtually every programming language has this concept.
* If statements define a condition and a sequence of statements to execute if the condition is `True`.

Prototype...

```
if some_expression:    
  do_this()
  do_that()
```

If the condition is true, the indented statements are executed.
Otherwise, the indented statements are skipped and program execution continues after the `if` statement.


In [None]:
bases = "AAAAATGCCCCC"
start = "ATG"
if start in bases:
    print(f"The sequence {bases} has a start in it!")
else:
    print(f"Sorry, no start in {bases}")

## Challenge

In Python, we use indentation to associate a block of statements with a condition, for example...

```
print("1")
if some_condition:    
  print("2")    
  print("3")
print("4") # this line is NOT part of the if block
```

What does the output look like...
* when the `some_condition` is True?
* when the `some_condition` is False?

Here’s a slightly different example...
```
print("1")
if some_condition:    
  print("2")    
print("3")  # this line is NOT part of if block
print("4")  # this line is NOT part of if block
```
What’s different?
What does the output look like...
* when the `some_condition` is True?
* when the `some_condition` is False?


## `if` Block Structure

* In Python, `if` statements blocks are defined by indentation.
* This idea of using indentation to delineate program structure is pervasive in Python and unique across programming languages.
* For now, we're focusing on if statements but later we'll see how indentation is used to define other kinds of statement blocks.

### Block Stucture in Other Languages

In other languages, explicit delineators are used. For example, in Java, C and C++ we would write:

```
if (bases contains "ATG") {
    has_start = true;
}
```

whereas, in Python we write:

```
if "ATG" in bases:
    has_start = True
```
Indentation in Java/C/C++ is a helpful practice for program readability but it does not affect program functionality.
In Python, indentation is not just a good idea - it's affects program logic!


## Python's use of whitespace

* Many people have strong opinions about this aspect of Python.
* Don’t get hung up on this feature. Try it and see what you think after you've written a few Python programs.
* Pitfalls:
  * watch out for mismatched indentation within a block
  * avoid mixing tabs and spaces in your code
  * I prefer spaces because it's more explicit, and most programs will automatically insert spaces even if you press Tab

**Pick _either_ tabs _or_ spaces _but_ be consistent.**

## `else` Statements

Sometimes we want to specify an alternative to the `if` condition, which we do with an `else` statement, for example...

```
if <condition>:
    <block1>
else:
    <block2>
```

* If the condition is true, block1 is executed.
* if the condition is false, block2 is executed.

The else cause is Python's way of saying "otherwise..."



Just as `if` blocks are defined by indentation, `else` blocks are also defined by indentation.

For example, this:

```
if <condition>:
    <statement1>
else:
    <statement2>
    <statement3>
```
is different from this:
```
if <condition>:
    <statement1>
else:
    <statement2>
<statement3>
```


## `elif` Statements

Sometimes we need one or more intermediate conditions between the if and else parts, for example...

`if A then do X, else if B then do Y, otherwise do Z`

We use the `elif` statement to express this in Python...
```
if condition1:
    do_thing_1()
elif condition2:
    do_thing_2()
else:
    do_thing_3()
```
* If `condition1` is true, `do_thing_1()` is executed.
* Otherwise, if `condition2` is true, `do_thing_2()` is executed.
* Otherwise, `do_thing_3()` is executed.


* `elif` blocks are defined the same way as `if` and `else` blocks - using indentation.

* It's good to have an if/elif for every condition of interest and not lump errors together with cases of interest.

For example, if you care about values 1 and 2 and everything else is considered an error, this code:

```
if "ATG" in bases:      # deal with 1 here
  starts_with_atg()
elif "TTG" in bases:    # deal with 2 here
  starts_with_ttg()
else:           # deal with errors here
  no_start_codon()
```
is better than this:
```
if "ATG" in bases:      # deal with 1 here
  starts_with_atg()
else:    # x must be 2 then, right? not necessarily!
  starts_with_ttg()
```
The latter code hides errors by combining a valid case with error cases.


# For loops

We have already actually used `for` loops when we were looking at dictionaries and lists, but just to reiterate ... if you want to iterate over a series of things, you can do so with a `for` loop. 


In [None]:
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu', 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
# loop through a dictionary (this iterates over the dictionary keys)
for codon in genetic_code:
    amino_acid = genetic_code[codon]
    print(f"The translation of {codon} is {amino_acid}")


<details>
<summary>Advanced for loops</summary>

Python also has a built in iterator that you sometimes see people use. This allows you to apply something to a list, and is called a `list iterator`.

```
genetic_code = { 'UUU' : 'Phe', 'UUA': 'Leu', 'CGA' : 'Arg', 'CGC' : 'Arg', 'CGG' : 'Arg', 'CGU' : 'Arg' }
amino_acids = [genetic_code[base] for base in genetic_code]
print(f"All the amino acids are {amino_acids}")
```
</details>


# Lesson 6 - Functions

## Functions are flexible software building blocks

* So far, we’ve been writing small programs.
* Things get much more complicated when we write large programs, especially with multiple authors.
* Ideally, we'd like to build software like snapping lego pieces together.
* What would that buy us?
  * abstraction
  * reuse
  * modularity
  * maintainability


### Abstraction - You don't need to do everything

* When building a house, you don't do everything yourself.
  * You hire an architect, a carpenter, an electrician, a roofer, a plumber, a mason.
  * You might hire a contractor to hire and manage all those people.
* In our programs we delegate tasks to certain functions, like `print()`, so that we don't have to worry about all the details.
  * It's a bit like hiring an electrician so that we don't have to worry about the details of electrical wiring in our house (or getting blown up!)


### Reuse - Don't Reinvent the Wheel

* it's OK to reuse other people's work
  * it's not stealing
  * it makes you more efficient and more productive
* Very few people build a house from scratch
  * so don't try to build programs from scratch
* Most of the software we produce is called `open source software` which means
  * you can look at the source
  * you can change the source
  * you can do more complex things with it


#### Reuse Example
You can count the number of bases in a DNA sequence the hard way:

In [None]:
sequence = "ATGCATAGCTAGCATCAGACTGATGCATCGACTGATCGACTGT"
bases = 0
valid_bases = ["A", "T", "G", "C"]
for i in sequence:
    if i in valid_bases:
        bases += 1

print(f"There are {bases} bases in {sequence}")

Or the easy way, by calling a method...

In [None]:
sequence = "ATGCATAGCTAGCATCAGACTGATGCATCGACTGATCGACTGT"
print(f"There are {len(sequence)} bases in {sequence}")

Which would you rather use? Which is more reliable?
* The first approach is great for learning.
* The second approach is great for getting real work done.


### Modularity - Divide and Conquer
* So far, our programs have been monoliths - one  continous sequence of Python statements.
* Real programs are often much bigger than the ones we've written.
  * Google's software repository has billions of lines of source code ([Why Google Stores Billions of Lines of Code in a Single Repository](https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext))
  * No one person can write a program that big.
  * Large programs are built by teams.
  * In order to build large, complex programs, we need the ability to divide program logic into manageable pieces.
* We call this modularity - dividing software into pieces or modules.


### Maintainability - Keeping your code DRY (Don't Repeat Yourself)
* Imagine that you need to do roughly the same thing in ten different places so you copy the code to those ten locations.
  * What happens when you find a bug or want to improve that piece of code?
  * You need to make the change ten times.
  * Will you remember to do that?
  * If you do remember, will you catch all ten locations?
  * Copying code is a **bad thing** - it leads to bugs.


### Functions solve all of these Problems

* Functions give us the ability to:
  * Hide low level details (abstraction)
  * Share and reuse pieces of functionality (reuse)
  * Split programs into manageable pieces (modularity)
  * Write one copy of an algorithm and use it anywhere (maintainability)
* We've already used several functions
  * `print()`, `input()`, `int()`, `len()`, `range()`, etc.
* Now let's see how to define our own functions.


## Defining Functions

* example:
```
def function_name(arg1, arg2):
    '''This is a docstring.'''  # optional but a good idea
    statement1
    statement2
    ...
```
* Not surprisingly, we define the scope of the function body using indentation (just like how we define blocks for if statements, for loops, etc.).
* This is a bit like an assignment statement in that it assigns a block of code (the function body) to the function name.
  * Function names have the same rules as variable names.
  * This only defines a function - it doesn't execute it.


In [None]:
def next_base():
  '''
     This function generates a DNA sequence base.
     It's how Illumina sequencing works.
  '''
  bases = ["A", "G", "T", "C"]
  print(bases[0])

next_base()

### Docstrings
* string defined immediately after the def line
* usually triple quoted since it may be multi-line
* not required but a good way to document your functions
* IDEs use the docstring to make your life easier
* automates output of `help(function)`

### Example Function

In [None]:
# Reverse complement a DNA sequence
# Here's an example function definition...
def rc(dna):
    """
    Reverse complement a DNA sequence
    :param dna: The DNA sequence
    :type dna: str
    :return: The reverse complement of the DNA sequence
    :rtype: str
    """
    complements = str.maketrans('acgtrymkbdhvACGTRYMKBDHV', 'tgcayrkmvhdbTGCAYRKMVHDB')
    rcseq = dna.translate(complements)[::-1]
    return rcseq

In [None]:
# Get help about this function...
help(rc)

In [None]:
# And here's how we would call this function...
sequence = "ATCGATCGCATAGCTACGACTAGCTACGACTGACT"
rc(sequence)

### Passing Values to a Function

* The variables we define in a function to take on the values passed by the caller are called parameters.
* In this code, `a`, `b` and `c` are parameters:
```
def sum(a, b, c):
    return a + b + c
```
* The values supplied by the caller when calling a function are called arguments.
* In this code, `1`, `2`, and `3` are arguments:
```
sum(1, 2, 3)
```

So in our example above, the `function` `rc()` has one parameter (`dna`) and the code to run it passes one argument (`sequence`).

Don't get hung up on this, most people use argument and parameter interchangeably.


### Passing Arguments
* Functions can define any number of parameters, including zero.
* Multiple parameters are separated by commas, like this...
```
def product(a, b, c):
    return a * b * c
```
* If you pass the wrong number of arguments, you'll hear about it:
```
product(1, 2)
...
TypeError: product() takes exactly 3 positional argument (2 given)
...
```


### Return Values

Instead of printing the result, we can also have the function return a result to the caller so that the caller can print it or use it in a calculation.

For example, our `rc()` function returns the reverse complement of the sequence. We can store that in a new variable and do things with it.

```
sequence = "ATGCATCGCATCGATCAGCTACGACTCGACTCGAT"
reverse_complement = rc(sequence)
# do something with reverse_complement
```


* Functions return a value to the caller via the `return` statement.
* The `return` statement causes two things to happen...
  * the function ends and control is returned to the caller
  * the returned value is passed back to the caller
* You can have as many return statements as you like (including zero).
* If the caller wants to do something with a returned value, it needs to save it or use it in an expression...


# Lesson 7 - Modules

`import` is how you use someone else's code.

Let's say we want to generate a random number between 1 and 100. We use the Python `random` module, like this...

In [None]:
import random

def next_base():
  '''
     This function generates a DNA sequence base.
     It's how Illumina sequencing works.
  '''
  bases = ["A", "G", "T", "C"]
  return bases[random.randint(0, 3)]


print("Here is a new DNA sequence for you:")
for i in range(150):
    print(next_base(), end="")
print()

## The `from` Statement

* You can also import code using this syntax...
```
from module-name import *
```
* This says loads all the names (*) from the designated module 
* With this kind of import, the module names get loaded into the global namespace, which means you don't need to qualify your accesses with the `module-name.` prefix.
* For example, you could do this...
```
from random import *
rand_val = randint(0, 3)
# I didn't need to use random.randint(0, 3)
```

* You can also import selected names from a module
```
from random import randint, choice
```
* This says load only those names explicitly listed (`choice` and `randint`, in this case) from the designated module into the global namespace.
* As in the previous example, after this import the names are loaded into the common global namespace so there is no need to qualify them...
```
from random import randint, choice

bases = ["A", "G", "T", "C"]
print(bases[randint(0,30])
print(choice(bases))
```



## When to use `import` vs. `from`
* Generally, it's better to use `import` because...
  * less risk of name clashes and other surprises
  * makes your code more explicit and clear
* Occasionally, you may find that you use a module’s functions so frequently that it pays to import it directly into the global namespace with `from`.
* That’s fine but do so carefully and watch out for name conflicts.
