# An introduction to solving biological problems with Python: 2

## Learning objectives

- **Recall** what we've learned so far on variables, common data types and collections
- **Propose and create** solutions using these concepts in an exercise
- **Use** conditions to execute specific code block
- **Employ** loops to repeat code block
- **Practice** reading and writing files with Python
- **Solve** more complex exercises

# Recap

- Simple data types, Collections
- Functions used so far...

## Simple data types

In [1]:
## Integer
i = 1
print('Integer:', i)
## Float
x = 3.14
print('Float', x)
## Boolean
print(True)

Integer: 1
Float 3.14
True


In [2]:
## String
s0 = '' # empty string
s1 = 'ATGTCGTCTACAACACT' # single quotes
s2 = "spam's" # double quotes
print(s1 + s2) # concatenate
print(s1, s2) # print

ATGTCGTCTACAACACTspam's
ATGTCGTCTACAACACT spam's


## Collections

In [3]:
## Tuple - immutable
my_tuple = (2, 3, 4, 5)
print('A tuple:', my_tuple)
print('First element of tuple:', my_tuple[0])

A tuple: (2, 3, 4, 5)
First element of tuple: 2


In [4]:
## List
my_list = [2, 3, 4, 5]
print('A list:', my_list)
print('First element of list:', my_list[0])
my_list.append(12)
print('Appended list:', my_list)
my_list[0] = 45
print('Modified list:', my_list)

A list: [2, 3, 4, 5]
First element of list: 2
Appended list: [2, 3, 4, 5, 12]
Modified list: [45, 3, 4, 5, 12]


In [5]:
## String - immutable, tuple of characters
text = "ATGTCATTT"
print('Here is a string:', text)
print('First character:', text[0])
print('Slice text[1:3]:', text[1:3])
print('Number of characters in text', len(text))

Here is a string: ATGTCATTT
First character: A
Slice text[1:3]: TG
Number of characters in text 9


In [6]:
## Set - unique unordered elements
my_set = set([1,2,2,2,2,4,5,6,6,6])
print('A set:', my_set)

A set: {1, 2, 4, 5, 6}


In [7]:
## Dictionary
my_dictionary = {"A": "Adenine", 
                 "C": "Cytosine", 
                 "G": "Guanine", 
                 "T": "Thymine"}
print('A dictionary:', my_dictionary)
print('Value associated to key C:', my_dictionary['C'])

A dictionary: {'A': 'Adenine', 'C': 'Cytosine', 'G': 'Guanine', 'T': 'Thymine'}
Value associated to key C: Cytosine


## Functions used so far...

In [8]:
my_list = ['A', 'C', 'A', 'T', 'G']
print('There are', len(my_list), 'elements in the list', my_list)
print('There are', my_list.count('A'), 'letter A in the list', my_list)
print("ATG TCA CCG GGC".split())

There are 5 elements in the list ['A', 'C', 'A', 'T', 'G']
There are 2 letter A in the list ['A', 'C', 'A', 'T', 'G']
['ATG', 'TCA', 'CCG', 'GGC']


# Part 2.1: Conditional execution

-----

 - Code blocks
 - Conditional execution


## Program control and logic

A program will normally run by executing the stated commands, one after the other in sequential order. Frequently however, you will need the program to deviate from this. There are several ways of diverting from the line-by-line paradigm:

- With conditional statements. Here you can check if some statement or expression is true, and if it is then you continue on with the following block of code, otherwise you might skip it or execute a different bit of code.

- By performing repetitive loops through the same block of code, where each time through the loop different values may be used for the variables.

- Through the use of functions (subroutines) where the program’s execution jumps from a particular line of code to an entirely different spot, even in a different file or module, to do a task before (usually) jumping back again. Functions are covered in the next session, so we will not discuss them yet.

- By checking if an error or exception occurs, i.e. something illegal has happened, and executing different blocks of code accordingly

## Code blocks

With all of the means by which Python code execution can jump about we naturally need to be aware of the boundaries of the block of code we jump into, so that it is clear at what point the job is done, and program execution can jump back again. In essence it is required that the end of a function, loop or conditional statement be defined, so that we know the bounds of their respective code blocks.

Python uses indentation to show which statements are in a block of code, other languages use specific `begin` and `end` statements or curly braces `{}`. It doesn't matter how much indentation you use, but the whole block must be consistent, i.e., if the first statement is indented by four spaces, the rest of the block must be indented by the same amount. The Python style guide recommends using 4-space indentation. Use spaces, rather than tabs, since different editors display tab characters with different widths.

The use of indentation to delineate code blocks is illustrated in an abstract manner in the following scheme: 

Statement 1:

    Command A – in the block of statement 1
    Command B – in the block of statement 1
  
    Statement 2:
        Command C – in the block of statement 2
        Command D – in the block of statement 2
  
    Command E – back in the block of statement 1

Command F – outside all statement blocks


## Conditional execution

### The <tt>if</tt> statement

A conditional <tt>if</tt> statement is used to specify that some block of code should only be executed if some associated test is upheld; a conditional expression evaluates to <tt>True</tt>. This might also involve subsidiary checks using the <tt>elif</tt> statement to use an alternative block if the previous expression turns out to be False. There can even be a final <tt>else</tt> statement to do something if none of the checks are passed. 

The following uses statements that test whether a number is less than zero, greater than zero or otherwise equal to zero and will print out a different message in each case:

In [None]:
x = -3

if x > 0:
  print("Value is positive")

elif x < 0:
  print("Value is negative")

else:
  print("Value is zero")

The general form of writing out such combined conditional statements is as follows:

<pre>
if conditionalExpression1:
    # codeBlock1

elif conditionalExpression2:
    # codeBlock2

elif conditionalExpressionN:
    # codeBlockN
    +any number of additional elif statements, then finally:

else:
    # codeBlockE
</pre>


The <tt>elif</tt> block is optional, and we can use as many as we like. The <tt>else</tt> block is also optional, so will only have the <tt>if</tt> statement, which is a fairly common situation. It is often good practice to include <tt>else</tt> where possible though, so that you always catch cases that do not pass, otherwise values might go unnoticed, which might not be the desired behaviour.

Placeholders are needed for “empty” code blocks:

In [None]:
gene = "BRCA2"
geneExpression = -1.2

if geneExpression < 0:
    print(gene, "is downregulated")
        
elif geneExpression > 0:
    print(gene, "is upregulated")
        
else:
    pass

For very simple conditional checks, you can write the `if` statement on a single line as a single expression, and the result will be the expression before the `if` if the condition is true or the expression after the `else` otherwise.



In [None]:
x = 11

if x < 10:
    s = "Yes"
else:
    s = "No"
print(s)

# Could also be written onto one line
s = "Yes" if x < 10 else "No"
print(s)

### Comparisons and truth

With conditional execution the question naturally arises as to which expressions are deemed to be true and which false. For the python boolean values <tt>True</tt> and <tt>False</tt> the answer is (hopefully) obvious. Also, the logical states of truth and falsehood that result from conditional checks like “Is x greater than 5?” or “Is y in this list?” are also clear. When comparing values Python has the standard comparison (or relational) operators, some of which we have already seen:

|Operator |	Description |	Example |
|---------|-------------|-----------|
|`==`  |	    equality |	1 == 2 # False |
|`!=`  |	    non equality |	1 != 2 # True |
| `<`  |	    less than |	1 < 2 # True |
| `<=` |	    equal or less than |	2 <= 2 # True |
| `>`  |	    greater then |	1 > 2 # False |
| `>=` |	    equal or greater than |	1 >= 1 # True |

It is notable that comparison operations can be combined, for example to check if a value is within a range.

In [None]:
x = -5

if x > 0 and x < 10:
    print("In range A")
    
elif x < 0 or x > 10:
    print("In range B")

Python has two additional comparison operators <tt>is</tt> and <tt>is not</tt>.

These compare whether two objects are the same object, whereas <tt>==</tt> and <tt>!=</tt> compare whether values are the same.

There is a simple rule of thumb to tell when to use `==` or `is`:
- `==` is for **value** equality. Use it to check if two objects have the same value.
- `is` is for **reference** equality. Use it to check if two references refer to the same object.

*Something to note*, you will get unexpected and inconsistent results if you mistakenly use `is` to compare for reference equality on integers:

In [None]:
a = 500
b = 500
print(a == b) # True
print(a is b) # False

Another example with lists `x`, `y`, and `z`:
- `y` being a copy of `x`
- and `z` being another reference to the same object as `x`

In [None]:
x = [123, 54, 92, 87, 33]
y = x[:] # y is a copy of x
z = x
print(x)
print(y)
print(z)
print("Are values of y and x the same?", y == x)
print("Are objects y and x the same?", y is x)
print("Are values of z and x the same?", z == x)
print("Are objects z and x the same?", z is x)

In [None]:
# Let's change x
x[1] = 23
print(x)
print(y)
print(z)
print("Are values of y and x the same?", y == x)
print("Are objects y and x the same?", y is x)
print("Are values of z and x the same?", z == x)
print("Are objects z and x the same?", z is x)

In Python even expressions that do not involve an obvious boolean value can be assigned a status of "truthfulness";  the value of an item itself can be forced to be considered as either True or False inside an if statement. For the Python built-in types discussed in this chapter the following are deemed to be False in such a context:

| False value | Description | 
|-------------|-------------|
| `None` |	numeric equality |
| `False` |	False boolean |
| `0`	| 0 integer |
| `0.0` |	0.0 floating point |
| `""` |	empty string |
| `()` |	empty tuple |
| `[]` |	empty list |
| `{}` |	empty dictonary |
| `set()` |	empty set |

And everything else is deemed to be True in a conditional context.

In [None]:
x = ''      # An empty string
y = ['a']   # A list with one item

if x:
    print("x is true")
else: 
    print("x is false")     

if y:
    print("y is true")
else:
    print("y is false")

## Exercises 2.1.1

Create a `if..elif..else` block that will compare a variable containing your age to another variable containing another person's age and print a statement which says if you are younger, older or the same age as that person.

## Exercises 2.1.2

Use an `if` statement to check if some variable containing DNA sequence contains a stop codon. (e.g. `dna = "ATGGCGGTCGAATAG"`), first just check for one possible stop, but then extend your code to look for any of the 3 stop codons (`TAG`, `TAA`, `TGA`). Hint: recall that the `in` operator lets you check if a string contains some substring, and returns `True` or `False` accordingly.

## Next session

Go to our next notebook: [python_basic_2_2](python_basic_2_2.ipynb)

# Part 2.2: Loops

- The <tt>for</tt> loop
- The <tt>while</tt> loop
- Skipping and breaking loops
- More looping using `range()` and `enumerate()`
- Filtering in loops


## Loops

When an operation needs to be repeated multiple times, for example on all of the items in a list, we 
avoid having to type (or copy and paste) repetitive code by creating a loop. There are two ways of creating loops in Python, the <tt>for</tt> loop and the <tt>while</tt> loop.

## The <tt>for</tt> loop

The for loop in Python iterates over each item in a sequence (such as a list or tuple) in the order that they appear in the sequence. What this means is that a variable (<tt>code</tt> in the below example) is set to each item from the sequence of values in turn, and each time this happens the indented block of code is executed again.

In [None]:
codeList = ['NA06984', 'NA06985', 'NA06986', 'NA06989', 'NA06991']

for code in codeList:
    print(code)

A <tt>for</tt> loop can iterate over the individual characters in a string:

In [None]:
dnaSequence = 'ATGGTGTTGCC'

for base in dnaSequence:
    print(base)

And also over the keys of a dictionary: 

In [None]:
rnaMassDict = {"G":345.21, "C":305.18, "A":329.21, "U":302.16}

for x in rnaMassDict:
    print(x, rnaMassDict[x])

Any variables that are defined before the loop can be accessed from inside the loop. So for example to calculate the summation of the items in a list of values we could define the total initially to be zero and add each value to the total in the loop:

In [None]:
total = 0
values = [1, 2, 4, 8, 16]

for v in values:
    total = total + v
    # total += v
    print(total)

print(total)

Naturally we can combine a <tt>for</tt> loop with an <tt>if</tt> statement, noting that we need two indentation levels, one for the outer loop and another for the conditional blocks:

In [None]:
geneExpression = {
    'Beta-Catenin': 2.5, 
    'Beta-Actin': 1.7, 
    'Pax6': 0, 
    'HoxA2': -3.2
}

for gene in geneExpression:
    if geneExpression[gene] < 0:
        print(gene, "is downregulated")
        
    elif geneExpression[gene] > 0:
        print(gene, "is upregulated")
        
    else:
        print("No change in expression of ", gene)

## Exercises 2.2.1

1. Create a sequence where each element is an individual base of DNA. Make the sequence 15 bases long.
2. Print the length of the sequence.
3. Create a `for` loop to output every base of the sequence on a new line.

## The <tt>while</tt> loop

In addition to the <tt>for</tt> loop that operates on a collection of items, there is a <tt>while</tt> loop that simply repeats while some statement evaluates to True and stops when it is False. Note that if the tested expression never evaluates to False then you have an “infinite loop”, which is not good.

In this example we generate a series of numbers by doubling a value after each iteration, until a limit is reached: 

In [None]:
value = 0.25
while value < 8:
    value = value * 2
    print(value)

print("final value:", value)

Whats going on here is that the value is doubled in each iteration and once it gets to 8 the while test fails (8 is not less than 8) and that last value is preserved. Note that if the test were instead value `<= 8` then we would get one more doubling and the value would reach 16.

## Exercises 2.2.2

1. Reuse the 15 bases long sequence created at the previous exercise where each element is an individual base of DNA.
2. Create a <tt>while</tt> loop similar to the one above that starts at the third base in the sequence and outputs every third base until the 12th.

## Skipping and breaking loops

Python has two ways of affecting the flow of the <tt>for</tt> or <tt>while</tt> loop inside the block. The <tt>continue</tt> statement means that the rest of the code in the block is skipped for this particular item in the collection, i.e. jump to the next iteration. In this example negative numbers are left out of a summation:

In [None]:
values = [10, -5, 3, -1, 7]

total = 0
for v in values:
    if v < 0:
        continue # Skip this iteration   
    total += v

print(total)

The other way of affecting a loop is with the <tt>break</tt> statement. In contrast to the <tt>continue</tt> statement, this immediately causes all looping to finish, and execution is resumed at the next statement _after_ the loop.

In [None]:
geneticCode = {'TAT': 'Tyrosine',  'TAC': 'Tyrosine',
               'CAA': 'Glutamine', 'CAG': 'Glutamine',
               'TAG': 'STOP'}

sequence = ['CAG','TAC','CAA','TAG','TAC','CAG','CAA']

for codon in sequence:
    if geneticCode[codon] == 'STOP':
        break            # Quit looping at this point
    else:
        print(geneticCode[codon])

## Looping gotchas

An internal counter is used to keep track of which item is used next, and this is incremented on each iteration. When this counter has reached the length of the sequence the loop terminates. This means that if you delete the current item from the sequence, the next item will be skipped (since it gets the index of the current item which has already been treated). Likewise, if you insert an item in a sequence before the current item, the current item will be treated again the next time through the loop. This can lead to nasty bugs that can be avoided by making a temporary copy using a slice of the whole sequence.

<div class="alert-warning">
**When looping, never modify the collection!** Always create a copy of it first.
</div>

## More looping

### Using `range()`

If you would like to iterate over a numeric sequence then this is possible by combining the `range()` function and a for loop.

In [None]:
print(list(range(10)))

print(list(range(5, 10)))

print(list(range(0, 10, 3)))

print(list(range(7, 2, -2)))

Looping through ranges 

In [None]:
for x in range(8):
    print(x*x)

In [None]:
squares = []
for x in range(8):
    s = x*x
    squares.append(s)
    
print(squares)

### Using `enumerate()`

Given a sequence, `enumerate()` allows you to iterate over the sequence generating a tuple containing each value along with a corresponding index.

In [None]:
letters = ['A','C','G','T']
for index, letter in enumerate(letters):
    print(index, letter)

In [None]:
numbered_letters = list(enumerate(letters))
print(numbered_letters)

## Filtering in loops

In [None]:
city_pops = {
    'London': 8200000,
    'Cambridge': 130000,
    'Edinburgh': 420000,
    'Glasgow': 1200000
}

big_cities = []
for city in city_pops:
    if city_pops[city] >= 1000000:
         big_cities.append(city)

print(big_cities)

In [None]:
total = 0
for city in city_pops:
    total += city_pops[city]
print("total population:", total)

In [None]:
pops = list(city_pops.values())
print("total population:", sum(pops))

## Formating string

Constructing more complex strings from a mix of variables of different types can be cumbersome, and sometimes you want more control over how values are interpolated into a string. Python provides a powerful mechanism for formatting strings using built-in `.format()` function using "replacement fields" surrounded by curly braces `{}` which starts with an optional field name followed by a colon `:` and finishes with a format specification. 

There are lots of these specifiers, but here are 3 useful ones:

    d: decimal integer
    f: floating point number
    s: string

You can specify the number of decimal points to use in a floating point number with, e.g. `.2f` to use 2 decimal places or `+.2f` to use 2 decimal with always showing its associated sign.

In [None]:
print('{:.2f}'.format(0.4567))

In [None]:
geneExpression = {
    'Beta-Catenin': 2.5, 
    'Beta-Actin': 1.7, 
    'Pax6': 0, 
    'HoxA2': -3.2
}

for gene in geneExpression:
    print('{:s}\t{:+.2f}'.format(gene, geneExpression[gene])) # s is optional
    # could also be written using variable names
    #print('{gene:s}\t{exp:+.2f}'.format(gene=gene, exp=geneExpression[gene]))

## Exercises 2.2.3

1. Let's calculate the GC content of a DNA sequence. Use the 15-base sequence you created for the exercises above. Create a variable, `gc`, which we will use to count the number of Gs or Cs in our sequence.
2. Output every base of the sequence alongside its index on a new line.
3. Create a loop to iterate over the bases in your sequence. If the base is a G or the base is a C, add one to your `gc` variable.
4. When the loop is done, divide the number of GC bases by the length of the sequence and multiply by 100 to get the GC percentage. Format the result to only display 2 decimal places.

# Part 2.3: Files

-----

- Using files
- Reading from files
- Writing to files


# Data input and output (I/O)

So far, all that data we have been working with has been written by us into our scripts, and the results of out computation has just been displayed in the terminal output. In the real world data will be supplied by the user of our programs (who may be you!) by some means, and we will often want to save the results of some analysis somewhere more permanent than just printing it to the screen. In this session we cover the way of reading data into our programs by reading files from disk, we also discuss writing out data to files. 

There are, of course, many other ways of accessing data, such as querying a database or retrieving data from a network such as the internet. We don't cover these here, but python has excellent support for interacting with databases and networks either in the standard library or using external modules.

## Using files

Frequently the data we want to operate on or analyse will be stored in files, so in our programs we need to be able to open files, read through them (perhaps all at once, perhaps not), and then close them. 

We will also frequently want to be able to print output to files rather than always printing out results to the terminal.

Python supports all of these modes of operations on files, and provides a number of useful functions and syntax to make dealing with files straightforward.

### Opening files

To open a file, python provides the `open` function, which takes a filename as its first argument and returns a _file object_ which is python's internal representation of the file.

In [None]:
path = "data/datafile.txt"
fileObj = open( path )

`open` takes an optional second argument specifying the _mode_ in which the file is opened, either for reading, writing or appending.

It defaults to `'r'` which means open for reading in text mode. Other common values are `'w'` for writing (truncating the file if it already exists) and `'a'` for appending.

In [None]:
open( "data/myfile.txt", "r" ) # open for reading, default

In [None]:
open( "data/myfile.txt", "w" ) # open for writing (existing files will be overwritten)

In [None]:
open( "data/myfile.txt", "a" ) # open for appending

### Closing files

To close a file once you finished with it, you can call the `.close` method on a file object.

In [None]:
fileObj.close()

### Mode modifiers

These mode strings can include some extra modifier characters to deal with issues with files across multiple platforms.

`'b'`: binary mode, e.g. `'rb'`. No translation for end-of-line characters to platform specific setting value.

|Character | Meaning |
|----------|---------|
|`'r'` |	open for reading (default) |
|`'w'` |	open for writing, truncating the file first |
|`'x'` |	open for exclusive creation, failing if the file already exists |
|`'a'` |	open for writing, appending to the end of the file if it exists |
|`'b'` |	binary mode |
|`'t'` |	text mode (default) |
|`'+'` |	open a disk file for updating (reading and writing) |

## Reading from files

Once we have opened a file for reading, file objects provide a number of methods for accessing the data in a file. The simplest of these is the `.read` method that reads the entire contents of the file into a string variable.



In [None]:
fileObj = open( "data/datafile.txt" )
print(fileObj.read()) # everything
fileObj.close()

Note that this means the entire file will be read into memory. If you are operating on a large file and don't actually need all the data at the same time this is rather inefficient.

Frequently, we just need to operate on individual lines of the file, and you can use the `.readline` method to read a line from a file and return it as a python string.

File objects internally keep track of your current location in a file, so to get following lines from the file you can call this method multiple times.

It is important to note that the string representing each line will have a trailing newline `"\n"` character, which you may want to remove with the `.rstrip` string method.

Once the end of the file is reached, `.readline` will return an empty string `''`. This is different from an apparently empty line in a file, as even an empty line will contain a newline character. Recall that the empty string is considered as `False` in python, so you can readily check for this condition with an `if` statement etc.

In [None]:
# one line at a time
fileObj = open( "data/datafile.txt" )
print("1st line:", fileObj.readline())
print("2nd line:", fileObj.readline())
print("3rd line:", fileObj.readline())
print("4th line:", fileObj.readline())
fileObj.close()

To read in all lines from a file as a list of strings containing the data from each line, use the `.readlines` method (though note that this will again read all data into memory).

In [None]:
# all lines
fileObj = open( "data/datafile.txt" )

lines = fileObj.readlines()

print("The file has", len(lines), "lines")

fileObj.close()

Looping over the lines in a file is a very common operation and python lets you iterate over a file using a `for` loop just as if it were an array of strings. This does not read all data into memory at once, and so is much more efficient that reading the file with `.readlines` and then looping over the resulting list.

In [None]:
# as an iterable
fileObj = open( "data/datafile.txt" )

for line in fileObj:
    print(line.rstrip().upper())

fileObj.close()

### The with statement

It is important that files are closed when they are no longer required, but writing ``fileObj.close()`` is tedious (and more importantly, easy to forget). An alternative syntax is to open the files within a ``with`` statement, in which case the file will automatically be closed at the end of the `with` block.

In [None]:
# fileObj will be closed when leaving the block
with open( "data/datafile.txt" ) as fileObj:
    for ( i, line ) in enumerate( fileObj, start = 1 ):
        print( i, line.strip() )

## Exercises 2.3.1

Write a script that reads a file containing many lines of nucleotide sequence. For each line in the file, print out the line number, the length of the sequence and the sequence (There is an example file <a href="http://www.ebi.ac.uk/~grsr/perl/dna.txt">here</a> or in `data/dna.txt` from the course materials ).

## Writing to files

Once a file has been opened for writing, you can use the `.write()` method on a file object to write data to the file.

The argument to the `.write()` method must be a string, so if you want to write out numerical data to a file you will have to convert it to a string somehow beforehand.

<div class="alert-warning">**Remember** to include a newline character `\n` to separate lines of your output, unlike the `print()` statement, `.write()` does not include this by default.</div>

In [None]:
read_counts = {
    'BRCA2': 43234,
    'FOXP2': 3245,
    'SORT1': 343792
}

with open( "out.txt", "w" ) as output:
    output.write("GENE\tREAD_COUNT\n")

    for gene in read_counts:
        line = "\t".join( [ gene, str(read_counts[gene]) ] )
        output.write(line + "\n")

To view the output file, open a terminal window, go to the directory where the file has been written, and print the content of the file using `cat` command or open it using your favourite editor:

```bash
cat out.txt
```

Be cautious when opening a file for writing, as python will happily let you overwrite any existing data in the file. 

## Exercises 2.3.2

Create a script that writes the values of a list of numbers to a file, with each number on a seperate line.

# Part 2.4: Delimited files

------

- Data formats


## Data formats

Bioinformaticians love creating endless new file formats for their data, but there is one very common standard format
that it is good to get used to parsing.

Delimited file example:
```
X 169008682 1 111267453 1.0976
2 8265484 5 69763543 4.9825
MT 10924 MT 81934 7.2357
3 127 8 10908776 1.2509
```

### Reading delimited files

We can use the various string manipulation techniques covered earlier to process delimited files in a fairly straightforward way. Here we loop through a file with columns delimited by spaces, reading the data for each row into a list, and storing each of these lists into a main results list.

To view the an example of a delimited file, open a terminal window, go to the course directory, and print the content of the file using `cat` command or open it using your favourite editor:

```bash
cat data/mydata.txt
```

```
Index Organism Score
1 Human 1.076
2 Mouse 1.202
3 Frog 2.2362
4 Fly 0.9853
```

In [None]:
results = []

with open("data/mydata.txt", "r") as data:
    header = data.readline()
    for line in data:
        results.append(line.split())
        
        
print(results)

Here we show a slightly more complicated example where we are reading the results into a more convenient data structure, a list of dictionaries with the dictionary keys corresponding to the column headers and the values to the values from each line. We also convert the columns to an appropriate type as we go.

In [None]:
results = []

with open("data/mydata.txt", "r") as data:
    header = data.readline()
    for line in data:
        idx, org, score = line.split()
        row = {'Index': int(idx), 'Organism': org, 'Score': float(score)}
        results.append(row)
        
print(results)
print('Score of first row:', results[0]['Score'])

### Writing delimited files

Writing out a delimited file is also straightforward using the `join` method. Here, as an example we will recreate our original file from above, but this time we will delimit the columns with a comma.

In [None]:
mydata = [{'Organism': 'Human', 'Index': 1, 'Score': 1.076}, 
          {'Organism': 'Mouse', 'Index': 2, 'Score': 1.202}, 
          {'Organism': 'Frog', 'Index': 3, 'Score': 2.2362}, 
          {'Organism': 'Fly', 'Index': 4, 'Score': 0.9853}]

with open('data/mydata.csv', 'w') as output:
    # write a header
    header = ",".join(['Index', 'Organism', 'Score'])
    output.write(header + "\n")
    for row in mydata:
        line = ",".join([str(row['Index']), row['Organism'], str(row['Score'])])
        output.write(line + "\n")

To view the output file, open a terminal window, go to the course directory, and print the content of the file using `cat` command or open it using your favourite editor:

```bash
cat data/mydata.csv
```

```
Index,Organism,Score
1,Human,1.076
2,Mouse,1.202
3,Frog,2.2362
4,Fly,0.9853
```

## Last but not least

### A big thank you!

### Remember...
- Our course webpage: http://pycam.github.io
- The Python website: https://www.python.org/ 
- To fill the course survey ;-)
- To come to our next course 'Working with Python: functions and modules' and register at https://training.csx.cam.ac.uk/

## Exercises 2.4.1

Write a script that reads a tab delimited file which has 4 columns: gene, chromosome, start and end coordinates; that computes each gene's length and stores it into a dictionary; and writes the results into a new tab separated file. You can find a data file in ` data/genes.txt` directory of the course materials.

## Exercises 2.4.2 

Read the lyrics of Imagine by John Lennon, 1971 from the file in `data/imagine.txt`. Split the text into words. Print the total number of words, and the number of distinct words. Calculate the frequency of each distinct word and store the result into a dictionary. Print each distinct word along with its frequency. Find the most frequent word longer than 3 characters in the song, print it with its frequency.

<center><img src="https://upload.wikimedia.org/wikipedia/en/1/1d/John_Lennon_-_Imagine_John_Lennon.jpg"/></center>

## Exercises 2.4.3 
#### Real life example

You have a tab separated file which contains information about all the yeast (*S.cerevisiae*) gene `data/yeast_genes.txt`:

`Systematic_name	Standard_name	Chromosome	Start	    End
YBR127C             VMA2             chrII         491269      492822
YBR128C             ATG14            chrII         493081      494115
...
`

For every gene, its location and coordinates are recorded. 
You should read through the file and store the data into an appropriate structure.
Then answer these questions:

- How many genes are there in *S.cerevisiae*?
- Which is  the longest and which is the shortest gene?
- How many genes per chromosome? Print the number of genes per chromosome.
- For each chromosome, what is the longest and what is the shortest gene?
- For each chromosome, how many genes on the Watson strand and how many genes on the Crick strand?

**bonus** 

- What is the chromosome with the highest gene density? You can calculate the length of each chromosome assuming that they start at 1 and they end at the end (if on the Watson strand) or at the start (if on the Crick strand) of their last gene. Then you can calculate the length of all the genes on each chromosome and the ratio between coding vs. noncoding regions.