# Hello Python!

Here, we'll take some first steps with the Python programming language. This tutorial starts from scratch so, if you've never written a line of code in your life, you've come to the right place!

### What we're aiming to do

We often find that knowing just a little bit of programming can get you a long way. We're aiming to teach you enought that you know how to:

    1. Identify problems that you can solve with a bit of programming
    2. Write and execute simple scripts
    3. Know how to find help when you get stuck
    
### What we're not trying to do

This is not a comprehensive programming course! You're not going to be Mr. Robot or Neo. You won't learn every trick in the book, but hopefully just enough to get started.

### Aren't there lots of these tutorials on the internet?

Absolutely! Go check them out! Some recommendations are below:

    - Resource 1.
    - Resource 2.
    - Resource 3.

# Getting started

If you've never written any Python at all before, the first thing we'll have to do is install Python itself. Yes, programming languages have to be installed just like anything else. 

To do this, go to: https://www.python.org/

Hover over the 'Download' menu. The web site should detect the operating system you're using and offer up the correct package. You're looking for Python 3.10.

Download the package and install using the installation wizard.

<img src="img/download.PNG" width="75%"/>

Once Python is installed, go to your start menu. In the 'Recently added' section, you should see **IDLE (Python 3.10 64-bit)**. Start this application.

<img src="img/run_idle.PNG" width="50%"/>

We should see something like this:

<img src="img/idle_start.PNG" width="50%"/>

This is what we call a Python 'interpreter'. It 'interprets' the code we write in Python and converts it to something that computers understand (0's and 1's).

Let's write some code!

## Python basic data types

Python has three basic data types:

- Integer (referred to as int): whole numbers, either positive or negative or zero
- Float (referred to as float): decimal numbers, either positive or negative or zero
- String (referred to as str): text (can include numbers)

We'll mostly be working with strings and integers today so let's have a look at these.

## Strings

A string in python is a collection of words or other characters. These can include letters, numbers, spaces or other special characters. Strings are denoted with either single ```''``` or double quotes ```""```

Below are some properties of strings:

Strings can be added together
```
>>> x = "hello"
>>> x += ' everyone'
>>> x
hello everyone
```

Strings can be tested for other strings
```
>>> 'hell' in x
True
```

Strings can be subsetted by an index
```
>>> x[0:10]
'hello ever'
```
Strings can be split on a given character (here, on a space ```' '```)
```
>>> x.split(' ')
['hello', 'everyone']
```

## Integers

This one is straight-forward.
```
>>> y = 100
>>> y += 10
>>> y
110
```
```
>>> z = 100
>>> z -= 65
>>> z
35
```
```
>>> a = 4
>>> b = 2
>>> a * b
8
```
```
>>> a / b
2.0
```
Note that, for division, the result is a float (decimal number). This behaviour is different between Python 2 and Python 3 (but we'll stick with 3 so don't worry!)

## Python basic data structures

Data structures refer to different ways that we can organise and access our data (strings, integers, floats etc). Python has many different data structures but you can go a long way with programming by understanding just a few:

- Lists
- Tuples
- Dictionaries
- Sets



## Scripts

Above we were typing code into the interpreter line-by-line and this quickly becomes tedious. We would be better off writing all our code into a file (a 'script') and then telling python to run all the code at once.



## Working with files

We frequently need to use python to open a file on our computers and work with the contents. Let's look at an example:

In [8]:
# open the FASTA file in 'read' mode
fasta_file = open('fasta/small.fasta','r').readlines()

# iterate through lines
for line in fasta_file:
    print(line.strip())

>sp|Q9ULX7|CAH14_HUMAN Carbonic anhydrase 14 OS=Homo sapiens OX=9606 GN=CA14 PE=1 SV=1
MLFSALLLEVIWILAADGGQHWTYEGPHGQDHWPASYPECGNNAQSPIDIQTDSVTFDPD
LPALQPHGYDQPGTEPLDLHNNGHTVQLSLPSTLYLGGLPRKYVAAQLHLHWGQKGSPGG
SEHQINSEATFAELHIVHYDSDSYDSLSEAAERPQGLAVLGILIEVGETKNIAYEHILSH
LHEVRHKDQKTSVPPFNLRELLPKQLGQYFRYNGSLTTPPCYQSVLWTVFYRRSQISMEQ
LEKLQGTLFSTEEEPSKLLVQNYRALQPLNQRMVFASFIQAGSSYTTGEMLSLGVGILVG
CLCLLLAVYFIARKIRKKRLENRKSVVFTSAQATTEA


Let's explain the above:

 - On line 2, we use 'open' to access our file
     - the first argument ```fasta/small.fasta``` is the path to the file we want to access
     - the second argument ```r``` tells python to open the file in 'read' mode
     - the ```.readlines()``` call tells python to separate the file into individual lines
 - On line 5, we iterate through the lines of the file
 - On line 6, we print the line
     - The ```.strip()``` call removes extra spaces from the start and end of the text 

Let's add a filter in our ```for``` loop so we print only the FASTA header line

In [12]:
# open the FASTA file in 'read' mode
fasta_file = open('fasta/small.fasta','r').readlines()

# iterate through lines
for line in fasta_file:
    if '>' in line:
        print(line.strip())

>sp|Q9ULX7|CAH14_HUMAN Carbonic anhydrase 14 OS=Homo sapiens OX=9606 GN=CA14 PE=1 SV=1


What if we wanted to print the sequence and not the header lines?

In [13]:
# open the FASTA file in 'read' mode
fasta_file = open('fasta/small.fasta','r').readlines()

# iterate through lines
for line in fasta_file:
    if '>' not in line:
        print(line.strip())

MLFSALLLEVIWILAADGGQHWTYEGPHGQDHWPASYPECGNNAQSPIDIQTDSVTFDPD
LPALQPHGYDQPGTEPLDLHNNGHTVQLSLPSTLYLGGLPRKYVAAQLHLHWGQKGSPGG
SEHQINSEATFAELHIVHYDSDSYDSLSEAAERPQGLAVLGILIEVGETKNIAYEHILSH
LHEVRHKDQKTSVPPFNLRELLPKQLGQYFRYNGSLTTPPCYQSVLWTVFYRRSQISMEQ
LEKLQGTLFSTEEEPSKLLVQNYRALQPLNQRMVFASFIQAGSSYTTGEMLSLGVGILVG
CLCLLLAVYFIARKIRKKRLENRKSVVFTSAQATTEA


We've got our sequence above but there's a small problem - it's split across 6 lines. Ideally, we'd like our sequence to be one continuous string so we can easily compare it to other sequences.

In [21]:
# open the FASTA file in 'read' mode
fasta_file = open('fasta/small.fasta','r').readlines()

# create an 'empty' string using the '' or "" quote marks
this_protein = ''

# iterate through lines
for line in fasta_file:
    if '>' not in line:
        
        # here, we add the new line to our empty string
        this_protein += line.strip()

# print the result
print(this_protein)

MLFSALLLEVIWILAADGGQHWTYEGPHGQDHWPASYPECGNNAQSPIDIQTDSVTFDPDLPALQPHGYDQPGTEPLDLHNNGHTVQLSLPSTLYLGGLPRKYVAAQLHLHWGQKGSPGGSEHQINSEATFAELHIVHYDSDSYDSLSEAAERPQGLAVLGILIEVGETKNIAYEHILSHLHEVRHKDQKTSVPPFNLRELLPKQLGQYFRYNGSLTTPPCYQSVLWTVFYRRSQISMEQLEKLQGTLFSTEEEPSKLLVQNYRALQPLNQRMVFASFIQAGSSYTTGEMLSLGVGILVGCLCLLLAVYFIARKIRKKRLENRKSVVFTSAQATTEA


OK, so we know how to assemble the whole protein sequence into one string. So far, this works fine for our small fasta file (which only has one protein), but what if the file had more?

Let's write some code that uses the ```medium.fasta``` file as input.

First, we'll count the number of proteins:

In [24]:
# open the FASTA file in 'read' mode
fasta_file = open('fasta/medium.fasta','r').readlines()

counter = 0

# iterate through lines
for line in fasta_file:
    
    # add 1 to our counter for every header line we see
    if '>' in line:
        counter += 1

# print the result
print(counter)

2


In [31]:
# open the FASTA file in 'read' mode
fasta_file = open('fasta/medium.fasta','r').readlines()

# create an 'empty' list to store the sequences
sequences = []

# iterate through lines
for line in fasta_file:
    
    if '>' in line:
        # for every header line we see, add an empty string to the list
        # -- this empty string will later be populated with the sequence
        sequences.append('')
        
    if '>' not in line:
        # here, we add the new line to our empty string
        # note that sequences[-1] refers to the last item in the list
        sequences[-1] += line.strip()

# print the result
for sequence in sequences:
    print(sequence)

MLFSALLLEVIWILAADGGQHWTYEGPHGQDHWPASYPECGNNAQSPIDIQTDSVTFDPDLPALQPHGYDQPGTEPLDLHNNGHTVQLSLPSTLYLGGLPRKYVAAQLHLHWGQKGSPGGSEHQINSEATFAELHIVHYDSDSYDSLSEAAERPQGLAVLGILIEVGETKNIAYEHILSHLHEVRHKDQKTSVPPFNLRELLPKQLGQYFRYNGSLTTPPCYQSVLWTVFYRRSQISMEQLEKLQGTLFSTEEEPSKLLVQNYRALQPLNQRMVFASFIQAGSSYTTGEMLSLGVGILVGCLCLLLAVYFIARKIRKKRLENRKSVVFTSAQATTEA
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGALVIPLAILINIGPQTYFHTCLMVACPVLILTQSSILALLAIAVDRYLRVKIPLRYKMVVTPRRAAVAIAGCWILSFVVGLTPMFGWNNLSAVERAWAANGSMGEPVIKCEFEKVISMEYMVYFNFFVWVLPPLLLMVLIYLEVFYLIRKQLNKKVSASSGDPQKYYGKELKIAKSLALILFLFALSWLPLHILNCITLFCPSCHKPSILTYIAIFLTHGNSAMNPIVYAFRIQKFRVTFLKIWNDHFRCQPAPPIDEDLPEERPDD


## Finding mutations in protein sequences between two subjects

Ok, let's move on to something a bit more involved. Let's say that we've got a list of protein sequences found in two different people and we suspect that one person has a detrimental mutation in at least one protein. Our task is to read the files containing the protein sequences for each person, iterate through them, and compare their sequences to find any differences.

The protein sequences are in the 'FASTA' directory and are labelled **'subject_1.fasta'** and **'subject_2.fasta'**

Our first job is to look at the data files and see what we're dealing with. Open one of the subject fasta files in a text editor (notepad or similar).

<img src="img/fasta_files.PNG" width="95%"/>
The data is in the [FASTA file format](https://en.wikipedia.org/wiki/FASTA_format) which is common in proteomics. Here, each protein entry begins with a 'header' line that begins with '>' followed by some text that can provide database identifiers, protein names and descriptions, taxonomy information and so on.

The following lines provide the sequence of the protein.

#### Exploring the data

Let's check how many proteins we have in each file

In [35]:
def count_proteins(file_name):

    # create an empty list
    # here, we'll store the FASTA header lines as we read them from the files
    headers = []
    
    # open the FASTA file in 'read' mode
    fasta_file = open(file_name,'r').readlines()
    
    # iterate through lines
    for line in fasta_file:
        
        # we're interested in the header lines, which start with '>'
        if '>' in line:
            
            # we've found a header line
            # - append to the list
            headers.append(line)
    
    # return the list from the function
    return headers

# read the subject_1 FASTA file
subject_1_headers = count_proteins('fasta/subject_1.fasta')

# read the subject_2 FASTA file
subject_2_headers = count_proteins('fasta/subject_2.fasta')

print(len(subject_1_headers))
print(len(subject_2_headers))

3684
3684


In [40]:
def read_protein_sequences(file_name):
    
    # open the FASTA file in 'read' mode
    fasta_file = open(file_name,'r').readlines()

    # create an 'empty' list to store the sequences
    sequences = []

    # iterate through lines
    for line in fasta_file:

        if '>' in line:
            # for every header line we see, add an empty string to the list
            # -- this empty string will later be populated with the sequence
            sequences.append('')

        if '>' not in line:
            # here, we add the new line to our empty string
            # note that sequences[-1] refers to the last item in the list
            sequences[-1] += line.strip()
    
    return sequences

# read the subject_1 FASTA file
subject_1_sequences = read_protein_sequences('fasta/subject_1.fasta')

# read the subject_2 FASTA file
subject_2_sequences = read_protein_sequences('fasta/subject_2.fasta')

Now we've got our protein sequences from both subjects. Let's see if there are any differences:

In [41]:
print(subject_1_sequences == subject_2_sequences)

False


There are some differences between our subjects! Right now, we don't know how many proteins or amino acids differ though. Lets find them:

In [48]:
for i in range(len(subject_1_sequences)):

    # is the i'th sequence from subject 1 different to the i'th sequence from subject 2?
    # -- print the sequences if they are different
    if subject_1_sequences[i] != subject_2_sequences[i]:
        print(subject_1_sequences[i])
        print(subject_2_sequences[i])

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVHVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD


Can you spot the amino acid that's different? It's a bit hard. Let's write some code to find it for us:

In [52]:
for i in range(len(subject_1_sequences)):
    
    # is the i'th sequence from subject 1 different to the i'th sequence from subject 2?
    if subject_1_sequences[i] != subject_2_sequences[i]:
        
        # search through the sequences of both proteins 
        for j in range(len(subject_1_sequences[i])):
            
            # is residue j of protein 1 different to residue [j] of protein 2? 
            if subject_1_sequences[i][j] != subject_2_sequences[i][j]:
                
                # if so, print the result
                print('%s: %s --> %s' % (
                    j+1, subject_1_sequences[i][j], subject_2_sequences[i][j])
                )
                
                # print the fasta header line as sell
                print(subject_1_headers[i])

273: R --> H
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4



Residue arginine 273 of the p53 protein has been mutated to histidine.