## Building and Tuning a Machine Learning Model to Predict Protein Subcellular Localization from Amino Acid Sequence

by MAS, 06/2019

### Introduction
We now have access to an unprecedented amount of genomic information. However, harnessing the full potential of that information still requires a lot of slow and expensive experiments. In an ideal scenario, we would employ supervised machine learning to make predictions about biology using the data that's already been collected and readily available genomic information. **Here, I build and tune a machine learning model to predict the subcellular localization of proteins based on amino acid sequences.** The end result is a support vector machine (SVM) model that predicts soluble and membrane proteins to with an accuracy of about 85%.

### Data Preprocessing
To build the model, we need protein amino acid sequences and data describing protein subcellular localization. Fortunately, the bacterium, *E. Coli* (species K12), has been studied to death so the subcellular localization of many of its proteins is known. Additionally, it's whole genome has been sequenced so all of its proteins' amino acid sequences are known. 

**For Simplicty we will only consider 2 classes for subcellular localization:**  

   * In a membrane ("membrane")
   * Not in a membrane ("soluble")

The world's repository of protein data is [**Uniprot**](https://www.uniprot.org/help/proteomes_manual). Not only can we access every known protein sequence, we can also look up if a specific protein has had its subcellular localization characterized using its unique uniprot ID.

I wrote a ```Python``` [script ```scrape_uniprot.py```](https://github.com/mas16/SubcellularLocalizationPrediction/blob/master/scrape_uniprot.py) (click link for code) to:

> 1. Access Uniprot  
> 2. Fetch the *E.Coli* protein sequences in [FASTA](https://en.wikipedia.org/wiki/FASTA_format) format  
> 3. Extract the uniprot ID and animo acid sequence using regex   
> 4. Query Uniprot using the uniprot ID  
> 5. Scrape the subcellular localization if documented  
> 6. Put everything together in a ```pandas``` dataframe and write to a ```.csv``` file

Let's take a look at the output:

In [8]:
# Preview the output from scrape_uniprot.py
import pandas as pd

# Path to output data
datapath = "/Users/matthewstetz/Documents/Projects/SubcellularLocalizationPrediction/"

df = pd.read_csv(datapath + "ecoli_proteome.csv")
print(df.head(5))

           ID                                           Sequence  Local
0  A0A385XJ53  MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYT...    NaN
1  A0A385XJE6  MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPW...    NaN
2  A0A385XJK5    MTLLQVHNFVDNSGRKKWLSRTLGQTRCPGKSMGREKFVKNNCSAIS    NaN
3  A0A385XJL2   MLSTESWDNCEKPPLLFPFTALTCDETPVFSGSVLNLVAHSVDKYGIG    NaN
4  A0A385XJL4  MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQ...    NaN


But as you can see, a lot of annotations are missing. The script defines the classes to be:
   * Membrane = 1
   * Soluble = 0  

So let's see how many of each class we have

In [11]:
print("Mebrane Count: ", df[df["Local"]==1].shape)
print("Soluble Count: ", df[df["Local"]==0].shape)

Mebrane Count:  (1210, 3)
Soluble Count:  (874, 3)


So there is a little bit of a class imbalance but we will deal with that when we build the model.

### Feature Engineering
Ok, now that we have our data in a tidy format, let's start engineering some features. Since the goal of this project is to only use genomic information, let's start by just counting the number of each amino acid. Of course, we need to normalize by the total number of amino acids since proteins can have very different lengths.

We can also include some information derived from previous studies that relate amino acid count to specific chemical and physical properties. Specifically, let's use the relationship between amino acid and [hydrophobicity](https://web.expasy.org/protscale/pscale/Hphob.Doolittle.html) and [secondary structure](https://web.expasy.org/protscale/pscale/alpha-helixLevitt.html) as starting points since these are very well calibrated experimentally.

I wrote a [script ```feature_extract.py```](https://github.com/mas16/SubcellularLocalizationPrediction/blob/master/feature_extract.py) which:
> Reads dataframe with amino acid sequences  
> Drops data without documented subcellular localization
> Calculates normalized amino acid counts  
> Calculates average hydrophobicity  
> Calculates average seconday structure propensity 
> Writes features out to a tidy ```.csv``` output file

Let's take a look at the output

In [12]:
df = pd.read_csv(datapath + "ecoli_proteome_features.csv")
print(df.head(5))

       ID                                           Sequence  Local         I  \
0  A5A605  MRLHVKLKEFLSMFFMAILFFPAFNASLFFTGVKPLYSIIKCSTEI...    1.0  0.132075   
1  A5A615                    MNVSSRTVVLINFFAAVGLFTLISMRFGWFI    1.0  0.096774   
2  A5A616                    MLGNMNVFMAVLGIILFSGFLAAYFSHKWDD    1.0  0.064516   
3  A5A618                      MSTDLKFSLVTTIIVLGLIVAVGLTAALH    1.0  0.103448   
4  A5A621                                MIERELGNWKDFIEVMLRK    0.0  0.105263   

          V         L         F         C         M         A    ...     \
0  0.056604  0.106918  0.113208  0.025157  0.037736  0.056604    ...      
1  0.129032  0.096774  0.161290  0.000000  0.064516  0.064516    ...      
2  0.064516  0.129032  0.129032  0.000000  0.096774  0.096774    ...      
3  0.137931  0.206897  0.034483  0.000000  0.034483  0.103448    ...      
4  0.052632  0.105263  0.052632  0.000000  0.105263  0.000000    ...      

          E         Q         D         N         K         R 

Now let's evaluate the features. We want to make sure the features provide some information that can discriminate between soluble and membrane proteins without being colinear with other features.

I wrote a [script ```evaluate_features.py```](https://github.com/mas16/SubcellularLocalizationPrediction/blob/master/evaluate_features.py) that does the following:

> Reads dataframe of features  
> Generates boxplots by classification for each feature   
> Generates correlation matrix for all features  

Let's see some output

The amino acid alanine, "A", is roughly equally represented in soluble and membrane proteins:

<img src="plots/A.png" width="600">

The amino acid aspartatic acid, "D", is more represented in soluble proteins:

<img src="plots/D.png" width="600">

The correlation matrix shows the two features we calculated from the amino acid count: hydrophobicity ```hydro_mean``` and secondary structure propensity ```ss_mean``` are colinear with amino acid count so they are not going to be very useful in our model.

<img src="plots/correlation_matrix.png" width="800">