# Capstone Check-In Two

## Proposal
My capstone project is the creation of a protein-binding predictive model using primary structure.  Originally, I hoped to create a regressor that predicts the protein binding affinity for an inputted pair of proteins.  But, I have simplified the scope to a binary classification: interacts or doesn't interact.  Such a predictive model could be used to determine if newly sequenced viral proteins, like those on the surface of novel coronavirus, bind to human membrane receptors.  Identifying protein-protein interactions without laboratory processes is an important application of machine learning.

## Problem Statement
SARS-COV-2 has infected at least 30 million people worldwide.  The global efforts to find a vaccine, treat COVID-19, and understand the virus can only be described as the largest scientific project every undertaken.  One branch of this project is mapping interactions between SARS-COV-2 surface proteins to human membrane receptors.  The goal of this project is to train a binary classifier to identify whether two inputted proteins interact or not.  The model will be applied to recently sequenced SARS-COV-2 surface proteins to determine their likely human targets.

## Methods and Models
The model will be trained on the primary structures (amino acid sequences) of thousands of protein profiles scraped from the UniProt protein database.  We are not interested in inferences (e.g. "presence of cysteines in both molecules increases the chance of interaction"), we are interested in accuracy.  Considering the complexity of protein data, we should move immediately to deep learning to extract as much predicting strength from the amino acid strings as possible.  I anticipate tuning a Convolutional Neural Network

## Risks and Assumptions
Predicting chemical forces between macromolecules is a vastly complex task that often involves computationally exhausting 3D modeling.  Limiting our model training to a 1D amino acid string avoids this problem, but also lowers our expectations for the model's performance.  However, we are still expecting the model to be able to intuit enough chemical understanding from just amino acids to predict binding.  This is a huge assumption the relies on the strength of deep learning and the size of our training data.  A potential risk of the project is that the model entirely fails to discern meaningful information from primary structure and secondary structure data (presence of alpha helices, beta sheets, etc.) must be added to the training set.

## Data Source
The commented code below includes saved functions that query the RCSB Protein DataBase. I originally scraped from this site, but ended up switching to UniProt. I've kept the code below, just in case I want to gather corroborating data later on.

I customized advanced queries from UniProt to only select human proteins with at least one interactor.  I choose to include protein mass, length, sequence, and interactors as my columns.

https://www.uniprot.org/

## EDA


In [9]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests

The commented code below includes saved functions that query the RCSB Protein DataBase. I originally scraped from this site, but ended up switching to UniProt. I've kept the code below, just in case I want to gather corroborating data later on.

In [25]:
# # This function is adapted from Carlos Oliver
# # https://www.rcsb.org/pages/webservices/python3_search_nmr

# def experimental_query(method):
#     url = 'http://www.rcsb.org/pdb/rest/search'

#     query_text = """
# <?xml version="1.0" encoding="UTF-8"?>

# <orgPdbQuery>

# <version>B0907</version>

# <queryType>org.pdb.query.simple.ExpTypeQuery</queryType>

# <description>Experimental Method Search: Experimental Method="""+method+"""</description>

# <mvStructure.expMethod.value>"""+method+"""</mvStructure.expMethod.value>

# </orgPdbQuery>

# """
#     query_helper(query_text)

# def query_helper(query_text):    

#     print("Query: %s" % query_text)
#     print("Querying RCSB PDB REST API...")

#     header = {'Content-Type': 'application/x-www-form-urlencoded'}

#     response = requests.post(url, data=query_text, headers=header)

#     if response.status_code == 200:
#         print("Found %d PDB entries matching query." % len(response.text))
#         print("Matches: \n%s" % response.text)
#     else:
#         print("Failed to retrieve results")

In [26]:
# solid_state = "SOLID-STATE NMR"
# experimental_query(solid_state)

In [7]:
train = pd.read_excel("./train_set.xlsx")
test = pd.read_excel("./test_set.xlsx")

In [10]:
train.head()

Unnamed: 0,Entry,Entry name,Protein names,Interacts with,Sequence,Mass,Length
0,Q13023,AKAP6_HUMAN,A-kinase anchor protein 6 (AKAP-6) (A-kinase a...,Q96CV9,MLTMSVTLSPLRSQDLDPMATDASPMAINMTPTVEQGEGEEAMKDM...,256720,2319
1,Q8WTP8,AEN_HUMAN,Apoptosis-enhancing nuclease (EC 3.1.-.-) (Int...,C9JG97; P50402; Q96B26; Q3T906; Q8IX15-3; Q134...,MVPREAPESAQCLCPSLTIPNAKDVLRKRHKRRSRQHQRFMARKAL...,36350,325
2,Q9NP61,ARFG3_HUMAN,ADP-ribosylation factor GTPase-activating prot...,Q9H400; Q8WWV3; Q16563; Q8IUH5,MGDPSKQDILTIFKRLRSVPTNKVCFDCGAKNPSWASITYGVFLCI...,56928,516
3,P36404,ARL2_HUMAN,ADP-ribosylation factor-like protein 2,Q9Y2Y0; Q969G2; O43924; Q9BZK7; Q13432,MGLLTILKKMKQKERELRLLMLGLDNAGKTTILKKFNGEDIDTISP...,20878,184
4,Q9NQ90,ANO2_HUMAN,Anoctamin-2 (Transmembrane protein 16B),Q12959,MATPGPRDIPLLPGSPRRLSPQAGSRGGQGPKHGQQCLKMPGPRAP...,113969,1003


In [13]:
# Let's look at the training dataset
train.shape

(11935, 7)

There are 11935 proteins in our training dataset.

In [14]:
train.set_index("Entry", inplace=True)
train.head()

Unnamed: 0_level_0,Entry name,Protein names,Interacts with,Sequence,Mass,Length
Entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Q13023,AKAP6_HUMAN,A-kinase anchor protein 6 (AKAP-6) (A-kinase a...,Q96CV9,MLTMSVTLSPLRSQDLDPMATDASPMAINMTPTVEQGEGEEAMKDM...,256720,2319
Q8WTP8,AEN_HUMAN,Apoptosis-enhancing nuclease (EC 3.1.-.-) (Int...,C9JG97; P50402; Q96B26; Q3T906; Q8IX15-3; Q134...,MVPREAPESAQCLCPSLTIPNAKDVLRKRHKRRSRQHQRFMARKAL...,36350,325
Q9NP61,ARFG3_HUMAN,ADP-ribosylation factor GTPase-activating prot...,Q9H400; Q8WWV3; Q16563; Q8IUH5,MGDPSKQDILTIFKRLRSVPTNKVCFDCGAKNPSWASITYGVFLCI...,56928,516
P36404,ARL2_HUMAN,ADP-ribosylation factor-like protein 2,Q9Y2Y0; Q969G2; O43924; Q9BZK7; Q13432,MGLLTILKKMKQKERELRLLMLGLDNAGKTTILKKFNGEDIDTISP...,20878,184
Q9NQ90,ANO2_HUMAN,Anoctamin-2 (Transmembrane protein 16B),Q12959,MATPGPRDIPLLPGSPRRLSPQAGSRGGQGPKHGQQCLKMPGPRAP...,113969,1003


In [15]:
train["Interacts with"] = train["Interacts with"].apply(lambda a: a.split(";"))

In [16]:
train.head()

Unnamed: 0_level_0,Entry name,Protein names,Interacts with,Sequence,Mass,Length
Entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Q13023,AKAP6_HUMAN,A-kinase anchor protein 6 (AKAP-6) (A-kinase a...,[Q96CV9],MLTMSVTLSPLRSQDLDPMATDASPMAINMTPTVEQGEGEEAMKDM...,256720,2319
Q8WTP8,AEN_HUMAN,Apoptosis-enhancing nuclease (EC 3.1.-.-) (Int...,"[C9JG97, P50402, Q96B26, Q3T906, Q8IX15-3,...",MVPREAPESAQCLCPSLTIPNAKDVLRKRHKRRSRQHQRFMARKAL...,36350,325
Q9NP61,ARFG3_HUMAN,ADP-ribosylation factor GTPase-activating prot...,"[Q9H400, Q8WWV3, Q16563, Q8IUH5]",MGDPSKQDILTIFKRLRSVPTNKVCFDCGAKNPSWASITYGVFLCI...,56928,516
P36404,ARL2_HUMAN,ADP-ribosylation factor-like protein 2,"[Q9Y2Y0, Q969G2, O43924, Q9BZK7, Q13432]",MGLLTILKKMKQKERELRLLMLGLDNAGKTTILKKFNGEDIDTISP...,20878,184
Q9NQ90,ANO2_HUMAN,Anoctamin-2 (Transmembrane protein 16B),[Q12959],MATPGPRDIPLLPGSPRRLSPQAGSRGGQGPKHGQQCLKMPGPRAP...,113969,1003


In [17]:
total_proteins = train.index

In [22]:
# for interactors in train['Interacts with']:
#     for i in interactors:
#         total_proteins.append(i)
        
# total_proteins = set(total_proteins)