<a href="https://colab.research.google.com/github/gokceuludogan/protein-ml-crash-course/blob/wip/Chapter_5_Protein_Function_Prediction_with_Gene_Ontology_(GO).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview

Protein function prediction is a crucial task in bioinformatics, where we aim to determine the biological role of a protein based on its sequence or structure. A popular framework for describing protein functions is the **Gene Ontology (GO)**. GO provides a structured vocabulary for annotating proteins, organized into three main categories:

- **Biological Process (BP)**: Pathways and processes a protein is involved in (e.g., cell cycle).
- **Molecular Function (MF)**: Specific activities performed by the protein (e.g., binding).
- **Cellular Component (CC)**: Locations within the cell where the protein operates (e.g., nucleus).

In this chapter, we will:

- Introduce the Gene Ontology.
- Explain how to prepare a dataset for GO-based function prediction.
- Build machine learning models for predicting GO terms using sequence or structural features.
- Evaluate the performance of these models.

## 1. Introduction to Gene Ontology (GO)

### What is Gene Ontology?

The **Gene Ontology** is a hierarchical system that classifies protein functions using a controlled vocabulary. Each GO term describes a specific aspect of a protein's role in the cell, and these terms are organized in a directed acyclic graph (DAG). This allows for hierarchical relationships between general terms (e.g., "metabolic process") and specific ones (e.g., "glycolysis").

### Key GO Categories:

- **Biological Process (BP)**: Pathways and larger processes (e.g., "DNA repair").
- **Molecular Function (MF)**: Specific biochemical activities (e.g., "ATP binding").
- **Cellular Component (CC)**: Where in the cell the protein is located (e.g., "mitochondrion").

Each protein can be annotated with multiple GO terms across these categories.

---

## 2. Preparing a Dataset for Protein Function Prediction

### Example Dataset:

We will assume we have a dataset where each protein sequence is associated with one or more GO terms. This dataset may include:

- Protein sequences
- Known GO annotations (from BP, MF, or CC)

| Sequence | GO Terms |
| --- | --- |
| MAGWELV | GO:0004674, GO:0005524, ... |
| GGQVNLL | GO:0000166, GO:0004674, ... |

The task is to predict the correct GO terms for new protein sequences.

### Retrieving and Annotating Data with GO Terms
We'll first retrieve protein sequences from UniProt SwissProt and annotate them with GO terms using UniProt-GOA. We will filter annotations based on the provided evidence codes (EXP, IDA, IMP, IGI, IEP, TAS, IC).

In [76]:
import requests
import pandas as pd
import numpy as np

# Example query to get SwissProt proteins
query = "reviewed:true AND organism_id:9606"
columns = "accession,sequence,go_p,go_f,go_c"
base_url = "https://rest.uniprot.org/uniprotkb/search"

# Parameters for the API call
params = {
    "query": query,
    "format": "tsv",  # We want to retrieve tab-separated values
    "fields": columns
}

# Make the request
response = requests.get(base_url, params=params)

# Check if the request was successful
if response.status_code == 200:
    # Convert to DataFrame
    data = response.text
    df = pd.DataFrame([x.split('\t') for x in data.split('\n')[1:] if x], columns=['Entry', 'Sequence', 'BP_GO', 'MF_GO', 'CC_GO'])

    go_cols = ['BP_GO', 'MF_GO', 'CC_GO']

    # Convert empty GO annotations to nan 
    df[go_cols] = df[go_cols].replace("", np.nan)

    # Drop rows without GO annotations
    df = df.dropna(subset=go_cols)

    # Format columns related to GO into a comma separated GO:XXXXXXX value list
    for col in go_cols: df[col] = df[col].str.extractall(r'\[(GO:\d+)\]').groupby(level=0).agg(','.join)

    # Extract GO terms into one column and filter based on evidence codes
    df['All_GO'] = df[['BP_GO', 'MF_GO', 'CC_GO']].apply(lambda x: ','.join(x.dropna()), axis=1)

    print(df.head())  # Display first few rows of the dataframe
else:
    print(f"Failed to retrieve data: {response.status_code}")


        Entry                                           Sequence  \
0  A0A0C5B5G6                                   MRWQEMGYIFYPRKLR   
1  A0A1B0GTW7  MLLLLLLLLLLPPLVLRVAASRCLHDETQKSVSLLRPPFSQLPSKS...   
2      A0JNW5  MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...   
5      A1A4S6  MGLQPLEFSDCYLDSPWFRERIRAHEAELERTNKFIKELIKDGKNL...   
6      A1A519  MKRRQKRKHLENEESQETAEKGGGMSKSQEDALQPGSTRVAKGWSQ...   

                                               BP_GO  \
0  GO:0032147,GO:2001145,GO:0001649,GO:0033687,GO...   
1                   GO:0007155,GO:0061966,GO:0006508   
2                              GO:0034498,GO:0120009   
5        GO:0007010,GO:0043066,GO:0051056,GO:0007165   
6                   GO:0009566,GO:0045893,GO:0006366   

                              MF_GO  \
0             GO:0003677,GO:0140297   
1  GO:0046872,GO:0004222,GO:0008233   
2  GO:0062069,GO:0120013,GO:0042803   
5                        GO:0005096   
6             GO:0003677,GO:0046872   

                   

In [86]:
!pip install goatools
!wget http://current.geneontology.org/ontology/go-basic.obo

Collecting goatools
  Downloading goatools-1.4.12-py3-none-any.whl.metadata (14 kB)
Collecting docopt (from goatools)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting ftpretty (from goatools)
  Downloading ftpretty-0.4.0-py2.py3-none-any.whl.metadata (6.6 kB)
Collecting openpyxl (from goatools)
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting pydot (from goatools)
  Downloading pydot-3.0.2-py3-none-any.whl.metadata (10 kB)
Collecting rich (from goatools)
  Downloading rich-13.9.2-py3-none-any.whl.metadata (18 kB)
Collecting statsmodels (from goatools)
  Downloading statsmodels-0.14.4-cp39-cp39-macosx_11_0_arm64.whl.metadata (9.2 kB)
Collecting xlsxwriter (from goatools)
  Downloading XlsxWriter-3.2.0-py3-none-any.whl.metadata (2.6 kB)
Collecting et-xmlfile (from openpyxl->goatools)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Collecting pyparsing>=3.0.9 (from pydot->goatools)
  D

In [87]:
from goatools import obo_parser

# Load GO terms from an ontology file (e.g., GO.obo)
go = obo_parser.GODag("go-basic.obo")

# Filter function for keeping GO terms with specific evidence codes
def filter_go_annotations(go_terms):
    filtered_terms = [go_term for go_term in go_terms if go[go_term].evidence in ['EXP', 'IDA', 'IMP', 'IGI', 'IEP', 'TAS', 'IC']]
    return filtered_terms

# Apply the filter to retain only high-confidence GO terms
df['Filtered_GO'] = df['All_GO'].apply(lambda go_terms: filter_go_annotations(go_terms.split(',')))

# Remove proteins with no filtered GO terms
df = df[df['Filtered_GO'].map(len) > 0]


go-basic.obo: fmt(1.2) rel(2024-09-08) 44,296 Terms


AttributeError: 'GOTerm' object has no attribute 'evidence'

In [None]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# Load dataset
df = pd.read_csv('protein_go_dataset.csv')

# Split into sequences and GO terms
X = df['Sequence']
y = df['GO Terms'].apply(lambda x: x.split(','))  # Split GO terms into lists

# Encode GO terms using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_encoded = mlb.fit_transform(y)

# Display encoded GO terms
print("GO Terms Shape:", y_encoded.shape)
print("Example of GO Terms for a Protein:", y_encoded[0])


## 3. Traditional Machine Learning for GO Prediction

For this task, we can use traditional ML models like **Logistic Regression**, **Random Forest**, or **SVM** in a multi-label classification setup. Since each protein may have multiple GO annotations, we will use **multi-label classification**, where the model predicts multiple classes (GO terms) for each protein.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# One-hot encode protein sequences (same as previous chapters)
X_train_encoded = np.array([one_hot_encode(seq) for seq in X_train])
X_test_encoded = np.array([one_hot_encode(seq) for seq in X_test])

# Train Logistic Regression model for multi-label classification
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_encoded, y_train)

# Predict GO terms on the test set
y_pred = logreg.predict(X_test_encoded)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Accuracy:", accuracy)


## 5. Model Evaluation

Since this is a **multi-label classification** task, traditional accuracy isn't enough to fully evaluate the model's performance. We should use metrics like **precision**, **recall**, and **F1-score** for each GO term.

In [None]:
from sklearn.metrics import classification_report

# Predict on the test set (using logistic regression for demonstration)
y_pred = logreg.predict(X_test_encoded)

# Convert predictions to binary format
y_pred_binary = (y_pred > 0.5).astype(int)

# Generate classification report
print("Classification Report:\n", classification_report(y_test, y_pred_binary, target_names=mlb.classes_))

## 6. Conclusion

In this chapter, we introduced the **Gene Ontology (GO)** framework for protein function prediction and explored how to build machine learning and deep learning models to predict GO terms based on protein sequences. We demonstrated how to handle multi-label classification, using both traditional ML models and deep learning architectures.