# Machine Learning for Protein Function Classification

Author: Jude Mike  
Domain: Bioinformatics / Machine Learning  
Tools: Python, Pandas, Scikit-learn  
Platform: Google Colab  

---

## Abstract
This project applies supervised machine learning techniques to classify proteins based on numerical features derived from amino acid sequences. Using a Random Forest classifier, we analyze protein data and evaluate model performance using standard metrics. The work demonstrates how machine learning can assist biological research by accelerating protein function prediction.

## 1. Biological Background

Proteins are sequences of amino acids that fold into complex structures and perform essential biological functions.
Protein function prediction is a key problem in bioinformatics, traditionally solved using laboratory experiments.

Machine learning offers a computational alternative by learning patterns from known protein data and predicting the function of unknown proteins.

## 2. Machine Learning Approach

We formulate protein function prediction as a supervised classification problem:

- Input: numerical protein features
- Output: functional class label

Steps:
1. Load and clean dataset
2. Split into training and test sets
3. Train a Random Forest classifier
4. Evaluate performance using accuracy, precision, recall, and F1-score

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv("/content/protein_data.csv")
data.head()

In [None]:
print("Dataset shape:", data.shape)
print("\nClass distribution:")
print(data["class"].value_counts())

## 3. Feature Description

Each protein is represented by numerical features such as:
- Amino acid composition
- Hydrophobicity
- Molecular weight indicators

These features allow machine learning models to capture biological patterns without raw sequence alignment.

In [None]:
X = data.drop("class", axis=1)
y = data["class"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [None]:
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42
)

model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred))

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(5,5))
plt.imshow(cm)
plt.title("Confusion Matrix")
plt.colorbar()
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

## 4. Results and Interpretation

The model achieves strong classification performance, indicating that numerical protein features carry meaningful biological information.

Random Forests perform well due to their ability to model nonlinear feature interactions common in biological systems.

## 6. Limitations and Future Work

Limitations:
- Feature-based approach ignores protein structure
- Dataset size limits generalization

Future work:
- Deep learning on raw sequences
- Integration with structural biology data