# Snake venom classification exercise

In this exercise, we are going to take a look at snake venoms :snake:. Because this is an educational course, we are first going
to examine how we can use vsiualizations and basic machine learning libraries, such as scikit-learn to classify the snake venoms.
Later we are going to use ProteusAI to greatly simplify the workflow and to add additional capabilities.

We will use the UniProt database to fetch snake venom toxin sequences. We will use a custom query to select reviewed snake toxin entries.

**UniProt Query**: `taxonomy_id:8570 AND ( cc_tissue_specificity:venom OR cc_scl_term:SL-0177) AND reviewed:true`

We filtered the data, removing toxins that are hard to group into categories and assigned Function Classes based on the 'Function [CC]' column
using OpenAIs GPT o1 model. This was done to create simple functional labels that we can more easily work with, instead of the lengthy function descriptions.

Pandas is imported to load the data and Matplotlib is used to visualize it. Take a look at the data columns.

_The exercises have been created by Jonathan Funk and Valentas Brasas_


In [None]:
import pandas as pd

df = pd.read_csv('Snake_Toxins_with_Function_Classes.csv')
df

### Exercise 1: Exploratory Data Analysis (EDA)

Objective: Familiarizing with the dataset and understanding the distribution of different toxin functional classes to identify any class imbalances that might affect model performance.


1. Count the Number of Instances per Class
2. Check if there are any missing, duplicate or inconsistent data to ensure data quality for model training.
3. Analyze the distribution of protein sequence lengths to understand variability and inform preprocessing steps like padding or truncation.
4. Examine the frequency and distribution of each amino acid in the protein sequences to identify patterns or biases that could inform feature engineering.

## Visualizing protein sequences

In machine learning data is represented numerically, frequently in the form of vectors. There are countless different choices to be made
when representing proteins as vectors, which will be covered in detail in later lectures. Here we are going to start simply by encoding
sequences as one-hot encoded vectors. This means, that we are going to assign each residue in the sequence with a vector, that is
0 in every position except for a single 1. The position of the one will indicate which amino acid we are dealing with.

For example, the amino acid Alanine (Ala, A) can be represented as having the 1 in the first position, while the amino acid
Cystein (Cys, C) can be represented as having the 1 in the second position. Thus, we will communicate that the two amino acids
are different entities. This method of encoding is known as One-Hot Encoding, or short OHE and commonly used to represent discrete
sequences.

Note, that the only information the machine learning algorithms will have when using this annotation is, that
the amino acids are different.

### Ecercise 2 : What could be a problem when representing amino acids as:

    1. A=1, C=2, D=3, ..., Y=20?
    
    2. Using OHE

We will use numpy to encode the protein sequences as vectors

In [None]:
import numpy as np

def one_hot_encode(seq):
    # Dictionary of standard amino acids
    amino_acids = 'ACDEFGHIKLMNPQRSTVWY'
    aa_to_int = {aa: i for i, aa in enumerate(amino_acids)}

    # Initialize the one-hot encoded matrix
    one_hot = np.zeros((len(seq), len(amino_acids)), dtype=int)

    # Fill the matrix
    for i, aa in enumerate(seq):
        if aa in aa_to_int:
            one_hot[i, aa_to_int[aa]] = 1

    return one_hot

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def visualize_one_hot(encoded_seq, seq):
    # Create a heatmap from the one-hot encoded matrix
    plt.figure(figsize=(10, len(encoded_seq) / 2))
    sns.heatmap(encoded_seq, annot=True, cbar=False, cmap='rocket',
                xticklabels=list('ACDEFGHIKLMNPQRSTVWY'), yticklabels=list(seq))
    plt.xlabel('Amino Acids')
    plt.ylabel('Sequence Position')
    plt.title('One-Hot Encoding of Amino Acid Sequence')
    plt.savefig('one_hot_encoding.png')
    plt.show()

# Example usage
seq = "MLQVLLVTICLAVF"
encoded_seq = one_hot_encode(seq)
visualize_one_hot(encoded_seq, seq)

### Exercise 3

1. What are different ways to encode protein sequences for machine learning algorithms? List at least 4 of them (One Hot Encoding).

| Encoding Method | Description | Advantages | Disadvantages |
|-----------------|-----------------|-----------------|-----------------|
| _One hot encoding_   | _Description 1_   | _List the advantages for Encoding 1 ._     |_List the disadvantages for Encoding 1 ._     |
| _Method 2_    | _Description 2_    | _List the advantages for Encoding 2 ._    |_List the disadvantages for Encoding 2 ._      |
| _Method 3_    | _Description 3_    | _List the advantages for Encoding 3 ._     |_List the disadvantages for Encoding 3 ._    |
| _Method 4_    | _Description 4_    | _List the advantages for Encoding 4 ._    |_List the disadvantages for Encoding 4 ._   |

2. Write functions to encode

### Now let's encode the entire df

In machine learning it is common to call the input x, thus we are encoding the sequences and give the column the name x.

### Exercise 4:

Create a column 'x' in the dataframe to encode all protein sequences using one-hot encoding.

## Encoding class labels

Next we also need to encode the class labels. These are also going to be encoded using OHe, however, the library we are going to use expects train the classification model
expects simple numbers as class labels and does the OHE under the hood. Because of this we are going to encode the classes as numbers. The output variable (class in this case),
is often called y. Thus we are calling the encoded class labels y.

### Exercise 5:

Create a column 'y' in the dataframe, assigning unique integer values to individual function classes:

## Training the classification model using scikit-learn

Next, we are going to train machine learning models using scikit-learn to classify snake toxins based on.

### Exercise 6:

    3.1 Load the dataset into variables X (features) and y (labels).
    3.2 Explore the structure of X. Are all sequences in X the same size? If not, figure out how to handle this.
    3.3 Divide the dataset into training and testing sets.
    3.4 Pick a classifier from the following options:
        a. Logistic Regression
        b. Random Forest
        c. Support Vector Machine
    3.5 Make predictions
    3.6 Evaluate the model with appropriate metrics and interpret the results
    

## Visualizing results

Now that we have trained the model we would like to visualize the results, which is an important step to communicate the capabilities of your model, but its also interesting for you to quickly see where your model performs better or worse. Visualize your results using a confusion matrix.

### Exercise 7

    4.1 Visualize the classification results using a confusion matrix. 
    4.2 Discuss the results? What are the differences between the classes?

### Exercise 8

1. Compare Metrics: Write a code to analyze and compare how different classification metrics (precision, recall, F1-score) correlate with each other.

2. Add a New Metric: Implement an additional metric (e.g., ROC-AUC, Matthews Correlation Coefficient) and use it to further compare the models.

3. Model Comparison: Re-run three classification models—Logistic Regression, Random Forest, and Support Vector Machine (SVM). Compare their performance using the metrics above and determine which model performs best.

4. Optimize Model Performance: Explore and apply new strategies to improve classification results. Potential strategies to try:
   * Experiment with different sequence encoding methods (e.g., k-mer encoding, * physicochemical properties).
   * Address class imbalance using techniques like oversampling, undersampling, or adjusting class weights.
   * Implement an ensemble method, such as a voting classifier or boosting, to combine the strengths of multiple models.
   * Perform hyperparameter tuning using GridSearchCV, RandomizedSearchCV, or Bayesian optimization.



# ProteusAI

Now lets do the same thing with the library ´ProteusAI´ which automates a lot of the tideous steps and gets you from training to results in only a few lines of code. ProteusAI will be used in later exercises.

In [None]:
import proteusAI as pai

In [None]:
lib = pai.Library(
    source="Snake_Toxins_with_Function_Classes.csv",
    names_col="Entry Name",
    seqs_col="Sequence",
    y_col="Function Class",
    y_type="class"
)

In [None]:
lib.compute("esm2_8M", device="cpu", batch_size=100)

In [None]:
fig, ax, df = lib.plot(method="tsne", rep="esm2_8M")

plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0))
plt.savefig("tsne.png")

In [None]:
fig, ax, df = lib.plot(method="tsne", rep="esm2_8M")

plt.savefig("tsne.png")
plt.show()

In [None]:
model = pai.Model(
    model_type="rf",
    library=lib,
    x="esm2_8M",
    seed=42,
)

In [None]:
model.train()

In [None]:
# Plot confusion matrix (implement in ProteusAI)