In [None]:
import warnings;
warnings.filterwarnings('ignore');
import pandas as pd

# Splice site prediction

Gene splicing is a post-transcriptional modification in which a single gene can code for multiple proteins. Gene Splicing is done in eukaryotes, prior to mRNA translation, by the differential inclusion or exclusion of regions of pre-mRNA. Gene splicing is an important source of protein diversity.

The vast majority of splice sites are characterized by the presence of specific dimers on the intronic side of the splice site: "GT" for donor and "AG" for acceptor sites. In this project you will fit a classification model for acceptor splice site prediction in DNA sequences.

This model will consider each AG in the DNA as a candidate acceptor site, extract a local context surrounding the candidate acceptor site, represent the candidate site as a feature vector and the predict the class ('acceptor site' or 'not acceptor site') by applying the model in the constructed feature vector.

This what the training data looks like:

In [None]:
data_train = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_data/master/practicum/Classification/acceptor_sites_dataset_train.csv")

In [None]:
data_train.head(5)

There are just two columns. 

The column "sequence" contains the local context DNA sequence. We can see that nucleotide positions 11 and 12 in the sequence are always "A" and "G". These are the candidate acceptor sites with a local context that consists of 10 nucleotides upstream en 10 nucleotides downstream the AG. 

The column "label" contains the class of the candidate acceptor site: 1 for "acceptor site" and -1 for "not acceptor site". 

*How many sequences does the dataset contain for each class?*

In [None]:
###Start code here

###End code here

Next, we load the test data:

In [None]:
data_test = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_data/master/practicum/Classification/acceptor_sites_dataset_test.csv")

In [None]:
data_test

To compute features from the `column` we first concatenate the trainging and testing data into one DataFrame. In this manner the training and testing data are processed in exactly the same way. We can later reconstruct the training and testing DataFrames.

*Use the Pandas function `concat()` to concatenate the training and testing data into a DataFrame called `data`. The training dat should be the first rows, with the testing data beneath those rows:*

In [None]:
###Start code here
data = 
###End code here

data

Pop the `label` column from the `data` DataFrame and assigned it to variable `y`:

In [None]:
###Start code here
y = 
###End code here

y

We need to represent the local context DNA sequence as a feature vector suitable for model fitting. This process is known as **feature engineering**. 

The "AG" dinucleotide in the middle of each local context sequence is the same for both classes, i.e. it does not provide any discriminative information. So, there is not rational behind computing features from this part of the local context sequence.

*Use the Pandas DataFrame `.apply()` method to remove the middle "AG" dinucleotides in the DNA sequences (don't create a new column):*

In [None]:
print(data.head())

###Start code here
data["sequence"] = 
###End code here

print(data.head())

First, we create a feature for each of the nucleotide positions in the local context DNA sequence.

The [pandas.Series.str.split](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html) function splits a string in a column (pandas.Series) from the beginning, at the specified delimiter string.

*Use this function to split the `sequence` column into one column for each nucleotide positon. Put the resulting columns in a DataFrame called `data_features`:*

In [None]:
###Start code here
data_features = 
###End code here

data_features

In a Pandas DataFrame, the `.columns` attirbute contains a list with the column names.

*Rename the columns to the relative position of the nucleotide position in the local context (from -10 to 10):*

In [None]:
###Start code here
data_features.columns = 
###End code here

data_features

Next we apply `sklearn.preprocessing.LabelEncoder` to repace each nucleotide by a number.

*Create a Pandas DataFrame `data_features_int_encoding` by applying the `LabelEncoder` on each feature in `data_features`:*

In [None]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

data_features_int_encoding = pd.DataFrame()
for col in data_features.columns:
    ###Start code here
    data_features_int_encoding[col] = 
    ###End code here
    
data_features_int_encoding.head()

Finally, we recontruct the training and testing data DataFrames based on the number of datapoints in the training data:

In [None]:
data_features_int_encoding_train = data_features_int_encoding.iloc[:len(data_train),:]
data_features_int_encoding_test = data_features_int_encoding.iloc[len(data_train):,:]

y_train = y.iloc[:len(data_train)]
y_test = y.iloc[len(data_train):]

Now we evaluate the generalization performance of a logisitc regression model with hyperparameters $C=0.1$ on the dataset `data_features_int_encoding` using 10-fold cross-validation. 

*Apply the `cross_val_score()` function to compute an accuracy score for each fold in the CV:*

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

###Start code here
model = 
scores = 
###Start code here

print(np.mean(scores))

*Fit a logistic regression model on the train set.*

In [None]:
###Start code here

###End code here

*Make predictions for the test set.*

In [None]:
###Start code here
predictions = 
###End code here

Scikit-learn offers many metrics to evaluate model predictions. These functions are contained in the `metrics` module of `sklearn`. 

*Can you find how to compute the accuracy of these predictions (use the `metrics`module)?*

In [None]:
from sklearn import metrics

###Start code here
score_acc = 
###End code here

print(score_acc)

An accuracy above 90% seems like a good score. But is it? Let's consider a model that predicts class "-1" for all test points.

In [None]:
predictions_zero = [-1]*len(data_test.label)

*What is the accuracy of these predictions?*

In [None]:
###Start code here
score_acc = 
###End code here

print(score_acc)

So this should be a good score as well, even though we did not learn anything.

For classification tasks where the classes are highly imbalanced, accuracy is not a good metric to evaluate the generalization performance. In fact, if there are 0.1% "AG" dinucleotides in a genome that are true acceptor sites then a model that predicts class "-1" for each "AG" would have an accuracy of 99.9%.

We have seen how a ROC curve plots the true positives rate against the false positives rate. Both these metrics focus on the positive class, in our case the true acceptor sites. These metrics are much more suitable to evalute the performance of models on tasks with highly imbalanced classes. To transform a ROC curve into one metric we can use the area under the curve (AUC). 

*What is the AUC score of the predictions computed by the linear regression model we fitted?*

In [None]:
###Start code here
score_auc = 
###End code here

print(score_auc)

You should see a negative value. 

To compute the AUC, we need the predictions to be scores (a continuous value) rather than class labels (discrete values).

For logistic regression these scores are the class probabilities predicted by the model (a value between 0 and 1). 

We can obtain these scores with the `predict_proba()` function of the `LogisticRegression` module as follows:

In [None]:
predictions = model.predict_proba(data_features_int_encoding_test)

print(predictions)

The first and second column contain the predicted probabilities for class '-1' and '1' respectively. To compute the AUC we need to use the positive class probabilities. 

*What is the AUC now?*

In [None]:
###Start code here
score_auc = 
###End code here

print(score_auc)

Is this good generalization performance?

Transforming categorical features into ordered integers is maybe not a good idea as the nucleotides don't have any ordering (the columns are not ordinal features). 

It is better to transform a categorical feature into one binary feature for each category (known as *one-hot* encoding). 

*Use the Pandas function `get_dummies()` to compute one-hot encoded features (put them in a DataFrame called `data_features_onehot_encoding`:*

In [None]:
###Start code here
data_features_onehot_encoding = 
###End code here

data_features_onehot_encoding

Evaluate the generalization performance of a logisitc regression model with hyperparameters $C=1$ on the training data in `data_features_onehot_encoding` using 10-fold cross-validation. 

The `cross_val_score()` has a function parameter called `scoring` that allows you to set different scoring metrics.

*Use the `cross_val_score()` function to compute the mean AUC of the CV-scores.* 


In [None]:
model = LogisticRegression(C=0.1)

###Start code here
data_features_onehot_encoding_train =
data_features_onehot_encoding_test = 

score_auc = 
###End code here

print(score_auc)

*What is the AUC on `data_test`?*

In [None]:
###Start code here

###Start code here

score_auc

Is this close to what your CV is telling you?

We have used hyperparameter $C=1$ for the logistic regression model. 

*Is there a better value for this regularization parameter (use `GridSearchCV`)?*

In [None]:
from sklearn.model_selection import GridSearchCV

search_space = [0.001,0.01,0.1,1,10,100]
print(params)
params = dict(C=search_space)

###Start code here

###End code here

print(grid_search.best_estimator_)
print(grid_search.best_score_)

*What is the 10-CV AUC performance with this value for $C$?*

In [None]:
###Start code here

###Start code here

score_auc

*What is the AUC performance on the test set for this value of $C$?*

In [None]:
###Start code here

###End code here

score_auc

Is this closer to the AUC you computed using 10-CV?

In scikit-learn a fitted logistic regression model has the fitted modelparameter values stored in `.coef_[0]`:

In [None]:
print(model.coef_[0])

For logistic regression this is one modelparameter for each feature (plus the interecept, which is not in `.coef_[0]`). 

Recall that for logistic regression a prediction is made by multiplying each fitted modelparameter with the corresponding feature, summing them and then squeezing this sum between 0 and 1 with the logistic function. 

Since all features have values 0 or 1, the modelparameter values indicate the contribution (importance) of a feature during prediction.

First we put the feature names and modelparameter values in a new DataFrame:

In [None]:
F_importances = []
for feature_name, modelparameter in zip(data_features_onehot_encoding.columns,model.coef_[0]):
    F_importances.append([feature_name,modelparameter])
F_importances = pd.DataFrame(F_importances,columns=["feature_name","importance"])
F_importances.head()    

*Use the Seaborn `.barplot()` method to create a plot like this:*

*Create a plot that looks like this:*

![plot](https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_data/master/practicum/Classification/AG_plot.png)

In [None]:
import seaborn as sns

def get_nuc(x):
    return(x.split("_")[1])

def get_position(x):
    if x.split("_")[0] == "A": return 0
    if x.split("_")[0] == "G": return 0
    return(int(x.split("_")[0]))

###Start code here

#End code here