In [1]:
import imp
compomics_import = imp.load_source('compomics_import', 'compomics_import.py')
from IPython.core.display import HTML
css_file = 'my.css'
HTML(open(css_file, "r").read())

# Classification

In this section we will build a classification model for gene splice site prediction. It is a problem arising in computational gene finding and concerns the recognition of splice sites that mark the boundaries between exons and introns in eukaryotes. Introns are spliced from premature mRNAs after transcription. The vast majority of splice sites are characterized by the presence of specific dimers on the intronic side of the splice site: GT for donor and AG for acceptor sites. Yet, only about 0.1-1% of all GT and AG occurrences in the genome represent true splice sites. 

Load the acceptor site training data set *acceptor_sites_dataset_train.csv* in a Pandas DataFrame called `data`.

Show the first 5 rows in the DataFrame.

There are only two columns. The column "sequence" contains a DNA sequence with length 22. The nucleotides at positions 11 and 12 in the sequence are always "A" and "G" respectively, so these positions are candidate gene acceptor sites. The column "target" indicates the class: 1 for "is acceptor site" and -1 for "is not acceptor site". The goal is to predict the target from the local context sequence of the candidate acceptor site. Let's see how many data points belong to each class:

In [2]:
data['label'].value_counts()

NameError: name 'data' is not defined

For the Machine Learning algorithms we well apply DNA sequences cannot be used as input. We will need to compute features from the DNA sequences that will allow us to detect true acceptor sites, a process known as **feature engineering**. 

In the first practicum we have seen the Pandas `apply()` function that allows us the process the values in a DataFrame column to create a new column. We will apply this function to compute feature vectors from the `sequence` column in the `data` DataFrame.  

We don't need to compute features from the middle AG dinculeotide in the local context sequence. Why?

Let's remove it:

In [None]:
def remove_AG(x):
    return x[0:10]+x[12:22]

print(data.head())

data["sequence"] = data["sequence"].apply(remove_AG)

print(data.head())

What does the following function do?

In [None]:
def DNA_int_encoding(x):
    encoding = []
    for nuc in x:
        if nuc == 'A':
            encoding.append(0)
        elif nuc == 'C':
            encoding.append(1)
        elif nuc == 'G':
            encoding.append(2)
        elif nuc == 'T':
            encoding.append(3)
        else:
            print("Found non-nucleotide in %s"%x)
    return encoding

Use this function on the `sequence` column in the `data` DataFrame to create a Pandas `Series` called `data_features_int_encoding` with feature vectors:

Next we put these feature vectors back in a Pandas DataFrame as follows:

In [None]:
data_features_int_encoding = pd.DataFrame(data_features_int_encoding.tolist())

data_features_int_encoding.head()

What does `data_features_int_encoding` contain?

Evaluate the generalization performance of a Logisitc Regression model with hyperparameters $C=0.1$ on the data set `data_features_int_encoding` using 10-fold cross-validation. Use the `cross_val_score()` function to compute the mean accuracy of the CV-scores (use `np.mean()` to compute the mean value). 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
import numpy as np


What does this number mean?

Load the acceptor site test data set *acceptor_sites_dataset_test.csv* in a Pandas DataFrame called `data_test`.

Compute the same feature vectors for the test set (create a DataFrame `data_test_features_int_encoding` that contains the feature vectors).

Now fit a Logistic Regression model ($C=0.1$) on the train set.

Use the `predict()` function of the `LinearRegression()` object to predict class labels for the instances in the test set (put the prediction in a variable called `predictions`).

Scikit-learn offers many metrics to evaluate model predictions. These functions are contained in the `metrics` module of `sklearn`. Can you find how to compute the accuracy of these predictions?

In [None]:
from sklearn import metrics


An accuracy above 90% seems like a good score. But is it? Let's consider a model that predicts class "-1" for all test points.

In [None]:
predictions_zero = [-1]*len(data_test.label)

What is the accuracy of these predictions?

So this should be a good score as well, even though we did not learn anything.

For classification tasks where the classes are highly imbalanced accuracy is not a good metric to evaluate the generalization performance. In fact, if there are 0.1% AG dinucleotides in a genome that are true acceptor sites then a model that predicts class "-1" for each AG would have an accuracy of 99.9%.

We have seen how a ROC curve plots the true positives rate against the false positives rate. Both these metrics focus on the positive class, in our case the true acceptor sites. These metrics are much more suitable to evalute the performance of models on tasks with highly imbalanced classes. To transform a ROC curve into one metric we can use the area under the curve (AUC). 

What is the AUC score of the predictions computed by the linear regression model we fitted?

You should see a negative value. This is because to compute the ROC we need the predictions to be scores (a continuous value) rather than class labels (discrete values). 

For logistic regression these scores are the class probabilities predicted by the model. We can obtain them using the `predict_proba()` function of the `LogisticRegression()` object as follows:

What does variable `predictions` contain?

The first and second column contains the predicted probabilities for class '-1' and '1' respectively. To compute the AUC we need to use the positive class probabilities. What is the AUC now?

Is this good generalization performance?

Transforming categorical features into ordered integers is maybe not a good idea as the nucleotides don't have any ordering. It is better to transform a categorical feature into one binary feature for each category (known as *one-hot* encoding). 

We can do this with the following function that again computes feature vectors:

In [None]:
def DNA_onehot_encoding(x):
    encoding = []
    for nuc in x:
        if nuc == 'A':
            encoding.extend([1,0,0,0])
        elif nuc == 'C':
            encoding.extend([0,1,0,0])
        elif nuc == 'G':
            encoding.extend([0,0,1,0])
        elif nuc == 'T':
            encoding.extend([0,0,0,1])
        else:
            encoding.extend([0,0,0,0])
    return encoding

Create a Pandas DataFrame called `data_features_onehot_encoding` that contains the *one-hot* encoded features.

Show the first five rows.

Evaluate the generalization performance of a logisitc regression model with hyperparameters $C=0.1$ on the data set `data_features_onehot_encoding` using 10-fold cross-validation. Use the `cross_val_score()` function to compute the mean AUC of the CV-scores. The `cross_val_score()` has a function parameter called `scoring` that you need to replace the *accuracy* metric with the *AUC* metric.

What is the AUC on `data_test`?

Is this close to what your CV is telling you?

We have used hyperparameter $C=0.1$ for the logistic regression model. Is there a better value for this regularization parameter (use `GridSearchCV`)? 

In [None]:
from sklearn.grid_search import GridSearchCV

search_space = [0.001,0.01,0.1,1,10,100]


What is the 10-CV AUC performance with this value for $C$?

What is the AUC performance on the test set for this value of $C$?

Is this closer to the AUC you computed using 10-CV?

Given this analysis, is the test set still *unseen data*?

The Logistic Regression algorithm fits a model parameter for each feature. We can use the value of these model parameters to investigate the relevance or importance of a feature in the model. 

First we give each feature a recognizable name:

In [None]:
columns = []

for i in range(-10,0,1):
    for nuc in ['A','C','G','T']:
        columns.append("%i_%s"%(i,nuc))
for i in range(1,11,1):
    for nuc in ['A','C','G','T']:
        columns.append("%i_%s"%(i,nuc))
        
data_features_onehot_encoding.columns = columns
data_features_onehot_encoding.head()

Now fit a Logistic Regression model ($C=1$) on the data set `data_features_onehot_encoding`.

The fitted model parameters can be found in the `.coef_[0]` parameter of the model. Print the model parameters.

Which is the most important feature in your model?

The following code plots the feature importances as a clear and informative graph. Can you understand the code (make it work for your model)? What does the plot show?

In [None]:
"""
F_importances = []
for feature_name,lr_coefficient in zip(data_features_onehot_encoding.columns,model.coef_[0]):
    F_importances.append([feature_name,lr_coefficient])
    
F_importances = pd.DataFrame(F_importances,columns=["feature_name","importance"])
print(F_importances.head())    

def get_nuc(x):
    return(x.split("_")[1])

def get_position(x):
    if x.split("_")[0] == "A": return 0
    if x.split("_")[0] == "G": return 0
    return(int(x.split("_")[0]))

F_importances["nuc"] = F_importances["feature_name"].apply(get_nuc)
F_importances["position"] = F_importances["feature_name"].apply(get_position)

print(F_importances.head())

import seaborn as sns

sns.factorplot(x="position", y="importance", hue="nuc", data=F_importances, aspect=2)
"""