Importing the Python packages 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn import model_selection
from sklearn import feature_extraction
from sklearn import linear_model
from sklearn import svm
from sklearn import neighbors
from sklearn import metrics

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

## Loading dataset

Loading the pre-prepared dataset of disordered and ordered protein regions using the <code>pandas</code> package. Function <code>read_csv</code> reads from file in CSV (Comma-Separated Values) format and stores the data into the <code>DataFrame</code> object. <code>DataFrame</code> is a <code>pandas</code> 2-dimensional data structure that is used for storing tabular data, i.e. data that is organized in a table with rows and columns.

Setting column <code>'ID'</code> as the index of the <code>DataFrame</code> object.

## Dataset analysis

Dimensions of the <code>DataFrame</code> object, i.e. the size of our dataset.

Class distribution:

## Splitting dataset into a training and test subsets

Dataset is split into a training and test subsets in ratio 2:1. The splitting of the dataset is done using the stratification technique (parameter <code>stratify</code>) that preserves the same proportions of instances in each class as observed in the original dataset. The value of the parameter <code>random_state</code> is fixed in order to make reproducible train-test split.

# N-gram sequence representation

Knowing that certain amino acids and their combinations are more likely (i.e. frequent) in ordered than disordered protein regions<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1), n-gram representation of protein sequences come as a natural choice. Based on the n-gram composition, different representations of protein sequences, suitable for machine learning models, can be constructed. Two basic methods, inherited from the field of Natural Language Processing, are:
- *Bag of Words* or *Bag of n-grams* method
- *TF-IDF* (Term Frequency - Inverse Document Frequency) method 

We can easily generate these two types of n-gram-based sequence representation using the <code>feature_extraction</code> module (<code>text</code> submodule) from <code>sklearn</code> package.

<a name="cite_note-1"></a>[<sup>[1]</sup>](#cite_ref-1)Hydrophobic amino acids, more precisely clusters of hydrophobic amino acids are characteristical for ordered regions of proteins, while the hydrophilic amino acids are more prevalent in disordered regions.

## Bag of Words representations

Firstly we well try with Bag of Words representations and train several types of classification models. Scikit-learn library provides support for n-gram sequence representation using the *Bag of Words method*  through the <code>CountVectorizer</code> class. Constructor function has various settings, mainly related to the preprocessing phase:
- <code>analyzer</code> - whether to tokenize text by the words or individual characters 
- <code>ngram_range</code> - the lower and upper boundary of the range of char n-grams to be extracted, only applies if analyzer is set to 'char'; e.g. an n-gram range of (1, 3) means that only unigrams, bigrams and trigrams will be extracted
- <code>min_df</code> - lower cutoff value of the frequency values, used for excluding tokens that have a frequency less than a given threshold 
- <code>max_df</code> - upper cutoff value of the frequency values, used for excluding tokens that have a frequency higher than a given threshold 
- <code>lowercase</code> - converting all characters to lowercase before tokenizing
- <code>stop_words</code> - excluding common words or words that are considered irrelevant for the specific problem
- <code>token_pattern</code> - excluding words that do not match a predefined format 
- <code>tokenizer</code> i <code>preprocessor</code> - passing a custom function to perform tokenization/preprocessing 
- <code>vocabulary</code> - assigning a pre-prepared vocabulary

We will set only the parameters <code>analyzer</code>, <code>ngram_range</code> and <code>min_df</code>, while default values will be taken for other parameters. We want to split text (protein sequences) by individual characters while extracting only the unigrams, bigrams and trigrams. Also we will want to exclude n-grams that appear less than 5 times.

Vocabulary<a name="cite_ref-2"></a>[<sup>[2]</sup>](#cite_note-2) is, by rule, built over a training set. Later, while obtaining vector representations for test set instances, all unknown tokens (n-grams that are not present in the vocabulary) will be ignored.

<a name="cite_note-2"></a>[<sup>[2]</sup>](#cite_ref-2)Set of features (words or in our case n-grams) which will be used for sequence representations.

A list of extracted features (n-grams) can be obtained by method <code>get_feature_names()</code>.

Number of extracted features (n-grams):

Vectorization of training and test set sequences:

### Model 1 - logistic regression

We construct a simple logistic regression model and train it on a vectorized training set. Scikit-learn library provides support for different linear models, as well for the logistic regression model, through <code>linear_model</code> package. 

Accuracy of the model on the training and test set:

Confusion matrix:

### Model 2 - linear SVM

We construct a linear SVM (Support Vector Machine) model and train it on a vectorized training set. Support for linear SVM model is provided through the <code>LinearSVC</code> class from the <code>svm</code> module of the <code>sklearn</code> package.

Accuracy of the model on the training and test set:

Confusion matrix:

### Model 3 - k nearest neighbors

We construct a *k* nearest neighbors classifier (observing the 4 nearest neighbors) and train it on a vectorized training set. Support for KNN model is provided through the <code>KNeighborsClassifier</code> class from the <code>neighbors</code> module of the <code>sklearn</code> package.

Accuracy of the model on the training and test set:

Confusion matrix:

## TF-IDF representations

Now we well try with TF-IDF representations and train same types of classification models in order to compare them with the previous ones. For obtaining the TF-IDF representations of sequences we will use the <code>TfidfVectorizer</code> class from module <code>sklearn.text.feature_extraction</code>. The set of parameters when calling a constructor function is the same as for the <code>CountVectorzer</code> class.

Via parameters <code>analyzer</code> and <code>ngram_range</code> we set text (protein sequences) to be tokenized into characters (instead of words, which is default) and n-gram range which is going to extracted. Also, we exclude tokens (n-grams) that appear less than 5 times (<code>min_df</code> parameter) and chose to use only the "TF part" of the TF-IDF metric bu setting the parameter <code>use_idf</code> to <code>False</code>.

Building vocabulary based on instances of the training set:

List of extracted features (n-grams) that make up the vocabulary:

Number of extracted features (n-grams):

Vectorization of training and test set sequences:

### Model 1 - logistic regression

We construct a simple logistic regression model and train it on the now TF-IDF vectorized training set. 

Accuracy of the model on the training and test set:

Confusion matrix:

### Model 2 - linear SVM

We construct a linear SVM (Support Vector Machine) model and train it on the now TF-IDF vectorized training set. 

Accuracy of the model on the training and test set:

Confusion matrix:

### Model 3 - k nearest neighbors

We construct a k nearest neighbors classifier (observing the 4 nearest neighbors) and train it on the now TF-IDF vectorized training set.

Accuracy of the model on the training and test set:

Confusion matrix:

# Analysis of constructed models

### Detection of the most relevant n-grams for disorder-order classification (i.e. model interpretation)

Following function visualizes the coefficients of the model <code>classifier</code>, i.e. the corresponding features (n-grams) <code>feature_names</code> showing only the <code>n_top_features</code> features (n-grams) that are most relevant for predicting the negative (disorder) and positive (order) classes. For the title of the graph will be set <code>title</code>.

In [None]:
def visualize_coefficients(title, classifier, feature_names, n_top_features=25):
    coefs = classifier.coef_.ravel()
    
    negative_coefs_indices = np.argsort(coefs)[:n_top_features]
    positive_coefs_indices = np.argsort(coefs)[-n_top_features:]
    
    most_decisive_coefs_indices = np.hstack([negative_coefs_indices, positive_coefs_indices])
    most_decisive_coefs = coefs[most_decisive_coefs_indices]
    
    plt.figure(figsize=(15, 5))
    plt.title(title)
    colors = ['orange' if c < 0 else 'cadetblue' for c in most_decisive_coefs]
    plt.bar(np.arange(2*n_top_features), most_decisive_coefs, color=colors)
    
    most_decisive_feature_names = np.array(feature_names)[most_decisive_coefs_indices]
    plt.xticks(np.arange(2*n_top_features), most_decisive_feature_names, rotation='vertical')

Model k nearest neighbors does not have an explicit form, that is, it is not determined by some coefficients, so we cannot get to the most relevant words in this way.

This analysis can as well be performed for models trained on Bag of Words representations of protein sequences.

## Comparing the classification accuracy of all 6 models

<img src="assets/comparing_models.png" width=750 align="left">