# Comp 551 Tutorial 2 :: scikit-learn

Feb 2nd, 2018

### Things we'll cover today

1) **What is scikit-learn and why should I care about it**

2) **How to go about doing ML (with scikit learn)**
  - ML as a single pipeline
    - Most common data preprocessing steps
      - train-test split
      - vectorization of textual features (only for textual features)
      - TF-IDF (only for textual features)
      - normalization
      - one-hot encoding of labels (for classification problems only)
    - Training
      - fit (closed form or gradient descent)
      - predict
    - Evaluation
      - metrics : measure accuracy - precision / recall / f-score
      - Cross validation
      - Grid Search


1) **What is scikit-learn and why should I care about it**

- Scikit-learn is a python-based free ML library that provides well-implemented off-the-shelf implementations for almost all ML operations.
- Implementing ML from scratch that is scalable, efficient, and error-free is really very very hard.

*Disclaimer* : Availability of off-the-shelf implementations of ML doesn't invalidates the need to know the algorithms details.

2) **How to go about doing ML (with scikit learn)**


2.1) **ML as a single pipeline**
- Almost all *scalable* ML models follows the style of development in a pipeline. It eases the pain of thinking through the complex ML processes.
- Keep this pipeline in mind while developing any ML model.

![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/Steps.png)


- P.S. :: Closed-form solution seeking ML don't follow this suit.

From here on, we'll explain the concepts behind each of these steps with a real dataset called News20group as hosted [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). So, let's import it now.

In [0]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all')

This dataset contains sentences from news and the topic they belong to.

In [0]:
# This is how the features look like
list(newsgroups.data)[0:3]

[u"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 u'From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)\nSubject: Which

In [0]:
# This is how the target looks like
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

**2.1.1) Most common data-proprocessing steps**
![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/Preprocessing.png)

**2.1.1.1) Train-test split **

In [0]:
## Simple way to do split would be to use scikit-learn's `train_test_split` method
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, train_size=0.8, test_size=0.2)

**2.1.1.2) Vectorization of textual features (applicable only for textual features)**

A very simple approach to represent textual features such as documents or sentences as numerical value is to use each word as an atomic type and as a basis for a vector space. For example imagine a world where there exist only 3 words: “Apple”, “Orange”, and “Banana” and every sentence or document is made of them. They become the basis of a 3 dimensional vector space:

```
Apple  ==>> [1,0,0]
Banana ==>> [0,1,0]
Orange ==>> [0,0,1]
```

This representation is called **one_hot** as it is always a vector of zeros with 1 on the position of the word.

Then a “sentence” or a “document” is simply the linear combination of these vectors where the number of the counts of appearance of the words is the coefficient along that dimension. For example:

```
d3 = "Apple Orange Orange Apple" ==>> [2,0,2]
d4 = "Apple Banana Apple Banana" ==>> [2,2,0]
d1 = "Banana Apple Banana Banana Banana Apple" ==>> [2,4,0]
d2 = "Banana Orange Banana Banana Orange Banana" ==>> [0,4,2]
d5 = "Banana Apple Banana Banana Orange Banana" ==>> [1,4,1]
```

This will be covered in detail in lecture *Feature Construction and Selection* on 14th and 19th February.

Since, our toy dataset also has textual features, we'll have to vectorize them. But let's do train-test split first

## Now we transform text into vectors

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [0]:
vectors = vectorizer.fit_transform(X_train)

# and we do the same for training data. remember to use the same vectorizer, only transform (why?)
vectors_test = vectorizer.transform(X_test)

**2.1.1.3) TF-IDF (only for textual features)**
TF-IDF stands for Term-Frequency Inverse Document Frequency is one of the most popular approaches to mitigate the effects of more frequent but less informative words such as ''*the*''.
This will also be covered in detail in lecture *Feature Construction and Selection* on 14th and 19th February.

In [0]:
## You could have also used Tf-IDF vectorizer to convert the words into vectors
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(X_train)
vectors_test = vectorizer.transform(X_test)

In [0]:
## How does the vectors look like
## number of vocabs
vectors.shape


(15076, 160971)

In [0]:
vectors_test.shape

(3770, 151628)

In [0]:
## See the row is mainly sparse, because it contains counts of the words in one line
vectors_test.toarray()[0]

array([0, 0, 0, ..., 0, 0, 0])

**2.1.1.4) normalization**::
While not mandatory, normalization usually improves the performance of the learner significantly. It ensures that the learner learns from the data on similar scales across features. There are many ways of normalizing the data.
This will also be covered in detail in lecture *Feature Construction and Selection* on 14th and 19th February.
For now we'll stick to the default *l2* provided by scikit-learn.

In [0]:
from sklearn.preprocessing import normalize

vectors = normalize(vectors)
vectors_test = normalize(vectors_test)

**2.1.1.5) one-hot encoding of labels (for classification problems only)**
The integer nature of the labels is not amenable for classification tasks. However, scikit-learn above internally handles the integer nature of our labels. In most other cases, the programmers need to represent them in a format that allows handling them explicitly. One such popular format is one-hot encoding. This one-hot encoding works similar to as explained in section 2.1.1.2. So, we are only refering to a function [here](http://scikit-learn.org/stable/modules/preprocessing_targets.html#) that does that for you but for labels in the context of classification.

**2.1.2) Training**

![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/training.png)

In [0]:
## Now we instantiate the classifier. Remember this can be any classifier, even the one you make.
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=.01)

**2.1.2.1) fit**

In [0]:
## Scikit Learn API is very simple and straightforward, which contains the basic commands:
## `fit` to learn your parameters
clf.fit(vectors, y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

**2.1.2.2) predict**

In [0]:
## `predict` to generate a prediction on Test data.

In [0]:
## We get the predicted class
y_pred = clf.predict(vectors_test)
## So now we see we have a set of predictions. 
y_pred

array([16, 18, 14, ...,  6,  1,  6])

array([16, 18, 14, ...,  6,  1,  6])

### Evaluation

To see how good or bad your classifier did, you should check the predictions with the gold standard dataset. The cool thing about scikit-learn is that it gives you several metrics to do so. You can even use your own classifier and use the list of predicted classes and gold standard classes to compare.

![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/metrics.png)

In [0]:
from sklearn import metrics

The `metrics` class provides a set of useful metrics you can use for your needs. For any classification task, you need to report mainly these metrics:

- Accuracy : (TP + TN) / (TP + TN + FP + FN)
- Precision : TP / (TP + FP)
- F1 Score : 2TP / (2TP + FP + FN)
- Recall Score : TP / (TP + FN)

Remember, when we do multi-class classification, we use it in `macro` average mode, where we calculate metrics for each label, and find their unweighted mean. You can also instead use other averaging modes such as `micro`, `weighted`, `samples`

In [0]:
metrics.accuracy_score(y_test, y_pred)

0.9169761273209549

In [0]:
metrics.precision_score(y_test, y_pred, average='macro')

0.9162959250625471

In [0]:
metrics.f1_score(y_test, y_pred, average='macro')

0.9138486202640822

In [0]:
metrics.recall_score(y_test, y_pred, average='macro')

0.9130048306298564

In [0]:
## You can show all of this in a single call
print(metrics.classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.89      0.92      0.91       156
          1       0.84      0.86      0.85       188
          2       0.85      0.86      0.86       192
          3       0.86      0.83      0.85       206
          4       0.85      0.92      0.88       171
          5       0.90      0.89      0.90       202
          6       0.93      0.86      0.89       203
          7       0.92      0.91      0.92       216
          8       0.96      0.98      0.97       185
          9       0.97      0.96      0.97       193
         10       0.96      0.98      0.97       200
         11       0.95      0.94      0.94       209
         12       0.93      0.92      0.92       189
         13       0.95      0.94      0.94       198
         14       0.96      0.96      0.96       206
         15       0.93      0.95      0.94       206
         16       0.90      0.96      0.93       187
         17       0.97      0.99      0.98   

In [0]:
### Cross Validation

from sklearn.model_selection import cross_val_score
clf = MultinomialNB(alpha=.01)
scores = cross_val_score(clf, X_train, y_train, cv=5)

In [0]:
scores

array([0.86956522, 0.84782609, 0.89010989, 0.92222222, 0.94444444])

## Grid Search

When you are first searching for the best hyperparamters, its a good first strategy to run a grid search with the choice of hyperparameters to see which ones work the best for your dataset. 

In [0]:
from sklearn.model_selection import GridSearchCV

tuned_parameters = [{'alpha': [1, 0.5, 0.2, 0.1]}]

n_folds = 5

clf = MultinomialNB()
clf = GridSearchCV(clf, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X_train, y_train)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']

In [0]:
scores

array([0.89450549, 0.89450549, 0.89450549, 0.89450549])

In [0]:
scores_std

array([0.0348636, 0.0348636, 0.0348636, 0.0348636])

### Choice of Classifier

![Choosing the right estimator](http://scikit-learn.org/stable/_static/ml_map.png)

# References

1. [Scikit Learn](http://scikit-learn.org/)

## Useful Links

1. ROC Curve: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
2. https://people.duke.edu/~ccc14/sta-663/BlackBoxOptimization.html