# scikit-learn Introductory Tutorial

[scikit-learning](http://scikit-learn.org/stable/index.html) is an open-sourced simple and efficient tools for data mining, data analysis and machine learning in Python. It is built on NumPy, SciPy and matplotlib. There are built-in classification, regression, and clustering models, as well as useful features like dimensionality reduction, evaluation and preprocessing. 

This tutorial is specifically tailored for NLP, i.e. working with text data. It will cover the following topics: loading data, preprocessing, feature extraction, training, evaluation, grid search, building a pipeline, creating custom transformers, etc.

For this tutorial, we will use the [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/) and perform topic classification. For the sake of time, I converted all the data into a CSV file. 

Note: apparently Jupyter disables spell-checker, so I'm only partially responsible for the typos in this tutorial. 

### 1. Loading Dataset

For the sake of convenience, we will use pandas to read CSV file. (You may do so with numpy as well; there is a `loadtext()` function, but you might encounter encoding issues when using it. The dataset is compressed into a tarball, so in the termal run ``tar xvfz 20news-18828.csv.tar.gz`` to uncompress it.

In [1]:
import pandas as pd

dataset = pd.read_csv('20news-18828.csv', header=None, delimiter=',', names=['label', 'text'])

Sanity check on the dataset.

In [4]:
print("There are 20 categories: %s" % (len(dataset.label.unique()) == 20))
print("There are 18828 records: %s" % (len(dataset) == 18828))

There are 20 categories: True
There are 18828 records: True


Now we need to split it to train set and test set. To do so, we can use the `train_test_split()` function. In scikit-learn's convention, X indicates data (yeah, uppercase X), and y indicates truths (and yeah, lowercase y).

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset.text, dataset.label, train_size=0.8)

### 2. A Simple Example

Before going too much into preprocessing, feature extraction and other more complicated tasks, we will do a relatively simple but complete example. In this example, we will use bag-of-words as features, and Naive Bayes as classifier to establish our baseline.

There are some built-in vectorizers, `CountVectorizer` and `TfidfVectorizer` that we can use to vectorizer our raw data and perform preprocessing and feature exctration on it. First, we will experiment with `CountVectorizer` which basically makes a token/ngram a feature and stores its count in the corresponding feature space. The `fit_transform()` function is the combination of `fit()` and `transform()`. `fit()` learns the vocabulary/features of a document, and `transform()` transforms the dataset into a matrix.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize a CountVectorizer
cv = CountVectorizer()
# fit the raw data into the vectorizer and tranform it into a series of arrays
X_train_counts = cv.fit_transform(X_train)
X_train_counts.shape

(15062, 181850)

Similar thing needs to be done for the test set, but we only need to use the `transform()` function to transform the test data into a matrix. 

In [5]:
X_test_counts = cv.transform(X_test)
X_test_counts.shape

(3766, 181850)

Then, we fit our features and labels into a Naive Bayes classifier, which basically trains a model (if you fit the data more than once, it overwrites the parameters the model learns previously). After training, we can use it to perform prediction.

In [6]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, y_train)
predicted = clf.predict(X_test_counts)

# sample some of the predictions against the ground truths 
for prediction, truth in zip(predicted[:10], y_test[:10]):
    print(prediction, truth)

sci.electronics sci.electronics
comp.windows.x comp.windows.x
rec.motorcycles rec.motorcycles
talk.politics.mideast talk.politics.mideast
comp.windows.x comp.windows.x
rec.sport.hockey rec.sport.baseball
sci.crypt sci.crypt
rec.autos rec.autos
rec.sport.hockey rec.sport.hockey
sci.med sci.med


Let's do some legit evaluation. The `classification_report()` function gives you precison, recall and f1 scores for each label, and their average. If you want to calculate overall macro-averaged, micro-averaged or weighted performance, you can use the `precision_recall_fscore_support`. Finally, the `confusion_matrix()` can show you which labels are confusing to the model, but unfortunately, it does not include the labels. 

In [7]:
from sklearn import metrics

print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

p, r, f1, _ = metrics.precision_recall_fscore_support(y_test, predicted, labels=dataset.label.unique(), average='micro')

print("Micro-averaged Performance:\nPrecision: {0}, Recall: {1}, F1: {2}".format(p, r, f1))

print(metrics.confusion_matrix(y_test, predicted, labels=dataset.label.unique()))

                          precision    recall  f1-score   support

             alt.atheism       0.91      0.87      0.89       164
           comp.graphics       0.54      0.86      0.67       185
 comp.os.ms-windows.misc       0.86      0.10      0.18       184
comp.sys.ibm.pc.hardware       0.71      0.77      0.74       222
   comp.sys.mac.hardware       0.87      0.82      0.85       171
          comp.windows.x       0.73      0.86      0.79       192
            misc.forsale       0.96      0.62      0.76       213
               rec.autos       0.87      0.92      0.89       201
         rec.motorcycles       0.97      0.96      0.96       183
      rec.sport.baseball       0.95      0.93      0.94       196
        rec.sport.hockey       0.97      0.97      0.97       213
               sci.crypt       0.83      0.97      0.89       202
         sci.electronics       0.89      0.82      0.85       206
                 sci.med       0.94      0.92      0.93       203
         

### 3. Preprocessing & Feature Extraction



One may ask, "how do I remove stop words, tokenize the texts differently, or use bigrams/trigrams as features?" 
The answer is you can do all that when you initialize a [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) object, i.e. you can pass various arguments to the constructor. 

Here are some of them: `ngram_range` takes in a tuple (n_min, n_max). For example, `(2,2)` means only use bigrams, and `(1,3)` means use unigrams, bigrams, and trigrams. `stop_words` takes in a list of stopwords that you'd like to remove. If you want to use default stopword list in scikit-learn, pass in the string `'english'`. `tokenizer` is a function that takes in a string and returns a string, inside that function, you can define how to tokenize your text. By default, scikit-learn tokenization pattern is `u'(?u)\b\w\w+\b'`. Finally, `preprocessor` takes in a function of which the argument is a string and the output is a string. You can use it to perform more customized preprocessing. For more detail, please checkout the documentation for `CountVectorizer` or `TfidfVectorizer`.

Let's start with defining a preprocessor to normalize all the numeric values, i.e. replacing numbers with the string `NUM`. Then, we construct a new `CountVectorizer`, and then use unigrams, bigrams, and trigrams as features, and remove stop words.  

In [8]:
import re
def normalize_numbers(s):
    return re.sub(r'\b\d+\b', 'NUM', s)

cv = CountVectorizer(preprocessor=normalize_numbers, ngram_range=(1,3), stop_words='english')

Let's fit and transform the train data and transform the test data. The speed of preprocessing and feature extraction depends on the running time of each step. For example, the running time of stopword removal is O(N * M), where N is the vocabulary size of the document, and M is the stopword list size.

In [9]:
# fit the raw data into the vectorizer and tranform it into a series of arrays
X_train_counts = cv.fit_transform(X_train)
X_test_counts = cv.transform(X_test)

KeyboardInterrupt: 

Let's use the Naive Bayes classifier to train a new model and see if it works better. From the last section with out preprocessing or feature engineering, our precison, recall and F1 are in the mid 80s, but now we got 90 for each score.

In [24]:
clf = MultinomialNB().fit(X_train_counts, y_train)
predicted = clf.predict(X_test_counts)
print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

                          precision    recall  f1-score   support

             alt.atheism       0.89      0.94      0.91       152
           comp.graphics       0.72      0.90      0.80       188
 comp.os.ms-windows.misc       0.87      0.80      0.83       205
comp.sys.ibm.pc.hardware       0.82      0.87      0.84       189
   comp.sys.mac.hardware       0.98      0.82      0.89       199
          comp.windows.x       0.89      0.90      0.89       202
            misc.forsale       0.89      0.78      0.83       193
               rec.autos       0.96      0.92      0.94       194
         rec.motorcycles       0.97      0.95      0.96       180
      rec.sport.baseball       0.97      0.92      0.95       209
        rec.sport.hockey       0.83      0.98      0.90       186
               sci.crypt       0.92      0.96      0.94       193
         sci.electronics       0.96      0.86      0.90       208
                 sci.med       0.96      0.94      0.95       202
         

Do you remember there are other vecotrizers that you can use? Walla, one of them is `TfidfVecotrizer`. LOL, what is tf-idf? It's on [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). It basically reflects how important a word/phrase is to a document in a corpus. The constructor of `TfidfVectorizer` takes in the same parameters as that of `CountVectorizer`, so you can perfrom the same preprocessing/feature extraction. Try to run the following block of code and see if using tf-idf will help improve the performance. There are some other parameters in the constructor that you can tweak when initializing the object, and they could affect the performance as well.   

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(preprocessor=normalize_numbers, ngram_range=(1,3), stop_words='english')
X_train_tf = tv.fit_transform(X_train)
X_test_tf = tv.transform(X_test)
clf2 = MultinomialNB().fit(X_train_tf, y_train)
predicted = clf2.predict(X_test_tf)
print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

                          precision    recall  f1-score   support

             alt.atheism       0.93      0.84      0.88       164
           comp.graphics       0.80      0.75      0.77       185
 comp.os.ms-windows.misc       0.76      0.88      0.81       184
comp.sys.ibm.pc.hardware       0.84      0.77      0.80       222
   comp.sys.mac.hardware       0.91      0.83      0.87       171
          comp.windows.x       0.91      0.84      0.87       192
            misc.forsale       0.93      0.72      0.81       213
               rec.autos       0.87      0.93      0.90       201
         rec.motorcycles       0.94      0.96      0.95       183
      rec.sport.baseball       0.94      0.91      0.93       196
        rec.sport.hockey       0.96      0.99      0.97       213
               sci.crypt       0.88      0.97      0.92       202
         sci.electronics       0.92      0.82      0.87       206
                 sci.med       0.95      0.84      0.89       203
         

Alternatively, if you like typing a longer block of code, you can use the `TfidfTransformer` to transform a word count matrix created by `CountVectorizer` into a tf-idf matrix.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

### 4. Model Selection

Depending on the size of your data and the nature your task, some classifiers might perform better than others. These days, Maximum Entropy is a very popular classifier for many machine learning tasks. So let's try the [logistic regression classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in scikit-learn (Maxent and logistic regression are virtually the same thing). Please note that this classifier is a lot slower than Naive Bayes.

In [15]:
from sklearn.linear_model import LogisticRegression

# for the sake of speed, we will just use all the default value of the constructor
cv = CountVectorizer()
X_train_counts = cv.fit_transform(X_train)
X_test_counts = cv.transform(X_test)

clf = LogisticRegression(solver='liblinear', max_iter=500, n_jobs=4)
clf.fit(X_train_counts, y_train)
predicted = clf.predict(X_test_counts)
print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

                          precision    recall  f1-score   support

             alt.atheism       0.92      0.93      0.92       164
           comp.graphics       0.76      0.78      0.77       185
 comp.os.ms-windows.misc       0.83      0.83      0.83       184
comp.sys.ibm.pc.hardware       0.78      0.77      0.78       222
   comp.sys.mac.hardware       0.81      0.85      0.83       171
          comp.windows.x       0.85      0.79      0.82       192
            misc.forsale       0.86      0.86      0.86       213
               rec.autos       0.85      0.94      0.89       201
         rec.motorcycles       0.97      0.94      0.95       183
      rec.sport.baseball       0.94      0.95      0.95       196
        rec.sport.hockey       0.97      0.98      0.97       213
               sci.crypt       0.95      0.95      0.95       202
         sci.electronics       0.86      0.84      0.85       206
                 sci.med       0.92      0.94      0.93       203
         

There are many other supervised, semi-supervised and clustering classifiers, and many of them work almost the same: (1) initialize a classifier, (2) fit the feature matrix and truths of the train set, and (3) pass in the feature matrix of test set to perform prediction. 