# scikit-learn Introductory Tutorial

scikit-learning is an open-sourced simple and efficient tools for data mining, data analysis and machine learning in Python. It is built on NumPy, SciPy and matplotlib. There are built-in classification, regression, and clustering models, as well as useful features like dimensionality reduction, evaluation and preprocessing. 

This tutorial is specifically tailored for NLP, i.e. working with text data. It will cover the following topics: loading data, preprocessing, feature extraction, training, evaluation, grid search, building a pipeline, creating custom transformers, etc.

For this tutorial, we will use the 20 Newsgroups data set <http://qwone.com/~jason/20Newsgroups/> and perform topic classification. For the sake of time, I converted all the data into a CSV file. 

Note: apparently Jupyter disables spell-checker, so I'm only partially responsible for the typos in this tutorial. 

### 1. Loading Dataset

For the sake of convenience, we will use pandas to read CSV file. (You may do so with numpy as well; there is a `loadtext()` function, but you might encounter encoding issues when using it.

In [16]:
import pandas as pd

dataset = pd.read_csv('20news-18828.csv', header=None, delimiter=',', names=['label', 'text'])

Sanity check on the dataset.

In [17]:
print("There are 20 categories: %s" % (len(dataset.label.unique()) == 20))
print("There are 18828 records: %s" % (len(dataset) == 18828))

There are 20 categories: True
There are 18828 records: True


Now we need to split it to train set and test set. To do so, we can use the `train_test_split()` function. In scikit-learn's convention, X indicates data (yeah, uppercase X), and y indicates truths (and yeah, lowercase y).

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset.text, dataset.label, train_size=0.8)

### 2. A Simple Example

Before going too much into preprocessing, feature extraction and other more complicated tasks, we will do a relatively simple but complete example. In this example, we will use bag-of-words as features, and Naive Bayes as classifier to establish our baseline.

There are some built-in vectorizers, `CountVectorizer` and `TfidfVectorizer` that we can use to vectorizer our raw data and perform preprocessing and feature exctration on it. First, we will experiment with `CountVectorizer` which basically makes a token/ngram a feature and stores its count in the corresponding feature space.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize a CountVectorizer
cv = CountVectorizer()
# fit the raw data into the vectorizer and tranform it into a series of arrays
X_train_counts = cv.fit_transform(X_train)
X_train_counts.shape

(15062, 167363)

Similar thing needs to be done for the test set, but we only need to use the `transform()` function.

In [22]:
X_test_counts = cv.transform(X_test)
X_test_counts.shape

(3766, 167363)

Then, we fit our features and labels into a Naive Bayes classifier, which basically trains a model. After training, we can use it to perform prediction.

In [27]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, y_train)
predicted = clf.predict(X_test_counts)

# sample some of the predictions against the ground truths 
for prediction, truth in zip(predicted[:10], y_test[:10]):
    print(prediction, truth)

comp.sys.ibm.pc.hardware comp.sys.ibm.pc.hardware
alt.atheism alt.atheism
comp.graphics comp.graphics
rec.sport.hockey rec.sport.hockey
sci.med sci.med
alt.atheism alt.atheism
talk.politics.guns talk.politics.guns
rec.motorcycles rec.motorcycles
rec.autos rec.autos
rec.autos sci.electronics


Let's do some legit evaluation. The `classification_report()` function gives you precison, recall and f1 scores for each label, and their average. If you want to calculate overall macro-averaged, micro-averaged or weighted performance, you can use the `precision_recall_fscore_support`. Finally, the `confusion_matrix()` can show you which labels are confusing to the model, but unfortunately, it does not include the labels. 

In [52]:
from sklearn import metrics

print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

p, r, f1, _ = metrics.precision_recall_fscore_support(y_test, predicted, labels=dataset.label.unique(), average='micro')

print("Micro-averaged Performance:\nPrecision: {0}, Recall: {1}, F1: {2}".format(p, r, f1))

print(metrics.confusion_matrix(y_test, predicted, labels=dataset.label.unique()))

                          precision    recall  f1-score   support

             alt.atheism       0.86      0.91      0.88       171
           comp.graphics       0.60      0.92      0.73       189
 comp.os.ms-windows.misc       0.98      0.21      0.34       191
comp.sys.ibm.pc.hardware       0.64      0.82      0.72       194
   comp.sys.mac.hardware       0.92      0.81      0.86       189
          comp.windows.x       0.76      0.86      0.81       195
            misc.forsale       0.94      0.68      0.79       195
               rec.autos       0.89      0.94      0.91       191
         rec.motorcycles       0.96      0.96      0.96       182
      rec.sport.baseball       0.98      0.92      0.95       201
        rec.sport.hockey       0.95      0.98      0.97       186
               sci.crypt       0.86      0.93      0.89       210
         sci.electronics       0.94      0.79      0.86       204
                 sci.med       0.94      0.92      0.93       202
         

### 3. Preprocessing & Feature Extraction



(15062, 167363)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
