# Classification in scikit-learn

In this unit, we'll explore how to use `scikit-learn` for text classification. We'll be using short texts collected from the [Universal Periodic Review](https://en.wikipedia.org/wiki/Universal_Periodic_Review), an international human rights mechanism. Each of these texts have an attached *label* or *labels* that pertain to the human rights issue that concerned in the text.

From these texts, we're going to estimate a supervised model that tries to guess the label(s) from the text data. Note that this is a *multilabel* classification problem, because each text may have more than one label, or no label.

In [None]:
import os
import re
import csv
import sys
import random
from pandas import DataFrame
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics, tree, cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import RandomizedLogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import coverage_error
from sklearn.grid_search import GridSearchCV

## 1) Load and PreProcess

We'll first load in a csv file that contains our texts and their corresponding label(s).

In [None]:
# read in full csv
recs = []
with open('data/upr.csv','r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        recs.append(row)
print(len(recs))

In [None]:
recs[:5]

In [None]:
# turn labels into a list
for i in recs:
    issues = i['Issue'].split(',')
    i['Issue'] = [x for x in issues if x != 'Other' and x != 'General']       

In [None]:
# remove texts with no label
rec_sub = [i for i in recs if i['Issue']]
print("Number of recs:", len(rec_sub))

In [None]:
# turn into a dataframe
data = DataFrame(rec_sub)
print(data.shape)

In [None]:
# extract text and label data
text = data['Text'].values
labels = data['Issue'].values

We now have to "binarize" the labels, meaning that we transform it from a list of labels into an array of binary indicators: the one, i.e. the non zero elements, corresponds to the subset of labels. For instance, an array such as `np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]])` represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample. 

The `MultiLabelBinarizer` transformer can be used to convert between a collection of collections of labels and the indicator format.

In [None]:
# binarize labels
mlb = MultiLabelBinarizer()
labels_binary = mlb.fit_transform(labels)
print(labels_binary)

We're now ready to extract our `X_train, X_test, y_train, y_test`:

In [None]:
# get training + test data
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    text, labels_binary, test_size=0.2, random_state=40)
print("Number of training data observations:", len(X_train))

In [None]:
# get target (label) names
label_names = list(mlb.classes_)
print(label_names)

## 2) Pipelines

Machine learning often involves a fixed sequence of steps for processing the data, for example feature selection, normalization and classification. 

Scikit-learn includes a [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html) structure to help with this. Pipelines serve 2 purposes:

1. **Convenience:** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
2. **Joint parameter selection**: You can grid search over parameters of all estimators in the pipeline at once.

In [None]:
# build a pipeline with Support Vector Machines
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', OneVsRestClassifier(LinearSVC(random_state=0)))
                     ])

In [None]:
# fit using pipeline
clf = text_clf.fit(X_train, y_train)

## 3) Predicting

Now we can use our trained model to predict the held-out "test" set. Better yet, there's no need to explicitely extract features or preprocess the data, since it uses the same pipeline as the training sequence.

In [None]:
# predict
predicted = clf.predict(X_test)
clf.score(X_test, y_test) 

In [None]:
# mean agreement
np.mean(predicted == y_test)

In [None]:
for doc, label in zip(list(X_test[:50]), predicted[:50]):
    print('%r => %s' % (doc, ", ".join(list(np.array(label_names)[label==1]))))

In [None]:
# print metrics
print(metrics.classification_report(y_test, predicted,
    target_names=label_names)) 

## 4) Cross Validation and Grid Search

`scikit-learn` is used differently from person to person depending on the task. It allows the user to build a basic single model from scratch, but also includes a grid-search function and cross-validation for more sophisticated exploration.

In [None]:
## cross validation
scores = cross_validation.cross_val_score(
   text_clf, text, labels_binary, cv=5)
scores

In [None]:
## grid search
parameters = {'vect__ngram_range': [(1, 1), (1, 2), (1,3)],
              'tfidf__use_idf': (True, False),
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

In [None]:
## Whare the best parameters?
best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))

In [None]:
gs_clf.grid_scores_

# AutoML on top of sklearn

`scikit-learn` itself does not include optimization algorithms for model parameters, but we will discuss two libraries, `auto-sklearn` and `TPOT`, which do.

AutoML packages still require preparing and formatting the data as we've shown in preprocessing steps. You will hand off the prepped `X_train, y_train, X_test, y_test` arrays to the AutoML package, which will optimize a model and its parameters. Most arguments for AutoML have to do with the size of the desired model, the time to search for the best model, and where the model should be saved.

## 5) [auto-sklearn](http://automl.github.io/auto-sklearn/stable/) (Bayesian optimization)

In [None]:
from autosklearn.classification import AutoSklearnClassifier
import sklearn.cross_validation
import sklearn.metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

X = text
y = labels_binary
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, random_state=1)

tfidf = TfidfVectorizer()
tfidf.fit(X)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)

automl_cl = AutoSklearnClassifier()  # time_left_for_this_task=100
automl_cl.fit(X_train, y_train)
y_hat = automl_cl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

In [None]:
automl_r.show_models()

In [None]:
automl_r.grid_scores_

In [None]:
automl_r.cv_results_

## 6) [TPOT](https://github.com/rhiever/tpot) (genetic algorithms)

NB: TPOT does not yet support multi-label classification, but is adding features quickly

In [None]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')