# Introduction to sklearn and the experimental setup

Scikit-Learn (http://scikit-learn.org) is a machine learning library in Python. It offers a wide range of tools for data mining and analysis. 

![alt text][logo]
[logo]: http://scikit-learn.org/stable/_static/ml_map.png "Sklearn Flowchart"

Overview of Tools provided by sklearn:
1. Supervised Learning
2. Unsupervised Learning
3. Model Selection and Evaluation
4. Dataset Transformations
5. Dataset loading utilities

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
import pprint

pp = pprint.PrettyPrinter(indent=4)

## Step 1: Import Data
How you load your data is highly dependent on the type of data you are working with. Libraries like sklearn, nltk, etc. provide toy datasets as well as some functionality for loading data. For example sklearn provides functions for loading datasets in the svmlight format: 

 &lt;label&gt; &lt;feature-id&gt;:&lt;feature-value&gt; &lt;feature-id&gt;:&lt;feature-value&gt; ... 
 
Here are some recommended ways to load standard columnar data into a format usable by scikit-learn:

- pandas.io provides tools to read data from common formats including CSV, Excel, JSON and SQL. DataFrames may also be constructed from lists of tuples or dicts. Pandas handles heterogeneous data smoothly and provides tools for manipulation and conversion into a numeric array suitable for scikit-learn.
- scipy.io specializes in binary formats often used in scientific computing context such as .mat and .arff
- numpy/routines.io for standard loading of columnar data into numpy arrays
- scikit-learn’s datasets.load_svmlight_file for the svmlight or libSVM sparse format
- scikit-learn’s datasets.load_files for directories of text files where the name of each directory is the name of each category and each file inside of each directory corresponds to one sample from that category 

In [3]:
#import data
name_data = [('male', name) for name in nltk.corpus.names.words('male.txt')]
name_data.extend(('female', name) for name in nltk.corpus.names.words('female.txt'))

#print the data
pp.pprint(name_data[:5])
pp.pprint(name_data[-5:])

[   ('male', 'Aamir'),
    ('male', 'Aaron'),
    ('male', 'Abbey'),
    ('male', 'Abbie'),
    ('male', 'Abbot')]
[   ('female', 'Zorine'),
    ('female', 'Zsa Zsa'),
    ('female', 'Zsazsa'),
    ('female', 'Zulema'),
    ('female', 'Zuzana')]


## Step 2: Data Preprocessing

In this step we perform some transformation on the raw data. For example if you work with Tweets it is common to replace the username with a '@name' token. Other preprocessing involves lowercasing the text, tokenizing, stemming, chunking, etc. For this step the NLTK (http://www.nltk.org/) provides a wide range of tools.

## Step 3: Feature Extraction

Here we extract the features for our data. Feautres should reflect properties of a datapoint which helps to distinguish the different classes. Especially in NLP tasks we use the feature extraction to tranforming the textual data into numerical features. These features are what we use as input to the machine learning procedures. Sklearn provides the sklearn.feature_extraction modlue which implements common feature extraction methods. 

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
#bag of characters, bag of character ngrams
cv = CountVectorizer(analyzer='char', preprocessor=None, lowercase=False)

corpus = [d[1] for d in name_data]

x_bow = cv.fit_transform(corpus)
pp.pprint(sorted(cv.vocabulary_.items(),key=lambda x: x[0]))

[   (' ', 0),
    ("'", 1),
    ('-', 2),
    ('A', 3),
    ('B', 4),
    ('C', 5),
    ('D', 6),
    ('E', 7),
    ('F', 8),
    ('G', 9),
    ('H', 10),
    ('I', 11),
    ('J', 12),
    ('K', 13),
    ('L', 14),
    ('M', 15),
    ('N', 16),
    ('O', 17),
    ('P', 18),
    ('Q', 19),
    ('R', 20),
    ('S', 21),
    ('T', 22),
    ('U', 23),
    ('V', 24),
    ('W', 25),
    ('X', 26),
    ('Y', 27),
    ('Z', 28),
    ('a', 29),
    ('b', 30),
    ('c', 31),
    ('d', 32),
    ('e', 33),
    ('f', 34),
    ('g', 35),
    ('h', 36),
    ('i', 37),
    ('j', 38),
    ('k', 39),
    ('l', 40),
    ('m', 41),
    ('n', 42),
    ('o', 43),
    ('p', 44),
    ('q', 45),
    ('r', 46),
    ('s', 47),
    ('t', 48),
    ('u', 49),
    ('v', 50),
    ('w', 51),
    ('x', 52),
    ('y', 53),
    ('z', 54)]


In [5]:
print(type(x_bow)) #sparse matrix
x_bow = x_bow.toarray() #make it dense

<class 'scipy.sparse.csr.csr_matrix'>


In [6]:
#generate new features: length of the name
x_nlengths = np.array([len(name) for name in corpus])
print(x_nlengths.shape)
#produces vector of shape (7944,) -> cannot be concatenated, transform into (7944,1) vector
x_nlengths = np.expand_dims(x_nlengths, axis=-1)
print(x_nlengths.shape)

(7944,)
(7944, 1)


In [7]:
#generate new features: length of the name
x_a = np.array([int(name[-1] == 'a') for name in corpus])
print(x_a.shape)
#produces vector of shape (7944,) -> cannot be concatenated, transform into (7944,1) vector
x_a = np.expand_dims(x_a, axis=-1)
print(x_a.shape)

(7944,)
(7944, 1)


In [8]:
#combine features
X = np.concatenate([x_bow,x_nlengths,x_a], axis=1)
X.shape

(7944, 57)

In [9]:
numeric_label_voc = {
    'male': 0,
    'female': 1
}
Y = np.array([numeric_label_voc.get(d[0]) for d in name_data])
Y.shape

(7944,)

## Step 4: Data Analysis



In [18]:
import collections
import random

def get_sample(data, n):
    return random.sample(data, n)

labels = [d[0] for d in name_data]
is_last_a = [int(name[-1] == 'a') for name in corpus]

print('P(C = female) =', collections.Counter(labels)['female']/len(labels))

print(collections.Counter(labels))
print(collections.Counter(is_last_a))
print(collections.Counter(zip(is_last_a, labels)))
print(get_sample(name_data, 20))

P(C = female) = 0.6295317220543807
Counter({'female': 5001, 'male': 2943})
Counter({0: 6142, 1: 1802})
Counter({(0, 'female'): 3228, (0, 'male'): 2914, (1, 'female'): 1773, (1, 'male'): 29})
[('female', 'Bamby'), ('female', 'Brunhilde'), ('female', 'Corinne'), ('male', 'Pascal'), ('female', 'Kaia'), ('male', 'Warde'), ('female', 'Aphrodite'), ('female', 'Sophronia'), ('female', 'Perrine'), ('female', 'Elberta'), ('female', 'Katina'), ('female', 'Shirline'), ('female', 'Ciara'), ('female', 'Justine'), ('male', 'Nestor'), ('female', 'Carleen'), ('female', 'Corrine'), ('female', 'Daloris'), ('female', 'Gracia'), ('male', 'Piggy')]


## Step 5: Choose a Classifier

Goto http://scikit-learn.org/stable/supervised_learning.html and choose a classifier. Use the flowchart above to guide your decision. Check if the results align with the Flowchart.

Just selecting a model and fitting it on the data is not enough. There are a couple pf problems we need to address:
1) We need to know how well the trained model performs on unseen data. 
2) We need to select the hyper-parameters which yield the best model.

In order to assess how well the trained model performs we apply K-Fold Cross Validation (CV). In K-Fold CV we partition the data into K equal sized subsamples. Of these K subsamples we keep one sample as the validation data and fit the model on the other K - 1 subsamples. This process is repreated K times until each subsample has been used as validation data.
CV is also used to avoid overfitting, i.e. when the model corresponds too closely to the data it was fitted on. When the model overfits it reduces the generalization performance of the model. 
Now we just need to select the best suited metric to evlaute our model, there are lots of metrics you can choose from: accuracy, precision, recall, f1-score, mean squared error, ... 

For selecting the hyperparameters we use a technique called Grid Search. Grid Search tries all combinations of hyper-parameters and returns the CV-score for each combination. For each hyper-parameter you need to specifiy which values the Grid-Search should test. 

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf = RandomForestClassifier()
scores = cross_val_score(clf, X, Y, cv=10, scoring='accuracy')
print(np.mean(scores), np.std(scores))

0.5229582811002241 0.13201772094230707


In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [1, 2, 3, 4, 5, 10, 20, 50],
    'max_features': [None, 'log2', 'sqrt'],
    'max_depth': [None, 1, 2, 25],
    'bootstrap': [True]
}

gs = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=1, cv=5, scoring='accuracy')
gs.fit(X,Y)
gs.best_params_

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 250 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:   11.8s finished


{'bootstrap': True, 'max_depth': 1, 'max_features': 'log2', 'n_estimators': 3}

In [12]:
gs.best_score_

0.6340634441087614