# Sheet 4: Evaluation and Unsupervised Learning

# Evaluation and Cross Validation

We're going to do some evaluations on the language prediction algorithms we build last week.

## Setting Up Swadesh NB and SVM Classifiers

In [1]:
from nltk.corpus import swadesh
import numpy as np

en = swadesh.words('en')
de = swadesh.words('de')

In [2]:
# setup the data 
import string
def vecRepr(word):
    return [word.count(l) for l in string.ascii_lowercase]

Xwords_all = np.array([w.lower() for w in en] + [w.lower() for w in de])
Ywords_all = np.array([0 for _ in en] + [1 for _ in de])

In [6]:
# from Lab 3 Q4
# an nb classifier class based on last weeks example
# it's good to know how to build classes in python, but is not required for the assignments or exam!
from sklearn import base
class nbLanguagePredictor(base.BaseEstimator):
    def __init__(self):
        # set internal variables to dummy values
        self.langLetters = None
        self.numLangLetters = None
        
    def fit(self, langwords, langlabels):
        # fit model using supplied word lists and language labels
        lang0words = [w for w,lang in zip(langwords, langlabels) if lang==0]
        lang1words = [w for w,lang in zip(langwords, langlabels) if lang==1]
        self.langLetters = tuple(''.join(words).lower() for words in (lang0words, lang1words))
        self.numLangLetters = tuple(map(len, self.langLetters))
        
    def calculate_probability(self, word, language):
        # calculate probability word is from first language 
        # (note: the `language` parameter should be 0 or 1)
        return 0.5 * np.prod([self.langLetters[language].count(letter)/self.numLangLetters[language] 
                              for letter in word])

    def naive_bayes(self, word):
        first_prob = self.calculate_probability(word, 0)
        second_prob = self.calculate_probability(word, 1)
        if first_prob > second_prob:
            return 0
        else:
            return 1
        
    def predict(self, words):
        if not self.langLetters:
            raise ValueError("Model not trained!")
        return np.array([self.naive_bayes(word) for word in words])
    
    def score(self, words, langs):
        return sum(self.predict(words)==langs)/len(langs)

nb_alldata = nbLanguagePredictor()
nb_alldata.fit(Xwords_all,Ywords_all)
nb_alldata.predict(['gewürztraminer','discombublate'])

array([1, 0])

In [7]:
# from lab 3 Q5
from sklearn import svm

svc_all = svm.SVC()
svc_all.fit(list(map(vecRepr,Xwords_all)), Ywords_all)
svc_all.predict([vecRepr("gewürztraminer"), vecRepr("discombublate")])

array([1, 0])

## Selecting Test and Training Sets

Selecting test and training sets is not always as simple as it may seem. You can do a simple random sample, but it's often better to make a *stratified sample* where the proportions of each class are the same in the test and training data. Scikit-learn has a handy function for doing that.

Here's an example using the iris data set you saw last week.

In [8]:
from sklearn import datasets
iris = datasets.load_iris()

In [9]:
# `sklearn` has a handy class for creating training and test subsets of your data.
from sklearn import cross_validation

# note: random_state=0 makes the random numbers the same each time we run this.
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)

print("data :", iris.data.shape, iris.target.shape)
print("train:", X_train.shape,   y_train.shape)
print("test :", X_test.shape,    y_test.shape)

data : (150, 4) (150,)
train: (90, 4) (90,)
test : (60, 4) (60,)


In [10]:
# We can get accuracy scores for the model
from sklearn import svm
svm_iris = svm.SVC().fit(X_train, y_train)
svm_iris.score(X_test, y_test)

0.94999999999999996

## Precision, Recall and F-measure

<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg", style="width: 250px; float: right"/>

Recall the confusion matrix for a binary classifier:

|              | predict T | predict F |
| --------     | --------- | --------- |
| actual **T** |    TP     |    FN     |
| actual **F** |    FP     |    TN     |

and formulas for **precision**, **recall** and **$F_1$ measure**:

> $P=\frac{TP}{TP+FP}$ 

> $R=\frac{TP}{TP+FN}$ 

> $F_1 = \frac{2PR}{P+R}$

(image from Wikipedia)

# Q1.


Generate training and test sets with 80%/20% split with the swadesh data and train SVM and NB classifiers on the training data.

Calculate the **confusion matrix**, **precision**, **recall** and $F_1$ measure for your classifiers.

In [11]:
w_train, w_test, l_train, l_test = cross_validation.train_test_split(
        Xwords_all, Ywords_all, test_size=0.4, random_state=0)

nbTrained = nbLanguagePredictor()
nbTrained.fit(w_train, l_train)

svcTrained = svm.SVC()
svcTrained.fit(list(map(vecRepr,w_train)), l_train)

nbPredicts = nbTrained.predict(w_test)
svcPredicts = svcTrained.predict(list(map(vecRepr,w_test)))

nbTP = sum(1 for lp,l in zip(nbPredicts, l_test) if lp and l)
nbFP = sum(1 for lp,l in zip(nbPredicts, l_test) if lp and not l)
nbTN = sum(1 for lp,l in zip(nbPredicts, l_test) if not lp and not l)
nbFN = sum(1 for lp,l in zip(nbPredicts, l_test) if not lp and l)

nbPrecision = nbTP/(nbTP+nbFP)
nbRecall    = nbTP/(nbTP+nbFN)
nbF1        = 2*nbPrecision*nbRecall/(nbPrecision+nbRecall)
print("NB - precision: ", nbPrecision, " recall: ", nbRecall, "F1: ", nbF1)

svcTP = sum(1 for lp,l in zip(svcPredicts, l_test) if lp and l)
svcFP = sum(1 for lp,l in zip(svcPredicts, l_test) if lp and not l)
svcTN = sum(1 for lp,l in zip(svcPredicts, l_test) if not lp and not l)
svcFN = sum(1 for lp,l in zip(svcPredicts, l_test) if not lp and l)

svcPrecision = svcTP/(svcTP+svcFP)
svcRecall    = svcTP/(svcTP+svcFN)
svcF1        = 2*svcPrecision*svcRecall/(svcPrecision+svcRecall)
print("SV - precision: ", svcPrecision, " recall: ", svcRecall, "F1: ", svcF1)

NB - precision:  0.6333333333333333  recall:  0.7125 F1:  0.6705882352941177
SV - precision:  0.6986301369863014  recall:  0.6375 F1:  0.6666666666666666


## Cross-Validation

Cross validation, dividing your data into $n$ subsets, training the model $n$ times with one of the subsets left out, then testing each corresponding trained model on the left out subsets.

`sklearn` also has a handy helper class for cross-validation.

In [12]:
from sklearn import cross_validation
scores = cross_validation.cross_val_score(svm_iris, iris.data, iris.target, cv=5)
print("Accuracies for each fold:",scores)

Accuracies for each fold: [ 0.96666667  1.          0.96666667  0.96666667  1.        ]


# Q2.

Do 10-fold cross validation on the two swadesh language detection models, calculating accuracy for each fold and average accuracy.

In [13]:
svcScores = cross_validation.cross_val_score(svm.SVC(), list(map(vecRepr,Xwords_all)), Ywords_all, cv=10)
nbScores = cross_validation.cross_val_score(nbLanguagePredictor(), Xwords_all, Ywords_all, cv=10)
print("Accuracies for each svc fold:",svcScores)
print("Accuracies for each nb  fold:",nbScores)
print("Average accuracy for svc:",np.average(svcScores))
print("Average accuracy for nb :",np.average(nbScores))

Accuracies for each svc fold: [ 0.57142857  0.64285714  0.52380952  0.54761905  0.83333333  0.85714286
  0.85714286  0.65        0.55        0.725     ]
Accuracies for each nb  fold: [ 0.66666667  0.61904762  0.54761905  0.61904762  0.70731707  0.68292683
  0.46341463  0.90243902  0.82926829  0.68292683]
Average accuracy for svc: 0.675833333333
Average accuracy for nb : 0.672067363531


## Numpy Indexing and More Cross Validation

Unfortunately, there's no handy method for calculating precision and recall for each fold. There are, however, handy methods to set up cross-validation manually given a little extra work.

With numpy arrays, you can use a list of indexes to select elements of the array. The scikit-learn `StratifiedKFold` class provides an iterator (you can use in for loops etc...) that returns lists of training and test indexes in your data.

In [14]:
X = np.array([[10, 20], [30, 40], [15, 25], [35, 45], [11, 21], [31, 41], [12, 22], [32, 42], [13, 23], [33, 43]])
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

cv = cross_validation.StratifiedKFold(y, n_folds=5, shuffle=True, random_state=0)
fold = 0
for train_indexes, test_indexes in cv:
    print("fold",fold)
    print("train", train_indexes)
    print("test ", test_indexes)
    print()
    fold += 1

fold 0
train [0 1 3 4 6 7 8 9]
test  [2 5]

fold 1
train [1 2 3 4 5 6 8 9]
test  [0 7]

fold 2
train [0 2 3 4 5 7 8 9]
test  [1 6]

fold 3
train [0 1 2 4 5 6 7 8]
test  [3 9]

fold 4
train [0 1 2 3 5 6 7 9]
test  [4 8]



In [15]:
fold = 0
for train_indexes, test_indexes in cv:
    print("fold",fold)
    X_train, X_test = X[train_indexes], X[test_indexes]
    y_train, y_test = y[train_indexes], y[test_indexes]
    print("training x:", X_train)
    print("training y:", y_train)
    print("testing  x:", X_test)
    print("testing  y:", y_test)
    fold += 1
    break

fold 0
training x: [[10 20]
 [30 40]
 [35 45]
 [11 21]
 [12 22]
 [32 42]
 [13 23]
 [33 43]]
training y: [0 0 0 0 1 1 1 1]
testing  x: [[15 25]
 [31 41]]
testing  y: [0 1]


# Q3.

The best way to calculate F-measures for cross validation is to add up the true positives etc...  for each fold, using the combined counts co calculate the F-measure.

Do a 10-fold cross validation on the NB and SVM swadesh language classifiers using `StratifiedKFold`. 

Calculate the overall $F_1$ measure for each of the models.

In [16]:
cv = cross_validation.StratifiedKFold(Ywords_all, n_folds=10, shuffle=True, random_state=0)
fold = 0
models = (nbLanguagePredictor(), svm.SVC())
predictions = ([],[])
realvals = ([],[])
for train_indexes, test_indexes in cv:
    for model,pred,real,name in zip(models, predictions, realvals,{"NB ","SVM"}):
        trainWords, testWords = Xwords_all[train_indexes], Xwords_all[test_indexes]
        if name=="SVM":
            trainWords, testWords = list(map(vecRepr,trainWords)), list(map(vecRepr,testWords))
        model.fit(trainWords, Ywords_all[train_indexes])
        thisPred = model.predict(testWords)
        pred.extend(thisPred)
        real.extend(Ywords_all[test_indexes])
    fold += 1
TP = [sum(1 for lp,l in zip(pred, real) if lp and l) for pred,real in zip(predictions, realvals)]
FP = [sum(1 for lp,l in zip(pred, real) if lp and not l) for pred,real in zip(predictions, realvals)]
TN = [sum(1 for lp,l in zip(pred, real) if not lp and not l) for pred,real in zip(predictions, realvals)]
FN = [sum(1 for lp,l in zip(pred, real) if not lp and l) for pred,real in zip(predictions, realvals)]

Precision = [tp/(tp+fp) for tp,fp in zip(TP,FP)]
Recall    = [tp/(tp+fn) for tp,fn in zip(TP,FN)]
F1        = [2*P*R/(P+R) for P,R in zip(Precision, Recall)]
print("NB  - precision: ", Precision[0], " recall: ", Recall[0], "F1: ", F1[0])
print("SVM - precision: ", Precision[1], " recall: ", Recall[1], "F1: ", F1[1])

NB  - precision:  0.6769911504424779  recall:  0.7391304347826086 F1:  0.7066974595842956
SVM - precision:  0.7615894039735099  recall:  0.5555555555555556 F1:  0.6424581005586593
