In [None]:
# Training the model

At this point, you should understand how to read, preprocess and vectorize your corpus. Completing these steps allows you to finally train a text classification model.

In [None]:
## Creating a train and test set

In [None]:
As mentioned during the introduction, supervised learning consists of training and testing a model. We build a model with training data and consequently evaluate how well it performs on unseen examples. 

Therefore, we first split our data into a train and test set. Luckily, we've done most of the work for you, by adding the `split` column.

The code below creates two variables (of the pandas.DataFrame type) containing the train and test sentences with their labels.

In [None]:
train = df[df.split=='train'] 
test = df[df.split=='test']

In [None]:
print(train.shape,test.shape)
print(train.shape[0]/df.shape[0])

In [None]:
When we vectorize the data with `.fit_transform()`, we only look at the training examples.  

Please remember that the model is not allowed to see examples from the test set (otherwise you are cheating!). We won't touch the test examples until the very end of the classification process. 

To transform the training sentences, we create an instance of the `CountVectorizer` class and define how we'd like to transform the text by specifying arguments such as `min_df` and `ngram_range`.

Feel free to change these settings later on and see what happens.

In [None]:
vectorizer = CountVectorizer(min_df=5, # discard words th
                             max_df=0.9,
                             ngram_range=(1,2),
                             token_pattern=r"\S+")

In [None]:
Below, we apply `.fit_transform()` on the processed sentences in our DataFrame. This returns a document-term matrix, which we store in `X_train`.

`y_train` contains the correct or actual label for each sentence (row) in `X_train`. These labels were obtained via human annotation.

In [None]:
X_train = vectorizer.fit_transform(train.SentenceProcessed)
y_train = train.Animacy

In [None]:
print(X_train.shape,y_train.shape)

In [None]:
We are almost there. Almost all ingredients are in place, except, probably, the most important one: the **learning algorithm**. 

We have to select the algorithm, that will allow us to learn the relation between features and labels. 

For this example, we selected a Naive Bayes classifier. Even though rather old, is still often used in the Digital Humanities and provides a competitive baseline.

We won't have time to discuss the algorithm in detail. For those who are interested, the Naive Bayes algorithm adheres to the following formula:

![Naive Bayes Algorithm](https://wikimedia.org/api/rest_v1/media/math/render/svg/52bd0ca5938da89d7f9bf388dc7edcbd546c118e)

![Expansion of Naive Bayes Algorithm](https://wikimedia.org/api/rest_v1/media/math/render/svg/6150f41afac2076bad6e326ebbdb96fa9ee4ca82)

This may look complicated, but the math is rather straightforward. we compute the probability of label given the words `x` in a text `C_k` (`P(C_k|x)`). By slightly manipulating Bayes rule, this probability is equal to the probability of `C_k` (how often does the label occur in the training set) multiplied by the probability of seeing the word `x_i` in documents with labels `C_k` (`P(x_i|C_k)`).   

For more information consult the [Wikipedia page](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) or the [NLTK handbook](https://www.nltk.org/book/ch06.html).

There are of course more complicated models, but it's good to give the Naive Bayes classifier a try. It often yields good results and is more transparent than other models (less of a black box).

In [None]:
# import the MultinomialNB class
from sklearn.naive_bayes import MultinomialNB

In [None]:
After instantiating the model, we call the `.fit()` method. This computes the class probabilities (prior) and conditional probabilities of the words (likelihood). 

In [None]:
clf = MultinomialNB(alpha=1)
clf.fit(X_train,y_train)

In [None]:
You can inspect these probabilities, which are hidden in the `.feature_log_prob_` attribute of the variable `clf`.

The shape of this matrix is (2,982) as there are two classes and 982 different features. 

In [None]:
clf.feature_log_prob_.shape

In [None]:
X_train.shape

In [None]:
Below we retrieve the conditional probabilities `P(x_i | C_k)` for the noun "labour", and see that it will slightly favour the not-animate class.

In [None]:
vectorizer.get_feature_names()[500]

In [None]:
clf.feature_log_prob_[:,500]

In [None]:
# Evaluating the model




In [None]:
## Out of sample accuracy

We have trained the model and inspected some of its inner workings. But the most important question remains unanswered: how well does it perform in recognizing animacy in text? 

To answer this question, we gauge the model's accuracy on **examples that it hasn't seen yet** (these examples were not observed during training, i.e. when computing the label priors and conditional probabilities).

Before we did this, however, we have to convert the test sentences (which we've set aside earlier) in exactly the **same way as we processed the training examples**. In other words: we have to create a new document-term matrix for the test set, using the same procedure for vectorization.

Luckily, this is easy with Python's Sklearn library. We can just reuse the vectorizer we fitted earlier. Instead of `.fit_transform()` we just apply `.transform()` to sentences in the `SentenceProcessed` column.

We also create a new array in which we store the actual labels.

In [None]:
# transform processed sentences to a document term matrix
X_test = vectorizer.transform(test.SentenceProcessed)
# create an array with all the labels of the test examples
y_test = test.Animacy

In [None]:
Next, we apply the model (which we fitted during trainig) to the test set. The `.predict()` method is all you need! It returns an array with the predictions for each sentence (which we save in the `pred` variable). 

In [None]:
pred = clf.predict(X_test)

In [None]:
Below we print the ten first predictions, and compare them with the actual labels.

In [None]:
print('Predictions=',pred[:10])
print('Actual labels=',y_test[:10].values)

In [None]:
Not bad! The model was only wrong once, the second sentence, where it predicted animate, while in the fact the sentence was annotated as inanimate (in the literature this is called a False Positive). 

Just looking at these predictions doesn't get us far. Luckily, there are established metrics that estimate the performance of the model. The most common measure is **accuracy**, which is simply the number of correct predictions divided by the total number of predictions. 

You may also encounter the **error rate**, which is simply 1 - accuracy.

Other commonly used metrics are precision, recall and f1-score. We won't discuss them here, but please inspect their Wikipedia pages.

Sklearn provides us with a convenient function, `classification_report`, that returns a summary of the output with all these metrics. It only expects the predictions and actual labels as arguments.

Below we printed the classification report, and observe that we obtained close to 80% accuracy!

Not bad? Can you do better? Please, scroll down if you want to play with other models.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(pred,y_test))

In [None]:
## Classifying other examples.

After training you can deploy the classifier and apply it to any sentence. The only condition is that all texts should processed and vectorized. 

Fortunately, given that we alread wrote all these functions and trained all the models, this is a rather easy task.

If you want to experiment yourself, you can easily change the string after the `sentence_new`.

In [None]:
sentence_new = 'The machine was a very smart, it wrote many books and spoke like a philosopher.'
# process sentence
sentence_new_proc = refined_preprocess(sentence_new)
print(sentence_new_proc)


In [None]:
After preprocessing the example sentence  (each token is now a lemma_part-of-speech pair), we can vectorize it using the `transform()` method attached to the `vectorizer` fitted on the training data. This method expects a list of documents, for this reason, we put the sentence between square brackets.

You'll observe that the new document-term matrix has exactly the same number of columns as `X_train`. If these dimensions are different, you've done something wrong and the following steps will raise an error.

In [None]:
X_new = vectorizer.transform([sentence_new_proc])
print(X_new.shape)

In [None]:
Now we apply `.predict()` to the vectorized sentence, and, wow, it's correct! The classifier did its work properly.

For sure, this model is far from perfect. Experiment with other examples and try to understand in which scenario it works, and when it fails.

In [None]:
clf.predict(X_new)[0]

In [None]:
Lastly, we can interrogate the model itself more systematically, something which we've already played with when inspecting the conditional probabilities. Don't worry if the code below is not very understandable, it shouldn't, but you can still run it.

What it does is finding and printing the features with the highest probabilities for each of the two classes. In other words: it returns you the expression that the model finds most useful for predicting animacy.

In [None]:
import numpy as np

neg_class_prob_sorted = clf.feature_log_prob_[0, :].argsort()
pos_class_prob_sorted = clf.feature_log_prob_[1, :].argsort()

print(np.take(vectorizer.get_feature_names(), neg_class_prob_sorted[:20]))
print(np.take(vectorizer.get_feature_names(), pos_class_prob_sorted[:20]))

In [None]:
## Experimenting with other models

In [None]:
from sklearn.svm import SVC
clf = SVC(C=1,kernel='rbf',class_weight='balanced')
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
print(classification_report(pred,y_test))

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
print(classification_report(pred,y_test))

In [None]:
# Putting everything together

The code cells below put each step together into one pipeline.

In [None]:
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [None]:
ps = lambda x: ' '.join([t.lemma_.lower() for t in nlp(x)])

In [None]:
df = pd.read_csv('playing_animacy_data.tsv',sep='\t',index_col=False)
df['SentenceProcessed'] = df.TextSnippet.apply(ps)

In [None]:
train = df[df.split=='test']
test = df[df.split=='train']

In [None]:
vectorizer = CountVectorizer(min_df=5, 
                             max_df=0.9,
                             ngram_range=(1,3),
                             token_pattern=r"\S+")

X_train = vectorizer.fit_transform(train.SentenceProcessed)
y_train = train.Animacy

X_test = vectorizer.transform(test.SentenceProcessed)
y_test = test.Animacy

In [None]:
print(X_train.shape,X_test.shape)

In [None]:
clf = MultinomialNB(alpha=1)
clf.fit(X_train,y_train)

In [None]:
pred = clf.predict(X_test)
print(classification_report(pred,y_test))