[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/kasparvonbeelen/ghi_python/4-tables?labpath=13_-_Supervised_Learning.ipynb)


# Lecture 13: Supervised Learning

## Data Science for Historians (with Python)

## A Gentle Introduction to Working with Data in Python

### Created by Kaspar Beelen and Luke Blaxill

### For the German Historical Institute, London

<img align="left" src="https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png">






This notebook provides a concise introduction to supervised learning in Python.

You'll learn:
- the basic components of a supervised classification pipeline
- how to load and preprocess your data
- how to vectorize data
- how to train a text classification model
- how to assess if your model works

Supervised learning is probably the most common type of machine learning. In this scenario, **we want to "teach" a machine to learn from labelled examples**.

When applied to textual data, we want a computer to learn classification, based on a set of labelled examples.

Image your corpus looks like this:

```python
train_corpus = [["I am happy happy !", "Pos"],
          ["I am sad sad", "Neg"],
          ["he is not happy", "Neg"],
          ["that is not bad", "Pos"],
          ["urrrrrggghh :'(", "Neg"],
          ['this is AWESOME ! ! !',"Pos"]]
```

With these data—a small set of examples, with very short exclamations, and labels ('Neg' and 'Pos')—we want to teach the computer to recognize emotion in texts (a task commonly referred to as **emotion mining**). 

To build an emotion classifier, we apply an algorithm that learns the relation between the content of a document (think of words) and the label. This step is called **training** or **fitting** the model. 

To goal of training is to detect a pattern in the documents that "betray" the label. This pattern is commonly referred to as the **signal** (words like "happy" and "sad"), other words, that don't convey emotions, are **noise** ("he", "I"). 

Machine learning models are engineered to efficiently distinguish signal from noise. In the above example, it will learn that the word "happy" corresponds with the `'Pos'` label, while sad is associated with `'Neg`'.

Training on labelled data creates a text classification model. This model is able to **predict the label** of a document given a text. The variable `clf` refers to a classification model. 

```python
clf.predict(['Haha , he is happy !'])
```

Hopefully, it returns the label 'Pos' (if the model is properly fitted).

To establish if our model works, we set aside a small sample of our labelled data for testing purposes: this sample is called the **test set**. We want to know how well our model performs on examples it hasn't seen during training, to determine its **out of sample accuracy**.

```python
test_corpus_text = ['he is not sad',
               			'the dog is happy',
               			'the puppy is sad']

test_corpus_labels = ['Pos','Pos','Neg']

pred = clf.predict([test_corpus])

```

The variable `pred` contains the predicted labels

```python
['Neg','Pos','Neg']
```

You can see that the model got the first sentence wrong (it saw 'sad' and probably missed it was preceded by a negation ('not'). So confusing!).

Now we can compute the out of sample **accuracy** on the test set by comparing actual labels (`test_corpus_labels`) with predictions (`pred`). The model was correct in two or the three times, meaning it has an accuracy of 2/3 = 66.7%



In [None]:
round(2/3*100,1)

66.7

### When to use supervised classification

- You know the categories of interest
- Organize a large corpus of text
- Detect things in text

# Task definition

These steps are the basic elements of the supervised learning pipeline. But enough theory, let's work on a practical and more realistic example!

![alt text](https://media.giphy.com/media/l41lLs970IkkBi6f6/giphy.gif)

We want to train a model that predicts whether the sentence contains an animated machine (or not). Put more simply: **is the machine alive?**



This example consists of the following steps:
  - Loading data
  - Preprocessing
  - Vectorization
  - Training
  - Evaluation
  - Application and Inspection of the model

# Loading data

In this part of the tutorial we continue working with the Living Machines dataset.
- at the left-hand-side of the screen, you should see a **folder** icon.
- click on the folder icon, a blade opens with a folder `sample_data` in it.
- drag the `playing_animacy_data.tsv` to the **empty space under** this folder. 
- you may get a message telling you that data will be removed after recycling the Runtime, just click `OK`.

This should work! Run the code below to check. It should return `True`.



In [None]:
import pandas as pd
df = pd.read_csv('playing_animacy_data.tsv',sep='\t',index_col=False, )
isinstance(df,pd.DataFrame)

True

## Alternative ways to load data

### Upload

In [None]:
from google.colab import files
uploaded = files.upload()

### Import from Google Drive 

In [None]:
#import packages and authorize connection to Google account:
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())

In [None]:
# We can load the data like this (where the URL is the address of your dataset):
spreadsheet = gc.open_by_url("https://docs.google.com/spreadsheets/d/1VLw3x4mrg2IIFHliyLSAYMAHCpMYBkW_HDMQOPj4seQ/edit#gid=698458315").get_worksheet(0)

# We can read a tsv file using pandas library in this way. The resulting object is called a dataframe:
df = pd.DataFrame(spreadsheet.get_all_records())

## Loading CSV as a Pandas DataFrame

After uploading the data to Colab we can open it as a Pandas DataFrame.

> What is Pandas?
>
> What is a DataFrame?
 


In [None]:
# load pandas using the abbreviation pd
import pandas as pd

In [None]:
# help(pd)

To load the data we use the method `read_csv`, which takes three arguments:
- a positional argument `path`, that indicates where the data is stored
- a named argument `sep` that indicates how the columns are seperated (in this case a tab or `\t` symbol, but often this is a comma)
- index_col

In [None]:
# we store the animacy data in a variable with the name df
# the data type of the variable is a pandas DataFrame
df = pd.read_csv('playing_animacy_data.tsv',sep='\t',index_col=False, )

Check the type of the `df` variable.

In [None]:
isinstance(df,pd.DataFrame)

True

The `.head()` method, provides a way for inspecting the first n-rows of the `pandas.DataFrame`

In [None]:
df.head()

Unnamed: 0,TextSnippet,MachineType,Date,Category,Humanness,Animacy,split
0,". and had almost resolved to go Wffik, when h...",locomotive,1890,machine as a human,1,1,train
1,I once made an experiment of this kind on a ch...,machine,1887,machine is inanimate object without agency,0,0,train
2,"hot-house, the forced labour of the beast iu t...",machine,1836,other,0,0,train
3,"The next fifteen or twenty years may, therefor...",machines,1892,machine is inanimate object without agency,0,0,train
4,THE LAST OF THE BARONS. 28 ray ; for Coniers ...,machinery,1895,human as a machine,0,1,train


The `.shape` attribute shows the number of rows and columns in the DataFrame.

In [None]:
df.shape

(393, 7)

# Preprocessing text data 

### Why preprocessing?

In supervised text classification, we want to a model to find textual **patterns** that are predictive of a document's label, i.e., we want the algorithm to learn how tokens in a document correspond with the label.

While machine learning models are mostly strong at recognizing such patterns in your data, **they can not do all the work for you**.

The way you "feed" the data to the model does have an  impact on how well it will perform (i.e. predict the correct label given the (transformed) content of a document).

Each task is different, and you need to adjust the preprocessing steps to the concepts you want to detect in your data.

Please ask yourself, what aspects of the text help distinguishing the target categories:

- **Capitals**: if names are an important feature, you don't want to lowercase your character. However for emotion detection, the difference between "Hamburger" and "hamburger" is less relevant.
- **Parts-of-Speech**: emotion often resides in adjectives and adverbs, nouns are indicative of topic. You could discard all words of a particular part-of-speech to remove "noise".

In the end, what works well is often an empirical question, but you can't examine all possible scenarios. Working on the basis of **intuitions** and assumptions is valid, as long as you are explicit about them.

Machine learning is always influenced and manipulated by **human intervention**.

But without further ado, let's go ahead with preprocessing our data.

### Preprocessing texts with Pandas

In what follows we use the `.apply()` method (attached to the DataFrame object) to preprocess our sentences. This section builds on the previous presentation on spaCy.

`.apply()` applies (what's in a name!) a function (entered as an argument between the parentheses) to each cell in a column.

For example, we can convert all strings to lowercase using the `str.lower()` method.

Normally `.lower()` is applied to a string object as in the following code cell:

In [None]:
"CONVERT mE tO LoWERcAsE, PLEASE!".lower()

'convert me to lowercase, please!'

Which equivalent to:

In [None]:
str.lower("CONVERT mE tO LoWERcAsE, PLEASE!")

'convert me to lowercase, please!'

To lowercase text in the `TextSnippet` column, simply apply `str.lower` (without the parentheses).

In [None]:
df.TextSnippet.apply(str.lower)

0      .  and had almost resolved to go wffik, when h...
1      i once made an experiment of this kind on a ch...
2      hot-house, the forced labour of the beast iu t...
3      the next fifteen or twenty years may, therefor...
4      the last of the barons.  28 ray ; for coniers ...
                             ...                        
388    in spite of myt avish, i learned fishing tho r...
389    for the best locomotive that could be made.  i...
390    he chooses those modes of fighting in avhich t...
391    others mimic the cries of barnyard fowl with m...
392    well, good-bye till dinner-time," responded le...
Name: TextSnippet, Length: 393, dtype: object

Lowercasing is an inbuilt method attached the columns of the DataFrame (which are of type `pandas.Series`).

In [None]:
df.TextSnippet.str.lower()

0      .  and had almost resolved to go wffik, when h...
1      i once made an experiment of this kind on a ch...
2      hot-house, the forced labour of the beast iu t...
3      the next fifteen or twenty years may, therefor...
4      the last of the barons.  28 ray ; for coniers ...
                             ...                        
388    in spite of myt avish, i learned fishing tho r...
389    for the best locomotive that could be made.  i...
390    he chooses those modes of fighting in avhich t...
391    others mimic the cries of barnyard fowl with m...
392    well, good-bye till dinner-time," responded le...
Name: TextSnippet, Length: 393, dtype: object

You can use more string methods, to see which ones are available inspect the various help and documentation functions provided by Python.

In [None]:
?df.TextSnippet.str

In [None]:
dir(df.TextSnippet.str)

In [None]:
help(df.TextSnippet.str)

Lowercasing text isn't enough. We'd like to use more of the NLP candy Mariona shared with us in a previous Notebook. This can be easily done by building a preprocessing function that combines various steps. 

Below we build the skeleton for such a function. it doesn't do anything yet, but shows how to document your code by using:
- [`typing`](https://docs.python.org/3/library/typing.html): type hints in the creation of the function 
- Docstring: a summary of what the function does, what it expects as input and returns as output

In [None]:
def preprocess(sentence: str = ''):
  """preprocessing function that takes a string as input
  perform steps X, Y, Z, and returns the converted sentence as a string.
  Arguments:
    sentence (str): input sentence
  Returns:
    a converted sentence as a str object
  """
  return sentence

Please note that these ornaments are not required by the Python syntax. However, they convey that you take code seriously and would like others to understand what you are doing (this "other" could be you, a few months later).

OK, let's add some more spaCy functionality. We want to:
- lowercase the sentence
- split it into tokens
- get the lemma of each token
- return the tokenized and lemmatized text as a string.

In [None]:
# load the spaCy library
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

In [None]:
def preprocess(sentence: str = ''):
  """preprocessing function that takes a string as input
  perform lowercasing, tokenization and lemmatization
  and returns the converted sentence as a string.
  Arguments:
    sentence (str): input sentence
  Returns:
    a converted sentence as a str object
  """
  # the nlp function perform tokenization under the hood
  sentence_nlp = nlp(sentence)
  # use a list comprehension to collect all lemmas in a list 
  # lowercase the lemma
  sentence = [token.lemma_.lower() for token in sentence_nlp]
  # convert the list of lemmas to string
  sentence = ' '.join(sentence)
  # return the converted string
  return sentence

Nice! To inspect the magic performed by this simple function, let's see how it handles the first sentence of our DataFrame.

In [None]:
sentence = str(df.iloc[0].TextSnippet)
print(sentence)

.  and had almost resolved to go Wffik, when he heard the faint, dis tant scream of a locomotive and the sound recalled him to himself.  He saw that there Was a man who Was carrying the lantern, and within a minutes more he had run up, almost out of breath, aud was brokenly telling his story of what he had seen ahead, to which ha trackman listened in silence.


In [None]:
print(preprocess(sentence))

.   and have almost resolve to go wffik , when -pron- hear the faint , dis tant scream of a locomotive and the sound recall -pron- to -pron- .   -pron- see that there be a man who be carry the lantern , and within a minute more -pron- have run up , almost out of breath , aud be brokenly tell -pron- story of what -pron- have see ahead , to which ha trackman listen in silence .


### Taking a closer look

If you are not familiar with Python, you may wonder why
- there is no `for` loop as was the case in many of the earlier examples. We used a **list comprehension**, which has a more concise syntax and is faster. If the function is difficult to comprehend, I created an "extended edition" that writes out each step in more detail.


In [None]:
def preprocess_extended(sentence: str = ''):
  """preprocessing function that takes a string as input
  perform lowercasing, tokenization and lemmatization
  and returns the converted sentence as a string.
  Arguments:
    sentence (str): input sentence
  Returns:
    a converted sentence as a str object
  """
  # the nlp function perform tokenization under the hood
  sentence_nlp = nlp(sentence)
  
  # create an empty in list where we save lowercased lemmas
  sentence_out = []
  # iterate over all tokens
  for token in sentence_nlp:
    lemma = token.lemma_
    lemma_lower = lemma.lower()
    sentence_out.append(lemma_lower)
  
  # convert the list of lemmas to string
  sentence_out = ' '.join(sentence_out)
  # return the converted string
  return sentence_out

- However, you could make the code even more concise with a `lambda` function.

In [None]:
preprocess_short = lambda x: ' '.join([token.lemma_.lower() for token in nlp(x)])

In [None]:
preprocess_short(sentence)

'.   and have almost resolve to go wffik , when -pron- hear the faint , dis tant scream of a locomotive and the sound recall -pron- to -pron- .   -pron- see that there be a man who be carry the lantern , and within a minute more -pron- have run up , almost out of breath , aud be brokenly tell -pron- story of what -pron- have see ahead , to which ha trackman listen in silence .'

We can refine the preprocessing by attaching a part-of-speech tag to each lemma. Below I show the "long version", to make clear what is going on at each step, but you could rewrite the whole function in a onelambdaliner!

In [None]:
def refined_preprocess(sentence: str = ''):
  """preprocessing function that takes a string as input
  perform lowercasing, tokenization, lemmatization, and p-o-s tagging
  and returns the converted sentence (in which each lemma is associated 
  with the p-o-s tag) as a string.
  Arguments:
    sentence (str): input sentence
  Returns:
    a converted sentence as a str object
  """
  # the nlp function perform tokenization under the hood
  sentence_nlp = nlp(sentence)
  # create an empty list in which you save processed tokens
  sentence_out = []
  # iterate over all tokens in the sentence_nlp object
  for token in sentence_nlp:
    # get the lemma and part-of-speech tag as tuple
    lemma_pos = (token.lemma_,token.pos_)
    # convert tuple to a string
    lemma_pos_str = '_'.join(lemma_pos)
    # lowercase the string
    lemma_pos_lower = lemma_pos_str.lower()
    # add lowercased string to sentence out list
    sentence_out.append(lemma_pos_lower)
  # convert the list of lemmas to string
  sentence_out = ' '.join(sentence_out)
  # return the converted string
  return sentence_out

In [None]:
refined_preprocess(sentence)

'._punct  _space and_cconj have_aux almost_adv resolve_verb to_part go_verb wffik_propn ,_punct when_adv -pron-_pron hear_verb the_det faint_adj ,_punct dis_propn tant_adj scream_noun of_adp a_det locomotive_adj and_cconj the_det sound_noun recall_verb -pron-_pron to_adp -pron-_pron ._punct  _space -pron-_pron see_verb that_sconj there_pron be_aux a_det man_noun who_pron be_aux carry_verb the_det lantern_noun ,_punct and_cconj within_adp a_det minute_noun more_adv -pron-_pron have_aux run_verb up_adp ,_punct almost_adv out_sconj of_adp breath_noun ,_punct aud_propn be_aux brokenly_adv tell_verb -pron-_det story_noun of_adp what_pron -pron-_pron have_aux see_verb ahead_adv ,_punct to_part which_det ha_propn trackman_propn listen_verb in_adp silence_noun ._punct'

Or


In [None]:
# the short version of the refined preprocess function
# it is concise, but is it still readable?
rfs = lambda x: ' '.join(['_'.join((t.lemma_,t.pos_)).lower() for t in nlp(x)])
rfs(sentence)

'._punct  _space and_cconj have_aux almost_adv resolve_verb to_part go_verb wffik_propn ,_punct when_adv -pron-_pron hear_verb the_det faint_adj ,_punct dis_propn tant_adj scream_noun of_adp a_det locomotive_adj and_cconj the_det sound_noun recall_verb -pron-_pron to_adp -pron-_pron ._punct  _space -pron-_pron see_verb that_sconj there_pron be_aux a_det man_noun who_pron be_aux carry_verb the_det lantern_noun ,_punct and_cconj within_adp a_det minute_noun more_adv -pron-_pron have_aux run_verb up_adp ,_punct almost_adv out_sconj of_adp breath_noun ,_punct aud_propn be_aux brokenly_adv tell_verb -pron-_det story_noun of_adp what_pron -pron-_pron have_aux see_verb ahead_adv ,_punct to_part which_det ha_propn trackman_propn listen_verb in_adp silence_noun ._punct'

Once the preprocessing steps are defined, we can simply use `apply` (remember) and convert all sentences in the `TextSnippet` column. 

For sure we want to save the output. We, therefore, create a new column 'SentenceProcessed' in which we store the result of our text transformation.

You don't have to worry about the order, Pandas makes sure all sentences end up in the correct row and columns. Simply run the code below (it can take a few seconds, don't worry).

In [None]:
df['SentenceProcessed'] = df["TextSnippet"].apply(refined_preprocess)

Use the `.head()` method and, voila, there you have a new column with your processed data.

In [None]:
df.head()

Unnamed: 0,TextSnippet,MachineType,Date,Category,Humanness,Animacy,split,SentenceProcessed
0,". and had almost resolved to go Wffik, when h...",locomotive,1890,machine as a human,1,1,train,._punct _space and_cconj have_aux almost_adv ...
1,I once made an experiment of this kind on a ch...,machine,1887,machine is inanimate object without agency,0,0,train,-pron-_pron once_adv make_verb an_det experime...
2,"hot-house, the forced labour of the beast iu t...",machine,1836,other,0,0,train,"hot_adj -_punct house_noun ,_punct the_det for..."
3,"The next fifteen or twenty years may, therefor...",machines,1892,machine is inanimate object without agency,0,0,train,the_det next_adj fifteen_num or_cconj twenty_n...
4,THE LAST OF THE BARONS. 28 ray ; for Coniers ...,machinery,1895,human as a machine,0,1,train,the_det last_noun of_adp the_det barons_propn ...


# From text to matrices

## Documents as vectors
Unfortunately, computers find it hard to read texts. They like numbers more. We can't just feed it the tokens but have to transform each sentence to a **vector**.

A vector is just a list of numbers, such as [0, 10, 1, 15]. 

How to convert a text to a series of numbers is much debated. Below we show you the easiest and most common scenario: the **bag-of-words** approach.

This approach assumes that a document can be adequately represented by simply counting the words they contain. We represent the document numerically by collecting the **token frequencies**. For example, the code below converts a sentence to a vector of term frequencies


In [None]:
from collections import Counter
fw = Counter(preprocess(sentence).split())
print(fw)

Counter({'-pron-': 7, ',': 6, '.': 3, 'and': 3, 'have': 3, 'to': 3, 'the': 3, 'of': 3, 'a': 3, 'be': 3, 'almost': 2, 'see': 2, 'resolve': 1, 'go': 1, 'wffik': 1, 'when': 1, 'hear': 1, 'faint': 1, 'dis': 1, 'tant': 1, 'scream': 1, 'locomotive': 1, 'sound': 1, 'recall': 1, 'that': 1, 'there': 1, 'man': 1, 'who': 1, 'carry': 1, 'lantern': 1, 'within': 1, 'minute': 1, 'more': 1, 'run': 1, 'up': 1, 'out': 1, 'breath': 1, 'aud': 1, 'brokenly': 1, 'tell': 1, 'story': 1, 'what': 1, 'ahead': 1, 'which': 1, 'ha': 1, 'trackman': 1, 'listen': 1, 'in': 1, 'silence': 1})


In [None]:
print(list(fw.values()))

[3, 3, 3, 2, 1, 3, 1, 1, 6, 1, 7, 1, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We can vectorize all documents, and construct a **document-term matrix**. A matrix is nothing more than a collection of individual vectors, stacked as rows on top of each other. 

Image our corpus consists of just two sentences: "I like food", "Cats like like food"

Using the bag-of-words approach we can convert this corpus to the following document-term matrix.

In [None]:
pd.DataFrame([[1,0,1,1],[0,1,1,2]],
              columns=["i","cats","food","like"], 
              index=['i like food','cats like like food'])


Unnamed: 0,i,cats,food,like
i like food,1,0,1,1
cats like like food,0,1,1,2


  We can do the same for the sentences we stored in the `SentenceProcessed` column. And the good news is that you don't have to write much of the code, because `sklearn` has provided you with many tools that simplify this task a lot.

  The cells below show how to vectorize your documents and generate a document-term matrix from your corpus. 

  The `CountVectorizer` class will convert each document to a vector of its token frequencies, just as in the previous example. Load the `CountVectorizer` by running the cell below.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# inspect the documentation
?CountVectorizer

As you noticed, the `CountVectorizer` has many arguments. Late on, you can adjust them and see how changing these settings improves or harms the performance of the classifier.

We suggest having a closer look at:
- `min_df` and `max_df`: discard words based on their document frequency. Words that occur only once or twice probably won't be important for predicting the label of a document. Discarding more frequent words is trickier and depends on the task at hand (sometimes function words convey important information!)
- `ngram_range`: n-grams are chunks of n consecutive words. The bag-of-words approach largely ignores the order in which words appear. However, we retain some information on order by counting bigrams (or trigrams). For example,  a bigram model will contain the phrase "not sad" whereas a unigram model won't capture this negation (it counts "not" and "sad" separately).

The code below converts all our processed documents into a document terms matrix (more specifically a dense matrix)

We first create `vectorizer` an instance of the `CountVectorizer` class for which specified many of the arguments.

In [None]:
vectorizer = CountVectorizer(min_df=5, 
                             max_df=0.9,
                             ngram_range=(1,2),
                             token_pattern=r"\S+")

What about the `token_pattern` argument you might wonder? Well, since we already tokenized the data, the whitespaces effectively indicate word boundaries. A token is everything between two whitespaces (or sentence boundaries). This pattern is matched by the regular expression "\S+" (sequences of everything except whitespace).

You can check it for yourself, running the code below:

In [None]:
import re
pattern = re.compile(r"\S+")
print(pattern.findall(df.iloc[0].SentenceProcessed)[:10])

['._punct', '_space', 'and_cconj', 'have_aux', 'almost_adv', 'resolve_verb', 'to_part', 'go_verb', 'wffik_propn', ',_punct']


We can convert, as an example, the first hundred sentences using the `.fit_transform()` method. 

In [None]:
dtm = vectorizer.fit_transform(df.iloc[:100].SentenceProcessed)

Now, what does the `dtm` variable (an abbreviation for "document term-matrix") contain?

In [None]:
dtm.shape

(100, 325)

The `.shape` attribute returns the dimensions of the matrix. It has 100 rows (because we selected the first 100 sentences) and 325 columns. 

Each column represents one feature. To inspect the features, use `.get_feature_names()` attached to the `CountVectorizer`. 

You see that the number of features corresponds to the number of columns in `dtm`.

In [None]:
len(vectorizer.get_feature_names())

325

The features are n-grams (of length 1 and 2) consisting of lemma_part-of-speech pairs.

In [None]:
print(vectorizer.get_feature_names()[100:110])

['be_aux so_adv', 'be_aux the_det', 'be_aux to_part', 'become_verb', 'before_adp', 'bell_noun', 'both_det', 'bring_verb', 'but_cconj', 'but_cconj -pron-_pron']


To inspect a document in vectorized form, we can convert it to a sparse `numpy.array`.

In [None]:
dtm[0]

<1x325 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

The vector below is the numerical presentation of the first sentence in our DataFrame. This is the format in which we feed the text to the training algorithm.

In [None]:
dtm[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 6, 0, 1, 0, 0, 0, 0, 0,
        2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 

# Training the model

At this point, you should understand how to read, preprocess and vectorize your corpus. Completing these steps allows you to finally train a text classification model.

## Creating a train and test set

As mentioned during the introduction, supervised learning consists of training and testing a model. We build a model with training data and consequently evaluate how well it performs on unseen examples. 

Therefore, we first split our data into a train and test set. Luckily, we've done most of the work for you, by adding the `split` column.

The code below creates two variables (of the pandas.DataFrame type) containing the train and test sentences with their labels.

In [None]:
train = df[df.split=='train'] 
test = df[df.split=='test']

In total, we use 75% for training and 25% for testing.

In [None]:
print(train.shape,test.shape)
print(train.shape[0]/df.shape[0])

(295, 8) (98, 8)
0.7506361323155216


When we vectorize the data with `.fit_transform()`, we only look at the training examples.  

Please remember that the model is not allowed to see examples from the test set (otherwise you are cheating!). We won't touch the test examples until the very end of the classification process. 

To transform the training sentences, we create an instance of the `CountVectorizer` class and define how we'd like to transform the text by specifying arguments such as `min_df` and `ngram_range`.

Feel free to change these settings later on and see what happens.

In [None]:
vectorizer = CountVectorizer(min_df=5, # discard words th
                             max_df=0.9,
                             ngram_range=(1,2),
                             token_pattern=r"\S+")

Below, we apply `.fit_transform()` on the processed sentences in our DataFrame. This returns a document-term matrix, which we store in `X_train`.

`y_train` contains the correct or actual label for each sentence (row) in `X_train`. These labels were obtained via human annotation.

In [None]:
X_train = vectorizer.fit_transform(train.SentenceProcessed)
y_train = train.Animacy

In [None]:
print(X_train.shape,y_train.shape)

(295, 982) (295,)


## Selecting an Algorithm 

We are almost there. Almost all ingredients are in place, except, probably, the most important one: the **learning algorithm**. 

We have to select the algorithm, that will allow us to learn the relation between features and labels. 

For this example, we selected a Naive Bayes classifier. Even though rather old, is still often used in the Digital Humanities and provides a competitive baseline.

We won't have time to discuss the algorithm in detail. For those who are interested, the Naive Bayes algorithm adheres to the following formula:

![Naive Bayes Algorithm](https://wikimedia.org/api/rest_v1/media/math/render/svg/52bd0ca5938da89d7f9bf388dc7edcbd546c118e)

![Expansion of Naive Bayes Algorithm](https://wikimedia.org/api/rest_v1/media/math/render/svg/6150f41afac2076bad6e326ebbdb96fa9ee4ca82)

This may look complicated, but the math is rather straightforward. we compute the probability of label given the words `x` in a text `C_k` (`P(C_k|x)`). By slightly manipulating Bayes rule, this probability is equal to the probability of `C_k` (how often does the label occur in the training set) multiplied by the probability of seeing the word `x_i` in documents with labels `C_k` (`P(x_i|C_k)`).   

For more information consult the [Wikipedia page](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) or the [NLTK handbook](https://www.nltk.org/book/ch06.html).

There are of course more complicated models, but it's good to give the Naive Bayes classifier a try. It often yields good results and is more transparent than other models (less of a black box).

In [None]:
# import the MultinomialNB class
from sklearn.naive_bayes import MultinomialNB

After instantiating the model, we call the `.fit()` method. This computes the class probabilities (prior) and conditional probabilities of the words (likelihood). 

In [None]:
clf = MultinomialNB(alpha=1)
clf.fit(X_train,y_train)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

You can inspect these probabilities, which are hidden in the `.feature_log_prob_` attribute of the variable `clf`.

The shape of this matrix is (2,982) as there are two classes and 982 different features. 

In [None]:
clf.feature_log_prob_.shape

(2, 982)

In [None]:
X_train.shape

(295, 982)

Below we retrieve the conditional probabilities `P(x_i | C_k)` for the noun "labour", and see that it will slightly favour the not-animate class.

In [None]:
vectorizer.get_feature_names()[500]

'labour_noun'

In [None]:
clf.feature_log_prob_[:,500]

array([-7.59931787, -7.72753511])

# Evaluating the model




## Out of sample accuracy

We have trained the model and inspected some of its inner workings. But the most important question remains unanswered: how well does it perform in recognizing animacy in text? 

To answer this question, we gauge the model's accuracy on **examples that it hasn't seen yet** (these examples were not observed during training, i.e. when computing the label priors and conditional probabilities).

Before we did this, however, we have to convert the test sentences (which we've set aside earlier) in exactly the **same way as we processed the training examples**. In other words: we have to create a new document-term matrix for the test set, using the same procedure for vectorization.

Luckily, this is easy with Python's Sklearn library. We can just reuse the vectorizer we fitted earlier. Instead of `.fit_transform()` we just apply `.transform()` to sentences in the `SentenceProcessed` column.

We also create a new array in which we store the actual labels.

In [None]:
# transform processed sentences to a document term matrix
X_test = vectorizer.transform(test.SentenceProcessed)
# create an array with all the labels of the test examples
y_test = test.Animacy

Next, we apply the model (which we fitted during trainig) to the test set. The `.predict()` method is all you need! It returns an array with the predictions for each sentence (which we save in the `pred` variable). 

In [None]:
pred = clf.predict(X_test)

Below we print the ten first predictions, and compare them with the actual labels.

In [None]:
print('Predictions=',pred[:10])
print('Actual labels=',y_test[:10].values)

Predictions= [0 1 0 0 0 0 0 0 0 0]
Actual labels= [0 0 0 0 0 0 0 0 0 0]


Not bad! The model was only wrong once, the second sentence, where it predicted animate, while in the fact the sentence was annotated as inanimate (in the literature this is called a False Positive). 

Just looking at these predictions doesn't get us far. Luckily, there are established metrics that estimate the performance of the model. The most common measure is **accuracy**, which is simply the number of correct predictions divided by the total number of predictions. 

You may also encounter the **error rate**, which is simply 1 - accuracy.

Other commonly used metrics are precision, recall and f1-score. We won't discuss them here, but please inspect their Wikipedia pages.

Sklearn provides us with a convenient function, `classification_report`, that returns a summary of the output with all these metrics. It only expects the predictions and actual labels as arguments.

Below we printed the classification report, and observe that we obtained close to 80% accuracy!

Not bad? Can you do better? Please, scroll down if you want to play with other models.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(pred,y_test))

              precision    recall  f1-score   support

           0       0.83      0.78      0.80        55
           1       0.74      0.79      0.76        43

    accuracy                           0.79        98
   macro avg       0.78      0.79      0.78        98
weighted avg       0.79      0.79      0.79        98



## Classifying other examples.

After training you can deploy the classifier and apply it to any sentence. The only condition is that all texts should processed and vectorized. 

Fortunately, given that we alread wrote all these functions and trained all the models, this is a rather easy task.

If you want to experiment yourself, you can easily change the string after the `sentence_new`.

In [None]:
sentence_new = 'The machine was a very smart, it wrote many books and spoke like a philosopher.'
# process sentence
sentence_new_proc = refined_preprocess(sentence_new)
print(sentence_new_proc)


the_det machine_noun be_aux a_det very_adv smart_adj ,_punct -pron-_pron write_verb many_adj book_noun and_cconj speak_verb like_sconj a_det philosopher_noun ._punct


After preprocessing the example sentence  (each token is now a lemma_part-of-speech pair), we can vectorize it using the `transform()` method attached to the `vectorizer` fitted on the training data. This method expects a list of documents, for this reason, we put the sentence between square brackets.

You'll observe that the new document-term matrix has exactly the same number of columns as `X_train`. If these dimensions are different, you've done something wrong and the following steps will raise an error.

In [None]:
X_new = vectorizer.transform([sentence_new_proc])
print(X_new.shape)

(1, 982)


Now we apply `.predict()` to the vectorized sentence, and, wow, it's correct! The classifier did its work properly.

For sure, this model is far from perfect. Experiment with other examples and try to understand in which scenario it works, and when it fails.

In [None]:
clf.predict(X_new)[0]

1

## Inspecting the model

Lastly, we can interrogate the model itself more systematically, something which we've already played with when inspecting the conditional probabilities. Don't worry if the code below is not very understandable, it shouldn't, but you can still run it.

What it does is finding and printing the features with the highest probabilities for each of the two classes. In other words: it returns you the expression that the model finds most useful for predicting animacy.

In [None]:
import numpy as np

neg_class_prob_sorted = clf.feature_log_prob_[0, :].argsort()
pos_class_prob_sorted = clf.feature_log_prob_[1, :].argsort()

print(np.take(vectorizer.get_feature_names(), neg_class_prob_sorted[:20]))
print(np.take(vectorizer.get_feature_names(), pos_class_prob_sorted[:20]))

['work_verb with_adp' 'again_adv ,_punct' 'yet_cconj' 'direct_verb'
 'teach_verb' 'money_noun' 'not_part have_aux' 'real_adj' 'thin_adj'
 'cold_adj' 'obey_verb' 'really_adv' ';_punct that_sconj'
 'know_verb ,_punct' 'talk_verb' '-pron-_pron make_verb'
 ',_punct not_part' '-pron-_pron as_sconj' 'a_det mere_adj'
 'soldier_noun ,_punct']
['supply_verb' 'the_det fire_noun' 'extent_noun ,_punct' 'extent_noun'
 'prove_verb' 'water_noun' 'difficulty_noun' 'cost_noun' 'space_noun'
 'expense_noun' 'manufacture_verb' 'several_adj' 'the_det water_noun'
 'of_adp machinery_noun' 'apparatus_noun ,_punct' 'apparatus_noun'
 'supply_noun' 'surface_noun' 'work_noun ,_punct' 'coal_noun']


## Experimenting with other models

In [None]:
from sklearn.svm import SVC
clf = SVC(C=1,kernel='rbf',class_weight='balanced')
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
print(classification_report(pred,y_test))

              precision    recall  f1-score   support

           0       0.73      0.76      0.75        50
           1       0.74      0.71      0.72        48

    accuracy                           0.73        98
   macro avg       0.73      0.73      0.73        98
weighted avg       0.73      0.73      0.73        98



In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
print(classification_report(pred,y_test))

              precision    recall  f1-score   support

           0       0.81      0.76      0.79        55
           1       0.72      0.77      0.74        43

    accuracy                           0.77        98
   macro avg       0.76      0.77      0.76        98
weighted avg       0.77      0.77      0.77        98



# Putting everything together

The code cells below put each step together into one pipeline.

In [None]:
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [None]:
ps = lambda x: ' '.join([t.lemma_.lower() for t in nlp(x)])

In [None]:
df = pd.read_csv('playing_animacy_data.tsv',sep='\t',index_col=False)
df['SentenceProcessed'] = df.TextSnippet.apply(ps)

In [None]:
train = df[df.split=='test']
test = df[df.split=='train']

In [None]:
vectorizer = CountVectorizer(min_df=5, 
                             max_df=0.9,
                             ngram_range=(1,3),
                             token_pattern=r"\S+")

X_train = vectorizer.fit_transform(train.SentenceProcessed)
y_train = train.Animacy

X_test = vectorizer.transform(test.SentenceProcessed)
y_test = test.Animacy

In [None]:
print(X_train.shape,X_test.shape)

(98, 321) (295, 321)


In [None]:
clf = MultinomialNB(alpha=1)
clf.fit(X_train,y_train)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

In [None]:
pred = clf.predict(X_test)
print(classification_report(pred,y_test))

              precision    recall  f1-score   support

           0       0.73      0.72      0.73       160
           1       0.68      0.69      0.68       135

    accuracy                           0.71       295
   macro avg       0.71      0.71      0.71       295
weighted avg       0.71      0.71      0.71       295



# Fin.