In [None]:
# Supervised Learning


This notebook provides a concise introduction to supervised learning in Python.

You'll learn:
- the basic components of a supervised classification pipeline
- how to load and preprocess your data
- how to vectorize data

Supervised learning is probably the most common type of machine learning. In this scenario, **we want to "teach" a machine to learn from labelled examples**.

When applied to textual data, we want a computer to learn classification, based on a set of labelled examples.

Image your corpus looks like this:

```python
train_corpus = [["I am happy happy !", "Pos"],
          ["I am sad sad", "Neg"],
          ["he is not happy", "Neg"],
          ["that is not bad", "Pos"],
          ["urrrrrggghh :'(", "Neg"],
          ['this is AWESOME ! ! !',"Pos"]]
```

In [None]:
With these data—a small set of examples, with very short exclamations, and labels ('Neg' and 'Pos')—we want to teach the computer to recognize emotion in texts (a task commonly referred to as **emotion mining**). 

To build an emotion classifier, we apply an algorithm that learns the relation between the content of a document (think of words) and the label. This step is called **training** or **fitting** the model. 

To goal of training is to detect a pattern in the documents that "betray" the label. This pattern is commonly referred to as the **signal** (words like "happy" and "sad"), other words, that don't convey emotions, are **noise** ("he", "I"). 

Machine learning models are engineered to efficiently distinguish signal from noise. In the above example, it will learn that the word "happy" corresponds with the `'Pos'` label, while sad is associated with `'Neg`'.

Training on labelled data creates a text classification model. This model is able to **predict the label** of a document given a text. The variable `clf` refers to a classification model. 

```python
clf.predict(['Haha , he is happy !'])
```

Hopefully, it returns the label 'Pos' (if the model is properly fitted).

To establish if our model works, we set aside a small sample of our labelled data for testing purposes: this sample is called the **test set**. We want to know how well our model performs on examples it hasn't seen during training, to determine its **out of sample accuracy**.

```python
test_corpus_text = ['he is not sad',
               			'the dog is happy',
               			'the puppy is sad']

test_corpus_labels = ['Pos','Pos','Neg']

pred = clf.predict([test_corpus])

```

The variable `pred` contains the predicted labels

```python
['Neg','Pos','Neg']
```

You can see that the model got the first sentence wrong (it saw 'sad' and probably missed it was preceded by a negation ('not'). So confusing!).

Now we can compute the out of sample **accuracy** on the test set by comparing actual labels (`test_corpus_labels`) with predictions (`pred`). The model was correct in two or the three times, meaning it has an accuracy of 2/3 = 66.7%



In [None]:
round(2/3*100,1)

In [None]:
# Opening data




In [None]:
# Preprocessing text data 

### Why preprocessing?

In supervised text classification, we want to a model to find textual **patterns** that are predictive of a document's label, i.e., we want the algorithm to learn how tokens in a document correspond with the label.

While machine learning models are mostly strong at recognizing such patterns in your data, **they can not do all the work for you**.

The way you "feed" the data to the model does have an  impact on how well it will perform (i.e. predict the correct label given the (transformed) content of a document).

Each task is different, and you need to adjust the preprocessing steps to the concepts you want to detect in your data.

Please ask yourself, what aspects of the text help distinguishing the target categories:

- **Capitals**: if names are an important feature, you don't want to lowercase your character. However for emotion detection, the difference between "Hamburger" and "hamburger" is less relevant.
- **Parts-of-Speech**: emotion often resides in adjectives and adverbs, nouns are indicative of topic. You could discard all words of a particular part-of-speech to remove "noise".

In the end, what works well is often an empirical question, but you can't examine all possible scenarios. Working on the basis of **intuitions** and assumptions is valid, as long as you are explicit about them.

Machine learning is always influenced and manipulated by **human intervention**.

But without further ado, let's go ahead with preprocessing our data.

In [None]:
### Preprocessing texts with Pandas

In what follows we use the `.apply()` method (attached to the DataFrame object) to preprocess our sentences. This section builds on the previous presentation on spaCy.

`.apply()` applies (what's in a name!) a function (entered as an argument between the parentheses) to each cell in a column.

For example, we can convert all strings to lowercase using the `str.lower()` method.

Normally `.lower()` is applied to a string object as in the following code cell:

In [None]:
"CONVERT mE tO LoWERcAsE, PLEASE!".lower()

In [None]:
str.lower("CONVERT mE tO LoWERcAsE, PLEASE!")

In [None]:
df.TextSnippet.apply(str.lower)

In [None]:
Lowercasing is an inbuilt method attached the columns of the DataFrame (which are of type `pandas.Series`).

In [None]:
df.TextSnippet.str.lower()

In [None]:
You can use more string methods, to see which ones are available inspect the various help and documentation functions provided by Python.

In [None]:
?df.TextSnippet.str

In [None]:
Lowercasing text isn't enough. We'd like to use more of the NLP candy Mariona shared with us in a previous Notebook. This can be easily done by building a preprocessing function that combines various steps. 

Below we build the skeleton for such a function. it doesn't do anything yet, but shows how to document your code by using:
- [`typing`](https://docs.python.org/3/library/typing.html): type hints in the creation of the function 
- Docstring: a summary of what the function does, what it expects as input and returns as output

In [None]:
OK, let's add some more spaCy functionality. We want to:
- lowercase the sentence
- split it into tokens
- get the lemma of each token
- return the tokenized and lemmatized text as a string.

In [None]:
# load the spaCy library
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

In [None]:
def preprocess(sentence: str = ''):
  """preprocessing function that takes a string as input
  perform lowercasing, tokenization and lemmatization
  and returns the converted sentence as a string.
  Arguments:
    sentence (str): input sentence
  Returns:
    a converted sentence as a str object
  """
  # the nlp function perform tokenization under the hood
  sentence_nlp = nlp(sentence)
  # use a list comprehension to collect all lemmas in a list 
  # lowercase the lemma
  sentence = [token.lemma_.lower() for token in sentence_nlp]
  # convert the list of lemmas to string
  sentence = ' '.join(sentence)
  # return the converted string
  return sentence

In [None]:
Nice! To inspect the magic performed by this simple function, let's see how it handles the first sentence of our DataFrame.

In [None]:
sentence = str(df.iloc[0].TextSnippet)
print(sentence)

In [None]:
print(preprocess(sentence))

In [None]:
### Taking a closer look

If you are not familiar with Python, you may wonder why
- there is no `for` loop as was the case in many of the earlier examples. We used a **list comprehension**, which has a more concise syntax and is faster. If the function is difficult to comprehend, I created an "extended edition" that writes out each step in more detail.


In [None]:
def preprocess_extended(sentence: str = ''):
  """preprocessing function that takes a string as input
  perform lowercasing, tokenization and lemmatization
  and returns the converted sentence as a string.
  Arguments:
    sentence (str): input sentence
  Returns:
    a converted sentence as a str object
  """
  # the nlp function perform tokenization under the hood
  sentence_nlp = nlp(sentence)
  
  # create an empty in list where we save lowercased lemmas
  sentence_out = []
  # iterate over all tokens
  for token in sentence_nlp:
    lemma = token.lemma_
    lemma_lower = lemma.lower()
    sentence_out.append(lemma_lower)
  
  # convert the list of lemmas to string
  sentence_out = ' '.join(sentence_out)
  # return the converted string
  return sentence_out

In [None]:
- However, you could make the code even more concise with a `lambda` function.

In [None]:
preprocess_short = lambda x: ' '.join([token.lemma_.lower() for token in nlp(x)])

In [None]:
preprocess_short(sentence)

In [None]:
We can refine the preprocessing by attaching a part-of-speech tag to each lemma. Below I show the "long version", to make clear what is going on at each step, but you could rewrite the whole function in a onelambdaliner!

In [None]:
def refined_preprocess(sentence: str = ''):
  """preprocessing function that takes a string as input
  perform lowercasing, tokenization, lemmatization, and p-o-s tagging
  and returns the converted sentence (in which each lemma is associated 
  with the p-o-s tag) as a string.
  Arguments:
    sentence (str): input sentence
  Returns:
    a converted sentence as a str object
  """
  # the nlp function perform tokenization under the hood
  sentence_nlp = nlp(sentence)
  # create an empty list in which you save processed tokens
  sentence_out = []
  # iterate over all tokens in the sentence_nlp object
  for token in sentence_nlp:
    # get the lemma and part-of-speech tag as tuple
    lemma_pos = (token.lemma_,token.pos_)
    # convert tuple to a string
    lemma_pos_str = '_'.join(lemma_pos)
    # lowercase the string
    lemma_pos_lower = lemma_pos_str.lower()
    # add lowercased string to sentence out list
    sentence_out.append(lemma_pos_lower)
  # convert the list of lemmas to string
  sentence_out = ' '.join(sentence_out)
  # return the converted string
  return sentence_out

In [None]:
refined_preprocess(sentence)

In [None]:
# the short version of the refined preprocess function
# it is concise, but is it still readable?
rfs = lambda x: ' '.join(['_'.join((t.lemma_,t.pos_)).lower() for t in nlp(x)])
rfs(sentence)

In [None]:
df['SentenceProcessed'] = df["TextSnippet"].apply(refined_preprocess)

In [None]:
df.head()