<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/03_sml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install spacy
!pip install nltk
!pip install eli5
import spacy
nlp = spacy.load('en_core_web_sm')


# An Introduction to Supervised Learning with Scikit learn

Classifications can take many forms. Today, we will train a simple binary sentiment classifier, using a subset of 10,000 [Amazon Reviews provided as part of a Kaggle competition](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews?resource=download). We will use the amazing `scikit-learn` package to transform the data and train our classifier.

## A Minimal Example

We first load some basic packages and the data:

In [None]:
# fundamental packages
import numpy as np
import pandas as pd

# load some data to train our classifier on
reviews = pd.read_csv("https://www.dropbox.com/scl/fi/y1fzhtdkw8m3swkxb9gif/sub_sample.csv?rlkey=ssaut1n6dua1cihgwww9bxnrm&dl=1")
reviews["bin_label"] = reviews.label == "good"


In [None]:
reviews.shape

In [None]:
reviews.head()

The data has a simple structure, with 10,000 observations and two variables/columns, "label" and "text". The label is either "good"or "bad". We added a binary version of the label as a third variable. Our task is now to train a classifier that cann tell tehse two labels apart, based on the text of the review. For that, we need some tools!

### A Quick Intro to `scikit-learn`

In [None]:
import IPython
url = 'https://scikit-learn.org/stable/'
iframe = '<iframe src=' + url + ' width=1600 height=350></iframe>'
IPython.display.HTML(iframe)

`scikit-learn` is an amazing package, catering to pretty much every need of data scientist. **In order to train a classifier, we need a model that we can train and a vectorizer to transform our data**, that's pretty much it. `scikit-learn` offers much more (please go check it out already!), like a function to transform our data in training and testing data and functions to bind them together and produce our metrics. **We load all of this below**:

In [None]:
# load relevant tools

## A model (choose from API)
from sklearn.linear_model import LogisticRegression as LogReg

## A vectorizer to transform our text into numbers
from sklearn.feature_extraction.text import TfidfVectorizer

## A function to split our data into train and test set
from sklearn.model_selection import train_test_split

## A pipeline to put it all together, and a few functions to compute how well our classifier performs
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

Let's split our data into train and test set:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    reviews.text, reviews.bin_label, test_size=0.33, random_state=42)

Now we need literally two lines of code to train the classifier.

In [None]:
pipe = Pipeline([('Tfidf', TfidfVectorizer()), ('LogReg', LogReg())])
pipe.fit(X_train, y_train)

![](https://media.giphy.com/media/zXMRfbsHOAire/giphy.gif)

Don't believe me? Check for yourself:

In [None]:
pipe.predict(["This is a great movie",
              "Never hated something as much as this movie"])

It does predict our examples well, but how good is the accuracy?

In [None]:
y_pred = pipe.predict(X_test)
pd.crosstab(y_test, y_pred)

In [None]:
## define a custom function to report metrics
def accuracy_report(y_test, y_pred):
  print("Accuracy: ",  round(accuracy_score(y_test, y_pred), 3))
  print("Recall: ",    round(recall_score(y_test, y_pred), 3))
  print("Precision: ", round(precision_score(y_test, y_pred), 3))
  print("F1: ",        round(f1_score(y_test, y_pred), 3))

accuracy_report(y_test, y_pred)

Pretty good, huh? Let's see how this works in more detail!

## Under the Hood

Let's show this based on a very simple example. We generate a set of example texts that are positive or negative reviews and check what the classifier does:

In [None]:
example_revs = ["This is a great, great movie",
                "This is a horrible movie",
                "Waste of time",
                "Beautiful"]

example_y = [True, False, False, True]

### Vectorization

We choose a vectorizer for our text [from `scikit-learn`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text) and assign it to an object so we can fit it:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec  = CountVectorizer()

Then, we fit it to our example reviews and transform the text into numbers:

In [None]:
sparse_mtrx = vec.fit_transform(example_revs)
print(vec.get_feature_names_out(), "\n", sparse_mtrx.toarray())

We can see that the vectorizer simply counts the occurence of each word in each text. The vectorizer by default strips all accents and converts all words into lowercase. Now we can use the `transform()` function to transform new texts into the same format. This is particularly important when we need to transform texts in the test set into a matrix based on the training set.

In [None]:
vec.transform(["This movie is not good."]).toarray()

We can see that some features ('not' and 'good') from this new text are not encoded, as the vectorizer does not have an appropriate column in the document-term-matrix.

Vectorizers have many more features that can be used to preprocess the text. Below is an example.

In [None]:
vec = CountVectorizer(stop_words=["this", "is", "of"])
sparse_mtrx = vec.fit_transform(example_revs)

## use the command from above to rpint your transformed matrix
print(vec.get_feature_names_out(), "\n", sparse_mtrx.toarray())

Look up the arguments of the [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and test it yourself.

### Fitting the Model

Now that we know how to convert text into numbers, we can fit a classifier to the data in order to predict observations in the test set (we use our initial data again).

In [None]:
## train-test-split
X_train, X_test, y_train, y_test = train_test_split(
  reviews.text, reviews.bin_label, test_size=0.33, random_state=42)

## vectorize data
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(1,2))

X_train = vec.fit_transform(X_train)

## load a classifier of your choosing
from sklearn.linear_model import SGDClassifier as SVM
clsfr = SVM()

## fit
clsfr.fit(X_train, y_train)

Now we can assess performance same as before:

In [None]:
X_test  = vec.transform(X_test)
y_pred = clsfr.predict(X_test)

pd.crosstab(y_test, y_pred)

In [None]:
accuracy_report(y_test, y_pred)

## Improving your Model

### Preprocessing with [`spaCy`](https://spacy.io/)

Sometimes, you might want to pre-select features based on your classification problem. For example, when you are interested in the topic of a text, it might be sufficient to assess the nouns which are used, whereas other words might introduce mostly noise. Other tasks might require you to identify the object in a sentence or the organisation mentioned in a text. `spaCy` can identify these words through **parts-of-speech tagging**, **Dependency Parsing**, and **named entity recognition**. However, `spaCy` can do much more. Their [website](https://course.spacy.io/en/) provides an entire course from finding words to training a neural network.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

#### Parts-of-speech tagging

In [None]:
# Process a text
doc = nlp("Not the hero we deserve, but the hero we need.")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

# what's PRON? get an explanation:
spacy.explain("PRON")

In [None]:
# retain only the nouns of a set of texts:
docs = ["May the force be with you.",
        "You're gonna need a bigger boat!",
        "Fly, you fools!",
        "And I will strike down upon thee with great vengeance and furious anger!",
        "You can't handle the truth!",
        "You take the blue pill, the story ends; you wake up in your bed and believe whatever you want to believe.",
        "I love the smell of napalm in the morning."]


for doc in nlp.pipe(docs):
  print([token.text for token in doc if token.pos_ == 'NOUN'])


Another package commonly used in text analysis is [`nltk`](https://www.nltk.org/). It has similar functionalities as `spacy` (e.g. parts-of-speech-tagging) but a slightly different implementation. Below, it is shown how to remove stopwords and stem with `nltk`.

#### Stopword removal

Many words that are constantly used in everyday language are usually not very informative about the content of text (see [Pennebaker 2011](http://secretlifeofpronouns.com/) for a contrasting perspective). These words are called 'stopwords' in NLP and usually considered clutter that could and should be removed.

Note however that preprocessing can heavily affect model results ([Denny and Spirling 2018](https://www.cambridge.org/core/journals/political-analysis/article/text-preprocessing-for-unsupervised-learning-why-it-matters-when-it-misleads-and-what-to-do-about-it/AA7D4DE0AA6AB208502515AE3EC6989E)). How to preprocess text in general is a decision that should be made based on careful consideration of the problem at hand ([Grimmer and Stewart 2013](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/F7AAC8B2909441603FEB25C156448F20/S1047198700013401a.pdf)).

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english'))

In [None]:
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
stop_words = set(stopwords.words('english'))

filtered_docs = []

for doc in docs:
  filtered_doc = " ".join([w for w in word_tokenize(doc) if not w.lower() in stop_words])
  filtered_docs.append(filtered_doc)

filtered_docs

#### Stemming

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

stemmed_docs = []

for doc in docs:
  stemmed_doc = " ".join([stemmer.stem(w) for w in word_tokenize(doc)])
  stemmed_docs.append(stemmed_doc)

stemmed_docs

### Optimizing Model Fit and Avoiding Overfit with CrossValidation

**Overfitting** describes the problem that we might have a classifier fitting our data a little too well, in that it **does not describe general patterns** anymore, but potentially incorporate some **idiosyncratic noise** in the training data. The visualisation below from the [`scikit-learn` guide on overfitting](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html?highlight=crossvalidation#underfitting-vs-overfitting) tries to describe this problem.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png)


Often, we might want to test models with different parameters and choose the one that fits best to solve our problem. This will generally increase the fit of our data but also potentially lead to overfitting. A particularly good way to deal with this is called **cross-validation**. In this process, the training data is divided into equally sized subsets and then, several classifiers are trained to predict each subset from the other subsets. This way, the influence of single observations is reduced, because each observation is in the test set once. More on this [here](https://scikit-learn.org/stable/modules/cross_validation.html) and [here](https://cssbook.net/chapter08.html#8_5_3).

`scikit-learn` contains many models like `LogisticRegressionCV()`, which implement this by default. However, using `GridSearchCV()`, we can optimise the parameters of any model. In order to optimise the parameters of our model, we first need to check which parameters *can* be tuned:

In [None]:
?LogReg

In `LogisticRegression()`, we can change the regularization technique with `penalty` (regularisation is a [process](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c) decreasing the importance of unimportant features, which avoids overfitting when classifying with many features). We can also change the degree of regularization using `C`.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Set the parameters by cross-validation
tuned_parameters = [
    {"penalty": ["l1", "l2"], "C": [1, 10, 100, 1000]}
]

score = "f1"

clf = GridSearchCV(LogReg(solver = 'liblinear'), tuned_parameters, scoring="%s_macro" % score)
clf.fit(X_train, y_train)

print("Best parameters set found on development set:")
print(clf.best_params_)

Let's see if a classifier with these parameters outperforms the standard model:

In [None]:
clsfr_std = LogReg(solver = 'liblinear')
clsfr_new = LogReg(solver = 'liblinear', C = 1000, penalty = "l2")

clsfr_std.fit(X_train, y_train)
clsfr_new.fit(X_train, y_train)

y_pred_std = clsfr_std.predict(X_test)
y_pred_new = clsfr_new.predict(X_test)

accuracy_report(y_test, y_pred_std)

In [None]:
accuracy_report(y_test, y_pred_new)

We managed to improve our classifier!

![](https://media.giphy.com/media/a0h7sAqON67nO/giphy.gif)

### BONUS: Feature Assessment with [`eli5`](https://eli5.readthedocs.io/en/latest/index.html)

`eli5` is a great package to understand how our classifier makes decisions. It has two main functions: `show_weights()` tells us which features are most predictive for the classification, and `show_prediction()` explains us how each feature affects the prediction for a single example. This can be particularly useful for iterative feature selection and the exclusion of stopwords, etc.

In [None]:
import eli5
eli5.show_weights(clsfr, vec=vec, feature_names=vec.get_feature_names_out())

In [None]:
eli5.show_prediction(clsfr, reviews.text[400], vec=vec, feature_names=vec.get_feature_names_out())

# Now, it's your turn!

Together with your neighbor, you have 15 minutes to **design an algorithm that can detect happiness online**. The training data contains tweets with different emotions, your challenge is to find the happy ones. Apply what you have learned in the past hour (feel free to copy the code from the other script). The last two cells contain code to load the test data and assess your classifier's performance. The link to the training set will be sent to you after the fifteen minutes have passed.

## **The best F1 score wins!!!**

In [None]:
tweets = pd.read_csv("https://www.dropbox.com/scl/fi/q4knjtpx0cw15v55q61vr/train_balanced.csv?rlkey=87djtsvsfx5mb1rgvpy0jsjz8&dl=1")

# Assessment

In [None]:
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score
def performance_report(y_test, y_pred):
  print("Accuracy: ",  round(accuracy_score(y_test, y_pred), 3))
  print("Recall: ",    round(recall_score(y_test, y_pred), 3))
  print("Precision: ", round(precision_score(y_test, y_pred), 3))
  print("F1: ",        round(f1_score(y_test, y_pred), 3))

In [None]:
test = pd.read_csv("")
X_test = vec.transform(test.content)
y_test = test.happiness
performance_report(y_test, classifier.predict(X_test))