<a href="https://colab.research.google.com/github/nicolaiberk/Imbalanced/blob/master/01_IntroSML_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install spacy
!pip install nltk
!pip install eli5
import spacy
nlp = spacy.load('en_core_web_sm')


# An Introduction to Supervised Learning with Scikit learn

Classifications can take many forms. Today, we will train a simple binary sentiment classifier, using a subset of 10,000 [Amazon Reviews provided as part of a Kaggle competition](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews?resource=download). We will use the amazing `scikit-learn` package to transform the data and train our classifier.

## A Minimal Example

We first load some basic packages and the data:

In [None]:
# fundamental packages
import numpy as np
import pandas as pd

# load some data to train our classifier on
reviews = pd.read_csv("https://www.dropbox.com/scl/fi/y1fzhtdkw8m3swkxb9gif/sub_sample.csv?rlkey=ssaut1n6dua1cihgwww9bxnrm&dl=1")
reviews["bin_label"] = reviews.label == "good"


In [4]:
reviews.shape

(10000, 3)

In [5]:
reviews.head()

Unnamed: 0,label,text,bin_label
0,good,District 9 Blu-Ray Edition: This Blu-Ray disc ...,True
1,bad,Great for it's age I guess...: This book smell...,False
2,bad,OOps !.. too much religious stuff here: I gues...,False
3,good,A Great Surprise!: When I first heard about th...,True
4,bad,Useless: Worthless for med school admission. H...,False


The data has a simple structure, with 10,000 observations and two variables/columns, "label" and "text". The label is either "good"or "bad". We added a binary version of the label as a third variable. Our task is now to train a classifier that cann tell tehse two labels apart, based on the text of the review. For that, we need some tools!

### A Quick Intro to `scikit-learn`

In [6]:
import IPython
url = 'https://scikit-learn.org/stable/'
iframe = '<iframe src=' + url + ' width=1600 height=350></iframe>'
IPython.display.HTML(iframe)



`scikit-learn` is an amazing package, catering to pretty much every need of data scientist. **In order to train a classifier, we need a model that we can train and a vectorizer to transform our data**, that's pretty much it. `scikit-learn` offers much more (please go check it out already!), like a function to transform our data in training and testing data and functions to bind them together and produce our metrics. **We load all of this below**:

In [7]:
# load relevant tools

## A model (choose from API)
from sklearn.linear_model import LogisticRegression as LogReg

## A vectorizer to transform our text into numbers
from sklearn.feature_extraction.text import TfidfVectorizer

## A function to split our data into train and test set
from sklearn.model_selection import train_test_split

## A pipeline to put it all together, and a few functions to compute how well our classifier performs
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

Let's split our data into train and test set:

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    reviews.text, reviews.bin_label, test_size=0.33, random_state=42)

Now we need literally two lines of code to train the classifier.

In [9]:
pipe = Pipeline([('Tfidf', TfidfVectorizer()), ('LogReg', LogReg())])
pipe.fit(X_train, y_train)

0,1,2
,steps,"[('Tfidf', ...), ('LogReg', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


![](https://media.giphy.com/media/zXMRfbsHOAire/giphy.gif)

Don't believe me? Check for yourself:

In [10]:
pipe.predict(["This is a great movie",
              "Never hated something as much as this movie"])

array([ True, False])

It does predict our examples well, but how good is the accuracy?

In [11]:
y_pred = pipe.predict(X_test)
pd.crosstab(y_test, y_pred)

col_0,False,True
bin_label,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1434,208
True,234,1424


In [12]:
## define a custom function to report metrics
def accuracy_report(y_test, y_pred):
  print("Accuracy: ",  round(accuracy_score(y_test, y_pred), 3))
  print("Recall: ",    round(recall_score(y_test, y_pred), 3))
  print("Precision: ", round(precision_score(y_test, y_pred), 3))
  print("F1: ",        round(f1_score(y_test, y_pred), 3))

accuracy_report(y_test, y_pred)

Accuracy:  0.866
Recall:  0.859
Precision:  0.873
F1:  0.866


Pretty good, huh? Let's see how this works in more detail!

## Under the Hood

Let's show this based on a very simple example. We generate a set of example texts that are positive or negative reviews and check what the classifier does:

In [13]:
example_revs = ["This is a great, great movie",
                "This is a horrible movie",
                "Waste of time",
                "Beautiful"]

example_y = [True, False, False, True]

### Vectorization

We choose a vectorizer for our text [from `scikit-learn`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text) and assign it to an object so we can fit it:

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
vec  = CountVectorizer()

Then, we fit it to our example reviews and transform the text into numbers:

In [15]:
sparse_mtrx = vec.fit_transform(example_revs)
print(vec.get_feature_names_out(), "\n", sparse_mtrx.toarray())

['beautiful' 'great' 'horrible' 'is' 'movie' 'of' 'this' 'time' 'waste'] 
 [[0 2 0 1 1 0 1 0 0]
 [0 0 1 1 1 0 1 0 0]
 [0 0 0 0 0 1 0 1 1]
 [1 0 0 0 0 0 0 0 0]]


We can see that the vectorizer simply counts the occurence of each word in each text. The vectorizer by default strips all accents and converts all words into lowercase. Now we can use the `transform()` function to transform new texts into the same format. This is particularly important when we need to transform texts in the test set into a matrix based on the training set.

In [16]:
vec.transform(["This movie is not good."]).toarray()

array([[0, 0, 0, 1, 1, 0, 1, 0, 0]])

We can see that some features ('not' and 'good') from this new text are not encoded, as the vectorizer does not have an appropriate column in the document-term-matrix.

Vectorizers have many more features that can be used to preprocess the text. Below is an example.

In [None]:
vec = CountVectorizer(stop_words=["this", "is", "of"])
sparse_mtrx = vec.fit_transform(example_revs)

## use the command from above to print your transformed matrix
print(vec.get_feature_names_out(), "\n", sparse_mtrx.toarray())

['beautiful' 'great' 'horrible' 'movie' 'time' 'waste'] 
 [[0 2 0 1 0 0]
 [0 0 1 1 0 0]
 [0 0 0 0 1 1]
 [1 0 0 0 0 0]]


Look up the arguments of the [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and test it yourself.

### Fitting the Model

Now that we know how to convert text into numbers, we can fit a classifier to the data in order to predict observations in the test set (we use our initial data again).

In [18]:
## train-test-split
X_train, X_test, y_train, y_test = train_test_split(
  reviews.text, reviews.bin_label, test_size=0.33, random_state=42)

## vectorize data
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(1,2))

X_train = vec.fit_transform(X_train)

## load a classifier of your choosing
from sklearn.linear_model import SGDClassifier as SVM
clsfr = SVM()

## fit
clsfr.fit(X_train, y_train)

0,1,2
,loss,'hinge'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


Now we can assess performance same as before:

In [19]:
X_test  = vec.transform(X_test)
y_pred = clsfr.predict(X_test)

pd.crosstab(y_test, y_pred)

col_0,False,True
bin_label,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1467,175
True,206,1452


In [20]:
accuracy_report(y_test, y_pred)

Accuracy:  0.885
Recall:  0.876
Precision:  0.892
F1:  0.884


## Improving your Model

### Preprocessing with [`spaCy`](https://spacy.io/)

Sometimes, you might want to pre-select features based on your classification problem. For example, when you are interested in the topic of a text, it might be sufficient to assess the nouns which are used, whereas other words might introduce mostly noise. Other tasks might require you to identify the object in a sentence or the organisation mentioned in a text. `spaCy` can identify these words through **parts-of-speech tagging**, **Dependency Parsing**, and **named entity recognition**. However, `spaCy` can do much more. Their [website](https://course.spacy.io/en/) provides an entire course from finding words to training a neural network.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

#### Parts-of-speech tagging

In [23]:
# Process a text
doc = nlp("Not the hero we deserve, but the hero we need.")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

# what's PRON? get an explanation:
spacy.explain("PRON")

Not PART
the DET
hero NOUN
we PRON
deserve VERB
, PUNCT
but CCONJ
the DET
hero NOUN
we PRON
need VERB
. PUNCT


'pronoun'

In [25]:
# retain only the nouns of a set of texts:
docs = ["May the force be with you.",
        "You're gonna need a bigger boat!",
        "Fly, you fools!",
        "And I will strike down upon thee with great vengeance and furious anger!",
        "You can't handle the truth!",
        "You take the blue pill, the story ends; you wake up in your bed and believe whatever you want to believe.",
        "I love the smell of napalm in the morning."]


for doc in nlp.pipe(docs):
  print([token.text for token in doc if token.pos_ == 'NOUN'])


['force']
['boat']
['fools']
['vengeance', 'anger']
['truth']
['pill', 'story', 'bed']
['smell', 'napalm', 'morning']


Another package commonly used in text analysis is [`nltk`](https://www.nltk.org/). It has similar functionalities as `spacy` (e.g. parts-of-speech-tagging) but a slightly different implementation. Below, it is shown how to remove stopwords and stem with `nltk`.

#### Stopword removal

Many words that are constantly used in everyday language are usually not very informative about the content of text (see [Pennebaker 2011](http://secretlifeofpronouns.com/) for a contrasting perspective). These words are called 'stopwords' in NLP and usually considered clutter that could and should be removed.

Note however that preprocessing can heavily affect model results ([Denny and Spirling 2018](https://www.cambridge.org/core/journals/political-analysis/article/text-preprocessing-for-unsupervised-learning-why-it-matters-when-it-misleads-and-what-to-do-about-it/AA7D4DE0AA6AB208502515AE3EC6989E)). How to preprocess text in general is a decision that should be made based on careful consideration of the problem at hand ([Grimmer and Stewart 2013](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/F7AAC8B2909441603FEB25C156448F20/S1047198700013401a.pdf)).

In [29]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/niberk/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [31]:
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
stop_words = set(stopwords.words('english'))

filtered_docs = []

for doc in docs:
  filtered_doc = " ".join([w for w in word_tokenize(doc) if not w.lower() in stop_words])
  filtered_docs.append(filtered_doc)

filtered_docs

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/niberk/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['May force .',
 "'re gon na need bigger boat !",
 'Fly , fools !',
 'strike upon thee great vengeance furious anger !',
 "ca n't handle truth !",
 'take blue pill , story ends ; wake bed believe whatever want believe .',
 'love smell napalm morning .']

### BONUS: Feature Assessment with [`eli5`](https://eli5.readthedocs.io/en/latest/index.html)

`eli5` is a great package to understand how our classifier makes decisions. It has two main functions: `show_weights()` tells us which features are most predictive for the classification, and `show_prediction()` explains us how each feature affects the prediction for a single example. This can be particularly useful for iterative feature selection and the exclusion of stopwords, etc.

In [41]:
import eli5
eli5.show_weights(clsfr, vec=vec, feature_names=vec.get_feature_names_out())

Weight?,Feature
+5.947,great
+3.513,excellent
+3.421,best
+2.953,love
+2.567,perfect
+2.556,good
+2.525,wonderful
+2.399,amazing
+2.361,fun
… 111708 more positive …,… 111708 more positive …


In [42]:
eli5.show_prediction(clsfr, reviews.text[400], vec=vec, feature_names=vec.get_feature_names_out())

Contribution?,Feature
1.384,Highlighted in text (sum)
-0.082,<BIAS>


## Exercise

Try to improve the model fit by changing the preprocessing of texts, choosing a different vectorizer and fit. How good is your model?