<a href="https://colab.research.google.com/github/nicolaiberk/Imbalanced/blob/master/IntroSML_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Supervised Learning with Scikit learn



## A Minimal Example

Today, we will train a sentiment classifier, using a subset of 10,000 [Amazon Reviews provided as part of a Kaggle competition](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews?resource=download). We first load some basic packages and the data:

In [None]:
# fundamental packages
import numpy as np
import pandas as pd

# load some data to train our classifier on
reviews = pd.read_csv("https://www.dropbox.com/s/zup4rcr8j5jz0wr/sub_sample.csv?dl=1")
reviews["bin_label"] = reviews.label == "good"


In [None]:
reviews.shape

(10000, 3)

In [None]:
reviews.head()

Unnamed: 0,label,text,bin_label
0,good,District 9 Blu-Ray Edition: This Blu-Ray disc ...,True
1,bad,Great for it's age I guess...: This book smell...,False
2,bad,OOps !.. too much religious stuff here: I gues...,False
3,good,A Great Surprise!: When I first heard about th...,True
4,bad,Useless: Worthless for med school admission. H...,False


The data has a simple structure, with 10,000 observations and two variables/columns, "label" and "text". The label is either "good"or "bad". We added a binary version of the label as a third variable. Our task is now to train a classifier that cann tell tehse two labels apart, based on the text of the review. For that, we need some tools!

### A Quick Intro to `scikit-learn`

In [None]:
import IPython
url = 'https://scikit-learn.org/stable/'
iframe = '<iframe src=' + url + ' width=1600 height=350></iframe>'
IPython.display.HTML(iframe)

`scikit-learn` is an amazing package, catering to pretty much every need of data scientist. **In order to train a classifier, we need a model that we can train and a vectorizer to transform our data**, that's pretty much it. `scikit-learn` offers much more (please go check it out already!), like a function to transform our data in training and testing data and functions to bind them together and produce our metrics. **We load all of this below**:

In [None]:
# load relevant tools

## A model (choose from API)
from sklearn.linear_model import 

## A vectorizer to transform our text into numbers
from sklearn.feature_extraction.text import 

## A function to split our data into train and test set
from sklearn.model_selection import train_test_split

## A pipeline to put it all together, and a few functions to compute how well our classifier performs
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

Let's split our data into train and test set:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
     , , test_size=0.33, random_state=42)

Now we need literally two lines of code to train the classifier.

In [None]:
pipe = Pipeline([(), ()])


Pipeline(steps=[('Tfidf', TfidfVectorizer()), ('LogReg', LogisticRegression())])

![](https://media.giphy.com/media/zXMRfbsHOAire/giphy.gif)

Don't believe me? Check for yourself:

In [None]:
pipe.predict(["",
              ""])

array([ True, False])

It does predict our examples well, but how good is the accuracy?

In [None]:
y_pred = 
pd.crosstab(, y_pred)

col_0,False,True
bin_label,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1434,208
True,235,1423


In [None]:
## define a custom function to report metrics
def accuracy_report(y_test, y_pred):
  print("Accuracy: ",  round(accuracy_score(y_test, y_pred), 3))
  print("Recall: ",    round(recall_score(y_test, y_pred), 3))
  print("Precision: ", round(precision_score(y_test, y_pred), 3))
  print("F1: ",        round(f1_score(y_test, y_pred), 3))

accuracy_report(y_test, y_pred)

Accuracy:  0.866
Recall:  0.858
Precision:  0.872
F1:  0.865


Pretty good, huh? Let's see how this works in more detail!

## Under the Hood

Let's show this based on a very simple example. We generate a set of example texts that are positive or negative reviews and check what the classifier does:

In [None]:
example_revs = ["",
                "",
                "",
                ""]

example_y = [, , , ]

### Vectorization

We choose a vectorizer for our text [from `scikit-learn`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text) and assign it to an object so we can fit it:

In [None]:
from sklearn.feature_extraction.text import 
vec  = vectorizer()

Then, we fit it to our example reviews and transform the text into numbers:

In [None]:

print(vec.get_feature_names_out(), "\n", sparse_mtrx.toarray())

['beautiful' 'great' 'horrible' 'is' 'movie' 'of' 'this' 'time' 'waste'] 
 [[0 2 0 1 1 0 1 0 0]
 [0 0 1 1 1 0 1 0 0]
 [0 0 0 0 0 1 0 1 1]
 [1 0 0 0 0 0 0 0 0]]


We can see that the vectorizer simply counts the occurence of each word in each text. The vectorizer by default strips all accents and converts all words into lowercase. Now we can use the `transform()` function to transform new texts into the same format. This is particularly important when we need to transform texts in the test set into a matrix based on the training set.

In [None]:
vec.transform([""]).toarray()

array([[0, 0, 0, 1, 1, 0, 1, 0, 0]])

We can see that some features ('not' and 'good') from this new text are not encoded, as the vectorizer does not have an appropriate column in the document-term-matrix.

Vectorizers have many more features that can be used to preprocess the text. Look up the arguments of the [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and test it yourself:

In [None]:
?CountVectorizer

In [None]:
vec = CountVectorizer( ... )
sparse_mtrx = vec.fit_transform(example_revs)

## use the command from above to rpint your transformed matrix
print(vec.get_feature_names_out(), "\n", sparse_mtrx.toarray())

['beautiful' 'great' 'horrible' 'movie' 'time' 'waste'] 
 [[0 2 0 1 0 0]
 [0 0 1 1 0 0]
 [0 0 0 0 1 1]
 [1 0 0 0 0 0]]


### Fitting the Model

Now that we know how to convert text into numbers, we can fit a classifier to the data in order to predict observations in the test set (we use our initial data again).

In [None]:
## vectorize data
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(1,2))

X = 
y = 

## train-test-split
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.33, random_state=42)

## load a classifier of your choosing
from sklearn.linear_model import 
clsfr = 

## fit
clsfr.fit(X_train, y_train)

SGDClassifier()

Now we can assess performance same as before:

In [None]:
y_pred = 

pd.crosstab(y_test, y_pred)

col_0,False,True
bin_label,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1461,181
True,196,1462


In [None]:
accuracy_report(y_test, y_pred)

Accuracy:  0.886
Recall:  0.882
Precision:  0.89
F1:  0.886


Solutions and additional material can be found [here](https://colab.research.google.com/github/nicolaiberk/Imbalanced/blob/master/IntroSML_standalone.ipynb)