# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 4: Multi-class Classification</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a binary classifier on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [2]:
import pandas as pd
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

Load wine reviews (Source: https://www.kaggle.com/datasets/zynicide/wine-reviews) from a csv file.

In [3]:
corpus = pd.read_csv('winemag-data-130k-v2.csv')

In [4]:
# rename Unnamed: 0 into index
corpus.rename(columns = {'Unnamed: 0':'index'}, inplace = True)

In [5]:
top_countries = ["US","France","Italy","Spain","Portugal","Chile","Argentina","Austria","Australia","Germany"]
corpus = corpus[corpus["country"].isin(top_countries)]

In [6]:
corpus.head()

Unnamed: 0,index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [7]:
corpus.shape

(124584, 14)

# Preprocess documents

Split data into training, validation, and test set.

In [8]:
training = corpus.iloc[0:80000,]
validation = corpus.iloc[80000:100000,]
test = corpus.iloc[100000:,]

In [9]:
print(training.shape)
print(validation.shape)
print(test.shape)

(80000, 14)
(20000, 14)
(24584, 14)


Perform standard NLP preprocessing steps on the training set using spaCy. To speed up things, we disable some components of spaCy's standard NLP pipeline.

In [10]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

def spacy_prep_df(corpus):
  corpus = corpus.to_dict("records")
  for i, entry in enumerate(corpus):
      doc = nlp(entry[u'description'])
      tokens_to_keep = []
      for token in doc:
          if token.is_alpha and not token.is_stop:
              tokens_to_keep.append(token.lemma_.lower())
      entry[u'description_prep'] = " ".join(tokens_to_keep)
  corpus = pd.DataFrame(corpus)
  return(corpus)

In [11]:
training = spacy_prep_df(training)

Display the first couple of lines of the preprocessed descriptions.

In [12]:
training["description_prep"].head()

0    aromas include tropical fruit broom brimstone ...
1    ripe fruity wine smooth structure firm tannin ...
2    tart snappy flavor lime flesh rind dominate gr...
3    pineapple rind lemon pith orange blossom start...
4    like regular bottling come rough tannic rustic...
Name: description_prep, dtype: object

# Vectorize documents

Use a `CountVectorizer` to vectorize the documents.

In [13]:
count_vect = CountVectorizer(min_df=10)

Apply the vectorizer to the review texts of the training set.

In [14]:
X_training = count_vect.fit_transform(training["description_prep"].tolist())

Store the labels that we want to predict in a separate variable.

In [15]:
y_training = training["country"]
y_training.value_counts()

country
US           34704
France       14291
Italy        12751
Spain         4300
Portugal      3748
Chile         2835
Argentina     2429
Austria       2073
Australia     1504
Germany       1365
Name: count, dtype: int64

A simple way to extend binary classification algorithms to the multi-class classification case is to use the so-called **one-vs-rest scheme**. The simple idea is to learn one binary classifier per class. For doing so, we need to convert multi-class labels to multiple binary labels (i.e., observation belongs or does not belong to the class).

In [16]:
label_bin = LabelBinarizer().fit(y_training)

In [17]:
y_training_bin = label_bin.transform(y_training)
y_training_bin

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

# Train classifier on training set

Use the `OneVsRestClassifier` wrapper to fit one logistic regression classifier per class. The term-document matrix holds the features and the binarized country of origin (i.e., `country` variable) represents the labels.

In [18]:
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000)).fit(X_training, y_training_bin)

Test whether classifier is working by predicting the quality of a short fake review. We apply the same NLP preprocessing steps and reuse the `count_vect` object to generate features in the same way as we did for the training set.

In [19]:
doc_new = {'index': [1],
           'description': ['This wine is nice and easy.']}

doc_new_df = pd.DataFrame.from_dict(doc_new)

In [20]:
doc_new_df_prep = spacy_prep_df(doc_new_df)
doc_new_df_prep

Unnamed: 0,index,description,description_prep
0,1,This wine is nice and easy.,wine nice easy


Predict class membership.

In [21]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = clf.predict(X_new)
predicted

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

In [22]:
label_bin.classes_

array(['Argentina', 'Australia', 'Austria', 'Chile', 'France', 'Germany',
       'Italy', 'Portugal', 'Spain', 'US'], dtype='<U9')

In [23]:
label_bin.inverse_transform(predicted)

array(['US'], dtype='<U9')

Instead of predicting hard membership, we can also predict the probabilities of the classes.

In [24]:
predicted_prob = clf.predict_proba(X_new)
print(clf.classes_)
print(predicted_prob)

[0 1 2 3 4 5 6 7 8 9]
[[0.00562176 0.01735571 0.00913316 0.00225515 0.1219898  0.00181919
  0.07715524 0.03882784 0.00520636 0.76810407]]


# Evaluate accuracy on validation set

Let's evaluate the predictive accurcay of our model on the validation set.

In [25]:
validation = spacy_prep_df(validation)

In [26]:
X_validation = count_vect.transform(validation["description_prep"])
y_validation = validation["country"]
y_validation_bin = label_bin.transform(y_validation)

Call the predict function of our model with the validation data and calculate precision, recall and F1-score.

In [27]:
predictions_validation = clf.predict(X_validation)
print(metrics.classification_report(y_validation_bin, predictions_validation))

              precision    recall  f1-score   support

           0       0.64      0.41      0.50       582
           1       0.84      0.50      0.63       386
           2       0.85      0.61      0.71       650
           3       0.65      0.42      0.51       723
           4       0.82      0.81      0.82      3461
           5       0.78      0.61      0.68       362
           6       0.97      0.95      0.96      3074
           7       0.83      0.52      0.64       880
           8       0.76      0.65      0.70      1038
           9       0.92      0.92      0.92      8844

   micro avg       0.88      0.82      0.85     20000
   macro avg       0.81      0.64      0.71     20000
weighted avg       0.88      0.82      0.84     20000
 samples avg       0.80      0.82      0.81     20000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Interpret model

Interpretation of a one-vs-rest logistic regression classifier is a bit more complex as usual, as we have to inspect the coefficients of many models (i.e., one per class).

In [28]:
coeffs = clf.estimators_[6].coef_.tolist()[0]

In [29]:
words = count_vect.get_feature_names_out()
words_with_coeffs = pd.DataFrame(coeffs, words, columns=["coeff"])

In [30]:
words_with_coeffs.sort_values("coeff", ascending=True).head(10)

Unnamed: 0,coeff
malbec,-2.88094
grenache,-2.459753
tempranillo,-2.260137
fruitiness,-2.195985
zin,-2.152913
chocolaty,-2.130974
riesling,-2.10633
noir,-2.00826
zinfandel,-1.976678
oaky,-1.943537


In [31]:
words_with_coeffs.sort_values("coeff", ascending=False).head(10)

Unnamed: 0,coeff
brunello,4.962957
nero,4.024776
bianco,3.952564
barolo,3.646842
amarone,3.614226
soave,3.467451
italy,3.20389
prosecco,3.190809
dole,3.151008
riserva,3.106787
