# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Week 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 4: Multi-class Classification</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a binary classifier on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [None]:
# Install packages
!pip install pymysql

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
from sqlalchemy import create_engine
import getpass
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

We load our data from a MySQL database. For security reasons, we don't store the database credentials here; please have a look at Panda to get them.

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM WineDataset"

# Query dataset (pandas)
corpus = pd.read_sql(sql=sql_query, con=engine)

Username: student
Password: ··········
Server: manila.uni-paderborn.de
Database: aml4ta


In [None]:
top_countries = ["US","France","Italy","Spain","Portugal","Chile","Argentina","Austria","Australia","Germany"]
corpus = corpus[corpus["country"].isin(top_countries)]

In [None]:
corpus.head()

Unnamed: 0,index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,testset,verygood
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,0,0
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,0,0
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,0,0
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,0,0
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,0,0


In [None]:
corpus.shape

(124584, 16)

# Preprocess documents

Split data into training, validation, and test set.

In [None]:
training = corpus[corpus["testset"] == 0]
validation = training.iloc[80000:100000,]
training = training.iloc[0:80000,]
test = corpus[corpus["testset"] == 1]

In [None]:
print(training.shape)
print(validation.shape)
print(test.shape)

(80000, 16)
(15881, 16)
(28703, 16)


Perform standard NLP preprocessing steps on the training set using spaCy. To speed up things, we disable some components of spaCy's standard NLP pipeline.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

def spacy_prep_df(corpus):
  corpus = corpus.to_dict("records")
  for i, entry in enumerate(corpus):
      doc = nlp(entry[u'description'])
      tokens_to_keep = []
      for token in doc:
          if token.is_alpha and not token.is_stop:
              tokens_to_keep.append(token.lemma_.lower())
      entry[u'description_prep'] = " ".join(tokens_to_keep)
  corpus = pd.DataFrame(corpus)
  return(corpus)

In [None]:
training = spacy_prep_df(training)

Display the first couple of lines of the preprocessed descriptions.

In [None]:
training["description_prep"].head()

0    aromas include tropical fruit broom brimstone ...
1    ripe fruity wine smooth structure firm tannin ...
2    tart snappy flavor lime flesh rind dominate gr...
3    pineapple rind lemon pith orange blossom start...
4    like regular bottling come rough tannic rustic...
Name: description_prep, dtype: object

# Vectorize documents

Use a `CountVectorizer` to vectorize the documents.

In [None]:
count_vect = CountVectorizer(min_df=10)

Apply the vectorizer to the review texts of the training set.

In [None]:
X_training = count_vect.fit_transform(training["description_prep"].tolist())

Store the labels that we want to predict in a separate variable.

In [None]:
y_training = training["country"]
y_training.value_counts()

US           34704
France       14291
Italy        12751
Spain         4300
Portugal      3748
Chile         2835
Argentina     2429
Austria       2073
Australia     1504
Germany       1365
Name: country, dtype: int64

A simple way to extend binary classification algorithms to the multi-class classification case is to use the so-called **one-vs-rest scheme**. The simple idea is to learn one binary classifier per class. For doing so, we need to convert multi-class labels to multiple binary labels (i.e., observation belongs or does not belong to the class).

In [None]:
label_bin = LabelBinarizer().fit(y_training)

In [None]:
y_training_bin = label_bin.transform(y_training)
y_training_bin

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

# Train classifier on training set

Use the `OneVsRestClassifier` wrapper to fit one logistic regression classifier per class. The term-document matrix holds the features and the binarized country of origin (i.e., `country` variable) represents the labels.

In [None]:
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000)).fit(X_training, y_training_bin)

Test whether classifier is working by predicting the quality of a short fake review. We apply the same NLP preprocessing steps and reuse the `count_vect` object to generate features in the same way as we did for the training set.

In [None]:
doc_new = {'index': [1],
           'description': ['This wine is nice and easy.']}

doc_new_df = pd.DataFrame.from_dict(doc_new)

In [None]:
doc_new_df_prep = spacy_prep_df(doc_new_df)
doc_new_df_prep

Unnamed: 0,index,description,description_prep
0,1,This wine is nice and easy.,wine nice easy


Predict class membership.

In [None]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = clf.predict(X_new)
predicted

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

In [None]:
label_bin.classes_

array(['Argentina', 'Australia', 'Austria', 'Chile', 'France', 'Germany',
       'Italy', 'Portugal', 'Spain', 'US'], dtype='<U9')

In [None]:
label_bin.inverse_transform(predicted)

array(['US'], dtype='<U9')

Instead of predicting hard membership, we can also predict the probabilities of the classes.

In [None]:
predicted_prob = clf.predict_proba(X_new)
print(clf.classes_)
print(predicted_prob)

[0 1 2 3 4 5 6 7 8 9]
[[0.00541626 0.01247073 0.01484196 0.00231026 0.11451257 0.00190995
  0.0838899  0.03592114 0.00487273 0.7865163 ]]


# Evaluate accuracy on validation set

Let's evaluate the predictive accurcay of our model on the validation set.

In [None]:
validation = spacy_prep_df(validation)

In [None]:
X_validation = count_vect.transform(validation["description_prep"])
y_validation = validation["country"]
y_validation_bin = label_bin.transform(y_validation)

Call the predict function of our model with the validation data and calculate precision, recall and F1-score.

In [None]:
predictions_validation = clf.predict(X_validation)
print(metrics.classification_report(y_validation_bin, predictions_validation))

              precision    recall  f1-score   support

           0       0.64      0.41      0.50       456
           1       0.82      0.52      0.64       302
           2       0.85      0.60      0.70       499
           3       0.63      0.39      0.48       581
           4       0.83      0.81      0.82      2772
           5       0.80      0.61      0.69       299
           6       0.97      0.95      0.96      2473
           7       0.83      0.53      0.65       713
           8       0.78      0.64      0.70       836
           9       0.92      0.92      0.92      6950

   micro avg       0.88      0.82      0.85     15881
   macro avg       0.81      0.64      0.71     15881
weighted avg       0.87      0.82      0.84     15881
 samples avg       0.80      0.82      0.81     15881



  _warn_prf(average, modifier, msg_start, len(result))


# Interpret model

Interpretation of a one-vs-rest logistic regression classifier is a bit more complex as usual, as we have to inspect the coefficients of many models (i.e., one per class).

In [None]:
coeffs = clf.coef_[6].tolist()



In [None]:
words = count_vect.get_feature_names_out()
words_with_coeffs = pd.DataFrame(coeffs, words, columns=["coeff"])

In [None]:
words_with_coeffs.sort_values("coeff", ascending=True).head(10)

Unnamed: 0,coeff
malbec,-2.694579
grenache,-2.300518
tempranillo,-2.208961
zin,-2.14314
riesling,-2.099057
chocolaty,-1.99577
fruitiness,-1.995422
minerality,-1.969245
oaky,-1.885095
garrigue,-1.880648


In [None]:
words_with_coeffs.sort_values("coeff", ascending=False).head(10)

Unnamed: 0,coeff
brunello,4.603661
bianco,4.176919
soave,3.738989
nero,3.729463
barolo,3.487626
amarone,3.38064
grigio,3.292125
prosecco,3.142726
barbaresco,3.117223
italy,3.065624
