<a href="https://colab.research.google.com/github/olivermueller/aml4ta-2021/blob/main/Session_02/2_03_Multi_class_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


In [None]:
# Set up Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# Install packages
!pip install pymysql

# <font color="#003660">Week 2: Predicting with Bags of Words</font>

# <font color="#003660">Notebook 3: Multi-class Classification</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a binary classifier on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [None]:
import pandas as pd
from sqlalchemy import create_engine
import getpass
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

We load our data from a MySQL database. For security reasons, we don't store the database credentials here; please have a look at Panda to get them.

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM WineDataset"

# Query dataset (pandas)
corpus = pd.read_sql(sql=sql_query, con=engine)

In [None]:
top_countries = ["US","France","Italy","Spain","Portugal","Chile","Argentina","Austria","Australia","Germany"]
corpus = corpus[corpus["country"].isin(top_countries)]

In [None]:
corpus.head()

In [None]:
corpus.shape

# Preprocess documents

Split data into training, validation, and test set.

In [None]:
training = corpus[corpus["testset"] == 0]
validation = training.iloc[80000:100000,]
training = training.iloc[0:80000,]
test = corpus[corpus["testset"] == 1]

In [None]:
print(training.shape)
print(validation.shape)
print(test.shape)

Perform standard NLP preprocessing steps on the training set using spaCy. To speed up things, we disable some components of spaCy's standard NLP pipeline.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser', 'tagger'])

def spacy_prep(dataset):
  dataset = dataset.to_dict("records")
  for i, entry in enumerate(dataset):
      text = nlp(entry[u'description'])
      tokens_to_keep = []
      for token in text:
          if token.is_alpha and not token.is_stop:
              tokens_to_keep.append(token.lemma_.lower())
      entry[u'description_prep'] = " ".join(tokens_to_keep)
  dataset = pd.DataFrame(dataset)
  return(dataset)

In [None]:
training = spacy_prep(training)

Display the first couple of lines of the preprocessed descriptions.

In [None]:
training["description_prep"].head()

# Vectorize documents

Use a `CountVectorizer` to vectorize the documents.

In [None]:
count_vect = CountVectorizer(min_df=10)

Apply the vectorizer to the review texts of the training set.

In [None]:
X_training = count_vect.fit_transform(training["description_prep"].tolist())

Store the labels that we want to predict in a separate variable.

In [None]:
y_training = training["country"]
y_training.value_counts()

A simple way to extend binary classification algorithms to the multi-class classification case is to use the so-called **one-vs-rest scheme**. The simple idea is to learn one binary classifier per class. For doing so, we need to convert multi-class labels to multiple binary labels (belong or does not belong to the class).

In [None]:
label_bin = LabelBinarizer().fit(y_training)

In [None]:
y_training_bin = label_bin.transform(y_training)
y_training_bin

# Train classifier on training set

Use the `OneVsRestClassifier` wrapper to fit one logistic regression classifier per class. The term-document matrix holds the features and the binarized country of origin (i.e., `country` variable) represents the labels.

In [None]:
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000)).fit(X_training, y_training_bin)

Test whether classifier is working by predicting the quality of a short fake review. We apply the same NLP preprocessing steps and reuse the `count_vect` object to generate features in the same way as we did for the training set.

In [None]:
doc_new = {'index': [1], 
           'description': ['This is a good wine']}

doc_new_df = pd.DataFrame.from_dict(doc_new)

In [None]:
doc_new_df_prep = spacy_prep(doc_new_df)
doc_new_df_prep

Predict class membership. 

In [None]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = clf.predict(X_new)
predicted

In [None]:
label_bin.classes_

In [None]:
label_bin.inverse_transform(predicted)

Instead of predicting hard membership, we can also predict the probabilities of the classes.

In [None]:
predicted_prob = clf.predict_proba(X_new)
print(clf.classes_)
print(predicted_prob)

In [None]:
label_bin.inverse_transform(predicted_prob)

# Evaluate accuracy on validation set

Let's evaluate the predictive accurcay of our model on the validation set.

In [None]:
validation = spacy_prep(validation)

In [None]:
X_validation = count_vect.transform(validation["description_prep"])
y_validation = validation["country"]
y_validation_bin = label_bin.transform(y_validation)

Call the predict function of our model with the validation data and calculate precision, recall and F1-score.

In [None]:
predictions_validation = clf.predict(X_validation)
print(metrics.classification_report(y_validation_bin, predictions_validation))

# Interpret model

Interpretation of a one-vs-rest logistic regression classifier is a bit more complex as usual, as we have to inspect the coefficients of many models (i.e., one per class).

In [None]:
coeffs = clf.coef_[6].tolist()

In [None]:
words = count_vect.get_feature_names()
words_with_coeffs = pd.DataFrame(coeffs, words, columns=["coeff"])

In [None]:
words_with_coeffs.sort_values("coeff", ascending=True).head(10)

In [None]:
words_with_coeffs.sort_values("coeff", ascending=False).head(10)