<a href="https://colab.research.google.com/github/olivermueller/aml4ta-2021/blob/main/Session_03/3_03_Classification_with_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


In [None]:
# Set up Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# Install packages
!pip install pymysql

# <font color="#003660">Week 3: Word Embeddings</font>

# <font color="#003660">Notebook 3: Classification with Word Embeddings</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... train a classifier with mean word embeddings as features.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `en_core_web_md` is a pre-trained Spacy model that has word embeddings included
- `sklearn` is the de-facto standard machine learning package in Python.

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import getpass
import spacy
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

We load our data from a MySQL database. For security reasons, we don't store the database credentials here; please have a look at Panda to get them.

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM WineDataset"

# Query dataset (pandas)
corpus = pd.read_sql(sql=sql_query, con=engine)

In [None]:
corpus.shape

# Preprocess documents

Split data into training, validation, and test set.

In [None]:
training = corpus[corpus["testset"] == 0]
validation = training.iloc[80000:100000,]
training = training.iloc[0:80000,]
test = corpus[corpus["testset"] == 1]

In [None]:
print(training.shape)
print(validation.shape)
print(test.shape)

# Vectorize documents

Instead of using a BoW model, we will vectorize the documents by computing the average word embeddings over all words of a document. For this, we will use the word embeddings we have trained on the wine dataset.

Convert the trained word embeddings to a format that Spacy understands.

In [None]:
!python -m spacy init-model en /content/gdrive/MyDrive/Colab_Notebooks/AMLTA2021/Session_03/data --vectors-loc /content/gdrive/MyDrive/Colab_Notebooks/AMLTA2021/Session_03/data/wine_300dim_10minwords_4context

Load the custom embeddings and extract the mean word embeddings for each document.

In [None]:
nlp = spacy.load("/content/gdrive/MyDrive/Colab_Notebooks/AMLTA2021/Session_03/data")

def spacy_prep(dataset):
  vectors = []
  dataset = dataset.to_dict("records")
  for i, entry in enumerate(dataset):
      text = nlp(entry[u'description'])
      vectors.append(text.vector)
  return(np.array(vectors))

In [None]:
X_training = spacy_prep(training)

In [None]:
X_training.shape

Store the labels that we want to predict in a separate variable.

In [None]:
y_training = training["verygood"]
y_training.describe()

# Train classifier on training set

Fit a classifier with word embeddings as the features and wine quality (i.e., `verygood` variable) as the label.

In [None]:
clf = LogisticRegression(max_iter=1000).fit(X_training, y_training)

# Evaluate accuracy on validation set

Before trying to predict the labels for the official test set (on Kaggle), we evaluate the predictive accurcay of our model on the validation set.

In [None]:
X_validation = spacy_prep(validation)

In [None]:
y_validation = validation["verygood"]

Call the predict function of our model with the validation data and calculate precision, recall and F1-score.

In [None]:
predictions_validation = clf.predict(X_validation)
print(metrics.classification_report(y_validation, predictions_validation))