# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 2: Binary Classification</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a binary classifier on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [22]:
import pandas as pd
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

Load wine reviews (Source: https://www.kaggle.com/datasets/zynicide/wine-reviews) from a csv file.

In [23]:
corpus = pd.read_csv('winemag-data-130k-v2.csv')

In [24]:
# rename Unnamed: 0 into index
corpus.rename(columns = {'Unnamed: 0':'index'}, inplace = True)

In [25]:
corpus.head()

Unnamed: 0,index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [26]:
corpus.shape

(129971, 14)

# Preprocess documents

Create response variable.

In [32]:
corpus["verygood"] = 0
corpus.loc[corpus['points'] > 90, 'verygood'] = 1

Split data into training, validation, and test set.

In [33]:
training = corpus.iloc[0:80000,]
validation = corpus.iloc[80000:100000,]
test = corpus.iloc[100000:,]

In [34]:
print(training.shape)
print(validation.shape)
print(test.shape)

(80000, 15)
(20000, 15)
(29971, 15)


In [35]:
# write test to csv file
test[["index", "verygood"]].to_csv('solution.csv', index=False)

Perform standard NLP preprocessing steps on the training set using spaCy. To speed up things, we disable some components of spaCy's standard NLP pipeline.

In [20]:
# YOUR CODE GOES HERE!
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
 
def spacy_prep_df(corpus):
  corpus = corpus.to_dict("records")
  for i, entry in enumerate(corpus):
    doc = nlp(entry[u"description"])
    tokens_to_keep = []
    for token in doc:
      if token.is_alpha and not token.is_stop:
        tokens_to_keep.append(token.lemma_.lower())
    entry[u"description_prep"] = " ".join(tokens_to_keep)
  corpus = pd.DataFrame(corpus)
  return(corpus)

In [21]:
training = spacy_prep_df(training)

KeyboardInterrupt: 

Display the first couple of lines of the preprocessed descriptions.

In [None]:
training["description_prep"].head()

# Vectorize documents

Vectorization is the process of turning a collection of text documents into numerical feature vectors.

We will use the **Bag of Words (BoW)** model for vectorization. In the BoW model, a corpus of documents is represented by a matrix with one row per document and one column per word occurring in the corpus. The cell values will either be simple frequency counts (How often does a word appear in a document?), or the term frequency (tf) times the inverse document frequency (idf) of a term. The idea of tf-idf is to scale down the impact of words that occur very frequently in a given corpus and that are therefore less informative than features that occur only in a small fraction of the corpus. Note that the BoW model completely ignores information about the position and sequences of the words in the document.

In `sklearn`, the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) creates a term-document matrix with (normalized) term frequencies and the [`TfIdfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) creates a term-document matrix with tf-idf weighting.

In [12]:
count_vect = CountVectorizer(min_df=10)

Apply the CountVectorizer object to the review texts of the training set.

In [13]:
X_training = count_vect.fit_transform(training["description_prep"].tolist())

Display an extract of the generated term-document matrix

In [None]:
X_training.shape

In [None]:
X_training[0:20,0:20].todense()

Store the labels that we want to predict in a separate variable.

In [None]:
y_training = training["verygood"]
y_training.describe()

# Train classifier on training set

Fit a logistic regression classification with the term-document matrix as the features and the wine quality (i.e., `verygood` variable) as the label.

In [17]:
clf = LogisticRegression(max_iter=1000).fit(X_training, y_training)

Test whether classifier is working by predicting the quality of a short fake review. We apply the same NLP preprocessing steps and reuse the `count_vect` object to generate features in the same way as we did for the training set.

In [18]:
doc_new = {'index': [1],
           'description': ['This is a spectacular, magnificent, and majestic wine. Awesome!']}

doc_new_df = pd.DataFrame.from_dict(doc_new)

In [None]:
doc_new_df_prep = spacy_prep_df(doc_new_df)
doc_new_df_prep

In [None]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = clf.predict(X_new)
predicted

Instead of predicting binary labels, we can also predict probabilities of the classes.

In [None]:
predicted_prob = clf.predict_proba(X_new)
print(clf.classes_)
print(predicted_prob)

# Evaluate accuracy on validation set

Before trying to predict the labels for the official test set, we evaluate the predictive accurcay of our model on the validation set. Again, we apply the same NLP preprocessing steps, reuse the `count_vect` object, and store `X` and `y` in separate data structures.

In [22]:
validation = spacy_prep_df(validation)

In [23]:
X_validation = count_vect.transform(validation["description_prep"])
y_validation = validation["verygood"]

Call the predict function of our model with the validation data and calculate precision, recall and F1-score.

In [None]:
predictions_validation = clf.predict(X_validation)
print(metrics.classification_report(y_validation, predictions_validation))

# Interpret model

Logistic regression is typically not the most accurate classification model, but one big advantage is that it can be interpreted by looking at the coefficients of the input features.

In [25]:
coeffs = clf.coef_[0].tolist()
words = count_vect.get_feature_names_out()
words_with_coeffs = pd.DataFrame(coeffs, words, columns=["coeff"])

These are the words with the most *negative* impact.    

In [None]:
words_with_coeffs.sort_values("coeff", ascending=True).head(10)

And these are the words with the most *positive* impact.

In [None]:
words_with_coeffs.sort_values("coeff", ascending=False).head(10)

# Make predictions on test set

Preprocess and vectorize the review texts of the test set.

In [28]:
test = spacy_prep_df(test)

In [29]:
X_test = count_vect.transform(test["description_prep"])
predictions_test = clf.predict(X_test)

Create a dataframe with the indices and predictions and save it as a CSV file (which we can upload to Kaggle).

In [30]:
my_submission = pd.DataFrame({'index': test["index"],
                              'verygood': predictions_test})

In [None]:
my_submission.head()

In [32]:
my_submission.to_csv("my_submission.csv", index=False)

# Define a pipeline and tune the model

Typically, we want to try out different preprocessing strategies and/or different classification algorithms. The concept of a **pipeline** in `sklearn` is very usuful to streamline this process.

The purpose of a pipeline is to bundle several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a `__`, as in the example below.

In [33]:
from sklearn.ensemble import RandomForestClassifier

clf_pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', RandomForestClassifier()),
])

In [34]:
parameters = {
    'vect__min_df': (10,100)
}

With the pipeline and its parameters, it is possible to run an exhaustive search of the best parameters on a grid of possible values and evaluate their effects on the predictive accuracy using k-fold cross validation.

In [35]:
clf_pipe_gs = GridSearchCV(clf_pipe, parameters, cv=3, scoring="f1_macro", n_jobs=-1)

In [36]:
clf_pipe_gs = clf_pipe_gs.fit(training["description_prep"], training["verygood"])

In [None]:
pd.DataFrame(clf_pipe_gs.cv_results_)

After the grid search has been performed and the best parameter values have been determined, we can use the fitted pipeline object just like a normal model (e.g., call the predict method with new data).

In [None]:
predictions_validation = clf_pipe_gs.predict(validation["description_prep"])
print(metrics.classification_report(validation["verygood"], predictions_validation))

In [39]:
predictions_test = clf_pipe_gs.predict(test["description_prep"])
my_submission = pd.DataFrame({'index': test["index"],
                              'verygood':predictions_test})
my_submission.to_csv("my_submission.csv", index=False)

For more tips and tricks on parameter tuning using grid search for text data, see: [https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)