# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 1: Document Classification</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a binary classifier on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [None]:
import pandas as pd
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

Load wine reviews (Source: https://www.kaggle.com/datasets/zynicide/wine-reviews) from a csv file.

In [None]:
corpus = pd.read_csv("https://raw.githubusercontent.com/olivermueller/amlta-2025/main/Session_01/winemag-data-130k-v2.csv")
corpus.rename(columns = {'Unnamed: 0':'index'}, inplace = True)

In [None]:
corpus.head()

In [None]:
corpus.shape

# Preprocess documents

Create response variable.

In [None]:
corpus["verygood"] = 0
corpus.loc[corpus['points'] > 90, 'verygood'] = 1

Split data into training, validation, and test set.

In [None]:
training = corpus.iloc[0:80000,]
validation = corpus.iloc[80000:100000,]
test = corpus.iloc[100000:,]

In [None]:
print(training.shape)
print(validation.shape)
print(test.shape)

Perform standard NLP preprocessing steps on the training set using spaCy. spaCy is an open-source library for Natural Language Processing (NLP) in Python. It helps you build NLP applications that process and understand large volumes of unstructured text. One of the main features of spaCy are linguistic annotations that give you insights into a text’s grammatical structure (e.g., word order, types of words, parts of speech, grammatical roles and relations).

At the center of spaCy is the processing pipeline, an object which is usually called `nlp`. The pipeline is build on top of a language-specific machine learning model and a set of handcrafted rules.

The pipeline contains different components, each specialized for a specific NLP task.

[More...](https://spacy.io/usage/spacy-101#whats-spacy)

<center><br><img src="
https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg"/><br></center>

In [None]:
nlp = spacy.load("en_core_web_sm", exclude=["ner", "parser", "textcat"])

def spacy_prep_df(df, text_col="description", batch_size=1000, n_process=4):
    texts = (str(x) if pd.notna(x) and x else "" for x in df[text_col].values)
    results = []
    with nlp.select_pipes(disable=[]):
        for doc in nlp.pipe(texts, batch_size=batch_size, n_process=n_process):
            if doc.text:
                toks = [t.lemma_.lower() for t in doc if t.is_alpha and not t.is_stop]
                results.append(" ".join(toks))
            else:
                results.append("")
    out = df.copy()
    out[text_col + "_prep"] = results
    return out


In [None]:
training = spacy_prep_df(training)

Display the first couple of lines of the preprocessed descriptions.

In [None]:
training["description_prep"].head()

# Vectorize documents

Vectorization is the process of turning a collection of text documents into numerical feature vectors.

We will use the **Bag of Words (BoW)** model for vectorization. In the BoW model, a corpus of documents is represented by a matrix with one row per document and one column per word occurring in the corpus. The cell values will either be simple frequency counts (How often does a word appear in a document?), or the term frequency (tf) times the inverse document frequency (idf) of a term. The idea of tf-idf is to scale down the impact of words that occur very frequently in a given corpus and that are therefore less informative than features that occur only in a small fraction of the corpus. Note that the BoW model completely ignores information about the position and sequences of the words in the document.

In `sklearn`, the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) creates a term-document matrix with (normalized) term frequencies and the [`TfIdfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) creates a term-document matrix with tf-idf weighting.

In [None]:
count_vect = CountVectorizer(min_df=10)

Apply the CountVectorizer object to the review texts of the training set.

In [None]:
X_training = count_vect.fit_transform(training["description_prep"])

Display an extract of the generated term-document matrix

In [None]:
X_training.shape

In [None]:
X_training[0:20,0:20].todense()

Store the labels that we want to predict in a separate variable.

In [None]:
y_training = training["verygood"]
y_training.describe()

# Train classifier on training set

Fit a logistic regression classification with the term-document matrix as the features and the wine quality (i.e., `verygood` variable) as the label.

In [None]:
clf = LogisticRegression(max_iter=1000).fit(X_training, y_training)

Test whether classifier is working by predicting the quality of a short fake review. We apply the same NLP preprocessing steps and reuse the `count_vect` object to generate features in the same way as we did for the training set.

In [None]:
doc_new = {'description': ['This is a spectacular, magnificent, and majestic wine. Awesome!']}
doc_new_df = pd.DataFrame.from_dict(doc_new)

In [None]:
doc_new_df_prep = spacy_prep_df(doc_new_df)
doc_new_df_prep

In [None]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = clf.predict(X_new)
predicted

Instead of predicting binary labels, we can also predict probabilities of the classes.

In [None]:
predicted_prob = clf.predict_proba(X_new)
print(clf.classes_)
print(predicted_prob)

# Evaluate accuracy on validation set

Before trying to predict the labels for the official test set, we evaluate the predictive accurcay of our model on the validation set. Again, we apply the same NLP preprocessing steps, reuse the `count_vect` object, and store `X` and `y` in separate data structures.

In [None]:
validation = spacy_prep_df(validation)

In [None]:
X_validation = count_vect.transform(validation["description_prep"])
y_validation = validation["verygood"]

Call the predict function of our model with the validation data and calculate precision, recall and F1-score.

In [None]:
predictions_validation = clf.predict(X_validation)
print(metrics.classification_report(y_validation, predictions_validation))

# Interpret model

Logistic regression is typically not the most accurate classification model, but one big advantage is that it can be interpreted by looking at the coefficients of the input features.

In [None]:
coeffs = clf.coef_[0].tolist()
words = count_vect.get_feature_names_out()
words_with_coeffs = pd.DataFrame(coeffs, words, columns=["coeff"])

These are the words with the most *negative* impact.    

In [None]:
words_with_coeffs.sort_values("coeff", ascending=True).head(10)

And these are the words with the most *positive* impact.

In [None]:
words_with_coeffs.sort_values("coeff", ascending=False).head(10)

# Make predictions on test set

Preprocess and vectorize the review texts of the test set.

In [None]:
test = spacy_prep_df(test)

In [None]:
X_test = count_vect.transform(test["description_prep"])
predictions_test = clf.predict(X_test)

Create a dataframe with the indices and predictions and save it as a CSV file (which we can upload to Kaggle).

In [None]:
my_submission = pd.DataFrame({'index': test["index"],
                              'verygood': predictions_test})

In [None]:
my_submission.head()

In [None]:
my_submission.to_csv("my_submission.csv", index=False)