# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 3: Regression</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a regression model on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [None]:
import pandas as pd
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

Load wine reviews (Source: https://www.kaggle.com/datasets/zynicide/wine-reviews) from a csv file.

In [None]:
corpus = pd.read_csv("https://raw.githubusercontent.com/olivermueller/amlta-2025/main/Session_01/winemag-data-130k-v2.csv")
corpus.rename(columns = {'Unnamed: 0':'index'}, inplace = True)

In [None]:
corpus.head()

In [None]:
corpus.shape

# Preprocess documents

Split data into training, validation, and test set.

In [None]:
training = corpus.iloc[0:80000,]
validation = corpus.iloc[80000:100000,]
test = corpus.iloc[100000:,]

In [None]:
print(training.shape)
print(validation.shape)
print(test.shape)

Perform standard NLP preprocessing steps on the training set using spaCy. To speed up things, we disable some components of spaCy's standard NLP pipeline.

In [None]:
def spacy_prep_df(corpus):
    nlp = spacy.load("en_core_web_sm")
    docs = corpus.to_dict("records")
    for i, entry in enumerate(docs):
        if entry["description"]:
            doc = nlp(str(entry["description"]), disable=['ner', 'parser'])
            tokens_prep = []
            for token in doc:
                if token.is_alpha and not token.is_stop:
                    tokens_prep.append(token.lemma_.lower())
            entry["description_prep"] = " ".join(tokens_prep)
        else:
            entry["description_prep"] = ""
    return pd.DataFrame(docs)


In [None]:
training = spacy_prep_df(training)

Display the first couple of lines of the preprocessed descriptions.

In [None]:
training["description_prep"].head()

# Vectorize documents

Vectorize using a simple `CountVectorizer`.

In [None]:
count_vect = CountVectorizer(min_df=10)

Apply the CountVectorizer object to the review texts of the training set.

In [None]:
X_training = count_vect.fit_transform(training["description_prep"].tolist())

Store the labels that we want to predict in a separate variable.

In [None]:
y_training = training["points"]
y_training.describe()

# Train regressor on training set

Fit a linear regression model with the term-document matrix as the features and the numeric wine quality (i.e., `points` variable) as the label.

In [None]:
reg = LinearRegression().fit(X_training, y_training)

Test whether model is working by predicting the quality of a short fake review.

In [None]:
doc_new = {'description': ['This is a good wine']}
doc_new_df = pd.DataFrame.from_dict(doc_new)

In [None]:
doc_new_df_prep = spacy_prep_df(doc_new_df)
doc_new_df_prep

In [None]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = reg.predict(X_new)
predicted

# Evaluate accuracy on test set

In [None]:
test = spacy_prep_df(test)

In [None]:
X_test = count_vect.transform(test["description_prep"])
y_test = test["points"]

Before calculating the predictions of our model, let's first create a simple benchmark (i.e., always predicting the mean points of the training set).

In [None]:
print(metrics.mean_absolute_error(y_test, [y_training.mean()]*len(y_test)))

Call the predict function of our model with the validation data and calculate MAE.

In [None]:
predictions_test = reg.predict(X_test)
print(metrics.mean_absolute_error(y_test, predictions_test))