# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 3: Regression</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a regression model on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [1]:
import pandas as pd
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

Load wine reviews (Source: https://www.kaggle.com/datasets/zynicide/wine-reviews) from a csv file.

In [2]:
corpus = pd.read_csv("https://raw.githubusercontent.com/olivermueller/amlta-2024/main/Session_01/winemag-data-130k-v2.csv")

In [3]:
# rename Unnamed: 0 into index
corpus.rename(columns = {'Unnamed: 0':'index'}, inplace = True)

In [4]:
corpus.head()

Unnamed: 0,index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [5]:
corpus.shape

(129971, 14)

# Preprocess documents

Split data into training, validation, and test set.

In [6]:
training = corpus.iloc[0:80000,]
validation = corpus.iloc[80000:100000,]
test = corpus.iloc[100000:,]

In [7]:
print(training.shape)
print(validation.shape)
print(test.shape)

(80000, 14)
(20000, 14)
(29971, 14)


Perform standard NLP preprocessing steps on the training set using spaCy. To speed up things, we disable some components of spaCy's standard NLP pipeline.

In [8]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

def spacy_prep_df(corpus):
  corpus = corpus.to_dict("records")
  for i, entry in enumerate(corpus):
      doc = nlp(entry[u'description'])
      tokens_to_keep = []
      for token in doc:
          if token.is_alpha and not token.is_stop:
              tokens_to_keep.append(token.lemma_.lower())
      entry[u'description_prep'] = " ".join(tokens_to_keep)
  corpus = pd.DataFrame(corpus)
  return(corpus)

In [9]:
training = spacy_prep_df(training)

Display the first couple of lines of the preprocessed descriptions.

In [10]:
training["description_prep"].head()

0    aromas include tropical fruit broom brimstone ...
1    ripe fruity wine smooth structure firm tannin ...
2    tart snappy flavor lime flesh rind dominate gr...
3    pineapple rind lemon pith orange blossom start...
4    like regular bottling come rough tannic rustic...
Name: description_prep, dtype: object

# Vectorize documents

Vectorize using a simple `CountVectorizer`.

In [11]:
count_vect = CountVectorizer(min_df=10)

Apply the CountVectorizer object to the review texts of the training set.

In [12]:
X_training = count_vect.fit_transform(training["description_prep"].tolist())

Store the labels that we want to predict in a separate variable.

In [13]:
y_training = training["points"]
y_training.describe()

count    80000.000000
mean        88.436312
std          3.007500
min         80.000000
25%         86.000000
50%         88.000000
75%         91.000000
max        100.000000
Name: points, dtype: float64

# Train regressor on training set

Fit a linear regression model with the term-document matrix as the features and the numeric wine quality (i.e., `points` variable) as the label.

In [14]:
reg = LinearRegression().fit(X_training, y_training)

Test whether model is working by predicting the quality of a short fake review.

In [15]:
doc_new = {'index': [1],
           'description': ['This is a good wine']}

doc_new_df = pd.DataFrame.from_dict(doc_new)

In [16]:
doc_new_df_prep = spacy_prep_df(doc_new_df)
doc_new_df_prep

Unnamed: 0,index,description,description_prep
0,1,This is a good wine,good wine


In [17]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = reg.predict(X_new)
predicted

array([84.79559323])

# Evaluate accuracy on validation set

In [18]:
validation = spacy_prep_df(validation)

In [19]:
X_validation = count_vect.transform(validation["description_prep"])
y_validation = validation["points"]

Before calculating the predictions of our model, let's first create a simple benchmark (i.e., always predicting the mean points of the training set).

In [20]:
print(metrics.mean_absolute_error(y_validation, [88.43]*len(y_validation)))

2.4627130000000004


Call the predict function of our model with the validation data and calculate MAE.

In [21]:
predictions_validation = reg.predict(X_validation)
print(metrics.mean_absolute_error(y_validation, predictions_validation))

1.371165640857064
