<a href="https://colab.research.google.com/github/olivermueller/aml4ta-2021/blob/main/Session_02/2_02_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


In [None]:
# Set up Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Install packages
!pip install pymysql

Collecting pymysql
  Downloading PyMySQL-1.0.2-py3-none-any.whl (43 kB)
[?25l[K     |███████▌                        | 10 kB 23.6 MB/s eta 0:00:01[K     |███████████████                 | 20 kB 29.6 MB/s eta 0:00:01[K     |██████████████████████▍         | 30 kB 34.6 MB/s eta 0:00:01[K     |██████████████████████████████  | 40 kB 36.4 MB/s eta 0:00:01[K     |████████████████████████████████| 43 kB 2.2 MB/s 
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.0.2


# <font color="#003660">Week 2: Predicting with Bags of Words</font>

# <font color="#003660">Notebook 2: Regression</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... transform raw text into a term-document matrix, <br>
        ... train a regression model on the term-document matrix, and <br> ... and compete in a Kaggle competition.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `sklearn` is the de-facto standard machine learning package in Python.

In [None]:
import pandas as pd
from sqlalchemy import create_engine
import getpass
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load documents

We load our data from a MySQL database. For security reasons, we don't store the database credentials here; please have a look at Panda to get them.

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM WineDataset"

# Query dataset (pandas)
corpus = pd.read_sql(sql=sql_query, con=engine)

Username: student
Password: ··········
Server: manila.uni-paderborn.de
Database: aml4ta


In [None]:
corpus.head()

Unnamed: 0,index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,testset,verygood
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,0,0
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,0,0
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,0,0
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,0,0
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,0,0


In [None]:
corpus.shape

(129971, 16)

# Preprocess documents

Split data into training, validation, and test set.

In [None]:
training = corpus[corpus["testset"] == 0]
validation = training.iloc[80000:100000,]
training = training.iloc[0:80000,]
test = corpus[corpus["testset"] == 1]

In [None]:
print(training.shape)
print(validation.shape)
print(test.shape)

(80000, 16)
(20000, 16)
(29970, 16)


Perform standard NLP preprocessing steps on the training set using spaCy. To speed up things, we disable some components of spaCy's standard NLP pipeline.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser', 'tagger'])

def spacy_prep(dataset):
  dataset = dataset.to_dict("records")
  for i, entry in enumerate(dataset):
      text = nlp(entry[u'description'])
      tokens_to_keep = []
      for token in text:
          if token.is_alpha and not token.is_stop:
              tokens_to_keep.append(token.lemma_.lower())
      entry[u'description_prep'] = " ".join(tokens_to_keep)
  dataset = pd.DataFrame(dataset)
  return(dataset)

In [None]:
training = spacy_prep(training)

Display the first couple of lines of the preprocessed descriptions.

In [None]:
training["description_prep"].head()

0    aromas include tropical fruit broom brimstone ...
1    ripe fruity wine smooth structure firm tannin ...
2    tart snappy flavor lime flesh rind dominate gr...
3    pineapple rind lemon pith orange blossom start...
4    like regular bottle come rough tannic rustic e...
Name: description_prep, dtype: object

# Vectorize documents

Vectorize using a simple `CountVectorizer`.

In [None]:
count_vect = CountVectorizer(min_df=10)

Apply the CountVectorizer object to the review texts of the training set.

In [None]:
X_training = count_vect.fit_transform(training["description_prep"].tolist())

Store the labels that we want to predict in a separate variable.

In [None]:
y_training = training["points"]
y_training.describe()

count    80000.000000
mean        88.436312
std          3.007500
min         80.000000
25%         86.000000
50%         88.000000
75%         91.000000
max        100.000000
Name: points, dtype: float64

# Train regressor on training set

Fit a linear regression model with the term-document matrix as the features and the numeric wine quality (i.e., `points` variable) as the label.

In [None]:
reg = LinearRegression().fit(X_training, y_training)

Test whether classifier is working by predicting the quality of a short fake review.

In [None]:
doc_new = {'index': [1], 
           'description': ['This is a good wine']}

doc_new_df = pd.DataFrame.from_dict(doc_new)

In [None]:
doc_new_df_prep = spacy_prep(doc_new_df)
doc_new_df_prep

Unnamed: 0,index,description,description_prep
0,1,This is a good wine,good wine


In [None]:
X_new = count_vect.transform(doc_new_df_prep["description_prep"])
predicted = reg.predict(X_new)
predicted

array([84.70620391])

# Evaluate accuracy on validation set

Let's evaluate the predictive accurcay of our model on the validation set.

In [None]:
validation = spacy_prep(validation)

In [None]:
X_validation = count_vect.transform(validation["description_prep"])
y_validation = validation["points"]

Call the predict function of our model with the validation data and calculate precision, recall and F1-score.

In [None]:
predictions_validation = reg.predict(X_validation)
print(metrics.mean_absolute_error(y_validation, predictions_validation))

1.368539957600655
