In this task, we try to solve the problem of predicting wine quality from review texts and
other properties of the wine. You can find the dataset here:
https://www.kaggle.com/zynicide/wine-reviews

### Task 1 Bag of Words and simple Features 

#### 1.1 Create a baseline model for predicting wine quality using only non-text features.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.linear_model import ElasticNet, LinearRegression, Ridge, Lasso,LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from category_encoders import TargetEncoder
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
import scipy.sparse as sp

In [2]:
df = pd.read_csv("winemag-data-130k-v2.csv")

In [3]:
# select non-text features. Column "Unnamed: 0" is not considered since its just row index
# Also removed "taster_twitter_handle" column since the column does not provide any useful ino
unnecessary_cols = ['Unnamed: 0', 'taster_twitter_handle']
for col in unnecessary_cols:
    print(col)
    print(df[col].describe())
    print("")
    
# Features with strings
text_features = [ 'description', 'title', 'winery']
for col in text_features:
    print(col)
    print(df[col].describe())
    print("")

Unnamed: 0
count    129971.000000
mean      64985.000000
std       37519.540256
min           0.000000
25%       32492.500000
50%       64985.000000
75%       97477.500000
max      129970.000000
Name: Unnamed: 0, dtype: float64

taster_twitter_handle
count          98758
unique            15
top       @vossroger
freq           25514
Name: taster_twitter_handle, dtype: object

description
count                                                129971
unique                                               119955
top       Cigar box, café au lait, and dried tobacco aro...
freq                                                      3
Name: description, dtype: object

title
count                                                129971
unique                                               118840
top       Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma...
freq                                                     11
Name: title, dtype: object

winery
count                 129971
unique                 16

Since the unncessary columns shown above do not give meaningful information, the columns 'Unnamed: 0', 'taster_twitter_handle' will be dropped. Also, the columns  'description', 'title', 'winery' are considered text-related columns since each of those columns consist of many distinct string values

In [4]:
# pre-pre processes
df = pd.read_csv("winemag-data-130k-v2.csv")
df = df.drop(unnecessary_cols, axis=1) # drop the unnecessary and text features
df = df.loc[df.country=="US"] # Only look for US data
df = df.drop('country', axis =1)

In [5]:
# For task 1.1 
text_features = ['description','title','winery'] # i.e. text-related features
df_non_text= df.drop(text_features, axis=1)

# Consider wine points as output 
target_sample = df_non_text.points
X_sample = df_non_text.drop(['points'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X_sample, target_sample,
                                                    stratify=target_sample, random_state=1)

#Baseline model: Linear regression with 'minimal' preprocessing
cat_preprocessing = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='NA'),
    OneHotEncoder(handle_unknown='ignore'))
cont_preprocessing = make_pipeline(
    SimpleImputer(),
    StandardScaler())

cat_TE_preprocessing = make_pipeline(
    TargetEncoder())

preprocess = make_column_transformer(
    (cat_preprocessing, make_column_selector(dtype_include='object')),
    remainder=cont_preprocessing)

model = make_pipeline(preprocess, Ridge())

scores = cross_val_score(model, X_train, y_train)
print("baseline mean score: ", np.mean(scores))

baseline mean score:  0.3816509571920229


#### 1.2 Create a simple text-based model using a bag-of-words approach and a linear model.


In [6]:
# Taking sample of the whole datset
sample_size = int(len(df["description"])*0.7)  # i.e. 70% of the original size
df_sample = df.sample(sample_size, random_state=1)

# One way by make column_transformer of each CountVectorizer on the text columns
text_features = [ 'description', 'title', 'winery']
text_target_samp = df_sample.points
text_cols_samp = df_sample[text_features]

X_train, X_test, y_train, y_test = train_test_split(text_cols_samp , text_target_samp, 
                                            stratify= text_target_samp, random_state=1)

preprocess = make_column_transformer(
    (CountVectorizer(), 'description'),
    (CountVectorizer(), 'title'),
    (CountVectorizer(), 'winery')  )

ridge = make_pipeline(preprocess, Ridge())
ridge_score = cross_val_score(ridge, X_train, y_train)
print ("Avg CV score using basic ridge: ", np.mean(ridge_score))


#--------------------------------------------------
# 2nd method -  Concatenating all text features s.t. share the vocabulary and accumulate the counts. 
# You can get that by just concatenating the string columns.

text_target_samp = df_sample.points
text_cols_samp = df_sample['description']+ df_sample['title'] + df_sample['winery']

X_train, X_test, y_train, y_test = train_test_split(text_cols_samp , text_target_samp, 
                                                    stratify= text_target_samp, random_state=1)
print("")
print("2nd method - concatenating all text features")

# plain text-based model using a bag-of-words
vect = CountVectorizer()
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf, y_train)
print("V2: Avg CV score of a plain count-vectorizer without any tuning: ", np.mean(ridge_score))

# plain text-based model using a bag-of-words
vect = CountVectorizer(min_df = 2, stop_words = 'english', token_pattern=r"\b\w[\w’]+\b")
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf, y_train)
print("V2: Avg CV score of a plain count-vectorizer with pre-defined tuning: ", np.mean(ridge_score))

Avg CV score using basic ridge:  0.6998183049018358

2nd method - concatenating all text features
V2: Avg CV score of a plain count-vectorizer without any tuning:  0.6832176751531114
V2: Avg CV score of a plain count-vectorizer with pre-defined tuning:  0.671852060736962


#### 1.3 Try using n-grams, characters, tf-idf rescaling and possibly other ways to tune the BoW
model. Be aware that you might need to adjust the (regularization of the) linear model for
different feature sets.


In [7]:
# Trying 2-grams with linear model
vect = CountVectorizer(ngram_range=(1,2), min_df = 2, stop_words = 'english', token_pattern=r"\b\w[\w’]+\b")
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf, y_train, cv=3)
print("Avg CV score w/ 2-gram and pre-defined tuning: ",  np.mean(ridge_score))

# 3-gram
vect = CountVectorizer(ngram_range=(1,3), min_df = 2, stop_words = 'english', token_pattern=r"\b\w[\w’]+\b")
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf, y_train, cv=3)
print("Avg CV score w/ 3-gram and pre-defined tuning: ",  np.mean(ridge_score))

# 3-gram without any tuning
vect = CountVectorizer(ngram_range=(1,3))
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf,y_train, cv=3)
print("Avg CV score w/ 3-gram and pre-defined tuning: ",  np.mean(ridge_score))

Avg CV score w/ 2-gram and pre-defined tuning:  0.7088785381327964
Avg CV score w/ 3-gram and pre-defined tuning:  0.7192283509481117
Avg CV score w/ 3-gram and pre-defined tuning:  0.7467945916542112


In [8]:
#Using characters - Takes a long time, and does not provide good result
vect = CountVectorizer(ngram_range=(2,3), analyzer = "char_wb", min_df = 2, stop_words = 'english', token_pattern=r"\b\w[\w’]+\b")
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf,y_train, cv=3)
print("Avg CV score w/ 2-3 character gram and pre-defined tuning: ",  np.mean(ridge_score))

Avg CV score w/ 2-3 character gram and pre-defined tuning:  0.6292857811480754


In [9]:
# Using tf-idf rescaling
vect = TfidfVectorizer(min_df = 2, stop_words = 'english', token_pattern=r"\b\w[\w’]+\b")
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf,y_train, cv=3)
print("Avg CV score w/ with tfidf and pre-defined tuning: ",  np.mean(ridge_score))

vect = TfidfVectorizer()
X_train_transf = vect.fit_transform(X_train)
ridge_score = cross_val_score(Ridge(), X_train_transf,y_train, cv=3)
print("Avg CV score w/ with tfidf and w/0 pre-defined tuning: ",  np.mean(ridge_score))

Avg CV score w/ with tfidf and pre-defined tuning:  0.7096549701749862
Avg CV score w/ with tfidf and w/0 pre-defined tuning:  0.7204656940024137


#### 1.4 Combine the non-text features and the text features. How does adding those features improve upon just using bag-of-words?

In [10]:
text_features = [ 'description', 'title', 'winery']
cont = ['price']
cat = ['designation','province', 'region_1', 'region_2', 'taster_name', 'variety']
text_col = ['desc,title,winery']

sample_size = int(len(df["description"])*0.7)  # i.e. 70% of the original size
df_sample = df.sample(sample_size, random_state=1)

# concatenate text-featured columns in to a single text column called "desc,title,winery"
df_sample["desc,title,winery"] =  df_sample['description']+ df_sample['title'] + df_sample['winery']
df_sample = df_sample.drop(text_features, axis=1)

target_sample = df_sample.points
X_sample = df_sample.drop(['points'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X_sample, target_sample,
                                                    stratify = target_sample,  random_state=0)

In [11]:
#preprocessing
cat_preprocessing = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='NA'),
    OneHotEncoder(handle_unknown='ignore'))

cont_preprocessing = make_pipeline(
    SimpleImputer(),
    StandardScaler())

text_preprocessing = make_pipeline(
    CountVectorizer(ngram_range=(1,3))
    )

preprocess = make_column_transformer(
    (cont_preprocessing, cont),
    (cat_preprocessing, cat),
    (text_preprocessing, 'desc,title,winery'))

model = make_pipeline(preprocess, Ridge())
model_score = cross_val_score(model, X_train, y_train, cv=3)
print("For 70% sample:")
print ("Avg CV score with both text/non-text features: ", np.mean(model_score))

#------------------------------------------
# To see what would the whold datset would give 
df_whole = df

# concatenate text-featured columns in to a single text column called "desc,title,winery"
df_whole["desc,title,winery"] =  df_whole['description']+ df_whole['title'] + df_whole['winery']
df_whole = df_whole.drop(text_features, axis=1)

target_sample = df_whole.points
X_sample = df_whole.drop(['points'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X_sample, target_sample,
                                                    stratify = target_sample,  random_state=0)

model = make_pipeline(preprocess, Ridge())
model_score = cross_val_score(model, X_train, y_train, cv=3)
print("For the whole dataset:")
print ("Avg CV score  with both text/non-text features: ", np.mean(model_score))

For 70% sample:
Avg CV score with both text/non-text features:  0.7572476667689682
For the whole dataset:
Avg CV score  with both text/non-text features:  0.7748946742649401


Yes compared to the average CV score that only use non-text features with bag-of-words methods, the average accuracy increased when using both text and non-text features. Particularly, the score increased a lot when it is compared to the the baseline bag-of-words (i.e. without no specification) approach.