# Wines Points prediction 

## Gil LAIFER (028482636) - TCDS17 6.5.2022

In [1]:
%load_ext autoreload
%autoreload 2
import sys; sys.path.append('../')

Here we will try to predict the points a wine will get based on known characteristics (i.e. features, in the ML terminology). The mine point in this stage is to establish a simple, ideally super cost effective, basline.
In the real world there is a tradeoff between complexity and perforamnce, and the DS job, among others, is to present a tradeoff tables of what performance is achivalbel at what complexity level. 

to which models with increased complexity and resource demands will be compared. Complexity should then be translated into cost. For example:
 * Compute cost 
 * Maintenance cost
 * Serving costs (i.e. is new platform needed?) 
 

## Loading the data

In [2]:
import pandas as pd
import cufflinks as cf; cf.go_offline()

In [3]:
wine_reviews = pd.read_csv("data/winemag-data-130k-v2.csv")
wine_reviews.shape

(129971, 14)

In [4]:
wine_reviews.sample(5)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
83500,83500,Italy,This is a creamy and layered wine (made with t...,Terre di Giumara,86,11.0,Sicily & Sardinia,Sicilia,,,,Caruso & Minini 2007 Terre di Giumara Grecanic...,Grecanico,Caruso & Minini
115820,115820,Italy,"Made entirely in stainless steel, this is only...",,87,15.0,Southern Italy,Salento,,,,Masseria Altemura 2006 Fiano (Salento),Fiano,Masseria Altemura
17578,17578,France,"An earthy, subdued nose leads to a clean palat...",,87,23.0,Alsace,Alsace,,Anne Krebiehl MW,@AnneInVino,Domaine Charles Baur 2015 Pinot Gris (Alsace),Pinot Gris,Domaine Charles Baur
74143,74143,Austria,A wine that shows all the typicity of cool-cli...,,90,20.0,Weinviertel,,,Roger Voss,@vossroger,Zull 2011 Grüner Veltliner (Weinviertel),Grüner Veltliner,Zull
9671,9671,Portugal,"Despite its richness, this is a stylish wine t...",Maria Mora Enamorada,90,100.0,Alentejano,,,Roger Voss,@vossroger,Magnum Vinhos 2012 Maria Mora Enamorada Red (A...,Portuguese Red,Magnum Vinhos


In [5]:
wine_reviews = wine_reviews.drop(columns=['Unnamed: 0'])

In [6]:
wine_reviews = wine_reviews.drop_duplicates()
wine_reviews.shape

(119988, 13)

## Points prediction

Points is descrete value target. There for we are talking about a prediction (Regression) problem (in contrary to classification problem). Prediction solutions can be measured in few metrics:

* MSE - [Mean score error](https://en.wikipedia.org/wiki/Mean_squared_error)
* R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
* MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)

Read more [here](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b)

### Train and test set split

To properly report results, let's split to train and test datasets:

In [7]:
train_data = wine_reviews.sample(frac = 0.8)
test_data = wine_reviews[~wine_reviews.index.isin(train_data.index)]
assert(len(train_data) + len(test_data) == len(wine_reviews))

In [8]:
len(test_data), len(train_data)

(23998, 95990)

### Baselines

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [10]:
def calc_prediction_quality(df, pred_score_col, true_score_col):
    return pd.Series({'MSE': mean_squared_error(df[true_score_col], df[pred_score_col]),
                      'MAE': mean_absolute_error(df[true_score_col], df[pred_score_col]),
                      'R2': r2_score(df[true_score_col], df[pred_score_col])})

#### Baseline 1

The most basic baseline is simply the average points. The implementaion is as simple as:

In [11]:
test_data['baseline_1_predicted_points'] = train_data.points.mean()
b1_stats = calc_prediction_quality(test_data, 'baseline_1_predicted_points', 'points')
b1_stats

MSE    9.550246
MAE    2.534278
R2    -0.000037
dtype: float64

#### Basline 2

We can probably improve by predicting the average score based on the origin country:

In [12]:
avg_points_by_country = train_data.groupby('country').points.mean()
avg_points_by_country.head()

country
Argentina                 86.649579
Armenia                   88.000000
Australia                 88.563542
Austria                   90.152801
Bosnia and Herzegovina    86.500000
Name: points, dtype: float64

In [13]:
test_data['baseline_2_predicted_points'] = test_data.country.map(avg_points_by_country).fillna(train_data.points.mean())
b2_stats = calc_prediction_quality(test_data, 'baseline_2_predicted_points', 'points')
b2_stats

MSE    9.029053
MAE    2.459092
R2     0.054539
dtype: float64

### Baseline 3

Adding more breakdowns will increase our granularity but can result in overfitting. Yet:

In [14]:
avg_points_by_country_and_region = train_data.groupby(['country','province']).points.mean().rename('baseline_3_predicted_points')
avg_points_by_country_and_region.head()

country    province        
Argentina  Mendoza Province    86.758663
           Other               85.972152
Armenia    Armenia             88.000000
Australia  Australia Other     85.481081
           New South Wales     87.681159
Name: baseline_3_predicted_points, dtype: float64

In [15]:
test_data_with_baseline_3 = test_data.merge(avg_points_by_country_and_region, on = ['country','province'], how='left')
test_data_with_baseline_3.baseline_3_predicted_points = test_data_with_baseline_3.baseline_3_predicted_points.fillna(test_data_with_baseline_3.baseline_2_predicted_points).fillna(test_data.baseline_1_predicted_points)
test_data_with_baseline_3.shape, test_data.shape

((23998, 16), (23998, 15))

In [16]:
b3_stats = calc_prediction_quality(test_data_with_baseline_3, 'baseline_3_predicted_points', 'points')
b3_stats

MSE    8.522805
MAE    2.367587
R2     0.107550
dtype: float64

### Baselines summary

In [17]:
baseline_summary = pd.DataFrame([b1_stats, b2_stats, b3_stats], index=['baseline_1', 'baseline_2','baseline_3'])
baseline_summary

Unnamed: 0,MSE,MAE,R2
baseline_1,9.550246,2.534278,-3.7e-05
baseline_2,9.029053,2.459092,0.054539
baseline_3,8.522805,2.367587,0.10755


In [18]:
baseline_summary.to_csv('data/baselines_summary.csv', index=False)

## Training a Boosting trees regressor

In [19]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#### Preparing data - Label encoding categorical features

In [20]:
categorical_features = ['country','province','region_1','region_2','taster_name','variety','winery']
numerical_features = ['price']
features = categorical_features + numerical_features

In [21]:
encoded_features = wine_reviews[categorical_features].apply(lambda col: le.fit_transform(col.fillna('NA')))
encoded_features['price'] = wine_reviews.price.fillna(-1)
encoded_features['points'] = wine_reviews.points
encoded_features.head()

Unnamed: 0,country,province,region_1,region_2,taster_name,variety,winery,price,points
0,22,332,424,6,9,691,11608,-1.0,87
1,32,108,738,6,16,451,12956,15.0,87
2,41,269,1218,17,15,437,13018,14.0,87
3,41,218,549,6,0,480,14390,13.0,87
4,41,269,1218,17,15,441,14621,65.0,87


#### Re-splitting to train and test

In [22]:
train_encoded_features = encoded_features[encoded_features.index.isin(train_data.index)]
test_encoded_features = encoded_features[encoded_features.index.isin(test_data.index)]
assert(len(train_encoded_features) + len(test_encoded_features) == len(wine_reviews))

#### Fitting a tree-regressor

In [23]:
from src.models import i_feel_lucky_xgboost_training

In [24]:
train_encoded_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95990 entries, 0 to 129970
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   country      95990 non-null  int32  
 1   province     95990 non-null  int32  
 2   region_1     95990 non-null  int32  
 3   region_2     95990 non-null  int32  
 4   taster_name  95990 non-null  int32  
 5   variety      95990 non-null  int32  
 6   winery       95990 non-null  int32  
 7   price        95990 non-null  float64
 8   points       95990 non-null  int64  
dtypes: float64(1), int32(7), int64(1)
memory usage: 4.8 MB


In [25]:
xgb_clf, clf_name = i_feel_lucky_xgboost_training(train_encoded_features, test_encoded_features, features, 'points', name='xgb_clf_points_prediction')

Let's look at the function output - specifically the **xgb_clf_points_prediction** column:

In [26]:
test_encoded_features.head()

Unnamed: 0,country,province,region_1,region_2,taster_name,variety,winery,price,points,xgb_clf_points_prediction
2,41,269,1218,17,15,437,13018,14.0,87,87
5,38,263,758,6,12,591,14706,15.0,87,87
10,41,51,747,7,19,80,9307,19.0,87,88
19,41,399,1201,6,0,325,12968,32.0,87,84
21,41,269,788,11,15,441,135,20.0,87,88


In [27]:
xgb_stats = calc_prediction_quality(test_encoded_features, 'xgb_clf_points_prediction','points')
xgb_stats

MSE    6.411826
MAE    1.910201
R2     0.328597
dtype: float64

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb'])
all_compared

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

## Classical NLP approaches

### Using only the text from the "description" column

#### Text tokenization

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

In [None]:
from pandas import * 
import cufflinks as cf; cf.go_offline()

In [None]:
set_option('display.max_colwidth',200)

In [None]:
wine_reviews.head(1)

In [None]:
!pip install ttp

In [None]:
!pip install emoji

In [None]:
import string
import emoji
import re as regex

In [None]:
specialChars = ''.join([",", ":", "\"", "=", "&", ";", "%", "$","@", "%", "^", "*", "(", ")", "{", "}",'–','“', '”'
                      "[", "]", "|", "/", "\\", ">", "<", "-","!", "?", ".", "'","--", "---", "#", '‘', '’', '…'])  
space_chars = ['.',',',';', '&', '?','!']
def remove_by_regex(description, regexp):
    return description.replace(regexp, "")

def remove_urls(description):
    return remove_by_regex(description, regex.compile(r"http\S+"))

def remove_special_chars(description): 
    return description.apply(lambda desc: ''.join([c for c in desc if c not in specialChars]))

def remove_usernames(description):
    return remove_by_regex(description, regex.compile(r"@[^\s]+[\s]?"))

def remove_numbers(description):
    return remove_by_regex(description, regex.compile(r"\s?[0-9]+\.?[0-9]*"))

def remove_emojis(description):
    return description.apply(lambda desc: ''.join(c for c in desc if c not in emoji.UNICODE_EMOJI))

def add_spaces(descriptions):
    def add_spaces_int(description):
        for char in space_chars:
            description = description.replace(char, char + ' ')
        return description
    return descriptions.apply(lambda desc: add_spaces_int(desc))

def leave_language_only(descriptions):
    for f in [remove_urls, remove_emojis, add_spaces, remove_numbers, remove_usernames, remove_special_chars]:
        descriptions = f(descriptions)
    return descriptions

Generating df - a DataFrame with the original 'description' and 'points' variables and a new 'pureTextDescription' variable which will be used for Tokenization. 

In [None]:
df = DataFrame(wine_reviews['description'])
df['y'] = wine_reviews['points']

In [None]:
df['pureTextDescription'] = leave_language_only(df.description.str.lower())
df.info()

In [None]:
stopwords=nltk.corpus.stopwords.words("english") + nltk.corpus.stopwords.words("italian") + nltk.corpus.stopwords.words("spanish")
stopwords[:5]

In [None]:
nltk.word_tokenize(df.pureTextDescription.iloc[0])

Tokenizing the dataset text using the pureTextDescription feature:

In [None]:
all_words = [word for desc in df.pureTextDescription for word in nltk.word_tokenize(desc) if word.lower() not in stopwords] # Words without stop words
words_df = DataFrame(data = all_words, columns = ['word']).word.value_counts().reset_index()
words_df.columns = ['word','wordCount']
words_df['wordImportance'] = len(words_df) / words_df.wordCount / words_df.wordCount.max()
words_df.head()

In [None]:
words_df.set_index('word').wordCount.head(20).iplot(kind = 'bar', title = 'Most frequent words in Corpus', yTitle = 'Count', xTitle = 'Word')

In [None]:
print("Total of {} words, {} unique words".format(len(all_words), len(words_df)))

to reduce the corpus size more, we probably don't care about words that appear too little. Let's drop any word which have under 5 appearnces:

In [None]:
print ("Using words with 5 or more appearnces will reduce the corpus size to: {}".format(sum(words_df.wordCount >= 5)))

In [None]:
words_df = words_df[words_df.wordCount >= 5]

#### Bag of words (One-hot-encoding)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
clean_vocab = set(words_df.word)
count_vect = CountVectorizer(vocabulary = clean_vocab, tokenizer=nltk.word_tokenize)
clean_bow_counts = count_vect.fit_transform(df.pureTextDescription)
clean_bow_counts.shape

In [None]:
df.iloc[1].pureTextDescription

In [None]:
print(clean_bow_counts[1])

In [None]:
rev_dict = {v:k for k,v in count_vect.vocabulary_.items()}
print(rev_dict[76])
print(rev_dict[271])
print(rev_dict[280])
print(rev_dict[941])

In [None]:
clean_bow_counts.sum()

### Training and testing on the entire dataset (no split to test/train)
(1) Cross-validation for searching the optimal regularization level<br>
(2) Predicting 'points' using optimal regularization level on the entire dataset and evaluating prediction quality

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score, precision_score, recall_score, accuracy_score

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

In [None]:
%%capture
search_grid = np.logspace(-2, 4, num=50, endpoint=True, base=10.0)
mse_by_alpha = []
for alpha in search_grid:
    model = Ridge(alpha = alpha, tol=0.0001, max_iter=10000)
    avg_score = cross_val_score(model, clean_bow_counts, y = df.y, cv = 10, scoring = 'neg_mean_squared_error').mean()
    mse_by_alpha.append((alpha,abs(avg_score)))

In [None]:
cv_results = DataFrame(mse_by_alpha, columns = ['alpha', 'mean_squared_error'])
cv_results.mean_squared_error.iplot(title = 'BOW Counts - mean_squared_error as a function of Regularization rate (alpha)', xTitle = 'alpha', yTitle = 'mean_squared_error', width = 3, hline=(0,0))

In [None]:
opt_alpha, min_mean_squared_error = cv_results.loc[cv_results.mean_squared_error.idxmin()]
print(opt_alpha, min_mean_squared_error)

#### Predicting 'points' with the optimal model and evaluating prediction quality

In [None]:
model = Ridge(alpha = opt_alpha, tol=0.0001, max_iter=10000)
model.fit(clean_bow_counts, df.y)

In [None]:
df['predicted_score'] = model.predict(clean_bow_counts)

In [None]:
NLP_desc_stats = calc_prediction_quality(df, 'predicted_score','y')
NLP_desc_stats

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats, NLP_desc_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb', 'NLP_desc_stats'])
all_compared

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

### Using both the text and other features
#### Training and testing on the entire dataset (no split to test/train)

In [None]:
other_features = encoded_features[['country', 'province', 'region_1', 'region_2', 'taster_name', 'variety', 'winery', 'price']]
other_features

In [None]:
from scipy.sparse import coo_matrix, hstack
#other_features_spares_matrix = coo_matrix(other_features) # no need to apply coo_matrix as hstack converts to sparse matrix authomatically
train_united_features = hstack((clean_bow_counts ,other_features))

In [None]:
%%capture
search_grid = np.logspace(-2, 4, num=50, endpoint=True, base=10.0)
mse_by_alpha = []
for alpha in search_grid:
    model = Ridge(alpha = alpha, tol=0.0001, max_iter=10000)
    avg_score = cross_val_score(model, train_united_features, y = df.y, cv = 10, scoring = 'neg_mean_squared_error').mean()
    mse_by_alpha.append((alpha,abs(avg_score)))

In [None]:
cv_results = DataFrame(mse_by_alpha, columns = ['alpha', 'mean_squared_error'])
cv_results.mean_squared_error.iplot(title = 'BOW Counts - mean_squared_error as a function of the Regularization strength (alpha)', xTitle = 'alpha', yTitle = 'mean_squared_error', width = 3, hline=(0,0))

In [None]:
opt_alpha, min_mean_squared_error = cv_results.loc[cv_results.mean_squared_error.idxmin()]
print(opt_alpha, min_mean_squared_error)

In [None]:
model = Ridge(alpha = opt_alpha, tol=0.0001, max_iter=10000)
model.fit(train_united_features, df.y)

In [None]:
df['predicted_score_extended_NLP'] = model.predict(train_united_features)

In [None]:
extended_NLP_stats = calc_prediction_quality(df, 'predicted_score_extended_NLP','y')
extended_NLP_stats

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats, NLP_desc_stats, extended_NLP_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb', 'NLP_desc_stats', 'extended_NLP_stats'])
all_compared

Adding more features to the text resulted in improvement MSE (as well as on the other metrics). This is the expected result as we added more features to the model. The improvement is not significant though.

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

## Deep Learning approaches

### Fully connected network on the 'description' text feature only

#### (1) Internal embedding layer + average pooling

##### Tokenization and vectorization:

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization, Embedding, Dense, GlobalAveragePooling1D, Dropout
from tensorflow.keras.callbacks import EarlyStopping

What is a good size for the vocabulary? 

In [None]:
wine_reviews.description.apply(lambda x: len(x.split(' '))).quantile([0.95, 0.99])

In [None]:
vocab_size = 32000
sequence_length = 72

# Use the text vectorization layer to normalize, split, and map strings to integers. 
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    #standardize=lambda text: tf.strings.lower(text), # You can use your own normalization function here
    standardize='lower_and_strip_punctuation', # Or you can use a pre-made normalization function
    max_tokens=vocab_size,    
    split='whitespace',
    output_mode='int',
    name = 'Text_processing',
    output_sequence_length=sequence_length)

Computing the vocabulary of the TextVectorization layer based on the 'description' variable:


In [None]:
vectorize_layer.adapt(train_data['description'])

In [None]:
sample_description = train_data['description'].sample().iloc[0]
print(sample_description)

In [None]:
vectorize_layer(sample_description)

In [None]:
vectorize_layer(sample_description).numpy()[:20]

In [None]:
for token in vectorize_layer(sample_description).numpy()[:20]:
    print(f"{token} ---> ",vectorize_layer.get_vocabulary()[token])

##### Modeling (Sequential API):

Total model parameters: 514,953

In [None]:
embedding_dim=16

model = tf.keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    GlobalAveragePooling1D(),
    Dense(164, activation='tanh', name='hidden_layer'),
    Dropout(0.2),
    Dense(1, name = 'output_layer')
])

In [None]:
model.summary()

In [None]:
tf.keras.utils.plot_model(model, show_dtype=False, show_shapes=True, show_layer_names=True)

In [None]:
model.compile(
    optimizer=tf.optimizers.Adam(), loss='mean_absolute_error', metrics=['mean_squared_error','mean_absolute_error'])

In [None]:
train_data.shape

In [None]:
%%time
text_col, target_col = 'description', 'points'

early_stopping_monitor = EarlyStopping(
    monitor='val_mean_squared_error',
    min_delta=0,
    patience=2,
    verbose=0,
    restore_best_weights=True
)

history = model.fit(
    train_data[text_col],
    train_data[target_col],
    epochs=20,
    batch_size=128,
    verbose=1,    
    callbacks=[early_stopping_monitor],
    validation_data = (test_data[text_col], test_data[target_col]))

##### Model evaluation:

In [None]:
test_data['predicted_score_fully_connected_NN'] = model.predict(test_data[text_col])

In [None]:
fully_connected_NN_stats = calc_prediction_quality(test_data, 'predicted_score_fully_connected_NN', target_col)
fully_connected_NN_stats

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats, NLP_desc_stats, extended_NLP_stats, fully_connected_NN_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb', 'NLP_desc_stats', 'extended_NLP_stats', 'fully_connected_NN'])
all_compared

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

#### (2) Fully connected NN with internal embedding and concatination (instead of average pooling)

The concatination is performed by reshaping of the outputs of the embedding layer to 1D vector

##### Modeling (Sequential API):

Total model parameters: 701,257

In [None]:
from tensorflow.keras.layers import Reshape, Dense, Dropout
from tensorflow.keras import Sequential

In [None]:
embedding_dim=16

model = tf.keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    Reshape((embedding_dim * sequence_length, ), name='concat_words'),
    Dense(164, activation='tanh', name='hidden_layer'),
    Dropout(0.7),
    Dense(1, activation='linear', name = 'output_layer')
])

In [None]:
model.summary()

In [None]:
model.compile(
    optimizer=tf.optimizers.Adam(), loss='mean_absolute_error', metrics=['mean_squared_error','mean_absolute_error'])

In [None]:
tf.keras.utils.plot_model(model, show_dtype=True, show_shapes=True, show_layer_names=True)

In [None]:
%%time
text_col, target_col = 'description', 'points'

early_stopping_monitor = EarlyStopping(
    monitor='val_mean_squared_error',
    min_delta=0,
    patience=2,
    verbose=0,
    restore_best_weights=True
)

history = model.fit(
    train_data[text_col],
    train_data[target_col],
    epochs=20,
    batch_size=128,
    verbose=1,    
    callbacks=[early_stopping_monitor],
    validation_data = (test_data[text_col], test_data[target_col]))

##### Model evaluation:

In [None]:
test_data['predicted_score_FC_NN_concatinated_words'] = model.predict(test_data[text_col])

In [None]:
fully_connected_NN_concatinated_words_stats = calc_prediction_quality(test_data, 'predicted_score_FC_NN_concatinated_words', target_col)
fully_connected_NN_concatinated_words_stats

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats, NLP_desc_stats, extended_NLP_stats, fully_connected_NN_stats, fully_connected_NN_concatinated_words_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb', 'NLP_desc_stats', 'extended_NLP_stats', 'fully_connected_NN', 'fully_connected_NN_concatinated_words'])
all_compared

By concatinating the embedding output vectors, instead of average pooling, we increased the number of parameters from 514,953 to 701,257 which increases the risk for overfitting. This may explain the degradation we see across the evaluation metrics.  

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

#### (3) Fully connected network, using the external GloVe embedding

In [None]:
import os

##### Creating a dictionary with the pre-trained GloVe word embeddings:

In [None]:
filename = "glove.6B.50d.txt"
mypath = os.getcwd()
path_to_glove_file = mypath + "\\data\\" + filename
path_to_glove_file

embeddings_index = {}   # the disctionary storing the GloVe words and their respective embedding vector  
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

f.close()

print("Found %s word vectors." % len(embeddings_index))

In [None]:
embeddings_index.get('drinking')

Creating a word embedding matrix with a word embedding vector for each word of the wine_reviews vocabulary:

In [None]:
embedding_matrix = np.zeros((vocab_size, 50))

In [None]:
out_of_glove_vocub = []
i = 0
for word in vectorize_layer.get_vocabulary():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
    else:
        out_of_glove_vocub.append((i, vectorize_layer.get_vocabulary()[i]))     # record the words that do not have an embedding   
    i = i + 1

In [None]:
out_of_glove_vocub[:10]

##### Creating the embedding layer:

In [None]:
embedding_layer = Embedding(input_dim=vocab_size,
                            output_dim=50,
                            weights=[embedding_matrix],
                            input_length=sequence_length,
                            trainable=False)

##### Modeling (Sequential API)

In [None]:
model = tf.keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    embedding_layer,
    GlobalAveragePooling1D(),
    Dense(164, activation='tanh', name='hidden_layer'),
    Dropout(0.2),
    Dense(1, name = 'output_layer')
])

In [None]:
model.summary()

In [None]:
model.compile(
    optimizer=tf.optimizers.Adam(), loss='mean_absolute_error', metrics=['mean_squared_error','mean_absolute_error'])

In [None]:
%%time
text_col, target_col = 'description', 'points'

early_stopping_monitor = EarlyStopping(
    monitor='val_mean_squared_error',
    min_delta=0,
    patience=2,
    verbose=0,
    restore_best_weights=True
)

history = model.fit(
    train_data[text_col],
    train_data[target_col],
    epochs=20,
    batch_size=128,
    verbose=1,    
    callbacks=[early_stopping_monitor],
    validation_data = (test_data[text_col], test_data[target_col]))

##### Model evaluation:

In [None]:
test_data['predicted_score_DNN_external_embedding_stats'] = model.predict(test_data[text_col])

In [None]:
DNN_external_embedding_stats = calc_prediction_quality(test_data, 'predicted_score_DNN_external_embedding_stats', target_col)
DNN_external_embedding_stats

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats, NLP_desc_stats, extended_NLP_stats, fully_connected_NN_stats, fully_connected_NN_concatinated_words_stats, DNN_external_embedding_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb', 'NLP_desc_stats', 'extended_NLP_stats', 'fully_connected_NN', 'fully_connected_NN_concatinated_words', 'DNN_external_embedding_stats'])
all_compared

We can see that using the external GloVe embeddings yielded poorer performance across the evaluation metrics. This can be expected as the GloVe vocabulary does not contain many of the wineray domain-specific words (out-of-vocab) and therefore does not provide effective embeddings for the wine-reviews texts.   

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

#### (4) Fully connected network with LSTM layer

##### Defining the LSTM layer (with 164 units):

In [None]:
LSTM_layer = tf.keras.layers.LSTM(
    164,
    activation='tanh',
    recurrent_activation='sigmoid',
    use_bias=True,
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros',
    unit_forget_bias=True,
    kernel_regularizer=None,
    recurrent_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    recurrent_constraint=None,
    bias_constraint=None,
    dropout=0.0,
    recurrent_dropout=0.0,
    return_sequences=False,
    return_state=False,
    go_backwards=False,
    stateful=False,
    time_major=False,
    unroll=False,
)

##### Modeling (Sequential API):

In [None]:
embedding_dim=16

model = tf.keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    LSTM_layer,
    Dense(164, activation='relu', name='hidden_layer'),
    #Dropout(0.7),
    Dense(1, activation='linear', name = 'output_layer')
])

In [None]:
model.summary()

In [None]:
model.compile(
    optimizer=tf.optimizers.Adam(), loss='mean_absolute_error', metrics=['mean_squared_error','mean_absolute_error'])

In [None]:
tf.keras.utils.plot_model(model, show_dtype=True, show_shapes=True, show_layer_names=True)

In [None]:
%%time
text_col, target_col = 'description', 'points'

early_stopping_monitor = EarlyStopping(
    monitor='val_mean_squared_error',
    min_delta=0,
    patience=2,
    verbose=0,
    restore_best_weights=True
)

history = model.fit(
    train_data[text_col],
    train_data[target_col],
    epochs=20,
    batch_size=128,
    verbose=1,    
    callbacks=[early_stopping_monitor],
    validation_data = (test_data[text_col], test_data[target_col]))

##### Model evaluation:

In [None]:
test_data['predicted_score_LSTM'] = model.predict(test_data[text_col])

In [None]:
LSTM_stats = calc_prediction_quality(test_data, 'predicted_score_LSTM', target_col)
LSTM_stats

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats, NLP_desc_stats, extended_NLP_stats, fully_connected_NN_stats, fully_connected_NN_concatinated_words_stats, DNN_external_embedding_stats, LSTM_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb', 'NLP_desc_stats', 'extended_NLP_stats', 'fully_connected_NN', 'fully_connected_NN_concatinated_words', 'DNN_external_embedding_stats', 'LSTM'])
all_compared

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

### Bonus task: Using all features applying the Keras Functional API

In [None]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, concatenate

In [None]:
# Define two sets of inputs: InputA is the 'description' text feature, InputB is the rest of the features which we already labeled before.
inputA = Input(shape=(1,), name="text input layer", dtype=tf.string)
inputB = Input(shape=(8,), name="other features input layer")

# The first branch operates on InputA: 
x = vectorize_layer(inputA)
x = Embedding(vocab_size, embedding_dim, name="embedding")(x)
x = GlobalAveragePooling1D()(x)
x = Dense(164, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(164, activation='relu')(x)
x = Model(inputs=inputA, outputs=x)

# The second branch opreates on InputB:
y = Dense(164, activation="relu")(inputB)
y = Dropout(0.2)(y)
y = Dense(164, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)

# Combine the output of the two branches
combined = concatenate([x.output, y.output])

# Apply a fully-connected layer and then a regression prediction on the combined outputs
z = Dense(164, activation="relu")(combined)
z = Dense(1, activation="linear")(z)

# Define a model that will accept the inputs of the two branches and then output a single value
model = Model(inputs=[x.input, y.input], outputs=z)

In [None]:
model.compile(
    optimizer=tf.optimizers.Adam(), loss='mean_absolute_error', metrics=['mean_squared_error','mean_absolute_error'])

In [None]:
tf.keras.utils.plot_model(model, show_dtype=True, show_shapes=True, show_layer_names=True)

Preparing the Train and Test datasets:

In [None]:
# organizing the Text train and test datasets
trainTextX = train_data[text_col]
testTextX = test_data[text_col]

In [None]:
# organizing the train and test datasets of the rest of features and which we already encoded before 
trainAttrX = train_encoded_features.loc[:,train_encoded_features.columns != 'points']
testAttrX = test_encoded_features.loc[:,test_encoded_features.columns != 'points']
testAttrX = testAttrX.drop(['xgb_clf_points_prediction'], axis=1)

In [None]:
%%time
text_col, target_col = 'description', 'points'

early_stopping_monitor = EarlyStopping(
    monitor='val_mean_squared_error',
    min_delta=0,
    patience=2,
    verbose=0,
    restore_best_weights=True
)

history = model.fit(
    x= [trainTextX, trainAttrX],
    y = train_data[target_col],
    epochs=20,
    batch_size=128,
    verbose=1,    
    callbacks=[early_stopping_monitor],
    validation_data = ([testTextX, testAttrX], test_data[target_col]))

##### Model evaluation:

In [None]:
test_data['Multiple_Inputs_Mixed_Data_NN_Functional_API'] = model.predict([testTextX, testAttrX])

In [None]:
Multiple_Inputs_Mixed_Data_NN_Functional_API_stats = calc_prediction_quality(test_data, 'Multiple_Inputs_Mixed_Data_NN_Functional_API', target_col)
Multiple_Inputs_Mixed_Data_NN_Functional_API_stats

In [None]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats, NLP_desc_stats, extended_NLP_stats, fully_connected_NN_stats, fully_connected_NN_concatinated_words_stats, DNN_external_embedding_stats, LSTM_stats, Multiple_Inputs_Mixed_Data_NN_Functional_API_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb', 'NLP_desc_stats', 'extended_NLP_stats', 'fully_connected_NN', 'fully_connected_NN_concatinated_words', 'DNN_external_embedding_stats', 'LSTM', 'Multiple_Inputs_Mixed_Data_NN_Functional_API'])
all_compared

The network with multiple inputs and mixed data (text and other features) yielded a prety good result - similar to, yet slightly lower than, the classical NLP regression model with the other parameters.  

In [None]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

## Visualization

In [None]:
import plotly_express as px

In [None]:
px.bar(all_compared, x=all_compared.index, y='MSE', 
        title="MSE of the different models")

In [None]:
px.bar(all_compared, x=all_compared.index, y='MAE', 
        title="MAE of the different models")