# Wines Points prediction 

Submission Date : 3.6.2023
Task: Predict the wine score given the inputs
Instructions:
 * Use logistic regression as benchmark model
 * Use sklearn pipeliens + cv + grid search with sklearn models (e.g. KNNs, RandomForest, etc.)
 * Compare all models on proper metric (your choice)

For DNN course project:
* Use sklearn pipeliens with tensorflow models (w/wo embeddings, LSTMs, RNNs, Transformers etc.)
* Compare all models on proper metric (your choice)

In [2]:
%load_ext autoreload
%autoreload 2
import sys; sys.path.append('../')

Here we will try to predict the points a wine will get based on known characteristics (i.e. features, in the ML terminology). The mine point in this stage is to establish a simple, ideally super cost effective, basline.
In the real world there is a tradeoff between complexity and perforamnce, and the DS job, among others, is to present a tradeoff tables of what performance is achivalbel at what complexity level. 

to which models with increased complexity and resource demands will be compared. Complexity should then be translated into cost. For example:
 * Compute cost 
 * Maintenance cost
 * Serving costs (i.e. is new platform needed?) 
 

## Loading the data

In [3]:
import pandas as pd
import cufflinks as cf; cf.go_offline()

In [44]:
wine_reviews = pd.read_csv("clean_wine_reviews_data.csv")
summary_df = pd.read_csv("summary_df.csv")
wine_reviews.shape

wine_reviews['desc_len'] = wine_reviews.description.str.len()

In [45]:
wine_reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,wine_category,desc_len
0,0,Italy,Aromas include tropical fruit broom brimstone ...,Vulkà Bianco,87,39.928286,Sicily & Sardinia,Etna,Unknown,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013.0,White,152
1,1,Portugal,This ripe fruity wine smooth still structured ...,Avidagos,87,15.0,Douro,Unknown,Unknown,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0,Red,160
2,2,US,Tart snappy flavors lime flesh rind dominate S...,Unknown,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0,White,149
3,3,US,Pineapple rind lemon pith orange blossom start...,Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,Unknown,Alexander Peartree,Unknown,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0,White,155
4,4,US,Much like regular bottling 2012 comes across r...,Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0,Red,185


# Embedding

In [54]:

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Embedding, Concatenate, Flatten, TextVectorization, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import MeanSquaredError
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical, plot_model, pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import activations

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder


In [55]:
wr_work = wine_reviews[['description','country','price','province','region_1','variety','winery','year','points']]
wr_work.head()

categorical_cols = ['country', 'province', 'region_1', 'variety', 'winery', 'year']
numerical_cols = ['price']

max_desc_len = max(wine_reviews.desc_len)
max_desc_len

tokenizer_1000 = Tokenizer(num_words=1000)
tokenizer_1000.fit_on_texts(wr_work.description)
desc_1000 = tokenizer_1000.texts_to_sequences(wr_work.description)
desc_1000_max = pad_sequences(desc_1000, maxlen=max_desc_len)
desc_1000_60 = pad_sequences(desc_1000, maxlen=60)
desc_1000_60

tokenizer_5000 = Tokenizer(num_words=5000)
tokenizer_5000.fit_on_texts(wr_work.description)
desc_5000 = tokenizer_5000.texts_to_sequences(wr_work.description)

desc_5000_max = pad_sequences(desc_5000, maxlen=max_desc_len)
desc_5000_60 = pad_sequences(desc_5000, maxlen=60)
desc_5000_60

x_train, x_test, y_train, y_test = train_test_split(wr_work[categorical_cols + numerical_cols], wr_work.points, \
                                                    test_size = 0.25, shuffle = True, random_state = 78)


desc_1000_max_train, desc_1000_max_test = train_test_split(desc_1000_max, test_size = 0.25, shuffle = True, random_state = 78)
desc_1000_60_train, desc_1000_60_test = train_test_split(desc_1000_60, test_size = 0.25, shuffle = True, random_state = 78)
desc_5000_max_train, desc_5000_max_test = train_test_split(desc_5000_max, test_size = 0.25, shuffle = True, random_state = 78)
desc_5000_60_train, desc_5000_60_test = train_test_split(desc_5000_60, test_size = 0.25, shuffle = True, random_state = 78)

desc_words = [1000, 5000]
desc_len = [max_desc_len, 60]
dense_activations = ['relu', 'sigmoid']
dense_units_1 =  [8, 16, 32, 64, 128]
dense_units_2 = [4, 8, 16, 32, 64]
model_1_results_df = pd.DataFrame(columns = ['parameters', 'train_MSE', 'test_MSE'])

for a in desc_words:
    for b in desc_len:
        for c in dense_activations:
            for d in dense_units_1:
                for e in dense_units_2:
                    params = {'desc_words' : a, 'desc_len': b, 'activation': c,  'units layer 1': d, 'units layer 2': e}
                    if e > d:
                        print(f'Passing Parameters: {params}')
                        continue  
                    input_1 = Input(shape=(b,))
                    embedding_1 = Embedding(input_dim = a, output_dim=10)(input_1)
                    flatten_1 = Flatten()(embedding_1)
                    dense_1a = Dense(units = d, activation = c)(flatten_1)
                    drop_1 =  Dropout(0.5)(dense_1a)
                    dense_1b = Dense(units = e, activation= c)(drop_1)
                    output_1 = Dense(units = 1, activation= 'linear')(dense_1b)
                    model_1 = Model(inputs=[input_1], outputs=output_1)

                    model_1.compile(optimizer='adam', loss='mean_squared_error')
                    
                    if a == 1000 and b == max_desc_len:
                        x_train_1, x_test_1 = desc_1000_max_train, desc_1000_max_test
                    elif a == 1000 and b == 60:
                        x_train_1, x_test_1 = desc_1000_60_train, desc_1000_60_test
                    elif a == 5000 and b == max_desc_len:
                        x_train_1, x_test_1 = desc_5000_max_train, desc_5000_max_test
                    elif a == 5000 and b == 60:
                        x_train_1, x_test_1 = desc_5000_60_train, desc_5000_60_test
                        
                    print(f'Fitting Model 1, Parameters: {params}')
                    model_1.fit(x_train_1, y_train,
                                batch_size=32,
                                epochs=10,
                                callbacks=EarlyStopping(monitor='val_loss', patience=3),
                                workers = 8,
                                verbose = 0,
                        validation_data=(x_test_1, y_test))
                    print(f'Evaluating Model 1, Parameters: {params}')
                    train_MSE = model_1.evaluate(x_train_1, y_train, verbose = 2)
                    test_MSE = model_1.evaluate(x_test_1, y_test, verbose = 2)
            
                    model_1_results_df.loc[len(model_1_results_df.index)] = ([params, train_MSE, test_MSE])

Fitting Model 1, Parameters: {'desc_words': 1000, 'desc_len': 626, 'activation': 'relu', 'units layer 1': 8, 'units layer 2': 4}
Evaluating Model 1, Parameters: {'desc_words': 1000, 'desc_len': 626, 'activation': 'relu', 'units layer 1': 8, 'units layer 2': 4}
2624/2624 - 4s - loss: 3920.1030 - 4s/epoch - 2ms/step
875/875 - 1s - loss: 3920.4141 - 1s/epoch - 2ms/step
Fitting Model 1, Parameters: {'desc_words': 1000, 'desc_len': 626, 'activation': 'relu', 'units layer 1': 8, 'units layer 2': 8}
Evaluating Model 1, Parameters: {'desc_words': 1000, 'desc_len': 626, 'activation': 'relu', 'units layer 1': 8, 'units layer 2': 8}
2624/2624 - 4s - loss: 1434.7241 - 4s/epoch - 2ms/step
875/875 - 1s - loss: 1435.3904 - 1s/epoch - 2ms/step
Passing Parameters: {'desc_words': 1000, 'desc_len': 626, 'activation': 'relu', 'units layer 1': 8, 'units layer 2': 16}
Passing Parameters: {'desc_words': 1000, 'desc_len': 626, 'activation': 'relu', 'units layer 1': 8, 'units layer 2': 32}
Passing Parameters: 

In [40]:
max_words = [1000, 2000, 3000]
max_len = [10, 20, 30, 40, 50, 60]
activation_methods = ['relu', 'sigmoid','tanh']
layer_1_dim =  [4, 8, 16, 32, 64]
layer_2_dim = [2, 4, 8, 16, 32]
output_dim = 16

max_length=max(len(seq) for seq in wine_reviews['description']) 

y = pd.DataFrame(wine_reviews, columns=['points'])
x = pd.DataFrame(wine_reviews, columns=['description'])

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=29)                     
                     
text_input_1 = Input(shape=(max_length,1))

print(text_input_1)


                     
for m_words in max_words:
    for m_len in max_len:
        for l1_dim in layer_1_dim:
            for l2_dim in layer_2_dim:
                for act in activation_methods:  
                    embedding_1 = Embedding(input_dim=m_len, output_dim=626)(text_input_1)
                    tokenizer = Tokenizer(num_words=m_words)
                    tokenizer.fit_on_texts(wine_reviews['description'])
                    X_text1 = tokenizer.texts_to_sequences(wine_reviews['description'])
                    X_text1 = pad_sequences(X_text1, maxlen=max_length)

                    flatten_1 = Flatten()(embedding_1)
                    
                    x = Dense(l1_dim, activation=act)(flatten_1)
                    x = Dense(l2_dim, activation=act)(x)
                    output = Dense(1, activation=act)(x)

                    model = Model(inputs=[text_input_1], outputs=output)
                    
                    plot_model(model, show_dtype=True, show_shapes=True, show_layer_names=True, to_file='model_1.png')

                    model.compile(optimizer=Adam(learning_rate=0.001),
                                  loss=BinaryCrossentropy(),
                                  metrics=[Accuracy()])
                    

                    # Train the model
                    model.fit(X_train, y_train,
                              batch_size=32,
                              epochs=10,
                              validation_data=(X_test, y_test))
                    max_length= math.sqrt(max_length)

# Evaluate the model
loss, accuracy = model.evaluate([X_test], y_test)
print('Test Loss:', loss)
print('Test Accuracy:', accuracy)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



KerasTensor(type_spec=TensorSpec(shape=(None, 626, 1), dtype=tf.float32, name='input_27'), name='input_27', description="created by layer 'input_27'")
You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
Epoch 1/10


ValueError: in user code:

    File "C:\Users\lior\anaconda3\lib\site-packages\keras\engine\training.py", line 1284, in train_function  *
        return step_function(self, iterator)
    File "C:\Users\lior\anaconda3\lib\site-packages\keras\engine\training.py", line 1268, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\lior\anaconda3\lib\site-packages\keras\engine\training.py", line 1249, in run_step  **
        outputs = model.train_step(data)
    File "C:\Users\lior\anaconda3\lib\site-packages\keras\engine\training.py", line 1050, in train_step
        y_pred = self(x, training=True)
    File "C:\Users\lior\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "C:\Users\lior\anaconda3\lib\site-packages\keras\engine\input_spec.py", line 280, in assert_input_compatibility
        raise ValueError(

    ValueError: Exception encountered when calling layer 'model_19' (type Functional).
    
    Input 0 of layer "dense_58" is incompatible with the layer: expected axis -1 of input shape to have value 391876, but received input with shape (None, 626)
    
    Call arguments received by layer 'model_19' (type Functional):
      • inputs=tf.Tensor(shape=(None, 1), dtype=string)
      • training=True
      • mask=None
