Introduction to notebook

First objective: Figure out a simpler way to read in the data instead of loading all of it in memory
I guess 50MB might not be so bad for this? Maybe lets proceed for now without using readers
I think I am going to follow the general structure of https://towardsdatascience.com/perfume-recommendations-using-natural-language-processing-ad3e6736074c


## Loading dependencies

In [2]:
import pandas as pd
import pickle
import os
from IPython.display import display, HTML
from nltk.stem import SnowballStemmer
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [8]:
WINE_SRC_FILENAME = os.path.join(
    "data", "wine-reviews", "1k_train.csv")

In [59]:
col_list=['description', 'variety']
df = pd.read_csv(WINE_SRC_FILENAME, usecols=col_list)

## Cleaning the data

In [10]:
print(df.head())

                                         description         variety
0  Aromas include tropical fruit, broom, brimston...     White Blend
1  This is ripe and fruity, a wine that is smooth...  Portuguese Red
2  Tart and snappy, the flavors of lime flesh and...      Pinot Gris
3  Pineapple rind, lemon pith and orange blossom ...        Riesling
4  Much like the regular bottling from 2012, this...      Pinot Noir


In [24]:
# think about also adding specific stopwords for wines
print(df.description[0])
df['reviews_new'] = df.description.str.lower().apply(remove_stopwords)
print(df.reviews_new[0])

Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.
aromas include tropical fruit, broom, brimstone dried herb. palate isn't overly expressive, offering unripened apple, citrus dried sage alongside brisk acidity.


In [7]:
#Fit TFIDF 
#Learn vocabulary and tfidf from all style_ids.
tf = TfidfVectorizer(analyzer='word', 
                     min_df=10,
                     ngram_range=(1, 1))
tf.fit(df['reviews_new'])

#Transform style_id products to document-term matrix.
tfidf_matrix = tf.transform(df['reviews_new'])
pickle.dump(tf, open("models/tfidf_model.pkl", "wb"))

print(tfidf_matrix.shape)

KeyboardInterrupt: 

In [17]:
# Lower the dimensionality of the matrix
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=500)
latent_matrix = svd.fit_transform(tfidf_matrix)
pickle.dump(svd, open("models/svd_model.pkl", "wb"))

In [19]:
n = 25 #pick components
#Use elbow and cumulative plot to pick number of components. 
#Need high ammount of variance explained. 
doc_labels = df.title
svd_feature_matrix = pd.DataFrame(latent_matrix[:,0:n] ,index=doc_labels)
print(svd_feature_matrix.shape)
svd_feature_matrix.head()

pickle.dump(svd_feature_matrix, open("models/lsa_embeddings.pkl", "wb"))

(129971, 25)


This is where we are going to try and run the model just to see what interesting results we get

In [18]:
tf = pickle.load(open("models/tfidf_model.pkl", "rb"))
svd = pickle.load(open("models/svd_model.pkl", "rb"))
svd_feature_matrix = pickle.load(open("models/lsa_embeddings.pkl", "rb"))

def get_similarity_scores(message_array, embeddings):
    cosine_sim_matrix = pd.DataFrame(cosine_similarity(X=embeddings,
                                                       Y=message_array,
                                                       dense_output=True))
    cosine_sim_matrix.set_index(embeddings.index, inplace=True)
    cosine_sim_matrix.columns = ["cosine_similarity"]
    return cosine_sim_matrix

def get_similarity(message):
    message_array = tf.transform([message]).toarray()
    message_array = svd.transform(message_array)
    message_array = message_array[:,0:25].reshape(1, -1)
    
    bow_similarity = get_similarity_scores(message_array, svd_feature_matrix)
    return bow_similarity

def query_wines(message):
    similar_wines = get_similarity(message)

    return similar_wines.head(10)

In [19]:
from IPython.display import display
from ipywidgets import widgets
from IPython.display import clear_output

print("Describe the perfume you are looking for. You can be as detailed as you like! ")
text = widgets.Text()
display(text)
button = widgets.Button(description="Restart!")
display(button)

def on_button_clicked(b):
    clear_output()
    print("Describe the perfume you are looking for. You can be as detailed as you like! ")
    display(text)
    display(button)

def handle_submit(sender):
    print("Got it! Hold tight while I find your recommendations!")
    message = text.value
    recs = query_wines(message)
    print(recs)

text.on_submit(handle_submit)
button.on_click(on_button_clicked)


Describe the perfume you are looking for. You can be as detailed as you like! 


Text(value='citrus, white, green, bright')

Button(description='Restart!', style=ButtonStyle())

Got it! Hold tight while I find your recommendations!
                                                    cosine_similarity
title                                                                
Nicosia 2013 Vulkà Bianco  (Etna)                            0.338419
Quinta dos Avidagos 2011 Avidagos Red (Douro)               -0.020524
Rainstorm 2013 Pinot Gris (Willamette Valley)                0.418326
St. Julian 2013 Reserve Late Harvest Riesling (...           0.029545
Sweet Cheeks 2012 Vintner's Reserve Wild Child ...           0.128369
Tandem 2011 Ars In Vitro Tempranillo-Merlot (Na...           0.299210
Terre di Giurfo 2013 Belsito Frappato (Vittoria)             0.486592
Trimbach 2012 Gewurztraminer (Alsace)                        0.027913
Heinz Eifel 2013 Shine Gewürztraminer (Rheinhes...           0.103685
Jean-Baptiste Adam 2012 Les Natures Pinot Gris ...           0.323748
Got it! Hold tight while I find your recommendations!
                                                    

Totally new approach here, we are going to run a simple test with the contextual reps notebook and see how we do with a normal classifier by trying to follow along with the sample code

In [1]:
import os
import sst
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNClassifier, TorchRNNClassifierModel
from torch_rnn_classifier import TorchRNNClassifier
from sklearn.metrics import classification_report
import utils

In [4]:
# BERT related imports
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

In [5]:
hf_weights_name = 'bert-base-cased'

In [6]:
hf_tokenizer = BertTokenizer.from_pretrained(hf_weights_name)

In [7]:
hf_model = BertModel.from_pretrained(hf_weights_name)

In [11]:
hf_example_ids = hf_tokenizer.batch_encode_plus(
    df.description, 
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True)

In [17]:
print(len(hf_example_ids['input_ids'][0])) # meaning there is a sentence with 163 BERT tokens
hf_tokenizer.convert_ids_to_tokens(hf_example_ids['input_ids'][0])

163


['[CLS]',
 'A',
 '##roma',
 '##s',
 'include',
 'tropical',
 'fruit',
 ',',
 'br',
 '##oom',
 ',',
 'br',
 '##ims',
 '##tone',
 'and',
 'dried',
 'herb',
 '.',
 'The',
 'p',
 '##ala',
 '##te',
 'isn',
 "'",
 't',
 'overly',
 'expressive',
 ',',
 'offering',
 'un',
 '##rip',
 '##ened',
 'apple',
 ',',
 'c',
 '##itrus',
 'and',
 'dried',
 'sage',
 'alongside',
 'br',
 '##isk',
 'acid',
 '##ity',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]

In [18]:
def hugging_face_bert_phi(description):
    input_ids = hf_tokenizer.encode(description, add_special_tokens=True)
    X = torch.tensor([input_ids])
    with torch.no_grad():
        final_hidden_states, cls_output = hf_model(X)
        return final_hidden_states.squeeze(0).numpy() 

In [20]:
def hugging_face_bert_classifier_phi(description):
    reps = hugging_face_bert_phi(description)
    #return reps.mean(axis=0)  # Another good, easy option.
    return reps[0]

In [60]:
y = df.pop('variety')
X = df

In [66]:
# lets make it 4000 for train, rest for dev
training_size = 4000
y_hf_train = y[:training_size]
y_hf_dev = y[training_size:]

In [52]:
# Do not run these! Very expensive
%time X_hf_train = [hugging_face_bert_classifier_phi(description) for description in X[:training_size].description]

CPU times: user 21min 27s, sys: 14.8 s, total: 21min 41s
Wall time: 5min 31s


In [53]:
# Do not run these! Very expensive
%time X_hf_dev = [hugging_face_bert_classifier_phi(description) for description in X[training_size:].description]

CPU times: user 2min 58s, sys: 1.83 s, total: 3min
Wall time: 45.4 s


In [54]:
hf_mod = TorchShallowNeuralClassifier(max_iter=100, hidden_dim=300)

In [67]:
%time _ = hf_mod.fit(X_hf_train, y_hf_train)

Finished epoch 100 of 100; error is 3.5912646055221558

CPU times: user 1min 17s, sys: 2.7 s, total: 1min 20s
Wall time: 12.1 s


In [68]:
hf_preds = hf_mod.predict(X_hf_dev)

In [69]:
print(classification_report(y_hf_dev, hf_preds, digits=3))

                               precision    recall  f1-score   support

                    Aglianico      0.000     0.000     0.000         3
                       Albana      0.000     0.000     0.000         1
                     Albariño      0.000     0.000     0.000         0
                      Altesse      0.000     0.000     0.000         1
                       Arinto      0.000     0.000     0.000         1
                    Assyrtico      0.000     0.000     0.000         1
                    Auxerrois      0.000     0.000     0.000         1
                      Barbera      0.000     0.000     0.000         3
                Blanc du Bois      0.000     0.000     0.000         1
                      Bonarda      0.000     0.000     0.000         1
     Bordeaux-style Red Blend      0.228     0.481     0.310        27
   Bordeaux-style White Blend      0.000     0.000     0.000         3
               Cabernet Blend      0.000     0.000     0.000         1
     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
