Introduction to notebook

First objective: Figure out a simpler way to read in the data instead of loading all of it in memory
I guess 50MB might not be so bad for this? Maybe lets proceed for now without using readers
I think I am going to follow the general structure of https://towardsdatascience.com/perfume-recommendations-using-natural-language-processing-ad3e6736074c


## Loading dependencies

In [13]:
import pandas as pd
import pickle
import os
from IPython.display import display, HTML
from nltk.stem import SnowballStemmer
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
WINE_SRC_FILENAME = os.path.join(
    "data", "wine-reviews", "winemag-data-130k-v2.csv")

In [6]:
col_list=['description', 'country', 'points', 'price', 'title']
df = pd.read_csv(WINE_SRC_FILENAME, usecols=col_list)

## Cleaning the data

In [9]:
print(df.head())

    country                                        description  points  price  \
0     Italy  Aromas include tropical fruit, broom, brimston...      87    NaN   
1  Portugal  This is ripe and fruity, a wine that is smooth...      87   15.0   
2        US  Tart and snappy, the flavors of lime flesh and...      87   14.0   
3        US  Pineapple rind, lemon pith and orange blossom ...      87   13.0   
4        US  Much like the regular bottling from 2012, this...      87   65.0   

                                               title  
0                  Nicosia 2013 Vulk√† Bianco  (Etna)  
1      Quinta dos Avidagos 2011 Avidagos Red (Douro)  
2      Rainstorm 2013 Pinot Gris (Willamette Valley)  
3  St. Julian 2013 Reserve Late Harvest Riesling ...  
4  Sweet Cheeks 2012 Vintner's Reserve Wild Child...  


In [10]:
# think about also adding specific stopwords for wines
df['reviews_new'] = df.description.str.lower().apply(remove_stopwords)
print(df.reviews_new.head())

0    aromas include tropical fruit, broom, brimston...
1    ripe fruity, wine smooth structured. firm tann...
2    tart snappy, flavors lime flesh rind dominate....
3    pineapple rind, lemon pith orange blossom star...
4    like regular bottling 2012, comes rough tannic...
Name: reviews_new, dtype: object


In [15]:
#Fit TFIDF 
#Learn vocabulary and tfidf from all style_ids.
tf = TfidfVectorizer(analyzer='word', 
                     min_df=10,
                     ngram_range=(1, 1))
tf.fit(df['reviews_new'])

#Transform style_id products to document-term matrix.
tfidf_matrix = tf.transform(df['reviews_new'])
pickle.dump(tf, open("models/tfidf_model.pkl", "wb"))

print(tfidf_matrix.shape)

(129971, 8832)


In [17]:
# Lower the dimensionality of the matrix
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=500)
latent_matrix = svd.fit_transform(tfidf_matrix)
pickle.dump(svd, open("models/svd_model.pkl", "wb"))

In [19]:
n = 25 #pick components
#Use elbow and cumulative plot to pick number of components. 
#Need high ammount of variance explained. 
doc_labels = df.title
svd_feature_matrix = pd.DataFrame(latent_matrix[:,0:n] ,index=doc_labels)
print(svd_feature_matrix.shape)
svd_feature_matrix.head()

pickle.dump(svd_feature_matrix, open("models/lsa_embeddings.pkl", "wb"))

(129971, 25)
