## Describe your favorite wine and get an *expert's* suggestions.
This notebook selects a wines based on a user's description. Using Latent Dirichlet Allocation (LDA) and<br/>
130,000 wine descriptions that experts used to describe wine varieties this notebook makes <br/>
a distribution of n wine types (here I use 20 Topics/types based on the Topics found by HdpModel).<br/>
The HdpModel automatically determines the ideal number of Topics in a corpus.<br/>
I use the LdaModel here because it is faster than the HdpModel.

Copied much of this code from this great kernel [about Tweets from Elon Musk](https://www.kaggle.com/errearanhas/topic-modelling-lda-on-elon-tweets).

**Notebook includes these steps:**
* Produce a LDA model that groups wine descriptions into Types
* Use inference on the LDA model to assign each description to most likely Type.
* Determine sentiment in each wine description and use it to rate each wine. *not finished*
* List the wine varieties along with wine Type. 
* Build a TextBox to intake user Input: "Describe the wine you want." Determine most likely Types. 
* Build a recommendation engine to show the best varieties in the most likely Type. 

In [None]:

import pandas as pd
import numpy as np
from operator import itemgetter
from gensim import corpora, models, similarities
import gensim
from collections import OrderedDict
from nltk.corpus import stopwords
from string import punctuation

pdir = '../input/'

df = pd.DataFrame()
df = pd.read_csv(pdir + "winemag-data-130k-v2.csv")
df.head(3)

In [None]:
# put all the wine descriptions into an numpy array called corpus
corpus=[]
a=[]
for i in range(len(df['description'])):
        a=df['description'][i]
        corpus.append(a)
corpus[0:2]

In [None]:
# remove common words and tokenize

stoplist = stopwords.words('english') + list(punctuation)

texts = [[word for word in str(document).lower().split() if word not in stoplist] for document in corpus]
dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]
#corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'wine.mm'), corpus)  # store to disk, for later use
#corpora.MmCorpus.serialize(pdir + 'wine.mm', corpus) # store to disk, for later use or other notebooks

tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
corpus_tfidf = tfidf[corpus]  # step 2 -- use the model to transform vectors
total_topics = 20
# now using the vectorized corpus learn a LDA model
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=total_topics)
#lda = models.HdpModel(corpus, id2word=dictionary)
corpus_lda = lda[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
#total_topics = len(lda.show_topics())
#Show first 5 important words in the first 3 topics:
print(total_topics)
lda.show_topics(3,5)

In [None]:
#lda.save(pdir + "LDAModel-wine") save model if you like

In [None]:
data_lda = {i: OrderedDict(lda.show_topic(i,total_topics)) for i in range(total_topics)}
#data_lda
df_lda = pd.DataFrame(data_lda)
print(df_lda.shape)
df_lda = df_lda.fillna(0).T
print(df_lda.shape)
df_lda.head(5)

In [None]:
row = 10007
print(df.loc[row,'variety'])
dt = lda[corpus[row]]
dt

Above the 10,007th sommelier's wine review is on a Chardonnay.<br/>
It fits in several discovered Topics. The vectors are (Topic, p(Topic))<br/>


In [None]:

dv = pd.DataFrame()
dv['Variety'] = df['variety']
for row in range(len(df)):
    dv.loc[row,'Likely_Topic'] = max(lda[corpus[row]], key = itemgetter(1))[0]
dv.head(5)

In [None]:
dv['Price'] = df['price']
dv['Title'] = df['title']
dv['Points'] = df['points']
dv['Vineyard'] = df['winery']
dv['Description'] = df['description']
dv.head(5)

**Now add the Sentiment of each wine review. Using NLTK for sentiment**

Wine Score = 10 x Sentiment + Points<br/>


## Make a Wine Suggestion text entry box

work in progeress - learning now about widgets

In [None]:
description = ['Heavy fruit but not sweet. Somewhat complex, \
delivering fruit up front, followed by acidity and tannins. \
Moderate tannins. A great standalone bottle of wine that needs no pairing.']
texts = [word for word in str(description).lower().split() if word not in stoplist]
desc_v = dictionary.doc2bow(texts)
suggestion_types = lda[desc_v]
print(description)

first_choice = max(suggestion_types,key = itemgetter(1))[0]
budget = 75
print(budget)
print(first_choice)
t = dv[(dv.Likely_Topic == first_choice) & (dv.Price < budget)]
t = t.sort_values(by = 'Points', ascending = False)
t[['Title', 'Variety', 'Price', 'Description']].head(5)

 ## Try it now.
 
 Try it yourself by changing the text for description in line 1 and budget in line 10 above and re-run.<br/>
 I am working on giving each topic a lable which maybe the most frequent variety per Topic.<br/>
 
 Work in progress.....still learning Widgets for budget and textbox UI.

In [None]:
description = ['I like german wine that is not too sweet.']
texts = [word for word in str(description).lower().split() if word not in stoplist]
desc_v = dictionary.doc2bow(texts)
suggestion_types = lda[desc_v]
print(description)

first_choice = max(suggestion_types,key = itemgetter(1))[0]
budget = 100
print(budget)
print(first_choice)
t = dv[(dv.Likely_Topic == first_choice) & (dv.Price < budget)]
t = t.sort_values(by = 'Points', ascending = False)
t[['Title', 'Variety', 'Price', 'Description']].head(5)

In [None]:
#del lda ## If you want to re-run this notebook first delete the current model by running this line.