## Wine reviews: topic modelling wine descriptions

In this kernel, we will be looking into [topic modelling ](https://en.wikipedia.org/wiki/Topic_model) the descriptions in the wine review dataset. We will be using [Non-Negative Matrix Factorization (NMF)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) decomposition model from sklearn library. We will analyze the topics generated by these models to see the similarity between topics in terms of top key words and grape variety.

Let's begin with reading in the dataset:

In [1]:
import numpy as np
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
wdat1 = pd.read_csv('../input/winemag-data_first150k.csv',delimiter=',',index_col=0,quotechar='"')
wdat2 = pd.read_csv('../input/winemag-data-130k-v2.csv',delimiter=',',index_col=0,quotechar='"')

In [2]:
print(wdat1.shape)
wdat1.head()

In [3]:
print(wdat2.shape)
wdat2.head()

### Data preprocessing

While working on this kernel, I've seen sklearn functions throwing errors due to the special characters in text, so we wil encode the text as unicode first: 

In [4]:
# encode as unicode
wdat1['country']=wdat1['country'].values.astype('U')
wdat1['description']=wdat1['description'].values.astype('U')
wdat1['variety']=wdat1['variety'].values.astype('U')
wdat1['price']=wdat1['price'].fillna(0.0)
# dataset 2
wdat2['country']=wdat2['country'].values.astype('U')
wdat2['description']=wdat2['description'].values.astype('U')
wdat2['variety']=wdat2['variety'].values.astype('U')
wdat2['taster_name']=wdat2['taster_name'].values.astype('U')
wdat2['price']=wdat2['price'].fillna(0.0)

Now that we have (hopefully) got rid of the encoding issues, the next step will be to merge the two sets of data and deal with duplicate entries in the dataset. Right now, we will work only with wine description, country of origin, grape variety, taster's name, points and price. So in the next step, we will create a dictionary of dictionaries. The keys for the first dictionary will be the wine descriptions, and the value will be another dictionary, which follows the format:
``` python
innerDict = {'country':'wine_country','points':'numeric_wine_points','variety':'grape_variety','price':'numeric_wine_price','taster':'name_of_taster'}
```

In [5]:
# convert to dict
desMap = {}
for idx,row in wdat1.iterrows():
    desMap[row['description']] = { 'country':row['country'],'points':row['points'],'variety':row['variety'],'price':row['price'],'taster':'nan'}
for idx,row in wdat2.iterrows():
    try:
        desMap[row['description']]['taster'] = row['taster_name']
        if desMap[row['description']]['price'] == 0 and row['price']>0:
            desMap[row['description']]['price'] = row['price']
    except KeyError:
        tasterName = 'nan' if len(str(row['taster_name']))<=4 else row['taster_name']
        desMap[row['description']] = { 'country':row['country'],'points':row['points'],'variety':row['variety'],'price':row['price'],'taster':tasterName }

Total unique wine descriptions:

In [6]:
print(len(desMap))

Now that we have the unique data in a (sort of) convienient  format, let's take a look at the number of wine tasters, country of wine origin etc....

I have purposefull avoided grape variety from the analyis below. My previous experience is that this is a long list...

In [7]:
from collections import Counter

In [8]:
tasters = list()
country = list()
points  = list()
for des,dat in desMap.items():
    tasters.append(dat['taster'])
    country.append(dat['country'])
    points.append(dat['points'])

In [9]:
Counter(tasters)

Looks like taster `nan` contributes significantly :)

In [10]:
Counter(country)

It is a bit surprising to see a long list of countries, but wiki gives a long list of [wine producing countries](https://en.wikipedia.org/wiki/List_of_wine-producing_countries)....

In [11]:
Counter(points)

So, it looks like majority of the wines scores between 84 - 93 on this wine scoring scale. It is also not immediately obvious how this scoring works, and there are no wines scoring below 80...

Before jumping to the next step, see the first few descriptions below. This field in the dataset is defined as:    
"*Description: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.*"

In [12]:
wdes = list(desMap.keys())
wdes[:5]

### Tokenizing and lemmatization

In this step, we will deal with tokenizing and lemmatization of text. I opted for lemmatization instead of stemming, as I find that it is not easy to read a stemmed word. But, it should be noted that other than the  improved 'readability' [the benefits of lemmatization over stemming will be very modust](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html).

In [13]:
import nltk
from nltk.stem import WordNetLemmatizer

NLTK WordNetLemmatizer needs the [POS tags for proper lemmatization](https://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word), let's do that first:

In [14]:
wtokens = list()
for i,d in enumerate(wdes):
    wtokens.append(nltk.pos_tag(nltk.tokenize.word_tokenize(d)))
    if i%10000==0:
        print(i)

Let's take a peek as these POS tagged data:

In [15]:
wtokens[:5]

A token must be a noun, adjective, verb or an adverb to do lemmatiztioin. The code snippets below creasets a set for each of these. In this step, we will also remove any token with length $<$2 and not a noun, adjective, verb or adverb.

In [16]:
lemmatizer = WordNetLemmatizer()
vset = set(['VB','VBD','VBG','VBN','VBP','VBZ'])
nset = set(['NN','NNS','NNP','NNPS'])
advset = set(['RB','RBR','RBS'])
adjset = set(['JJ','JJR','JJS'])
wmatch = re.compile('^\w{2,}.*$', re.IGNORECASE)

In [17]:
wlemmas = list()
for wt in wtokens:
    wtlist = list()
    for tk in wt:
        if tk[1] in vset:
            wtlist.append(lemmatizer.lemmatize(tk[0].lower(), 'v'))
        elif tk[1] in nset:
            wtlist.append(lemmatizer.lemmatize(tk[0].lower(), 'n'))
        elif tk[1] in advset:
            wtlist.append(lemmatizer.lemmatize(tk[0].lower(), 'r'))
        elif tk[1] in adjset:
            wtlist.append(lemmatizer.lemmatize(tk[0].lower(), 'a'))
        elif re.match(wmatch,string=tk[1]):
            wtlist.append(tk[0].lower())
    wlemmas.append(" ".join(wtlist))

Let's see the lemmatized descriptions again....

In [18]:
wlemmas[:5]

### Feature extraction

Using tf-idf document$*$term matrix:

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
tf1 = TfidfVectorizer(analyzer='word',stop_words='english',min_df=0.002,max_df=0.95,ngram_range=(1,5),lowercase=True)
desTfIdf = tf1.fit_transform(wlemmas)

In [21]:
print(desTfIdf.shape)
print(tf1.get_feature_names())

The N-gram range we have given is 1-5, but let's examine theactual  N-gram distribution in the feature matrix: 

In [22]:
ntokens = [ len(s.split(' ')) for s in tf1.get_feature_names() ]
print(Counter(ntokens))

So, it appears that 4-grams or 5-grams do not appear that frequenly in the dataset, of course, the value given to the `min_df` parameter plays a role here.


### Topic modelling using NMF

In [23]:
from sklearn.decomposition import NMF

The functions in the code snippets below are not really needed for basic topic modelling. One of the functions below returns top scoring keywords (features) and documents (descriptions) associated with each topic and the next function returns the values associated with each of the descriptions from the `desMap` dictionary created earlier.

In [24]:
def get_topics(H,feature_names,W,docs,kw='NMF',n_topWords=25,n_topDocs=25):
    '''
    Given an H matrix, words mapped, a W matrix, mapped documents, number of top words and number of 
    top documents, return a dictionary with n_topWords and n_topDocs
    source: https://towardsdatascience.com/improving-the-interpretation-of-topic-models-87fd2ee3847d 
    '''
    topicMap = {}
    for topicInd,topic in enumerate(H):
        keywords = [feature_names[i] for i in topic.argsort()[:-n_topWords-1:-1] ]
        descriptions = list()
        topDocInd = np.argsort( W[:,topicInd] )[::-1][0:n_topDocs]
        for docInd in topDocInd:
            descriptions.append(docs[docInd])
        topicMap[kw+'.'+str(topicInd)] = {'kw':keywords,'desc':descriptions}
    return(topicMap)

def get_info(desMap,deslist):
    infoMap = {}
    variety = list()
    points = list()
    country = list()
    taster = list()
    price = list()
    for des in deslist:
        if des in desMap:
            variety.append(desMap[des]['variety'])
            points.append(desMap[des]['points'])
            price.append(desMap[des]['price'])
            country.append(desMap[des]['country'])
            taster.append(desMap[des]['taster'])
    infoMap['variety']= Counter(variety)
    infoMap['points']= points
    infoMap['price']= price
    infoMap['country']= Counter(country)
    infoMap['taster']= Counter(taster)
    return(infoMap)
        

Now let's actually use Non-Negative Matrix Factorization for topic modeling:

In [25]:
nmf1 = NMF(max_iter=750,n_components=25, random_state=42, alpha=.05, l1_ratio=0.5,solver='cd',init='nndsvda',shuffle=True)
nmf1.fit(desTfIdf)

In [26]:
nmfW1 = nmf1.transform(desTfIdf)

Examine the matrices before proceeding further:   
$H$ matrix  `nmf1.components_`   and $W$ matrix `nmfW1`

In [27]:
print(nmfW1.shape)
print (nmf1.components_.shape)

Use the functions written above to get a dictionary with topic ids as values and key words, descriptions, varieties, country(ies) of origin, points and prices as values:

In [28]:
descTopicsNmf = get_topics(nmf1.components_,tf1.get_feature_names(),nmfW1,wdes,'NMF',100,100)

Let's take a look at the first topic (by numbering/naming)

In [29]:
print('**** Keywords ******')
print(descTopicsNmf['NMF.0']['kw'][:25])
print('**** Descriptions ******')
print(descTopicsNmf['NMF.0']['desc'][:5])
topic0Info = get_info(desMap,descTopicsNmf['NMF.0']['desc'])
print('**** Stats ******')
print('   variety   ')
print(topic0Info['variety'])
print('   country   ')
print(topic0Info['country'])
print('   tasters   ')
print(topic0Info['taster'])
print('   points   ')
print(pd.Series(topic0Info['points']).describe())
print('   price   ')
print(pd.Series(topic0Info['price']).describe())

It will be an understatement to say that 'black' appears to be the prominent keyword in the topic above. From the top keywords and list of varieties it appears that sommeliers quite often use keywords like black, black fruit, black cherry etc to describe  red wine flavors.

So, we will look at the next topic....

In [30]:
print('**** Keywords ******')
print(descTopicsNmf['NMF.1']['kw'][:25])
print('**** Descriptions ******')
print(descTopicsNmf['NMF.1']['desc'][:5])
topic1Info = get_info(desMap,descTopicsNmf['NMF.1']['desc'])
print('**** Stats ******')
print('   variety   ')
print(topic1Info['variety'])
print('   country   ')
print(topic1Info['country'])
print('   tasters   ')
print(topic1Info['taster'])
print('   points   ')
print(pd.Series(topic1Info['points']).describe())
print('   price   ')
print(pd.Series(topic1Info['price']).describe())

Instead of looking at all the topics one, by one lets comapre the topics by:    
* the number of shared top key words and    
* the number of grape varieties linked to the wine descriptions in these topics  
We will do this by computing the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) between two topics using top N key words or grape varieties.     
`get_topic_similarity` function takes as inputs a topic dictionary and returns a pandas.DataFrame with Jaccard indices between the topics.    
`get_variety_similarity`  function takes as inputs a topic dictionary, descriptions to variety, points mapping (defined above) and returns a pandas.DataFrame with Jaccard indices as grape variety similarity between topics. 

In [78]:
def get_topic_similarity(topicDict,ntopics=100):
    '''
    Calculate Jaccard index based on ton n topics in wine description
    '''
    simMat = np.ones((len(topicDict),len(topicDict)),dtype=np.float)
    for i,k in enumerate(topicDict.keys()):
        kwords = set(topicDict[k]['kw'][:ntopics])
        for j,l in enumerate(topicDict.keys()):
            if j>i: # 
                lwords = set(topicDict[l]['kw'][:ntopics])
                JI = len(kwords & lwords)/float(len(kwords | lwords))
                simMat[j,i] = JI
                simMat[i,j] = JI
    return(pd.DataFrame(simMat,index=list(topicDict.keys()),columns=list(topicDict.keys())))

def get_variety_similarity(topicDict,desmap,nvariety=0):
    '''
    Calculate Jaccard index based on top N (in terms of count) 
    wine varieties in wine description topics
    '''
    varietySim = np.ones((len(topicDict),len(topicDict)),dtype=np.float)
    for i,k in enumerate(topicDict.keys()):
        variety1 = list()
        for desc in topicDict[k]['desc']:
            variety1.append(desmap[desc]['variety'])
        if nvariety == 0:
            varTopN1 = set(variety1)
        else:
            varTup1 = sorted(list(Counter(variety1).items()),key=lambda vit: vit[1],reverse=True)
            varTopN1 = set([v[0] for v in varTup1[:nvariety]])
        for j,l in  enumerate(topicDict.keys()):
            if j>i:
                variety2 = list()
                for desc in topicDict[l]['desc']:
                    variety2.append(desmap[desc]['variety'])
                if nvariety == 0:
                    varTopN2 = set(variety2)
                else:
                    varTup2 = sorted(list(Counter(variety2).items()),key=lambda vit: vit[1],reverse=True)
                    varTopN2 = set([v[0] for v in varTup2[:nvariety]])
                JI = len(varTopN1 & varTopN2)/float(len(varTopN1 | varTopN2))
                varietySim[i,j] = JI
                varietySim[j,i] = JI
    return((pd.DataFrame(varietySim,index=list(topicDict.keys()),columns=list(topicDict.keys()))))

Use the functions defined above to assess the similarity between topics


### Topic similarity using key words 


We will begin with topic similarity using top keywords shared:

In [102]:
topicSimilarityNmf = get_topic_similarity(descTopicsNmf)

Now that we have topic similarities, lets plot these values:

In [33]:
sns.clustermap(topicSimilarityNmf,square=True,figsize=(15,15),cmap=sns.light_palette((0.231,0.349,0.596),n_colors=11))

It doesn't look like the topics share a lot of common key words, let's find the maximum similarities for each topic:

In [101]:
print(topicSimilarityNmf.apply(np.sort,axis=1).iloc[:,-2])

Now that we have topic similarity in terms of the the shared keywords, let's analyze the topic similarity with respect to the number of common grape varieties.  Instead of ranking the grape varieties in each topic by count, we will use all grape varieties mapped to a topic, irrespective of whether that variety is listed just one time or ten times in a topic.


### Topic similarity using grape varieties

In [70]:
varietySimilarityNmf = get_variety_similarity(descTopicsNmf,desMap,0)

In [71]:
sns.clustermap(varietySimilarityNmf,square=True,figsize=(15,15),cmap=sns.light_palette((0.447,0.184,0.2156),n_colors=11))

It is clear from the cluster map above that topics show more similarity in terms of the number of shared grape varieties, so find the the maximimum similarity as before....

In [103]:
print(varietySimilarityNmf.apply(np.sort,axis=1).iloc[:,-2])

Since we have topic similarity based on top key words and grape varieties, let's make a table for each topic with top scoring topic according to topic similarity and grape variety similarity:

In [117]:
similarTopics = pd.DataFrame({'topicSimilarity':topicSimilarityNmf.columns[topicSimilarityNmf.values.argsort()[:,-2]],'varietySimilarity':varietySimilarityNmf.columns[varietySimilarityNmf.values.argsort()[:,-2]]},index=topicSimilarityNmf.index)
similarTopics

So, the table above shows that there are a few topics such as `NMF.0`,`NMF.7`, `NMF.14` and `NMF.145` with same high scoring topic similarity and variety similarity topics. But, the grape variety similarity numbers were generated using all grape varities mapping to wine descriptions in a topic. But, what if we use top N (say N=10) instead of all grape varieties ?  Would the results remain similar ? We will see that in the next step:

In [118]:
varietySimilarityNmf10 = get_variety_similarity(descTopicsNmf,desMap,10)
similarTopics10 = pd.DataFrame({'topicSimilarity':topicSimilarityNmf.columns[topicSimilarityNmf.values.argsort()[:,-2]],'varietySimilarity10':varietySimilarityNmf10.columns[varietySimilarityNmf10.values.argsort()[:,-2]]},index=topicSimilarityNmf.index)
similarTopics10

The observations made for topics `NMF.0`,`NMF.7` and `NMF.15` doesn't hold true anymore. So, we can assume that topic similarity from top key words shared doesn't infact translate to topic similarity according to grape variety or vice versa....

### Summing up...

This notebook explored topic modelling wine descriptions using N-grams and NMF and finding similar topics according to number of shared key words (N-grams) and grape varieties. What we saw is that topic key word similarity doesn't exactly reflect topic variety in terms of topic similarity, or in other words,  roughly same set of words can be used to describe flavor characterestics of a range of grape varieties.