# Significant Words

* Date: 18 August 2019
* Author: Rafael C. Alvarado
* Subject: Textual OEnolytics 

In this notebook we focus on the "unstructured data" contained in the wine reviews themselves. We explore some frequency-based measures for estimating the significance of words and bigrams in the corpus.


# Setup

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib
import matplotlib.pyplot as plt
import plotly_express as px
from IPython.core.display import display, HTML

## Pragmas

In [2]:
%matplotlib inline
matplotlib.style.use('ggplot')

## Modify HTML styles

In [3]:
%%html
<style>
p {margin: rem; }
.large {font-size: 16pt;}
.readerly {font-size: 14pt; font-family: serif;}
div.title {font-weight:bold; margin-bottom:.5rem;}
</style>

In [4]:
def show(str, el='DIV', css_class=''):
    display(HTML("<{0} class='{1}'>{2}</{0}>".format(el, css_class, str)))

##  Import data

In [None]:
dbdir = '/Users/rca2t/CODE/polo2-test/PUB/winereviews/'
db = dbdir + 'winereviews-corpus.db'
with sqlite3.connect(db) as db:
    doc = pd.read_sql('select * from doc', db, index_col='doc_id')
    doctoken = pd.read_sql('select * from doctoken', db, index_col=['doc_id','sentence_id','token_ord'])
    vocab = pd.read_sql('select * from token', db, index_col='token_str')
    bigrams = pd.read_sql('select * from ngrambi', db, index_col='ngram')
    docbigrams = pd.read_sql('select * from ngrambidoc', db, index_col=['ngram','doc_id'])

# Vocabulary

## Signifant Words (Unigrams)

We use TFIDF to measure the significance of words in the corpus. TFIDF combines the frequency of a word in a document along with its overall frequency in the corpus. If a word is used all the time, it is considered insignificant. 

```
TF: Frequency of words in a document  
DF: Number of documents in which a word appears in the corpus  
IDF = 1 / DF  
N: Number of words in the corpus  
TFIDF = TF * log10(N/IDF)
```

We sum the TFIDF of each word across the corpus to get the most significant of words in the vocabulary.

Note that TFIDF is roughly equivalent to term frequecny divided by term entropy.

In [None]:
vocab.sort_values('tfidf_sum', ascending=False).head(20)[['tfidf_sum']].style.bar()

# Bigrams

We look at word pairs.

## Bigram Frequency

In [None]:
bigrams.sort_values('freq', ascending=False).freq.to_frame().head(20).style.bar()

## Bigram  Entropy

In [None]:
bigrams.sort_values('entropy', ascending=False).entropy.to_frame().head(20).style.bar()

## Bigram TFIDF

NOTE TO SELF: This should be computed by Polo.

In [None]:
# docbigrams['tp'] = docbigrams['tf'] / docbigrams.groupby('doc_id')['tf'].count()
# docbigrams['tfidf'] = docbigrams['tp'] * bigrams['idf']
docbigrams['tfidf'] = (docbigrams['tf'] / docbigrams.groupby('doc_id')['tf'].count()) * bigrams['idf']
bigrams['tfidf_sum'] = docbigrams.groupby('ngram').tfidf.sum()

In [None]:
bigrams.tfidf_sum.sort_values(ascending=False).head(20).to_frame().style.bar()

##  Bigram TFTR (experimental)

A new measure -- Term Frequency Term Redundancy

`R = 1 - (H / Hmax)`

In [None]:
max_entropy = (1 / bigrams.shape[0]) * np.log2(bigrams.shape[0])
bigrams['redundancy'] = 1 - (bigrams['entropy'] / max_entropy)
docbigrams['tfidh'] = (docbigrams['tf'] / docbigrams.groupby('doc_id')['tf'].count()) * bigrams['redundancy']
bigrams['tfidh_sum'] = docbigrams.groupby('ngram').tfidh.sum() * -1

In [None]:
bigrams['tfidh_sum'].sort_values(ascending=False).to_frame().head(20).style.bar()

In [None]:
# px.scatter(bigrams, 'tfidf_sum', 'tfidh_sum', trendline='lowess')

# Words associated with Good and Bad wines

Select the review with high and low ratings, using the 80th and 20th percentiles respectively.

## Convert points into grades (A and B) 

In [None]:
doc.loc[doc.doc_points < doc.doc_points.quantile(.2), 'grade'] = 'B'
doc.loc[doc.doc_points > doc.doc_points.quantile(.8), 'grade'] = 'A'
doc['grade'] = doc['grade'].fillna('N')

In [None]:
Amin =  doc.loc[doc.grade == 'A'].doc_points.min()
Bmax = doc.loc[doc.grade == 'B'].doc_points.max()

In [None]:
show("Grade A >= {}<br/>Grade B <= {}".format(Amin, Bmax), css_class='large')

In [None]:
doc.grade.value_counts().plot(kind='pie', figsize=(5,5))

## Apply doc grades to words

In [None]:
dtg = doctoken.join(doc['grade'], on='doc_id', how='inner')
dtg.index.names  = ['doc_id', 'sentence_id', 'token_ord'] # Because doc_id gets lost for some reason

In [None]:
dtg.head()

##  Find words strongly associated with each grade

In [None]:
G = dtg.groupby(['grade', 'token_str']).count().unstack().fillna(0).T
G.index = G.index.droplevel(0)
G = G / G.sum()

In [None]:
G['A'].sort_values(ascending=False).head(10)

In [None]:
G['B'].sort_values(ascending=False).head(10)

In [None]:
G['x'] = G.A - G.B

### Top words associated with A wines

In [None]:
G.x.sort_values(ascending=False).head(10).to_frame().style.bar()

###  Top words associated with B wines

In [None]:
G.x.sort_values(ascending=True).head(10).to_frame().style.bar()

## Do the same for bigrams

In [None]:
docbigrams = docbigrams.reset_index().set_index(['doc_id', 'ngram']).sort_index()

In [None]:
dbg = docbigrams.join(doc['grade'], on='doc_id', how='inner')
dbg.index.names  = ['doc_id', 'ngram'] # Because doc_id gets lost

In [None]:
dbg.head()

In [None]:
G1 = dbg.groupby(['grade', 'ngram']).tfidh.count().unstack().fillna(0).T
G1 = G1 / G1.sum()

In [None]:
G1.A.sort_values(ascending=False).to_frame().head(10)

In [None]:
G1.B.sort_values(ascending=False).to_frame().head(10)

In [None]:
G1['x'] = G1.A - G1.B

### Top bigrams associated with A wines

In [None]:
G1.x.sort_values(ascending=False).to_frame().head(30).style.bar()

### Top bigrams associated with B wines

In [None]:
G1.x.sort_values(ascending=True).to_frame().head(20).style.bar()

In [None]:
G1['A_z'] = (G1.A - G1.A.mean()) / G1.A.std()

In [None]:
G1['B_z'] = (G1.B - G1.B.mean()) / G1.B.std()

In [None]:
G1['x_z'] = (G1.x - G1.x.mean()) / G1.x.std()

In [None]:
n = 20
Au = G.x.sort_values(ascending=False).head(n).to_frame().reset_index().rename(columns={'x':'A', 'token_str':'unigram'})
Ab = G1.x.sort_values(ascending=False).head(n).to_frame().reset_index().rename(columns={'x':'A', 'ngram':'bigram'})
Bu = G.x.sort_values(ascending=True).head(n).to_frame().reset_index().rename(columns={'x':'B',  'token_str':'unigram'})
Bb = G1.x.sort_values(ascending=True).head(n).to_frame().reset_index().rename(columns={'x':'B', 'ngram':'bigram'})

In [None]:
E = pd.concat([Au,Ab,Bu,Bb], 1, sort=False)

In [None]:
E

In [None]:
pd.concat([G.x.sort_values(ascending=True).head(20), 
          G.x.sort_values(ascending=True).tail(20)]).plot(kind='barh', figsize=(5,20))

In [None]:
pd.concat([G1.x.sort_values(ascending=True).head(20), 
          G1.x.sort_values(ascending=True).tail(20)]).plot(kind='barh', figsize=(5,20))

In [None]:
G1.x.sort_values(ascending=False).head(20).plot(kind='barh', figsize=(5,10))

# Conclusions

**Words and prhases to use when tasting wines**

* finish, acidity, aromas, tannis, ripeness
* black cherry, crisp acitiy, firm tannins, black pepper, white pepper 

In [None]:
G.to_csv('winereview-vocab.csv')
G1.to_csv('winereview-bigrams.csv')

In [None]:
# END

# Postscript: Asssessing the Voss Effect

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [None]:
# corpus = pd.read_csv('winereviews.csv')

In [None]:
# corpus.head()

In [None]:
# vectorizer = TfidfVectorizer(use_idf=True)
# X = vectorizer.fit_transform(corpus.description)
# df = pd.DataFrame(X.todense(), index=corpus.index)