# Module 5: BOW, TFIDF, and Vector Spaces

* DS 5001
* Raf Alvarado

# Overview

In this notebook, we explore Luhn's concept of term significance in light of Zipf's Law, TFIDF, and vector space models of text. 

Recall Luhn's (1958) representation of the problem:

<img src="https://keep.google.com/u/0/media/v2/1sejm7ApXHSqRKIyNj3gwdAmzSP1hWW_Zo_vMNeisIeNyoxJJije0g2fCAGej/1AaLvSK1xIoctnS3XUh0SEd0Cx2RPlTFLqn0hHrFynWgoKi802glZOKMo5S74zg?accept=image/gif,image/jpeg,image/jpg,image/png,image/webp,audio/aac&sz=695">

In this notebook, we look at ways to approximate the significance curve using the ideas we learned in this module.

# Set Up

## Config

In [None]:
data_dir = '../2020-02-06/' # Or wherever you put your previous lab

In [None]:
count_method = 'n' # 'c' or 'n' # n = n tokens, c = distinct token (term) count
tf_method = 'sum' # sum, max, log, double_norm, raw, binary
tf_norm_k = .5 # only used for double_norm
idf_method = 'standard' # standard, max, smooth
gradient_cmap = 'YlGnBu' # YlGn, GnBu, YlGnBu; For tables; see https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html 

In [None]:
OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']
SENTS = OHCO[:4]
PARAS = OHCO[:3]
CHAPS = OHCO[:2]
BOOKS = OHCO[:1]

In [None]:
bag = CHAPS

## Import

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly_express as px

In [None]:
pd.__version__

In [None]:
sns.set()
%matplotlib inline

# Prepare the data

## Import tables

Bring the the tables we created last time.

In [None]:
%%time
LIB = pd.read_csv(data_dir + "LIB.csv").set_index(BOOKS)
TOKEN = pd.read_csv(data_dir + 'TOKEN.csv').set_index(OHCO)
VOCAB = pd.read_csv(data_dir + 'VOCAB.csv').set_index('term_id')
# DOC = pd.read_csv(data_dir + "DOC.csv").set_index(PARAS)

In [None]:
LIB.head()

In [None]:
VOCAB.head()

In [None]:
VOCAB = VOCAB[~VOCAB.term_str.isna()]

In [None]:
VOCAB.head()

In [None]:
TOKEN.head()

In [None]:
TOKEN = TOKEN[~TOKEN.term_str.isna()]

In [None]:
TOKEN.head()

In [None]:
# DOC.head()

## Add term_id to TOKEN table

We need to do this to combine the VOCAB and TOKEN tables more efficiently. Note, we could have done this in the previous lab.

We use `.map()` because TOKEN and VOCAB do not share an index at this time.

In [None]:
TOKEN['term_id'] = TOKEN.term_str.map(VOCAB.reset_index().set_index('term_str').term_id)

In [None]:
TOKEN.head()

## Add Max POS to VOCAB

Just in case it's not there. It's easy now that we have a share feature -- `term_id` -- between VOCAB and TOKEN.

Regarding collisions when using `.idxmax()`, the documentation says "If multiple values equal the maximum, the first row label with that value is returned."

In [None]:
# Demo
# TOKEN.groupby(['term_id', 'pos']).pos.count()
# TOKEN.groupby(['term_id', 'pos']).pos.count().unstack()
# TOKEN.groupby(['term_id', 'pos']).pos.count().unstack().idxmax(1)

In [None]:
VOCAB['pos_max'] = TOKEN.groupby(['term_id', 'pos']).pos.count().unstack().idxmax(1)

In [None]:
VOCAB.sample(5)

## Compare POS Stats in TOKEN and VOCAB

Pause and look at distribution of POS tags. The POS table could become part of your data model (analytical edition) if you were interested in studying POS tags.

In [None]:
POS = TOKEN.pos.value_counts().to_frame().rename(columns={'pos':'n'})
POS.index.name = 'pos_id'

In [None]:
POS.sort_values('n').plot.bar(y='n', figsize=(15,5), rot=45);

# Zipf's Law

$f \propto \frac{1}{r} $

$k =  fr$

## Add Term Rank to VOCAB

In [None]:
if 'term_rank' not in VOCAB.columns:
    VOCAB = VOCAB.sort_values('n', ascending=False).reset_index()
    VOCAB.index.name = 'term_rank'
    VOCAB = VOCAB.reset_index()
    VOCAB = VOCAB.set_index('term_id')
    VOCAB['term_rank'] = VOCAB['term_rank'] + 1

In [None]:
VOCAB.head()

## Alternate Rank

The `term_rank` as defined above assigns different ranks to words with the same frequency, which occurs in the long tail, e.g. with words that appear once. 
This measure groups words by term count.

In [None]:
new_rank = VOCAB.n.value_counts()\
    .sort_index(ascending=False).reset_index().reset_index()\
    .rename(columns={'level_0':'term_rank2', 'index':'n', 'n':'nn'})\
    .set_index('n')

In [None]:
new_rank.head()

In [None]:
VOCAB['term_rank2'] = VOCAB.n.map(new_rank.term_rank2) + 1

In [None]:
VOCAB.head()

In [None]:
VOCAB['p'] = VOCAB.n / VOCAB.shape[0]

## Compute Zipf's K

In [None]:
VOCAB['zipf_k'] = VOCAB.n * VOCAB.term_rank
VOCAB['zipf_k2'] = VOCAB.n * VOCAB.term_rank2
VOCAB['zipf_k3'] = VOCAB.p * VOCAB.term_rank2

In [None]:
VOCAB.describe().T

### Words with low k

In [None]:
VOCAB[VOCAB.zipf_k <= VOCAB.zipf_k.quantile(.1)].sort_values('zipf_k3', ascending=True).head()

### Words with high k

In [None]:
VOCAB[VOCAB.zipf_k >= VOCAB.zipf_k.quantile(.9)].sort_values('zipf_k3', ascending=False).head()

## Visualize

### Histogram of Zipf K

In [None]:
# px.histogram(VOCAB, 'zipf_k', marginal='box')

In [None]:
# px.histogram(VOCAB, 'zipf_k2', marginal='box')

In [None]:
# px.histogram(VOCAB, 'zipf_k3', marginal='box')

###  Rank and N

In [None]:
VSAMP1 = VOCAB[['n','term_rank','zipf_k','term_str','pos_max']]
# VSAMP2 = VOCAB[['n','term_rank2','zipf_k3']].drop_duplicates()

In [None]:
px.scatter(VSAMP1, x='term_rank', y='n', log_y=False, log_x=False, hover_name='term_str', color='pos_max')

In [None]:
# px.scatter(VSAMP2, x='term_rank2', y='n', log_y=False, log_x=False)

In [None]:
px.scatter(VSAMP1, x='term_rank', y='n', log_y=True, log_x=True, hover_name='term_str', color='pos_max')

In [None]:
# px.scatter(VSAMP2, x='term_rank2', y='n', log_y=True, log_x=True)

## Demo Rank Index

In [None]:
rank_index = [1, 2, 3, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000]

In [None]:
demo = VOCAB.loc[VOCAB.term_rank.isin(rank_index), ['term_str', 'term_rank', 'n', 'zipf_k', 'pos_max']]

In [None]:
demo.style.background_gradient(cmap=gradient_cmap, high=.5)

In [None]:
# rank_index = [1, 2, 3, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800]
# demo = VOCAB.loc[VOCAB.term_rank2.isin(rank_index), ['term_str', 'term_rank2', 'n', 'zipf_k2', 'pos_max']]
# demo.style.background_gradient(cmap=gradient_cmap, high=.5)

# VOCAB Entropy

## Compute P of VOCAB

This is the prior, or marginal, probability of a term.

In [None]:
%%time
VOCAB['p2'] = VOCAB.n / VOCAB.n.sum()

## Compute Entropy of VOCAB

In [None]:
VOCAB['h'] = VOCAB.p2 * np.log2(1/VOCAB.p2) # Self entropy of each word 
H = VOCAB.h.sum()
N_v = VOCAB.shape[0]
H_max = np.log2(N_v)
R = round(1 - (H/H_max), 2) * 100

In [None]:
print("H \t= {}\nH_max \t= {}\nR \t= {}%".format(H, H_max, int(R)))

# BOW

In [None]:
BOW = TOKEN.groupby(bag+['term_id']).term_id.count()\
    .to_frame().rename(columns={'term_id':'n'})

In [None]:
BOW['c'] = BOW.n.astype('bool').astype('int')

In [None]:
BOW.head(10)

# Document-Term Matrix

We create a document-term count matrix. Note that we can create a matrix for any of the features in BOW. Also, see how the OHCO helps us distinguish between features and observation identity.

Note, these operations are slower than using `groupby()`.

## Create Count Matrix

In [None]:
%%time
DTCM = BOW[count_method].unstack().fillna(0).astype('int')

In [None]:
DTCM.head()

## Compute TF

We could also compute that using `BOW.groupby()`.

In [None]:
%%time
print('TF method:', tf_method)

if tf_method == 'sum':
    TF = DTCM.T / DTCM.T.sum()

elif tf_method == 'max':
    TF = DTCM.T / DTCM.T.max()

elif tf_method == 'log':
    TF = np.log10(1 + DTCM.T)
    
elif tf_method == 'raw':
    TF = DTCM.T

elif tf_method == 'double_norm':
    TF = DTCM.T / DTCM.T.max()
    TF = tf_norm_k + (1 - tf_norm_k) * TF[TF > 0] # EXPLAIN; may defeat purpose of norming

elif tf_method == 'binary':
    TF = DTCM.T.astype('bool').astype('int')
    
TF = TF.T

In [None]:
TF.head()

## Compute DF

In [None]:
%%time
DF = DTCM[DTCM > 0].count()

In [None]:
DF.head()

## Compute IDF

In [None]:
N = DTCM.shape[0]

In [None]:
print('IDF method:', idf_method)

if idf_method == 'standard':
    IDF = np.log10(N / DF)

elif idf_method == 'max':
    IDF = np.log10(DF.max() / DF) 

elif idf_method == 'smooth':
    IDF = np.log10((1 + N) / (1 + DF)) + 1 # Correct?

## Compute TFIDF

In [None]:
TFIDF = TF * IDF

In [None]:
TFIDF.head()

## Move things to their places

In [None]:
VOCAB['df'] = DF
VOCAB['idf'] = IDF

In [None]:
VOCAB.head()

In [None]:
%%time
BOW['tf'] = TF.stack()
BOW['tfidf'] = TFIDF.stack()

In [None]:
BOW.head()

## Apply TFIDF sum to VOCAB

In [None]:
VOCAB['tfidf_sum'] = TFIDF.sum()

## Observe results

In [None]:
VOCAB.sort_values('tfidf_sum', ascending=False).head(20).style.background_gradient(cmap=gradient_cmap, high=1)

In [None]:
VOCAB[['term_rank','term_str','pos_max','tfidf_sum']]\
    .sort_values('tfidf_sum', ascending=False).head(50)\
    .style.background_gradient(cmap=gradient_cmap, high=1)

In [None]:
VOCAB.loc[VOCAB.pos_max != 'NNP', ['term_rank','term_str','pos_max','tfidf_sum']]\
    .sort_values('tfidf_sum', ascending=False)\
    .head(50).style.background_gradient(cmap=gradient_cmap, high=1)

In [None]:
BOW = BOW.join(VOCAB[['term_str','pos_max']], on='term_id')

In [None]:
BOW.sort_values('tfidf', ascending=False).head(20)\
    .style.background_gradient(cmap=gradient_cmap, high=1)

## Visualize

### Rank and TFIDF Sum

In [None]:
px.scatter(VOCAB, x='term_rank', y='tfidf_sum', hover_name='term_str', hover_data=['n'], color='pos_max')

In [None]:
# px.scatter(VOCAB, x='term_rank2', y='tfidf_sum', hover_name='term_str', hover_data=['n'], color='pos_max')

### Log Rank and Log TFIDF Sum

In [None]:
px.scatter(VOCAB, x='term_rank', y='tfidf_sum', hover_name='term_str', hover_data=['n'], color='pos_max', 
           log_x=True, log_y=True)

In [None]:
# px.scatter(VOCAB, x='term_rank2', y='tfidf_sum', hover_name='term_str', hover_data=['n'], color='pos_max', 
#            log_x=True, log_y=True)

### Show Demo Table with TFIDF

In [None]:
demo2 = VOCAB.loc[VOCAB.term_rank.isin(rank_index), ['term_str', 'pos_max', 'term_rank', 'n', 'zipf_k', 'tfidf_sum']]

In [None]:
demo2.style.background_gradient(cmap=gradient_cmap, high=1)

In [None]:
px.scatter(demo2, x='term_rank', y='tfidf_sum', log_x=True, log_y=True, text='term_str', color='pos_max', size='n')

# Word-Context Matrix Entropy

In [None]:
WCM = DTCM / DTCM.sum()

In [None]:
WCM.sum().head()

In [None]:
WCMh = WCM * np.log2(1/WCM)

In [None]:
VOCAB['h2'] = WCMh.sum()

In [None]:
VOCAB['h2'].hist();

# X Factor

In [None]:
# VOCAB['x_factor'] = np.log(VOCAB.term_rank) * VOCAB.h2

In [None]:
# px.scatter(VOCAB, x='term_rank', y='x_factor', hover_name='term_str', color='pos_max', hover_data=['n'])

In [None]:
# VOCAB['x_factor2'] = VOCAB.term_rank2 * VOCAB.h2

In [None]:
VOCAB['x_factor2'] = np.log(VOCAB.term_rank2) * VOCAB.h2

In [None]:
px.scatter(VOCAB, x='term_rank2', y='x_factor2', hover_name='term_str', color='pos_max', hover_data=['n'])

In [None]:
# px.scatter(VOCAB, x='term_rank', y='x_factor', log_x=True, log_y=True, hover_name='term_str', color='pos_max', hover_data=['n'])

In [None]:
# px.scatter(VOCAB, x='term_rank2', y='x_factor2', log_x=True, log_y=True, hover_name='term_str', color='pos_max', hover_data=['n'])

## Demo Table

In [None]:
demo3 = VOCAB.loc[VOCAB.term_rank.isin(rank_index), ['term_str', 'pos_max', 'n', 'term_rank', 'zipf_k', 'tfidf_sum', 'h2', 'x_factor', 'term_rank2', 'x_factor2']]

In [None]:
demo3.style.background_gradient(cmap=gradient_cmap)

In [None]:
# px.scatter(demo3, x='term_rank', y='x_factor', log_x=True, log_y=True, text='term_str', color='pos_max', size='n')

In [None]:
px.scatter(demo3, x='term_rank2', y='x_factor2', log_x=False, log_y=False, text='term_str', color='pos_max', size='n')

# Reduce VOCAB

## Select Significant Terms based on X Factor

We want to take the upper and middle segment of our graph.

In [None]:
# key_col = 'tfidf_sum'
key_col = 'x_factor2'
key_min = VOCAB[key_col].quantile(.9)
rank_min = 200

In [None]:
SIGS = VOCAB.loc[(VOCAB[key_col] >= key_min) & (VOCAB.term_rank >= rank_min)].sort_values(key_col, ascending=False)

In [None]:
SIGS.shape[0]

In [None]:
SIGS[['pos_max', 'term_str', 'n', 'term_rank', 'zipf_k', 'df', 'idf', 'tfidf_sum','x_factor2']].head(100).style.background_gradient(cmap=gradient_cmap, high=1)

# Save Work

In [None]:
VOCAB.to_csv('VOCAB2.csv')
TOKEN.to_csv('TOKEN2.csv')
BOW.to_csv('DOC2.csv')
DTCM.to_csv('DTCM.csv')
TFIDF.to_csv('TFIDF.csv')
SIGS.to_csv('SIGS.csv')
WCM.to_csv('WCM.csv')
# BOW.to_csv('BOW.csv')