# Metadata

```yaml
Course:    DS 5001
Module:    M05 Lab
Topic:     Variant TFIDFs and Document Significance
Author:    R.C. Alvarado
Date:      12 February 2023
```

# Exposition

* Three kinds of signficance:
  * __Local__: `TF-IDF` (significance of a term in a document; related to $p(w|d, C)$ ).
  * __Global__: Aggregate `TF-IDF` by term (significane of a term in the corpus; related to $p( w|C ) $ ).
  * __Document__: Aggreate `TF-IDF` by document (significance of document in the corpus; related to $p(d|W_d,C) $ ).
* `TF-IDF` is essentially local frequency balanced by global frequency.
* `DF-IDF` = `TF-IDF` Σ for boolean counts.
* `DF-IDF` is global boolean term entropy.
* Boolean counts are bad for computing local significance, but good for global.
* Max normalization is good for local significance.
* Doc significance should be computed from good local significance.

# Set Up

In [1]:
data_in = '../data/output'
data_out = '../data/output'
data_prefix = 'austen-melville'

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly_express as px
import plotly.graph_objects as go
import re

In [3]:
sns.set()

# Get Data

In [4]:
# Lib
LIB = pd.read_csv(f"{data_in}/{data_prefix}-LIB.csv").set_index('book_id')
LIB['title'] = LIB.title.str.split(r',? by').apply(lambda x: x[0])
LIB['author'] = LIB.apply(lambda x: re.split(r',? by', x.title)[-1], 1)
for idx in [15859, 13720, 53861, 13721]:
    LIB.loc[idx, 'author'] = 'Herman Melville'
LIB = LIB[['title','author']]

In [5]:
# Tokens
TOKEN = pd.read_csv(f'{data_in}/{data_prefix}-TOKEN.csv')
OHCO = TOKEN.columns.to_list()[:5] 
TOKEN = TOKEN.set_index(OHCO)

In [None]:
# Vocab
VOCAB = pd.read_csv(f'{data_in}/{data_prefix}-VOCAB.csv').set_index('term_str')
# VOCAB = VOCAB.drop('term_id', 1) # We will forego using numeric term_ids and just use the term_str
VOCAB['pos_max'] = TOKEN.groupby(['term_str','pos']).pos.count().unstack().idxmax(1)
VOCAB['pos_group'] = VOCAB.max_pos.str[:2]
VOCAB['term_code'] = VOCAB.apply(lambda x: str(x.name) + '/' + x.max_pos, 1)
VOCAB['term_len'] = VOCAB.index.str.len()

# Recreate BOW and TFIDF

In [None]:
# DOC = OHCO[:1] # Book
DOC = OHCO[:2] # Chapter
# DOC = OHCO[:3] # Paragraph

In [None]:
BOW = TOKEN.groupby(DOC+['term_str']).term_str.count().to_frame('n')
BOW['bool'] = 1 

In [None]:
BOW

## TFIDF

### Traditional

In [None]:
TF = BOW.n.unstack(fill_value=0)    
DF = TF.astype('bool').sum() 
N = len(TF)
IDF = np.log2(N/DF)      
TFIDF = TF * IDF
TFIDF_agg = TFIDF.sum()

In [None]:
TFIDF_agg.sort_values(ascending=False).head(20)

### Variants

In [None]:
def rel_tf(tf):
    return (tf.T / tf.T.sum()).T

def max_tf(tf, alpha=.4):
    return alpha + (1 - alpha) * (tf.T / tf.T.max()).T

def bool_tf(tf):
    return tf.astype('bool').astype('int')

def log_tf(tf):
    return np.log2(1 + tf)

# def sub_tf(tf): # Sublinear
#     return 1 + np.log2(tf)

In [None]:
TFIDF_rel = rel_tf(TF) * IDF
TFIDF_bool = bool_tf(TF) * IDF
TFIDF_max = max_tf(TF) * IDF
TFIDF_log = log_tf(TF) * IDF

In [None]:
TFIDF_agg_rel = TFIDF_rel.sum().to_frame('sum_val')
TFIDF_agg_bool = TFIDF_bool.sum().to_frame('sum_val')
TFIDF_agg_max = TFIDF_max.sum().to_frame('sum_val')
TFIDF_agg_log = TFIDF_log.sum().to_frame('sum_val')

In [None]:
DFIDF = (DF * IDF).to_frame('val')

In [None]:
pd.concat([
    DFIDF.sort_values('val', ascending=False).head(20).reset_index(),
    TFIDF_agg_bool.sort_values('sum_val', ascending=False).head(20).reset_index(), 
    TFIDF_agg_max.sort_values('sum_val', ascending=False).head(20).reset_index(),
    TFIDF_agg_rel.sort_values('sum_val', ascending=False).head(20).reset_index(),
    TFIDF_agg_log.sort_values('sum_val', ascending=False).head(20).reset_index()
    ], 
    axis=1, keys=['dfidf', 'bool','max', 'rel', 'log'])\
    .style.background_gradient('YlGnBu')

# Document Significance

In [None]:
# DS = TFIDF.T.mean() # Document Significance
# DS_sum = TFIDF.T.sum() # Document Significance
# DS_bool = TFIDF_bool.T.mean() # Document Significance
# DS_max = TFIDF_max.T.mean() # Document Significance

In [None]:
DOCS = BOW.groupby(DOC).n.sum().to_frame('n')
DOCS['doc_sig_raw'] = TFIDF.T.mean()
DOCS['doc_sig_bool'] = TFIDF_bool.T.mean()
DOCS['doc_sig_max'] = TFIDF_max.T.mean()
DOCS['doc_sig_rel'] = TFIDF_rel.T.mean()

In [None]:
def plot_sig_docs(book_id, sig_type='raw', type='scatter'):
    d = DOCS.loc[book_id]
    sig = f'doc_sig_{sig_type}'
    title = LIB.loc[book_id].title + " / " + sig_type
    if type == 'scatter':
        return px.scatter(d.reset_index(), 'chap_num', sig, title=title, height=500, size='n', text=d.index)
    elif type == 'line':
        return px.line(d.reset_index(), 'chap_num', sig, title=title, height=500)        

In [None]:
plot_sig_docs(105, sig_type='bool')

> **Chapter 12 signals a climax in the novel's narrative.** Persuasion is a linear narrative that is organized chronologically. The original edition of this novel was published in two volumes, the first volume ending at the close of Chapter 12. Louisa's fall is the greatest dramatic occurrence which has happened so far. By inserting the fall here, Austen creates a cliffhanger and encourages her readers to buy the second volume of her novel. In these chapters, the reader is shown the negative effects of what can happen when one is too stubborn. Louisa would not be persuaded to keep from jumping off the wall. Her firmness of mind means serious injury for her and significant guilt for Captain Wentworth. He is encouraged to rethink his initial judgment of the benefit of a "strong character." [Sparknotes https://www.sparknotes.com/lit/persuasion/section6/page/2/]

In [None]:
plot_sig_docs(1342, sig_type='bool')

In [None]:
plot_sig_docs(2701, sig_type='bool')

# Doc Sig by Book

In [None]:
def get_chap_sigs(bow):
    tf = bow['bool'].unstack(fill_value=0)
    tf = (tf.T / tf.T.sum()).T # Normalize bool by length
    df = tf.sum()
    idf = np.log2(len(tf)/df)
    tfidf = tf * idf
    ds = tfidf.T.mean()
    return ds

In [None]:
def plot_sig_docs2(book_id, type='scatter'):

    global BOW
    bow = BOW.loc[book_id]
    d = bow.groupby('chap_num').n.sum().to_frame('n')
    d['doc_sig'] = get_chap_sigs(bow)
    title = LIB.loc[book_id].title
    d = d.reset_index()
    d['p'] = (d.n / d.n.sum()) * 700

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=d.chap_num, y=d.doc_sig, text=d.chap_num, 
                             mode = 'lines+markers+text',
                             marker = dict(size=d.p, color='#BBB'),
                             line = dict(color='#DADADA'),
                             textfont = dict(color="black")
                            ))
                  
    fig.update_layout(
        font = dict(color="#000", size=14),
        title=title,
        xaxis_title="Chapter",
        yaxis_title="Significance",
        height=800
    )
    fig.show()

In [None]:
plot_sig_docs2(105)

In [None]:
plot_sig_docs2(1342)

In [None]:
plot_sig_docs2(2701)

In [None]:
BOW