# Metadata

```
Course:   DS 5001
Module:   05 HW
Topic:    Create and Apply a TF-IDF Function
Author:   R.C. Alvarado
```

# Instructions

Using the notebook from this module (`M05_BOW_TFIDF.ipynb`) and the `LIB` and `CORPUS` tables generated from the collection of texts (Austen and Melville) in Module 4, create a notebook to perform the following tasks:

Write a function to generate a bag-of-words (`BOW`) represenation of the `CORPUS` table (or some subset of it) that takes the following arguments:
* A tokens dataframe which can be a filtered version of the dataframe you import. This will be the `CORPUS` table or some subset of it.
* A choice of bag, i.e. `OHCO` level, such as book, chapter, or paragraph.

Write a function that returns the `TFIDF` values for a given `BOW`, with the following arguments:
* The `BOW` table.
* The type of `TF` measure to use.
To compute `IDF`, use the formula $log_2(\frac{N}{DF})$ where $N$ is the number of documents (aka 'bags') in your `BOW`.
  
Use these functions to get get the appropriate `TFIDF` to answer the questions below.

## Hints
* Update the course GitHub repository to make sure you are working with the latest files.
* Remember that the `CORPUS` table is a `TOKENS` table; it's just the combination of several such tables into one.
* You will need to generate your own `VOCAB` table from `CORPUS` and compute `max_pos`. 
* When generating your own `VOCAB` table from `CORPUS`, be sure to name your index `term_str`. 
* Remember that the mean `TFIDF` is an aggregate statistic computed from the `TFIDF` results, and which shares the same domain as the `VOCAB` table.
* `OHCO = ['book_id', 'chap_id', 'para_num', 'sent_num', 'token_num']`

## Questions

## Q1 

Paste your functions here.

**Answer**: `PASTED FUNCTIONS`

## Q2

What are the top 20 words in the corpus by TFIDF mean using the `max` count method and `book` as the bag?

**Answer**:

```
elinor	0.631162	NNP
vernon	0.484550	NNP
darcy	0.360007	NNP
reginald	0.344776	NNP
frederica	0.335458	NNP
crawford	0.331042	NNP
elliot	0.318066	NNP
weston	0.309443	NNP
pierre	0.286605	NNP
knightley	0.283192	NNP
tilney	0.257662	NNP
elton	0.254555	NNP
bingley	0.247385	NNP
wentworth	0.239176	NNP
courcy	0.237616	NNP
woodhouse	0.221144	NNP
churchhill	0.214320	NNP
marianne	0.197925	NNP
babbalanja	0.169229	NNP
mainwaring	0.167729	NNP 
```

## Q3

What are the top 20 words in the corpus by TFIDF mean, if you using the `max` count method and `paragraph` as the bag? Note, beccause of the greater number of bags, this will take longer to compute. 

**<span style="color:red;">NOTE</span>:** These results would be improved by using the `sum` TF count method, since the mean values would not all be the same.

**Answer**:

```
0	14.841663	CD
obverse	14.841663	NN
neeva	14.841663	NN
nestor	14.841663	NNP
nevermore	14.841663	RB
newlanded	14.841663	VBN
adjurative	14.841663	NNP
nightshade	14.841663	VBP
adjourn	14.841663	VBP
niter	14.841663	NN
nocturnally	14.841663	RB
noder	14.841663	RB
nomine	14.841663	JJ
nov	14.841663	NNP
nullifies	14.841663	VBZ
ogdoads	14.841663	NNP
nb	14.841663	NN
ohonoose	14.841663	NNP
oloo	14.841663	IN
optative	14.841663	NNP```
```

## Q4

Characterize the general difference between the words in Question 3 and those in Qestion 2 in terms of part-of-speech.

**Answer**: TFIDF by book just captures proper nouns.

## Q5

Compute mean `TFIDF` for vocabularies conditioned on individual author, using *chapter* as the bag and `max` as the `TF` count method. Among the two authors, whose work has the most significant ajective?

**Answer**: Melville. /ugh/

# Solution

## Setup

In [1]:
data_home = "../labs-repo/data"
data_prefix = 'austen-melville'

In [2]:
OHCO = ['book_id', 'chap_id', 'para_num', 'sent_num', 'token_num']

In [3]:
SENTS = OHCO[:4]
PARAS = OHCO[:3]
CHAPS = OHCO[:2]
BOOKS = OHCO[:1]

### Import

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly_express as px

In [5]:
sns.set()

## Prepare the data

In [6]:
LIB = pd.read_csv(f"{data_home}/output/{data_prefix}-LIB.csv").set_index('book_id')
CORPUS = pd.read_csv(f"{data_home}/output/{data_prefix}-CORPUS.csv").set_index(OHCO)

In [7]:
VOCAB = CORPUS.term_str.value_counts().to_frame('n')
VOCAB.index.name = 'term_str'
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = np.log2(1/VOCAB.p)
VOCAB['max_pos'] = CORPUS.reset_index().value_counts(['term_str','pos']).sort_index().unstack().idxmax(1)

In [8]:
VOCAB

Unnamed: 0_level_0,n,p,i,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
the,104874,5.286723e-02,4.241482,DT
of,63153,3.183558e-02,4.973216,IN
and,61104,3.080267e-02,5.020801,CC
to,54397,2.742166e-02,5.188540,TO
a,42556,2.145258e-02,5.542705,DT
...,...,...,...,...
tailoress,1,5.041024e-07,20.919780,NN
mollolla,1,5.041024e-07,20.919780,NNP
praya,1,5.041024e-07,20.919780,NNP
effluvia,1,5.041024e-07,20.919780,NN


## Define Functions

In [9]:
def create_bow(CORPUS, bag, item_type='term_str'):
    BOW = CORPUS.groupby(bag+[item_type])[item_type].count().to_frame('n')
    return BOW

In [10]:
def get_tfidf(BOW, tf_method='max', df_method='standard', item_type='term_str'):
            
    DTCM = BOW.n.unstack() # Create Doc-Term Count Matrix
    
    if tf_method == 'sum':
        TF = (DTCM.T / DTCM.T.sum()).T
    elif tf_method == 'max':
        TF = (DTCM.T / DTCM.T.max()).T
    elif tf_method == 'log':
        TF = (np.log2(1 + DTCM.T)).T
    elif tf_method == 'raw':
        TF = DTCM
    elif tf_method == 'bool':
        TF = DTCM.astype('bool').astype('int')
    else:
        raise ValueError(f"TF method {tf_method} not found.")

    DF = DTCM.count()
    N_docs = len(DTCM)
    
    if df_method == 'standard':
        IDF = np.log2(N_docs/DF) # This what the students were asked to use
    elif df_method == 'textbook':
        IDF = np.log2(N_docs/(DF + 1))
    elif df_method == 'sklearn':
        IDF = np.log2(N_docs/DF) + 1
    elif df_method == 'sklearn_smooth':
        IDF = np.log2((N_docs + 1)/(DF + 1)) + 1
    else:
        raise ValueError(f"DF method {df_method} not found.")
    
    TFIDF = TF * IDF

    return TFIDF

## Get Top Words by Bag

## Q2

In [11]:
BOW_books = create_bow(CORPUS, bag=BOOKS)

In [12]:
TFIDF_books = get_tfidf(BOW_books, tf_method='max', df_method='standard')

In [13]:
TFIDF_books.mean().sort_values(ascending=False)\
    .head(20).to_frame('mean_tfidf').join(VOCAB.max_pos)

Unnamed: 0_level_0,mean_tfidf,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1
elinor,0.631162,NNP
vernon,0.48455,NNP
darcy,0.360007,NNP
reginald,0.344776,NNP
frederica,0.335458,NNP
crawford,0.331042,NNP
elliot,0.318066,NNP
weston,0.309443,NNP
pierre,0.286605,NNP
knightley,0.283192,NNP


## Q3

In [14]:
BOW_paras = create_bow(CORPUS, bag=PARAS)

In [15]:
TFIDF_paras_max = get_tfidf(BOW_paras, tf_method='max')

In [16]:
TFIDF_paras_max.mean().sort_values(ascending=False)\
    .head(20).to_frame('mean_tfidf').join(VOCAB.max_pos)

Unnamed: 0_level_0,mean_tfidf,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1
0,14.841663,CD
obverse,14.841663,NN
neeva,14.841663,NN
nestor,14.841663,NNP
nevermore,14.841663,RB
newlanded,14.841663,VBN
adjurative,14.841663,NNP
nightshade,14.841663,VBP
adjourn,14.841663,VBP
niter,14.841663,NN


In [17]:
TFIDF_paras_raw = get_tfidf(BOW_paras, tf_method='raw')

In [18]:
TFIDF_paras_raw.mean().sort_values(ascending=False)\
    .head(20).to_frame('mean_tfidf').join(VOCAB.max_pos)

Unnamed: 0_level_0,mean_tfidf,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1
sneezes,118.733301,JJ
whitherward,103.891638,NN
lbs,103.891638,NN
rea,103.891638,NNP
caw,89.049976,NN
willi,89.049976,NNP
tammaro,74.208313,NNP
euroclydon,74.208313,NNP
catnip,59.36665,NN
shirr,59.36665,NN


## Q5

In [19]:
AUS_IDX = LIB[LIB.author.str.contains('AUS')].index
MEL_IDX = LIB[LIB.author.str.contains('MEL')].index

In [20]:
aus_chap_bow = create_bow(CORPUS.loc[AUS_IDX], bag=CHAPS)
mel_chap_bow = create_bow(CORPUS.loc[MEL_IDX], bag=CHAPS)

In [21]:
TFIDF_aus = get_tfidf(aus_chap_bow, tf_method='max')
TFIDF_mel = get_tfidf(mel_chap_bow, tf_method='max')

### Method 1

In [22]:
A = TFIDF_aus.mean().sort_values(ascending=False).to_frame('mean_tfidf').join(VOCAB.max_pos)

In [23]:
A[A.max_pos == 'JJ'].head(20)

Unnamed: 0_level_0,mean_tfidf,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1
undismayed,2.095926,JJ
precarious,0.762155,JJ
dreary,0.698642,JJ
perverted,0.598836,JJ
eoconomical,0.493159,JJ
unmajestic,0.493159,JJ
filial,0.459114,JJ
assaulted,0.419185,JJ
unoffending,0.419185,JJ
indissoluble,0.335348,JJ


In [24]:
M = TFIDF_mel.mean().sort_values(ascending=False).to_frame('mean_tfidf').join(VOCAB.max_pos)

In [25]:
M[M.max_pos == 'JJ'].head(20)

Unnamed: 0_level_0,mean_tfidf,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1
ugh,4.461857,JJ
um,2.8857,JJ
manchineels,1.497126,JJ
sneezes,1.342251,JJ
adorable,1.226006,JJ
chronometrical,1.024349,JJ
upas,0.757241,JJ
afire,0.748563,JJ
forereaching,0.748563,JJ
quoggy,0.748563,JJ


### Method 2

In [26]:
A[A.max_pos == 'JJ'].mean_tfidf.idxmax(), A[A.max_pos == 'JJ'].mean_tfidf.max()

('undismayed', 2.095926073118513)

In [27]:
M[M.max_pos == 'JJ'].mean_tfidf.idxmax(), M[M.max_pos == 'JJ'].mean_tfidf.max()

('ugh', 4.461856968598885)