The following code was cut from 3.3. It uses the DTM `lex_comp_dtm` created in 3.3 - the code below, therefore, does not work.

# Important Words: TF-IDF
What the above analysis does not take into aacount is the significance of the matching words. Words such as igi\[eye\]n or šag\[heart\]n are used in many kinds of expressions and will be found in all literary compositions, except for the most fragmentary ones. Those words are matched by the list Ugumu (list of body parts) but such matches are much less meaningful than a rarer word or expression like nir\[stone\]n babbardil\[~stone\]n (a type of agate with a single white stripe) which is fairly common in the lexical tradition, but only appears in one place in the literary corpus (UET 6/3 561) and with some frequency in administrative documents.

In order to distinguish between more and less significant matches we will compute the tf-idf value for each word or expression in each literary document. Tf-idf is short for term frequency - inverse document frequency and is a widely used measure (or rather a class of measures) for weighing the importance of a word in a document. Term frequency refers to the frequency of a term in a document; a term that is more frequently used is probably important in that document. Document frequency refers to the number of documents in the corpus in which the term is found. A term that is found in the great majority of documents has little distinguishing value. A high document frequency, therefore, results in a lower relevance of the term. Thus the tf-idf value of words like igi\[eye\]n or šag\[heart\]n in any document, even though they may appear very frequently, is expected to be low because they appear in almost every document. On the other hand, the tf-idf value of nir\[stone\]n babbardil\[~stone\]n in UET 6/3 561 is going to be fairly low, too, because the word appears only once in that text. A high value is associated with high frequency in a few documents.

We can compute tf-idf values for the literary corpus with `TfidfVectorizer()` from the `sklearn` package. `TfidfVectorizer()` is a close relative of `CountVectorizer()` and works in essentially the same way.

> Note:
This is probably not a good idea. It is too complex, with computing mean tfidf scores, adding up means and taking the mean of means - not a good way to introduce tf-idf. Probably better to explore in more detail the kinds of words that are matched.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this suppresses a warning about pandas from tqdm
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
import zipfile
import json

In [None]:
with open('output/lit_lex_vocab.txt', 'r', encoding = 'utf8') as l:
    lit_lex_vocab = l.read().splitlines()
lit_lex_vocab = [v.replace('_', ' ') for v in lit_lex_vocab]

In [None]:
lit_lines = pd.read_pickle('output/litlines.p')
lit_comp = lit_lines.groupby(['id_text']).agg({'lemma' : ' '.join}).reset_index()
lit_comp['id_text'] = lit_comp['id_text'].str[-7:]
tv = TfidfVectorizer(tokenizer = lambda x: x.split(), preprocessor = lambda x: x, ngram_range = (1,5), vocabulary =lit_lex_vocab)
dtm = tv.fit_transform(lit_comp['lemma'])
lit_df = pd.DataFrame(dtm.toarray(), columns= tv.get_feature_names(), index=lit_comp["id_text"])
lit_df

### Mean tf-idf weight
We can compute the mean tf-idf value of each word by dividing the sum of each column (the total tf-idf weights for a word or expression) by the number of rows in the DTM. This results in an array of values for each column in the DTM.

In [None]:
mean = lit_df[lit_lex_vocab].values.mean(axis=0)
mean

The array `mean` represents the mean weight of each word or expression in the literary corpus. We can multiply the entire lexical DTM with that array.

In [None]:
lit_lex_tfidf = lex_comp_dtm
lit_lex_tfidf[lit_lex_vocab] = lit_lex_tfidf[lit_lex_vocab].astype(bool).mul(mean, axis = 1)

In [None]:
lit_lex_tfidf['weighted'] = lit_lex_tfidf[lit_lex_vocab].sum(axis=1)

In [None]:
lit_lex_tfidf

In [None]:
lex2 = pd.merge(cat_df, lit_lex_tfidf[['id_text', 'weighted', 'length', 'n_matches']], on = 'id_text', how = 'inner')
#lex2 = pd.merge(lex2, lex[['length', 'n_matches', 'id_text']], on = 'id_text', how = 'inner')

Instead of dividing by length look at mean value of weighted
```python
lex2['norm'] = lex2['weigthed'] / lex2.astype(bool).sum(axis = 1)
```

In [None]:
#lex2['norm'] = lex2['weighted'] / lex2['n_matches']
lex2['tfidfnorm'] = lex2['weighted'] / lex2['length']
lex2.loc[lex2.length >= 250].sort_values(by = 'tfidfnorm', ascending = False)

### Explore
Explore the results in an interactive table.

In [None]:
anchor = '<a href="http://oracc.org/dcclt/{}", target="_blank">{}</a>'
lex3 = lex2.copy()
lex3['id_text'] = [anchor.format(val,val) for val in lex2['id_text']]
lex3['PQ'] = ['Composite' if i[0] == 'Q' else 'Exemplar' for i in lex2['id_text']]

In [None]:
@interact(sort_by = lex3.columns, rows = (1, len(lex2), 1), min_length = (0,500,5), show = ["Exemplars", "Composites", "All"])
def sort_df(sort_by = "tfidfnorm", ascending = False, rows = 25, min_length = 250, show = 'All'):
    if not show == 'All':
        l = lex3.loc[lex3['PQ'] == show[:-1]]
    else:
        l = lex3
    l = l.drop('PQ', axis = 1)
    l = l.loc[l.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style
    return l