This is going to be 3.3

# 3.3 Looking at the Lexical Vocabulary from the Perspective of the Literary Material

In section 3.2 we asked whether we can see differences between Old Babylonian literary compositions in their usage of vocabulary (lemmas and MWEs) attested in the lexical corpus. In this notebook we will change perspective and ask: are there particular lexical texts (or groups of lexical texts) that show a greater engagement with literary vocabulary than others?

In large part, this notebook uses the same techniques and the same code as section 3.2 did, and the reader is referred there for further explanation. In some aspects, however, the process is different. In particular, we will use various aspects of `CountVectorizer()` and the related function `TfidfVectorizer()` to understand the relationship in more detail.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this suppresses a warning about pandas from tqdm
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
import zipfile
import json
import numpy as np

Open the file `lexlines.p` which was produced in [3_1_Lit_Lex_Vocab.ipynb](./3_1_Lit_Lex_Vocab.ipynb). The file contains the pickled version of the DataFrame `lex_lines` in which the lexical ([dcclt](http://oracc.org/dcclt)) corpus is represented in line-by-line format.

The field `id_text` is represented as `dcclt/P227743` or `dcclt/signlists/Q000057`. In practice, we only need the list seven characters, the P, Q, or X number.

In [None]:
lex_lines = pd.read_pickle('output/lexlines.p')
lex_lines['id_text'] = lex_lines['id_text'].str[-7:]

### Special Case: OB Nippur Ura 6
The sixth chapter of the Old Babylonian Nippur version of the thematic list Ura deals with foodstuffs and drinks. This chapter was not standardized (each exemplar has its own order of items and sections) and therefore no composite text has been created in [DCCLT](http://oracc.org/dcclt). Instead, the "composite" of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) consists of the concatenation of all known Nippur exemplars of the list of foodstuffs. In our current dataframe, therefore, there are no lines where the field `id_text` equals "Q000043".

We create a "composite" by changing the field `id_text` in all exemplars of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) to "Q000043". 

In [None]:
Ura6 = ["P227657",
"P227743",
"P227791",
"P227799",
"P227925",
"P227927",
"P227958",
"P227967",
"P227979",
"P228005",
"P228008",
"P228200",
"P228359",
"P228368",
"P228488",
"P228553",
"P228562",
"P228663",
"P228726",
"P228831",
"P228928",
"P229015",
"P229093",
"P229119",
"P229304",
"P229332",
"P229350",
"P229351",
"P229352",
"P229353",
"P229354",
"P229356",
"P229357",
"P229358",
"P229359",
"P229360",
"P229361",
"P229362",
"P229365",
"P229366",
"P229367",
"P229890",
"P229925",
"P230066",
"P230208",
"P230230",
"P230530",
"P230586",
"P231095",
"P231128",
"P231424",
"P231446",
"P231453",
"P231458",
"P231742",
"P266520"]
lex_lines.loc[lex_lines["id_text"].isin(Ura6), "id_text"] = "Q000043"

# Computing text length
In order to evaluate the number of matches between a lexical text and the literary corpus we need a measure of text length. Text length is defined here as the number of lemmatized words in a text.

First the lines of `lit_lines` are aggregated to lexical compositions in the DataFrame `lex_comp`. 

In [None]:
lex_comp = lex_lines.groupby(
    [lex_lines["id_text"]]).aggregate(
    {"lemma": ' '.join}).reset_index()

The function `lex_length()` computes the number of lemmas in each composition by first splitting the field `lemmas` into individual lemmas. A list comprehension removes all unlemmatized words, and the length of the resulting list is returned.

In [None]:
def lex_length(lemmas):
    lemmas = lemmas.split()
    lemmas = [lemma for lemma in lemmas if not '[na]na' in lemma] # remove unlemmatized words
    length = len(lemmas)
    return length

First add the new field `length` by calling the function `lex_length()` for every row.

The DataFrame `lex_comp` has data from all Old Babylonian lexical texts currently in [dcclt](http://oracc.org/dcclt). Not all of these texts are lemmatized. In particular, documents that have been linked to a composite text are usually not lemmatized. Such documents have no lemmatized contents and therefore have length 0. These documents are removed from `lex_comp`.

- remove duplicates

In [None]:
lex_comp['length'] = lex_comp['lemma'].progress_map(lex_length)
lex_comp = lex_comp.loc[lex_comp['length'] > 0] # remove compositions that have no lemmatized content
lex_comp = lex_comp.sort_values(by = 'length', ascending=False)
lex_comp = lex_comp.drop_duplicates(subset = 'id_text', keep = 'first')

# Open list of Vocabulary Intersection

In [None]:
with open('output/lit_lex_vocab.txt', 'r', encoding = 'utf8') as l:
    lit_lex_vocab = l.read().splitlines()
lit_lex_vocab = [v.replace('_', ' ') for v in lit_lex_vocab]
lit_lex_vocab[:25]

# DTM
Go back to `lex_lines` so that ngrams do not jump over line boundaries

In [None]:
cv = CountVectorizer(preprocessor = lambda x: x, tokenizer = lambda x: x.split(), vocabulary = lit_lex_vocab, ngram_range=(1, 5))
dtm = cv.fit_transform(lex_lines['lemma'])
lex_lines_dtm = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=lex_lines["id_text"])

In [None]:
lex_lines_dtm

In [None]:
lex_comp_dtm = lex_lines_dtm.groupby('id_text').agg(sum).reset_index()
vocab = lex_comp_dtm.columns[1:]

# Remove duplicates and empty rows

In [None]:
#lex_comp_dtm['id_text'] = [i[-7:] for i in lex_comp_dtm['id_text']]
#lex_comp_dtm['length'] = lex_comp_dtm.sum(axis=1)
#lex_comp_dtm = lex_comp_dtm.sort_values(by = 'length', ascending = False)
#lex_comp_dtm = lex_comp_dtm.drop_duplicates(subset = 'id_text', keep = 'first')
#lex_comp_dtm = lex_comp_dtm[lex_comp_dtm.length > 0]

In [None]:
lex_comp_dtm["n_matches"] = lex_comp_dtm[vocab].astype(bool).sum(axis = 1, numeric_only=True)

In [None]:
# Get the metadata. 
cat = {}
for proj in ['dcclt', 'dcclt/signlists', 'dcclt/nineveh', 'dcclt/ebla']:
    f = proj.replace('/', '-')
    file = f"jsonzip/{f}.zip" # The ZIP file was downloaded in notebook 3_1
    z = zipfile.ZipFile(file) 
    st = z.read(f"{proj}/catalogue.json").decode("utf-8")
    j = (json.loads(st))
    cat.update(j["members"])
cat_df = pd.DataFrame(cat).T
cat_df["id_text"] = cat_df["id_text"].fillna(cat_df["id_composite"])
cat_df = cat_df.fillna('')
cat_df = cat_df[["id_text", "designation", "subgenre"]]

In [None]:
lex = pd.merge(cat_df, lex_comp_dtm[['n_matches', 'id_text']], on = 'id_text', how = 'inner')
lex = pd.merge(lex, lex_comp[['length', 'id_text']], on = 'id_text', how = 'inner')

In [None]:
lex['norm'] = lex['n_matches'] / lex['length']
lex = lex.sort_values(by = 'norm', ascending = False)
lex.loc[lex.length > 250]

In [None]:
anchor = '<a href="http://oracc.org/dcclt/{}", target="_blank">{}</a>'
lex2 = lex.copy()
lex2['id_text'] = [anchor.format(val,val) for val in lex['id_text']]

In [None]:
@interact(sort_by = lex2.columns, rows = (1, len(lex2), 1), min_length = (1,500,5))
def sort_df(sort_by = "norm", ascending = False, rows = 25, min_length = 250):
    return lex2.loc[lex2.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style

Next step: look at important words with tfidf.

Note: first make ngrams (as above) then TfidfVectorizer() with vocabulary.

In [None]:
lit_lines = pd.read_pickle('output/litlines.p')
lit_comp2 = lit_lines.groupby(['id_text']).agg({'lemma' : ' '.join}).reset_index()
lit_comp2['id_text'] = [i[-7:] for i in lit_comp2['id_text']]
tv = TfidfVectorizer(token_pattern = r'[^ ]+', ngram_range = (1,5), vocabulary = vocab)
dtm = tv.fit_transform(lit_comp2['lemma'])
lit_df = pd.DataFrame(dtm.toarray(), columns= tv.get_feature_names(), index=lit_comp2["id_text"])
lit_df

In [None]:
mean = lit_df[vocab].sum(axis=0) / lit_df[vocab].astype(bool).sum(axis=0) #total weights by total hits
mean = mean.array

In [None]:
lit_lex_tfidf = lex_comp_dtm.copy()
lit_lex_tfidf[vocab] = lit_lex_tfidf[vocab].mul(mean, axis = 1)

In [None]:
lit_lex_tfidf

In [None]:
lit_lex_tfidf['weighted'] = lit_lex_tfidf[vocab].sum(axis=1, numeric_only = True)

In [None]:
lit_lex_tfidf

In [None]:
lit_lex_tfidf.shape

In [None]:
lex2 = pd.merge(cat_df, lit_lex_tfidf[['weighted', 'id_text']], on = 'id_text', how = 'inner')
lex2 = pd.merge(lex2, lex[['length', 'n_matches', 'id_text']], on = 'id_text', how = 'inner')

Instead of dividing by length look at mean value of weighted
```python
lex2['norm'] = lex2['weigthed'] / lex2.astype(bool).sum(axis = 1)
```

In [None]:
#lex2['norm'] = lex2['weighted'] / lex2['n_matches']
lex2['norm'] = lex2['weighted'] / lex2['length']
lex2.sort_values(by = 'norm', ascending = False)

In [None]:
anchor = '<a href="http://oracc.org/dcclt/{}", target="_blank">{}</a>'
lex3 = lex2.copy()
lex3['id_text'] = [anchor.format(val,val) for val in lex2['id_text']]

In [None]:
@interact(sort_by = lex3.columns, rows = (1, len(lex3), 1), min_length = (1,500,5))
def sort_df(sort_by = "weighted", ascending = False, rows = 25, min_length = 200):
    return lex3.loc[lex3.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style