# 3.2 Overlap in Lexical and Literary Vocabulary: Digging Deeper

In order to research the relationship between lexical and literary material in more detail we first organize the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus in a Document Term Matrix. A Document Term Matrix is a table in which each row is a document (in our case: a literary composition) and each column represents a lemma. Each cell indicates how many times the lemma appears in this particular document.

### 3.2.0 Preparation

First import the necessary libraries. If you are running this notebook in Jupyter Lab you will need to install the Jupyter Lab ipywidgets extension (see Introduction, section 1.2.2.1). 

The [LexicalRichness](https://pypi.org/project/lexicalrichness/) package by Lucas Shen has been adapted for the present purposes in order to circumvent preprocessing and tokenization. The adapted version, named `lexicalrichness_v` can be imported from the `utils` directory. The usage information in the [LexicalRichness](https://pypi.org/project/lexicalrichness/) website is valid for `lexicalrichness_v` with the following exceptions:
- the option use_TextBlob in LexicalRichness() is removed
- the option use_tokenizer in LexicalRichness is added; default is use_tokenizer = False.

If `use_tokenizer = False` (default) the function expects a list as input; no tokenizing or preprocessing is performed. If `use_tokenizer = True` the function expects a string, which is preprocessed and tokenized (default behaviour in the original package).

In [None]:
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import MWETokenizer
from IPython.display import Markdown, display
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from lexicalrichness_v import LexicalRichness as lr

Open the files `etcsllines.p` and `lexlines.p` which were produced in the previous notebook. These files contain the pickled versions of the DataFrames `etcsl_lines` and `lex_lines` in which the literary ([ETCSL](http://etcsl.orinst.ox.ac.uk)) and lexical corpora ([DCCLT](http://oracc.org/dcclt)) are represented in line-by-line format.

In [None]:
etcsl_lines = pd.read_pickle('output/etcsllines.p')
lex_lines = pd.read_pickle('output/lexlines.p')

Remove unlemmatized words from the column `lemma_mwe`. Each entry in `lemma_mwe` is a list of lemmatized words and expressions.

In [None]:
etcsl_lines['lemma_mwe'] = etcsl_lines.progress_apply(lambda x: [lemma for lemma in x['lemma_mwe'] 
                                if not '[na]na' in lemma],
                                axis = 1)

#### 3.2.0.1 Literary: By Composition
For the literary corpus we can take the line-by-line representation that was prepared in the previous notebook and transform that into a composition-by-composition representation. The DataFrame `etcsl_lines` includes the column `lemma_mwe` in which each line is represented as a list of lemmas and/or Multiple Word Expressions (lemmas connected by underscores). The `pandas` `groupby()` function is used to group on `id_text` and `text_name`. The aggregate function for the `lemma_mwe` column in this case is simply `sum`: all the lists (representing lines) are added up to form one long list of lemmas representing one composition.

In [None]:
etcsl_comp = etcsl_lines.groupby(
    [etcsl_lines["id_text"], etcsl_lines["text_name"]]).aggregate(
    {"lemma_mwe": sum}).reset_index()
etcsl_comp[25:35]

The result is a DataFrame with three columns: `id_text`, `text_name`, and `lemma_mwe`. Each row represents a literary composition from the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus. Each cell in the column `lemma_mwe` contains a list with all the lemmas of one composition (with MWEs connected by underscores).

#### 3.2.0.2 Lexical: Extract Vocabulary
The column `lemma` in the DataFrame `lex_lines`, which was created in the last notebook represents each line in each Old Babylonian lexical text as a tuple of lemmas. Thus, the line **udu niga** in the [Old Babylonian list of Animals](http://oracc.org/dcclt/Q000001) is represenmted as `(udu[sheep]n, niga[fattened]v/i)`. In order to get the fullest representation of the lexical vocabulary, we will create the entry `udu[sheep]n_niga[fattened]v/i` as well as the entries `udu[sheep]n` and `niga[fattened]v/i`. First, a column `lemma_mwe` to the `lex_lines` DataFrame, connecting all lemmas in a lexical entry by an underscore and extract all unique lexical entries in a set. Second, flatten the list of tuples `lex_lines['lemma']` and extract all unique lemmas in a second set. The union of these two sets will have all individual lemmas, as well as all Multiple Word Entries. This set is turned into a list for use in `CountVectorizer()`. Entries and lemmas that contain `[na]na` are either broken or unlemmatized for some other reason and are not admitted to the list.

In [None]:
lex_lines["lemma_mwe"] = ["_".join(entry) for entry in lex_lines["lemma"]]
lex_vocab_a = {lemma for lemma in lex_lines["lemma_mwe"] if not '[na]na' in lemma}
lex_vocab_b = {item for t in lex_lines['lemma'] for item in t if not '[na]na' in item} 
lex_vocab = lex_vocab_a.union(lex_vocab_b)
lex_vocab = list(lex_vocab) # lex_vocab is needed for Countvectorizer
lex_vocab.sort()
lex_vocab[:10]

# Some thoughts

* Step 1. Measure length of lemma_mwe in etcsl_comp and remove rows with len < 200.
* Step 2. Create DTM (see below) of etcsl_comp, binary = True and vocabulary = lemma_mwe from lex (use lex_lines)
* Step 3. Order compositions by highest match
* Step 4. Normalize for text length (from Step 1)
* Step 5. Same process for individual lex texts (which has highest match for Ura 4?)
* Step 6. TF-IDF

In future iteration: do *not* select among lexical texts - let the script figure out which lex compositions are most relevant.

Perhaps: make DTM first - show that DTM.shape gives same numbers for lex vocabulary as second Venn diagram above. Remove all columns where sum == 0. Show that DTM.shape now gives total of overlap as in Venn diagram above. Then remove rows <= minimum. Tricky!

### 3.2.1 Basic Statistics of the [ETCSL](http://etcsl.orinst.ox.ac.uk) Corpus
In computing the relationship between lexical and literary vocabulary text length is playing a big role. A long text will likely have more overlap with lexical vocabulary then a very short one. The [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus includes compositions that are known only from a fragmentary *incipit*, as well as compositions that are more than a thousand lines long.

In order to meaningfully compare these compositions we will first eliminate all texts that have fewer than 200 lemmas and/or MWEs. Second, we will collect data on text length and lexical variation (how many unique lemmas are used in this text?). Dividing lexical variation by text length provides the "Type to Token Ratio" or TTR. 

TTR is generally considered to be a poor measurement for lexical richness. Short text have higher TTR values than long texts, becuase longer texts will by necessity use the same words over and over again and function words such as "the" or "in" will be repeated many times whatever the lexical ingenuity of the author. A better measurement is called MTLD or Measure of Textual Lexical Diversity ([McCarthy and Jarvis 2010](https://link.springer.com/article/10.3758/BRM.42.2.381)). The MTLD value is calculated as the mean number of words in a text that will bring TTR from 1 (at the first word in the text) down to a threshold value (default is 0.720). In practice that means that a text is cut in many small units, each with approximately the same TTR - eliminating the effect of text length. This is a promising approach that may well work for Sumerian and a Python module that includes MTLD is available ([lexicalrichness](https://pypi.org/project/lexicalrichness/)). Its usage here, however, is experimental and preliminary. The threshold value is based on the observation that when going through a text sequentially the TTR in any text will drop drastically as soon as the first repeated word is encountered. At some place in the text the TTR will stabilize and drop only very gradually later on. That place is approximated by the default threshold value of 0.720. It seems likely, however, that a valid threshold value is language dependent and that a language with very few function words, such as the literary register of Sumerian, might need a lower value. 

In [None]:
minimum = 200
etcsl_comp["length"] = [len(lemmas) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp["lex_var"] = [len(set(lemmas)) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp["ttr"] = [len(set(lemmas))/len(lemmas) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp['mtld'] = etcsl_comp['lemma_mwe'].progress_apply(lambda x: lr(x).mtld())
etcsl_comp = etcsl_comp.loc[etcsl_comp.length >= minimum]
etcsl_comp[25:35]

### 3.2.2 Document Term Matrix

The corpus is transformed into a Document Term Matrix (or DTM), a table in which each column represents a word (or expression) that appears in a lexical text and each row represents a Sumerian composition. Each cell is a number, 0 or 1, indicating whether or not that word appears  in a particular composition. This is a binary DTM, non-binary DTMS give the number of times a word appears in a composition.

Since DTMs are very commonly used in computational text analysis, it is worth spending a bit more time on various ways in which they can be created for cuneiform data. The function `CountVectorizer()` (from the `Sklearn` package) is a very flexible tool with many possible parameters. How `CountVectorizer()` and its counterpart `TfidfVectorizer()` are used depends on the structure of the input data. The most common use case is a corpus of raw documents (probably in English), each of them consisting of a text string that needs to be pre-processed and tokenized before anything else can be done. Default pre-processing includes, for instance, lowercasing the entire text. Default tokenizers assume that the text is in a modern (western) language and take spaces and punctuation marks as word dividers. Cuneiform data, whether in transliteration, lemmatization, or in normalization is much simpler than most modern language texts, because the only type of word boundary is a space (or a sequence of spaces). When using `CountVectorizer()` on transliterated, lemmatized, or normalized text we can use the parameter `token_pattern = r'[^ ]+'`, meaning "any sequence of characters, except space." 
```python
cv = CountVectorizer(token_pattern= r'[^ ]+')
```
A second situation is where we want to use data that is already in a list format (is already preprocessed and tokenized). This is uncommon in general, but very common for cuneiform data: all the [ORACC](http://oracc.org) and [ETCSL](http://etcsl.orinst.ox.ac.uk) data fall into that category. Rather than transforming the tokenized text back into raw strings and then tokenize those strings, we can use the parameters `tokenizer` and `preprocessor` to take care of that situation. These parameters take a function as their value, the function should return a list with tokenized text. If our input already is a list with tokenized text we can call a dummy function - a function that simply returns the list it receives. 
```python
def dummy(l):
    return(l)
cv = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
```
This will prevent `Countvectorizer()` from using a default tokenizer and preprocessor (which do not accept the list input) and it saves the trouble of untokenizing and then tokenizing again (See the [blog post](http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/) on this subject by David Batista). Instead of defining a `dummy()` function we can reach the same effect with a lambda function (see the code below).

Finally, we can choose to use the `MWETokenizer()` discussed above (section ###). The `MWETokenizer()` expects a tokenized text (a list) and re-tokenizes that text by using a list of pre-defined Multiple Word Expressions, returning a new list. In case we use the original [ETCSL](http://etcsl.orinst.ox.ac.uk) data, in which the MWEs have not yet been marked, we can do the CountVectorizing and marking the MWEs in one go, as follows:
```python
def dummy(l):
    return(l)
tokenizer = MWETokenizer(lex_mwe) # initialize the tokenizer with the lexical MWEs
cv = CountVectorizer(tokenizer=tokenizer.tokenize, preprocessor=dummy)
```
For our current purposes the best approach is to use a dummy tokenizer and preprocessor. The disadvantage of using the MWETokenizer on entire texts is that it will not honor line boundaries. See, for instance, Gilgameš and Huwawa 50-51 (text and translation [ETCSL](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=t.1.8.1.5&display=Crit&charenc=gcirc#): 

> ama tuku ama-a-ni-še₃
> nitah saŋ-dili ŋe₂₆-e-gin₇ ak a₂-ŋu₁₀-še₃ hu-mu-un-ak
> "Let him who has a mother go to his mother! 
> Let bachelor males, types like me, join me at my side!"

This will result in the Multiple Word Expression ama\[mother\]n_nita\[male\]N, an expression found in the list of human beings Lu ([OB Nippur Lu](http://oracc.org/dcclt/Q000047.351), which is clearly not applicable here. The number of such errors is fairly small (about 6 for a corpus of almost 400 texts). For other types of texts, where line boundaries are less significant, this method may well be an efficient way of doing things.

The CountVectorizer is now applied to the corpus and the result is transformed into a new Pandas DataFrame.

In [None]:
cv = CountVectorizer(tokenizer=lambda x: x, preprocessor=lambda x: x, vocabulary=lex_vocab, binary=True)

dtm = cv.fit_transform(etcsl_comp['lemma_mwe'])
etcsl_df = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=etcsl_comp["id_text"])
etcsl_df

The resulting DataFrame etcsl_df has a row for each *literary* composition (excluding those with fewer than 200 lemmas) and it has a column for every lemma/expression in the *lexical* corpus. As we have seen in the previous notebook, many of these words/expressions do not appear in the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus, and thus all cells in that column are 0. The other way around, there are many words in the literary corpus that do not appear in lexical texts, and those words are not represented at all in this DTM. This DTM, therefore, should only be used to research *overlap* between the two (literary and lexical) vocabularies.

# Number of Lexical/Literary Matches per Literary Composition. 
Since the DTM was built with the option `binary = True` the sum of each row equals the number of unique words/expressions that the composition shares with the lexical corpus. The code in the cell below may be simplified as:
```python
etcsl_df["n_matches"] = etcsl_df.sum(axis=1)
```
which will yield exactly the same result. The extra elements in the code are added for two reasons. First, if we add additional columns to the DataFrame, for instance composition names, the code will fail unless we add the option `numeric_only = True`. Second, if the (simplified) code is run twice, even with the option `numeric_only=True` the column `n_matches` will become part of the summation and the result in the new `n_matches` column will be twice the correct outcome. By explicitly stating that only the columns named after the lemmas in `lex_vocab` should be used such accidents are avoided.

In [None]:
etcsl_df["n_matches"] = etcsl_df[lex_vocab].sum(axis=1, numeric_only=True)

Add columns from `etcsl_comp` by using merge on `id_text`. The merge method is "inner," which means that only those rows that exist in both DataFrames will end up in the new DataFrame. Thus we ensure that the short compositions (which are in `etcsl_comp` but not in `etcsl_df`) are not part of the merged DataFrame.

In [None]:
etcsl_df2 = pd.merge(etcsl_comp[["id_text", "text_name", "length", "mtld", "ttr", "lex_var"]], etcsl_df["n_matches"], on="id_text", how="inner")

In [None]:
etcsl_df2 = etcsl_df2.sort_values(by = "n_matches", na_position="first", ascending=False)
etcsl_df2.head()

# Discussion
The Gudea Cylinders and Lugal-e (or Ninurta's Exploits) have the highest number of matches (707 and 677) with the Old Babylonian lexical corpus in [DCCLT](http://oracc.org/dcclt). But those are also the two longest compositions in the corpus. We can normalize by dividing the total number of matches by the number of unique lemmas in the text (`norm2`).

In [None]:
etcsl_df2["norm"] = etcsl_df2["n_matches"] / etcsl_df2["lex_var"]

# Exploring the Results
The following code displays the result in an interactive table that may be sorted (ascending or descending) in different ways for further exploration. The column `id_text` provides links to the editions in [ETCSL](http://etcsl.orinst.ox.ac.uk).

In [None]:
anchor = '<a href="http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text={}&display=Crit&charenc=gcirc#", target="_blank">{}</a>'
etcsl = etcsl_df2.copy()
etcsl['id_text'] = [anchor.format(val,val) for val in etcsl['id_text']]

In [None]:
@interact(col = etcsl.columns, rows = (1, len(etcsl), 1))
def sort_df(col = "norm", ascending = False, rows = 10):
    return etcsl.sort_values(by = col, ascending = ascending).reset_index(drop=True)[:rows].style

# Creating Some Viz
Provisional. Mainly as example. Save the figures by opening an Output View (right click on output) and then right click on that Output View, select Save As.

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
counts, bins, patches = ax.hist(etcsl.mtld, bins=3, edgecolor='k')
ax.set_xticks(bins)
plt.ylabel('No. of Compositions')
plt.xlabel('MTLD')
plt.show()

Alternative. Much simpler - but does not return the bins and the counts.

In [None]:
etcsl.mtld.hist(bins = 3);

In [None]:
etcsl.plot.scatter(x = 'length', y = 'ttr', figsize = (10, 5));

# For Creating Output Only
The following code is used to create MarkDown tables from Pandas DataFrames. The tables can be included in the Compass Markdown files.

In [None]:
from tabulate import tabulate

In [None]:
etcsl_tab = etcsl_df2.copy()
markdown = "[{}](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text={}&display=Crit&charenc=gcirc#)"
etcsl_tab['id_text'] = [markdown.format(val,val) for val in etcsl_df2['id_text']]
etcsl_tab = etcsl_tab.round({'ttr' : 3, 'norm': 3, 'mtld' : 3})

In [None]:
rows = 10 # number of rows to be exported
col = 'mtld' # column by which to sort
asc = True
tab = tabulate(etcsl_tab.sort_values(by=col, ascending=asc)[:rows],
         headers= etcsl_tab.columns , tablefmt="github", showindex=False)
with open('output/etcsl_tab.txt', 'w', encoding='utf8') as w:
    w.write(tab)

# Testing

In [None]:
plt.figure(figsize=(20,10))
for id in etcsl_comp['id_text']:
    c = etcsl_comp.loc[etcsl_comp['id_text'] == id, 'lemma_mwe']
    c = c.iloc[0]

    ttr_l = []
    enum = range(1, len(c))
    for ind in enum:
        t = c[:ind]
        ttr = lr(t).ttr
        ttr_l.append(ttr)
    plt.plot(enum, ttr_l)
plt.show()

In [None]:
import matplotlib.pyplot as plt

In [None]:
etcsl = etcsl.sort_values(by = 'ttr', ascending = False)
plt.figure(figsize=(20,10))
plt.scatter(range(len(etcsl)), etcsl['ttr'])

In [None]:
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(10)

plt.plot(x, x)
plt.plot(x, 2 * x)
plt.plot(x, 3 * x)
plt.plot(x, 4 * x)
plt.show()

In [None]:
etcsl.mtld.median()

In [None]:
etcsl.loc[(etcsl.length > 300) & (etcsl.length < 400)].sort_values(by = 'ttr')

In [None]:
tetrad = ['c.2.5.8.1', 'c.2.5.5.2', 'c.2.5.3.2', 'c.4.16.1']
#etcsl_lines.loc[etcsl_lines['id_text'].isin(tetrad), 'id_text'] = 'c.tetrad'
#etcsl_lines.loc[etcsl_lines['id_text'] == 'c.tetrad', 'text_name'] = 'tetrad' 

In [None]:
etcsl_df2.loc[etcsl_df2.id_text.isin(tetrad)]

In [None]:
alltext

In [None]:
len(alltext)

In [None]:
for i in tetrad[1: ]: 
    etcsl_comp.loc[etcsl_comp.id_text == 'c.2.5.8.1', 'lemma_mwe'] = etcsl_comp.loc[etcsl_comp.id_text == 'c.2.5.8.1', 'lemma_mwe'] + etcsl_comp.loc[etcsl_comp.id_text == i, 'lemma_mwe']

In [None]:
etcsl_comp.loc[etcsl_comp.id_text == 'c.2.5.8.1', 'lemma_mwe']

In [None]:
P_sz = etcsl_comp.loc[etcsl_comp.id_text == 'c.3.1.19', 'lemma_mwe']

In [None]:
P_sz

In [None]:
P_sz = list(P_sz)

In [None]:
P_sz = P_sz[0]

In [None]:
P_sz_n = [word for word in P_sz if not word in lex_vocab]

In [None]:
P_sz_n

In [None]:
etcsl

In [None]:
etcsl.plot.scatter(x = 'length', y= 'ttr');

In [None]:
etcsl.loc[etcsl.text_name.str.contains('Nanše')]

In [None]:
etcsl.ttr.describe()

In [None]:
import numpy as np    
hist = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = 3)

In [None]:
hist, b_e = np.histogram(etcsl['mtld'], bins = 3)

In [None]:
hist, b_e

In [None]:
etcsl.mtld.hist(bins = 3)

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm  
df = etcsl_df2

df.mtld.plot(kind='hist', density=True)

#range = np.arange(0, 330, 1)
plt.plot(range, norm.pdf(range,0,1))

In [None]:
Proper_N = ['AN', 'CN', 'DN', 'EN', 'FN', 'GN', 
            'LN', 'MN', 'ON', 'PN', 'QN', 'RN', 'SN', 'TN', 'WN', 'YN']

In [None]:
file = "../2_2_Data_Acquisition_ETCSL/Output/alltexts.csv"
etcsl_words = pd.read_csv(file, keep_default_na=False)
etcsl_words = etcsl_words.loc[etcsl_words["lang"].str.contains("sux")]  # throw out non-Sumerian words

In [None]:
etcsl_words = etcsl_words.loc[~etcsl_words.pos.isin(Proper_N)]

In [None]:
etcsl_words["lemma"] = etcsl_words.progress_apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
etcsl_words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in etcsl_words['lemma'] ] 
# kick out empty forms
etcsl_words["lemma"] = etcsl_words["lemma"].str.lower()