# 3.2 Overlap in Lexical and Literary Vocabulary: Digging Deeper

> This alternative to section 3.2 uses CountVectorizer() with ngram_range. It does not work, because the LexicalRichness package does not use ngrams, with the result that the number of matches (between lexical and literary vocab) as determined by CountVectorizer() may be larger than the number of unique words (types) as determined by LexicalRichness. For instance, the word sequence amar\[young\]n ga\[milk\]n gu\[eat\]v/t will result in 5 matches in the ngram approach, but there are only 3 unique words. This is problematic because it prevents proper normalizing.

In order to research the relationship between lexical and literary vocabularies in more detail we will look at individual literary texts. Which compositions have more and which have less overlap with the lexical vocabulary?

In 3.1 we used Multiple Word Expressions, connecting words that are found in a lexical entry by underscrores (using `MWEtokenizer()` from the nltk module). The lemmas and MWE were used visualized in Venn diagrams to illustrate the intersection between lexical and literary vocabulary.

In this notebook we will use the ngram option of the `CountVectorizer()` function in order to find sequences of lemmas that are shared between lexical and literary texts. A ngram is a continuous sequence of *n* words (or lemmas). 

Longer texts will have more vocabulary items (and Multiple Word Expressions) in common with the lexical corpus than shorter texts, but that does not mean much. For that reason we will look at measures of *lexical richness* and ask: are compositions that use a richer lexicon more likely to utilize lemmas found in the lexical corpus than composition with a lower lexical richness rank? 

### 3.2.0 Preparation

This notebook uses some files that were downloaded or produced in [3_1_Lit_Lex_Vocab.ipynb](./3_1_Lit_Lex_Vocab.ipynb). Run that notebook first, before this one. 

First import the necessary libraries. If you are running this notebook in Jupyter Lab you will need to install the Jupyter Lab ipywidgets extension (see [Introduction](../1_Preliminaries/1_Introduction.md), section 1.2.2.1). 

> The [Lexicalrichness](https://pypi.org/project/lexicalrichness/) package by Lucas Shen has been slightly adapted for the present purposes. The package expects a data set in the English language in a raw text format that must be pre-processed (removal of interpunction, digits, etc.) and tokenized (cut up into individual words). These steps do not work well for the present data set. The adapted version, named `lexicalrichness_v` is imported from the `utils` directory. The usage information in the [Lexicalrichness](https://pypi.org/project/lexicalrichness/) website is valid for `lexicalrichness_v` with the following exceptions:
> - the option use_TextBlob in LexicalRichness() is removed
> - the option use_tokenizer in LexicalRichness is added; default is use_tokenizer = False.

> If `use_tokenizer = False` (default) the main function expects a list as input; no tokenizing or preprocessing is performed. If `use_tokenizer = True` the function expects a string, which is preprocessed and tokenized (default behaviour in the original package).

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this suppresses a warning about pandas from tqdm
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import zipfile
import json
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from lexicalrichness_v import LexicalRichness as lr

Open the file `litlines.p` which was produced in [3_1_Lit_Lex_Vocab.ipynb](./3_1_Lit_Lex_Vocab.ipynb). The file contains the pickled version of the DataFrame `lit_lines` in which the literary ([epsd2/literary](http://oracc.org/epsd2/literary)) corpus is represented in line-by-line format.

In [None]:
lit_lines = pd.read_pickle('output/litlines.p')

#### 3.2.0.1 Literary: By Composition
The line-by-line representation that was prepared in the previous notebook will be transformed into a composition-by-composition representation. The DataFrame `lit_lines` includes the column `lemma` in which each line is represented as a sequence of lemmas. The `pandas` `groupby()` function is used here to group on `id_text` and `text_name`. The aggregate function for the `lemma` column in this case is simply `' '.join`: all the entries (representing lines) are concatenated to form one long sequence of lemmas in a single string representing one composition.

The field `id_text` in the resulting DataFrame has the form 'epsd2/literary/P254863'. In fact, we only need the last 7 characters of that string (the P, Q, or X number of the text), because all texts derive from the same project. We can simplify the `id_text` string with a list comprehension.

In [None]:
lit_lines['id_text'] = [id[-7:] for id in lit_lines["id_text"]]
lit_comp = lit_lines.groupby(
    [lit_lines["id_text"]]).aggregate(
    {"lemma": ' '.join}).reset_index()
lit_comp[25:35]

The result is a DataFrame with two columns: `id_text`, and `lemma`. Each row represents a literary composition from the [epsd2/literary](http://oracc.org/epsd2/literary) corpus. Each cell in the column `lemma` contains a sequence of lemmas of one composition.

#### 3.2.0.2 Lexical Richness Statistics for the Literary Corpus

In the following we will compute the number of vocabulary matches between each text in the literary corpus and the full lexical corpus. In order to interpret the number of matches properly, we will first compute a number of basic text measures, such as text length, type-token ratio, etc. In all the measures discussed below only the words that are properly lemmatized are counted.

| Measurement       |                                             |
|-------------------|---------------------------------------------|
| Text Length       | Number of lemmatized words         |
| Lexical Variation | Number of unique lemmas            |
| Type Token Ration | Lexical Variation divided by Text Length    |
| MTLD              | (see below)                                 |

Lexical richness measures the variation in vocabulary usage. Texts that use a relatively low number of unique lexemes (repeat the same words all over the place) receive a low lexical richness score. Texts that use the lexicon more ingenuously, using synonyms or circumscriptions to refer to the same concept, receive a higher lexical richness score. Lexical richness is used, among other things, to identify texts written by language learners, to assess aphasia, or to measure the difficulty of a text. For Sumerian literature, we may expect that compositions with high levels of repetition, such as certain hymns and narratives, end up with a low lexical richness score, whereas disputation texts or other compositions that actively explore the lexicon may have a higher score.

The most straightforward lexical richness score is the Type Token Ratio (or TTR), which simply divides the number of unique lexemes by the total number of lexemes. This is a fine measurement to compare texts of (approximately) equal length, but does not work well for a corpus with texts of very different length, as is the case here. Short texts have higher TTR values than long texts, because longer texts will by necessity use the same words over and over again and (in English) function words such as "the" or "in" will be repeated many times whatever the lexical ingenuity of the author. A better measurement is called MTLD or Measure of Textual Lexical Diversity ([McCarthy and Jarvis 2010](https://doi.org/10.3758/BRM.42.2.381)). The MTLD value is calculated as the mean number of words in a text that will bring TTR from 1 (at the first word in the text) down to a threshold value (default is 0.720). In practice that means that a text is cut in many small units, each with approximately the same TTR - eliminating the effect of text length. This is a promising approach that may well work for Sumerian and a Python module that includes MTLD is available ([lexicalrichness](https://pypi.org/project/lexicalrichness/)). Its usage here, however, is experimental and preliminary. The threshold value is based on the observation that when going through a text sequentially the TTR in any text will drop drastically as soon as the first repeated word is encountered. At some place in the text the TTR will stabilize and drop only very gradually later on. That place is approximated by the default TTR threshold value of 0.720. It seems likely, however, that a valid threshold value is language dependent and that a language with very few function words, such as the literary register of Sumerian, might need a higher value (function words drag the TTR down). On the other hand, a corpus of texts with very substantial repetition (occasionaly repetition of entire lengthy passages) may well require a lower threshold. 

In [None]:
def lit_stats(lemmas):
    lemmas = lemmas.split()
    lemmas = [lemma for lemma in lemmas if not '[na]na' in lemma] # remove unlemmatized words
    lex = lr(lemmas) #, tokenizer = None, preprocessor = None)
    length = lex.words # number of lemmatized words
    lex_var = lex.terms # number of unique lemmas
    if length > 0:  # prevent division by 0
        ttr = lex.ttr
        mtld = lex.mtld()
    else:
        ttr = 0
        mtld = 0
    return ' '.join(lemmas), length, lex_var, ttr, mtld

In [None]:
lit_comp['lemma'], lit_comp['length'], lit_comp['lex_var'], lit_comp['ttr'], lit_comp['mtld'] = \
    zip(*lit_comp['lemma'].progress_map(lit_stats))
lit_comp = lit_comp.loc[lit_comp['length'] > 0] # remove compositions that have no lemmatized content

We may get a first glimpse of the results by inspecting the basic descriptive statistics. For this, we ignore texts shorter than 50 lemmas, because measures like TTR and MTLD become rather meaningless for very short compositions. It appears that MTLD varies from 9.0 all the way up to 538.1, with a mean value of 67.6. That means that there is a text that, on average, needs only 9 words (two or three lines) to push the TTR under 0.720 - meaning a lot of repeated words (or repeated phrases) all over the place.

In [None]:
lit_comp.loc[lit_comp.length > 50].describe()

### 3.2.1 Document Term Matrix

The literary corpus is transformed into a Document Term Matrix (or DTM), a table in which each column represents a lemma and each row represents a Sumerian composition. Each cell contains a number indicating the frequency of that word  in a particular composition.

Since we are interested in the usage of lexical vocabulary in literary texts, we may skip all words that are not available in the lexical corpus. We can do that by defining a vocabulary, derived from the data produced in the previous notebook ([3_1_Lit_Lex_Vocab.ipynb](./3_1_Lit_Lex_Vocab.ipynb)).

The file `output/lex_vocab.txt` includes all lemmas and all lexical entries available in the lexical corpus (excluding entries that contain unlemmatized words). The (multiple word) lexical entries are represented as Multiple Word Expressions, connected by underscores. In this notebook, however, we will not work with MWEs (which are made into a unit) but with ngrams (sequences of words). For that reason we need to replace all underscores by spaces. Sort the vocabulary so the columns of the DTM will appear in alphabetical order. Finally determine the longest lexical entry (in number of words) in order to establish the maximum *n* when dividing the literary corpus in ngrams.

In [None]:
with open('output/lex_vocab.txt', 'r', encoding = 'utf8') as r:
    lex_vocab = r.read().splitlines()
lex_vocab = [v.replace('_', ' ') for v in lex_vocab]
lex_vocab.sort()
length = [len(v.split()) for v in lex_vocab]
m = max(length)

The function `CountVectorizer()` (from the `Sklearn` package) is a very flexible tool with many possible parameters. The most common use case is a corpus of raw documents (probably in English), each of them consisting of a text string that needs to be pre-processed  and tokenized (turned into a list of words or lemmas) before anything else can be done. Default pre-processing includes, for instance, lowercasing the entire text (so that thursday, Thursday, and THURSDAY will all be recognized as the same lemma) and removal of punctuation and numbers. Default tokenizers assume that the text is in a modern (western) language, taking spaces, hyphens, and punctuation marks as word dividers. The structure of the [ORACC](http://oracc.org) data is much simpler than that. Pre-processing is unnecessary, and tokenization should split the string *only* at blank spaces.

This can be achieved by defining custom tokenizer/preprocessor functions, and tell `Countvectorizer()` to use these. The custom tokenizer consists of the standard Python function `split()`; the preprocessor function does nothing at all.

The parameter `ngram_range` is set to (1, m) where `m` represents the maximum length of a lexical entry, computed above. `Countvectorizer()` will create a column for each word (ngram n=1), but also for each sequence of two words (bigram; n=2), or three words (trigram; n=3), etc. The sentence **inana-ra lugal-e e₂-a-ni mu-un-du₃** ("The king build her temple for Inana") lemmatized as Inana\[1\]dn lugal\[king\]n e\[house\]n du\[build\]v/t, will be represented as:

| type             | representation  |
|------------------|-----------------|
| unigram        | Inana\[1\]dn |
|                    | lugal\[king\]n |
|                    | e\[house\]n |
|                    | du\[build\]v/t |
| bigram             | Inana\[1\]dn lugal\[king\]n |
|                    | lugal\[king\]n e\[house\]n |
|                    | e\[house\]n du\[build\]v/t |
| trigram | Inana\[1\]dn lugal\[king\]n e\[house\]n |
|                     | lugal\[king\]n e\[house\]n du\[build\]v/t |
| 4-gram            | Inana\[1\]dn lugal\[king\]n e\[house\]n du\[build\]v/t |

This four-word sentence thus results in 10 columns in the Document Term Matrix. We will use `CountVectorizer()` on the representation of literary texts in *lines* so that the ngrams do not extent over the end of a line. Afterwards, lines are combined into compositions. 

The parameter `vocabulary` is set to the variable `lex_vocab`, which includes all lemmas and lexical entries in the lexical corpus. `Countvectorizer()` will skip all words and ngrams that do not appear in the lexical texts.

`CountVectorizer()` stores the results in a sparse matrix which notes the position and value of non-zero entries. In transforming the output to a Pandas DataFrame, we need to use the `toarray()` method, which will transform the sparse matrix into a regular matrix.

In [None]:
cv = CountVectorizer(tokenizer=lambda x: x.split(), preprocessor=lambda x: x, \
                     ngram_range = (1,m), vocabulary = lex_vocab)
dtm = cv.fit_transform(lit_lines['lemma'])
df = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=lit_lines["id_text"])
df

> # Note
It may be more efficient to remove zero-columns from the sparse matrix before converting to DataFrame. This is possible but tricky.
> 1. remove columns with only zeros
```python
import numpy as np
nonzeros = np.unique(dtm.nonzero()[1])
dtm = dtm[:, nonzeros]
```
> 2. select the right column names
```python
col = cv.get_feature_names()
col = [col[i] for i in nonzeros]
```
> 3 create the DataFrame
```python
df = pd.DataFrame(dtm.toarray(), columns= col, index=lit_lines["id_text"])
```
> The `nonzero()` method returns all the coordinates (row, column) of all non-zero elements in the sparse matrix. `dtm.nonzero()[1]`, therefore, returns all column numbers of non-zero entries in the sparse matrix. Since there may well be more than one non-zero element in a column, the `unique()` function is used. The variable `nonzeros` now has the index numbers of all columns that include at least one nonzero element. That list is used, first, to select the nonzero columns from the dtm, and, second, to select the relevant column names with a list comprehension.

# Remove Empty Columns
Lemmas and entries in `vocab_l` (the lexical vocabulary) that have no match in the literary corpus are represented in the DataFrame with a column filled with zeros; such columns are removed.

In [None]:
df = df.loc[: , df.sum(axis=0) != 0].copy()
df.shape

# Combine Lines into Compositions

In [None]:
df = df.groupby(['id_text']).agg(sum)
df.shape

One may notice that the shape of the DataFrame suggests that the outcome here is slightly different from what we saw in 3.1. The number of columns represents the number of shared entries (lemmas, ngrams) between the lexical and the literary corpus. The intersection in the Venn diagram in 3.1 had a slightly lower number.

The difference is a difference in approach: Multiple Word Expressions vs. ngrams. In 3.1 (and 3.2) we used MWEs, connecting words with underscores. In the MWE approach the literary line 'amar\[young\]n_ga\[milk\]n_gu\[eat\]v/t' matches the lexical entry 'amar\[young\]n_ga\[milk\]n_gu\[eat\]v/t' but not 'amar\[young\]n_ga\[milk\]n'. In the ngram approach this will yield matches for unigrams, bigrams and a trigram, namely: 'amar\[young\]n', 'ga\[milk\]n', 'gu\[eat\]v/t', 'amar\[young\]n ga\[milk\]n', and 'amar\[young\]n ga\[milk\]n gu\[eat\]v/t' (the bigram 'ga\[milk\]n gu\[eat\]v/t' is not a lexical entry and will therefore not result in a match).

While this may seem like a big difference, in outcome the differences between these two approaches are minor because most literary ngrams do not result in a match with the literary vocabulary.

In [None]:
vocab = df.columns # `vocab` is a list with all the vocabulary items currently in `lit_df`
with open('output/lit_lex_vocab.txt', 'w', encoding = 'utf8') as w:
    w.write('\n'.join(vocab)) # save for use in 3.3
df

### 3.2.2 Number of Lexical/Literary Matches per Literary Composition. 
The sum of each row of the DTM equals the number of words/expressions that the composition shares with the lexical corpus. Instead of adding up the frequencies, however, it makes more sense to count the number of non-zero entries. We can do so with `astype(bool)`, which will yield 0 for each zero entry and 1 for each non-zero entry. 
```python
df["n_matches"] = df.astype(bool).sum(axis=1)
```
The extra elements in the code below are added for two reasons. First, once we add additional columns to the DataFrame, for instance composition names, the code will fail unless we add the option `numeric_only = True`. Second, if the (simplified) code is run twice, even with the option `numeric_only=True` the column `n_matches` will become part of the summation and the result in the new `n_matches` column will be twice the correct outcome. By explicitly stating that only the columns named after the lemmas in `vocab` should be used such accidents are avoided.

In [None]:
df["n_matches"] = df[vocab].astype(bool).sum(axis=1, \
                                                 numeric_only=True)

In [None]:
df

#### 3.2.2.1 Adding Metadata and Lexical Richness Statistics
Above, we computed various statistics for each of the literary compositions. The catalog file for [epsd2/literary](http://oracc.org/epsd2/literary) contains further information (such as the composition name). 

In [None]:
# First get the metadata. 
file = "jsonzip/epsd2-literary.zip" # The ZIP file was downloaded in the previous notebook
z = zipfile.ZipFile(file) 
st = z.read("epsd2/literary/catalogue.json").decode("utf-8")
j = json.loads(st)
cat_df = pd.DataFrame(j["members"]).T
#The important information, giving the title of the literary text is sometimes found in 
# `designation` and sometimes in `subgenre`. Merge those two fields.
cat_df.loc[cat_df.designation.str[:13] == "CDLI Literary", "designation"] = cat_df.subgenre
# Exemplars have a P number (`id_text`), composite texts have a Q number (`id_composite`).
# Merge those two in `id_text`.
cat_df["id_text"] = cat_df["id_text"].fillna(cat_df["id_composite"])
# Keep only `id_text` and `designation`.
cat_df = cat_df[["id_text", "designation"]]

In [None]:
lit_df = pd.merge(lit_comp[["id_text", "length", "mtld", "ttr", "lex_var"]], df["n_matches"], on="id_text", how="inner")
lit_df = pd.merge(cat_df, lit_df, on = 'id_text', how = 'inner')
lit_df

Sort by the number of lexical matches.

In [None]:
lit_df = lit_df.sort_values(by = "n_matches", na_position="first", ascending=False)
lit_df.head()

#### 3.2.2.2 Normalizing
Lugal-e (or [Ninurta's Exploits](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.1.6.2&display=Crit&charenc=gcirc#) has the highest number of matches (more than 700) with the Old Babylonian lexical corpus in [DCCLT](http://oracc.org/dcclt). But this is also the longest composition in the corpus. We can normalize by dividing the total number of matches (`n_matches`) by the number of unique lemmas (`lex_var`) in the text (`norm`). Such numbers mean little for very short texts with just a few (lemmatized) words. In the next section we will add the possibility of excluding texts that fall under a certain minimum length.

In [None]:
lit_df["norm"] = lit_df["n_matches"] / lit_df["lex_var"]
lit_df.sort_values(by = "norm", na_position="first", ascending=False)

### 3.2.3 Exploring the Results
The following code displays the results in an interactive table that may be sorted (ascending or descending) in different ways for further exploration. By default, texts shorter than 50 lemmatized words are excluded and only the first 10 columns are displayed. One may change these numbers by moving the slides. The column `id_text` provides links to the editions in [epsd2/literary](http://oracc.org/epsd2/literary).

In [None]:
anchor = '<a href="http://oracc.org/epsd2/literary/{}", target="_blank">{}</a>'
lit = lit_df.copy()
lit['id_text'] = [anchor.format(val,val) for val in lit['id_text']]

In [None]:
@interact(sort_by = lit.columns, rows = (1, len(lit), 1), min_length = (1,500,5))
def sort_df(sort_by = "norm", ascending = False, rows = 10, min_length = 50):
    return lit.loc[lit.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style

### 3.2.4 Discussion
We may now come back to our initial question: is there a correlation between lexical richness of a composition and the size of the intersection of the vocabulary of that composition with the lexical vocabulary? In other words: did Old Babylonian scribes and scholars utilize the lexical corpus when they wished to broaden their vocabulary?

We will define lexical richness with our mtld measure, and for the intersection we use the `norm` variable (normalized for text length).

As it turns out, a correlation between these two variables exists, but is weak:

In [None]:
min_length = 50
lit = lit.loc[lit.length > min_length]
lit['norm'].corr(lit['mtld'])

#### 3.2.4.1 Scatter Plot
We can further explore this correlation by inspecting a scatter plot of mtld and norm.

In [None]:
lit.plot.scatter(x = 'mtld', y = 'norm');

The scatter plot shows that, indeed, the text with (by far) the lowest `norm` value ([The Sumerian King List](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.2.1.1&display=Crit&charenc=gcirc#)), also receives one of the lowest scores on `mtld` and that texts with the highest `mtld` also score high on `norm`. What the figure also shows is that, with a few exceptions, variation in `norm` is fairly small. The great majority of compositions fall in the 0.85 - 0.95 range.

We can further illustrate that with descriptive statistics of the `norm` variable.

In [None]:
lit['norm'].describe()

The table shows that the 25%, 50%, and 75% points are all very close to each other, around 0.92. In other words: in the great majority of literary compositions, more than 90% of the lemmas and Multiple Word Expressions are attested in the lexical corpus.

#### 3.2.4.2 Histogram
The histogram of `norm` is another way to visualize the (very) skewed distribution of its values.

In [None]:
nbins = 5
column = 'norm'
fig, ax = plt.subplots()
counts, bins, patches = ax.hist(lit[column], bins = nbins)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.2f')) # tick labels with two decimals
ax.set_xticks(bins)
for i in range(nbins):
    plt.text(bins[i],counts[i]/2,str(counts[i]), fontsize = 16)
plt.ylabel('No. of Compositions')
plt.xlabel(column)

# Issue
Lexical Variation is based upon individual words, but nmatches is based upon ngrams - nmatches can be larger than Lexical Variation. Not good!