# 3 Overlap in Lexical and Literary Vocabulary

Comparing the vocabulary of Old Babylonian lexical texts and the vocabulary of the Sumerian literary corpus as represented in [ETCSL](http://etcsl.orinst.ox.ac.uk/).

## 3.1 Counting Words and Expressions

In this notebook we will simply count words and expressions in lexical texts and in the [ETCSL](http://etcsl.orinst.ox.ac.uk/) corpus and compute the amount of overlap, to be visualized in Venn diagrams.

### 3.1.0 Preparation 
In order to run this notebook, first parse the [ETCSL](http://etcsl.orinst.ox.ac.uk) data with the ETCSL Parser (2.2). Second, the code below uses the package `matplotlib_venn`, which is currently not part of the set of packages installed with Anaconda. Open the Anaconda prompt (Windows) or Terminal (Mac OS X), and type the following command:

```bash
conda install -c conda-forge matplotlib-venn
```

Note that the package is imported as matplotlib_venn but must be installed as matplotlib-venn.
Installation of a package can take quite some time, but it needs to be done only once. 

In [None]:
%matplotlib inline  
# %matplotlib inline Enables drawing of visualizations in the Notebook
import pandas as pd
import os
import sys
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
from matplotlib import pyplot as plt
from matplotlib_venn import venn2
from nltk.tokenize import MWETokenizer
import zipfile
import json
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

#### 3.1.0.1 Read ETCSL Data Files
Open the file `alltexts.csv` which contains all of [ETCSL](http://etcsl.orinst.ox.ac.uk) and read the data into a `Pandas` DataFrame. Each row is a word from [ETCSL](http://etcsl.orinst.ox.ac.uk/) in lemmatized format, according to [ePSD2](http://build-oracc.museum.upenn.edu/epsd2) standards. Only Sumerian words are kept; Akkadian glosses, for instance, are removed.

In [None]:
file = "../2_2_Data_Acquisition_ETCSL/Output/alltexts.csv"
etcsl_words = pd.read_csv(file, keep_default_na=False)
etcsl_words = etcsl_words.loc[etcsl_words["lang"].str.contains("sux")]  # throw out non-Sumerian words

#### 3.1.0.2 Lemmas
Create a lemmas column and lowercase all lemmas.

The `lemmas` column is created by combining Citation Form (`cf`), Guide Word (`gw`) and Part of Speech (`pos`). The Pandas `apply()` function applies a function to every row (`axis = 1`) or column (`axis = 0`) of a dataframe. The function is defined here as a so-called `lambda` function (a temporary function). It is a simple addition of the strings of the `cf`, `gw`, and `pos` columns (with `[` and `]` as separators), so that a single lemma now looks like `lugal[king]N`. The `lambda` function has one condition: if there is no Citation Form (column `cf` equals the empty string) the contents of the column `form` are taken, followed by `NA]NA]`. The absence of a Citation Form implies that the word was not lemmatized (perhaps an unknown or a broken word). The field `form` contains the raw transliteration - the result may be `x-ra-bi[NA]NA`.

If the field `form` is empty (which happens, for instance, where a horizontal in the text is noted), however, this results in the `lemma` entry `NA[NA]`. In those case the value of `lemma` is turned into the empty string with a conditional list comprehension.

For the current analysis we will use *lemmatized* forms for the comparison between literary and lexical vocabulary. The unlemmatized forms, therefore, are of little importance here. We need to keep them, for now, because we will also compare *sequences* of lemmas. Premature removal of unlemmatized forms would result in false positives. 

In [None]:
etcsl_words["lemma"] = etcsl_words.progress_apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
etcsl_words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in etcsl_words['lemma'] ] 
# kick out empty forms
etcsl_words["lemma"] = etcsl_words["lemma"].str.lower()

#### 3.1.0.3 Read Lexical Data
The module `utils` in the `utils` directory of Compass includes the function `get_data()` which essentially runs the same code as the Extended ORACC Parser (2.1.3; see there for explanation of the code). Its only parameter is a string with [ORACC](http://oracc.org) project names, separated by commas. It returns a Pandas DataFrame in which each word is represented by a row.

In [None]:
projects = "dcclt, dcclt/nineveh, dcclt/signlists, dcclt/ebla"
lex_words = get_data(projects)

In [None]:
lex_words = lex_words.loc[lex_words["lang"].str.contains("sux")] # remove Akkadian glosses 

In [None]:
lex_words["lemma"] = lex_words.progress_apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
lex_words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in lex_words['lemma'] ] 
# kick out empty forms
lex_words["lemma"] = lex_words["lemma"].str.lower()

Sign lists (which belong to the broader category of lexical lists) list cuneiform signs with pronunciation glosses and sometimes with Akkadian translation, sign name, and other information. For the current purposes we *only* need the Sumerian word that is represented by the entry. We remove entries that derive from the pronunciation glosses and the signs themselves. Sign names and Akkadian translations are already removed, because we have selected on Sumerian only in the previous cell..

The Pandas function `isin()` compares the contents of a field with a list and returns a boolean (`True` or `False`). In this case the column `field` (which is primarily used for sign lists) is compared to the list `["sg", "pr"]`. If `field` equals one of these terms `isin()` returns `True`. The `~` before the entire expression changes `True` into `False` and vv. As a result the dataframe `lexical` now omits all rows that have either "sg" or "pr" in the column `field`.

In [None]:
# remove lemmas that derive from the fields "sign" 
# or "pronunciation" in sign lists.
lex_words = lex_words[~lex_words["field"].isin(["sg", "pr"])] 

### 3.1.0.4 Select Old Babylonian Lexical Texts
The great majority of texts in [ETCSL](http://etcsl.orinst.ox.ac.uk) is from the Old Babylonian period. We will use the [DCCLT](http://oracc.org/dcclt) catalog to select only those lexcial texts that come from that same period.

The catalog is included as a separate `json` file in `dcclt.zip`. Since we parsed the [DCCLT](http://oracc.org/dcclt) text editions in preparation for this script, the file `dcclt.zip` should still be in `jsonzip` directory.

The file `catalogue.json` is much more shallow in structure than the text files - there is no need to parse this file. We unzip the file with the `zipfile` module, and read the `catalogue.json` file with `read` command (from the `zipfile` library) as a string into the variable `st`. We can than use the `loads()` (load string) command from the `json` package to structure the data in proper `json` format. Once loaded, the data can be read immediately into a Pandas dataframe. In order to get the dataframe properly oriented (each row representing a text) the dataframe needs to be transposed, by adding `.T` to the end of the command.

Finally the dataframe is reduced to just two columns: `id_text` and `period` so that we can select the ones that have "Old Babylonian" in the `period` column.

In [None]:
file = "jsonzip/dcclt.zip"
z = zipfile.ZipFile(file) 
st = z.read("dcclt/catalogue.json").decode("utf-8")
j = json.loads(st)
cat_df = pd.DataFrame(j["members"]).T
cat_df["id_text"] = cat_df["id_text"].fillna(cat_df["id_composite"])
cat_df = cat_df[["id_text", "period"]]

In [None]:
ob = cat_df[cat_df["period"] == "Old Babylonian"]
ob[:10]

The index of the resulting dataframe `ob` is identical to the column `id_text` (the P, Q, or X number of each text). We can retrieve the index with the Pandas command `index.values`, which returns a list. These are the P/Q/X numbers that we want to keep.

In the dataframe `lex_words` all text IDs are preceded by `dcclt/`, `dcclt/signlists`, etc. We will compare the last seven characters of `id_text` (the P, Q, or X number), to see if that number appears in `keep`. This will select the Old Babylonian entries.

In [None]:
keep = ob.index.values
lex_words = lex_words.loc[lex_words["id_text"].str[-7:].isin(keep)]

### 3.1.1 First Approximation
Now we have two dataframes: `etcsl` and `lexical`. In both the field `lemma` contains the lemmatization data of a single word. We can extract the unique lemmas with the `set()` command (a set is an unordered collection of unique elements). We remove the non-lemmatized words (those have `na` as Guide Word and `na` as POS) with a set comprehension. Now we can compare the two sets in a [Venn diagram](https://en.wikipedia.org/wiki/Venn_diagram).

In [None]:
etcsl_words_s = set(etcsl_words["lemma"])
lexical_words_s = set(lex_words["lemma"])
etcsl_words_s = {lemma for lemma in etcsl_words_s if not '[na]na' in lemma}
lexical_words_s = {lemma for lemma in lexical_words_s if not '[na]na' in lemma}

The `venn2` command from the `matplotlib_venn` library creates a Venn diagram of two sets. Each set is represented by a circle, the diameter of the circle is related to the number of elements in the set. The overlap between the circles represents elements that are contained in both sets.

In its most basic form the `venn2()` command simply takes a list that contains the two sets.

In [None]:
venn2([etcsl_words_s, lexical_words_s]);

This basic plot is not too informative because it does not include the size of each set, nor its name. We can customize colors, size of the plot, and the legends. This customization is put in a function so it can be reused later on.

In [None]:
def plot_venn(lit_vocab, lex_vocab):
    """The function takes two sets as arguments and draws a Venn diagram that shows the intersection between the two sets.
    The legend includes the size of each set and the percentage of the intersection with the other set.
    """
    plt.figure(figsize=(8,8))
    lit_abs = len(lit_vocab)
    lex_abs = len(lex_vocab)
    inter_abs = len(lit_vocab.intersection(lex_vocab))
    lit_per = "{:.0%}".format(inter_abs/lit_abs)
    lex_per = "{:.0%}".format(inter_abs/lex_abs)
    lit_legend = "literary (" + str(lit_abs) + ') ' + lit_per + " overlap"
    lex_legend = "lexical (" + str(lex_abs) + ') ' + lex_per + " overlap"
    c = venn2([lit_vocab, lex_vocab], (lit_legend, lex_legend))
    c.get_patch_by_id('10').set_color("#fdb515")
    c.get_patch_by_id('01').set_color("#003262")
    c.get_patch_by_id('11').set_color("#bc9b6a")
    plt.show()
    return

In [None]:
plot_venn(etcsl_words_s, lexical_words_s)

### 3.1.2 Second Approach: Multiple Word Expressions

Instead of looking at individual words (or lexemes), we may also look at lexical *entries* and their presence (or absence) in literary texts. The list of domestic animals, for instance, includes the entry `udu diŋir-e gu₇-a`('sheep eaten by a god'), lemmatized as `udu[sheep]n diŋir[god]n gu[eat]v/t`. Unsurprisingly, all these very common lemmas appear in the literary corpus, and thus in our previous analysis this item results in three hits. But does the expression as a whole ever appear in the literary corpus? 

In order to perform the comparison on the lexical entry level we first need to represent our data (lexical and literary) as lines, rather than as individual words. Lines in lexical texts will become our multiple word expressions. Lines in literary texts will serve as boundaries, since we do not expect our multiple word expressions to continue from one line to the next.

We will use the Multiple Word Expressions (MWE) Tokenizer from the Natural Language Toolkit (`nltk`) to identify and mark the lexical expressions in the literary corpus. Essentially MWETokenizer processes a text that is already tokenized, combining tokens that belong together in a Multiple Word Expression according to a list of such expressions provided by the user. The corpus to be tokenized with MWETokenizer is expected to be a list of lists, where each list represents a sentence. The data format for the Multiple Word Expressions is a list of tuples, where each tuple represents a sequence of words that belong together. In order to use MWETokenizer for our purposes we thus need to transform the [ETCSL](http://etcsl.orinst.ox.ac.uk) data into a list of lists and the lexical data into a list of tuples.

#### 3.1.2.1 Line by Line

The dataframe `lex_words` that was produced in section 3.1.0.4 contains the lemmatizations of all Old Babylonian lexical texts in a word-by-word (or rather lemma-by-lemma) arrangement. In order to work with lexical *entries* we need to reconstruct lines. That is, we collect the words (lemmas) that belong to the same line of the same lexical text. The dataframe `lex_words` includes the fields `id_text` and `id_line` that allow us to do so. 

| id_text | id_line | lemma|
|:-------|:------|:------|
| dcclt/Q000001 |	1 | udu\[sheep\]n |
| dcclt/Q000001|	1 | niga\[fattened\]v/i|
| dcclt/Q000001|	2 |	udu\[sheep\]n|
| dcclt/Q000001|	2 |	niga\[fattened\]v/i|
| dcclt/Q000001|	2 |	sag\[rare\]v/i|

We need to change the above representation into two entries (representing two lines in a lexical text) like this:

| id_text | id_line | lemma|
|:-------|:------|:------|
| dcclt/Q000001 |	1 | (udu\[sheep\]n, niga\[fattened\]v/i) |
| dcclt/Q000001|	2 | (udu\[sheep\]n, niga\[fattened\]v/i, sag\[rare\]v/i) |

The round brackets in the `lemma` column indicate that the data format is a tuple: an immutable list. The Multiple Word Expression (MWE) tokenizer from the `nltk` package (see below) uses tuples to define MWEs, so we can directly feed the new `lemma` column into the tokenizer.

In order to do this we use the Pandas functions `groupby()` and `agg()` (for aggregate). The `groupby()` function takes as argument a list of fields on which the grouping should be performed -- in this case the fields `id_text` and `id_line`. The `groupby()` function returns a so-called "GroupBy object" which preserves all the information of the original dataframe. The GroupBy object can be further manipulated with the `agg()` function.

The `agg()` function works on a GroupBy object and computes summary statistics (such as mean, sum, or average) for each group. In our case each group is a line in a lexical text and the "summary statistics" that we want is a tuple that contains all the lemmas of a single lexical line. The `agg()` function takes as argument a dictionary with field names as key and functions as value. The function `tuple` aggregates the grouped entries in the `lemma` column and and places them in a tuple. A second field that is aggregated is `extent`. This field indicates (among other things) the number of broken or illegible lines between two lines of text. We will use that data in a later phase of the analyis. The aggregation function here is `''.join`, which will simply concatenate the strings.

The `agg()` function returns a new dataframe with a composite index. The Pandas function `reset_index()` will create a new (flat) index that starts counting from 0. The `to_pickle` function from the `pandas` package saves the resulting DataFrame for use in the next notebook.

In [None]:
lex_lines = lex_words.groupby([lex_words['id_text'], lex_words['id_line']]).agg({
        'lemma': tuple,
        'extent': ''.join
    }).reset_index()
lex_lines.to_pickle('output/lexlines.p')

In [None]:
lex_lines[:10]

Now we do essentially the same for the `etcsl` dataframe, reconstructing lines in literary compositions. In this case, however, we want to aggregate the `lemma` column in a list, because that is the format MWETokenizer expects for the text to be tokenized. 

In [None]:
etcsl_lines = etcsl_words.groupby([etcsl_words['id_text'], etcsl_words['id_line'], etcsl_words['text_name']]).agg({
        'lemma': list,
        'extent': ''.join
    }).reset_index()

In [None]:
etcsl_lines[1000:1010]

#### 3.1.2.2 Extract lexical entries 
Each row in the resulting DataFrame `lex_lines` now consists of a text ID (`id_text`), a line number (`id_line`), and a tuple with the lemmas that represent a lexical *entry* (e.g. `(udu[sheep]n, diŋir[god]n, gu[eat]v/t)`). We extract the `lemma` column, remove duplicate lexical entries with the `set()` function and create a `list` (a list of tuples). 

Any lexical line that contains an unlemmatized word (characterized by "na" as Guide Word and "na" as Part of Speech) is useless for the comparison and is deleted from the list. 

In [None]:
lex = list(set(lex_lines["lemma"]))
lex = [l for l in lex if not '[na]na' in ' '.join(l)]
lex[0:10]

#### 3.1.2.3 Mark lexical entries in literary texts
The list `lex` now contains all uniquely lemmatized entries in the Old Babylonian lexical corpus as edited in [DCCLT](http://oracc.org/dcclt). This is the vocabulary that we wish to find in the literary corpus as edited in [ETCSL](http://etcsl.orinst.ox.ac.uk/).

In order to do so we must re-tokenize the literary corpus, using the Multiple Word Expressions Tokenizer from `nltk`. This tokenizer is initialized with a list of tuples, where each tuple represents a Multiple Word Expression. By default, the words that constitute a MWE are connected by underscores.

In order to do so we will first remove from `lex` the single-word entries (tuples with length 1). The resulting list is called `lex_mwe`. Now the tokenizer is inititalized with `lex_mwe` as its sole argument.

In [None]:
lex_mwe = [item for item in lex if len(item) > 1]
tokenizer = MWETokenizer(lex_mwe)

To illustrate how the MWETokenizer works we may try it on a single line of text, line 148 of the composition [Iddin-Dagan A](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.2.5.3.1&amp;display=Crit&amp;charenc=gcirc):

> 148. {udu}a-lum udu zulumḫi udu niga ŋiš mu-ni-ib-tag-ge
>
> "They sacrifice *aslum* sheep, long-haired sheep, and fattened sheep for her."

In the `lemma` column of the `etcsl` DataFrame the line is represented as
> [aslum[sheep]n, udu[sheep]n, zulumhi[sheep]n, udu[sheep]n, niga[fattened]v/i, ŋeš[tree]n, tag[touch]v/t]

We can run this list of lemmas through the tokenizer to see what happens.

In [None]:
lemm_line = ["aslum[sheep]n", "udu[sheep]n", "zulumhi[sheep]n", "udu[sheep]n", "niga[fattened]v/i", "ŋeš[tree]n", "tag[touch]v/t"]
tokenizer.tokenize(lemm_line)

The tokenizer thus found three Multiple Word Expressions in this single line and connected the lemmas of the MWEs by underscores. The line also illustrates a limitation of this approach. The [ETCSL](http://etcsl.orinst.ox.ac.uk) edition of [Iddin-Dagan A](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.2.5.3.1&amp;display=Crit&amp;charenc=gcirc) represents the first word of line 148 as {udu}a-lum, taking "udu" as a determinative (or semantic classifier). The edition of the list of animals in [OB Ura 3](http://oracc.org/dcclt/Q000001) in [DCCLT](http://oracc.org/dcclt), however, treats this same sign sequence as a sequence of two words: udu a-lum, lemmatized as udu\[sheep\]N aslum\[sheep\]N (line 8). Although aslum\[sheep\]N will result in a match, it will seem that the combination udu\[sheep\]N aslum\[sheep\]N does not appear in the literary corpus. Matches are only found if the words are represented in exactly the same way, and small inconsistencies in transliteration or lemmatization may mess things up.

We can now apply the MWE tokenizer on the entire data set, by re-tokenizing each list of lemmas in the `lemma` column of the `etcsl` DataFrame. The function `tokenize_sents()` (for "tokenize sentences") can be used to tokenize a list of lists where each second-order list represents a sentence (or, in our case, a line) in one go. The result of this function is again a list of lists, which is added as a new column (`lemma_mwe`) to the `etcsl` DataFrame.

The `lemma_mwe` column of the `etcsl` dataframe will now represent the [ETCSL](http://etcsl.orinst.ox.ac.uk/) data in a line-by-line presentation of lemmatizations, with underscores connecting lemmas if a corresponding sequence of lemmas exists as an Old Babylonian lexical entry. This version of the DataFrame `etcsl_lines` is pickled for use in the next notebook.

In [None]:
etcsl_lines["lemma_mwe"] = tokenizer.tokenize_sents(etcsl_lines["lemma"])
etcsl_lines.to_pickle('output/etcsllines.p')

Now join all the tuples in the list lex with underscores, so that the multiple-word entries are represented in the same way as they are in the literary corpus.

In [None]:
lex_vocab = ["_".join(entry) for entry in lex]
lex_vocab.sort()

We can extract the column `lemma_mwe` from the `etcsl_lines` DataFrame as a list of lists and flatten that list with a list comprehension. This returns a list that contains all lemmatizations of the entire [ETCSL](http://etcsl.orinst.ox.ac.uk) data set. After turning this list into a set (to remove duplicate lemmas) we can remove all the non-lemmatized words from the [ETCSL](http://etcsl.orinst.ox.ac.uk) data set with a set comprehension.

In [None]:
etcsl_words2 = [item for sublist in etcsl_lines["lemma_mwe"] for item in sublist] #flatten list of lists to list
etcsl_words_s2 = set(etcsl_words2)
etcsl_words_s2 = {lemma for lemma in etcsl_words_s2 if not '[na]na' in lemma}
lexical_words_s2 = set(lex_vocab)

We can now reuse the function `plot_venn()` that was created above.

In [None]:
plot_venn(etcsl_words_s2, lexical_words_s2)

### 3.1.3 Add them Up
By creating the union of the two sets (the set with individual words and the set with the lexical entries) we get the most complete comparison of the two corpora. Here `gud[oxen]N*an[heaven]N`, `gud[oxen]N` and `an[heaven]N` are all counted as entries, whether or not `gud` and `an` actually appear as such in the lexical corpus.

In [None]:
etcsl_words_s3 = etcsl_words_s | etcsl_words_s2
lexical_words_s3 = lexical_words_s | lexical_words_s2

In [None]:
plot_venn(etcsl_words_s3, lexical_words_s3)

#### 3.1.3.1 Discussion
Whereas the change from individual *words* to *lexical expressions* made a big difference in the plot, adding the two up changes the picture only slightly. Many words (lemmas) are part of lexical entries, but are also lexical entries in and of themselves. These individual words are already included in the set of lexical entries and are taken into account in the previous plot. Nevertheless, there are several hundreds of words added on the lexical side - these are lexemes that *only* appear in multiple-word lexical entries and are not attested in the lexical corpus as separate words. 

On the literary side the number of additional entries is much smaller (counted in the tens, rather than in the hundreds). These are words that appear *only* in fixed expressions (connected by underscore), but not as separate words. We can see which words those are by by subtracting the set `etcsl_words_s2` from `etcsl_words_s3`.

In [None]:
etcsl_words_s3 - etcsl_words_s2

The word `ašgar[kid]n`, for instance, appears only once in the ETCSL corpus, in the Gudea cylinders. 

Note that the DataFrame `etcsl_words` is the original one, in which each row represents a single lemma (not a line).

In [None]:
etcsl_words.loc[etcsl_words['lemma']=="ašgar[kid]n"]

The context is a ritual where Gudea uses (the hide of) a virgin female kid (`ašgar ŋeš nu-zu` [Gudea Cylinder A203](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=c.2.1.7&display=Crit&charenc=gcirc&lineid=c217.203#c217.203)). This expression appears as a whole in the [Old Babylonian list of domestic animals](http://oracc.org/dcclt/Q000001.173#Q000001.168) - as a result the word `ašgar` alone was not represented in the set `etcsl_words_s2`.

For the big picture, however, the last two graphs are very similar to each other and show that on the *literary* side, a little more than half of the vocabulary is attested in (contemporary) lexical texts. On the *lexical* side, however, there appears to be a large group of words and expressions (around 70%) that were taught to students but were not used anywhere in the literary tradition.

# Testing
Do the same, but without Proper Nouns and Number words

In [None]:
Proper_N = ['AN', 'CN', 'DN', 'EN', 'FN', 'GN', 
            'LN', 'MN', 'ON', 'PN', 'QN', 'RN', 'SN', 'TN', 'WN', 'YN', 'NU']

In [None]:
etcsl_words = etcsl_words.loc[~etcsl_words.pos.isin(Proper_N)]

In [None]:
etcsl_words_s = set(etcsl_words["lemma"])
lexical_words_s = set(lex_words["lemma"])
etcsl_words_s = {lemma for lemma in etcsl_words_s if not '[na]na' in lemma}
lexical_words_s = {lemma for lemma in lexical_words_s if not '[na]na' in lemma}

In [None]:
plot_venn(etcsl_words_s, lexical_words_s)

In [None]:
nonlex = etcsl_words_s - lexical_words_s
nonlex