# 3.2 Overlap in Lexical and Literary Vocabulary: Digging Deeper

In order to research the relationship between lexical and literary material in more detail we first organize the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus in a Document Term Matrix. A Document Term Matrix is a table in which each row is a document (in our case: a literary composition) and each column represents a lemma. Each cell indicates how many times the lemma appears in this particular document.

### 3.2.0 Preparation

First import the necessary libraries. If you are running this notebook in Jupyter Lab you will need to install the Jupyter Lab ipywidgets extension (see Introduction, section 1.2.2.1). Second, open the files `etcsllines.p` and `lexlines.p` which were produced in the previous notebook.

In [1]:
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import MWETokenizer

In [2]:
etcsl_lines = pd.read_pickle('output/etcsllines.p')
lex_lines = pd.read_pickle('output/lexlines.p')

#### 3.2.0.1 Literary: By Composition
For the literary corpus we can take the line-by-line representation that was prepared in the previous notebook and transform that into a composition-by-composition representation. The DataFrame `etcsl_lines` includes the column `lemma_mwe` in which each line is represented as a list of lemmas and/or Multiple Word Expressions (lemmas connected by underscores). The `pandas` `groupby()` function is used to group on `id_text` and `text_name`. The aggregate function for the `lemma_mwe` column in this case is simply `sum`: all the lists (representing lines) are added up to form one long list of lemmas representing one composition.

In [3]:
etcsl_comp = etcsl_lines.groupby(
    [etcsl_lines["id_text"], etcsl_lines["text_name"]]).aggregate(
    {"lemma_mwe": sum}).reset_index()
etcsl_comp[25:35]

Unnamed: 0,id_text,text_name,lemma_mwe
25,c.1.4.1.1,Dumuzid and Ŋeštin-ana,"[galla[policeman]n_tur[small]v/i, kag[mouth]n_..."
26,c.1.4.1.3,Dumuzid and his sisters,"[x[na]na, x[na]na, ŋar[na]na, x[na]na, igi[eye..."
27,c.1.4.3,Dumuzid's dream,"[šag[heart]n, er[tears]n, si[fill]v/t, eden[ba..."
28,c.1.4.4,Inana and Bilulu: an ulila to Inana,"[eden[back]n, dumuzid[1]dn, ilu[song]n, zaʾe[y..."
29,c.1.5.1,Nanna-Suen's journey to Nibru,"[ursaŋ[hero]n, iri[city]n, ama[mother]n, nanna..."
30,c.1.6.1,Ninurta's return to Nibru: a šir-gida to Ninurta,"[an[1]dn, dim[create]v/t, dumu[child]n, enlil[..."
31,c.1.6.2,Ninurta's exploits: a šir-sud (?) to Ninurta,"[an[1]dn, lugal[king]n, diŋir[deity]n, nirŋal[..."
32,c.1.6.3,Ninurta and the turtle,"[x[na]na, x[na]na, amar[young]n, anzud[1]dn, x..."
33,c.1.7.1,The marriage of Martu,"[inab[1]sn, me[be]v/i, kiritaba[1]sn, me[be]v/..."
34,c.1.7.3,Ninŋišzida's journey to the nether world,"[zig[rise]v/t, u[ride]v/t, zig[rise]v/t, u[rid..."


The result is a DataFrame with three columns: `id_text`, `text_name`, and `lemma_mwe`. Each row represents a literary composition from the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus. Each cell in the column `lemma_mwe` contains a list with all the lemmas of one composition (with MWEs connected by underscores).

#### 3.2.0.2 Lexical: Extract Vocabulary
The `lex_lines` DataFrame, created in the last notebook, 
Add a column `lemma_mwe` to the `lex_lines` DataFrame, connecting all lemmas in a lexical entry by an underscore. Extract a list of all the unique lexical lemmas (including MWEs) for use in the Document Term Matrix.

In [4]:
lex_lines["lemma_mwe"] = ["_".join(entry) for entry in lex_lines["lemma"]]
lex_vocab = {lemma for lemma in lex_lines["lemma_mwe"] if not '[na]na' in lemma}
lex_vocab = list(lex_vocab) # lex_vocab is needed for Countvectorizer
lex_vocab.sort()
lex_vocab[:10]

['a[arm]n',
 'a[arm]n_ak[do]v/t',
 'a[arm]n_apin[plow]n',
 'a[arm]n_bad[open]v/t',
 'a[arm]n_bad[wall]n',
 'a[arm]n_badsi[parapet]n',
 'a[arm]n_be[diminish]v/t',
 'a[arm]n_da[line]n',
 'a[arm]n_dabašin[object]n',
 'a[arm]n_daluš[sling]n']

# Some thoughts

Probably better *not* to concatenate lex_comp and etcsl_comp.

* Step 1. Measure length of lemma_mwe in etcsl_comp and remove rows with len < 50.
* Step 2. Create DTM (see below) of etcsl_comp, binary = True and vocabulary = lemma_mwe from lex (use lex_lines)
* Step 3. Order compositions by highest match
* Step 4. Normalize for text length (from Step 1)
* Step 5. Same process for individual lex texts (which has highest match for Ura 4?)
* Step 6. TF-IDF

In future iteration: do *not* select among lexical texts - let the script figure out which lex compositions are most relevant.

Perhaps: make DTM first - show that DTM.shape gives same numbers for lex vocabulary as second Venn diagram above. Remove all columns where sum == 0. Show that DTM.shape now gives total of overlap as in Venn diagram above. Then remove rows <= minimum. Tricky!

### 3.2.1 Basic Statistics of the [ETCSL](http://etcsl.orinst.ox.ac.uk) Corpus
In computing the relationship between lexical and literary vocabulary text length is playing a big role. A long text will likely have more overlap with lexical vocabulary then a very short one. The [ETCSL](http://etcsl.orinst.ox.ac.uk) includes compositions that are known only from a fragmentary *incipit*, as well as compositions that are more than a thousand lines long.

In order to meaningfully compare these compositions we will first eliminate all texts that have fewer than 200 lemmas and/or MWEs. Second, we will collect data on text length and lexical variation (how many unique lemmas are used in this text?). Dividing lexical variation by text length provides the "Type to Token Ratio" or ttr. 

In [None]:
minimum = 200
etcsl_comp["length"] = [len(lemmas) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp["lex_var"] = [len(set(lemmas)) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp["ttr"] = [len(set(lemmas))/len(lemmas) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp = etcsl_comp.loc[etcsl_comp.length >= minimum]
etcsl_comp[25:35]

### 3.2.2 Document Term Matrix

**adjust text to new DTM**

The corpus is transformed into a Document Term Matrix (or DTM), a table in which each column represents a word (or expressions) that appears in a lexical text and each row represents a Sumerian composition. Each cell is a number, 0 or 1, indicating whether or not that word appears  in a particular composition. This is a binary DTM, non-binary DTMS give the number of times a word appears in a composition.

Since DTMs are very commonly used in computational text analysis, it is worth spending a bit more time on various ways in which they can be created for cuneiform data. The function `CountVectorizer()` (from the `Sklearn` package) is a very flexible tool with many possible parameters. How `CountVectorizer()` and its counterpart `TFIDFVectorizer()` are used depends on the structure of the input data. The most common use case is a corpus of raw documents (probably in English), each of them consisting of a text string that needs to be pre-processed and tokenized before anything else can be done. Default pre-processing includes, for instance, lowercasing the entire text. Default tokenizers assume that the text is in a modern (western) language and take spaces and punctuation marks as word dividers. Cuneiform data, whether in transliteration, lemmatization, or in normalization is much simpler than most modern language texts, because the only type of word boundary is a space (or a sequence of spaces). When using `CountVectorizer()` on transliterated, lemmatized, or normalized text we can use the parameter `token_pattern = r'[^ ]+'`, meaning "any sequence of characters, except space." 
```python
cv = CountVectorizer(token_pattern= r'[^ ]+')
```
A second situation is where we want to use data that is already in a list format (is already preprocessed and tokenized). All the [ORACC](http://oracc.org) and [ETCSL](http://etcsl.orinst.ox.ac.uk) data fall into that category. Rather than transforming the tokenized text back into raw strings and then tokenize those strings, we can use the parameters `tokenizer` and `preprocessor` to take care of that situation. These parameters take a function as their value, the function should return a list with tokenized text. If our input already is a list with tokenized text we can call a dummy function - a function that simply returns the list it receives. 
```python
def dummy(l):
    return(l)
cv = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
```
This will prevent `Countvectorizer()` from using a default tokenizer and preprocessor (which do not accept the list input) and it saves the trouble of untokenizing and then tokenizing again (See the [blog post](http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/) on this subject by David Batista). Instead of defining a `dummy()` function we can reach the same effect with a lambda function (see the code below).

Finally, we can choose to use the `MWETokenizer()` discussed above (section ###). The `MWETokenizer()` expects a tokenized text (a list) and re-tokenizes that text by using a list of pre-defined Multiple Word Expressions, returning a new list. In case we use the original [ETCSL](http://etcsl.orinst.ox.ac.uk) data, in which the MWEs have not yet been marked, we can do the CountVectorizing and marking the MWEs in one go, as follows:
```python
def dummy(l):
    return(l)
tokenizer = MWETokenizer(lex_mwe) # initialize the tokenizer with the lexical MWEs
cv = CountVectorizer(tokenizer=tokenizer.tokenize, preprocessor=dummy)
```
For our current purposes the best approach is to use a dummy tokenizer and preprocessor. The disadvantage of using the MWETokenizer on entire texts is that it will not honor line boundaries. See, for instance, Gilgameš and Huwawa 50-51 (text and translation [ETCSL](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=t.1.8.1.5&display=Crit&charenc=gcirc#): 

> ama tuku ama-a-ni-še₃
> nitah saŋ-dili ŋe₂₆-e-gin₇ ak a₂-ŋu₁₀-še₃ hu-mu-un-ak
> "Let him who has a mother go to his mother! 
> Let bachelor males, types like me, join me at my side!"

This will result in the Multiple Word Expression ama\[mother\]n_nita\[male\]N, an expression found in the list of human beings Lu ([OB Nippur Lu](http://oracc.org/dcclt/Q000047.351), which is clearly not applicable here. The number of such errors is fairly small (about 6 for a corpus of almost 400 texts). For other types of texts, where line boundaries are less significant, this method may well be an efficient way of doing things.

The CountVectorizer is now applied to the corpus and the result is transformed into a new Pandas DataFrame.

In [None]:
cv = CountVectorizer(tokenizer=lambda x: x, preprocessor=lambda x: x, vocabulary=lex_vocab, binary=True)

dtm = cv.fit_transform(etcsl_comp['lemma_mwe'])
etcsl_df = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=etcsl_comp["id_text"])
etcsl_df

### Remove empty columns
The DataFrame etcsl_df has a column for every lemma/expression in the *lexical* corpus. As we have seen in the previous notebook, many of these words/expressions do not appear in the [ETCSL](http://etcsl.orinst.ox.ac.uk) corpus, and thus all cells in that column are 0, taking up unused space.

# Note:
This step has been removed because the empty cols do not hinder anything. Do discuss the fact that there are many columns empty and that words that are not represented in lex are not represented in the DTM

In [None]:
# etcsl_df = etcsl_df.loc[: , etcsl_df.sum(axis=0) > 0].copy()

Number of lexical/literary matches per literary composition. Since the DTM was built with the option `binary = True` the sum of each row equals the number of unique words/expressions that the composition shares with the lexical corpus. The list `lex_vocab` lists all column headers - it prevents the code form accidentally adding `n_matches` to itself (if the code is executed twice). Explain this better.

In [None]:
etcsl_df["n_matches"] = etcsl_df[lex_vocab].sum(axis=1, numeric_only=True)

Add columns from etcsl_comp by using merge. Method is "inner" so that the short compositions (which are in etcsl_comp but not in etcsl_df) do not come back.

In [None]:
etcsl_df2 = pd.merge(etcsl_comp[["id_text", "text_name", "length", "ttr", "lex_var"]], etcsl_df["n_matches"], on="id_text", how="inner")

In [None]:
etcsl_df2 = etcsl_df2.sort_values(by = "n_matches", na_position="first", ascending=False)
etcsl_df2.head()

# Discussion
The Gudea Cylinders and Lugal-e (or Ninurta's Exploits) have the highest number of matches (669 and 624) with the lexical texts chosen. But those are also the two longest compositions in the corpus. We can normalize by dividing the total number of matches by text length, and then order again on the normalized match (`norm1`). Since, however, we are working with a *binary* model of our data (presence/absence of a lemma), it may make more sense to divide the number of matches by the number of unique lemmas in the text (`norm2`).

In [None]:
etcsl_df2["norm1"] = etcsl_df2["n_matches"] / etcsl_df2["length"]
etcsl_df2["norm2"] = etcsl_df2["n_matches"] / etcsl_df2["lex_var"]

In [None]:
anchor = '<a href="http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text={}&display=Crit&charenc=gcirc#", target="_blank">{}</a>'
etcsl = etcsl_df2
etcsl['id_text'] = [anchor.format(val,val) for val in etcsl['id_text']]

In [None]:
@interact(col = etcsl.columns)
def sort_df(col = "norm2", ascending = False):
    return etcsl.sort_values(by = col, ascending = ascending).style.hide_index()

In [None]:
interact(sort_df, col = etcsl_df2.columns)