# Digging Deeper
In order to research the relationship between lexical and literary material in more detail we first organize both corpora in a Document Term Matrix. A Document Term Matrix is a table in which each row is a document (in our case: a lexical or literary text) and each column represents a lemma. Each cell indicates how many times the lemma appears in this particular document.

Both corpora are currently organized by line. The `aggregate` function assembles the lines that belong to a single composition. The resulting dataframe has 394 entries for ETCSL, one for each composition. For the lexical material we select the most important lexical compositions from Nippur in their standard format (composite texts).

In [None]:
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import MWETokenizer

# Literary: By Composition
For the literary corpus we can take the line-by-line representation that was prepared in the previous section in the DataFrame `etcsl_lines`. This DataFrame includes the column "lemma_mwe" in which each line is represented as a list of lemmas and/or Multiple Word Expressions (lemmas connected by underscores). The `pandas` `groupby()` function is used to group on "id_text" and "text_name". The aggregate function for the "lemma_mwe" column in this case is simply `sum`: all the lists (representing lines) are added up to form one long list of lemmas representing one composition.

In [None]:
etcsl_lines = pd.read_pickle('output/etcsllines.p')
lex_lines = pd.read_pickle('output/lexlines.p')

In [None]:
etcsl_comp = etcsl_lines.groupby(
    [etcsl_lines["id_text"], etcsl_lines["text_name"]]).aggregate(
    {"lemma_mwe": sum}).reset_index()
etcsl_comp[25:35]

The results is a DataFrame with three columns: `id_text`, `text_name`, and `lemma_mwe`. Each row represents a literary composition from the ETCSL collection. Each cell in the column `lemma_mwe` contains a list with all the lemmas of one composition (with MWEs connected by underscores).

### Special Case: OB Nippur Ura 6
The sixth chapter of the Old Babylonian Nippur version of the thematic list Ura deals with foodstuffs and drinks. This chapter was not standardized (each exemplar has its own order of items and sections) and therefore no composite text has been created in [DCCLT](http://oracc.org/dcclt). Instead, the "composite" of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) consists of the concatenation of all known Nippur exemplars of the list of foodstuffs. In our current dataframe, therefore, there are no lines where the field `id_text` equals "dcclt/Q000043".

We create a "composite" by changing the field `id_text` in all exemplars of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) to "dcclt/Q000043". 

In [None]:
Ura6 = ["dcclt/P227657",
"dcclt/P227743",
"dcclt/P227791",
"dcclt/P227799",
"dcclt/P227925",
"dcclt/P227927",
"dcclt/P227958",
"dcclt/P227967",
"dcclt/P227979",
"dcclt/P228005",
"dcclt/P228008",
"dcclt/P228200",
"dcclt/P228359",
"dcclt/P228368",
"dcclt/P228488",
"dcclt/P228553",
"dcclt/P228562",
"dcclt/P228663",
"dcclt/P228726",
"dcclt/P228831",
"dcclt/P228928",
"dcclt/P229015",
"dcclt/P229093",
"dcclt/P229119",
"dcclt/P229304",
"dcclt/P229332",
"dcclt/P229350",
"dcclt/P229351",
"dcclt/P229352",
"dcclt/P229353",
"dcclt/P229354",
"dcclt/P229356",
"dcclt/P229357",
"dcclt/P229358",
"dcclt/P229359",
"dcclt/P229360",
"dcclt/P229361",
"dcclt/P229362",
"dcclt/P229365",
"dcclt/P229366",
"dcclt/P229367",
"dcclt/P229890",
"dcclt/P229925",
"dcclt/P230066",
"dcclt/P230208",
"dcclt/P230230",
"dcclt/P230530",
"dcclt/P230586",
"dcclt/P231095",
"dcclt/P231128",
"dcclt/P231424",
"dcclt/P231446",
"dcclt/P231453",
"dcclt/P231458",
"dcclt/P231742",
"dcclt/P266520"]
lex_lines.loc[lex_lines["id_text"].isin(Ura6), "id_text"] = "dcclt/Q000043"

### 2.1 Select Lexical Compositions
Select the following compositions: 
* Ura 1 dcclt/Q000039
* Ura 2 dcclt/Q000040
* Ura 3 dcclt/Q000001
* Ura 4 dcclt/Q000041
* Ura 5 dcclt/Q000042
* Ura 6 dcclt/Q000043
* Lu₂-Azlag₂ B/C Q000302 
* Ugumu dcclt/Q002268
* Diri dcclt/Q000057
* Nigga dcclt/Q000052
* Izi dcclt/Q000050
* Kagal dcclt/Q000048
* Lu dcclt/Q000047
* Ea dcclt/Q000055

In [None]:
keep = {"dcclt/Q000039" : "OB Ura 1", 
    "dcclt/Q000040" : "OB Ura 2",
    "dcclt/Q000001" : "OB Ura 3",
    "dcclt/Q000041" : "OB Ura 4",
    "dcclt/Q000042" : "OB Ura 5",
    "dcclt/Q000043" : "OB Ura 6",
    "dcclt/Q000302" : "Lu-azlag",
    "dcclt/Q002268" : "Ugumu",
    "dcclt/Q000057" : "OB Nippur Diri",
    "dcclt/Q000052" : "Nigga",
    "dcclt/Q000050" : "OB Nippur Izi",
    "dcclt/Q000048" : "OB Nippur Kagal",
    "dcclt/Q000047" : "OB Nippur Lu", 
    "dcclt/Q000055" : "OB Nippur Ea"}
lex_lines = lex_lines.loc[lex_lines["id_text"].isin(keep)]

Add a column `lemma_mwe` to the `lex_lines` DataFrame, connecting all lemmas in a lexical entry by an underscore.

# NOTE: Something is wrong with def of lex_vocab

In [None]:
lex_lines["lemma_mwe"] = ["_".join(entry) for entry in lex_lines["lemma"]]
lex_vocab = {lemma for lemma in lex_lines["lemma_mwe"] if not '[na]na' in lemma}
lex_vocab = list(lex_vocab) # lex_vocab is needed for Countvectorizer
lex_lines[:10]

Group the lexical data by lexical composition.

In [None]:
lex_comp = lex_lines.groupby(
    [lex_lines["id_text"]]).aggregate(
    {"lemma_mwe": list}).reset_index()

Add names of lexical compositions, by using the dictionary `keep`.

In [None]:
lex_comp["text_name"] = [keep[t_id] for t_id in lex_comp["id_text"]]
lex_comp

Probably better *not* to concatenate lex_comp and etcsl_comp.

* Step 1. Measure length of lemma_mwe in etcsl_comp and remove rows with len < 50.
* Step 2. Create DTM (see below) of etcsl_comp, binary = True and vocabulary = lemma_mwe from lex (use lex_lines)
* Step 3. Order compositions by highest match
* Step 4. Normalize for text length (from Step 1)
* Step 5. Same process for individual lex texts (which has highest match for Ura 4?)
* Step 6. TF-IDF

In future itereation: do *not* select among lexical texts - let the script figure out which lex compositions are most relevant.

Perhaps: make DTM first - show that DTM.shape gives same numbers for lex vocabulary as second Venn diagram above. Remove all columns where sum == 0. Show that DTM.shape now gives total of overlap as in Venn diagram above. Then remove rows <= minimum. Tricky!

In [None]:
minimum = 200
etcsl_comp["length"] = [len(lemmas) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp["lex_var"] = [len(set(lemmas)) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp["ttr"] = [len(set(lemmas))/len(lemmas) for lemmas in etcsl_comp["lemma_mwe"]]
etcsl_comp = etcsl_comp.loc[etcsl_comp.length >= minimum]
etcsl_comp[25:35]

# Document Term Matrix

**adjust text to new DTM**

The corpus is transformed into a Document Term Matrix (or DTM) in which each word (or expression) is a column and each row a Sumerian composition. Each cell is a number that indicates how often a particular word appears in a particular composition. A matrix can be thought of as a collection of vectors. Each row represents a vector that characterizes a composition through its vocabulary; each column is a vector that characterizes a word through its usage in different compositions. The process is therefore also referred to as "vectorizing" a corpus of texts. Once vectorized we can use all of matrix and vector mathematics to analyze the data - for instance by computing the cosine similarity between two compositions.

Since DTMs are very commonly used in computational text analysis, it is worth spending a bit more time on various ways in which they can be created for cuneiform data. The function `CountVectorizer()` (from the `Sklearn` package) is a very flexible tool with many possible parameters. How `CountVectorizer()` and its counterpart `TFIDFVectorizer()` are used depends on the structure of the input data. The most common use case is a corpus of raw documents (probably in English), each of them consisting of a text string that needs to be pre-processed and tokenized before anything else can be done. Default pre-processing includes, for instance, lowercasing the entire text. Default tokenizers assume that the text is in a modern (western) language and take spaces and punctuation marks as word dividers. Cuneiform data, whether in transliteration, lemmatization, or in normalization is much simpler than most modern language texts, because the only type of word boundary is a space (or a sequence of spaces). When using `CountVectorizer()` on transliterated, lemmatized, or normalized text we can use the parameter `token_pattern = r'[^ ]+'`, meaning "any sequence of characters, except space." 
```python
cv = CountVectorizer(token_pattern= r'[^ ]+')
```
A second situation is where we want to use data that is already in a list format (is already preprocessed and tokenized). All the [ORACC](http://oracc.org) and [ETCSL](http://etcsl.orinst.ox.ac.uk) data fall into that category. Rather than transforming the tokenized text back into raw strings and then tokenize those strings, we can use the parameters `tokenizer` and `preprocessor` to take care of that situation. These parameters take a function as their value, the function should return a list with tokenized text. If our input already is a list with tokenized text we can call a dummy function - a function that simply returns the list it receives. 
```python
def dummy(l):
    return(l)
cv = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
```
This will prevent `Countvectorizer()` from using a default tokenizer and preprocessor (which do not accept the list input) and it saves the trouble of untokenizing and then tokenizing again (See the [blog post](http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/) on this subject by David Batista).

Finally, we can choose to use the `MWETokenizer()` discussed above (section ###). The `MWETokenizer()` expects a tokenized text (a list) and re-tokenizes that text by using a list of pre-defined Multiple Word Expressions, returning a new list. In case we use the original [ETCSL](http://etcsl.orinst.ox.ac.uk) data, in which the MWEs have not yet been marked, we can do the CountVectorizing and marking the MWEs in one go, as follows:
```python
def dummy(l):
    return(l)
tokenizer = MWETokenizer(lex_mwe) # initialize the tokenizer with the lexical MWEs
cv = CountVectorizer(tokenizer=tokenizer.tokenize, preprocessor=dummy)
```
For our current purposes the best approach is to use a dummy tokenizer and preprocessor. The disadvantage of using the MWETokenizer on entire texts is that it will not honor line boundaries. See, for instance, Gilgameš and Huwawa 50-51 (text and translation [ETCSL](http://etcsl.orinst.ox.ac.uk/cgi-bin/etcsl.cgi?text=t.1.8.1.5&display=Crit&charenc=gcirc#): 

> ama tuku ama-a-ni-še₃
> nitah saŋ-dili ŋe₂₆-e-gin₇ ak a₂-ŋu₁₀-še₃ hu-mu-un-ak
> "Let him who has a mother go to his mother! 
> Let bachelor males, types like me, join me at my side!"

This will result in the Multiple Word Expression ama\[mother\]n_nita\[male\]N, an expression found in the list of human beings Lu ([OB Nippur Lu](http://oracc.org/dcclt/Q000047.351), which is clearly not applicable here. The number of such errors is fairly small (about 6 for a corpus of almost 400 texts). For other types of texts, where line boundaries are less significant, this method may well be an efficient way of doing things.

The CountVectorizer is now applied to the corpus and the result is transformed into a new Pandas DataFrame.

In [None]:
def dummy(tokens): 
    return(tokens)

cv = CountVectorizer(tokenizer=dummy, preprocessor=dummy, vocabulary=lex_vocab.sort(), binary=True)

dtm = cv.fit_transform(etcsl_comp['lemma_mwe'])
etcsl_df = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=etcsl_comp["id_text"])
etcsl_df

In [None]:
etcsl_df = etcsl_df.loc[: , etcsl_df.sum(axis=0) > 0]

Number of lexical/literary matches per literary composition. Note that `binary = True` 

In [None]:
etcsl_df["n_matches"] = etcsl_df.sum(axis=1, numeric_only=True)

Add columns from etcsl_comp by using merge. Method is "inner" so that the short compositions (which are in etcsl_comp but not in etcsl_df) do not come back.

In [None]:
etcsl_df2 = pd.merge(etcsl_comp[["id_text", "text_name", "length", "ttr", "lex_var"]], etcsl_df["n_matches"], on="id_text", how="inner")

In [None]:
etcsl_df2 = etcsl_df2.sort_values(by = "n_matches", na_position="first", ascending=False)
etcsl_df2

# Discussion
The Gudea Cylinders and Lugal-e (or Ninurta's Exploits) have the highest number of matches (669 and 624) with the lexical texts chosen. But those are also the two longest compositions in the corpus. We can normalize by dividing the total number of matches by text length, and then order again on the normalized match.

In [None]:
etcsl_df2["norm1"] = etcsl_df2["n_matches"] / etcsl_df2["length"]
etcsl_df2["norm2"] = etcsl_df2["n_matches"] / etcsl_df2["lex_var"]

In [None]:
@interact(col = etcsl_df2.columns)
def sort_df(col = "norm2"):
    return etcsl_df2.sort_values(by = col)

In [None]:
interact(sort_df, col = etcsl_df2.columns)