# 3.3 Looking at the Lexical Vocabulary from the Perspective of the Literary Material

In section 3.2 we asked whether we can see differences between Old Babylonian literary compositions in their usage of vocabulary (lemmas and MWEs) attested in the lexical corpus. In this notebook we will change perspective and ask: are there particular lexical texts (or groups of lexical texts) that show a greater affinity with literary vocabulary than others?

In 3.1 and 3.2 we used Multiple Word Expressions, connecting words that are found in a lexical entry by underscrores (using `MWEtokenizer()` from the nltk module). The lemmas and MWE were fed into the `Countvectorizer()` to create a Document Term Matrix.

In this notebook we will use the ngram option of the `CountVectorizer()` function in order to find sequences of lemmas that are shared between lexical and literary texts. N-gram is a continuous sequence of *n* words (or lemmas). Any sequence of *n* words will form an ngram, but only sequences that are more or less standardized will appear with some frequency.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this suppresses a warning about pandas from tqdm
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import zipfile
import json
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from lexicalrichness_v import LexicalRichness as lr

Open the files `litlines.p` and `lexlines.p` which were produced in [3_1_Lit_Lex_Vocab.ipynb](./3_1_Lit_Lex_Vocab.ipynb). The files contain the pickled versions of the DataFrames `lit_lines` and `lex_lines` in which the literary ([epsd2/literary](http://oracc.org/epsd2/literary)) and lexical ([dcclt](http://oracc.org/dcclt)) corpora are represented in line-by-line format.

In [None]:
lit_lines = pd.read_pickle('output/litlines.p')
lex_lines = pd.read_pickle('output/lexlines.p')

Get a list of all lexical vocabulary items, including both lexical *entries* and all the individual lemmas.

In [None]:
vocab = list(lex_lines['lemma'])
vocab2 = [v.split() for v in vocab] # this creates a list of list
vocab2 = [item for sublist in vocab2 for item in sublist] # flatten the list of lists
vocab_s = set(vocab) | set(vocab2) # join the lexical entries (vocab) and the individual words (vocab2)
vocab_l = list(vocab_s) 
vocab_l = [v for v in vocab_l if not '[na]na' in v] # remove unlemmatized words and entries that contain
                                                    # unlemmatized words.
vocab_l.sort()
length = [len(v.split()) for v in vocab_l] # determine the length (in words) of lexical entries
m = max(length) # determine the maximum length of a lexical entry

# CountVectorizer with Ngrams
We will use `CountVectorizer()` slightly differently from how it was used in [3_2_Lit_Lex.ipynb](./3_2_Lit_Lex.ipynb). The main difference is the inclusion of the argument ngram_range = (1, m), where `m` represents the maximum length of a lexical entry, computed above. `Countvectorizer()` will count the number of times each wordd appears (ngram n=1), but also the number of times each sequence of two words (or rather two lemmas) appears (bigram; n=2), etc. The sentence **inana-ra lugal-e e₂-a-ni mu-un-du₃** ("The king build her temple for Inana") lemmatized as Inana\[1\]dn lugal\[king\]n e\[house\]n du\[build\]v/t, will be represented as:

| type             | representation  |
|------------------|-----------------|
| unigram        | Inana\[1\]dn |
|                    | lugal\[king\]n |
|                    | e\[house\]n |
|                    | du\[build\]v/t |
| bigram             | Inana\[1\]dn lugal\[king\]n |
|                    | lugal\[king\]n e\[house\]n |
|                    | e\[house\]n du\[build\]v/t |
| trigram | Inana\[1\]dn lugal\[king\]n e\[house\]n |
|                     | lugal\[king\]n e\[house\]n du\[build\]v/t |
| 4-gram            | Inana\[1\]dn lugal\[king\]n e\[house\]n du\[build\]v/t |

This four-word sentence thus results in 10 columns in the Document Term Matrix for `m=4`. We will use `CountVectorizer()` on the representation of literary texts in *lines* so that the ngrams do not extent over the end of a line. Afterwards, lines are combined into compositions. 

`CountVectorizer()` uses the vocabulary `vocab_l`, which contains all lexical lemmas and lexical entries (produced above). Lemmas and ngrams not found in `vocab_l` are skipped.

In [None]:
cv = CountVectorizer(tokenizer=lambda x: x.split(), preprocessor=lambda x: x, \
                     ngram_range = (1,m), vocabulary = vocab_l)
dtm = cv.fit_transform(lit_lines['lemma'])
df = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=lit_lines["id_text"])
df

> # Note
It may be more efficient to remove zero-columns from the sparse matrix before converting to DataFrame. This is possible but tricky.
1. remove 0 columns
```python
nonzeros = np.unique(dtm.nonzero()[1])
dtm = dtm[:, nonzeros]
```
2. remove the right column names
```python
col = cv.get_feature_names()
col = [col[i] for i in nonzeros]
```


# Combine Lines into Compositions
Combine lines into compositions with groupby and aggregate. Note that the number of columns is identical to the number of lexical items in the Venn diagram in 3.1.

In [None]:
df = df.groupby(['id_text']).agg(sum)
df.shape

# Remove Empty Columns
Lemmas and entries in `vocab_l` (the lexical vocabulary) that have no match in the literary corpus are represented in the DataFrame with a column filled with zeros; such columns are removed.

In [None]:
df = df.loc[: , df.sum(axis=0) != 0].copy()
df.shape

One may notice that the shape of the DataFrame suggests that the outcome here is slightly different from what we saw in 3.1. The number of columns represents the number of shared entries (lemmas, ngrams) between the lexical and the literary corpus. The intersection in the Venn diagram in 3.1 had a slightly lower number.

The difference is a difference in approach: Multiple Word Expressions vs. ngrams. In 3.1 (and 3.2) we used MWEs, connecting words with underscores. In the MWE approach the literary line 'amar\[young\]n_ga\[milk\]n_gu\[eat\]v/t' matches the lexical entry 'amar\[young\]n_ga\[milk\]n_gu\[eat\]v/t' but not 'amar\[young\]n_ga\[milk\]n'. In the ngram approach this will yield matches for unigrams, bigrams and a trigram, namely: 'amar\[young\]n', 'ga\[milk\]n', 'gu\[eat\]v/t', 'amar\[young\]n ga\[milk\]n', and 'amar\[young\]n ga\[milk\]n gu\[eat\]v/t' (the bigram 'ga\[milk\]n gu\[eat\]v/t' is not a lexical entry and will therefore not result in a match).

While this may seem like a big difference, in outcome the differences between these two approaches are minor because most literary ngrams do not result in a match with the literary vocabulary.

# Read Lexical Corpus

In [None]:
lex_lines = lex_lines.loc[~lex_lines.lemma.str.contains('\[na\]na')]
lex_lines

### Special Case: OB Nippur Ura 6
The sixth chapter of the Old Babylonian Nippur version of the thematic list Ura deals with foodstuffs and drinks. This chapter was not standardized (each exemplar has its own order of items and sections) and therefore no composite text has been created in [DCCLT](http://oracc.org/dcclt). Instead, the "composite" of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) consists of the concatenation of all known Nippur exemplars of the list of foodstuffs. In our current dataframe, therefore, there are no lines where the field `id_text` equals "dcclt/Q000043".

We create a "composite" by changing the field `id_text` in all exemplars of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) to "dcclt/Q000043". 

In [None]:
Ura6 = ["dcclt/P227657",
"dcclt/P227743",
"dcclt/P227791",
"dcclt/P227799",
"dcclt/P227925",
"dcclt/P227927",
"dcclt/P227958",
"dcclt/P227967",
"dcclt/P227979",
"dcclt/P228005",
"dcclt/P228008",
"dcclt/P228200",
"dcclt/P228359",
"dcclt/P228368",
"dcclt/P228488",
"dcclt/P228553",
"dcclt/P228562",
"dcclt/P228663",
"dcclt/P228726",
"dcclt/P228831",
"dcclt/P228928",
"dcclt/P229015",
"dcclt/P229093",
"dcclt/P229119",
"dcclt/P229304",
"dcclt/P229332",
"dcclt/P229350",
"dcclt/P229351",
"dcclt/P229352",
"dcclt/P229353",
"dcclt/P229354",
"dcclt/P229356",
"dcclt/P229357",
"dcclt/P229358",
"dcclt/P229359",
"dcclt/P229360",
"dcclt/P229361",
"dcclt/P229362",
"dcclt/P229365",
"dcclt/P229366",
"dcclt/P229367",
"dcclt/P229890",
"dcclt/P229925",
"dcclt/P230066",
"dcclt/P230208",
"dcclt/P230230",
"dcclt/P230530",
"dcclt/P230586",
"dcclt/P231095",
"dcclt/P231128",
"dcclt/P231424",
"dcclt/P231446",
"dcclt/P231453",
"dcclt/P231458",
"dcclt/P231742",
"dcclt/P266520"]
lex_lines.loc[lex_lines["id_text"].isin(Ura6), "id_text"] = "dcclt/Q000043"

# Lexical Compositions

In [None]:
lex_comp = lex_lines.groupby(['id_text']).agg({'lemma': ' '.join}).reset_index()
lex_comp

Since the data are drawn from multiple (sub)projects, it is possible that there are duplicates. We take the version with the largest number of (lemmatized) words.

In [None]:
lex_comp['id_text'] = [i[-7:] for i in lex_comp['id_text']]
lex_comp['length'] = [len(lem.split()) for lem in lex_comp['lemma']]
lex_comp = lex_comp.sort_values(by = 'length', ascending = False)
lex_comp = lex_comp.drop_duplicates(subset = 'id_text', keep = 'first')
lex_comp