# Code used to select OB Nippur lex compositions.

In [None]:
import pandas as pd

In [None]:
lex_lines = pd.read_pickle('output/lexlines.p')

### Special Case: OB Nippur Ura 6
The sixth chapter of the Old Babylonian Nippur version of the thematic list Ura deals with foodstuffs and drinks. This chapter was not standardized (each exemplar has its own order of items and sections) and therefore no composite text has been created in [DCCLT](http://oracc.org/dcclt). Instead, the "composite" of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) consists of the concatenation of all known Nippur exemplars of the list of foodstuffs. In our current dataframe, therefore, there are no lines where the field `id_text` equals "dcclt/Q000043".

We create a "composite" by changing the field `id_text` in all exemplars of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) to "dcclt/Q000043". 

In [None]:
Ura6 = ["dcclt/P227657",
"dcclt/P227743",
"dcclt/P227791",
"dcclt/P227799",
"dcclt/P227925",
"dcclt/P227927",
"dcclt/P227958",
"dcclt/P227967",
"dcclt/P227979",
"dcclt/P228005",
"dcclt/P228008",
"dcclt/P228200",
"dcclt/P228359",
"dcclt/P228368",
"dcclt/P228488",
"dcclt/P228553",
"dcclt/P228562",
"dcclt/P228663",
"dcclt/P228726",
"dcclt/P228831",
"dcclt/P228928",
"dcclt/P229015",
"dcclt/P229093",
"dcclt/P229119",
"dcclt/P229304",
"dcclt/P229332",
"dcclt/P229350",
"dcclt/P229351",
"dcclt/P229352",
"dcclt/P229353",
"dcclt/P229354",
"dcclt/P229356",
"dcclt/P229357",
"dcclt/P229358",
"dcclt/P229359",
"dcclt/P229360",
"dcclt/P229361",
"dcclt/P229362",
"dcclt/P229365",
"dcclt/P229366",
"dcclt/P229367",
"dcclt/P229890",
"dcclt/P229925",
"dcclt/P230066",
"dcclt/P230208",
"dcclt/P230230",
"dcclt/P230530",
"dcclt/P230586",
"dcclt/P231095",
"dcclt/P231128",
"dcclt/P231424",
"dcclt/P231446",
"dcclt/P231453",
"dcclt/P231458",
"dcclt/P231742",
"dcclt/P266520"]
lex_lines.loc[lex_lines["id_text"].isin(Ura6), "id_text"] = "dcclt/Q000043"

### 2.1 Select Lexical Compositions
Select the following compositions: 
* Ura 1 dcclt/Q000039
* Ura 2 dcclt/Q000040
* Ura 3 dcclt/Q000001
* Ura 4 dcclt/Q000041
* Ura 5 dcclt/Q000042
* Ura 6 dcclt/Q000043
* Lu₂-Azlag₂ B/C Q000302 
* Ugumu dcclt/Q002268
* Diri dcclt/Q000057
* Nigga dcclt/Q000052
* Izi dcclt/Q000050
* Kagal dcclt/Q000048
* Lu dcclt/Q000047
* Ea dcclt/Q000055

In [None]:
keep = {"dcclt/Q000039" : "OB Ura 1", 
    "dcclt/Q000040" : "OB Ura 2",
    "dcclt/Q000001" : "OB Ura 3",
    "dcclt/Q000041" : "OB Ura 4",
    "dcclt/Q000042" : "OB Ura 5",
    "dcclt/Q000043" : "OB Ura 6",
    "dcclt/Q000302" : "Lu-azlag",
    "dcclt/Q002268" : "Ugumu",
    "dcclt/Q000057" : "OB Nippur Diri",
    "dcclt/Q000052" : "Nigga",
    "dcclt/Q000050" : "OB Nippur Izi",
    "dcclt/Q000048" : "OB Nippur Kagal",
    "dcclt/Q000047" : "OB Nippur Lu", 
    "dcclt/Q000055" : "OB Nippur Ea"}
lex_lines = lex_lines.loc[lex_lines["id_text"].isin(keep)]

In [None]:
lex_lines["lemma_mwe"] = ["_".join(entry) for entry in lex_lines["lemma"]]

In [None]:
lex_comp = lex_lines.groupby(
    [lex_lines["id_text"]]).aggregate(
    {"lemma_mwe": list}).reset_index()

In [None]:
lex_comp["text_name"] = [keep[t_id] for t_id in lex_comp["id_text"]]
lex_comp

# More Old Code (may not work)

In [None]:
words = dict(etcsl_df.iloc[1])

In [None]:
words = {word : words[word] for word in words if not words[word] == 0}

In [None]:
words

In [None]:
lit_df.shape

In [None]:
lex = [id for id in corpus_df.index.values if id[:5] == "dcclt"]
lex_df = corpus_df.loc[lex, : ]
lex_df = lex_df.loc[ : , lex_df.sum(axis=0) != 0]
lex_df.shape

In [None]:
lex_words = lex_df.columns
lit_words = lit_df.columns

In [None]:
lit_in_lex = [word for word in lit_words if word in lex_words]

In [None]:
len(lit_in_lex)

In [None]:
lit_in_lex_df = lit_df[lit_in_lex]
lit_in_lex_df

Rare words (words that appear only once or twice) may be a strong indicator of a connection (either way) between the literary and the lexical corpus. We can reduce the dataframe to select only those rare words.

In [None]:
rare_n = 2
rare = lit_in_lex_df.loc[ : , lit_in_lex_df.sum(axis=0) <= 2]
rare.shape

# Which literary texts share many rare words with the lexical corpus?

In [None]:
idx = rare.sum(axis=1).sort_values(ascending=False).index
rare.loc[idx, : ]

# Retrieve composition names
Composition names are available in the original `etcsl` dataframe. Retrieve `id_text` and `text_name` from that dataframe and merge this with the dataframe `rare` by using `id_text` as index.

In [None]:
id_name = etcsl_comp[["id_text", "text_name"]].drop_duplicates().set_index("id_text")

In [None]:
merged = pd.merge(rare, id_name, left_index=True, right_index=True, how='inner')
merged.loc[idx]

This shows that Ninurta's Exploits has the largest number of such rare words, shared with lexical texts. 

In [None]:
idx = merged.sum(axis=1, numeric_only=True).sort_values(ascending = False).index
merged.loc[idx]

In [None]:
m

# Words in Lexical Texts not in ETCSL
If a word or expression in the lexical corpus is never used in the literary texts from [ETCSL](http://etcsl.orinst.ox.ac.uk/) the sum of its column will be `0`.

Give the number of columns (the number of unique words and expressions in the lexical texts), the number of words/expressions never used in the ETCSL corpus and the relation between those two numbers in percent.

In [None]:
lex_not_in_etcsl = etcsl_df.loc[:, etcsl_df.sum()==0]
len(etcsl_df.columns), len(lex_not_in_etcsl.columns), str(len(lex_not_in_etcsl.columns)/len(etcsl_df.columns)*100) + "%"

# Simplify
The above may be an overly complex way of doing it.
Alternative: make a full dtm of etcsl (without a vocabulary constraint); make the etcsl vocabulary and lexical vocabulary into sets that can be subtracted from each other.

In [None]:
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+', binary = False)
etcsl2_dtm = cv.fit_transform(corpus['text'])
etcsl2_df = pd.DataFrame(etcsl2_dtm.toarray(), columns= cv.get_feature_names(), index=corpus["etcsl_no"])
etcsl_vocab_s = set(etcsl2_df.columns)
lex_vocab_s = set(lex_vocab)
diff_e_l = list(etcsl_vocab_s - lex_vocab_s)
diff_l_e = list(lex_vocab_s - etcsl_vocab_s)

In [None]:
print("number of words/expressions in ETCSL " + str(len(etcsl_vocab_s)))
print("number of words/expressions in lexical texts " + str(len(lex_vocab_s)))
print("number of words/expressions in ETCSL not in lexical " + str(len(diff_e_l)))
print("number of words/expressions in lexical not in ETCSL " + str(len(diff_l_e)))

In [None]:
plt.figure(figsize=(4,4))
venn2([etcsl_vocab_s, lex_vocab_s], ("literary", "lexical"))
plt.show()

# Rare Words Shared by Lex and Lit
Which words appear in Lex and in Lit but appear only once in Lit? In which composition do we find such words; which words are those?

First create a dataframe (`rare`) that only has the columns that add up to `1` (word or expression appears only once in the corpus). The row totals of this dataframe indicate per composition (= row) how many such rare words they contain. These row totals are added as a separate column. The composition naes are extracted from the `corpus` dataframe created above. Finally the dataframe is sorted by the row totals.

The dataframe `rare` includes columns for each of the words that appear only once. We are showing only the columns that identify the composition and the row totals.

In [None]:
rare =etcsl_df.loc[:, etcsl_df.sum()==1].reset_index()
rare["no. of unique lexical correspondences"] = rare.sum(axis=1)
rare["text_name"] = corpus["text_name"]
rare = rare.sort_values('no. of unique lexical correspondences', ascending = False)
rare.loc[:,["etcsl_no", "no. of unique lexical correspondences", "text_name"]]

# Which Words?
Which are the rare words that define this list of compositions? We first extract the full list of words from the column names of the daraframe `rare`. The variable `words` is a Numpy array that contains strings.

In [None]:
words = rare.columns.values
len(words)

# The rare words in the top-ten
The first ten compositions in our list are the ones that have the most rare words shared with lexical texts. Each row, representing a composition, has columns that represent individual words. We create a `mask` (a sequence of boolean values `True` or `False`) that indicate whether or not the value in the column is 1. If the boolean is `True` the word is printed.

In [None]:
for i in range(10):
    indexes = rare.iloc[i] == 1
    print(rare.iloc[i,-1]), print(words[indexes])

In [None]:
lexical["text"] = lexical["text"].str.replace(" ", "*")
lexical

In [None]:
lexical_corpus = lexical.groupby([lexical["id_text"], 
                                  lexical["text_name"]]).aggregate({"text": " ".join}).reset_index()
lexical_corpus

In [None]:
lexical_temp = lexical[["id_text", "id_line", "lemma"]]

In [None]:
lexical_temp[lexical_temp["id_text"]=="dcclt/Q000001"]

In [None]:
lexical = lexical.groupby([lexical['id_text'], lexical['id_line']]).agg({
        'lemma': ' '.join,
        'extent': ''.join
    })

In [None]:
a = [1, 2, 3, [4]]
b = [10, 11, 12, [13]]
c = [1, 4, 5, [7]]

In [None]:
df = pd.DataFrame([a, b, c], columns = ["a", "b", "c", "d"])
df

In [None]:
df.groupby(["a"]).apply(lambda x: append(x))

In [None]:
len(set(etcsl.id_text))

In [None]:
test1 = etcsl2.groupby(
    [etcsl2["id_text"]]).aggregate(
    {"lemma_mwe": sum}).reset_index()
test2 = etcsl2.groupby(
    [etcsl2["id_text"]]).aggregate(
    {"lemma": sum}).reset_index()

In [None]:
def dummy(tokens): 
    return(tokens)

cv1 = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
cv2 = CountVectorizer(tokenizer=tokenizer.tokenize, preprocessor=dummy)


dtm1 = cv1.fit_transform(test1['lemma_mwe'])
corpus_df1 = pd.DataFrame(dtm1.toarray(), columns= cv1.get_feature_names() , index=test1["id_text"])
dtm2 = cv2.fit_transform(test2['lemma'])
corpus_df2 = pd.DataFrame(dtm2.toarray(), columns= cv2.get_feature_names() , index=test2["id_text"])

In [None]:
etcsl_comp

In [None]:
lex_vocab