# Distance Measurements: Sumerian Literature

This is a work-in-progress Notebook.


In [12]:
import pandas as pd
import glob
import re
from sklearn.feature_extraction.text import CountVectorizer

First read the directory with the cleaned ETCSL texts. These files contain lemmatization in ORACC (ePSD2 style). The texts list lemmatizations per line.

In [2]:
path =r'../Scrape-etcsl/cleaned/' # use your path
allFiles = glob.glob(path + "/*.txt")
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
etcsl_data = pd.concat(list_)
etcsl_data.head()

Unnamed: 0,etcsl_no,text_name,version,l_no,text
0,c.0.1.1,Ur III catalogue from Nibru (N1),,1,sux:dubsaŋ[first]AJ
1,c.0.1.1,Ur III catalogue from Nibru (N1),,2,sux:Enki[1]DN sux:unu[dwelling]N sux:gal[big]V...
2,c.0.1.1,Ur III catalogue from Nibru (N1),,3,sux:anzag[horizon]N
3,c.0.1.1,Ur III catalogue from Nibru (N1),,4,sux:anŋi[eclipse]N sux:zu[know]V/t sux:ama[mot...
4,c.0.1.1,Ur III catalogue from Nibru (N1),,5,sux:gi[thicket]N sux:tuku[rock]V/t


In order to transform this DataFrame into a proper Document Term Matrix we need to discard the columns `version` and `l_no` and concatenate all the text that belongs to a single composition. Some line have no content in the `text` column - these lines need to be dropped.

First select the relevant columns and drop the rows that have no text content.

In [3]:
etcsl_data = etcsl_data[['etcsl_no', 'text_name', 'text']]
etcsl_data = etcsl_data.dropna()
etcsl_data.head()

Unnamed: 0,etcsl_no,text_name,text
0,c.0.1.1,Ur III catalogue from Nibru (N1),sux:dubsaŋ[first]AJ
1,c.0.1.1,Ur III catalogue from Nibru (N1),sux:Enki[1]DN sux:unu[dwelling]N sux:gal[big]V...
2,c.0.1.1,Ur III catalogue from Nibru (N1),sux:anzag[horizon]N
3,c.0.1.1,Ur III catalogue from Nibru (N1),sux:anŋi[eclipse]N sux:zu[know]V/t sux:ama[mot...
4,c.0.1.1,Ur III catalogue from Nibru (N1),sux:gi[thicket]N sux:tuku[rock]V/t


Group the rows by `etcsl_no` and apply the `join` function to the `text` column. Transform the aggregated data into a new DataFrame.

In [4]:
etcsl_bytext = etcsl_data['text'].groupby(etcsl_data['etcsl_no']).apply(' '.join)
etcsl_bytext_df = pd.DataFrame(etcsl_bytext)
etcsl_bytext_df.head()

Unnamed: 0_level_0,text
etcsl_no,Unnamed: 1_level_1
c.0.1.1,sux:dubsaŋ[first]AJ sux:Enki[1]DN sux:unu[dwel...
c.0.1.2,sux:diŋir[deity]N sux:šembizida[kohl]N sux:dar...
c.0.2.01,sux:lugal[king]N sux:šag[heart]N sux:lugal[kin...
c.0.2.02,sux:Enlil[1]DN sux:sud[distant]V/i sux:nam[lor...
c.0.2.03,sux:lugal[king]N sux:mu[name]N sux:niŋul[everl...


Create a DataFrame of `etcsl_no` and `text_name` equivalencies, with `etcsl_no` set as index (row names). Then merge this DataFrame with the the `etctsl_bytext_df` using the indexes.

In [5]:
etcsl_numbers_names = etcsl_data[['etcsl_no', 'text_name']].drop_duplicates().set_index('etcsl_no')
etcsl_data_df = pd.merge(etcsl_numbers_names, etcsl_bytext_df, right_index=True, left_index=True)
etcsl_data_df.head()

Unnamed: 0_level_0,text_name,text
etcsl_no,Unnamed: 1_level_1,Unnamed: 2_level_1
c.0.1.1,Ur III catalogue from Nibru (N1),sux:dubsaŋ[first]AJ sux:Enki[1]DN sux:unu[dwel...
c.0.1.2,Ur III catalogue at Yale (Y1),sux:diŋir[deity]N sux:šembizida[kohl]N sux:dar...
c.0.2.01,OB catalogue from Nibru (N2),sux:lugal[king]N sux:šag[heart]N sux:lugal[kin...
c.0.2.02,OB catalogue in the Louvre (L),sux:Enlil[1]DN sux:sud[distant]V/i sux:nam[lor...
c.0.2.03,OB catalogue from Urim (U1),sux:lugal[king]N sux:mu[name]N sux:niŋul[everl...


Transfrom the DataFrame into a Document Term Matrix (DTM) by using `CountVecorizer`. This function uses a Regular Expression (`token_pattern`) to indicate how to find the beginning and end of each word (or token). In lemmatized Sumerian, a space indicates the boundary between two lemmas. The expression `r.[^ ]+` means: any combination of characters, except the space.

The output of the CountVectorizer (`etcsl_dtm`) is not in a human-readable format. It is transformed into another DataFrame, with the ETCSL numbers as index.

The length of each composition in ETCSL may be computed by adding up all the numbers in a row. The text length will be used at various places in further computations.

In [6]:
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+')
etcsl_dtm = cv.fit_transform(etcsl_data_df['text'])
etcsl_dtm_df = pd.DataFrame(etcsl_dtm.toarray(), columns = cv.get_feature_names(), index = etcsl_data_df.index.values)
etcsl_text_length =  etcsl_dtm_df.sum(axis=1)
etcsl_dtm_df.head()

Unnamed: 0,sux-x-emesal:am-mu-uc[3]nu,sux-x-emesal:aŋ[do]v/t,sux-x-emesal:aŋ[sky]n,sux-x-emesal:aŋ[thing]n,sux-x-emesal:aŋba[gift]n,sux-x-emesal:aŋdara[rag]n,sux-x-emesal:aŋgig[bad-thing]n,sux-x-emesal:aŋhulu[evil]n,sux-x-emesal:aŋlam[garment]n,sux-x-emesal:aŋsub[polishing]n,...,sux:šusuen[1]rn,sux:šutubur[mixture]n,sux:šutug[reed-hut]n,sux:šutum[storehouse]n,sux:šutur[garment]n,sux:šuziana[1]dn,sux:šuš[cover]v/t,sux:šuʾi[barber]n,sux:šuʾu[stone]n,sux:šuʾura[goose]n
c.0.1.1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
c.0.1.2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
c.0.2.01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
c.0.2.02,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
c.0.2.03,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
words_l = list(etcsl_dtm_df.columns.values)

In [14]:
words_l_error = [word for word in words_l if '-' in word]
words_l_error

['sux-x-emesal:am-mu-uc[3]nu',
 'sux-x-emesal:aŋ[do]v/t',
 'sux-x-emesal:aŋ[sky]n',
 'sux-x-emesal:aŋ[thing]n',
 'sux-x-emesal:aŋba[gift]n',
 'sux-x-emesal:aŋdara[rag]n',
 'sux-x-emesal:aŋgig[bad-thing]n',
 'sux-x-emesal:aŋhulu[evil]n',
 'sux-x-emesal:aŋlam[garment]n',
 'sux-x-emesal:aŋsub[polishing]n',
 'sux-x-emesal:ašer[lament]n',
 'sux-x-emesal:ašte[chair]n',
 'sux-x-emesal:damal[wide]v/i',
 'sux-x-emesal:di[go]v/i',
 'sux-x-emesal:di[one]nu',
 'sux-x-emesal:dimmer[deity]n',
 'sux-x-emesal:dumu[child]n',
 'sux-x-emesal:dumuzid[1]dn',
 'sux-x-emesal:dumuzidabzuk[1]dn',
 'sux-x-emesal:duʾuš[bird]n',
 'sux-x-emesal:elum[bison]n',
 'sux-x-emesal:eneŋ[word]n',
 'sux-x-emesal:enki[1]dn',
 'sux-x-emesal:enlil[1]dn',
 'sux-x-emesal:ere[slave]n',
 'sux-x-emesal:ereškigalak[1]dn',
 'sux-x-emesal:eridug[1]sn',
 'sux-x-emesal:eze[sheep]n',
 'sux-x-emesal:ga[bring]v/t',
 'sux-x-emesal:gašan[lady]n',
 'sux-x-emesal:gelleŋ[forsake]v/i',
 'sux-x-emesal:gi[thicket]n',
 'sux-x-emesal:gin[female-work