This Notebook attempts to isolate names (personal names, place names, royal names, divine names) in a collection of texts downloaded from [BDTNS](http://bdtns.filol.csic.es/). The names are formatted to correspond to current [ORACC](http://oracc.org) conventions. The resulting list (name instances with normalized names) will need hand editing.

The original Notebook took all names from the entire [BDTNS](http://bdtns.filol.csic.es/) database. The current version selects names from texts that are cataloged (in [BDTNS](http://bdtns.filol.csic.es/)) as coming from Puzriš-Dagan.

In [1]:
import pandas as pd
import re

Create list of Puzriš-Dagan texts only

In [2]:
#with open('data/query_cat_17_09_6-194802.txt') as f:
#    puzd_df = pd.read_csv(f, sep="\t", header=None, names = ["BDTNS", "CDLI"])
#puzd_df
#puzd_l = list(puzd_df["BDTNS"])
#puzd_l = [str(no).zfill(6) for no in puzd_l]

In [3]:
with open('data/query_text_19_05_15-020951.txt', mode = 'r', encoding = 'utf8') as f:
    lines = f.read().splitlines()

The function `select_puzd()` sets a flag to distinguish between texts from Puzriš-Dagan and other texts. If the 6-digit code at the beginning of the header (the [BDTNS](http://bdtns.filol.csic.es/) text number) corresponds to a number in the list `puzd_l` the flag is set to `True`, otherwise to `False`.

In [4]:
#flag = False
#puzd_lines = []
#for line in linelist:
#    if line[:6].isdigit():
#        if line[:line.find("\t")] in puzd_l:
#            flag = True
#        else:
#            flag = False
#    if flag:
#        puzd_lines.append(line)

In [5]:
#len(puzd_lines)

In [6]:
#linelist = puzd_lines

# Clean
Now remove all flags etc., including brackets and half brackets

In [7]:
def clean(data, oldlist, new=""):
    for i in range(len(oldlist)):
        data = data.replace(oldlist[i],new)
    return data

In [8]:
remlist = ["`", "´", "/", "]", "[", "!", "?", "<", ">", "(", ")", "⸢", "⸣", "⌈", "⌉"]
lines = [clean(line, remlist) for line in lines]

# Remove Editorial Lines and Headers
Eliminate empty lines, header lines, and editorial remarks. Header lines begin with a digit. Editorial remarks begin with a line indicator (like data lines), followed by one or more TABs, followed by #. It seems that in the current version of the [BDTNS] output TABs are not used this way, but # still marks the beginning of an editorial comment. Everything after # is discarded. 

In [9]:
lines = [line for line in lines if len(line) > 0]
lines = [line for line in lines if not line[0].isdigit()]
lines = [line for line in lines if not "\t#" in line]
lines = [line for line in lines if not "=====" in line]
lines = [line.split("#")[0] for line in lines]  # remove everything after #

# Resolve x-values
First, use a regular expression to replace capital X by subscript ₓ. This will skip instances like KA×X, because × (the "times" character) is not in the list of possible preceding characters.

Second, use a regular expression to resolve instances such as 
> mu-kuₓ lugal-la (=DU)

and change those into
> mu-kuₓ(DU) lugal-la

Round brackets, "("and ")" have been removed in a preceding cell - they are put back for such ₓ-value specifiers. The regex in the second line uses a lookahead to check that the expression is followed either by the end of the string ($) or by a white space (\s). Running regex replacements on tens of thousands of lines takes a little while.

In [10]:
lines = [re.sub('([abdegŋhḫijklmnpqrstuwzšṣṭABDEGŊHḪIJKLMNPQRSTUWZŠṢṬ])(X)', '\\1ₓ', line) for line in lines]
lines = [re.sub('([ₓ])(.*?)(=)(.*?)(?:\s|$)', '\\1(\\4)\\2', line) for line in lines]

In [11]:
lines = [re.sub('(ₓ\()(.*?)([\.×])(.*?)(\))', '\\1|\\2\\3\\4|\\5', line) for line in lines]

Inspect results. Note that this code fails for a line like

> 18 gin2 nagga mu-kuX gibil (=AN.NA) (=DU)

which will yield

> 18 gin2 nagga mu-kuₓ(|AN.NA|) gibil =DU

The notation (=AN.NA) in this case refers to nagga, whereas (=DU) refers to kuX - but there is no way the script can detect that.

In [12]:
pd.set_option('display.max_colwidth', -1)
lines_df = pd.DataFrame(lines)
lines_df[lines_df[0].str.contains('ₓ')]

Unnamed: 0,0
87,o. 2 gi ziₓ(SIG7)-a 12 sar-ta
98,o. 3 4 guruš gi ziₓ(SIG7)-a 12 sar-ta
115,o. 6 la2-i3 su-ga Lugal-ušurₓ(|LAL2.TUG2|) sipa
195,r. 2 1 Ur-nigarₓ{gar}
222,o. 4 1 Lu2-Urubₓ(|URU×KAR2|){ki}
427,r. 5 Umma{ki}-a mu-kuₓ(DU)
481,o. 5 kišib Ur-nigarₓ{gar} x
487,s. 1 Ur-nigarₓ{gar}
506,r. 4 ša3 Tum-alₓ(TUR3){ki}
548,r. 14 mu-kuₓ(DU)


# Extract Names
In BDTNS names start with a capital or with {d}. The list comprehension iterates through the list of lines. It iterates through each line by splitting the line into words, testing whether the word begins with a capital or with {d}. The result is a list.

In [13]:
names = [word for line in lines for word in line.split() if word[0].isupper() or word.startswith('{d}')]
len(names)

474815

# Remove All Upper Case
Remove words that are entirely in upper case (for instance LU2.SU)

In [14]:
names =[word for word in names if not word.isupper()]
len(names)

448664

# Remove Incomplete Entries
Remove names that have ellipsis (damage) or illegible signs.

Note that the x in KA×A is not a x, but the "times" sign. The X in x-values (such as NigarX) has been replaced in a previous cell by index x (Nigarₓ). The X in KA×X, however, is a real X and names including that sign will be eliminated as partially illegible.

In [15]:
names = [word for word in names if not '...' in word]
names = [word for word in names if not '-x' in word.lower() 
         and not 'x-' in word.lower() 
         and not '.x' in word.lower() 
         and not 'x.' in word.lower()
        and not '}x' in word.lower()
        and not 'x{' in word.lower()]
len(names)

438642

# Remove duplicates
Reduce the list to a unique set.

In [16]:
total_names = len(names)
#keep the total number of names for statistics
names = set(names)
len(names)

32318

# Sort Alphabetically

In [17]:
names = sorted(names)
names

['A-AMA-a2-a',
 'A-AN-ba-az',
 'A-Ab-ba-ge-na-ta',
 'A-Ad-da',
 'A-Ad-da-kal-la',
 'A-Ad-da-mu',
 'A-An-na-hi-li-bi',
 'A-DU-a',
 'A-DU-ba',
 'A-DU-ba-bi',
 'A-DU-gam-ma',
 'A-DU-la-URU',
 'A-DU-mu',
 'A-DU-mu-ta',
 'A-DU-ra-mu',
 'A-DU-ta',
 'A-DU.DU-e',
 'A-DU.DU-ke4',
 'A-DU.DU-ta',
 'A-DU.DU-še3',
 'A-Eš4-tar2',
 'A-GU4-na-mu',
 'A-KA-a',
 'A-KU-um',
 'A-KU.KU-lum',
 'A-KU.KU-ta',
 'A-LUM-ma',
 'A-MUNU4-da',
 'A-NE-i3-zi',
 'A-NE-ku-bi',
 'A-NI-NI-še3',
 'A-NI-ma',
 'A-NI-nu',
 'A-NI-sig5',
 'A-NI-ta',
 'A-NI.NI-ki-ma-še3',
 'A-PA4-u2-a',
 'A-TU5-še3',
 'A-U.E2-nu-tuku',
 'A-a',
 'A-a-',
 'A-a-Kal-la',
 'A-a-NI',
 'A-a-UN-e-ba-ab-du7',
 'A-a-a',
 'A-a-ab-ba',
 'A-a-ar',
 'A-a-ba',
 'A-a-ba-ni',
 'A-a-ba-ta',
 'A-a-bad3',
 'A-a-bad3-da-ri2-a',
 'A-a-bad3-mu',
 'A-a-bar-ra',
 'A-a-bi',
 'A-a-bi-ta',
 'A-a-bi2-du10',
 'A-a-da',
 'A-a-dingir',
 'A-a-dingir-mu',
 'A-a-dingir-mu-ta',
 'A-a-dingir-mu-še3',
 'A-a-dingir-ta',
 'A-a-du10-ga',
 'A-a-e',
 'A-a-ga',
 'A-a-ga-mu',
 'A-a-ga-ta',


# Create DataFrame
The DataFrame has two columns: column 1 has the original transliteration (as in BDTNS); column 2 starts out with the same data, but this data is transformed into ORACC compatible lemmatization

In [40]:
df = pd.DataFrame(names)

In [41]:
df.columns = ['Transliteration']

In [42]:
df['Normalized'] = df['Transliteration']

# Shin, Emphatic T, and Emphatic S
Replace c by š, C by Š, ty by ṭ, etc.

This is no longer necessary - but also doesn't hurt. Kept in case someone uses an old BDTN output file.

In [43]:
signs = {'c': 'š',
        'C': 'Š',
        'ty': 'ṭ',
        'TY': 'Ṭ',
        'sy': 'ṣ',
        'SY': 'Ṣ'}
for key in signs:
    df['Normalized'] = [word.replace(key, signs[key]) for word in df['Normalized']]

# Sign Reading Substitution
The conventions for reading signs in BDTNS differs from ORACC: BDTNS does not distinguish between G and nasal G (ŋ), and BDTNS uses short readings (`ku3`) where ORACC uses long readings (`kug`).

The following function (`signreplace()`) replaces a BDTNS reading with an ORACC reading. The regular expression uses `\\b` (before and after the sign) to indicate word boundaries, so that replacing `sag` by `saŋ` does not find `sag2` etc. Word boundaries (as defined by the `regex` module) include `-`, `.`, `{`, and `}`.

Since names are capitalized (as in `Dingir-nu-me-a`) each sign-replacement is run twice: once in lower case (`dingir`, replaced by `diŋir`) and once capitalized (`Dingir`, replaced by `Diŋir`).  

In [44]:
def signreplace(old, new, data):
    old_cap = old.capitalize()
    new_cap = new.capitalize()
    data = re.sub('\\b'+old_cap+'\\b', new_cap, data)
    data = re.sub('\\b'+old+'\\b', new, data)
    return data

# Dictionary of signs with canonical ORACC reading
Preliminary list of "short" vs. "long" sign readings (`du11` vs. `dug4`) and sign readings with nasal G (ŋ). The list does *not* include `mu` : `ŋu10`, because that is valid *only* at the end of a word. It is necessary to first remove morphological suffixes such as -ta, -še3, etc., which happens in the next phase.

In [45]:
bdtns_oracc = {'ag2': 'aŋ2',
               'balag': 'balaŋ', 
               'dagal': 'daŋal', 
               'dingir': 'diŋir',
               'eridu': 'eridug',
               'ga2': 'ŋa2', 
               'gar': 'ŋar',
               'geštin': 'ŋeštin',
               'gir2': 'ŋir2',
               'gir3': 'ŋiri3',
               'giri3': 'ŋiri3',
               'giš': 'ŋeš',
               'giškim': 'ŋiškim',
               'gišnimbar' : 'ŋešnimbar',
               'hun': 'huŋ',
               'kin': 'kiŋ2', 
               'nig2': 'niŋ2',
               'nigin': 'niŋin',
               'nigin2': 'niŋin2',
               'pisan': 'bisaŋ',
               'pirig': 'piriŋ',
               'sag': 'saŋ',
               'sanga': 'saŋŋa',
               'šeg3': 'šeŋ3', 
               'šeg6': 'šeŋ6',
               'umbisag': 'umbisaŋ',
               'uri2': 'urim2',
               'uri5': 'urim5',
               'uru': 'iri',
               
               'bara2' : 'barag',
               'du10' : 'dug3',
               'du11' : 'dug4',
               'gu4' : 'gud',
               'kala' : 'kalag',
               'ku3' : 'kug',
               'ku5': 'kud',
              # 'lu5' : 'lul',
               'sa6' : 'sag9',
               'ša6' : 'sag9',
               'za3' : 'zag'
           } 

In [46]:
for key in bdtns_oracc:
    df['Normalized'] = [signreplace(key, bdtns_oracc[key], word) for word in df['Normalized']]
#normalized = [signreplace(key, bdtns_oracc[key], word) for key in bdtns_oracc for word in df['Normalized']]


# Capitalize god names
God names (as part of personal names) are not consistently capitalized (as in `{d}utu-ki-ag2`). The first character after `{d}` must be a capital. The `regex` for doing so was found [here](http://stackoverflow.com/questions/8934477/making-letters-uppercase-using-re-sub-in-python).

In [47]:
df['Normalized'] = [re.sub('{d}([a-zšŋ])', lambda match: '{d}'+'{}'.format(match.group(1).upper()), word) 
                    for word in df['Normalized']]

In [48]:
df

Unnamed: 0,Transliteration,Normalized
0,A-AMA-a2-a,A-AMA-a2-a
1,A-AN-ba-az,A-AN-ba-az
2,A-Ab-ba-ge-na-ta,A-Ab-ba-ge-na-ta
3,A-Ad-da,A-Ad-da
4,A-Ad-da-kal-la,A-Ad-da-kal-la
5,A-Ad-da-mu,A-Ad-da-mu
6,A-An-na-hi-li-bi,A-An-na-hi-li-bi
7,A-DU-a,A-DU-a
8,A-DU-ba,A-DU-ba
9,A-DU-ba-bi,A-DU-ba-bi


# Remove Morphology
The following lines remove morphology that can (almost) unambigously be identified, namely `-ta` (ablative); `ke4` (genitive + ergative) and `-še3`. The genitive element of `-ke4` (`-k`) is kept when it immediately follows a vowel because in such cases it usually belongs to the name. Note that `-ra` (dative) is ambiguous, since it may be part of a name ending in /r/ like `Šul-gi-ra` (in the genitive). After removing these morphemes, word-final `-mu` is replaced by `-ŋu10`.

In [49]:
df['Normalized'] = [word[:-3] if word.endswith('-ta') else word for word in df['Normalized']]
df['Normalized'] = [word[:-4] if word.endswith('-še3') else word for word in df['Normalized']]
df['Normalized'] = [word[:-2] if word.endswith(('a-ke4', 'e-ke4', 'i-ke4', 'u-ke4')) else word for word in df['Normalized']]
df['Normalized'] = [word[:-4] if word.endswith('-ke4') else word for word in df['Normalized']]
df['Normalized'] = [word[:-2]+'ŋu10' if word.endswith('-mu') else word for word in df['Normalized']]

In [50]:
df[1000:1050]

Unnamed: 0,Transliteration,Normalized
1000,A-kal-la-me,A-kal-la-me
1001,A-kal-la-mu,A-kal-la-ŋu10
1002,A-kal-la-ra,A-kal-la-ra
1003,A-kal-la-ta,A-kal-la
1004,A-kal-la-ti,A-kal-la-ti
1005,A-kal-la-še3,A-kal-la
1006,A-kal-le,A-kal-le
1007,A-kal-ta,A-kal
1008,A-kap-še-en,A-kap-še-en
1009,A-kar-ni-iš,A-kar-ni-iš


# Names with Genitive -k
Names such as `Nin-ŋir2-su` contain a (hidden) genitive morpheme `-(a)k` that only appears when followed by a vowel, as in `Nin-ŋir2-su-ke4`. In the preceding `Nin-ŋir2-su-ke4` has been shortened to `Nin-ŋir2-su-k` (removing the ergative morpheme). The presence of `Nin-ŋir2-su-k` in the list proves that `Nin-ŋir2-su` has a hidden geneitive and should be normalized `Ninŋirsuk`. The following cell tests for the existence of such instances, if yes, the `-k` is added to the normalized form of the name.

In [53]:
df['Normalized'] = [word + '-k' if word + '-k' in df.Normalized.values else word for word in df['Normalized']]
df.loc[df.Transliteration.str.contains("Nin-gir2-su")]

Unnamed: 0,Transliteration,Normalized
385,A-ba-{d}Nin-gir2-su-gen7,A-ba-{d}Nin-ŋir2-su-gen7
7822,GIŠ.NI-{d}Nin-gir2-su,GIŠ.NI-{d}Nin-ŋir2-su
8484,Geme2-{d}Nin-gir2-su-ka,Geme2-{d}Nin-ŋir2-su-ka-k
8485,Geme2-{d}Nin-gir2-su-ka-ke4,Geme2-{d}Nin-ŋir2-su-ka-k
11878,Inim-{d}Nin-gir2-su,Inim-{d}Nin-ŋir2-su
11879,Inim-{d}Nin-gir2-su-ka-ib2-ta-e3,Inim-{d}Nin-ŋir2-su-ka-ib2-ta-e3
13316,Ku3-{d}Nin-gir2-su,Kug-{d}Nin-ŋir2-su
13317,Ku3-{d}Nin-gir2-su-ka,Kug-{d}Nin-ŋir2-su-ka-k
13318,Ku3-{d}Nin-gir2-su-ka-ke4,Kug-{d}Nin-ŋir2-su-ka-k
13319,Ku3-{d}Nin-gir2-su-ka-ra,Kug-{d}Nin-ŋir2-su-ka-ra


# Replace a-a with aya
Replace a-a and A-a with aya and Aya, but only between word boundaries. For this we can use the function `signreplace()` defined above.


In [54]:
df['Normalized'] = [signreplace('a-a', 'aya', word) for word in df['Normalized']]
df[:100]

Unnamed: 0,Transliteration,Normalized
0,A-AMA-a2-a,A-AMA-a2-a
1,A-AN-ba-az,A-AN-ba-az
2,A-Ab-ba-ge-na-ta,A-Ab-ba-ge-na
3,A-Ad-da,A-Ad-da
4,A-Ad-da-kal-la,A-Ad-da-kal-la
5,A-Ad-da-mu,A-Ad-da-ŋu10
6,A-An-na-hi-li-bi,A-An-na-hi-li-bi
7,A-DU-a,A-DU-a
8,A-DU-ba,A-DU-ba
9,A-DU-ba-bi,A-DU-ba-bi


# Remove dashes and sign index numbers
In order to produce a normalized form of the name, sign separators (dashes) and sign index numbers are removed. In first instance this is done *only* if there are no uppercase characters further on in the name (uppercase may indicate a logogram). Secondarily we will consider instances where the uppercase letter is dues to a god name (as in `A-bu-um-{d}Dumu-zi`)

In [55]:
remove = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-', ':']
df['Normalized'] = [clean(word, remove) if word[1:].islower() 
                    or (word.startswith('{d}') and word[4:].islower()) else word for word in df['Normalized']]


# Theophoric Names
Theophoric names, if the god name is not at the beginning of the word, will have a capital in the middle of the word (as in `A-ba-{d}Dumu-zi-gen7`) and are thus ignored by the previous cell (no removal of dashes etc.). The following cell tests for the presence of `{d}` in the middle of the word (not at position 0) and splits that name into two. Each half may start with a capital, but should be lower case otherwise. If that is the case the dashes and index numbers are removed.

In [56]:
df['Normalized'] = [clean(word, remove) if ('{d}' in word[1:] 
              and word.split('{d}')[0][1:].islower() and word.split('{d}')[1][1:].islower()) else word 
              for word in df['Normalized']]
df[df['Normalized'].str.contains('{d}Dumu')]

Unnamed: 0,Transliteration,Normalized
370,A-ba-{d}Dumu-zi-gen7,Aba{d}Dumuzigen
556,A-bu-um-{d}Dumu-zi,Abuum{d}Dumuzi
3127,Arad2-{d}Dumu-zi,Arad{d}Dumuzi
6359,E2-duru5-Ur-{d}Dumu-zi,E2-duru5-Ur-{d}Dumu-zi
8398,Geme2-{d}Dumu-zi,Geme{d}Dumuzi
8399,Geme2-{d}Dumu-zi-da,Geme{d}Dumuzida
8400,Geme2-{d}Dumu-zi-de2,Geme{d}Dumuzide
8401,Geme2-{d}Dumu-zi-me,Geme{d}Dumuzime
8577,Geme2{d}Dumu-zi,Geme{d}Dumuzi
11850,Inim-{d}Dumu-zi,Inim{d}Dumuzi


# Theophoric Names with {d} twice
Names of the pattern `{d}Amar-{d}Suen` are still not normalized, because of the capital in position 3 (in `Amar`).

In [57]:
df['Normalized'] = [clean(word, remove) if ('{d}' in word[1:] and word.startswith('{d}')
              and word.split('{d}')[1][1:].islower() and word.split('{d}')[2][1:].islower()) else word 
              for word in df['Normalized']]
df[df['Transliteration'].str.contains('Suen')]

Unnamed: 0,Transliteration,Normalized
497,A-bi2-{d}Suen,Abi{d}Suen
1112,A-li2-id-{d}Suen,Aliid{d}Suen
1163,A-ma-an-{d}Suen,Amaan{d}Suen
1218,A-mur-{d}Suen,Amur{d}Suen
1219,A-mur-{d}Suen-še3,Amur{d}Suen
1272,A-na-{d}Suen-tak2-la-ku,Ana{d}Suentaklaku
1389,A-ra-zu-{d}I-bi2-{d}Suen-ka-še3-pa3-da,Arazu{d}Ibi{d}Suenkašepada
1390,A-ra-zu-{d}I-bi2-{d}Suen-na-pa3-da,Arazu{d}Ibi{d}Suennapada
1867,A2-{d}Suen,A2-{d}Suen
1938,AK-{d}Suen,AK-{d}Suen


# Theophoric Names without {d}
God names such as `I-šum` and `E2-a` (and others?) are usually not preceded by `{d}`. We can submit those to a similar test, splitting the name at a dash when followed by `I-šum` and `E2-a`, followed by a word boundary `\\b` (using a positive lookahead regex). If the two halves of the split are both lower case after the initial character, remove dashes and index numbers. If other gods are found that are usually not preceded by `{d}` they can be added to the list `gods_no_d`.

In [58]:
gods_no_d = ['I-šum', 'E2-a', 'Er3-ra', 'A-šur5']
for god in gods_no_d:
    df['Normalized'] = [clean(word, remove) if (god in word[2:] 
              and re.split('-(?=' + god + '\\b)', word)[0][1:].islower() 
              and re.split('-(?=' + god + '\\b)', word)[1][1:].islower()) else word 
              for word in df['Normalized']]
df[df['Transliteration'].str.contains('E2-a')]

Unnamed: 0,Transliteration,Normalized
1204,A-mur-E2-a,AmurEa
1205,A-mur-E2-a-ta,AmurEa
1234,A-na-at-E2-a,AnaatEa
1543,A-zar3-E2-a,AzarEa
2139,Ab-ba-E2-a,AbbaEa
4048,Be-li2-um-E2-a,BeliumEa
4491,DE2-a,DE2-a
4492,DE2-am3,DE2-am3
6097,E2-a,Ea
6098,E2-a-DINGIR,E2-a-DINGIR


# Replace uu by u, etc.
In the normalization replace double vowels and double consonants with single ones (replace `Abuum` with `Abum` and `Abasagga` with `Abasaga`), but not at the beginning or the end of a word.


In [59]:
df['Normalized'] = [re.sub('\\B([a-z])\\1\\B', '\\1', word) for word in df['Normalized']]

# Alef between Vowels
Put alef between lowercase vowels. Note: for some reason the Alef is represented as an Ayin (`ʾ`) in the code cell. It does. however, correctly produce an Alef (`ʿ`) in the output.

In [60]:
vowel_combis = {"ae":"aʾe",
          "ai": "aʾi",
          "au": "aʾu",
          "ea": "eʾa",
          "ei": "eʾi",
          "eu": "eʾu",
          "ia": "iʾa",
          "ie": "iʾe",
          "iu": "iʾu",
          "ua": "uʾa",
          "ue": "uʾe",
          "ui": "uʾi"
         }



In [61]:
for key in vowel_combis:
    df['Normalized'] = [word.replace(key, vowel_combis[key]) for word in df['Normalized']]
df

Unnamed: 0,Transliteration,Normalized
0,A-AMA-a2-a,A-AMA-a2-a
1,A-AN-ba-az,A-AN-ba-az
2,A-Ab-ba-ge-na-ta,A-Ab-ba-ge-na
3,A-Ad-da,A-Ad-da
4,A-Ad-da-kal-la,A-Ad-da-kal-la
5,A-Ad-da-mu,A-Ad-da-ŋu10
6,A-An-na-hi-li-bi,A-An-na-hi-li-bi
7,A-DU-a,A-DU-a
8,A-DU-ba,A-DU-ba
9,A-DU-ba-bi,A-DU-ba-bi


# Assign Proper Noun Classes
Proper noun classes include:
    - RN       Royal Name
    - DN       Divine Name
    - PN       Personal Name
    - SN       Settlement Name
    - GN       Geographical Name (larger Geographical units such as states)
    - TN       Temple Name
    - ON       Object Name (such as divine vessels and chariots)
    - FN       Field Name

The class is indicated after square brackets after the name (as in `Utu[]DN`).

There are only 5 royal names: `UrNammak`, `Šulgir`, `AmarSuen`, `ŠuSuen`, and `IbbiSuen`.

Settlement names are followed by the determinative {ki}.

Divine names are preceded by the determinative {d}.

The rest are considered to be Personal Names. Field names, Temple names, and Geographical names etc. can usually not  be recognized unambiguously.

The first line of the code in the cell below adds `[]PN` to each entry in the Normalization, turning every entry into a Personal Name. Subsequent lines replace `PN` with `SN`, `DN`, or `RN` where appropriate.

A certain amount of error is unavoidable. Personal names are often preceded by `{d}` because they contain a divine name. Similarly, god names may contain place names (as in `{d}Nin-Urim5{ki}`: Lady of Ur).

In [62]:
df['Normalized'] = [word+'[]PN' for word in df['Normalized']]
# Settlement names
df['Normalized'] = [word[:-2]+'SN' if '{ki}' in word else word for word in df['Normalized']]

In [63]:
# Divine names
df['Normalized'] = [word[:-2]+'DN' if word.startswith('{d}') and not '{d}' in word[4:] 
                    else word for word in df['Normalized']]

In [64]:
# Royal names
Shulgi = ['Šulgira', 'Šulgi', '{d}Šulgira', '{d}Šulgi']
AmarSuen = ['Amar{d}Suʾenra', 'Amar{d}Suʾen', 'Amar{d}Suʾenka', '{d}Amar{d}Suʾenra', 
            '{d}Amar{d}Suʾen', '{d}Amar{d}Suʾenka']
ShuSuen = ['Šu{d}Suʾen', 'Šu{d}Suʾenra', 'Šu{d}Suʾenka',
          '{d}Šu{d}Suʾen', '{d}Šu{d}Suʾenra', '{d}Šu{d}Suʾenka']
IbbiSuen = ['Ibi{d}Suʾen', 'Ibi{d}Suʾenra', 'Ibi{d}Suʾenka',
          '{d}Ibi{d}Suʾen', '{d}Ibi{d}Suʾenra', '{d}Ibi{d}Suʾenka']
df['Normalized'] = ['Šulgir[]RN' if word[:-4] in Shulgi else word for word in df['Normalized']]
df['Normalized'] = ['AmarSuʾen[]RN' if word[:-4] in AmarSuen else word for word in df['Normalized']]
df['Normalized'] = ['ŠuSuʾen[]RN' if word[:-4] in ShuSuen else word for word in df['Normalized']]
df['Normalized'] = ['IbbiSuʾen[]RN' if word[:-4] in IbbiSuen else word for word in df['Normalized']]
df['Normalized'] = ['UrNammak[]RN' if 'Ur{d}Nama' in word else word for word in df['Normalized']]

In [65]:
df[df['Normalized'].str.contains('[]RN', regex=False)]

Unnamed: 0,Transliteration,Normalized
2714,Amar-{d}Suen,AmarSuʾen[]RN
2718,Amar-{d}Suen-ka-še3,AmarSuʾen[]RN
2719,Amar-{d}Suen-ke4,AmarSuʾen[]RN
2723,Amar-{d}Suen-ra,AmarSuʾen[]RN
10196,I-bi-{d}Suen,IbbiSuʾen[]RN
10220,I-bi2-{d}Suen,IbbiSuʾen[]RN
10221,I-bi2-{d}Suen-ka,IbbiSuʾen[]RN
25391,Ur-{d}Namma,UrNammak[]RN
25392,Ur-{d}Namma-ka,UrNammak[]RN
25393,Ur-{d}Namma-ka-še3,UrNammak[]RN


# Remove Determinatives
Remove determinatives, but only from those names that have been successfully normalized (do not contain `-` or `.` anymore).

In [66]:
df['Normalized'] = [re.sub('{.+?}', '', word) if not '-' in word and not '.' in word 
                    else word for word in df['Normalized']]
df

Unnamed: 0,Transliteration,Normalized
0,A-AMA-a2-a,A-AMA-a2-a[]PN
1,A-AN-ba-az,A-AN-ba-az[]PN
2,A-Ab-ba-ge-na-ta,A-Ab-ba-ge-na[]PN
3,A-Ad-da,A-Ad-da[]PN
4,A-Ad-da-kal-la,A-Ad-da-kal-la[]PN
5,A-Ad-da-mu,A-Ad-da-ŋu10[]PN
6,A-An-na-hi-li-bi,A-An-na-hi-li-bi[]PN
7,A-DU-a,A-DU-a[]PN
8,A-DU-ba,A-DU-ba[]PN
9,A-DU-ba-bi,A-DU-ba-bi[]PN


# Replace numbers by Index numbers
In names that could not be normalized automatically, replace numbers by index numbers and dashes (`-`) by dots (`.`).

In [67]:
numbers_index = {'0':'₀',
               '1': '₁',
               '2':'₂',
               '3':'₃',
               '4':'₄',
               '5':'₅',
               '6':'₆',
               '7':'₇',
               '8':'₈',
               '9':'₉',
                '-': '.'}

In [68]:
for key in numbers_index:
    df['Normalized'] = [word.replace(key, numbers_index[key]) for word in df['Normalized']]
df

Unnamed: 0,Transliteration,Normalized
0,A-AMA-a2-a,A.AMA.a₂.a[]PN
1,A-AN-ba-az,A.AN.ba.az[]PN
2,A-Ab-ba-ge-na-ta,A.Ab.ba.ge.na[]PN
3,A-Ad-da,A.Ad.da[]PN
4,A-Ad-da-kal-la,A.Ad.da.kal.la[]PN
5,A-Ad-da-mu,A.Ad.da.ŋu₁₀[]PN
6,A-An-na-hi-li-bi,A.An.na.hi.li.bi[]PN
7,A-DU-a,A.DU.a[]PN
8,A-DU-ba,A.DU.ba[]PN
9,A-DU-ba-bi,A.DU.ba.bi[]PN


# Statistics
In the current set, how many name forms and how many names? How many name forms could not be normalized?

In [69]:
not_norm_df = df[df['Normalized'].str.contains('.', regex=False)]
norm_df = df[~df['Normalized'].str.contains('.', regex=False)]
not_normalized = len(not_norm_df)
norm = len(norm_df)
norm_set = len(set(norm_df['Normalized']))
names_forms = len(df)
print('Name Instances ' + str(total_names))
print('Name forms: ' + str(names_forms))
print('Name forms Normalized: ' + str(norm) + "; representing " + str(norm_set) + " different names.")
print('Name forms not normalized: ' + str(not_normalized))

Name Instances 438642
Name forms: 32318
Name forms Normalized: 24913; representing 18328 different names.
Name forms not normalized: 7405


# Utamišaram
The Ur III name `Utamišaram` appears in many different spellings and name forms. How did our script do with him?


In [71]:
miša = df[df['Transliteration'].str.contains('mi-ša')]
Utamišaram = miša[miša['Transliteration'].str[0]=='U']
Utamišaram

Unnamed: 0,Transliteration,Normalized
22612,U2-ta-mi-šar-ra-am,Utamišaram[]PN
22613,U2-ta-mi-šar-ra-am-ta,Utamišaram[]PN
22614,U2-ta-mi-šar-ru-um,Utamišarum[]PN
22621,U2-ta2-mi-ša-ra-am,Utamišaram[]PN
22622,U2-ta2-mi-šar-MI-ra-am,U₂.ta₂.mi.šar.MI.ra.am[]PN
22623,U2-ta2-mi-šar-am,Utamišaram[]PN
22624,U2-ta2-mi-šar-am-ta,Utamišaram[]PN
22625,U2-ta2-mi-šar-ra-am,Utamišaram[]PN
22626,U2-ta2-mi-šar-ra-am-ta,Utamišaram[]PN
22627,U2-ta2-mi-šar-ru-um,Utamišarum[]PN


Turns out that 10 different spellings/forms are correctly identified with `Utamišaram`; others are normalized as `Udamišaram` or `Utamišarum`, or, in one case `U₂.ta₂.mi.šar.MI.ra.am` (apparently an ancient spelling mistake). Such entries need to be corrected by hand; further research shows that in one case `U2-ta-mi-car-um-ta` is a modern mistake (= `Utamišaram`); the other cases of `Utamišarum` are correctly transliterated and represent a variant form of the same name refering to the same person (an official in the Drehem administration) as `Utamišaram`.

# Save to File
Separator is a `TAB` (instead of comma).

In [72]:
df.to_csv('output/UrIII-Names.csv', sep = '\t', encoding='utf-8')