This Notebook attempts to isolate names (personal names, place names, royal names, divine names) in a collection of texts downloaded from [BDTNS](http://bdtns.filol.csic.es/). The names are formatted to correspond to current [ORACC](http://oracc.org) conventions. The resulting list (name instances with normalized names) will need hand editing.

The original Notebook took all names from the entire [BDTNS](http://bdtns.filol.csic.es/) database. The current version selects names from texts that are cataloged (in [BDTNS](http://bdtns.filol.csic.es/)) as coming from Puzriš-Dagan.

In [1]:
import pandas as pd
import re

Create list of Puzriš-Dagan texts only

In [2]:
with open('data/query_cat_17_09_6-194802.txt') as f:
    puzd_df = pd.read_csv(f, sep="\t", header=None, names = ["BDTNS", "CDLI"])
puzd_df
puzd_l = list(puzd_df["BDTNS"])
puzd_l = [str(no).zfill(6) for no in puzd_l]

In [3]:
with open('data/query_text_16_12_29-052645.txt', mode = 'r', encoding = 'utf8') as f:
    linelist = f.read().splitlines()

The function `select_puzd()` sets a flag to distinguish between texts from Puzriš-Dagan and other texts. If the 6-digit code at the beginning of the header (the [BDTNS](http://bdtns.filol.csic.es/) text number) corresponds to a number in the list `puzd_l` the flag is set to `True`, otherwise to `False`.

In [4]:
flag = False
puzd_lines = []
for line in linelist:
    if line[:6].isdigit():
        if line[:line.find("\t")] in puzd_l:
            flag = True
        else:
            flag = False
    if flag:
        puzd_lines.append(line)

In [5]:
len(puzd_lines)

414390

In [6]:
linelist = puzd_lines

# Clean
first remove all flags etc., including brackets and half brackets


In [7]:
def clean(data, oldlist, new=""):
    for i in range(len(oldlist)):
        data = data.replace(oldlist[i],new)
    return data

In [8]:
remlist = ["`", "´", "/", "]", "[", "!", "?", "<", ">", "(", ")"]
lines = [clean(line, remlist) for line in linelist]

# Remove Editorial Lines and Headers
Header lines begin with a digit. Editorial remarks begin with a line indicator (like data lines), followed by one or more TABs, followed by #.

In [9]:
lines = [line for line in lines if not line[0].isdigit()]
lines = [line for line in lines if not "\t#" in line]

# Extract Names
In BDTNS names start with a capital or with {d}. The list comprehension iterates through the list of lines. It iterates through each line by splitting the line into words, testing whether the word begins with a capital or with {d}. The result is a list.

In [10]:
names = [word for line in lines for word in line.split() if word[0].isupper() or word.startswith('{d}')]
len(names)

165425

# Remove All Upper Case
Remove words that are entirely in upper case (for instance LU2.SU)

In [11]:
names =[word for word in names if not word.isupper()]
len(names)

160536

# Remove Incomplete Entries
Remove names that have ellipsis (damage) or illegible signs.

In [13]:
names = [word for word in names if not '...' in word]
names = [word for word in names if not '-x' in word.lower() 
         and not 'x-' in word.lower() 
         and not '.x' in word.lower() 
         and not 'x.' in word.lower()
        and not '}x' in word.lower()
        and not 'x{' in word.lower()]
len(names)

156112

# Remove duplicates
Reduce the list to a unique set.

In [14]:
total_names = len(names)
#keep the total number of names for statistics
names = set(names)
len(names)

6077

# Sort Alphabetically

In [15]:
names = sorted(names)
names

['A-AN-ba-az',
 'A-KU-um',
 'A-KU.KU-ta',
 'A-NI-ta',
 'A-U.E2-nu-tuku',
 'A-a-',
 'A-a-bad3',
 'A-a-ce3',
 'A-a-dingir',
 'A-a-dingir-mu',
 'A-a-dingir-mu-ta',
 'A-a-dingir-ta',
 'A-a-i3-li2',
 'A-a-i3-li2-cu',
 'A-a-kal-la',
 'A-a-kal-la-mu',
 'A-a-kal-la-ta',
 'A-a-ma',
 'A-a-mu',
 'A-a-mu-ce3',
 'A-a-mu-dah',
 'A-a-mu-ta',
 'A-a-ni',
 'A-a-ri2-mu-ce3',
 'A-a-ti',
 'A-a-u4-su3-ce3',
 'A-a-ur-sag',
 'A-a-{d}Nanna-ar-kal-la',
 'A-ab-ba',
 'A-ab-ba-a',
 'A-ab-ba-a-ta',
 'A-ab-ba-bi',
 'A-ab-ba-bi-ta',
 'A-ab-ba-mmu',
 'A-ab-ba-mu',
 'A-ab-ba-ni',
 'A-ab-ba-sig5',
 'A-ab-ba-ta',
 'A-ab-mi-ra-din',
 'A-ad-da',
 'A-ad-da-a',
 'A-ad-da-mu',
 'A-ad-da-ni-ra-am-an-ni',
 'A-al-la',
 'A-al-la-mu',
 'A-al-la-mu-ta',
 'A-al-mu',
 'A-al-mu-ta',
 'A-am-ma',
 'A-an-na',
 'A-an-na-ti',
 'A-ar-ba-tal',
 'A-ba',
 'A-ba-al-la-ta',
 'A-ba-al{ki}',
 'A-ba-ar-da',
 'A-ba-ar-du-uk',
 'A-ba-ar-ni-um{ki}',
 'A-ba-ba',
 'A-ba-cec-mu-gen7',
 'A-ba-dingir-gen7',
 'A-ba-dingir-mu-gen7',
 'A-ba-e-ne-gen7',
 'A-ba

# Create DataFrame
The DataFrame has two columns: column 1 has the original transliteration (as in BDTNS); column 2 starts out with the same data, but this data is transformed into ORACC compatible lemmatization

In [16]:
df = pd.DataFrame(names)

In [17]:
df.columns = ['Transliteration']

In [18]:
df['Normalized'] = df['Transliteration']

# Shin, Emphatic T, and Emphatic S
Replace c by š, C by Š, ty by ṭ, etc.

In [19]:
signs = {'c': 'š',
        'C': 'Š',
        'ty': 'ṭ',
        'TY': 'Ṭ',
        'sy': 'ṣ',
        'SY': 'Ṣ'}
for key in signs:
    df['Normalized'] = [word.replace(key, signs[key]) for word in df['Normalized']]

# Sign Reading Substitution
The conventions for reading signs in BDTNS differs from ORACC: BDTNS does not distinguish between G and nasal G (ŋ), and BDTNS uses short readings (`ku3`) where ORACC uses long readings (`kug`).

The following function (`signreplace()`) replaces a BDTNS reading with an ORACC reading. The regular expression uses `\\b` (before and after the sign) to indicate word boundaries, so that replacing `sag` by `saŋ` does not find `sag2` etc. Word boundaries (as defined by the `regex` module) include `-`, `.`, `{`, and `}`.

Since names are capitalized (as in `Dingir-nu-me-a`) each sign-replacement is run twice: once in lower case (`dingir`, replaced by `diŋir`) and once capitalized (`Dingir`, replaced by `Diŋir`).  

In [20]:
def signreplace(old, new, data):
    old_cap = old.capitalize()
    new_cap = new.capitalize()
    data = re.sub('\\b'+old_cap+'\\b', new_cap, data)
    data = re.sub('\\b'+old+'\\b', new, data)
    return data

# Dictionary of signs with canonical ORACC reading
Preliminary list of "short" vs. "long" sign readings (`du11` vs. `dug4`) and sign readings with nasal G (ŋ). The list does *not* include `mu` : `ŋu10`, because that is valid *only* at the end of a word. It is necessary to first remove morphological suffixes such as -ta, -še3, etc., which happens in the next phase.

In [21]:
bdtns_oracc = {'ag2': 'aŋ2',
               'balag': 'balaŋ', 
               'dagal': 'daŋal', 
               'dingir': 'diŋir',
               'eridu': 'eridug',
               'ga2': 'ŋa2', 
               'gar': 'ŋar',
               'geštin': 'ŋeštin',
               'gir2': 'ŋir2',
               'gir3': 'ŋiri3',
               'giri3': 'ŋiri3',
               'giš': 'ŋeš',
               'giškim': 'ŋiškim',
               'gišnimbar' : 'ŋešnimbar',
               'hun': 'huŋ',
               'kin': 'kiŋ2', 
               'nig2': 'niŋ2',
               'nigin': 'niŋin',
               'nigin2': 'niŋin2',
               'pisan': 'bisaŋ',
               'pirig': 'piriŋ',
               'sag': 'saŋ',
               'sanga': 'saŋŋa',
               'šeg3': 'šeŋ3', 
               'šeg6': 'šeŋ6',
               'umbisag': 'umbisaŋ',
               'uri2': 'urim2',
               'uri5': 'urim5',
               'uru': 'iri',
               
               'bara2' : 'barag',
               'du10' : 'dug3',
               'du11' : 'dug4',
               'gu4' : 'gud',
               'kala' : 'kalag',
               'ku3' : 'kug',
               'ku5': 'kud',
              # 'lu5' : 'lul',
               'sa6' : 'sag9',
               'ša6' : 'sag9',
               'za3' : 'zag'
           } 

In [22]:
for key in bdtns_oracc:
    df['Normalized'] = [signreplace(key, bdtns_oracc[key], word) for word in df['Normalized']]
#normalized = [signreplace(key, bdtns_oracc[key], word) for key in bdtns_oracc for word in df['Normalized']]


# Capitalize god names
God names (as part of personal names) are not consistently capitalized (as in `{d}utu-ki-ag2`). The first character after `{d}` must be a capital. The `regex` for doing so was found [here](http://stackoverflow.com/questions/8934477/making-letters-uppercase-using-re-sub-in-python).

In [23]:
df['Normalized'] = [re.sub('{d}([a-zšŋ])', lambda match: '{d}'+'{}'.format(match.group(1).upper()), word) 
                    for word in df['Normalized']]

In [24]:
df

Unnamed: 0,Transliteration,Normalized
0,A-AN-ba-az,A-AN-ba-az
1,A-KU-um,A-KU-um
2,A-KU.KU-ta,A-KU.KU-ta
3,A-NI-ta,A-NI-ta
4,A-U.E2-nu-tuku,A-U.E2-nu-tuku
5,A-a-,A-a-
6,A-a-bad3,A-a-bad3
7,A-a-ce3,A-a-še3
8,A-a-dingir,A-a-diŋir
9,A-a-dingir-mu,A-a-diŋir-mu


# Remove Morphology
The following lines remove morphology that can (almost) unambigously be identified, namely `-ta` (ablative); `ke4` (genitive + ergative) and `-še3`. The genitive element of `-ke4` (`-k`) is kept when it immediately follows a vowel because in such cases it usually belongs to the name. Note that `-ra` (dative) is ambiguous, since it may be part of a name ending in /r/ like `Šul-gi-ra` (in the genitive). After removing these morphemes, word-final `-mu` is replaced by `-ŋu10`.

In [25]:
df['Normalized'] = [word[:-3] if word.endswith('-ta') else word for word in df['Normalized']]
df['Normalized'] = [word[:-4] if word.endswith('-še3') else word for word in df['Normalized']]
df['Normalized'] = [word[:-2] if word.endswith('a|e|i|u' + '-ke4') else word for word in df['Normalized']]
df['Normalized'] = [word[:-4] if word.endswith('-ke4') else word for word in df['Normalized']]
df['Normalized'] = [word[:-2]+'ŋu10' if word.endswith('-mu') else word for word in df['Normalized']]

In [26]:
df[1000:1050]

Unnamed: 0,Transliteration,Normalized
1000,Ce-ti-ir-ca{ki},Še-ti-ir-ša{ki}
1001,Cec-Da-da,Šeš-Da-da
1002,Cec-a-ni,Šeš-a-ni
1003,Cec-cec,Šeš-šeš
1004,Cec-cec-mu,Šeš-šeš-ŋu10
1005,Cec-cec-ta,Šeš-šeš
1006,Cec-da-da,Šeš-da-da
1007,Cec-gi,Šeš-gi
1008,Cec-kal-la,Šeš-kal-la
1009,Cec-kal-la-ke4,Šeš-kal-la


# Names with Genitive -k
Names such as `Nin-ŋir2-su` contain a (hidden) genitive morpheme `-(a)k` that only appears when followed by a vowel, as in `Nin-ŋir2-su-ke4`. In the preceding `Nin-ŋir2-su-ke4` has been shortened to `Nin-ŋir2-su-k` (removing the ergative morpheme). The presence of `Nin-ŋir2-su-k` in the list proves that `Nin-ŋir2-su` has a hidden geneitive and should be normalized `Ninŋirsuk`. The following cell tests for the existence of such instances, if yes, the `-k` is added to the normalized form of the name.

In [27]:
df['Normalized'] = [word + '-k' if word + '-k' in df.Normalized.values else word for word in df['Normalized']]

df[1000:1050]

Unnamed: 0,Transliteration,Normalized
1000,Ce-ti-ir-ca{ki},Še-ti-ir-ša{ki}
1001,Cec-Da-da,Šeš-Da-da
1002,Cec-a-ni,Šeš-a-ni
1003,Cec-cec,Šeš-šeš
1004,Cec-cec-mu,Šeš-šeš-ŋu10
1005,Cec-cec-ta,Šeš-šeš
1006,Cec-da-da,Šeš-da-da
1007,Cec-gi,Šeš-gi
1008,Cec-kal-la,Šeš-kal-la
1009,Cec-kal-la-ke4,Šeš-kal-la


# Replace a-a with aya
Replace a-a and A-a with aya and Aya, but only between word boundaries. For this we can use the function `signreplace()` defined above.


In [28]:
df['Normalized'] = [signreplace('a-a', 'aya', word) for word in df['Normalized']]
df[:100]

Unnamed: 0,Transliteration,Normalized
0,A-AN-ba-az,A-AN-ba-az
1,A-KU-um,A-KU-um
2,A-KU.KU-ta,A-KU.KU
3,A-NI-ta,A-NI
4,A-U.E2-nu-tuku,A-U.E2-nu-tuku
5,A-a-,Aya-
6,A-a-bad3,Aya-bad3
7,A-a-ce3,Aya
8,A-a-dingir,Aya-diŋir
9,A-a-dingir-mu,Aya-diŋir-ŋu10


# Remove dashes and sign index numbers
In order to produce a normalized form of the name, sign separators (dashes) and sign index numbers are removed. In first instance this is done *only* if there are no uppercase characters further on in the name (uppercase may indicate a logogram). Secondarily we will consider instances where the uppercase letter is dues to a god name (as in `A-bu-um-{d}Dumu-zi`)

In [29]:
remove = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-', ':']
df['Normalized'] = [clean(word, remove) if word[1:].islower() 
                    or (word.startswith('{d}') and word[4:].islower()) else word for word in df['Normalized']]


# Theophoric Names
Theophoric names, if the god name is not at the beginning of the word, will have a capital in the middle of the word (as in `A-ba-{d}Dumu-zi-gen7`) and are thus ignored by the previous cell (no removal of dashes etc.). The following cell tests for the presence of `{d}` in the middle of the word (not at position 0) and splits that name into two. Each half may start with a capital, but should be lower case otherwise. If that is the case the dashes and index numbers are removed.

In [30]:
df['Normalized'] = [clean(word, remove) if ('{d}' in word[1:] 
              and word.split('{d}')[0][1:].islower() and word.split('{d}')[1][1:].islower()) else word 
              for word in df['Normalized']]
df[df['Normalized'].str.contains('{d}Dumu')]

Unnamed: 0,Transliteration,Normalized
113,A-bu-um-{d}Dumu-zi,Abuum{d}Dumuzi
2518,Inim-{d}Dumu-zi-da,Inim{d}Dumuzida
3093,Lu2-{d}Dumu-zi,Lu{d}Dumuzi
3094,Lu2-{d}Dumu-zi-da,Lu{d}Dumuzida
3095,Lu2-{d}Dumu-zi-da-ta,Lu{d}Dumuzida
4813,Ur-{d}Dumu-zi,Ur{d}Dumuzi
4814,Ur-{d}Dumu-zi-da,Ur{d}Dumuzida
4815,Ur-{d}Dumu-zi-da-ce3,Ur{d}Dumuzida
4816,Ur-{d}Dumu-zi-da-ke4,Ur{d}Dumuzida
4817,Ur-{d}Dumu-zi-da-ta,Ur{d}Dumuzida


# Theophoric Names with {d} twice
Names of the pattern `{d}Amar-{d}Suen` are still not normalized, because of the capital in position 3 (in `Amar`).

In [31]:
df['Normalized'] = [clean(word, remove) if ('{d}' in word[1:] and word.startswith('{d}')
              and word.split('{d}')[1][1:].islower() and word.split('{d}')[2][1:].islower()) else word 
              for word in df['Normalized']]
df[df['Transliteration'].str.contains('Suen')]

Unnamed: 0,Transliteration,Normalized
251,A-mur-{d}Suen,Amur{d}Suen
288,A-ra-zu-{d}I-bi2-{d}Suen-ka-ce3-pa3-da,Arazu{d}Ibi{d}Suenkašepada
289,A-ra-zu-{d}I-bi2-{d}Suen-na-pa3-da,Arazu{d}Ibi{d}Suennapada
568,Amar-{d}Suen,Amar{d}Suen
569,Amar-{d}Suen-ra,Amar{d}Suenra
637,Arad2-{d}Amar-{d}Suen,Arad{d}Amar{d}Suen
638,Arad2-{d}Amar-{d}Suen-ta,Arad{d}Amar{d}Suen
911,Ca-at-{d}Cu-{d}Suen,Šaat{d}Šu{d}Suen
914,Ca-at-{d}Suen,Šaat{d}Suen
915,Ca-at-{d}Suen-ce3,Šaat{d}Suen


# Theophoric Names without {d}
God names such as `I-šum` and `E2-a` (and others?) are usually not preceded by `{d}`. We can submit those to a similar test, splitting the name at a dash when followed by `I-šum` and `E2-a`, followed by a word boundary `\\b` (using a positive lookahead regex). If the two halves of the split are both lower case after the initial character, remove dashes and index numbers. If other gods are found that are usually not preceded by `{d}` they can be added to the list `gods_no_d`.

In [32]:
gods_no_d = ['I-šum', 'E2-a', 'Er3-ra', 'A-šur5']
for god in gods_no_d:
    df['Normalized'] = [clean(word, remove) if (god in word[2:] 
              and re.split('-(?=' + god + '\\b)', word)[0][1:].islower() 
              and re.split('-(?=' + god + '\\b)', word)[1][1:].islower()) else word 
              for word in df['Normalized']]
df[df['Transliteration'].str.contains('E2-a')]

Unnamed: 0,Transliteration,Normalized
247,A-mur-E2-a,AmurEa
248,A-mur-E2-a-ta,AmurEa
255,A-na-ab-E2-a,AnaabEa
257,A-na-at-E2-a,AnaatEa
904,Ca-at-E2-a,ŠaatEa
1060,Cu-E2-a,ŠuEa
1494,E2-a-DINGIR-ta,E2-a-DINGIR
1495,E2-a-DU.DU,E2-a-DU.DU
1496,E2-a-KAL,E2-a-KAL
1497,E2-a-ba-ac-ti,Eabaašti


# Replace uu by u, etc.
In the normalization replace double vowels and double consonants with single ones (replace `Abuum` with `Abum` and `Abasagga` with `Abasaga`), but not at the beginning or the end of a word.


In [33]:
df['Normalized'] = [re.sub('\\B([a-z])\\1\\B', '\\1', word) for word in df['Normalized']]

# Alef between Vowels
Put alef between lowercase vowels. Note: for some reason the Alef is represented as an Ayin (`ʾ`) in the code cell. It does. however, correctly produce an Alef (`ʿ`) in the output.

In [34]:
vowel_combis = {"ae":"aʾe",
          "ai": "aʾi",
          "au": "aʾu",
          "ea": "eʾa",
          "ei": "eʾi",
          "eu": "eʾu",
          "ia": "iʾa",
          "ie": "iʾe",
          "iu": "iʾu",
          "ua": "uʾa",
          "ue": "uʾe",
          "ui": "uʾi"
         }



In [35]:
for key in vowel_combis:
    df['Normalized'] = [word.replace(key, vowel_combis[key]) for word in df['Normalized']]
df

Unnamed: 0,Transliteration,Normalized
0,A-AN-ba-az,A-AN-ba-az
1,A-KU-um,A-KU-um
2,A-KU.KU-ta,A-KU.KU
3,A-NI-ta,A-NI
4,A-U.E2-nu-tuku,A-U.E2-nu-tuku
5,A-a-,Aya
6,A-a-bad3,Ayabad
7,A-a-ce3,Aya
8,A-a-dingir,Ayadiŋir
9,A-a-dingir-mu,Ayadiŋirŋu


# Assign Proper Noun Classes
Proper noun classes include:
    - RN       Royal Name
    - DN       Divine Name
    - PN       Personal Name
    - SN       Settlement Name
    - GN       Geographical Name (larger Geographical units such as states)
    - TN       Temple Name
    - ON       Object Name (such as divine vessels and chariots)
    - FN       Field Name

The class is indicated after square brackets after the name (as in `Utu[]DN`).

There are only 5 royal names: `UrNammak`, `Šulgir`, `AmarSuen`, `ŠuSuen`, and `IbbiSuen`.

Settlement names are followed by the determinative {ki}.

Divine names are preceded by the determinative {d}.

The rest are considered to be Personal Names. Field names, Temple names, and Geographical names etc. can usually not  be recognized unambiguously.

The first line of the code in the cell below add `[]PN` to each entry in the Normalization, turning every entry into a Personal Name. Subsequent lines replace `PN` with `SN`, `DN`, or `RN` where appropriate.

A certain amount of error is unavoidable. Personal names are often preceded by `{d}` because they contain a divine name. Similarly, god names may contain place names (as in `{d}Nin-Urim5{ki}`: Lady of Ur).

In [36]:
df['Normalized'] = [word+'[]PN' for word in df['Normalized']]
# Settlement names
df['Normalized'] = [word[:-2]+'SN' if '{ki}' in word else word for word in df['Normalized']]

In [37]:
# Divine names
df['Normalized'] = [word[:-2]+'DN' if word.startswith('{d}') and not '{d}' in word[4:] 
                    else word for word in df['Normalized']]

In [38]:
# Royal names
Shulgi = ['Šulgira', 'Šulgi', '{d}Šulgira', '{d}Šulgi']
AmarSuen = ['Amar{d}Suʾenra', 'Amar{d}Suʾen', 'Amar{d}Suʾenka', '{d}Amar{d}Suʾenra', 
            '{d}Amar{d}Suʾen', '{d}Amar{d}Suʾenka']
ShuSuen = ['Šu{d}Suʾen', 'Šu{d}Suʾenra', 'Šu{d}Suʾenka',
          '{d}Šu{d}Suʾen', '{d}Šu{d}Suʾenra', '{d}Šu{d}Suʾenka']
IbbiSuen = ['Ibi{d}Suʾen', 'Ibi{d}Suʾenra', 'Ibi{d}Suʾenka',
          '{d}Ibi{d}Suʾen', '{d}Ibi{d}Suʾenra', '{d}Ibi{d}Suʾenka']
df['Normalized'] = ['Šulgir[]RN' if word[:-4] in Shulgi else word for word in df['Normalized']]
df['Normalized'] = ['AmarSuʾen[]RN' if word[:-4] in AmarSuen else word for word in df['Normalized']]
df['Normalized'] = ['ŠuSuʾen[]RN' if word[:-4] in ShuSuen else word for word in df['Normalized']]
df['Normalized'] = ['IbbiSuʾen[]RN' if word[:-4] in IbbiSuen else word for word in df['Normalized']]
df['Normalized'] = ['UrNammak[]RN' if 'Ur{d}Nama' in word else word for word in df['Normalized']]

In [39]:
df[df['Normalized'].str.contains('[]RN', regex=False)]

Unnamed: 0,Transliteration,Normalized
568,Amar-{d}Suen,AmarSuʾen[]RN
569,Amar-{d}Suen-ra,AmarSuʾen[]RN
1200,Cu-{d}Suen,ŠuSuʾen[]RN
1212,Cul-gi,Šulgir[]RN
1213,Cul-gi-ra,Šulgir[]RN
2135,I-bi2-{d}Suen,IbbiSuʾen[]RN
4874,Ur-{d}Namma,UrNammak[]RN
4875,Ur-{d}Namma-ka,UrNammak[]RN
4876,Ur-{d}Namma-ta,UrNammak[]RN
5208,{d}Amar-{d}Suen,AmarSuʾen[]RN


# Remove Determinatives
Remove determinatives, but only from those names that have been normalized (do not contain `-` or `.` anymore).

In [40]:
df['Normalized'] = [re.sub('{.+?}', '', word) if not '-' in word and not '.' in word 
                    else word for word in df['Normalized']]
df

Unnamed: 0,Transliteration,Normalized
0,A-AN-ba-az,A-AN-ba-az[]PN
1,A-KU-um,A-KU-um[]PN
2,A-KU.KU-ta,A-KU.KU[]PN
3,A-NI-ta,A-NI[]PN
4,A-U.E2-nu-tuku,A-U.E2-nu-tuku[]PN
5,A-a-,Aya[]PN
6,A-a-bad3,Ayabad[]PN
7,A-a-ce3,Aya[]PN
8,A-a-dingir,Ayadiŋir[]PN
9,A-a-dingir-mu,Ayadiŋirŋu[]PN


# Replace numbers by Index numbers
In names that could not be normalized automatically, replace numbers by index numbers and dashes (`-`) by dots (`.`).

In [41]:
numbers_index = {'0':'₀',
               '1': '₁',
               '2':'₂',
               '3':'₃',
               '4':'₄',
               '5':'₅',
               '6':'₆',
               '7':'₇',
               '8':'₈',
               '9':'₉',
                '-': '.'}

In [42]:
for key in numbers_index:
    df['Normalized'] = [word.replace(key, numbers_index[key]) for word in df['Normalized']]
df

Unnamed: 0,Transliteration,Normalized
0,A-AN-ba-az,A.AN.ba.az[]PN
1,A-KU-um,A.KU.um[]PN
2,A-KU.KU-ta,A.KU.KU[]PN
3,A-NI-ta,A.NI[]PN
4,A-U.E2-nu-tuku,A.U.E₂.nu.tuku[]PN
5,A-a-,Aya[]PN
6,A-a-bad3,Ayabad[]PN
7,A-a-ce3,Aya[]PN
8,A-a-dingir,Ayadiŋir[]PN
9,A-a-dingir-mu,Ayadiŋirŋu[]PN


# Statistics
In the current set, how many name forms and how many names? How many name forms could not be normalized?

In [43]:
not_norm_df = df[df['Normalized'].str.contains('.', regex=False)]
norm_df = df[~df['Normalized'].str.contains('.', regex=False)]
not_normalized = len(not_norm_df)
norm = len(norm_df)
norm_set = len(set(norm_df['Normalized']))
names_forms = len(df)
print('Name Instances ' + str(total_names))
print('Name forms: ' + str(names_forms))
print('Name forms Normalized: ' + str(norm) + "; representing " + str(norm_set) + " different names.")
print('Name forms not normalized: ' + str(not_normalized))

Name Instances 156112
Name forms: 6077
Name forms Normalized: 4931; representing 3742 different names.
Name forms not normalized: 1146


# Utamišaram
The Ur III name `Utamišaram` appears in many different spellings and name forms. How did our script do with him?


In [44]:
mica = df[df['Transliteration'].str.contains('mi-ca')]
Utamicaram = mica[mica['Transliteration'].str[0]=='U']
Utamicaram

Unnamed: 0,Transliteration,Normalized
4363,U2-da-mi-ca-ra-am,Udamišaram[]PN
4394,U2-ta-mi-car-ra-am,Utamišaram[]PN
4395,U2-ta-mi-car-ra-am-ta,Utamišaram[]PN
4396,U2-ta-mi-car-ru-um,Utamišarum[]PN
4397,U2-ta1-mi-car-ra-am-ta,Utamišaram[]PN
4402,U2-ta2-mi-ca-ra-am,Utamišaram[]PN
4403,U2-ta2-mi-car-MI-ra-am,U₂.ta₂.mi.šar.MI.ra.am[]PN
4404,U2-ta2-mi-car-am,Utamišaram[]PN
4405,U2-ta2-mi-car-am-ta,Utamišaram[]PN
4406,U2-ta2-mi-car-ra-am,Utamišaram[]PN


Turns out that 10 different spellings/forms are correctly identified with `Utamišaram`; others are normalized as `Udamišaram` or `Utamišarum`, or, in one case `U₂.ta₂.mi.šar.MI.ra.am` (apparently an ancient spelling mistake). Such entries need to be corrected by hand; further research shows that in one case `U2-ta-mi-car-um-ta` is a modern mistake (= `Utamišaram`); the other cases of `Utamišarum` are correctly transliterated and represent a variant form of the same name refering to the same person (an official in the Drehem administration) as `Utamišaram`.

# Save to File
Note that the data is encoded in `utf-16` (rather than `utf-8`) and the separator is a `TAB` (instead of comma). This way Excel can open the file without problem. However, if another program (such as `emacs`) is used for editing it is better to use `utf-8` and perhaps a different separator.

In [45]:
df.to_csv('output/UrIII-Name_Puz_Dagan.csv', sep = '\t', encoding='utf-8')