# Geographical Names
Matching geographical names from our dataset with the [Pleiades](http://pleiades.stoa.org) data. Dataset of ancient place names downloaded in JSON from [Pleiades](https://pleiades.stoa.org/downloads).

In [None]:
import json
import requests
from tqdm.auto import tqdm
import pandas as pd
import zipfile

In [None]:
with open('pleiades-places.json', encoding='utf8') as json_file:
    data = json.load(json_file)

In [None]:
places = data['@graph']

Place names are taken from the field `romanized` in `names` - are there better fields to use?

In [None]:
placenames = {}
for p in places:
    coor = p.get("reprPoint")
    names = p.get("names")
    for n in names:
        name = n.get("romanized")
        d = {name: coor}
        placenames.update(d)

## Example 1: Drehem

In [None]:
placenames.get("Drehem")

## Example 2: Puzriš-Dagan
This finds nothing. Why? Because the name is represented as "Puzriš-Dagan, Puzrish-Dagan, Puzurish-Dagan" as a single string.

In [None]:
placenames.get("Puzriš-Dagan")

In [None]:
placenames.get("Puzriš-Dagan, Puzrish-Dagan, Puzurish-Dagan")

## Example 3: Ŋirsu
This finds nothing. Why? The dataset does not represent nasal G (Ŋ), but writes Girsu

In [None]:
placenames.get("Ŋirsu")

In [None]:
placenames.get("Girsu")

## Example 4: Telloh (modern name for Ŋirsu/Girsu)
The full name in Pleiades is Tell Telloh, not Telloh.

In [None]:
placenames.get("Telloh")

In [None]:
placenames.get("Tell Telloh")

## Example 5: Agade
Very different issue: ccordinates are unknown.

In [None]:
placenames.get("Agade")

## Example 6: Irisaŋrig
The name is written " Iri-Saĝrig, Irisagrig, Urusagrig" - with a blank at the beginning! The location of the site is known, but not recorded yet in Pleiades.

In [None]:
placenames[" Iri-Saĝrig, Irisagrig, Urusagrig"]

In [None]:
project = "epsd2/admin/ur3"

In [None]:
CHUNK = 1024
proj = project.replace('/', '-')
url = f"http://build-oracc.museum.upenn.edu/json/{proj}.zip"
file = f'jsonzip/{proj}.zip'
with requests.get(url, stream=True) as r:
    if r.status_code == 200:
        total_size = int(r.headers.get('content-length', 0))
        tqdm.write(f'Saving {url} as {file}')
        t=tqdm(total=total_size, unit='B', unit_scale=True, desc = project)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                t.update(len(c))
                f.write(c)
    else:
        tqdm.write(f"WARNING: {url} does not exist.")

In [None]:
def parsejson(text, id_text):
    l = []
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            l.extend(parsejson(JSONobject, id_text))
        if "f" in JSONobject:
            lemm = JSONobject["f"]
            lemm["id_text"] = id_text
            l.append(lemm)
    return l

In [None]:
lemm_l = [] # initiate the list that will hold all the lemmatization data of all texts in all requested projects
file = f"jsonzip/{project.replace('/', '-')}.zip"
try:
    z = zipfile.ZipFile(file)       # create a Zipfile object
except:
    print(f"{file} does not exist or is not a proper ZIP file")
files = z.namelist()     # list of all the files in the ZIP
files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
for filename in tqdm(files, desc=project):                            #iterate over the file names
    id_text = project + filename[-13:-5] # id_text is, for instance, blms/P414332
    try:
        text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
        data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
        lemm_l.extend(parsejson(data_json, id_text))               # and send to the parsejson() function
    except:
        tqdm.write(f'{id_text} is not available or not complete')
z.close()

In [None]:
words = pd.DataFrame(lemm_l)
keep = ["id_text", "cf", "gw", "pos"]
words = words[keep]
words = words.fillna("")
GeographicalPOS = ["SN", "GN", "WN"]
words = words.loc[words.pos.isin(GeographicalPOS)]
GeographicalNames = set(words["cf"])

In [None]:
Coordinates = {}
for name in GeographicalNames:
    location = placenames.get(name)
    if location: 
        c = {name: location}
        Coordinates.update(c)

A literal search finds 28 out of 1937 geographical names (this is for the entire Ur III data set, not just for Drehem). Taking into account the issues discussed above may increase that number.

In [None]:
len(GeographicalNames), len(Coordinates)

In [None]:
Coordinates