# Extracting Personal Names from QPN Glossary
Extract proper nouns with the id_text of where they appear from `gloss-qpn.json`. The appropriate JSON ZIP file must be located in the directory `jsonzip`.

In [4]:
import pandas as pd
import zipfile
import json
import os
import errno
import tqdm
import requests

# Create Download Directory
Create a directory called `jsonzip`. If the directory already exists, do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

# Download the ZIP file

In [5]:
CHUNK = 16 * 1024
url = "http://oracc.ub.uni-muenchen.de/json/saao-sargonletters.zip"
file = 'jsonzip/saao-sargonletters.zip'
with requests.get(url, stream=True) as r:
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in tqdm.tqdm(r.iter_content(chunk_size=CHUNK)):
                f.write(c)
    else:
        print(url + " does not exist.")

Downloading http://oracc.ub.uni-muenchen.de/json/saao-sargonletters.zip saving as jsonzip/saao-sargonletters.zip


968it [00:04, 212.50it/s]


Extract the data from the `gloss-qpn.json` file and use the `json` package to store the results in the variable `data_json`.

In [6]:
filename = ""
z = zipfile.ZipFile(file)
names = z.namelist()
for name in names: 
    if "gloss-qpn.json" in name: 
        filename = name
qpn = z.read(filename).decode('utf-8')         #read and decode the qpn glossary json file
data_json = json.loads(qpn)          

# First DataFrame: Entries
The DataFrame `df_pn` is created, capturing the columns `headword`, `xis`, and `pos`. The column `pos` is used to select personal names (regnal names and personal names). The value of `xis` is an ID which is used in `instances` (below).

In [7]:
entries = data_json["entries"]
df = pd.DataFrame(entries)
df = df[["headword", "xis", "pos"]]
df_pn = df.loc[df["pos"].isin(["PN", "RN"])]

# Second DataFrame: Instances
Instances links the `xis` ID number with text IDs. It notes the text's ID number, but also line and word.

In [8]:
instances = data_json["instances"]

Create a list of lists (called `l`) where every list has two elements: the `xis` number and one text ID. The text ID is made out of the references in `instances` which have the format PROJECT:ID_TEXT.ID_LINE.ID_WORD.

In [9]:
l = []
for i in df_pn["xis"]:
    for k in instances[i]:
        QPN = k.split(":")[1]
        QPN = QPN.split(".")[0]
        d = [i, QPN]
        l.append(d)

Create the DataFrame and give names to the columns.

In [10]:
inst_df = pd.DataFrame(l)
inst_df.columns = ["xis", "id_text"]

Merge the two DataFrames on the field `xis`, using the keys from the left object (that is `inst_df`).

In [11]:
df = inst_df.merge(df_pn, on='xis', how='left')
df

Unnamed: 0,xis,id_text,headword,pos
0,qpn.r000004,P313422,Abaliuqunu[1]PN,PN
1,qpn.r000004,P334090,Abaliuqunu[1]PN,PN
2,qpn.r000004,P334090,Abaliuqunu[1]PN,PN
3,qpn.r000004,P334257,Abaliuqunu[1]PN,PN
4,qpn.r00000b,P313425,Abattu[1]PN,PN
5,qpn.r000008,P334282,Abat-šarri-uṣur[1]PN,PN
6,qpn.r000008,P334321,Abat-šarri-uṣur[1]PN,PN
7,qpn.r000011,P334398,Abile[1]PN,PN
8,qpn.r00000f,P334412,Abi-ramu[1]PN,PN
9,qpn.r00000e,P313421,Abi-Seʾ[1]PN,PN
