# Sargon's Pen Pals
This notebook will pull all the names (senders and people mentioned) from letters to Sargon II, who reigned over Assyria 722-705 BCE. These letters were published in the series *State Archives of Assyria* (SAA) volumes 1, 5, and 15 (1987, 1990, and 2001). Electronic versions of these volumes are found in the [ORACC](http://oracc.org) project *State Archives of Assyria online* ([SAAo](http://oracc.org/saao)).

In the letters to Sargon (and to other Assyrian kings) the addressee is hardly ever mentioned by name. Instead, the letter opens with "to the king my lord". Simo Parpola, the main editor of the SAA series, assigned the letters to kings, based on his vast knowledge of the corpus.

The current notebook will use a network approach to evaluate these assignments, by using the names of a few individuals that are known to have been contemporaries of Sargon. If these people are mentioned in a letter, it is likely that the letter is to Sargon. We may also look at second or third degree relationships, to estimate the plausibility that a letter was indeed sent to Sargon.

# 1. Import Libraries

In [None]:
import pandas as pd
import zipfile
import json
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. The directories are created with the function `make_dirs()` from the `utils` module. 

In [None]:
directories = ['jsonzip', 'output']
make_dirs(directories)

# 3. Download the JSON ZIP files
Using the function `oracc_download()` from the `utils` module.

In [None]:
volumes = ["saao/saa01", "saao/saa05", "saao/saa15"]
oracc_download(volumes)

# 4. Extract Proper Nouns
Each of the ZIP files contains a file `gloss-qpn.json` which contains the glossary of proper nouns in that volume. This file is extracted and loaded in `json`. The extracted data are put in a list, each element of the list represents the proper nouns of one SAA volume.

In [None]:
json_l = []
for v in volumes:
    file = "jsonzip/" + v.replace("/", "-") + ".zip"
    z = zipfile.ZipFile(file)
    filename = v + "/gloss-qpn.json"
    qpn = z.read(filename).decode('utf-8')         #read and decode the qpn glossary json file
    data_json = json.loads(qpn)  
    json_l.append(data_json)

# 5. Entries and Instances
The function `parse()` builds two DataFrames from different elements in the `gloss-qpn.json`. The key `entries` holds all the headwords (lemmas) in the glossary, with information such as Part of Speech (`pos`), original spelling, number of attestations, etc. The elements to be extracted for the Proper Noun DataFrame (`df_pn`) are Citation Form (`cf`), Part of Speech (`pos`), Guide Word (`gw`) and `xis`. The field `xis` holds an ID for this entry. Note that the `xis` ID is unique *within a project* and may well be duplicated in another project. Moreover, the `xis` is *not persistent*. After a new build of the project, the `xis` IDs will be realigned.

The key `instances` in `gloss-qpn.json` provides all the instances (in list form) of each headword, using the same `xis` field to identify the headword. The instance is referred to in the format PROJECT:ID_TEXT.ID_LINE.ID_WORD, for instance: `saao/saa01:P243567.9.1`. We can iterate through the field `xis` in `df_pn` to select the headwords that we need. We build a second DataFrame (`dinst_df`) with two columns: the `xis` ID and the text ID. The text ID can be extracted from the reference by taking the part between the colon and the first dot.

The two DataFarmes share the field `xis`. In the second DataFrame (`inst_df`) the same `xis` ID may appear multiple times (because the same name may appear in multiple texts, or multiple times in the same text). We can merge the two DataFrames with the `pandas` function `merge()`, merging on `xis` and using the keys from `inst_df`. The DataFrame that is returned will now have a row for each name instance, associated with a text ID (a P, Q, or X number). Each volume of [SAAo](http://oracc.org/saao) will return a separate DataFrame. 

In [None]:
def parse(data_json):
    entries = data_json["entries"]
    df_pn = pd.DataFrame(entries)
    df_pn = df_pn[["cf", "gw", "pos", "xis"]]
    df_pn = df_pn.loc[df_pn["pos"].isin(["PN", "RN"])]
    instances = data_json["instances"]
    l = []
    for i in df_pn["xis"]:
        for k in instances[i]:
            QPN = k.split(":")[1]
            QPN = QPN.split(".")[0]
            d = [i, QPN]
            l.append(d)
    inst_df = pd.DataFrame(l)
    inst_df.columns = ["xis", "id_text"]
    df = inst_df.merge(df_pn, on='xis', how='left')
    return df

# 6. Create and Concatenate the Lists of Name Instances
For each of the projects with Sargon letters in [SAAo](http://oracc.org/saao) the `json` data from its `gloss-qpn.json` are sent to the `parse()` function. This function refturns a DataFrame with all name instances, associated with text IDs. The code below collects those DataFrames in the list `pn_l`, concatenates them and then drops the filed `xis`. Since `xis` is project-specific, it has become meaningless in this stage

In [None]:
pn_l = []
for j in json_l:
    pns = parse(j)
    pn_l.append(pns)
df = pd.concat(pn_l)
df = df.drop("xis", axis = 1)
df

# 7. Node List
The Node List is simply the list of all unique headwords. Add to the nodes list whether a name is known as a (Sargon-period) eponym. Save the nodes list as `nodes.csv` in the `output` directory.

In [None]:
n_df = df[["cf", "gw"]].copy()
n_df = n_df.drop_duplicates().reset_index(drop=True)
n_df.columns = ["label", "namesake"]

In [None]:
with open("csv/sargoneponyms.csv", mode="r", encoding="utf-8") as f:
    eponyms = pd.read_csv(f)
eponyms["eponym"] = True

In [None]:
nodes_df = n_df.merge(eponyms, on = "label", how="outer")
nodes_df["eponym"] = nodes_df["eponym"].fillna(False)
nodes_df["Id"] = nodes_df["label"].copy()
nodes_df

In [None]:
with open("output/nodes.csv", mode="w", encoding="utf-8") as w:
    nodes_df.to_csv(w, index=False)

# 8 Edge List
Transform the Pandas Dataframe into a simple list of list. In order to produce the edge list we use a loop within a loop. The first loop goes through all the items in the list (all names). For each name, it goes through the entire list again, to find items that match the same text ID (P number). This way, the routine finds all pairs of names that appear in each letter.

The secondary loop begins at the location of the index of the primary loop. This way, the edge A == B is not duplicated by the edge B == A (since the edges are undirected).

If there is a text ID match in the secondary loop, make a list that contains `id_text`, `source`, and `target` - this list represents a single edge. Add this list to the list of lists `edges`.

In [None]:
data = df.values.tolist()
edges = []
for idx, item in enumerate(data):
    textid = item[0]
    source = item[1]
    for idx_2, item_2 in enumerate(data[idx:len(data)]):
        if item[0] == item_2[0]:
            if not item[1] == item_2[1]:   # no SELF == SELF edges
                target = item_2[1]
                edge = [textid, source, target]
                edges.append(edge)  

The object `edges` is a list of lists that can be transformed again into a Dataframe. If the same name is mentioned multiple times in the same letter, that will create duplicate edges. Drop the duplicates.

In [None]:
edges_df = pd.DataFrame(edges, columns= ["id_text", "source", "target"])
edges_df = edges_df.drop_duplicates().reset_index(drop=True)
edges_df

# 9. Save Edge List
Save the edge list as `edges.csv` in the `output` directory.

In [None]:
with open("output/edges.csv", mode="w", encoding="utf-8") as w:
    edges_df.to_csv(w, index=False)