# Create Edge and Node List for Sargon Letters SNA
This notebook creates an edge list and a node list for import into [Gephi](https://gephi.org/). The data are taken from the Sargon letters published online in [SAAo](http://oracc.org/saao), but the code can be used for other data sets as well.

In [None]:
import pandas as pd
import zipfile
import json
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
volumes = ["saao/sargonletters"]
p = oracc_download(volumes, "lmu")

In [None]:
lemm_list = get_lemmas(p)
words_df = dataformat(lemm_list)
directories = ['raw']
make_dirs(directories)
words_df.to_csv('raw/sargonletters.csv', index=False, encoding='utf8')

The `JSON` of saao/sargonletters has been parsed and transformed into a `csv` file called `sargonletters.csv`. This file has been moved to the directory `raw`. The current notebook will extract the information necessary for an edge list that can be imported in [Gephi](https://gephi.org/). In addition, the code will create a node list with one attribute (`eponym`, either `True` or `False`). For importing these files into [Gephi](https://gephi.org/), see the bottom of this file.

The first step is to select all proper names that appear in the letters. Two proper names that appear in the same letter represent an (undirected) edge. In a second step, this list of edges is augmented with catalog information such as sender location and dossier. In a third step the catalog is used to create additional (directed) edges, representing sender and recipient.

# 1 Creat Edge List from Proper Names in Letters

First open the `.csv` file (prepared by parsing the corpus JSON files) and import it into a Pandas Dataframe. 

In [None]:
with open("raw/sargonletters.csv", mode = 'r', encoding = "utf8") as f:
    df = pd.read_csv(f)
df

Select the rows where `pos` is either `PN` (Personal Name) or `RN` (Royal Name).

In [None]:
keep = ["PN", "RN"]
df = df[df["pos"].isin(keep)]

Select the columns `id_text` (the P number), and `cf` (Citation Form) and clean up the `id_text` column (keep only the P number).

In [None]:
df = df[["id_text", "cf"]]
df["id_text"] = [idt[-7:] for idt in df["id_text"]]
df

Transform the Pandas Dataframe into a simple list of list. In order to produce the edge list we use a loop within a loop. The first loop goes through all the items in the list (all names). For each name, it goes through the entire list again, to find items that match the same text ID (P number). This way, the routine finds all pairs of names that appear in each letter.

The secondary loop begins at the location of the index of the primary loop. This way, the edge A == B is not duplicated by the edge B == A (since the edges are undirected).

If there is a text ID match in the secondary loop, make a list that contains `id_text`, `source`, and `target` - this list represents a single edge. Add this list to the list of lists `edges`.

In [None]:
data = df.values.tolist()
edges = []
for idx, item in enumerate(data):
    textid = item[0]
    source = item[1]
    for idx_2, item_2 in enumerate(data[idx:len(data)]):
        if item[0] == item_2[0]:
            if not item[1] == item_2[1]:
                target = item_2[1]
                edge = [textid, source, target]
                edges.append(edge)                

The object `edges` is a list of list that can be transformed again into a Dataframe.

In [None]:
df_edges = pd.DataFrame(edges, columns= ["id_text", "source", "target"])
df_edges

If the same name is mentioned multiple times in the same letter, that will create duplicate edges. Drop the duplicates.

Add a new field, called `Type` to indicate whether an edge is directed or undirected. So far, all the edges are undirected.

In [None]:
df_edges = df_edges.drop_duplicates().reset_index(drop=True)
df_edges["Type"] = "undirected" 
df_edges

# 2. Add Catalog Information to the Edge List
The file `catalogue.json`, which is available in the file `jsonzip/saao-sargonletters.zip`, contains more information about each letter. The main field in `catalogue.json` is called `members` which contains the catalog information for each text in this corpus. We select a number of relevant catalog fields. We can add this information to the edge list by merging the two dataframes on `id_text` (the P number).

Instead of extracting all the files from the `ZIP` file we can create a `ZipFile` object and then read only the file we need (namely `catalogue.json`). This is transformed into a `JSON` object which can be further manipulated and transformed into a Dataframe.

In [None]:
p = "saao/sargonletters"
file = f"{p.replace('/', '-')}.zip"
z = zipfile.ZipFile(f'jsonzip/{file}', 'r')
data = z.read(f"{p}/catalogue.json").decode("utf-8")
data = json.loads(data)
d = data['members']
df = pd.DataFrame(d).T
df_cat = df[["id_text", "ancient_author", "recipient", "dossier", "senderloc", "sender_title"]].fillna('')

The names of author and recipient in the catalog are sometimes slightly different from the name forms in the lemmatization. The code in the following cell replaces the catalog form by the lemmatization form. The `.replace()` method in Pandas will search and replace a full string. In order to perform the search/replace on a partial string the option `regex = True` is necessary. Therefore, characters that have a special function in regular expressions (such as `[` and `(`) must be escaped by preceding them with a backslash.

The search - replace pairs are listed in a dictionary that can be fed to the Pandas `.replace()` method.

In [None]:
search_replace = {"Ṭab-ṣill-Ešarra": "Ṭab-ṣil-Ešarra",
                  "’": "ʾ",
                  "\[": "",
                  "\]": "",
                  "Nashir-Bel \(Liphur-Bel\)": "Nashir-Bel",
                 "Sennacherib": "Sin-ahhe-eriba",
                 "Upaqa-Šamaš" : "Upaq-Šamaš",
                  }
fields = ["ancient_author", "recipient"]
df_cat[fields] = df_cat[fields].replace(search_replace, regex=True)

Now merge the edges Dataframe with the catalog information.

In [None]:
df_edges_cat = pd.merge(df_edges, df_cat, on="id_text").fillna("")
with open("csv/edges_no_sender.csv", mode="w", encoding="utf-8") as w:
    df_edges_cat.to_csv(w, index=False)
df_edges_cat

# 3. Add Sender and Recipient
Neo-Assyrian letters to or from the king often do not contain the name of sender or recipient, because the king's name is never explicit. We can pull this information from the dataframe `df_cat` that we have created above.

In [None]:
df_cat

Note that `ancient_author` sometimes has more than one name, separated by a comma. The code tests for the presence of a comma in the field `ancient author`. If a comma appears, the field is split at the comma, resulting in a list of authors (2 or more). For each author a separate row is created that copies the original row, but replaces the field `ancient_author` by the author. The same is done for recipients.

### Note:
It is necessary to use two separate `if/else` loops in order to take care of the possibility of multiple senders *and* multiple recipients.

In [None]:
df_cat = df_cat[['id_text', 'ancient_author', 'recipient']]

In [None]:
cat = df_cat.values.tolist()
cat_edges = []
for item in cat:
    if ',' in item[1]:
        senders = item[1].split(',')
        for sender in senders:
            edge = item.copy()
            edge[1] = sender.strip()
            cat_edges.append(edge)
    else:
        cat_edges.append(item)
cat_edges2 = []
for item in cat_edges:
    if ',' in item[2]:
        recipients = item[2].split(',')
        for recipient in recipients:
            edge = item.copy()
            edge[2] = recipient.strip()
            cat_edges2.append(edge)
    else:
        cat_edges2.append(item)

In [None]:
df_cat2 = pd.DataFrame(cat_edges2, columns = df_cat.columns)
df_cat2

Copy the columns `ancient_author` and `recipient` into `source` and `target`, respectively, and add `Type` to make the dataframe compatible with `df_edge_cat` created above. For this set of rows all the edges are Directed because they connect sender and recipient.

In [None]:
df_cat2["source"] = df_cat2["ancient_author"].copy()
df_cat2["target"] = df_cat2["recipient"].copy()
df_cat2["Type"] = "directed"

Combine the two Dataframes; change all missing values into the empty string.

In [None]:
df_combined = df_edges_cat.append(df_cat2).reset_index(drop=True)
df_combined = df_combined.fillna("")
df_combined

Write to a `CSV` file to be imported as an edge list in [Gephi](https://gephi.org/).

In [None]:
with open("csv/edges.csv", mode="w", encoding="utf-8") as w:
    df_combined.to_csv(w, index=False)

# 4 Creat Node List
[Gephi](https://gephi.org/) will automatically create a node list from the edge list. The advantage of explicitly adding a node list is that we can add one or more attributes to the nodes.

First create the node list from the columns `source` and `target` in the `df_combined` dataframe.

In [None]:
nodes = list(set(list(df_combined["source"])) | set(list(df_combined["target"])))
df_nodes = pd.DataFrame(nodes, columns=["label"])

The file `sargoneponymns.csv` is a simple, one-dimensional list of names that match the names as they appear in the data set. Read the list of eponymns (or any other type of attribute) and add a column `eponym`. The value of each row in that column is set to be `True`.

In [None]:
with open("csv/sargoneponyms.csv", mode="r", encoding="utf-8") as f:
    eponyms = pd.read_csv(f)
eponyms["eponym"] = True

Merge the `df_nodes` dataframe with the `eponymns` dataframe. The `outer` method keeps all rows of both dataframes and joins them where they match (in this case on `label`). The default behavior of `merge()` is to keep only those rows that match. Where there is no match (not an eponym) the `eponym` column will have `NaN` ("Not a Number", or missing value). These missing values are set to `False`.

Copy the column `label` into a column `Id`. The `Id` column is used by Gephi to identify nodes; the `label` column is used to display labels in a graph.

In [None]:
df = df_nodes.merge(eponyms, on = "label", how="outer")
df["eponym"] = df["eponym"].fillna(False)
df["Id"] = df["label"].copy()
df

Save the nodes list as a CSV.

In [None]:
with open("csv/nodes.csv", mode="w", encoding="utf-8") as w:
    df.to_csv(w, index=False)

# 5 Import in [Gephi](https://gephi.org/)
First, import the node list (go to Data Laboratory and click on `import spreadsheet`). After importing, copy the `Id` column to the `Label` column. Now import the edge list. The order of import is important. When you import an edge list, [Gephi](https://gephi.org/) will add all non-existent nodes to the node list.