## Create Graph Linking Cities By Shared Authors

The code in this notebook creates two files: `edges.tsv` and `nodes.csv`, which are in a format suitable for input to [flourish.studio](https://flourish.studio/) as a network graph.

The first step below converts the flat `csv` data into hierarchical data organized by city name and author.

In [None]:
import csv
import os

csv_path = "../comics_as_data_north_america_2020-01-09_Cleaned.csv"
author_cities = {}
city_counts = {}

with open(csv_path) as f:
    csvreader = csv.DictReader(f)
    
    for row in csvreader:

        # Separate author name from uri.
        author = row["Author"].split("http")[0].strip()
        # Skip records without an author.
        if author == '':
            continue
    
        city = row["City of Publication"]
        citystate = ", ".join([city, row["State or Province"]]).strip()
        
        # Skip records without a city, i.e. with a single-digit country code.
        if city.isdigit():
            continue
        
        # print(author, city)

        if author not in author_cities:
            author_cities[author] = {}
        
        #Compile data about cities by authors.
        author_cities[author][citystate] = author_cities[author].get(citystate, 0) + 1 
        
        # Total # of records for each city.
        city_counts[citystate] = city_counts.get(citystate, 0) + 1
            
print(len(author_cities))

The next step iterates over every unique pair of cities associated with each author, and creates a link between them. That is, if an author is associated with San Francisco, Berkeley, and Oakland, links are created as in:
- San Francisco -- Berkeley
- San Francisco -- Oakland
- Berkeley -- Oakland

These links are undirected so the order of the cities in these linkages doesn't matter.

In [None]:
from itertools import combinations
import networkx as nx

G = nx.Graph()

for author, cities in author_cities.items():
    citieslist = list(cities.keys())
    for i, j in combinations(citieslist, 2):

        if G.has_edge(i, j):
            G[i][j]["weight"] += 1
            if author not in G[i][j]["authors"]:
                G[i][j]["authors"].append(author)

        else:
            G.add_edge(i, j, weight=1, authors=[author])
            G.nodes[i]["weight"] = city_counts[i]
            G.nodes[j]["weight"] = city_counts[j]
            G.nodes[i]["state"] = i[-2:]
            G.nodes[j]["state"] = j[-2:]


Lastly create each file, one containing all the nodes (publication cities) and another containing data about which nodes should be connected.

In [None]:
import csv

nx.write_edgelist(G, "edges.tsv", delimiter="\t", data=["weight"])
with open("nodes.csv", "w") as f:
    fieldnames = ["id", "weight", "state"]
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for n, data in list(G.nodes(data=True)):
        row = {
            "id": n,
            "weight": data["weight"],
            "state": data["state"]
        }
        writer.writerow(row)
    