<div style="float:left;"><img src="logo.png" width="500"/></div>

# Creating Networks

In this demo we will look at the core concepts of network analysis by using the Python [NetworkX](https://networkx.org) package to construct networks from raw data. In this case, we will create a network representation from a real dataset of U.S. air transport records.

Firstly, import the required modules, including NetworkX:

In [1]:
from pathlib import Path
import pandas as pd
import networkx as nx

## Data Loading
We will load the flight record data from the file airstats.csv into a Pandas Data Frame. 

In [2]:
in_path = Path("../Data", "airstats.csv")
# create the Data Frame
df = pd.read_csv(in_path, index_col=0)
print("Read %d flight records" % len(df))
# display a few rows
df.head(10)

Read 119038 flight records


Unnamed: 0_level_0,ORIGIN,DEST,ORIGIN_CITY_NAME,DEST_CITY_NAME
FLIGHT_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4679,AEX,SAT,"Alexandria, LA","San Antonio, TX"
4680,ATL,MDW,"Atlanta, GA","Chicago, IL"
4681,AZA,LAS,"Phoenix, AZ","Las Vegas, NV"
4682,AZA,MIA,"Phoenix, AZ","Miami, FL"
4683,AZO,LEX,"Kalamazoo, MI","Lexington, KY"
4684,BED,BNA,"Bedford, MA","Nashville, TN"
4685,BED,DAL,"Bedford, MA","Dallas, TX"
4686,BED,FRG,"Bedford, MA","East Farmingdale, NY"
4687,BED,GSO,"Bedford, MA","Greensboro/High Point, NC"
4688,BED,RDU,"Bedford, MA","Raleigh/Durham, NC"


## Creating a Directed Network

We will now construct a *directed unweighted network* such that:

- There is a node for each airport involved in a flight record. We will use the three letter IATA airport codes for the origin and destination as the node identifiers. We will also add the airport city name as an attribute for each node.
- There is a directed edge between each unique origin and destination pair, based on the flight records.

First, get the set of all airports:

In [None]:
# get set of all airports
origins = set(df["ORIGIN"].unique())
destinations = set(df["DEST"].unique())
airports = origins.union(destinations)

In [4]:
# create a map from the airport codes to the city name
city_names = {}
for i, row in df.iterrows():
    city_names[row["ORIGIN"]] = row["ORIGIN_CITY_NAME"]
    city_names[row["DEST"]] = row["DEST_CITY_NAME"]

Create a directed network, with a node for each airport:

In [5]:
# here a DiGraph indicates a directed network
g = nx.DiGraph()
nodes = sorted(list(airports))
for node in nodes:
    # we add the city name as an attribute
    g.add_node(node, city=city_names[node])

Create a directed edge between each unique origin and destination pair, based on the flight records

Note that Networkx ignores duplicate edges and only adds them once

In [6]:
for i, row in df.iterrows():
    node1 = row["ORIGIN"]
    node2 = row["DEST"]
    # ignore self-loops, in case they exist
    if node1 == node2:
        continue
    g.add_edge(node1, node2)

We can check the size of our new network:

In [7]:
print("Network has %d nodes and %d edges" % (g.number_of_nodes(), g.number_of_edges()))

Network has 1043 nodes and 17644 edges


## Creating a Directed Weighted Network

As an alternative network representation, we will now use the original Data Frame to create a *directed weighted network*, such that:

- There is a node for each airport involved in a flight record. We use the three letter IATA airport codes for the origin and destination as the node identifiers. We add the airport city name as an attribute for each node.
- There is a directed edge between each unique origin and destination pair, based on the flight records. The *weight* on an edge indicates the number of flights from the source airport to the target airport.

In [8]:
# create the new network
g = nx.DiGraph()
nodes = list(airports)
nodes.sort()
for node in nodes:
    g.add_node(node, city=city_names[node])

In [9]:
# count the flight frequencies between each pair of airports.
from collections import Counter
freqs = Counter()
# make sure to apply this to the filtered Data Frame
for i, row in df.iterrows():
    node1 = row["ORIGIN"]
    node2 = row["DEST"]
    # ignore self-loops, in case they exist
    if node1 == node2:
        continue
    pair = (node1,node2)
    freqs[pair] += 1

Now we create a directed weighted edge between each unique origin and destination airport pair:

In [10]:
for pair in freqs:
    g.add_edge(pair[0], pair[1], weight=freqs[pair])
print("Created network with %d nodes and %d edges" % (g.number_of_nodes(), g.number_of_edges() ))

Created network with 1043 nodes and 17644 edges


Based on the weights on the edges in this network, we can identify the most frequent flight routes in this flight network.

One way to do this is to turn the network edge list into a Data Frame and then we can sort and browse it:

In [11]:
rows = []
for node1, node2, data in g.edges(data=True):
    # get the city names from the node attributes
    origin = g.nodes[node1]["city"]
    destination = g.nodes[node2]["city"]
    rows.append({"Origin":origin, "Destination":destination, "Weight":data["weight"]})
# create the Data Frame and sort in descending order by weight
df_edges = pd.DataFrame(rows).sort_values(by="Weight", ascending=False)
df_edges.head(20)

Unnamed: 0,Origin,Destination,Weight
11259,"Minneapolis, MN","Chicago, IL",78
12356,"Chicago, IL","Minneapolis, MN",75
4386,"Denver, CO","Salt Lake City, UT",71
4910,"Detroit, MI","Chicago, IL",70
12332,"Chicago, IL","Detroit, MI",69
15841,"Salt Lake City, UT","Denver, CO",63
3902,"Cincinnati, OH","Chicago, IL",61
12317,"Chicago, IL","Kansas City, MO",60
12328,"Chicago, IL","Atlanta, GA",60
9228,"Los Angeles, CA","San Francisco, CA",60


We can save this weighted directed network as a GEXF file for later use

In [None]:
nx.write_gexf(g, "airstats-weighted-directed.gexf")