# Lab 03 Tasks - Solution

The dataset used for this task consists of records of flights retrieved from the [US Bureau of Transportation Statistics website](https://www.transtats.bts.gov) The data covers the period Q1 and Q2 2016, and includes each flight’s origin, destination, along with other metadata. The raw data is provided as a single CSV file (airstats-2016.csv).

In [None]:
import networkx as nx
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

### Task 1

Load the flight record data from the file airstats-2016.csv into a Pandas Data Frame, and apply the following filtering steps to the Data Frame:

1. Only include records where the flight origin and destination where both in the United States.
2. Only include records from the time periods Q1 2016 and Q2 2016.
3. Only include records where the reported distance between the origin and destination was at least 20 miles.

In [None]:
# Load the data
df = pd.read_csv("airstats-2016.csv")
print( "%d flight records" % len(df))
df.head()

In [None]:
# Only include records where the flight origin and destination where both in the United States.
df2 = df[(df["ORIGIN_COUNTRY"]=="US") & (df["DEST_COUNTRY"]=="US")]
print("%d flight records after filtering" % len(df2))
df2.head()

In [None]:
# include only records from the time periods Q1 2016 and Q2 2016.
df3 = df2[(df2["QUARTER"]<3)]
print("%d flight records after filtering" % len(df3))
df3.head()

In [None]:
# only include records where the reported distance between the origin and destination was at least 20 miles.
df4 = df3[(df3["DISTANCE"]>=20)]
print("%d flight records after filtering" % len(df4))
df4.head()

### Task 2

Create an **unweighted directed** network from the Pandas Data Frame.

Use the three letter IATA airport codes for the origin and destination as the node identifiers. Also add the airport’s city name as an attribute for each node.

In [None]:
# get set of all airports
origins = set(df4["ORIGIN"].unique())
destinations = set(df4["DEST"].unique())
airports = origins.union(destinations)

In [None]:
# create a map from the airport codes to the city name
city_names = {}
for i, row in df4.iterrows():
    city_names[row["ORIGIN"]] = row["ORIGIN_CITY_NAME"]
    city_names[row["DEST"]] = row["DEST_CITY_NAME"]

In [None]:
# create a directed network and add the nodes
g = nx.DiGraph()
nodes = sorted(list(airports))
for node in nodes:
    g.add_node(node, city=city_names[node] )

In [None]:
# create a directed edge between each unique origin and destination pair, based on the flight records
# note that, in a standard DiGraph, Networkx ignores duplicate edges and only adds them once
for i, row in df4.iterrows():
    node1 = row["ORIGIN"]
    node2 = row["DEST"]
    # ignore self-loops, in case they exist
    if node1 == node2:
        continue
    g.add_edge(node1, node2)

### Task 3

Characterise the unweighted directed network from Task 3, looking at:
      
1. How many nodes and edges are in the network?
2. The connectedness of the network (i.e. density and number of components).
3. Identify frequent origin and destination airports in the network (i.e. in-degree and out-degreee)
4. Identify key hub airports in the network (i.e. betweenness)

In [None]:
# How many nodes and edges are in the network?
print("Network has %d nodes and %d edges" % ( g.number_of_nodes(), g.number_of_edges() ) )

In [None]:
# What level of density in the network?
nx.density(g)

In [None]:
# how many strongly connected components?
nx.number_strongly_connected_components(g)

In [None]:
# create Pandas series for in-degree and out-degree scores
s_in = pd.Series( dict(g.in_degree()) )
s_out = pd.Series( dict(g.out_degree()) )

In [None]:
# get top ranked by in-degree
s_in.sort_values(ascending=False).head(10)

In [None]:
# get top ranked by out-degree
s_out.sort_values(ascending=False).head(10)

In [None]:
# use betweenness centrality to identify key hub airports in the network:
s_bet = pd.Series(nx.betweenness_centrality(g))
s_bet.sort_values(ascending=False).head(10)

### Task 4

Now create an alternative **weighted directed** network from the filtered Pandas Data Frame.

In [None]:
# create the new network
g = nx.DiGraph()
nodes = list(airports)
nodes.sort()
for node in nodes:
    g.add_node(node, city=city_names[node])

In [None]:
# count the flight frequencies between each pair of airports.
from collections import Counter
freqs = Counter()
# make sure to apply this to the filtered Data Frame
for i, row in df4.iterrows():
    node1 = row["ORIGIN"]
    node2 = row["DEST"]
    # ignore self-loops, in case they exist
    if node1 == node2:
        continue
    pair = (node1,node2)
    freqs[pair] += 1

In [None]:
# now create a directed weighted edge between each unique origin and destination airport pair
for pair in freqs:
    g.add_edge( pair[0], pair[1], weight=freqs[pair] )
print("Created network with %d nodes and %d edges" % ( g.number_of_nodes(), g.number_of_edges() ) )

### Task 5

Based on the weighted directed network, identify:
    
1. The most frequent routes in the network.
2. The most frequent origin and destination airports in the network, considering edge weights.

In [None]:
# get all the edge weights, so that we can find the highest weight pairs
weights = {}
for e in g.edges(data=True):
    pair = (e[0],e[1])
    weights[pair] = e[2]["weight"]

In [None]:
# convert to a series
s_weights = pd.Series(weights)
# get the top ranked pairs
s_weights.sort_values(ascending=False).head(10)

In [None]:
# find the most frequent origin airports, based on weighted out-degree
s_wout = pd.Series( dict(g.out_degree(weight="weight")) )
s_wout.sort_values(ascending=False).head(10)

In [None]:
# find the most frequent destination airports, based on weighted in-degree
s_win = pd.Series( dict(g.in_degree(weight="weight")) )
s_win.sort_values(ascending=False).head(10)