# US Flight Network Analysis  with GraphFrames

This notebook explores the structure and connectivity of the US domestic flight network using GraphFrames.

We treat airports as Vertices (Nodes) and individual flights as Edges (Connections)*to perform key graph operations such as calculating connectivity (Degree), identifying important hubs (PageRank), and finding flight communities (Label Propagation and BFS).





## Spark set-up and Imports

In [2]:
!java -version
!pip install "pyspark==3.5.0"
# Install Java 17
!sudo apt-get update
!sudo apt-get install -y openjdk-17-jdk-headless

!java -version

openjdk version "17.0.16" 2025-07-15
OpenJDK Runtime Environment (build 17.0.16+8-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 17.0.16+8-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
Collecting pyspark==3.5.0
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425346 sha256=dfb1d32c1f6ed9cff0969ed9d887398867587fa009db3d1c651c43ad901cdc9c
  Stored in directory: /root/.cache/pip/wheels/84/40/20/65eefe766118e0a8f8e385cc3ed6e9eb7241c7e51cfc04c51a
Successfully built pyspark
Installing collected packages: pyspark
  Attempting uninstall: pyspark
    Found existing installation: pyspark 3.5.1
    Uninstalling pyspark-3.5.1:
  

In [3]:
%pip install graphframes-py==0.10.0

Collecting graphframes-py==0.10.0
  Downloading graphframes_py-0.10.0-py3-none-any.whl.metadata (3.7 kB)
Collecting nose==1.3.7 (from graphframes-py==0.10.0)
  Downloading nose-1.3.7-py3-none-any.whl.metadata (1.7 kB)
Downloading graphframes_py-0.10.0-py3-none-any.whl (48 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.7/48.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nose-1.3.7-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nose, graphframes-py
Successfully installed graphframes-py-0.10.0 nose-1.3.7


In [None]:
try:
    SparkSession.getActiveSession().stop()
except:
    pass

In [4]:
# Set JAVA_HOME to Java 17
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("GraphFrames") \
    .master("local[*]") \
    .config("spark.jars.packages", "io.graphframes:graphframes-spark3_2.12:0.10.0") \
    .getOrCreate()

print(f"spark version: {spark.version}")
print("spark session created with graphframes package specified!")

spark version: 3.5.0
spark session created with graphframes package specified!


In [5]:
# Imports
import os
import glob
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import os
import glob
from pyspark.sql.functions import col
from graphframes import GraphFrame

In [36]:
from pyspark.sql.functions import avg

## 1. Data Loading and Graph Creation

We load the flight data and then define the Vertices (unique airport codes with city names) and Edges(flight records).

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("patrickzel/flight-delay-and-cancellation-dataset-2019-2023")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/patrickzel/flight-delay-and-cancellation-dataset-2019-2023?dataset_version_number=7...


100%|██████████| 140M/140M [00:05<00:00, 27.6MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/patrickzel/flight-delay-and-cancellation-dataset-2019-2023/versions/7


In [6]:
# Load the data
df = spark.read.csv(f"{path}/flights_sample_3m.csv", header=True, inferSchema=True)
df.show(5)

+----------+--------------------+--------------------+------------+--------+---------+------+-------------------+----+--------------------+------------+--------+---------+--------+----------+---------+-------+------------+--------+---------+---------+-----------------+--------+----------------+------------+--------+--------+-----------------+-----------------+-------------+------------------+-----------------------+
|   FL_DATE|             AIRLINE|         AIRLINE_DOT|AIRLINE_CODE|DOT_CODE|FL_NUMBER|ORIGIN|        ORIGIN_CITY|DEST|           DEST_CITY|CRS_DEP_TIME|DEP_TIME|DEP_DELAY|TAXI_OUT|WHEELS_OFF|WHEELS_ON|TAXI_IN|CRS_ARR_TIME|ARR_TIME|ARR_DELAY|CANCELLED|CANCELLATION_CODE|DIVERTED|CRS_ELAPSED_TIME|ELAPSED_TIME|AIR_TIME|DISTANCE|DELAY_DUE_CARRIER|DELAY_DUE_WEATHER|DELAY_DUE_NAS|DELAY_DUE_SECURITY|DELAY_DUE_LATE_AIRCRAFT|
+----------+--------------------+--------------------+------------+--------+---------+------+-------------------+----+--------------------+------------+--------

In [7]:
# Define Vertices (Airports)
vertices = (
    df.select(col("ORIGIN").alias("id"), col("ORIGIN_CITY").alias("city"))
      .union(
          df.select(col("DEST").alias("id"), col("DEST_CITY").alias("city"))
      )
      .distinct()
)

In [8]:
# Define Edges (Flights)
edges = df.select(
    col("ORIGIN").alias("src"),
    col("DEST").alias("dst"),
    col("AIRLINE"),
    col("FL_NUMBER"),
    col("FL_DATE"),
    col("DEP_DELAY"),
    col("ARR_DELAY"),
    col("CANCELLED"),
    col("DISTANCE"),
    col("AIR_TIME"),
)

In [9]:
# Create the GrpahFrame
g = GraphFrame(vertices, edges)

In [None]:
print("Number of airports (vertices):", g.vertices.count())
print("Number of flights (edges):", g.edges.count())

Number of airports (vertices): 381
Number of flights (edges): 3000000


In [None]:
g.vertices.show(5)
g.edges.show(5)

+---+--------------------+
| id|                city|
+---+--------------------+
|COS|Colorado Springs, CO|
|SDF|      Louisville, KY|
|PIR|          Pierre, SD|
|CLL|College Station/B...|
|MSN|         Madison, WI|
+---+--------------------+
only showing top 5 rows

+---+---+--------------------+---------+----------+---------+---------+---------+--------+--------+
|src|dst|             AIRLINE|FL_NUMBER|   FL_DATE|DEP_DELAY|ARR_DELAY|CANCELLED|DISTANCE|AIR_TIME|
+---+---+--------------------+---------+----------+---------+---------+---------+--------+--------+
|FLL|EWR|United Air Lines ...|     1562|2019-01-09|     -4.0|    -14.0|      0.0|  1065.0|   153.0|
|MSP|SEA|Delta Air Lines Inc.|     1149|2022-11-19|     -6.0|     -5.0|      0.0|  1399.0|   189.0|
|DEN|MSP|United Air Lines ...|      459|2022-07-22|      6.0|      0.0|      0.0|   680.0|    87.0|
|MSP|SFO|Delta Air Lines Inc.|     2295|2023-03-06|     -1.0|     24.0|      0.0|  1589.0|   249.0|
|MCO|DFW|    Spirit Air Lines|  

# 2. Graph Structure and Hub Analysis

This section analyzes fundamental network properties to identify importance and connectivity.

## 2.1. Degree Analysis (Hub Identification)

The Degree of an airport is the total number of incoming and outgoing flights. A high degree indicates a major hub or high-traffic airport.

In [None]:
# Total connections
g.degrees.orderBy("degree", ascending=False).show(10)

# Incoming flights
g.inDegrees.orderBy("inDegree", ascending=False).show(10)

# Outgoing flights
g.outDegrees.orderBy("outDegree", ascending=False).show(10)

+---+------+
| id|degree|
+---+------+
|ATL|307125|
|DFW|260104|
|ORD|245630|
|DEN|239511|
|CLT|189717|
|LAX|171493|
|PHX|150420|
|LAS|146932|
|SEA|141738|
|MCO|127701|
+---+------+
only showing top 10 rows

+---+--------+
| id|inDegree|
+---+--------+
|ATL|  153569|
|DFW|  129770|
|ORD|  123334|
|DEN|  119592|
|CLT|   95413|
|LAX|   85621|
|PHX|   75605|
|LAS|   73462|
|SEA|   70832|
|MCO|   63818|
+---+--------+
only showing top 10 rows

+---+---------+
| id|outDegree|
+---+---------+
|ATL|   153556|
|DFW|   130334|
|ORD|   122296|
|DEN|   119919|
|CLT|    94304|
|LAX|    85872|
|PHX|    74815|
|LAS|    73470|
|SEA|    70906|
|MCO|    63883|
+---+---------+
only showing top 10 rows



## 2.2. PageRank Algorithm (Network Importance)

PageRank measures the influence of an airport. It scores airports higher if they are connected to other highly influential airports, indicating critical hubs in the network flow.



Subsampling the dataset (only 2023 flights) bceause of limited resources.

In [10]:
df_2023 = df.filter(df.FL_DATE.startswith("2023"))

# rebuild edges & vertices
vertices = (
    df_2023.select(col("ORIGIN").alias("id"))
           .union(df_2023.select(col("DEST").alias("id")))
           .distinct()
)
edges = df_2023.select(
    col("ORIGIN").alias("src"),
    col("DEST").alias("dst"),
    col("AIRLINE"),
    col("FL_NUMBER"),
    col("FL_DATE"),
    col("DEP_DELAY"),
    col("ARR_DELAY"),
    col("CANCELLED"),
    col("DISTANCE"),
    col("AIR_TIME"),
)

g_small = GraphFrame(vertices, edges)

# Now run PageRank
pr = g_small.pageRank(resetProbability=0.15, tol=0.01)
pr.vertices.orderBy("pagerank", ascending=False).show(10)


+---+------------------+
| id|          pagerank|
+---+------------------+
|DFW|14.971593423888594|
|ATL|14.507191176323762|
|DEN|14.400678015494847|
|ORD|12.315512358376747|
|CLT| 8.351375393057925|
|SEA| 7.719350369113928|
|LAX| 7.584317218445228|
|LAS|7.5391506613638395|
|PHX| 7.105252947695909|
|LGA| 6.729546189010542|
+---+------------------+
only showing top 10 rows



In [None]:
# Number of arriving flights
g_small.edges.filter("dst = 'DFW'").count()

19157

In [None]:
# Number of departing flights
g_small.edges.filter("src = 'DFW'").count()

19271

## 2.3. Connected Components
Connected Componentsidentifies groups of airports that are all mutually reachable. This analysis is vital for confirming network density and connectivity.

In [None]:
# Compute connected components
cc = g_small.connectedComponents()

In [None]:
# Show each airport with its component id
cc.select("id", "component").show(10)

+---+---------+
| id|component|
+---+---------+
|BGM|        0|
|PSE|        0|
|DLG|        0|
|INL|        0|
|MSY|        0|
|PPG|        0|
|GEG|        0|
|DRT|        0|
|SNA|        0|
|BUR|        0|
+---+---------+
only showing top 10 rows



In [None]:
# Count number of airports per component
cc.groupBy("component").count().orderBy("count", ascending=False).show(10)

+---------+-----+
|component|count|
+---------+-----+
|        0|  348|
+---------+-----+



This shows that all airports are coonected.

# 3. Community Detection (Label Propagation Algorithm)

The Label Propagation Algorithm (LPA) is a fast method to find densely connected communities or clusters. In this context, it reveals distinct regional or airline-specific networks.

In [30]:
# Run LPA for 5 iterations
lpa = g_small.labelPropagation(maxIter=5)

In [31]:
# Show the top 10 most common community labels
print("Label Propagation: Top 10 airport communities:")
lpa.select("id", "label").groupBy("label").count().orderBy("count", ascending=False).show(10)


Label Propagation: Top 10 airport communities:
+-----+-----+
|label|count|
+-----+-----+
|   21|  337|
|  149|    4|
|  262|    1|
|  232|    1|
|  143|    1|
|  342|    1|
|  312|    1|
|  121|    1|
|  204|    1|
+-----+-----+



## 3.1. Analyzing Small, Isolated Communities

We investigate the airports in the smallest communities (counts of 1 or 4) to understand which specific regional or fringe airports are loosely coupled to the main network.

In [33]:
# List of labels provided by the user (excluding the main community, label 21)
labels_to_show = [149, 262, 232, 143, 342, 312, 121, 204]

# Filter the LPA result DataFrame (lpa) to show only airports belonging to these small communities
# We assume the 'lpa' DataFrame (defined earlier) is available in your PySpark environment.
small_communities = lpa.filter(col("label").isin(labels_to_show))

# Display the airport ID ('id') and its community label, sorted by label for grouping.
print("Airports belonging to the smaller communities:")
small_communities.select("id", "label").orderBy("label").show(20)

Airports belonging to the smaller communities:
+---+-----+
| id|label|
+---+-----+
|MCW|  121|
|SPN|  143|
|PPG|  149|
|LIH|  149|
|ITO|  149|
|KOA|  149|
|FOD|  204|
|HNL|  232|
|WRG|  262|
|GUM|  312|
|PSG|  342|
+---+-----+



# 4. Route Analysis and Pathfinding

## 4.1. Reciprocal Routes (Motif Finding)

Motif Finding searches for specific structural patterns. The motif (a)-[e1]->(b); (b)-[e2]->(a) identifies all pairs of airports with a reciprocal (two-way) flight connection, indicating highly traveled corridors.

In [40]:
# Motif: (a)-[e1]->(b); (b)-[e2]->(a)
# a and b are airports (vertices), e1 and e2 are flights (edges)
reciprocal_routes = g_small.find("(a)-[e1]->(b); (b)-[e2]->(a)")

In [41]:
# Show the airports (a.id, b.id) and filter out self-loops (a is not b)
print("Top 10 Reciprocal Flight Routes:")
reciprocal_routes.select("a.id", "b.id")\
                 .filter("a.id != b.id")\
                 .distinct()\
                 .limit(10)\
                 .show()

Top 10 Reciprocal Flight Routes:
+---+---+
| id| id|
+---+---+
|ATL|ABE|
|BNA|ABE|
|CLT|ABE|
|FLL|ABE|
|DFW|ABI|
|AUS|ABQ|
|BUR|ABQ|
|BWI|ABQ|
|DAL|ABQ|
|DEN|ABQ|
+---+---+



## 4.2. Breadth-First Search (BFS) for Shortest Multi-Hop Path

We use BFS to find the shortest path in terms of hops between JFK (New York, New York) and SCC (Deadhorse, alaska), two airports highly unlikely to have a direct flight.

First, we verify the number of 1-hop (direct) paths.

In [24]:
# Check : JFK to SCC
jfk_to_scc_count = g_small.edges.filter("src = 'JFK' AND dst = 'SCC'").count()
print(f"Number of direct flights from JFK to SCC: {jfk_to_scc_count}")

Number of direct flights from JFK to SCC: 0


In [26]:
# BFS from JFK to SCC, max 3 hops
bfs_results_jfk_scc = g_small.bfs(
    fromExpr="id = 'JFK'",
    toExpr="id = 'SCC'",
    maxPathLength=3
)

In [27]:
# Check if any paths were found
path_count = bfs_results_jfk_scc.count()
print(f"\nTotal number of paths found (1, 2, or 3 hops): {path_count}")


Total number of paths found (1, 2, or 3 hops): 5556960


In [28]:
# Now, filter and select the 2-hop paths (v1, the intermediate airport, must exist)
two_hop_paths = bfs_results_jfk_scc.filter(col("v1.id").isNotNull()).select(
    col("from.id").alias("Start_Airport"),
    col("e0.AIRLINE").alias("Flight_1_Airline"),
    col("v1.id").alias("Layover_Airport"),
    col("e1.AIRLINE").alias("Flight_2_Airline"),
    col("to.id").alias("Final_Destination"),
    col("e0.DISTANCE").alias("Dist_1"),
    col("e1.DISTANCE").alias("Dist_2")
)

In [29]:
print("\nShortest Multi-Hop Paths (JFK -> X -> SCC):")
two_hop_paths.withColumn("Total_Distance", col("Dist_1") + col("Dist_2")) \
             .orderBy(col("Total_Distance")) \
             .show(5)


Shortest Multi-Hop Paths (JFK -> X -> SCC):
+-------------+-----------------+---------------+--------------------+-----------------+------+------+--------------+
|Start_Airport| Flight_1_Airline|Layover_Airport|    Flight_2_Airline|Final_Destination|Dist_1|Dist_2|Total_Distance|
+-------------+-----------------+---------------+--------------------+-----------------+------+------+--------------+
|          JFK|Endeavor Air Inc.|            MSP|Delta Air Lines Inc.|              SCC|1029.0|2519.0|        3548.0|
|          JFK|Endeavor Air Inc.|            MSP|Delta Air Lines Inc.|              SCC|1029.0|2519.0|        3548.0|
|          JFK|Endeavor Air Inc.|            MSP|Delta Air Lines Inc.|              SCC|1029.0|2519.0|        3548.0|
|          JFK|Endeavor Air Inc.|            MSP|Delta Air Lines Inc.|              SCC|1029.0|2519.0|        3548.0|
|          JFK|Endeavor Air Inc.|            MSP|Delta Air Lines Inc.|              SCC|1029.0|2519.0|        3548.0|
+----------

### **Shortest Paths to JFK**

In [None]:
paths = g_small.shortestPaths(landmarks=["JFK"])
# Sort by distance to FLL ascending (closest airports first)
paths.select("id", "distances").orderBy(col("distances")["JFK"]).show(10, truncate=False)

+---+----------+
|id |distances |
+---+----------+
|JFK|{JFK -> 0}|
|DCA|{JFK -> 1}|
|SJU|{JFK -> 1}|
|ORF|{JFK -> 1}|
|MSY|{JFK -> 1}|
|SAV|{JFK -> 1}|
|BUR|{JFK -> 1}|
|CMH|{JFK -> 1}|
|SJC|{JFK -> 1}|
|AUS|{JFK -> 1}|
+---+----------+
only showing top 10 rows



In [None]:
# Sort descending (farthest airports first)
paths.select("id", "distances").orderBy(col("distances")["JFK"].desc()).show(10, truncate=False)

+---+----------+
|id |distances |
+---+----------+
|SCC|{JFK -> 3}|
|BET|{JFK -> 3}|
|WRG|{JFK -> 3}|
|PSG|{JFK -> 3}|
|IAG|{JFK -> 3}|
|TOL|{JFK -> 3}|
|BRW|{JFK -> 3}|
|HGR|{JFK -> 3}|
|CDV|{JFK -> 3}|
|OME|{JFK -> 3}|
+---+----------+
only showing top 10 rows



### **Route-level analysis**

Average delay per route

In [None]:
g_small.edges.groupBy("src", "dst").avg("ARR_DELAY").orderBy("avg(ARR_DELAY)", ascending=False).show(10)


+---+---+------------------+
|src|dst|    avg(ARR_DELAY)|
+---+---+------------------+
|DEN|ABE|            1080.0|
|SFB|GFK|             866.5|
|IDA|PDX|             746.5|
|PSC|SAN|             620.0|
|SMX|LAS|336.42857142857144|
|LAS|AZA|             313.0|
|HTS|PGD|             265.0|
|GEG|ORD|             234.2|
|CHS|LCK|             218.0|
|FCA|DFW|207.53846153846155|
+---+---+------------------+
only showing top 10 rows



Cancellations

In [None]:
g_small.edges.groupBy("src", "dst").sum("CANCELLED").orderBy("sum(CANCELLED)", ascending=False).show(10)


+---+---+--------------+
|src|dst|sum(CANCELLED)|
+---+---+--------------+
|BOS|LGA|          25.0|
|LGA|ORD|          24.0|
|EWR|ORD|          23.0|
|ORD|LGA|          21.0|
|DFW|LGA|          20.0|
|FLL|LGA|          18.0|
|JFK|BOS|          18.0|
|LGA|BOS|          18.0|
|DEN|LAS|          17.0|
|EWR|FLL|          16.0|
+---+---+--------------+
only showing top 10 rows



In [37]:
from pyspark.sql.functions import avg

Average arrival delay for all flights originating at DFW

In [38]:
dfw_departures_avg_delay = (
    g_small.edges.filter("src = 'DFW'")
    .agg(avg("ARR_DELAY").alias("Avg_DFW_Arr_Delay"))
)

print("Average Arrival Delay for flights departing DFW (2023 data):")
dfw_departures_avg_delay.show()

Average Arrival Delay for flights departing DFW (2023 data):
+------------------+
| Avg_DFW_Arr_Delay|
+------------------+
|15.858564359069867|
+------------------+



Most frequent destination from LAX

In [39]:
most_frequent_dest = (
    g_small.edges.filter("src = 'LAX'")
    .groupBy("dst")
    .count()
    .orderBy("count", ascending=False)
)

print("Top 5 most frequent destinations from LAX:")
most_frequent_dest.show(5)

Top 5 most frequent destinations from LAX:
+---+-----+
|dst|count|
+---+-----+
|SFO|  754|
|LAS|  692|
|JFK|  676|
|SEA|  486|
|DEN|  483|
+---+-----+
only showing top 5 rows



### **Connected Components**

### **Triangle Count**

In [None]:
from pyspark import StorageLevel

In [None]:
spark.sparkContext.setCheckpointDir("/tmp/graphframes-checkpoint")

In [None]:
triangles = g_small.triangleCount(storage_level=StorageLevel.MEMORY_AND_DISK)

# Show the number of triangles each airport participates in
triangles.select("id", "count").orderBy("count", ascending=False).show(10)

+---+-----+
| id|count|
+---+-----+
|DFW| 2199|
|ATL| 2161|
|DEN| 2146|
|ORD| 2069|
|CLT| 1897|
|LAS| 1768|
|MSP| 1765|
|PHX| 1560|
|IAH| 1551|
|LAX| 1535|
+---+-----+
only showing top 10 rows



**Breadth-First Search**

**Label Propagation**

**Pattern Finding**