
# PySpark Airline Flight Analysis

This notebook demonstrates how to work with **Airlines, Airports, and Flights datasets** using **PySpark RDDs**.  
We will explore how to load data from CSV files, filter headers, parse structured data, and perform basic analytics.

---

### **Objectives**
- Load CSV data into RDDs using Spark.
- Remove header rows from datasets.
- Parse date, time, and numeric fields.
- Compute flight-level statistics using RDD transformations and actions.



## 1. Initialize Spark Session and Load Data

We start by creating a `SparkSession` and reading the CSV files as text into RDDs.


In [None]:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

airlinesDataPath = 'airlines.csv'
airportsDataPath = 'airports.csv'
flightsDataPath = 'flights.csv'

airlinesRdd = spark.sparkContext.textFile(airlinesDataPath)

for i in airlinesRdd.take(10): 
    print(i)

airlinesRdd.first()
airlinesRdd.count()



## 2. Remove Header Rows

CSV files often contain a header line that should not be part of computations.  
We can filter it out using either a **lambda expression** or a **custom function**.


In [None]:

# Remove header using a lambda expression
airlinesWOHeaderRdd = airlinesRdd.filter(lambda x: 'Description' not in x)
airlinesWOHeaderRdd.count()
airlinesWOHeaderRdd.first()


In [None]:

# Define a function instead of a lambda expression
def notHeader(row):
    return 'Description' not in row

airlinesWOHeaderRdd1 = airlinesRdd.filter(notHeader)
airlinesWOHeaderRdd1.count()



## 3. Parsing Flight Data

We’ll now parse flight records to extract useful fields such as dates, times, and distances.  
To represent structured records, we use Python’s `namedtuple` along with datetime parsing.


In [None]:

flightsRdd = spark.sparkContext.textFile(flightsDataPath)

flightsRdd.first()

from datetime import datetime
from collections import namedtuple

fields = ('date', 'airline', 'flightnum', 'origin', 'dest', 
          'dep', 'dep_delay', 'arv', 'arv_delay', 'airtime', 'distance')

Flight = namedtuple('Flight', fields, verbose=True)

DATE_FMT = '%Y-%m-%d'
TIME_FMT = '%H%M'

def parse(row):
    row[0] = datetime.strptime(row[0], DATE_FMT).date()
    row[5] = datetime.strptime(row[5], TIME_FMT).time()
    row[6] = float(row[6])
    row[7] = datetime.strptime(row[7], TIME_FMT).time()
    row[8] = float(row[8])
    row[9] = float(row[9])
    row[10] = float(row[10])
    return Flight(*row[:11])

flightsParsedRdd = flightsRdd.map(lambda x: x.split(",")).map(parse)
flightsParsedRdd.first()



## 4. Flight-Level Analysis

We can perform aggregate computations such as calculating **average distance** and **percentage of delayed flights** using RDD actions like `reduce` and `filter`.


In [None]:

# Average distance travelled by a flight
totalDistance = flightsParsedRdd.map(lambda x: x.distance).reduce(lambda x, y: x + y)
avgDistance = totalDistance / flightsParsedRdd.count()
avgDistance


In [None]:

# Percentage of flights with delays
flightsParsedRdd.filter(lambda x: x.dep_delay > 0).count() / float(flightsParsedRdd.count())



## 5. Summary

In this notebook, we learned how to:
- Load and preprocess CSV data using PySpark RDDs.  
- Remove headers using filters and functions.  
- Parse structured records using `namedtuple` and `datetime`.  
- Perform analytical computations such as averages and ratios using RDD transformations and actions.

This demonstrates how Spark RDDs can handle **large-scale flight data** with ease and flexibility.
