## Learning Objectives

compare CSV loading performance between Pandas and Polars

measure and contrast filtering + aggregation speed on flight delay data

evaluate join/merge efficiency across large datasets (flights + airports)

practice benchmarking common operations (mean, filtering, joins) with timing utilities

develop awareness of trade-offs between Pandas (compatibility) and Polars (speed) in real-world data workflows

## Question 1

In [2]:
import polars as pl
import time
start_time = time.time()
myDF = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA")
print("My program took", time.time() - start_time, "to run")

My program took 0.6889636516571045 to run


In [4]:
import pandas as pd
import time
start_time = time.time()
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv")
print("My program took", time.time() - start_time, "to run")

My program took 5.482293367385864 to run


I ran both twice and Pandas took about 4.793 seconds longer

## Question 2

In [8]:
import polars as pl
# read in the 2002 flight data to a Polars data frame again
myDF = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA")

import time
start_time = time.time()
myresults = myDF.filter(pl.col("Origin") == "IND")['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for Indianapolis flights was " + str(myresults))

My program took 0.02305126190185547 to run
The average delay for Indianapolis flights was 3.9418510691198407


In [9]:
import pandas as pd
# read in the 2002 flight data to a Pandas data frame again
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv")

import time
start_time = time.time()
myresults = myDF[myDF['Origin'] == "IND"]['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for Indianapolis flights was " + str(myresults))

My program took 0.30306196212768555 to run
The average delay for Indianapolis flights was 3.9418510691198407


If we divide panas time over polars so 0.02305126190185547/0.30306196212768555 = 5.01 therefore showing that polars is 5 times faster than pandas 

## Question 3

In [11]:
import polars as pl
# read in the 2002 flight data to a Polars data frame again, but only 3 columns this time:
myDF = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA").select(["Origin", "Dest", "DepDelay"])

# also read in the airports data to a Polars data frame as follows:
myairports = pl.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")

import time
start_time = time.time()
newDF = myDF.join(myairports, left_on="Origin", right_on="iata")
myresults = newDF.filter(pl.col("state") == "CA")['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for all flights with Origin at any California airports was " + str(myresults))

My program took 0.3490889072418213 to run
The average delay for all flights with Origin at any California airports was 5.245392521079324


In [12]:
import pandas as pd
# read in the 2002 flight data to a Pandas data frame again, but only 3 columns this time:
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", usecols=["Origin", "Dest", "DepDelay"])

# also read in the airports data to a Pandas data frame as follows:
myairports = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")

import time
start_time = time.time()
newDF = myDF.merge(myairports, left_on='Origin', right_on='iata')
myresults = newDF[newDF['state'] == "CA"]['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for Indianapolis flights was " + str(myresults))

My program took 1.1394693851470947 to run
The average delay for Indianapolis flights was 5.245392521079324


If we divide panas time over polars so 1.1394693851470947/0.3490889072418213 = 3.26 therefore Polars is around 3.3 times faster than Pandas for merging the data

## Question 4

In [14]:
myairports.head()

Unnamed: 0,iata,airport,city,state,country,lat,long
0,00M,Thigpen,Bay Springs,MS,USA,31.953765,-89.234505
1,00R,Livingston Municipal,Livingston,TX,USA,30.685861,-95.017928
2,00V,Meadow Lake,Colorado Springs,CO,USA,38.945749,-104.569893
3,01G,Perry-Warsaw,Perry,NY,USA,42.741347,-78.052081
4,01J,Hilliard Airpark,Hilliard,FL,USA,30.688012,-81.905944


In [16]:
myDF.head()

Unnamed: 0,DepDelay,Origin,Dest
0,-4.0,PIT,CLT
1,-5.0,PIT,CLT
2,-5.0,PIT,CLT
3,-5.0,PIT,CLT
4,-8.0,PIT,CLT


In [37]:
import polars as pl
import time

# Load the flight data
df = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA")
# Start timing the value count operation
start_time = time.time()
# Count how many times each Origin appears
origin_counts = df.groupby("Origin").count()
print("Polars program took", time.time() - start_time, "to run")

Polars program took 0.0665743350982666 to run


  origin_counts = df.groupby("Origin").count()


In [38]:
import pandas as pd
import time

# Load the flight data
df = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv")
# Start timing the value count operation
start_time = time.time()
# Count how many times each Origin appears again
origin_counts = df['Origin'].value_counts()
print("Pandas program took", time.time() - start_time, "to run")

Pandas program took 0.25765061378479004 to run


0.0665743350982666/0.25765061378479004 shows that polars is 3.9 times faster than pandas 

## Question 5

In [29]:
import polars as pl
import time
#Loads the dataset using Pandas 
df = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA")
#This starts the timmer 
start_time = time.time()
# Filter rows with DepDelay greater than 60 and sort by DepDelay
filtered = df.filter(pl.col("DepDelay") > 60).sort("DepDelay")
print("Polars program took", time.time() - start_time, "to run")

Polars program took 0.0664510726928711 to run


In [30]:
import pandas as pd
import time
#Loads the dataset using Pandas 
df = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv")
#This starts the timmer 
start_time = time.time()
#Same as top
filtered = df[df["DepDelay"] > 60].sort_values("DepDelay")
print("Pandas program took", time.time() - start_time, "to run")

Pandas program took 0.10351085662841797 to run


If we divide pandas time over polars 0.10351085662841797/0.0664510726928711 it polars is about 1.55 times faster.