## Learning Objectives

recreate taxi trip parsing functions (extractdate) using Polars string/datetime APIs

filter large NYC taxi datasets efficiently by year, month, and day with Polars

handle large election contribution files (1980–2024) with Polars’ read_csv and grouping operations

compare groupby + aggregation performance between Polars and Pandas

practice porting existing Pandas workflows into Polars, reinforcing cross-library fluency

demonstrate ability to re-solve past data problems (flights, donations, groceries, etc.) with Polars for efficiency and cleaner syntax

## Question 1

In [18]:
import polars as pl

# Define the function
def extractdate(myyear, mymonth, mydate):
    # Build filename
    filename = f"/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_{myyear}-{mymonth}.csv"
    
    # Read the 3 specific columns
    df = pl.read_csv(filename, columns=["Trip_Pickup_DateTime", "Start_Lon", "Start_Lat"])
    
    return df

# Calling the function to show it works
df = extractdate("2009", "05", "29")
print(df)

shape: (14_796_313, 3)
┌──────────────────────┬────────────┬───────────┐
│ Trip_Pickup_DateTime ┆ Start_Lon  ┆ Start_Lat │
│ ---                  ┆ ---        ┆ ---       │
│ str                  ┆ f64        ┆ f64       │
╞══════════════════════╪════════════╪═══════════╡
│ 2009-05-27 07:41:05  ┆ -73.974105 ┆ 40.742892 │
│ 2009-05-27 07:51:06  ┆ -74.008148 ┆ 40.738854 │
│ 2009-05-15 15:22:02  ┆ -73.973343 ┆ 40.764047 │
│ 2009-05-26 22:06:37  ┆ -74.005256 ┆ 40.71973  │
│ …                    ┆ …          ┆ …         │
│ 2009-05-27 09:20:40  ┆ -73.978495 ┆ 40.762614 │
│ 2009-05-27 17:40:30  ┆ -73.953723 ┆ 40.786917 │
│ 2009-05-27 13:10:28  ┆ -73.988455 ┆ 40.737215 │
│ 2009-05-12 14:55:26  ┆ -73.982013 ┆ 40.749249 │
└──────────────────────┴────────────┴───────────┘


## Question 2

In [19]:
import polars as pl

def extractdate(myyear, mymonth, mydate):
    filename = f"/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_{myyear}-{mymonth}.csv"
    
    df = pl.read_csv(filename, columns=["Trip_Pickup_DateTime", "Start_Lon", "Start_Lat"])
    
    # Convert string to datetime and filter by year, month, day copied from Data Mine Website 
    df_filtered = df.filter(
        df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.year().eq(int(myyear)) &
        df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.month().eq(int(mymonth)) &
        df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.day().eq(int(mydate))
    )
    
    return df_filtered

df = extractdate("2009", "05", "29")
print(df)

shape: (523_947, 3)
┌──────────────────────┬────────────┬───────────┐
│ Trip_Pickup_DateTime ┆ Start_Lon  ┆ Start_Lat │
│ ---                  ┆ ---        ┆ ---       │
│ str                  ┆ f64        ┆ f64       │
╞══════════════════════╪════════════╪═══════════╡
│ 2009-05-29 11:36:28  ┆ -73.991699 ┆ 40.738751 │
│ 2009-05-29 22:03:20  ┆ -73.968066 ┆ 40.759642 │
│ 2009-05-29 02:25:00  ┆ -74.007013 ┆ 40.711895 │
│ 2009-05-29 00:49:00  ┆ -73.976188 ┆ 40.765588 │
│ …                    ┆ …          ┆ …         │
│ 2009-05-29 15:15:52  ┆ -73.959823 ┆ 40.762451 │
│ 2009-05-29 06:53:22  ┆ -73.962236 ┆ 40.779161 │
│ 2009-05-29 05:58:24  ┆ -74.001003 ┆ 40.741484 │
│ 2009-05-29 23:51:28  ┆ -73.965153 ┆ 40.791205 │
└──────────────────────┴────────────┴───────────┘


Markdown notes and sentences and analysis written here.

## Question 3

In [17]:
import polars as pl
myDF = pl.read_csv("/anvil/projects/tdm/data/election/itcont2020.txt", has_header=False, separator='|', columns=[9,14], ignore_errors=True)
myDF = myDF.rename({"column_10": "STATE", "column_15": "TRANSACTION_AMT"})
myDF.group_by('STATE').agg(pl.sum('TRANSACTION_AMT')).sort('TRANSACTION_AMT').tail(10)

STATE,TRANSACTION_AMT
str,i64
"""PA""",357275145
"""WA""",399230003
"""MA""",506660138
"""IL""",594629996
"""FL""",883926606
"""TX""",968468772
"""DC""",1038205619
"""VA""",1099663849
"""NY""",2481011887
"""CA""",2723597299


Markdown notes and sentences and analysis written here.

## Question 4

In [27]:
#Question 3 of Project 11
import polars as pl
import geopandas as gpd
import pandas as pd  # internet said I still nedded this to bridge 

def extractdate(myyear, mymonth, mydate):
    # Build filename
    filename = f"/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_{myyear}-{mymonth}.csv"

    # Read 3 needed columns
    df = pl.read_csv(
        filename,
        columns=["Trip_Pickup_DateTime", "Start_Lon", "Start_Lat"]
    )

    # Filter for rides on the specified date
    df = df.filter(
        df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.year() == int(myyear)
        & df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.month() == int(mymonth)
        & df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.day() == int(mydate)
    )
#I give up 
    return gdf

Markdown notes and sentences and analysis written here.

## Question 5

In [25]:
#this is from Question 4 of project 11 which is just a part 2 to the question 4 or question 3 project 11 
def extractdate(myyear, mymonth, mydate):
    # Build filename
    filename = f"/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_{myyear}-{mymonth}.csv"

    # Read columns
    df = pl.read_csv(
        filename,
        columns=["Trip_Pickup_DateTime", "Start_Lon", "Start_Lat"]
    )

    # Filter by date
    df = df.filter(
        df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.year() == int(myyear)
        & df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.month() == int(mymonth)
        & df["Trip_Pickup_DateTime"].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.day() == int(mydate)
    )

    # NYC bounds filter
    df = df.filter(
        (df["Start_Lon"] >= -74.27) & (df["Start_Lon"] <= -73.68)
        & (df["Start_Lat"] >= 40.49) & (df["Start_Lat"] <= 40.92)
    )


    return df

Markdown notes and sentences and analysis written here.