# OPAN5505 Lab 3 Assignment (Polars Version)

## Background

You've been engaged by the Federal Aviation Administration to perform an analysis on "strike" data. A "strike" refers to when an aircraft and an object within the air (typically wildlife) collide.

## Prepare your environment

You will need the `polars` package for this assignment. We'll also use `datetime` for date handling.

In [None]:
import polars as pl
from datetime import datetime

# Configure Polars to display more rows
pl.Config.set_tbl_rows(20)

## Load your dataset

Read the `faa_strikes.txt` dataset into Polars. Name the resulting DataFrame `strikes`.

In [None]:
# Read the faa_strikes.txt data
strikes = pl.read_csv("faa_strikes.txt", separator="\t")
print(f"Shape: {strikes.shape}")
print("\nColumns:")
print(strikes.columns)
strikes.head()

## Question 1: Running total of strikes by day through 2013 (2 points)

You are interested in seeing the running total of strikes over time (`Collision Date and Time`) on a daily basis. Using the `strikes` dataframe, create a new column that removes the time information but keeps the date information from the `Collision Date and Time` column (call this new field `date`). Next, aggregate the `Number of Strikes` by day (call this new field `daily_strikes`). Sort the data in ascending order by `date`. Then, create the running total of `daily_strikes` (name the new field `strikes_cumulative`) and filter to only the records **up to and including** the `date` of 2013-12-31. Name the resulting data frame `running_total_strikes`.

In [None]:
# Convert to datetime and extract date
running_total_strikes = (
    strikes
    .with_columns(
        pl.col("Collision Date and Time").str.strptime(pl.Datetime, format="%m/%d/%Y %H:%M:%S").dt.date().alias("date")
    )
    .group_by("date")
    .agg(
        pl.col("Number of Strikes").sum().alias("daily_strikes")
    )
    .sort("date")
    .with_columns(
        pl.col("daily_strikes").cum_sum().alias("strikes_cumulative")
    )
    .filter(pl.col("date") <= datetime(2013, 12, 31).date())
)

print(f"Shape: {running_total_strikes.shape}")
running_total_strikes.tail(10)

## Question 2: States with the third highest financial cost (2 points)

The FAA is interested in the financial cost of strikes for each `Origin State`. Using the `strikes` dataframe: first, sum `Cost: Total $` by `Origin State` (name the resulting column `damage`) then `rank` these states in order of which state had the most financial `damage` (name the new column with rank information, `ranking`). Next, filter to show the row with the third highest `ranking`. Use a window function to answer this question. Name the resulting dataframe `damage_state`.

In [None]:
# Sum damage by state and rank
damage_state = (
    strikes
    .group_by("Origin State")
    .agg(
        pl.col("Cost: Total $").sum().alias("damage")
    )
    .with_columns(
        pl.col("damage").rank(method="min", descending=True).alias("ranking")
    )
    .filter(pl.col("ranking") == 3)
)

damage_state

## Question 3: What are the second costliest Species Groups for each Aircraft Type? (2 points)

The FAA wants to know if some species groups are more dangerous to particular types of aircraft. Using the `strikes` dataframe: first, sum financial damage (`Cost: Total $`) information by `Aircraft: Type` and `Wildlife: Species Group` (name the new field `damage`). Rank the rows within each `Aircraft: Type` based on the greatest amount of `damage`; when performing the ranking function, name the new column `ranking`. Return the rows that represent the `Wildlife: Species Group` values that caused the **second most** financial `damage` to each `Aircraft: Type`. Name the resulting dataframe `type_species`.

In [None]:
# Sum damage by aircraft type and species group, then rank within each aircraft type
type_species = (
    strikes
    .group_by(["Aircraft: Type", "Wildlife: Species Group"])
    .agg(
        pl.col("Cost: Total $").sum().alias("damage")
    )
    .with_columns(
        pl.col("damage")
        .rank(method="min", descending=True)
        .over("Aircraft: Type")
        .alias("ranking")
    )
    .filter(pl.col("ranking") == 2)
)

print(f"Shape: {type_species.shape}")
type_species

## Question 4: Which days had the greatest positive jump in strikes? (2 points)

The FAA wants to investigate which days had the largest increase in strikes from the previous day. Using the `strikes` dataframe: sum the `Number of Strikes` measure by day (note: not by day and time). You can use a similar technique from Question 1 to create a new column from `Collision Date and Time` without time information and with date information (name this column `date`). Next, compute the previous day's strikes using a window function (name the new column `previous_day`), and calculate the difference of strikes between the current day and the previous day (name the new column `delta_strikes`). Sort the resulting data by `delta_strikes` in descending order to find which days had the highest increase of strikes from the previous day. Name the resulting dataframe `greatest_strike_increase`.

**Hint:** This code is more involved that other exercises, so I will explicitly name the steps to answer the question.

1. Start with the `strikes` data frame
2. Create a new column called `date` which removes the time information in the `Collision Date and Time` column
3. Sum the `Number of Strikes` by `date`, and name the new column `daily_strikes`
4. Sort by the `date` column in ascending order
5. Create a new column called `previous_day` to calculate what the `daily_strikes` were in the previous day
6. Create a new column called `delta_strikes` which subtracts `previous_day` from `daily_strikes`
7. Sort in descending order by `delta_strikes`

In [None]:
# Calculate daily strikes and find days with greatest increase
greatest_strike_increase = (
    strikes
    # Step 2: Create date column
    .with_columns(
        pl.col("Collision Date and Time").str.strptime(pl.Datetime, format="%m/%d/%Y %H:%M:%S").dt.date().alias("date")
    )
    # Step 3: Sum strikes by date
    .group_by("date")
    .agg(
        pl.col("Number of Strikes").sum().alias("daily_strikes")
    )
    # Step 4: Sort by date
    .sort("date")
    # Step 5 & 6: Calculate previous day and delta
    .with_columns([
        pl.col("daily_strikes").shift(1).alias("previous_day"),
    ])
    .with_columns(
        (pl.col("daily_strikes") - pl.col("previous_day")).alias("delta_strikes")
    )
    # Step 7: Sort by delta_strikes descending
    .sort("delta_strikes", descending=True)
)

print(f"Shape: {greatest_strike_increase.shape}")
greatest_strike_increase.head(10)

## Question 5: Which single day had the greatest increase in strikes for each `Aircraft: Type`? (2 points)

FAA was interested in the exercise from the last question but now wants to determine the largest delta between days for each `Aircraft: Type`. What are the days that had largest positive change in strikes for each `Aircraft: Type`? In order to answer this question, you will need to perform the exercise from Question 4, but this time include a grouping by `Aircraft: Type`. After grouping by `Aircraft: Type`, add a column called `ranking` and use a window function to determine the day with the largest increase in strikes from the previous day. Name the resulting data frame `greatest_strike_increase_type`

NOTE: Helicopters are not struck with a high frequency and will not show up in your analysis; this is fine.

In [None]:
# Calculate daily strikes by aircraft type and find day with greatest increase for each type
greatest_strike_increase_type = (
    strikes
    # Create date column
    .with_columns(
        pl.col("Collision Date and Time").str.strptime(pl.Datetime, format="%m/%d/%Y %H:%M:%S").dt.date().alias("date")
    )
    # Sum strikes by date and aircraft type
    .group_by(["date", "Aircraft: Type"])
    .agg(
        pl.col("Number of Strikes").sum().alias("daily_strikes")
    )
    # Sort by aircraft type and date
    .sort(["Aircraft: Type", "date"])
    # Calculate previous day's strikes within each aircraft type
    .with_columns(
        pl.col("daily_strikes").shift(1).over("Aircraft: Type").alias("previous_day")
    )
    # Calculate delta
    .with_columns(
        (pl.col("daily_strikes") - pl.col("previous_day")).alias("delta_strikes")
    )
    # Rank within each aircraft type
    .with_columns(
        pl.col("delta_strikes")
        .rank(method="min", descending=True)
        .over("Aircraft: Type")
        .alias("ranking")
    )
    # Filter for top ranked (greatest increase) per aircraft type
    .filter(pl.col("ranking") == 1)
    .sort("delta_strikes", descending=True)
)

print(f"Shape: {greatest_strike_increase_type.shape}")
greatest_strike_increase_type

## Save results

Save all the resulting DataFrames for verification.

In [None]:
# Save results to parquet files
running_total_strikes.write_parquet("running_total_strikes.parquet")
damage_state.write_parquet("damage_state.parquet")
type_species.write_parquet("type_species.parquet")
greatest_strike_increase.write_parquet("greatest_strike_increase.parquet")
greatest_strike_increase_type.write_parquet("greatest_strike_increase_type.parquet")

print("All results saved to parquet files.")