# Lab Exercise: SQL Analysis with Polars

In this lab, you'll practice SQL queries using Polars' built-in SQL functionality. Complete each exercise by writing the appropriate SQL query.

In [None]:
# Setup - Run this cell first
import polars as pl

# Load data
airlines = pl.read_csv('data/nyc_airlines.csv')
airports = pl.read_csv('data/nyc_airports.csv')
flights = pl.read_csv('data/nyc_flights.csv')
planes = pl.read_csv('data/nyc_planes.csv')
weather = pl.read_csv('data/nyc_weather.csv')

# Create SQL context
ctx = pl.SQLContext(
    airlines=airlines,
    airports=airports,
    flights=flights,
    planes=planes,
    weather=weather,
    eager_execution=True
)

print("Setup complete! Tables available:")
print(ctx.execute("SHOW TABLES"))

## Exercise 1: Basic Queries

### 1.1 Find all unique carriers in the airlines table

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 1.2 Find the top 10 destinations by number of flights

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 1.3 Find all flights that departed more than 2 hours late (120 minutes)

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

## Exercise 2: Aggregation

### 2.1 Calculate the average departure delay for each origin airport

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 2.2 Find the busiest month of the year

Count the number of flights per month and find which month has the most flights.

In [None]:
# First, let's check what columns are available
result = ctx.execute("""
    SELECT *
    FROM flights
    LIMIT 5
""")
# print(result)

# Now write your query to find busiest month
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 2.3 Calculate the on-time performance rate for each carrier

Consider a flight on-time if the departure delay is <= 15 minutes.

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

## Exercise 3: Joins

### 3.1 List all flights with their airline names (not just carrier codes)

Show the first 20 flights with carrier code, airline name, flight number, origin, and destination.

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 3.2 Find the average age of planes for each carrier

Hint: The planes table has a `year` column for manufacture year. Calculate age based on 2013.

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 3.3 Find flights that experienced both departure delays and bad weather

Join flights with weather data and find flights where departure delay > 30 minutes and either wind_speed > 20 or precip > 0.1

In [None]:
# First, explore the weather table structure
result = ctx.execute("""
    SELECT *
    FROM weather
    LIMIT 5
""")
# print(result)

# Now write your join query
result = ctx.execute("""
-- Your query here
""")

# print(result)

## Exercise 4: Advanced Queries

### 4.1 Find the most popular aircraft types (by number of flights)

Join flights with planes to get manufacturer and model information. Show top 10.

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 4.2 Analyze route performance

Find the top 10 routes (origin-destination pairs) with:
- Total number of flights
- Average departure delay
- Percentage of flights delayed more than 30 minutes

Include airport names, not just codes.

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

### 4.3 Time-based analysis

Create a query that shows:
- Average delays by time of day (morning: 5-12, afternoon: 12-17, evening: 17-22, night: 22-5)
- Number of flights in each period

Hint: Use CASE WHEN statements to categorize times. The `hour` column contains the scheduled departure hour.

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

## Exercise 5: Challenge Query

### 5.1 Comprehensive Airline Performance Report

Create a query that produces a comprehensive performance report for each airline including:
- Airline name
- Total flights
- Average departure delay
- On-time percentage (delay <= 15 min)
- Number of unique aircraft used
- Average age of fleet

This will require multiple joins and subqueries!

In [None]:
# Write your SQL query here
result = ctx.execute("""
-- Your query here
""")

# print(result)

## Bonus: Compare with Polars

### Choose one of the queries above and implement it using Polars

This will help you understand the relationship between SQL and Polars operations.

In [None]:
# Example: Let's implement Exercise 2.1 (average delay by origin) in Polars

# SQL version (for reference)
sql_result = ctx.execute("""
    SELECT 
        origin,
        AVG(dep_delay) as avg_delay
    FROM flights
    WHERE dep_delay IS NOT NULL
    GROUP BY origin
    ORDER BY avg_delay DESC
""")

# Polars version
polars_result = (
    flights
    .filter(pl.col('dep_delay').is_not_null())
    .group_by('origin')
    .agg(pl.col('dep_delay').mean().alias('avg_delay'))
    .sort('avg_delay', descending=True)
)

print("SQL Result:")
print(sql_result)
print("\nPolars Result:")
print(polars_result)

# Now implement one of your own queries in Polars below:
# Your Polars code here

## Bonus 2: Create Your Own Analysis

Using what you've learned, create your own interesting analysis of the flight data. Try to:
1. Ask an interesting business question
2. Write a SQL query to answer it
3. Visualize or interpret the results

In [None]:
# Your custom analysis here
# Example question: "Which day of the week has the best on-time performance?"

# Your SQL query here
result = ctx.execute("""
-- Your custom query here
""")

# print(result)

## Reflection Questions

After completing these exercises, consider:

1. Which operations felt more natural in SQL vs Polars?
2. When would you choose SQL over Polars (or vice versa)?
3. What are the advantages of using Polars' SQL integration?
4. How does writing SQL queries help you understand your data better?

Write your thoughts here:

*Your reflection here*

## Solutions Hints

If you get stuck, here are some hints:

1. **Exercise 1.1**: Use `SELECT DISTINCT` or just `SELECT carrier, name`
2. **Exercise 1.2**: Use `GROUP BY dest` with `COUNT(*)`
3. **Exercise 2.3**: Use `CASE WHEN` to create an on-time indicator, then average it
4. **Exercise 3.2**: Remember to handle NULL values in plane year
5. **Exercise 4.2**: Use CTEs to break down the complex calculation
6. **Exercise 5.1**: Consider using multiple CTEs or subqueries for different metrics

Remember: SQL queries in Polars support most standard SQL features including:
- CTEs (WITH clauses)
- Window functions
- CASE WHEN statements
- All standard joins
- Aggregate functions