# Analyze a subset with pandas


In this notebook, we'll explore ~1 year the airline on-time performance data using pandas and understand some limitations of pandas.

---

Big data analysis always start with a manageable subset of the data, this allows you to:

* Explore it with familiar tools like NumPy and pandas, and
* Experiment with various computations you wish to do faster.

After you have your computations ready, you can focus on scaling up!


## Introduce dataset: Airline on-time performance data

In this tutorial, we will analyze **the ["airline on-time performance" dataset](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ) -- a collection of flight records maintained by the U.S. Department of Transportation's Bureau of Transportation Statistics (BTS)**.

This dataset provides information about the on-time performance of domestic flights operated by large air carriers in the United States, including flight delays, cancellations, and diversions. It covers flights operated by 23 major airlines and the records from 1987-present day.

We will work with data from 2003-2022, which is ~70 GB in size on disk.


## Read a subset with pandas

Let's start by reading data for one year, 2022.

The data is stored as one CSV file per month for each year:

<img src="./images/csv-files.png">

The following cell prepares has some preliminary credential needed, we'll take a closer look at cloud storage in a future notebook.

In [None]:
import gcsfs

fs = gcsfs.GCSFileSystem()
files = [f"gcs://{f}" for f in fs.glob("quansight-datasets/airline-ontime-performance/csv/*2022.csv")]

You can read only one file at a time in a pandas DataDrame, so we'll read 12 individual files for each month and concatenate them:

In [None]:
import pandas as pd

In [None]:
import json

with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

In [None]:
# Note: This cell will take ~3 minutes to execute on a medium machine profile

df_list = []

for file in files:
    df_temp = pd.read_csv(file, dtype=dtypes)
    df_list.append(df_temp)

In [None]:
df = pd.concat(df_list)

## Explore the dataset

While the previous cells execute, let's [learn more about the dataset](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ).

Go to the above link and take a look at the information available. ðŸ‘†

In [None]:
df.head()

Let's also list some column names for quicker access later, and note that the column names are capitalized in our dataset.

In [None]:
df.columns[:61]

Now let's perform some quick computations to get a better understanding of the dataset.

### What is the total time people spent on a flight in 2022?

In [None]:
time_in_flight = df["ACTUAL_ELAPSED_TIME"].sum()

print(f"People spent a total of {time_in_flight} minutes on a domestic flight, in 2022 in the USA; \nwhich is ~{round(time_in_flight / (60*24*30*12), 2)} years in aggregate.")

### ðŸ’» Your turn: What are the maximum and average delays in flight departures?

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
max_dep_delay = df["DEP_DELAY"].max()
print(f"The maximum departure delay in departure is {max_dep_delay} minutes, or ~{max_dep_delay // 60} hours.")

mean_dep_delay = df["DEP_DELAY"].mean()
print(f"The average departure delay in departure is {round(mean_dep_delay, 2)} minutes.")

### Which airport/airline has the most flight departure and arrival delays?

**Airport:**

In [None]:
df.groupby("ORIGIN")["DEP_DELAY"].count().idxmax()

In [None]:
df.groupby("ORIGIN")["ARR_DELAY"].count().idxmax()

That's the code for Hartsfield-Jackson Atlanta International Airport, interesting!

**ðŸ’» Your turn: Airline**

In [None]:
# Your code here. When ready, click on the three dots for the solutions.

In [None]:
df.groupby("OP_CARRIER")["DEP_DELAY"].count().idxmax()

In [None]:
df.groupby("OP_CARRIER")["ARR_DELAY"].count().idxmax()

'WN' is the code for Southwest Airlines.

### Get all "DISTANCE" values in kilometers instead of miles

In [None]:
df.DISTANCE.apply(lambda x: x*1.609344)

### ðŸ’» Your turn: Which type of delay contributes most to the overall departure delay?

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
# Note: Kernel restarts!

df[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']] \
.sum() \
.idxmax(axis=1)

This computation leads to a kernel restart because we're reaching the limits of pandas here.

## Need for scale: Try to read the full dataset in pandas

You also won't be able to load the full dataset in pandas without the kernel crashing.

In [None]:
files = [f"gcs://{f}" for f in fs.glob("quansight-datasets/airline-ontime-performance/csv/*.csv")]

In [None]:
with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

In [None]:
# Note: Kernel restarts

df_list = []

for file in files:
    df_temp = pd.read_csv(file, dtype=dtypes)
    df_list.append(df_temp)

We'll see how we can overcome this with Dask in the upcoming notebooks!

---

## Next â†’

Let's learn to create some [pretty plots](./02-intro-to-hvplot.ipynb)!