# Analyze a susbset
Big data analyses always start with a manageable subset of the data, this allows you to:
* Download the dataset locally if needed,
* Explore it in depth, and
* Experiment with the computations you wish to do.

Once you have your computations ready, you can focus on scaling up!



## Introduce dataset: Airline on-time performance data

In this tutorial, we will analyze **the ["airline on-time performance" dataset](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJhttps://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ) -- a collection of flight records maintained by the U.S. Department of Transportation's Bureau of Transportation Statistics (BTS)**.

This dataset provides information about the on-time performance of domestic flights operated by large air carriers in the United States, including flight delays, cancellations, and diversions. It covers flights operated by 23 major airlines and the records from 1987-present day.

We will work with data from 2003-2022, which is ~70 GB is size on disk.


## Read a subset with pandas


Read data for 1 year, 2022.

In [8]:
import json
import gcsfs

import pandas as pd

We'll look at cloud storage in the next notebook.

In [9]:
token = json.load(open("scripts/credentials.json"))
fs = gcsfs.GCSFileSystem(token=token)
storage_options={"token": token}

In [10]:
files = [f"gcs://{f}" for f in fs.glob("quansight-datasets/airline-ontime-performance/csv/*2022.csv")]

In [11]:
with open('scripts/dtypes.json', 'r') as f:
    dtypes = json.load(f)

As the following cells execute, go to https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ to see what this dataset contains.

In [12]:
# this cell will take ~2.5 minutes to execute on a small/medium machine profiles

df_list = []

for file in files:
    df_temp = pd.read_csv(file, dtype=dtypes, storage_options=storage_options)
    df_list.append(df_temp)

In [13]:
df = pd.concat(df_list)

## Explore the dataset

In [14]:
df.head()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N132EV,...,,,,,,,,,,
1,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
2,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
3,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
4,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,


In [17]:
# df.describe() -- kernel restarts on medium profile

### Maximum and average delay in departure?

### Which airport/airline has the most flight departure and arrival delays?

### Busiest time of the day/year?

(Can also plot later)

### Total flight cancellations?

(Can plot which month has the most cancellations, then across all years)

### Which type of cancellation is the most frequent: CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,or LateAircraftDelay?

### Total time people spent in the air in 2022 in the US? 

Maybe calculate approx. carbon emissions?

---

## Next

Let's chat briefly about [storage formats](02-storage-formats.ipynb)!