# parse_data.ipynb

This notebook parses the data files used for the FP-2 assignment. 

<br>
<br>

First let's read the attached data file:

In [1]:
import pandas as pd

# Fix mixed type warning: cancellation/diversion columns are sparse categorical data
df0 = pd.read_csv('/Users/marino/Documents/UAB/Erasmus/Kyoto/Classes/Data Analysis Practice/Final Project/flightData/flight_delays_2024_combined.csv',
                  dtype={
                      'CancellationCode': str,
                      'Div1TailNum': str,
                      'Div2Airport': str,
                      'Div2TailNum': str,
                      'Div3Airport': str,
                      'Div3TailNum': str
                  })

### Reducing the number of entries
- Due to the large number of entries 7M, the data analysis direction has been updated to only analyze delays of flights departing from Philadelphia International Airport (PHL). From 7M to 100k flight entries 1.42% of all total flights in the dataset.

In [2]:
df0_phl = df0[df0['Origin'] == 'PHL'].copy()

<br>
<br>

The dependent and independent variables variables (DVs and IVs) that we are interested in are:

**DVs**:
- ArrDelay (Arrival Delay in minutes)

**IVs**:
- DepDelay (Departure delay in minutes)
- DayOfWeek
- CRSDepTime (Computer Reservation System Departure Time)
- Reporting_Airlines (Airline Carrier Code)
- Dest
- Distance
- Month


<br>
<br>

Let's extract the relevant columns:
- Reporting_Airline and Dest are not displayed as data type is not numeric

In [3]:

df = df0_phl[['ArrDelay', 'DepDelay', 'DayOfWeek', 'CRSDepTime', 
              'Reporting_Airline', 'Dest', 'Distance', 'Month']]

df.describe()

Unnamed: 0,ArrDelay,DepDelay,DayOfWeek,CRSDepTime,Distance,Month
count,98710.0,99123.0,100635.0,100635.0,100635.0,100635.0
mean,8.535498,15.113283,4.001053,1347.374909,890.54806,6.694212
std,65.515171,63.183139,2.015506,482.392957,597.595095,3.273155
min,-72.0,-36.0,1.0,107.0,36.0,1.0
25%,-18.0,-6.0,2.0,909.0,453.0,4.0
50%,-8.0,-2.0,4.0,1345.0,678.0,7.0
75%,11.0,10.0,6.0,1832.0,1013.0,9.0
max,2189.0,2204.0,7.0,2359.0,2521.0,12.0


<br>
<br>

Next let's use the `rename` function to give the columns simpler variable names:

In [4]:
df = df.rename(columns={
    'ArrDelay': 'arrival_delay',
    'DepDelay': 'departure_delay',
    'DayOfWeek': 'day_of_week',
    'CRSDepTime': 'scheduled_dep_time',
    'Reporting_Airline': 'airline',
    'Dest': 'destination',
    'Distance': 'distance',
    'Month': 'month'
})

df.describe()

Unnamed: 0,arrival_delay,departure_delay,day_of_week,scheduled_dep_time,distance,month
count,98710.0,99123.0,100635.0,100635.0,100635.0,100635.0
mean,8.535498,15.113283,4.001053,1347.374909,890.54806,6.694212
std,65.515171,63.183139,2.015506,482.392957,597.595095,3.273155
min,-72.0,-36.0,1.0,107.0,36.0,1.0
25%,-18.0,-6.0,2.0,909.0,453.0,4.0
50%,-8.0,-2.0,4.0,1345.0,678.0,7.0
75%,11.0,10.0,6.0,1832.0,1013.0,9.0
max,2189.0,2204.0,7.0,2359.0,2521.0,12.0
