# STAT 451 Final Project - Group 12 (11:00am)
Mariya Siddiqui, Gavin Fring, Nicholas Dubois, Fiena Sapari, Ahmad Latiffi

## Data Preparation, Exploration, and Transformation
In this section, we do the following:

 1. Load raw data (which is stored locally at `./datasets/airline_delays.csv`)
 2. Run a rough profile of it to get an idea of what the dataset looks like as a whole, and on a columnar level. The profile numerical type columns (floats, ints, etc.) will also contain basic statistics about the values and distribution of that column. For categorical columns, these values are skipped. 
 3. Do some exploratory analysis of the raw dataset to get 

In [1]:
# Imports
import pandas as pd

# Suppressing deprecation warnings temporarily as they take up a lot of room in output
import warnings
warnings.filterwarnings('ignore')

### Load Data

In [2]:
df = pd.read_csv("./datasets/airline_delays.csv")
df.head(5)

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2023,8,9E,Endeavor Air Inc.,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",89.0,13.0,2.25,1.6,...,0.0,5.99,2.0,1.0,1375.0,71.0,761.0,118.0,0.0,425.0
1,2023,8,9E,Endeavor Air Inc.,ABY,"Albany, GA: Southwest Georgia Regional",62.0,10.0,1.97,0.04,...,0.0,7.42,0.0,1.0,799.0,218.0,1.0,62.0,0.0,518.0
2,2023,8,9E,Endeavor Air Inc.,AEX,"Alexandria, LA: Alexandria International",62.0,10.0,2.73,1.18,...,0.0,4.28,1.0,0.0,766.0,56.0,188.0,78.0,0.0,444.0
3,2023,8,9E,Endeavor Air Inc.,AGS,"Augusta, GA: Augusta Regional at Bush Field",66.0,12.0,3.69,2.27,...,0.0,1.57,1.0,1.0,1397.0,471.0,320.0,388.0,0.0,218.0
4,2023,8,9E,Endeavor Air Inc.,ALB,"Albany, NY: Albany International",92.0,22.0,7.76,0.0,...,0.0,11.28,2.0,0.0,1530.0,628.0,0.0,134.0,0.0,768.0


### Profile Data

In [3]:
def dataframe_profile(df):
    '''
    This function creates a profile of a particular dataset given in the form of a dataframe. 
    It outputs a dataframe that contains the profile. 
    '''
    # Create a DataFrame to store the profile
    profile_df = pd.DataFrame(
        columns=[
            "Column",
            "Data Type",
            "Missing Values",
            "Unique Values",
            "Top Value",
            "Frequency",
            "Min",
            "25th Percentile",
            "Median",
            "75th Percentile",
            "Max",
            "Mean",
            "Standard Deviation",
        ]
    )

    # Populate the profile DataFrame
    for column in df.columns:
        data_type = df[column].dtype
        missing_values = df[column].isnull().sum()
        unique_values = df[column].nunique()
        top_value = df[column].mode().iloc[0] if unique_values > 0 else None
        frequency = df[column].value_counts().sort_values(ascending=False).iloc[0] if unique_values > 0 else None

        # Additional metadata for numeric columns
        if pd.api.types.is_numeric_dtype(df[column]):
            min_value = df[column].min()
            percentile_25 = df[column].quantile(0.25)
            median_value = df[column].median()
            percentile_75 = df[column].quantile(0.75)
            max_value = df[column].max()
            mean_value = df[column].mean()
            std_deviation = df[column].std()

        # No additional metadata for other types
        else:
            min_value = None
            percentile_25 = None
            median_value = None
            percentile_75 = None
            max_value = None

            mean_value = None
            std_deviation = None

        # Use loc to add rows to the DataFrame
        profile_df.loc[len(profile_df)] = {
            "Column": column,
            "Data Type": data_type,
            "Missing Values": missing_values,
            "Unique Values": unique_values,
            "Top Value": top_value,
            "Frequency": frequency,
            "Min": min_value,
            "25th Percentile": percentile_25,
            "Median": median_value,
            "75th Percentile": percentile_75,
            "Max": max_value,
            "Mean": mean_value,
            "Standard Deviation": std_deviation,
        }

    # Summary stats 
    (df_rows, df_columns) = df.shape
    print(f"The dataframe has {df_rows} rows and {df_columns} columns")

    return profile_df


result_profile = dataframe_profile(df)
display(result_profile)

The dataframe has 345323 rows and 21 columns


Unnamed: 0,Column,Data Type,Missing Values,Unique Values,Top Value,Frequency,Min,25th Percentile,Median,75th Percentile,Max,Mean,Standard Deviation
0,year,int64,0,21,2019,20946,2003.0,2008.0,2013.0,2019.0,2023.0,2013.206213,6.042778
1,month,int64,0,12,6,30098,1.0,4.0,7.0,9.0,12.0,6.493312,3.431955
2,carrier,object,0,29,OO,42164,,,,,,,
3,carrier_name,object,0,33,SkyWest Airlines Inc.,42164,,,,,,,
4,airport,object,0,420,DTW,3243,,,,,,,
5,airport_name,object,0,444,"Detroit, MI: Detroit Metro Wayne County",3243,,,,,,,
6,arr_flights,float64,509,7456,31.0,9864,1.0,58.0,120.0,270.0,21977.0,378.935876,1021.719103
7,arr_del15,float64,747,2366,0.0,9683,0.0,9.0,22.0,56.0,6377.0,73.002383,199.130487
8,carrier_ct,float64,509,19260,0.0,18747,0.0,3.0,8.15,19.76,1792.07,21.416112,48.9841
9,weather_ct,float64,509,5766,0.0,136811,0.0,0.0,0.6,2.01,717.94,2.633833,9.9062


Our categorical column "carrier", "carrier_name", "airport", and "airport_name" come in pairs. It would make sense that each pair would have the same number of unique values, but since the "name" columns have more in both cases, its clear that some "carrier" and "airport" values correspond to more than one possible naming of the location. 

Because of this naming inconsistency, we choose to use the identifying columns in our work instead of the names.

In [4]:
df_carrier_pairs = df[["carrier", "carrier_name"]].drop_duplicates(ignore_index=True)
display(df_carrier_pairs)

df_airport_pairs = df[["airport", "airport_name"]].drop_duplicates(ignore_index=True)
display(df_airport_pairs)

Unnamed: 0,carrier,carrier_name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.
5,F9,Frontier Airlines Inc.
6,G4,Allegiant Air
7,HA,Hawaiian Airlines Inc.
8,MQ,Envoy Air
9,NK,Spirit Air Lines


Unnamed: 0,airport,airport_name
0,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ..."
1,ABY,"Albany, GA: Southwest Georgia Regional"
2,AEX,"Alexandria, LA: Alexandria International"
3,AGS,"Augusta, GA: Augusta Regional at Bush Field"
4,ALB,"Albany, NY: Albany International"
...,...,...
439,MKK,"Hoolehua, HI: Molokai"
440,ILE,"Killeen, TX: Skylark Field"
441,SKA,"Spokane, WA: Fairchild AFB"
442,CBM,"Columbus, MS: Columbus AFB"


Some of the numerical columns have missing values. Running a profile on the rows with missing values tells us

In [5]:
missing_rows_df = df[df.isna().any(axis=1)]
missing_profile = dataframe_profile(missing_rows_df)
display(missing_profile)

The dataframe has 747 rows and 21 columns


Unnamed: 0,Column,Data Type,Missing Values,Unique Values,Top Value,Frequency,Min,25th Percentile,Median,75th Percentile,Max,Mean,Standard Deviation
0,year,int64,0,21,2020,260.0,2003.0,2009.0,2016.0,2020.0,2023.0,2014.69344,5.873875
1,month,int64,0,12,4,195.0,1.0,4.0,5.0,9.0,12.0,6.283802,3.199328
2,carrier,object,0,26,OO,108.0,,,,,,,
3,carrier_name,object,0,29,SkyWest Airlines Inc.,108.0,,,,,,,
4,airport,object,0,239,PVD,10.0,,,,,,,
5,airport_name,object,0,243,"Bristol/Johnson City/Kingsport, TN: Tri Cities",10.0,,,,,,,
6,arr_flights,float64,509,38,1.0,98.0,1.0,1.0,2.0,8.75,120.0,8.689076,16.751975
7,arr_del15,float64,747,0,,,,,,,,,
8,carrier_ct,float64,509,1,0.0,238.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,weather_ct,float64,509,1,0.0,238.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
