# Candidate Number: A12988

## Part 2 (b)
*In this task, we are required to evaluate "whether older planes suffer more delays on a year-to-year basis"*

***Data Preparation:***
- ***"Loaded flight data for each year from 1998 to 2007 and supplementary plane data"***
- ***"Defined a function to convert time from HHMM format to minutes"***
- ***"Created time range bins of 60-minute intervals spanning a 24-hour period"***

***Data Processing:***

- ***"Iterated over each year"***
- ***"Read flight data, merged with plane data, and processed time columns"***
- ***"Segmented departure times into time ranges and assigned labels based on the bins"***
- ***"Handled moments where departure or arrival times exceeded 24 hours"***
- ***"Dealt with missing values in the dataset"***

***Data Analysis:***

- ***"Calculated the proportion of delayed flights for each time range"***
- ***"Plotted bar charts for each year, showing the proportion of delayed flights"***
- ***"Set appropriate labels and titles for the plots"***
- ***"Displayed each plot to visualize flight delay trends over the years"***


In [11]:
import pandas as pd
from scipy.stats import ttest_ind

# Function to convert HHMM format to minutes
def convert_to_minutes(hhmm_time):
    if pd.isna(hhmm_time):
        return None
    if hhmm_time >= 2400:
        hhmm_time -= 2400
    hours = int(hhmm_time) // 100
    minutes = int(hhmm_time) % 100
    total_minutes = hours * 60 + minutes
    return total_minutes

# List to store t-statistics and p-values for each year
t_statistics = []
p_values = []

# Iterate over each year from 1998 to 2007
for year in range(1998, 2008):
    # Read flight data for the current year
    flights = pd.read_csv(f'/Users/macbookpro/Downloads/dataverse_files/{year}.csv', encoding='latin1', low_memory=False)

    # Convert departure times to minutes and handle missing values
    flights['DepTime'] = pd.to_numeric(flights['DepTime'], errors='coerce')
    flights['DepTime'] = flights['DepTime'].apply(convert_to_minutes)
    
    # Filter out missing values and departure times exceeding 24 hours
    flights = flights.dropna(subset=['DepTime'])
    flights = flights[flights['DepTime'] < 2400]

    # Merge flight data with plane data to determine the age of each plane
    merged_data = pd.merge(flights, planes, how='inner', left_on='TailNum', right_on='tailnum')

    # Determine the age of each plane
    merged_data['plane_age'] = year - merged_data['year']

    # Calculate the proportion of delayed flights for old and new planes
    proportion_delayed_old = merged_data.loc[merged_data['plane_age'] >= 20, 'DepDelay'].apply(lambda x: x > 15).mean()
    proportion_delayed_new = merged_data.loc[merged_data['plane_age'] < 20, 'DepDelay'].apply(lambda x: x > 15).mean()

    # Filter delayed flights for old and new planes
    delayed_old = merged_data.loc[(merged_data['plane_age'] >= 20) & (merged_data['DepDelay'] > 15), 'DepDelay']
    delayed_new = merged_data.loc[(merged_data['plane_age'] < 20) & (merged_data['DepDelay'] > 15), 'DepDelay']

    # Perform t-test
    t_statistic, p_value = ttest_ind(delayed_old, delayed_new, equal_var=False)
    t_statistics.append(t_statistic)
    p_values.append(p_value)

    # Print the results
    print(f"Year: {year}")
    print("Proportion of delayed flights for old planes:", proportion_delayed_old)
    print("Proportion of delayed flights for new planes:", proportion_delayed_new)
    print("T-statistic:", t_statistic)
    print("P-value:", p_value)
    if p_value < 0.05:
        print("There is a significant difference in the proportion of delayed flights between old and new planes at 5% significance level")
    else:
        print("There is no significant difference in the proportion of delayed flights between old and new planes at 5% significance level")
    print()

# Overall result
overall_t_statistic = sum(t_statistics)
overall_p_value = sum(p_values) / len(p_values)  # Adjusting for multiple comparisons
alpha = 0.05  # Significance level

if overall_p_value < alpha:
    print("Overall result: The difference in the proportion of delayed flights between old and new planes across all years is statistically significant (p < 0.05)")
else:
    print("Overall result: There is no statistically significant difference in the proportion of delayed flights between old and new planes across all years (p >= 0.05)")

Year: 1998
Proportion of delayed flights for old planes: 0.15661161256345363
Proportion of delayed flights for new planes: 0.15079584733985057
T-statistic: 4.649007714823757
P-value: 3.3515157024305223e-06
There is a significant difference in the proportion of delayed flights between old and new planes at 5% significance level

Year: 1999
Proportion of delayed flights for old planes: 0.12053544339130329
Proportion of delayed flights for new planes: 0.15612890933900433
T-statistic: 7.340953613249759
P-value: 2.193073654318063e-13
There is a significant difference in the proportion of delayed flights between old and new planes at 5% significance level

Year: 2000
Proportion of delayed flights for old planes: 0.1430432649193753
Proportion of delayed flights for new planes: 0.19237855542978252
T-statistic: 3.719952850697048
P-value: 0.00019965163087563137
There is a significant difference in the proportion of delayed flights between old and new planes at 5% significance level

Year: 2001
P