# Project Name

## Izzy May and Drew Fitzpatrick

### Dataset Description

1. Our dataset is from [Kaggle](#https://www.kaggle.com/code/farzadnekouei/flight-data-eda-to-preprocessing). It is a csv file containing flight information, such as departure/arrival times.


2. The flights.csv dataset includes attributes such as departure/arrival times, airline, origin, destination, the time spent in the air, and the date.

### Implementation/technical merit

1. Anticipated challenges: We anticipate potential challenges in figuring out the ranges we will use to predict delays as well as finding patterns across the delayed flights

2. Since the number of attributes is large, we plan on using information gain by identifying target value and measuring reduction in the entropy values. We also plan on manually identifying possible redundant or irrelevant attributes.

### Potential impact of the results

1. These results are useful for informing people of potential delay risks so that they can plan accordingly. It will also help airports be able to better anticipate potential delays.

2. Our stakeholders are travelers, airlines, and airports worldwide.

In [1]:
import pandas as pd

def categorize_flight_delays(file_path):
    """
    Categorizes flight delays into specified time intervals and prints statistics.
    
    Parameters:
        file_path (str): The path to the flights.csv file.
    """
    # Load the CSV file into a Pandas DataFrame
    flights = pd.read_csv(file_path)
    
    # Ensure the 'dep_delay' column is numeric
    flights['dep_delay'] = pd.to_numeric(flights['dep_delay'], errors='coerce')
    
    # Remove rows with missing 'dep_delay' values
    flights = flights.dropna(subset=['dep_delay'])
    
    # Categorize delays
    delay_intervals = {
        "On Time": flights['dep_delay'] <= 0,
        "0-30 mins": (flights['dep_delay'] > 0) & (flights['dep_delay'] <= 30),
        "30 mins - 1 hour": (flights['dep_delay'] > 30) & (flights['dep_delay'] <= 60),
        "1-2 hours": (flights['dep_delay'] > 60) & (flights['dep_delay'] <= 120),
        "2-3 hours": (flights['dep_delay'] > 120) & (flights['dep_delay'] <= 180),
        "3-4 hours": (flights['dep_delay'] > 180) & (flights['dep_delay'] <= 240),
        "Over 4 hours": flights['dep_delay'] > 240
    }
    
    # Count occurrences in each category
    counts = {category: flights[condition].shape[0] for category, condition in delay_intervals.items()}
    
    # Print the results
    print("Flight Delay Categories:")
    for category, count in counts.items():
        print(f"{category}: {count}")

# Usage
file_path = "flights.csv"  # Replace with the path to your CSV file
categorize_flight_delays(file_path)


Flight Delay Categories:
On Time: 200089
0-30 mins: 80141
30 mins - 1 hour: 21710
1-2 hours: 16858
2-3 hours: 5830
3-4 hours: 2369
Over 4 hours: 1524
