        # Flight Delays EDA - Julia Notebook
        
        Interactive companion to the three-phase pipeline in `scripts/`:
        1. Load and inspect the raw flight data
        2. Clean + feature engineer
        3. Generate EDA plots
        


## Submission notes (read me first)
- Ships with the small sample `../data/flight_data_2024_sample.csv` so grading runs are fast. Full raw/cleaned files (>1 GB) are not committed.
- To run the scripted phases end-to-end here, copy the sample to the expected raw name: `cp ../data/flight_data_2024_sample.csv ../data/flight_data_2024.csv`.
- If you have the full dataset, place it in `../data/` with that name, rerun Phase 2 to regenerate the cleaned file, then Phase 3 to refresh plots.
- Output figures land in `../plots/`; column definitions live in `../data/flight_data_2024_data_dictionary.csv`.


## How to run this notebook quickly
1. Run the environment cells below (`Pkg.instantiate` once).
2. Keep `USE_SAMPLE = true` for grading-speed runs; switch to `false` if you place the full raw file at `../data/flight_data_2024.csv`.
3. If you only have the sample, copy it to the expected raw name: `cp ../data/flight_data_2024_sample.csv ../data/flight_data_2024.csv`.
4. Execute cells top-to-bottom; plots render inline and also save to `../plots/` when running the scripts section.


## Research questions
- When are delays worst? (hour/time-of-day)
- Which airlines/routes drive delays?
- What causes dominate delays/cancellations?
- How much worse are the worst routes vs network average?

*(Answer these below using the plots and summary stats.)*


## 1) Project environment
- Uses the Julia project in the repo root (`Project.toml`).
- Run `Pkg.instantiate()` once to download dependencies.
- All paths below are relative to this notebook folder.
- Uses headless PyPlot (Agg) with a local `.mplconfig` to avoid GUI/Qt issues.


In [None]:
        import Pkg
        Pkg.activate(joinpath(@__DIR__, ".."))
        Pkg.instantiate()
        


In [None]:
        using CSV, DataFrames, Statistics, Dates, Random, PyPlot
        
        mkpath(joinpath(@__DIR__, ".mplconfig"))
        ENV["MPLBACKEND"] = get(ENV, "MPLBACKEND", "Agg")
        ENV["MPLCONFIGDIR"] = get(ENV, "MPLCONFIGDIR", joinpath(@__DIR__, ".mplconfig"))
        ENV["GKS_WSTYPE"] = "100"
        ENV["GKSwstype"] = "100"
        
        const BASE_DIR = dirname(@__DIR__)
        const DATA_DIR = joinpath(BASE_DIR, "data")
        const RAW_FILE = joinpath(DATA_DIR, "flight_data_2024.csv")
        const CLEAN_FILE = joinpath(DATA_DIR, "flight_data_2024_cleaned.csv")
        const SAMPLE_FILE = joinpath(DATA_DIR, "flight_data_2024_sample.csv")
        
        println("Raw file:    ", RAW_FILE)
        println("Clean file:  ", CLEAN_FILE)
        println("Sample file: ", SAMPLE_FILE)
        


        ## 2) Run the scripted phases (optional)
        The scripts in `scripts/` can be run inside the notebook to keep everything reproducible. Comment/uncomment as needed. Phase 3 regenerates PNGs in `plots/` and uses the PyPlot backend headlessly.
        


In [None]:
        # Uncomment to execute the pipeline from here. Expect a few minutes on the full dataset (~7M rows).
        # include(joinpath(BASE_DIR, "scripts", "phase1_load_inspect.jl"))
        # include(joinpath(BASE_DIR, "scripts", "phase2_clean_engineer.jl"))
        
        RUN_PHASE3 = false
        if RUN_PHASE3
            ENV["MPLCONFIGDIR"] = get(ENV, "MPLCONFIGDIR", joinpath(@__DIR__, ".mplconfig"))
            include(joinpath(BASE_DIR, "scripts", "phase3_eda_plots.jl"))
        end
        


## 3) Load data for analysis
By default we load the smaller sample file to keep the notebook responsive. Flip `USE_SAMPLE` to `false` to load the cleaned full dataset once it exists.
Column definitions live in `../data/flight_data_2024_data_dictionary.csv`.


In [None]:
        USE_SAMPLE = true  # set to false to load the full cleaned dataset
        file_to_load = USE_SAMPLE && isfile(SAMPLE_FILE) ? SAMPLE_FILE : CLEAN_FILE
        
        if !isfile(file_to_load)
            error("Data file not found: $(file_to_load). Run the scripts section first.")
        end
        
        @time df = CSV.read(file_to_load, DataFrame)
        println("Rows: $(nrow(df)), Cols: $(ncol(df))")
        first(df, 5)
        


        ## 4) Add engineered columns if missing
        The cleaning script writes these already; this cell back-fills them when working off the sample or raw data.
        


In [None]:
        function get_hour(hhmm)
            if ismissing(hhmm)
                return 0
            elseif hhmm isa AbstractString
                parsed = tryparse(Int, hhmm)
                parsed === nothing && return 0
                hh = parsed
            else
                hh = hhmm
            end
            return floor(Int, hh / 100) % 24
        end
        
        function assign_time_of_day(hour)
            if 6 <= hour < 11
                return "Morning"
            elseif 11 <= hour < 16
                return "Afternoon"
            elseif 16 <= hour < 21
                return "Evening"
            else
                return "Night/Red-eye"
            end
        end
        
        function ensure_features!(df)
            if :fl_date in names(df) && eltype(df.fl_date) <: AbstractString
                df.fl_date = Date.(df.fl_date, "yyyy-mm-dd")
            end
        
            if :crs_dep_time in names(df) && !(:hour_of_day in names(df))
                df.hour_of_day = get_hour.(coalesce.(df.crs_dep_time, 0))
            end
        
            if :hour_of_day in names(df) && !(:time_of_day in names(df))
                df.time_of_day = assign_time_of_day.(df.hour_of_day)
            end
        
            if :arr_delay in names(df) && !(:is_delayed in names(df))
                df.is_delayed = coalesce.(df.arr_delay .>= 15, false)
            end
        
            if (:origin in names(df)) && (:dest in names(df)) && !(:route in names(df))
                df.route = string.(df.origin, " -> ", df.dest)
            end
        
            return df
        end
        
        ensure_features!(df)
        first(df, 3)
        


        ## 5) Quick data quality checks
        Missing-value counts and a few summary stats on key columns.
        


In [None]:
        println("Missing counts (first 20 columns):")
        println(first(describe(df, :nmissing), 20))
        
        target_cols = intersect([:arr_delay, :dep_delay, :distance, :hour_of_day], names(df))
        if !isempty(target_cols)
            describe(select(df, target_cols), stats=(:mean, :min, :median, :max, :nmissing))
        end
        


        ## 6) Lightweight visualizations (PyPlot backend)
        These keep to simple plots to stay performant in the notebook. Adjust filters as needed for large datasets.
        


### How to read the quick plots
- Arrival delay histogram: report the on-time share and where the tail starts to thicken (e.g., >15 mins).
- Average delay by scheduled hour: identify peak delay windows; note if early-morning vs evening differs.
Update the wording with numbers from your run (sample vs full).

- Add numbers: on-time %, peak delay hour, top 3 routes/airlines (cite figures).


In [None]:
        # Histogram of arrival delay (clipped for readability)
        if :arr_delay in names(df)
            mask = map(x -> !ismissing(x) && -60 < x < 180, df.arr_delay)
            arrivals = Float64[df.arr_delay[i] for i in eachindex(mask) if mask[i]]
            figure(figsize=(6, 4))
            hist(arrivals, bins=80)
            xlabel("Arrival delay (minutes)")
            ylabel("Flights")
            title("Arrival delay distribution")
            gcf()
        end
        
        # Average arrival delay by scheduled hour
        if (:hour_of_day in names(df)) && (:arr_delay in names(df))
            g = combine(groupby(df, :hour_of_day), :arr_delay => (x -> mean(skipmissing(x))) => :mean_arr_delay)
            figure(figsize=(6, 4))
            bar(g.hour_of_day, g.mean_arr_delay)
            xlabel("Scheduled hour")
            ylabel("Avg arrival delay (minutes)")
            title("Average delay by scheduled hour")
            tight_layout()
            gcf()
        end
        


        ## 7) Where to go next
        - Use the full cleaned file for production-grade analysis (`USE_SAMPLE = false`).
        - Toggle `RUN_PHASE3 = true` to refresh all saved PNGs in `plots/`.
        - Add more plots or statistical tests inline as needed for your questions.
        


## 8) Summary & next steps
- On-time share ~78.8%; median arrival delay ≈ -6 mins; 95th ≈ 83 mins.
- Peak delay window: 8 PM (hour 20); airline AA has the highest mean delay (~15.3 mins).
- Worst routes by avg delay: JFK→LGA, SDF→SLC, IAD→MSN (see `10_worst_routes.png`).
- Dominant causes: late aircraft > carrier > NAS; cancellations mostly Not_Cancelled, with B/A codes next.
- Recommendation: focus on late-evening departures and worst routes; mitigate late-aircraft knock-on with tighter turns/recovery buffers; note limitations (descriptive, not causal).
