# Basic Exploratory Analysis

## Setup

For the purposes of our analysis, the following modules shall be required:

In [None]:
import pandas as pd

# Extension of Python package pycountry providing conversion functions
import pycountry_convert as pc

# Python implementation of the Predictive Power Score (PPS)
import ppscore as pps

import matplotlib.pyplot as plt
import seaborn as sns

from ipynb_utils import CFG, plt_savefig

Let us now load the data frame containing the flight delay durations, which must be obtained manually (cf. the preceding notebook).

In [None]:
df = pd.read_csv(CFG["TRAIN_DATA_PATH"])

## First Inspection

As a preliminary inspection, let us invoke the info and sample methods of the data frame.

In [None]:
df.info()

In [None]:
df.sample(10)

The dataset contains no missing values. The meaning of the column names may be found at <https://zindi.africa/competitions/flight-delay-prediction-challenge/data>:

| Column | Description |
| --- | --- |
| ID | Unique identifier for the flight |
| DATOP | Date of flight |
| FLTID | Flight number |
| DEPSTN | Departure point (station/airport) |
| ARRSTN | Arrival point (station/airport) |
| STD | Scheduled Time of Departure |
| STA | Scheduled Time of Arrival |
| STATUS | Flight status (e.g., delayed, canceled) |
| AC | Aircraft code |
| target | Flight delay (in minutes) |

A comparison of our dataset with the information provided on the aforementioned webpage reveals that additional features are documented there which are absent from our data frame — specifically, the following:

| Column | Description |
| --- | --- |
| ETD | Expected Time departure |
| ETA | Expected Time arrival |
| ATD | Actual Time of Departure |
| ATA | Actual Time of arrival |
| DELAY1 | Delay code 1 |
| DUR1 | Delay time 1 |
| DELAY2 | Delay code 2 |
| DUR2 | Delay time 2 |
| DELAY3 | Delay code 3 |
| DUR3 | Delay time 3 |
| DELAY4 | Delay code 4 |
| DUR4 | Delay time 4 |

## Feature Inspection

Let us now examine certain features of the data frame in greater detail, en passant cleansing the data.

### Status Column

We begin by examining the STATUS feature.

In [None]:
# NOTE: The etymologically correct plural of the Latin word *status* [ˈstaː.tus]
# is *status* [ˈstaː.tuːs]!

statuses = df["STATUS"].unique()

print("All Statuses:")
for status in statuses:
    print(f"  {status}")
    print(f"    Number of entries : {df[df['STATUS'] == status].shape[0]}")
    print(f"    Mean              : {df[df['STATUS'] == status]['target'].mean()}")
    print(f"    Median            : {df[df['STATUS'] == status]['target'].median()}")

The following table elucidates the status codes that arise:

| Code | Name | Description |
| --- | --- | --- |
| ATA | Actual Time Arrival| Flights that successfully landed at their destination |
| DEP  | Departed | Flights that departed but may not have completed their journey |
| RTR  | Returned | Flights that took off but returned to the departure airport due to issues |
| SCH  | Scheduled | Flights listed in the schedule, no delay data applicable |
| DEL  | Cancelled | Flights that were canceled, treated as permanent delays |

Let us visualise the distribution of delay durations conditioned on STATUS.

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(15, 10))
axes = axes.flatten()

for idx, status in enumerate(statuses):
    ax = axes[idx]
    df[df["STATUS"] == status]["target"].hist(bins=50, log=False, ax=ax)
    ax.set_title(status)
    ax.set_xlabel("Delay")
    ax.set_ylabel("Frequency")

plt.tight_layout()

plt_savefig("delay-to-sum-flight-histograms")
plt.show()

The meaning of DEP ("Flights that departed but may not have completed their journey") remains somewhat obscure. As only a minority of flights are labelled as DEP, we have chosen to omit the corresponding data points.

Furthermore, measuring the delay of a DEL (cancelled) flight proves difficult. One might consider, for regular flights, calculating the interval between the cancelled flight and the subsequent flight that indeed arrives, adding the delay of that latter flight. We have also chosen to delete rows the corresponding rows.

In [None]:
df = df[~df["STATUS"].isin(["DEP", "DEL"])]

### Airport Columns

We introduce columns that reduce the airports of departure and destination to their respective country.  For this purpose, we require the airports.csv file (located at CFG["AIRPORTS_DATA_PATH"]) downloaded in the preceding notebook.

In [None]:
airports = (
    pd.read_csv(CFG["AIRPORTS_DATA_PATH"])
    .loc[:, ["iata_code", "iso_country"]]
    .dropna()
)

We incorporate the relevant airport information into our original data frame.

In [None]:
# A data frame to make the next cell idempotent.
df_bkp = df.copy()

In [None]:
# Merges departure.
dep_countries = airports.loc[:, ["iata_code", "iso_country"]].rename(
    columns={"iata_code": "DEPSTN", "iso_country": "country_dep"}
)
df_tmp = df_bkp.merge(dep_countries, on="DEPSTN", how="left")

# Merges arrival.
arr_countries = airports.loc[:, ["iata_code", "iso_country"]].rename(
    columns={"iata_code": "ARRSTN", "iso_country": "country_arr"}
)
df_tmp = df_tmp.merge(arr_countries, on="ARRSTN", how="left")

df = df_tmp

# ENIGMA: Why was this correction necessary?
df.loc[df["DEPSTN"] == "SXF", "country_dep"] = "DE"
df.loc[df["ARRSTN"] == "SXF", "country_arr"] = "DE"

To convert the ISO codes to continent codes, we employ the functionality provided by the module `pycountry_convert`.

In [None]:
def iso_to_continent(iso: str) -> None | str:
    try:
        continent_code = pc.country_alpha2_to_continent_code(iso)
        return pc.convert_continent_code_to_continent_name(continent_code)
    except:
        return None


df["continent_dep"] = df["country_dep"].apply(iso_to_continent)
df["continent_arr"] = df["country_arr"].apply(iso_to_continent)

Furthermore, let us remove all flights from the data frame for which the departure and arrival airports coincide. Very likely, these represent merely service flights and no genuine flights.

In [None]:

df = df[df["DEPSTN"] != df["ARRSTN"]]

### Date-related Columns

The dataset contains several columns bearing date semantics. Let us convert them to the appropriate data type.

In [None]:
df.loc[:, "DATOP"] = pd.to_datetime(df["DATOP"], format="%Y-%m-%d")
df.loc[:, "STD"] = pd.to_datetime(df["STD"], format="%Y-%m-%d %H:%M:%S")
df.loc[:, "STA"] = pd.to_datetime(df["STA"], format="%Y-%m-%d %H.%M.%S")

We may now introduce several additional useful features relating to dates and times:

In [None]:
df["DATOP_year"] = df["DATOP"].dt.year
df["DATOP_month"] = df["DATOP"].dt.month
df["DATOP_day"] = df["DATOP"].dt.dayofweek + 1


def map_hour_to_period(hour: int) -> str:
    if 6 <= hour < 12:
        return "morning"
    elif 12 <= hour < 18:
        return "day"
    elif 18 <= hour < 24:
        return "evening"
    else:
        return "night"


df["STD_hour"] = df["STD"].dt.hour
df["STD_period"] = df["STD_hour"].apply(map_hour_to_period)

df["flight_time"] = (df["STA"] - df["STD"]).dt.total_seconds() / 60

Let us examine the years covered by the dataset:

In [None]:
DATOP_years = df["DATOP_year"].unique()
DATOP_years

Thus, the data originate from the years 2016, 2017, and 2018.

Next, let us consider the distribution of recorded flights across the months within the period from 2016 to 2018:

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=len(DATOP_years), figsize=(16, 5), sharey=True)

for idx, year in enumerate(DATOP_years):
    # Filter the DataFrame for the specific year
    df_year = df[df["DATOP_year"] == year]

    # Plot the histogram on the respective subplot
    axes[idx].hist(df_year["DATOP_month"], bins=range(1, 14), alpha=0.8, color="blue")
    axes[idx].set_title(f"Flight Distribution for {year}")
    axes[idx].set_xlabel("Month")
    # Set x-axis ticks for months
    axes[idx].set_xticks(range(1, 13))
    axes[idx].set_ylabel("Number of Flights")

plt.tight_layout()

plt_savefig("month-to-sum-flight-by-year_hist")

plt.show()

In each year, we observe a single suspicious month during which the sum of flights is significantly lower than in the others. An (even manual) inspection of the provided test data set from zindi reveals that the majority of flights for the affected months are included therein (sic!). Consequently, we exclude these months entirely.

In [None]:
df = df[~((df["DATOP_month"] == 5) & (df["DATOP_year"] == 2016))]
df = df[~((df["DATOP_month"] == 2) & (df["DATOP_year"] == 2017))]
df = df[~((df["DATOP_month"] == 9) & (df["DATOP_year"] == 2018))]

Similarly, let us inspect the distribution of delay durations conditioned on to the weekday (and the year).

In [None]:
num_years = len(DATOP_years)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 5))
axes = axes.flatten()

for i, year in enumerate(DATOP_years):
    ax = axes[i]
    df_year = df[df["DATOP_year"] == year]
    daily_avg = df_year.groupby("DATOP_day")["target"].mean().reset_index()

    ax.bar(daily_avg["DATOP_day"], daily_avg["target"], color="blue", alpha=0.7)

    ax.set_title(f"Average Delay by Day of Week for {year}", fontsize=12)
    ax.set_xlabel("Day", fontsize=10)
    ax.set_ylabel("Delay", fontsize=10)
    ax.set_xticks(range(1, 8))
    ax.grid(axis="y", linestyle="--", alpha=0.7)

plt.tight_layout()
plt_savefig("month-to-avg-delay-by-year_hist")
plt.show()

## Final Overview of the Polished Data

A random sample from the processed data is presented below:

In [None]:
df.sample(10)

Let us create a plot illustrating all uni- and bivariate distributions:

In [None]:
sns.pairplot(df)

plt_savefig("each-vs-each-wrt-distribution_scatterplot")
plt.show()

Let us redraw the joint distribution of flight and delay duration in a dedicated plot:

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(df["flight_time"], df["target"], color="blue")
plt.xlabel("Flight Duration")
plt.ylabel("Delay Duration")
plt.xlim(1, 3000)
plt.ylim(1, 3000)
plt.title("Flight Duration vs. Delay Duration")
plt_savefig("flight-to-delay_scatterplot")
plt.show()

Let us visualise the correlations between the numerical features of our processed data:

In [None]:
correlation_matrix = df.corr()

# Plot the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt=".2f",
)
plt.title("Correlation Matrix")

plt_savefig("each-vs-each-wrt-correlation_heatmap")
plt.show()

As the data contain numerous categorical features, it is also advisable to compute the predictive power score matrix:

In [None]:
cols = [
    col
    for col in df.columns
    if df[col].nunique() > 1 and not col.startswith("DATOP_") and col != "ID"
]
df_tmp = df[cols]

pp_scores = pps.matrix(df_tmp)[["x", "y", "ppscore"]].pivot(
    columns="x", index="y", values="ppscore"
)

pp_scores = pp_scores.round(2)

plt.figure(figsize=(12, 8))

sns.heatmap(
    pp_scores,
    vmin=0,
    vmax=1,
    # cmap="Reds",
    linewidths=0.5,
    annot=True,
)

plt_savefig("each-vs-each-wrt-pp-score_heatmap")

plt.show()

For reuse in subsequent notebooks, we store our processed data frame as a pickle file on disk.

In [None]:
df.to_pickle(CFG["PROCESSED_DATA_PATH"])