#  Airport Data Quality Check and Cleaning (`run_full_qc.ipynb`)

This notebook cleans the **Airport Codes** dataset as part of the Capital One Airline Route Profitability Challenge.

We aim to:
- Validate and standardize raw input fields
- Detect and clean invalid data (textual numbers, special characters, inconsistent types)
- Impute missing values
- Remove null-heavy columns
- Handle outliers
- Save the cleaned data for further analysis and joins

---

### Steps Overview:
1. Load raw airport data
2. Analyze type conversion issues
3. Clean numeric columns
4. Recheck formatting issues
5. Check and visualize null patterns
6. Classify columns by type
7. Filter to U.S. medium/large airports
8. Impute missing values
9. Drop null-heavy or empty columns
10. Handle duplicates
11. Final null check and imputation
12. Detect and cap outliers
13. Save cleaned data

### Step 1: Load Raw Data

Load raw airport data using `load_airport_data()` utility.  
Also generate an automatic data profiling report using `ydata-profiling`.

### Step 2: Initial Type Conversion Checks

Use `analyze_conversion_errors()` to detect:
- Textual numbers (e.g., "Thirty Five")
- Malformed floats (e.g., "8.9abc")
- Invalid date or object formats

Print a column-wise error report using `print_error_report()`.


### Step 3: Clean Numerical Columns

Use `clean_numeric_string()` and apply it to all columns that should be numeric.  
Cleans:
- Text values like "Thirty One"
- Floats with extra characters


### Step 4: Re-check Type Conversions

Run `analyze_conversion_errors()` again to confirm issues were fixed.  
If everything looks good, continue to null analysis.


### Step 5: Visualize and Analyze Null Patterns

Use `plot_missing_values()` and `check_nulls()` to:
- Visualize missingness (matrix, bar, dendrogram)
- Identify columns with missing > 50%

### Step 6: Classify Columns by Data Type

Use `classify_columns()` to break down:
- Categorical columns
- Numerical columns
- Ordinal (if any)

This helps define proper imputation logic.


### Step 7: Filter Dataset to Relevant Airports

Filter only:
- Airports of type: `medium_airport`, `large_airport`
- Country: `US`

Use `filter_dataset_based_on_user()` to isolate the relevant airport rows.

### Step 8: Impute Missing Values

Use `multi_strategy_imputer()` to fill:
- Numerical columns: with median
- Categorical columns: with "Unknown"


### Step 9: Drop Null-heavy or Fully-null Columns

Drop:
- Columns where **100% of values** are null using `drop_all_null_columns()`
- Columns with >99% missing using `drop_high_null_columns(threshold=99)`


### Step 10: Handle Duplicates

Check for:
- Fully duplicated rows
- Duplicate rows based on key columns (if needed)

Use `check_and_handle_duplicates()` with or without subset keys.


### Step 11: Final Null Imputation

Use `check_nulls()` and reclassify any remaining columns with NA.  
Re-impute using appropriate strategies.


### Step 12: Detect and Cap Outliers (IQR Method)

Use `outlier_summary_report()` and `batch_plot_outliers()` to:
- Summarize skewness, kurtosis, and number of outliers
- Plot box/KDE plots

Cap using `cap_and_report_outliers()` (MAD-based).


### Save Final Cleaned Dataset

Save cleaned airport dataset to:
`/data/cleaned/airports_cleaned.csv`

This file will be used for joining with Flights and Tickets data in downstream notebooks.


### Generalized Pipeline for All Three Datasets

The data quality and cleaning process was applied **consistently across all three datasets** using the **same modular codebase** located in `src/`. This ensured standardization, reusability, and reproducibility across the challenge.

#### Reusable Functions from `data_cleaner.py` and `plot_utils.py`:
| Function | Purpose |
|----------|---------|
| `analyze_conversion_errors()` | Detects formatting and type conversion issues in numeric, string, and date fields |
| `clean_numeric_string()` | Handles messy numeric values like "Thirty Five" or "8.9abc" |
| `apply_cleaning_to_numeric_columns()` | Applies cleaning logic across selected columns |
| `check_nulls()` | Generates null count and percentage report |
| `multi_strategy_imputer()` | Imputes missing values using mean/median (numeric) or mode/constant (categorical) |
| `drop_all_null_columns()` / `drop_high_null_columns()` | Drops columns with high or full null values |
| `check_and_handle_duplicates()` | Detects and optionally removes duplicate rows |
| `classify_columns()` | Classifies columns as categorical or numerical for imputation and EDA |
| `filter_dataset_based_on_user()` | Applies domain filters like US-only, medium/large airports |
| `predictive_categorical_imputer()` | Optional imputation using ML model (e.g. `RandomForestClassifier`) |

---



In [None]:
import sys
import os
import warnings
warnings.filterwarnings("ignore")
# Go up one level from notebooks/ to project root, then into src/
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
src_path = os.path.join(project_root, 'src')

if src_path not in sys.path:
    sys.path.insert(0, src_path)

print(" src/ path added:", src_path)

# Now import WITHOUT the `src.` prefix
import pandas as pd
from ydata_profiling import ProfileReport, compare
from data_cleaner import (
    analyze_conversion_errors, print_error_report,
    apply_cleaning_to_numeric_columns, check_nulls,
    classify_columns, filter_dataset_based_on_user,
    multi_strategy_imputer, drop_all_null_columns,
    drop_high_null_columns, check_and_handle_duplicates
)
from plot_utils import (
    batch_plot_outliers, outlier_summary_report,
    cap_outliers, print_outlier_summary,cap_and_report_outliers,plot_missing_values
)
%matplotlib inline

: 

## Airports QC 

In [None]:

# if __name__ == "__main__":
# === 1. Load Raw Data ===
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
data_dir = os.path.join(project_root, 'data', 'raw')
file_path = os.path.join(data_dir, 'Airport_Codes.csv')
print(f"\n Loading data from: {file_path}")
ac = pd.read_csv(file_path)

# === 2. Initial Conversion Error Analysis ===
convert_dtypes = {
    'TYPE': 'string', 'NAME': 'string', 'ELEVATION_FT': 'float',
    'CONTINENT': 'string', 'ISO_COUNTRY': 'string',
    'MUNICIPALITY': 'string', 'IATA_CODE': 'string', 'COORDINATES': 'string'
}
report = ProfileReport(df=ac, title="Airports Data")
report.to_notebook_iframe()

df_cleaned, error_reports, _ = analyze_conversion_errors(ac, convert_dtypes)
print_error_report(error_reports)

# === 3. Apply Cleaning ===
numeric_columns = ['ELEVATION_FT']
ac = apply_cleaning_to_numeric_columns(ac, numeric_columns)

print("Error check after conversion")
# === 4. Recheck Conversion ===
df_cleaned, error_reports, _ = analyze_conversion_errors(ac, convert_dtypes)
print_error_report(error_reports)

# === 5. Null Checks ===
plot_missing_values(ac)
null_values, na_cols = check_nulls(ac)

# === 6. Column Classification ===
categorical_dict_cols = classify_columns(ac)

# === 7. Filtering ===
filter_dataset_dict = {'TYPE': ['medium_airport', 'large_airport'], 'ISO_COUNTRY': ['US']}
ac, cat_na_cols, num_na_cols = filter_dataset_based_on_user(ac, filter_dataset_dict, na_cols)

# === 8. Imputation ===
ac = multi_strategy_imputer(ac, num_cols=num_na_cols, cat_cols=cat_na_cols,
                                num_strategy='median', cat_strategy='constant', constant_fill='Unknown')

# === 9. Drop NULL Columns ===
plot_missing_values(ac)
ac, _ = drop_all_null_columns(ac)
ac, _ = drop_high_null_columns(ac, threshold=99)

# === 10. Deduplication ===
ac, _ = check_and_handle_duplicates(ac)

# === 11. Final NA Imputation ===
null_values, na_cols = check_nulls(ac)
cat_na_dict = classify_columns(ac[na_cols])
ac = multi_strategy_imputer(ac,
                            num_cols=cat_na_dict['numerical'],
                            cat_cols=cat_na_dict['categorical'],
                            num_strategy='mean', cat_strategy='constant', constant_fill='Unknown')

# === 12. Outlier Handling ===
outlier_cols = categorical_dict_cols['numerical']
print("\n Generating Outlier Summary Report...")
outlier_report_df = outlier_summary_report(ac, outlier_cols, method='iqr')
print(outlier_report_df)
batch_plot_outliers(ac, outlier_cols, method='iqr')

for col in outlier_cols:
    ac = cap_and_report_outliers(ac, column=col, k=1)
    #print_outlier_summary(ac, col, k=3)
batch_plot_outliers(ac, outlier_cols, method='iqr')

# === 13. Save Final Output ===
cleaned_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'data', 'cleaned', 'airports_cleaned.csv'))
ac.to_csv(cleaned_path, index=False)
print(f"\n Cleaned dataset saved to: {cleaned_path}")


: 

## Flights QC 

In [None]:

if __name__ == "__main__":
    # === Load Flights Data ===
    file_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'data', 'raw', 'Flights.csv'))
    print(f"\n Loading data from: {file_path}")
    ac = pd.read_csv(file_path)

    # === Dtype Mapping ===
    convert_dtypes = {
        'FL_DATE': 'date', 'OP_CARRIER': 'string', 'TAIL_NUM': 'string', 'OP_CARRIER_FL_NUM': 'string',
        'ORIGIN_AIRPORT_ID': 'string', 'ORIGIN': 'string', 'ORIGIN_CITY_NAME': 'string',
        'DEST_AIRPORT_ID': 'string', 'DESTINATION': 'string', 'DEST_CITY_NAME': 'string',
        'DEP_DELAY': 'float', 'ARR_DELAY': 'float', 'CANCELLED': 'float',
        'AIR_TIME': 'float', 'DISTANCE': 'float', 'OCCUPANCY_RATE': 'float'
    }
    report = ProfileReport(df=ac, title="Flights Data")
    report.to_notebook_iframe()

    df_cleaned, error_reports, _ = analyze_conversion_errors(ac, convert_dtypes)
    print_error_report(error_reports)

    # === Clean Known Numeric Columns ===
    numeric_columns = ['AIR_TIME', 'DISTANCE', 'OCCUPANCY_RATE']
    ac = apply_cleaning_to_numeric_columns(ac, numeric_columns)

    # Fix dates manually
    ac['FL_DATE'] = pd.to_datetime(ac['FL_DATE'], errors='coerce')
    print("Error check after conversion")
    # Recheck Conversion Errors
    df_cleaned, error_reports, _ = analyze_conversion_errors(ac, convert_dtypes)
    print_error_report(error_reports)

    # === Initial Null Check ===
    plot_missing_values(ac)
    null_values, na_cols = check_nulls(ac)

    # === Filter Non-Cancelled Flights ===
    filter_dict = {'CANCELLED': [0.0]}
    ac, cat_cols_na, num_cols_na = filter_dataset_based_on_user(ac, filter_dict, na_cols)

    # === Impute ===
    ac = multi_strategy_imputer(ac, num_cols=num_cols_na, cat_cols=cat_cols_na,
                                 num_strategy='median', cat_strategy='constant', constant_fill='Unknown')

    # === Drop null-heavy columns ===
    ac, _ = drop_all_null_columns(ac)
    ac, _ = drop_high_null_columns(ac, 99)

    # === Dedupe ===
    ac, _ = check_and_handle_duplicates(ac)

    # === Final Null Check + Impute ===
    plot_missing_values(ac)
    null_values, na_cols = check_nulls(ac)
    na_col_types = classify_columns(ac[na_cols])
    ac = multi_strategy_imputer(ac,
                                num_cols=na_col_types['numerical'],
                                cat_cols=na_col_types['categorical'],
                                num_strategy='mean', cat_strategy='constant', constant_fill='Unknown')

    # === Outlier Summary & Capping ===
    all_col_types = classify_columns(ac)
    outlier_cols = all_col_types['numerical']
    print("\n Outlier Summary Report:")
    report = outlier_summary_report(ac, outlier_cols, method='iqr')
    print(report)
    batch_plot_outliers(ac, outlier_cols, method='iqr')
    #columns_to_cap = ['DEP_DELAY', 'ARR_DELAY', 'AIR_TIME', 'DISTANCE', 'OCCUPANCY_RATE']
    for col in outlier_cols:
        ac = cap_and_report_outliers(ac, col, k=3)
        #print_outlier_summary(ac, col, k=3)
        
    batch_plot_outliers(ac, outlier_cols, method='iqr')

    # === Save Cleaned ===
    cleaned_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'data', 'cleaned', 'flights_cleaned.csv'))
    ac.to_csv(cleaned_path, index=False)
    print(f"\n Cleaned flights dataset saved to: {cleaned_path}")


: 

## Tickets QC 

In [None]:


# Load dataset
# === Load Tickets Data ===
file_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'data', 'raw', 'Tickets.csv'))
print(f"\n Loading data from: {file_path}")
tickets_df = pd.read_csv(file_path)

# Step 1: Define dtypes and run conversion analysis
convert_dtypes = {
    'ITIN_ID': 'string', 'YEAR': 'string', 'QUARTER': 'string', 'ORIGIN': 'string', 'ORIGIN_COUNTRY': 'string',
    'ORIGIN_STATE_ABR': 'string', 'ORIGIN_STATE_NM': 'string', 'ROUNDTRIP': 'float',
    'REPORTING_CARRIER': 'string', 'PASSENGERS': 'int', 'ITIN_FARE': 'float', 'DESTINATION': 'string'
}

report = ProfileReport(df=tickets_df, title="Tickets Data")
report.to_notebook_iframe()

print("\n Running Conversion Error Analysis...")
df_cleaned, error_reports, _ = analyze_conversion_errors(tickets_df, convert_dtypes)
print_error_report(error_reports)

# Step 2: Apply numeric cleanup and standardization
numeric_cols = ['ROUNDTRIP', 'PASSENGERS', 'ITIN_FARE']
tickets_df = apply_cleaning_to_numeric_columns(tickets_df, numeric_cols)

# Step 3: Recheck errors
print("Error check after conversion")
df_cleaned, error_reports, _ = analyze_conversion_errors(tickets_df, convert_dtypes)
print("\n Final Conversion Check After Cleaning")
print_error_report(error_reports)

# Step 4: Null checks
plot_missing_values(tickets_df)
null_summary, na_cols = check_nulls(tickets_df)
print("\n Null Summary:")
print(null_summary)

# Step 5: Filter for ROUNDTRIP == 1.0
filter_dict = {'ROUNDTRIP': [1.0]}
tickets_df, cat_na_after_filter, num_na_after_filter = filter_dataset_based_on_user(tickets_df, filter_dict, na_cols)

# Step 6: Imputation
tickets_df = multi_strategy_imputer(
    tickets_df,
    num_cols=num_na_after_filter,
    cat_cols=cat_na_after_filter,
    num_strategy="median",
    cat_strategy="constant",
    constant_fill="Unknown"
)

# Step 7: Drop null-heavy columns
tickets_df, _ = drop_all_null_columns(tickets_df)
tickets_df, _ = drop_high_null_columns(tickets_df, threshold=99)

# Step 8: Remove duplicates
tickets_df, _ = check_and_handle_duplicates(tickets_df)

# Step 9: Final impute if anything remains
plot_missing_values(tickets_df)
null_summary, na_cols = check_nulls(tickets_df)
cat_na_final = classify_columns(tickets_df[na_cols])
tickets_df = multi_strategy_imputer(
    tickets_df,
    num_cols=cat_na_final['numerical'],
    cat_cols=cat_na_final['categorical'],
    num_strategy="median",
    cat_strategy="constant",
    constant_fill="Unknown"
)

# Step 10: Outlier Treatment
classified_cols = classify_columns(tickets_df)
outlier_cols = [col for col in classified_cols['numerical'] if tickets_df[col].nunique() > 1]

print("\ Outlier Summary Before Capping:")
outlier_df = outlier_summary_report(tickets_df, outlier_cols, method='iqr')
print(outlier_df)
batch_plot_outliers(tickets_df, outlier_cols, method='iqr')

for col in outlier_cols:
    tickets_df = cap_and_report_outliers(tickets_df, col, k=3)
batch_plot_outliers(tickets_df, outlier_cols, method='iqr')
    #print_outlier_summary(tickets_df, col, k=3)

# Step 11: Save Cleaned File
cleaned_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'data', 'cleaned', 'tickets_cleaned.csv'))
tickets_df.to_csv(cleaned_path, index=False)
print(f"\n Cleaned Tickets Data saved to: {cleaned_path}")

: 

: 