# BigData 2025: Project 4
Students: Kalju Jake Nekvasil, Joosep Orasmäe, Tanel Tiisler, Kaupo Humal

## Data Ingestion and Preparation

We are using 2009 US domestic airline data in CSV format. The dataset is cleaned and saved. Different ML models are tried to predict cancellation with both accuracy and AOC scores. Best model was used to predict 2010 data too.

## Cleaning and Preprocessing

Columns were renamed for readability and unnamed columns were dropped. Checked that categorical values are not empty and 0 was put as most likely value for missing numerical rows. Resulting dataset was partioned by airline identifier and saved as parquet file.

## Exploratory Analysis

To understand data better, we explored it by finding the most popular carriers:

| UniqueCarrier | Count |
|------|---------------|
| WN | 1127045 | 
| AA | 548194 | 
| OO | 544843 | 
| MQ | 434577 | 
| DL | 424982 | 
| US | 411274 | 
| UA | 375501 | 
| XE | 308340 | 
| EV | 297874 | 
| NW | 291856 | 


We also found the most important reasons for flight cancellation by respective codes:

| CancellationCode | Count |
|------|---------------|
| B | 36651 | 
| A | 35568 | 
| C | 14799 | 
| D | 20 | 


Finally, we analysed the flight status imbalance ratio: 

| CANCELLED | Count |
|------|---------------|
| 0.0 | 6326977 | 
| 1.0 | 87038 | 
 
There are about 72 times more normally operating than cancelled flights.

## Feature Engineering

We identified features, that are leaking information for flight cancellation, and would most likely not be available in prediction time. For that, we compared each feature min, max and mean values for both outcomes(cancelled or not). We dropped features, that had mostly zero values in one group only. I.e. Airtime values exist only for non-cancelled flights. Flight numbers were also removed as we were unsure if they consist any useful encoding, or are assigned randomly and can only introduce more noise. Additionally, we extracted expected arrival and departure hours in cyclical format.

We converted all categorical variables with StringIndexer to numeric and used OneHotEncoder, to convert them into sparse binary vectors. Then, all features were assembled together with VectorAccembler and original, unconverted features were dropped from dataset.

Dataset was split into train-test sets(0.7-0.3) while making sure that both sets have similar stratification ratio of target (cancellation) variable. 

## Modeling

We trained 4 different models as required by the task: LogisticRegression, DecisionTree, RandomForest and GBT. Due to computational complexity and time constraints, we ran CrossValidation for each of the models for just a single hyperparameter, testing 3 different values.

All of the models had very similar accuracy but differing AUC scores. As the GBT model had the highest AUC, we chose that one as our model of choice.

The GBT model does have a big downside of being more computationally expensive, since it relies on sequential operations for training. This meant that we could only do a very limited hyperparameter search in a reasonable amount of runtime. Thankfully inference is still very fast.

## Explainability

### Top 10 Features by Importance

| Rank | Feature Index | Decoded Feature            | Importance Score | Interpretation |
|------|---------------|----------------------------|------------------|----------------|
| 1    | -             | **DepDelay**               | 0.5778           | Departure delay |
| 2    | 633           | Month_9 (September)        | 0.0454           | Peak hurricane season |
| 3    | 625           | Month_1 (January)          | 0.0357           | Winter storms |
| 4    | 628           | Month_4 (April)            | 0.0311           | Spring break travel disruptions |
| 5    | 632           | Month_8 (August)           | 0.0284           | Summer travel peaks |
| 6    | 607           | UniqueCarrier_FL (AirTran) | 0.0259           | Out-of-business carrier |
| 7    | -             | **Distance**               | 0.0253           | Longer flight risk |
| 8    | 317           | Dest_ALO (Waterloo, IA)    | 0.0249           | Small Airport, Unpredictable Midwest weather |
| 9    | 26            | Origin_ATW (Appleton, WI)  | 0.0244           | Small Airport, Unpredicatable Midwest Weather |
| 10   | 304           | Dest_ABI (Abilene, TX)     | 0.0242           | Small Airport, Extreme Texas Weather |

## Model Persistence and Inference

SparkML makes combining different processing steps very easy, as everything is basically a transformer. We also made a Preprocessor class that has a `_transform(self, df)` method, allowing it to be used as a pipeline stage. This means the we can combine everything into one big pipeline such that the raw dataframe goes in on one end and predictions come out the other.

Testing this full pipeline on the 2010.csv dataset we achieved an accuracy score of <b>0.9824</b> and an AUC score of <b>0.9555</b>.