# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('dark_background')

import scipy.stats as st
from sklearn import preprocessing as pre

In [14]:
import calendar

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import DBSCAN


In [29]:
flights = pd.read_csv('../Data/files/flights_no_missing.csv')
flights.columns

passengers = pd.read_csv('../Data/files/passengers_no_missing.csv')
passengers.columns

fuel = pd.read_csv('../Data/files/fuel_no_missing.csv')
fuel.columns

flights_test = pd.read_csv('../Data/files/flights_test_no_missing.csv')
flights_test.columns

Index(['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier',
       'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num',
       'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name',
       'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time',
       'crs_arr_time', 'dup', 'crs_elapsed_time', 'flights', 'distance'],
      dtype='object')

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [30]:
flights.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,...,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2019-05-19,UA,UA_CODESHARE,UA,4264,EV,N48901,4264,12266,IAH,...,52.0,51.0,26.0,1,127,0.0,0.0,0.0,0.0,0.0
1,2019-05-19,UA,UA_CODESHARE,UA,4266,EV,N12540,4266,13244,MEM,...,112.0,102.0,81.0,1,468,0.0,0.0,0.0,0.0,0.0
2,2019-05-19,UA,UA_CODESHARE,UA,4272,EV,N11164,4272,12266,IAH,...,176.0,184.0,143.0,1,1091,0.0,0.0,0.0,0.0,0.0
3,2019-05-19,UA,UA_CODESHARE,UA,4281,EV,N13995,4281,11042,CLE,...,80.0,68.0,49.0,1,310,0.0,0.0,0.0,0.0,0.0
4,2019-05-19,UA,UA_CODESHARE,UA,4286,EV,N13903,4286,13061,LRD,...,76.0,80.0,57.0,1,301,0.0,0.0,0.0,0.0,0.0


In [31]:
fuel.head()

Unnamed: 0,month,airline_id,unique_carrier,carrier,carrier_name,carrier_group_new,sdomt_gallons,satl_gallons,spac_gallons,slat_gallons,...,sdomt_cost,satl_cost,spac_cost,slat_cost,sint_cost,ts_cost,tdomt_cost,tint_cost,total_cost,year
0,1,21352.0,0WQ,0WQ,Avjet Corporation,1,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0,396216,140239.0,536455,2016
1,1,21645.0,23Q,23Q,Songbird Airways Inc.,1,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0,0,0.0,0,2016
2,1,21652.0,27Q,27Q,"Jet Aviation Flight Services, Inc.",1,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0,0,0.0,0,2016
3,1,20408.0,5V,5V,Tatonduk Outfitters Limited d/b/a Everts Air A...,1,260848.0,0.0,0.0,0.0,...,522405,0.0,0.0,0.0,0.0,522405,569497,0.0,569497,2016
4,1,19917.0,5X,5X,United Parcel Service,3,32138000.0,9743000.0,16116000.0,2972000.0,...,34098000,9752000.0,17965000.0,3524000.0,31241000.0,65339000,34098000,31241000.0,65339000,2016


In [32]:
passengers.head()

Unnamed: 0,departures_scheduled,departures_performed,payload,seats,passengers,freight,mail,distance,ramp_to_ramp,air_time,...,dest_country,dest_country_name,aircraft_group,aircraft_type,aircraft_config,month,year,distance_group,class,data_source
0,2,2,75800,300,244,0,0,1020,310,259,...,US,United States,6,694,1,2017,3,3,F,DU
1,2,2,72400,256,219,458,950,255,140,90,...,US,United States,6,698,1,2017,3,1,F,DU
2,2,2,76000,300,240,75,335,601,208,174,...,US,United States,6,694,1,2017,3,2,F,DU
3,2,2,104600,376,187,0,58,304,156,103,...,US,United States,6,622,1,2017,3,1,F,DU
4,2,2,72700,256,214,3,553,507,196,156,...,US,United States,6,698,1,2017,3,2,F,DU


In [33]:
passengers.columns

Index(['departures_scheduled', 'departures_performed', 'payload', 'seats',
       'passengers', 'freight', 'mail', 'distance', 'ramp_to_ramp', 'air_time',
       'unique_carrier', 'airline_id', 'unique_carrier_name', 'region',
       'carrier', 'carrier_name', 'carrier_group', 'carrier_group_new',
       'origin_airport_id', 'origin_city_market_id', 'origin',
       'origin_city_name', 'origin_country', 'origin_country_name',
       'dest_airport_id', 'dest_city_market_id', 'dest', 'dest_city_name',
       'dest_country', 'dest_country_name', 'aircraft_group', 'aircraft_type',
       'aircraft_config', 'month', 'year', 'distance_group', 'class',
       'data_source'],
      dtype='object')

In [34]:
flights.dest.unique()

array(['LCH', 'IAH', 'CLE', 'DCA', 'MCI', 'BHM', 'TYS', 'EWR', 'JAX',
       'PWM', 'ORD', 'MLU', 'CAK', 'CHS', 'PIA', 'DFW', 'ORF', 'OKC',
       'CLT', 'PVD', 'ILM', 'CVG', 'PIT', 'GRR', 'IAD', 'RDU', 'CRW',
       'MLB', 'PHL', 'ROC', 'HNL', 'OGG', 'KOA', 'LIH', 'BNA', 'DAL',
       'GSP', 'HOU', 'LAS', 'MCO', 'MDW', 'SAN', 'SAT', 'STL', 'ATL',
       'LAX', 'OAK', 'BWI', 'RSW', 'BOS', 'CMH', 'DEN', 'LGA', 'MSP',
       'MSY', 'TPA', 'PHX', 'SFO', 'SJC', 'BDL', 'FLL', 'MEM', 'MHT',
       'ABQ', 'AMA', 'AUS', 'IND', 'LBB', 'MAF', 'MKE', 'ONT', 'SLC',
       'SMF', 'OMA', 'BUR', 'SDF', 'SNA', 'BUF', 'SJU', 'BOI', 'ELP',
       'TUL', 'GEG', 'PDX', 'RNO', 'DTW', 'SEA', 'ISP', 'TUS', 'LGB',
       'PSP', 'BZN', 'DLH', 'MYR', 'MFE', 'MIA', 'ITH', 'RIC', 'SAV',
       'RST', 'LEX', 'XNA', 'MTJ', 'LIT', 'ROA', 'FAR', 'SGF', 'GJT',
       'EVV', 'MSN', 'HLN', 'BIS', 'MOT', 'PSC', 'ATW', 'SYR', 'CWA',
       'ACY', 'LBE', 'IAG', 'STX', 'CAE', 'TLH', 'TVC', 'ABI', 'GNV',
       'JFK', 'GPT',

In [35]:
flights.mkt_carrier_fl_num.unique()

array([4264, 4266, 4272, ..., 6682, 9301, 6751], dtype=int64)

In [36]:
flights.mkt_unique_carrier.unique()

array(['UA', 'AA', 'HA', 'WN', 'NK', 'DL', 'B6', 'AS', 'G4', 'F9', 'VX'],
      dtype=object)

In [37]:
fuel.unique_carrier.unique()

array(['0WQ', '23Q', '27Q', '5V', '5X', '5Y', '8C', '9E', '9S', 'AA',
       'ABX', 'AS', 'B6', 'CP', 'DL', 'EE', 'EV', 'F9', 'FX', 'G4', 'G7',
       'GFQ', 'GL', 'HA', 'WL', 'KAQ', 'KD', 'KH', 'KLQ', 'L2', 'M6',
       'MQ', 'N8', 'NC', 'NK', 'OH', 'OO', 'PFQ', 'PO', 'PRQ', 'QX', 'S5',
       'SY', 'U7', 'UA', 'VX', 'WE', 'WI', 'WP', 'X9', 'XP', 'YV', 'YX',
       'ZW', 'WN', '09Q', '1BQ', '0JQ', 'FCQ', 'US', '2HQ', '3EQ'],
      dtype=object)

In [38]:
flights.columns

Index(['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier',
       'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num',
       'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name',
       'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time', 'dep_time',
       'dep_delay', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in',
       'crs_arr_time', 'arr_time', 'arr_delay', 'cancelled',
       'cancellation_code', 'diverted', 'dup', 'crs_elapsed_time',
       'actual_elapsed_time', 'air_time', 'flights', 'distance',
       'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay',
       'late_aircraft_delay'],
      dtype='object')

In [None]:
# Flights Columns
# ['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier', 'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num', 'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name', 'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time', 'dep_time', 'dep_delay', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in', 'crs_arr_time', 'arr_time', 'arr_delay', 'cancelled', 'cancellation_code', 'diverted', 'dup', 'crs_elapsed_time', 'actual_elapsed_time', 'air_time', 'flights', 'distance', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']

# Don't Drop
# ['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier', 'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num', 'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name', 'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time', 'dep_time', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in', 'crs_arr_time', 'arr_time', 'arr_delay', 'cancelled', 'cancellation_code', 'diverted', 'dup', 'crs_elapsed_time', 'actual_elapsed_time', 'air_time', 'flights', 'distance', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']

In [25]:
#flights_clean = flights.drop(['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier', 'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num', 'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name', 'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time', 'dep_time', 'dep_delay', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in', 'crs_arr_time', 'arr_time', 'arr_delay', 'cancelled', 'cancellation_code', 'diverted', 'dup', 'crs_elapsed_time', 'actual_elapsed_time', 'air_time', 'flights', 'distance', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay'], axis=1)

In [42]:
#flights_clean_sample = flights_clean.head()
flights_clean_sample = flights.head()

In [43]:
flights_clean_sample

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,...,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2019-05-19,UA,UA_CODESHARE,UA,4264,EV,N48901,4264,12266,IAH,...,52.0,51.0,26.0,1,127,0.0,0.0,0.0,0.0,0.0
1,2019-05-19,UA,UA_CODESHARE,UA,4266,EV,N12540,4266,13244,MEM,...,112.0,102.0,81.0,1,468,0.0,0.0,0.0,0.0,0.0
2,2019-05-19,UA,UA_CODESHARE,UA,4272,EV,N11164,4272,12266,IAH,...,176.0,184.0,143.0,1,1091,0.0,0.0,0.0,0.0,0.0
3,2019-05-19,UA,UA_CODESHARE,UA,4281,EV,N13995,4281,11042,CLE,...,80.0,68.0,49.0,1,310,0.0,0.0,0.0,0.0,0.0
4,2019-05-19,UA,UA_CODESHARE,UA,4286,EV,N13903,4286,13061,LRD,...,76.0,80.0,57.0,1,301,0.0,0.0,0.0,0.0,0.0


In [27]:
flights_clean_sample

0
1
2
3
4


In [39]:
# Don't Drop!
# ['month', 'airline_id', 'unique_carrier', 'carrier', 'carrier_name','carrier_group_new', 'total_gallons', 'total_cost', 'year']

In [44]:
fuel_clean = fuel.drop(['sdomt_gallons', 'satl_gallons', 'spac_gallons', 'slat_gallons', 'sint_gallons', 'ts_gallons', 'tdomt_gallons', 'tint_gallons', 'sdomt_cost', 'satl_cost', 'spac_cost', 'slat_cost', 'sint_cost', 'ts_cost', 'tdomt_cost', 'tint_cost'], axis=1)

In [45]:
fuel_clean_sample = fuel_clean.head()

In [46]:
fuel_clean_sample

Unnamed: 0,month,airline_id,unique_carrier,carrier,carrier_name,carrier_group_new,total_gallons,total_cost,year
0,1,21352.0,0WQ,0WQ,Avjet Corporation,1,210112.0,536455,2016
1,1,21645.0,23Q,23Q,Songbird Airways Inc.,1,0.0,0,2016
2,1,21652.0,27Q,27Q,"Jet Aviation Flight Services, Inc.",1,0.0,0,2016
3,1,20408.0,5V,5V,Tatonduk Outfitters Limited d/b/a Everts Air A...,1,284362.0,569497,2016
4,1,19917.0,5X,5X,United Parcel Service,3,60969000.0,65339000,2016


In [44]:
fuel_clean.head()

Unnamed: 0,month,airline_id,unique_carrier,carrier,carrier_name,carrier_group_new,total_gallons,total_cost,year
0,1,21352.0,0WQ,0WQ,Avjet Corporation,1,210112.0,536455,2016
1,1,21645.0,23Q,23Q,Songbird Airways Inc.,1,0.0,0,2016
2,1,21652.0,27Q,27Q,"Jet Aviation Flight Services, Inc.",1,0.0,0,2016
3,1,20408.0,5V,5V,Tatonduk Outfitters Limited d/b/a Everts Air A...,1,284362.0,569497,2016
4,1,19917.0,5X,5X,United Parcel Service,3,60969000.0,65339000,2016


In [50]:
### LARGE DATA -- MEMORY ERROR
#flights_fuel_joined = flights.merge(fuel_clean, how='inner', left_on='mkt_unique_carrier', right_on='unique_carrier')
flights_fuel_joined = flights_clean_sample.merge(fuel_clean_sample, how='left', left_on='mkt_unique_carrier', right_on='unique_carrier')

In [55]:
flights_fuel_joined.iloc[:, 30:]

Unnamed: 0,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,month,airline_id,unique_carrier,carrier,carrier_name,carrier_group_new,total_gallons,total_cost,year
0,26.0,1,127,0.0,0.0,0.0,0.0,0.0,,,,,,,,,
1,81.0,1,468,0.0,0.0,0.0,0.0,0.0,,,,,,,,,
2,143.0,1,1091,0.0,0.0,0.0,0.0,0.0,,,,,,,,,
3,49.0,1,310,0.0,0.0,0.0,0.0,0.0,,,,,,,,,
4,57.0,1,301,0.0,0.0,0.0,0.0,0.0,,,,,,,,,


In [None]:
flights_fuel_joined = flights.merge(fuel, left_on='', right_on='unique_carrier')

In [18]:
#df_trans_accts_joined = df_transactions.merge(df_accounts, left_on='acct_nbr', right_on='acct_nbr')
#df_master = df_trans_accts_joined.merge(df_customer, left_on='cust_id',right_on='cust_id')

In [12]:
flights.mkt_unique_carrier.unique()

array(['UA', 'AA', 'HA', 'WN', 'NK', 'DL', 'B6', 'AS', 'G4', 'F9', 'VX'],
      dtype=object)

In [13]:
flights.op_unique_carrier.unique()

array(['EV', 'MQ', 'OH', 'HA', 'WN', 'UA', 'C5', 'ZW', 'AX', 'G7', 'NK',
       'AA', 'PT', 'CP', 'YX', '9E', 'OO', 'B6', 'DL', 'YV', 'QX', 'KS',
       'G4', 'AS', 'F9', 'EM', 'VX', '9K'], dtype=object)

In [14]:
fuel.unique_carrier.unique()

array(['0WQ', '23Q', '27Q', '5V', '5X', '5Y', '8C', '9E', '9S', 'AA',
       'ABX', 'AS', 'B6', 'CP', 'DL', 'EE', 'EV', 'F9', 'FX', 'G4', 'G7',
       'GFQ', 'GL', 'HA', 'WL', 'KAQ', 'KD', 'KH', 'KLQ', 'L2', 'M6',
       'MQ', 'N8', 'NC', 'NK', 'OH', 'OO', 'PFQ', 'PO', 'PRQ', 'QX', 'S5',
       'SY', 'U7', 'UA', 'VX', 'WE', 'WI', 'WP', 'X9', 'XP', 'YV', 'YX',
       'ZW', 'WN', '09Q', '1BQ', '0JQ', 'FCQ', 'US', '2HQ', '3EQ'],
      dtype=object)

In [19]:
passengers['payload'].head()

0     75800
1     72400
2     76000
3    104600
4     72700
Name: payload, dtype: int64

In [20]:
passengers.unique_carrier.unique()

array(['AA', 'OO', 'YX', '5X', 'HBQ', '1AQ', '7S', 'ABX', 'C5', 'EM',
       'F9', 'L2', 'SY', '9E', 'AS', 'AX', 'CH', 'MQ', 'B6', '9K', 'DL',
       'M6', 'VX', 'OH', 'PT', 'WN', 'KAH', 'UA', 'YV', 'FX', 'G7', 'QX',
       'WI', 'ZK', 'ZW', '04Q', '2HQ', 'NK', 'MW', 'EV', 'YR', '9X',
       '28Q', 'G4', '2JQ', 'HA', '22Q', 'CP', '2TQ', 'AJQ', '4B', 'KH',
       'NC', '3M', '2O', 'WP', '1WQ', 'SEB', 'ELL', 'RVQ', 'AAT', 'WRD',
       '1SQ', '2EQ', 'NEW', '1QQ', '3E', '38Q', '1YQ', 'VI', 'WST', 'KLQ',
       '1PQ', 'GCH', 'N8', '2LQ', 'AN', '3DQ', '30Q', '3BQ', 'AC', 'WL',
       'XP', '23Q', '2OQ', '07Q', '0BQ', '0CQ', '0Q', '0QQ', '13Q', '15Q',
       '17Q', '24Q', '2AQ', '2GQ', '3S', '3SD', '5C', '3U', '4C', '4M',
       '5D', '7I', '7L', '8I', '8R', '9V', '9W', 'A0', 'AB', 'AD', 'ADB',
       'AF', 'AI', 'AM', 'AR', 'AV', 'AY', 'AZ', 'BA', 'BR', 'BW', 'BX',
       'C8', 'CAZ', 'CC', 'CI', 'CK', 'CM', 'CRV', 'CX', 'CZ', 'DE',
       'DHQ', 'E7', 'EI', 'EK', 'EQ', 'EY', 'FI', 'FJ', 'F

In [11]:
fuel.tail()

Unnamed: 0,month,airline_id,unique_carrier,carrier,carrier_name,carrier_group_new,sdomt_gallons,satl_gallons,spac_gallons,slat_gallons,...,sdomt_cost,satl_cost,spac_cost,slat_cost,sint_cost,ts_cost,tdomt_cost,tint_cost,total_cost,year
3020,12,20377.0,X9,X9,Omni Air International LLC,2,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0,1584314,4588387.0,6172701,2018
3021,12,20207.0,XP,XP,XTRA Airways,1,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0,0,0.0,0,2018
3022,12,20378.0,YV,YV,Mesa Airlines Inc.,2,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0,0,0.0,0,2018
3023,12,20452.0,YX,YX,Republic Airline,2,21048.0,0.0,0.0,0.0,...,50043,0.0,0.0,0.0,0.0,50043,50043,0.0,50043,2018
3024,12,19393.0,WN,WN,Southwest Airlines Co.,3,173203901.0,0.0,0.0,2440902.0,...,351339537,0.0,0.0,5730304.0,5730304.0,357069841,351820604,5730304.0,357550908,2018


### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [10]:
def rank_column_by_mean(df: pd.DataFrame,
                        to_rank_name: str,
                        rank_by_name: str,
                        method: str = 'dense',
                        ) -> tuple[pd.Series, dict]:
    """
    Converts categorical columns into ordinal columns using the pandas `groupby`, `rank` and `transform` categorical methods.
    Because this is a wrapper function, limitations stem from those of its subcomponents.
    Parameters:
        `df`: pd.DataFrame - Input DataFrame that is expected to contain both the categorical column (`to_rank_name`) to be 
            ranked as well as the numeric column (`rank_by_name`) to use for ranking values.
        `to_rank_name`: str - The name of the categorical column to be ranked as a string. Must be a column in the
            input DataFrame (`df`).
        `rank_by_name`: str - The name of the numeric column to be used for ranking the categorical column. Must be 
            a column in the input DataFrame (`df`).
        `method`: str = 'dense' - the method to be used by the pandas `rank` method to determine ranking. Our default of 
            'dense' is sensible, but {'average', 'min', 'max', 'first', 'dense'} are all accepted here.
    Returns:
        tuple[pd.Series, dict] - A tuple with the transformed series and a dictionary containing the mappings to be applied
            on the test data.
    """
    _metric_series = df.groupby(to_rank_name)[rank_by_name].transform('mean')
    ranked_series = _metric_series.rank(method=method)

    _mapping = df.groupby(to_rank_name)[rank_by_name].mean().rank()
    rank_mapping = {index: rank for index,
                    rank in zip(_mapping.index, _mapping)}

    return ranked_series, rank_mapping


TypeError: 'type' object is not subscriptable

In [25]:
def rank_features_by_mean(df: pd.DataFrame,
                          to_rank_names: list,
                          rank_by_name: str,
                          method: str = 'dense',
                          ) -> tuple[pd.DataFrame, dict[dict]]:
    """
    Applies `rank_column_by_mean` on a list of categorical features (expected subset of `df`.columns) to generate a pandas
    DataFrame of ranked columns as well as a dict of their mappings.
    Parameters:
        `df`: pd.DataFrame - Input DataFrame that is expected to contain both the categorical column (`to_rank_name`) to be 
            ranked as well as the numeric column (`rank_by_name`) to use for ranking values.
        `to_rank_names`: list - The list of names of the categorical columns to be ranked. Expected be columns in the
            input DataFrame (`df`).
        `rank_by_name`: str - The name of the numeric column to be used for ranking the categorical column. Must be 
            a column in the input DataFrame (`df`).
        `method`: str = 'dense' - the method to be used by the pandas `rank` method to determine ranking. Our default of 
            'dense' is sensible, but {'average', 'min', 'max', 'first', 'dense'} are all accepted here.
    Returns:
        tuple[pd.DataFrame, dict[dict]] - A tuple with the transformed DataFrame and a dictionary containing the mappings 
            to be applied on the test data.
    """
    ranked_features = {}
    mappings = {}

    for to_rank_name in to_rank_names:
        ranked_features[f'{to_rank_name}_ranked'], mappings[to_rank_name] = rank_column_by_mean(df,
                                                                                                to_rank_name,
                                                                                                rank_by_name,
                                                                                                method)

    return pd.DataFrame(ranked_features), mappings


TypeError: 'type' object is not subscriptable

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.