# **Feature Selection and Engineering**
---

In [1]:
import pandas as pd

In [2]:
RANDOM_STATE = 42

In [3]:
flights_test = pd.read_csv('../files/flights_test_no_missing.csv')
feature_space = list(flights_test.columns)
feature_space.append('arr_delay') # Add the target.

flights = pd.read_csv('../files/flights_no_missing.csv')
flights_subspace = flights[feature_space].copy()
flights_subspace.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,...,dest_airport_id,dest,dest_city_name,crs_dep_time,crs_arr_time,dup,crs_elapsed_time,flights,distance,arr_delay
0,2019-05-19,UA,UA_CODESHARE,UA,4264,EV,N48901,4264,12266,IAH,...,12915,LCH,"Lake Charles, LA",1020,1112,N,52.0,1,127,-2.0
1,2019-05-19,UA,UA_CODESHARE,UA,4266,EV,N12540,4266,13244,MEM,...,12266,IAH,"Houston, TX",1148,1340,N,112.0,1,468,-14.0
2,2019-05-19,UA,UA_CODESHARE,UA,4272,EV,N11164,4272,12266,IAH,...,11042,CLE,"Cleveland, OH",1155,1551,N,176.0,1,1091,4.0
3,2019-05-19,UA,UA_CODESHARE,UA,4281,EV,N13995,4281,11042,CLE,...,11278,DCA,"Washington, DC",839,959,N,80.0,1,310,-20.0
4,2019-05-19,UA,UA_CODESHARE,UA,4286,EV,N13903,4286,13061,LRD,...,12266,IAH,"Houston, TX",710,826,N,76.0,1,301,-1.0


Estimate of prediction set size relative to train/test/validation set:

In [4]:
flights_test.shape[0] / flights.shape[0] * 100

28.165286444822605

### Objective:

Our aim for this section is to use the feature space available to the **flights_test** dataframe containing the target range of dates for our prediction submission (January 1 - 7, 2020, inclusive) to select and engineer a range of fetures that best capture the common causes for flight delays. The prediction submission dates are to be treated as if they are in the future for the purpose of generating an appropriate prediction; as such, no historic data for these dates should be considered. Features will be engineered and selected on the basis of how they best captured the different delay classes available to us during the exploration phase of the study.

## **flights**

Features considered here should be engineered and selected only from the subset present in both **flights** and **flights_test**. Since we know the **flights_test** feature space is a subset of the **flights** feature space, we will simply use the columns in flights_test to filter the feature space of flights to prepare our data for transformation and modelling.

### Feature Selection:

* all ids can be discarded initially (except for the `origin` and `dest` ids), but they may also be ranked by delay if that turns out to be interesting/feasible
* `flights` column is always 1; completely uninformative
* dest and origin city names information is already captured by origin and destination, but the `state` could be useful in a later iteration (e.g. binarized to something like `busy_state` for high traffic origins/destinations)

In [5]:
to_drop = [
    'mkt_unique_carrier',
    'mkt_carrier',
    'mkt_carrier_fl_num',
    'op_unique_carrier',
    'tail_num',
    'op_carrier_fl_num',
    'origin',
    'dest',
    'dest_city_name',
    'origin_city_name',
    'flights']

submission_columns = ['fl_date', 'mkt_carrier', 'mkt_carrier_fl_num', 'origin', 'dest']

flights_test_drop = set(to_drop) - set(submission_columns)

flights_subspace = flights_subspace.drop(columns=to_drop)
flights_test = flights_test.drop(columns=flights_test_drop)

### Feature Engineering

In [6]:
flights_subspace['day_of_week'] = pd.to_datetime(flights_subspace['fl_date']).dt.dayofweek
flights_subspace['month_of_year'] = pd.to_datetime(flights_subspace['fl_date']).dt.month

flights_test['day_of_week'] = pd.to_datetime(flights_test['fl_date']).dt.dayofweek
flights_test['month_of_year'] = pd.to_datetime(flights_test['fl_date']).dt.month

Now let us check the relationship with arrival delay for these features.

In [7]:
flights_subspace.groupby('day_of_week')['arr_delay'].mean()

day_of_week
0    6.696040
1    4.567373
2    4.458588
3    7.273884
4    6.993864
5    2.512690
6    5.173920
Name: arr_delay, dtype: float64

In [8]:
flights_subspace.groupby('month_of_year')['arr_delay'].mean()

month_of_year
1      3.923047
2      6.693371
3      2.902349
4      4.071800
5      6.448396
6     10.562826
7      8.871938
8      8.883695
9      1.694430
10     2.810083
11     3.071748
12     5.087656
Name: arr_delay, dtype: float64

As we can see above, although we can separate into discrete ranks by mean, some classes are much closer on average than others which is an inherent drawback of this method.

In [12]:
flights_subspace = flights_subspace.drop('fl_date', axis=1)

flights_subspace.to_csv('../files/flights_engi.csv', index=False)
flights_test.to_csv('../files/flights_test_engi.csv', index=False)