# **Feature Selection and Engineering**
---

In [1]:
import pandas as pd

In [2]:
RANDOM_STATE = 42

In [3]:
flights_test = pd.read_csv('../files/flights_test_no_missing.csv')
feature_space = list(flights_test.columns)
feature_space.append('arr_delay') # Add the target.

flights = pd.read_csv('../files/flights_no_missing.csv')
flights_subspace = flights[feature_space].copy()
flights_subspace.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,...,dest_airport_id,dest,dest_city_name,crs_dep_time,crs_arr_time,dup,crs_elapsed_time,flights,distance,arr_delay
0,2019-05-19,UA,UA_CODESHARE,UA,4264,EV,N48901,4264,12266,IAH,...,12915,LCH,"Lake Charles, LA",1020,1112,N,52.0,1,127,-2.0
1,2019-05-19,UA,UA_CODESHARE,UA,4266,EV,N12540,4266,13244,MEM,...,12266,IAH,"Houston, TX",1148,1340,N,112.0,1,468,-14.0
2,2019-05-19,UA,UA_CODESHARE,UA,4272,EV,N11164,4272,12266,IAH,...,11042,CLE,"Cleveland, OH",1155,1551,N,176.0,1,1091,4.0
3,2019-05-19,UA,UA_CODESHARE,UA,4281,EV,N13995,4281,11042,CLE,...,11278,DCA,"Washington, DC",839,959,N,80.0,1,310,-20.0
4,2019-05-19,UA,UA_CODESHARE,UA,4286,EV,N13903,4286,13061,LRD,...,12266,IAH,"Houston, TX",710,826,N,76.0,1,301,-1.0


Estimate of prediction set size relative to train/test/validation set:

In [4]:
flights_test.shape[0] / flights.shape[0] * 100

28.165286444822605

In [5]:
flights_subspace

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,...,dest_airport_id,dest,dest_city_name,crs_dep_time,crs_arr_time,dup,crs_elapsed_time,flights,distance,arr_delay
0,2019-05-19,UA,UA_CODESHARE,UA,4264,EV,N48901,4264,12266,IAH,...,12915,LCH,"Lake Charles, LA",1020,1112,N,52.0,1,127,-2.0
1,2019-05-19,UA,UA_CODESHARE,UA,4266,EV,N12540,4266,13244,MEM,...,12266,IAH,"Houston, TX",1148,1340,N,112.0,1,468,-14.0
2,2019-05-19,UA,UA_CODESHARE,UA,4272,EV,N11164,4272,12266,IAH,...,11042,CLE,"Cleveland, OH",1155,1551,N,176.0,1,1091,4.0
3,2019-05-19,UA,UA_CODESHARE,UA,4281,EV,N13995,4281,11042,CLE,...,11278,DCA,"Washington, DC",839,959,N,80.0,1,310,-20.0
4,2019-05-19,UA,UA_CODESHARE,UA,4286,EV,N13903,4286,13061,LRD,...,12266,IAH,"Houston, TX",710,826,N,76.0,1,301,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2339957,2019-05-19,AA,AA,AA,2025,AA,N185AN,2025,11298,DFW,...,13303,MIA,"Miami, FL",1156,1545,N,169.0,1,1121,131.0
2339958,2019-05-19,AA,AA_CODESHARE,AA,4876,PT,N658AE,4876,13577,MYR,...,14100,PHL,"Philadelphia, PA",1132,1314,N,102.0,1,473,0.0
2339959,2019-05-19,AA,AA_CODESHARE,AA,4879,PT,N647AE,4879,10792,BUF,...,14100,PHL,"Philadelphia, PA",1520,1652,N,92.0,1,279,-15.0
2339960,2019-05-19,UA,UA_CODESHARE,UA,4223,EV,N15912,4223,12266,IAH,...,12448,JAN,"Jackson/Vicksburg, MS",1640,1800,N,80.0,1,351,-5.0


### Objective:

Our aim for this section is to use the feature space available to the **flights_test** dataframe containing the target range of dates for our prediction submission (January 1 - 7, 2020, inclusive) to select and engineer a range of fetures that best capture the common causes for flight delays. The prediction submission dates are to be treated as if they are in the future for the purpose of generating an appropriate prediction; as such, no historic data for these dates should be considered. Features will be engineered and selected on the basis of how they best captured the different delay classes available to us during the exploration phase of the study.

## **flights**

Features considered here should be engineered and selected only from the subset present in both **flights** and **flights_test**. Since we know the **flights_test** feature space is a subset of the **flights** feature space, we will simply use the columns in flights_test to filter the feature space of flights to prepare our data for transformation and modelling.

### Feature Selection:

* all ids can be discarded initially (except for the `origin` and `dest` ids), but they may also be ranked by delay if that turns out to be interesting/feasible
* `flights` column is always 1; completely uninformative
* dest and origin city names information is already captured by origin and destination, but the `state` could be useful in a later iteration (e.g. binarized to something like `busy_state` for high traffic origins/destinations)

In [6]:
to_drop = [
    'mkt_unique_carrier',
    'mkt_carrier',
    'mkt_carrier_fl_num',
    'op_unique_carrier',
    'tail_num',
    'op_carrier_fl_num',
    'origin',
    'dest',
    'dest_city_name',
    'origin_city_name',
    'flights']

submission_columns = ['fl_date', 'mkt_carrier', 'mkt_carrier_fl_num', 'origin', 'dest']

flights_test_drop = set(to_drop) - set(submission_columns)

flights_subspace = flights_subspace.drop(columns=to_drop)
flights_test = flights_test.drop(columns=flights_test_drop)

In [7]:
flights_subspace.head()

Unnamed: 0,fl_date,branded_code_share,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,dup,crs_elapsed_time,distance,arr_delay
0,2019-05-19,UA_CODESHARE,12266,12915,1020,1112,N,52.0,127,-2.0
1,2019-05-19,UA_CODESHARE,13244,12266,1148,1340,N,112.0,468,-14.0
2,2019-05-19,UA_CODESHARE,12266,11042,1155,1551,N,176.0,1091,4.0
3,2019-05-19,UA_CODESHARE,11042,11278,839,959,N,80.0,310,-20.0
4,2019-05-19,UA_CODESHARE,13061,12266,710,826,N,76.0,301,-1.0


In [8]:
flights_subspace.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339962 entries, 0 to 2339961
Data columns (total 10 columns):
 #   Column              Dtype  
---  ------              -----  
 0   fl_date             object 
 1   branded_code_share  object 
 2   origin_airport_id   int64  
 3   dest_airport_id     int64  
 4   crs_dep_time        int64  
 5   crs_arr_time        int64  
 6   dup                 object 
 7   crs_elapsed_time    float64
 8   distance            int64  
 9   arr_delay           float64
dtypes: float64(2), int64(5), object(3)
memory usage: 178.5+ MB


### Feature Engineering

In [9]:
flights_subspace['day_of_week'] = pd.to_datetime(flights_subspace['fl_date']).dt.dayofweek
flights_subspace['month_of_year'] = pd.to_datetime(flights_subspace['fl_date']).dt.month

flights_test['day_of_week'] = pd.to_datetime(flights_test['fl_date']).dt.dayofweek
flights_test['month_of_year'] = pd.to_datetime(flights_test['fl_date']).dt.month

Now let us check the relationship with arrival delay for these features.

In [10]:
flights_subspace.groupby('day_of_week')['arr_delay'].mean()

day_of_week
0    6.696040
1    4.567373
2    4.458588
3    7.273884
4    6.993864
5    2.512690
6    5.173920
Name: arr_delay, dtype: float64

In [11]:
flights_subspace.groupby('month_of_year')['arr_delay'].mean()

month_of_year
1      3.923047
2      6.693371
3      2.902349
4      4.071800
5      6.448396
6     10.562826
7      8.871938
8      8.883695
9      1.694430
10     2.810083
11     3.071748
12     5.087656
Name: arr_delay, dtype: float64

As we can see above, although we can separate into discrete ranks by mean, some classes are much closer on average than others which is an inherent drawback of this method.

In [12]:
flights_subspace = flights_subspace.drop('fl_date', axis=1)

flights_subspace.to_csv('../files/flights_engi.csv', index=False)
flights_test.to_csv('../files/flights_test_engi.csv', index=False)

In [13]:
flights_subspace.head()

Unnamed: 0,branded_code_share,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,dup,crs_elapsed_time,distance,arr_delay,day_of_week,month_of_year
0,UA_CODESHARE,12266,12915,1020,1112,N,52.0,127,-2.0,6,5
1,UA_CODESHARE,13244,12266,1148,1340,N,112.0,468,-14.0,6,5
2,UA_CODESHARE,12266,11042,1155,1551,N,176.0,1091,4.0,6,5
3,UA_CODESHARE,11042,11278,839,959,N,80.0,310,-20.0,6,5
4,UA_CODESHARE,13061,12266,710,826,N,76.0,301,-1.0,6,5


In [76]:
flights_subspace.values

array([['UA_CODESHARE', 12266, 12915, ..., -2.0, 6, 5],
       ['UA_CODESHARE', 13244, 12266, ..., -14.0, 6, 5],
       ['UA_CODESHARE', 12266, 11042, ..., 4.0, 6, 5],
       ...,
       ['AA_CODESHARE', 10792, 14100, ..., -15.0, 6, 5],
       ['UA_CODESHARE', 12266, 12448, ..., -5.0, 6, 5],
       ['UA_CODESHARE', 10868, 12266, ..., 21.0, 6, 5]], dtype=object)

In [77]:
flights_subspace.branded_code_share.unique()

array(['UA_CODESHARE', 'AA_CODESHARE', 'HA', 'WN', 'UA', 'NK', 'AA',
       'DL_CODESHARE', 'B6', 'AS_CODESHARE', 'DL', 'G4', 'AS', 'F9',
       'HA_CODESHARE', 'VX'], dtype=object)

In [74]:
from sklearn.decomposition import PCA

**CUT HERE**

In [17]:
import sklearn.model_selection as ms
# Move a copy of processing.py from the Scripts directory to this one to make the import work.
import sklearn.preprocessing as pro

In [18]:
df_train, df_test = ms.train_test_split(flights_subspace, random_state=RANDOM_STATE, test_size=0.3)

Sanity check to verify sizes.

In [19]:
df_train.shape[0] / flights.shape[0]

0.699999829057053

In [33]:
df_test.shape[0] / flights.shape[0]

0.30000017094294695

We initially split into two dataframes to more effectively enable the use of our custom ranking implementation in [processing.py](../../Scripts/processing.py)

#### Categorical features:

`fl_date` can be used to add `day_of_week`, and `month_of_year` columns. These could potentially be reordered by rank of expected delay as well as long as each change is noted for the sake of interpretability.

`origin_airport_id` and `dest_airport_id` could also be ordinalized based on expected delay. IDs should be used and the codes discarded becasue these will always refer to the expected airport across years.

In [65]:
from typing import Dict, List, Optional, Sequence, Tuple, Union

In [70]:
import processing_jonas


TypeError: 'type' object is not subscriptable

In [68]:
def rank_column_by_mean(df: pd.DataFrame,
                        to_rank_name: str,
                        rank_by_name: str,
                        method: str = 'dense',
                        ):
    """
    Converts categorical columns into ordinal columns using the pandas `groupby`, `rank` and `transform` categorical methods.
    Because this is a wrapper function, limitations stem from those of its subcomponents.

    Parameters:

        `df`: pd.DataFrame - Input DataFrame that is expected to contain both the categorical column (`to_rank_name`) to be 
            ranked as well as the numeric column (`rank_by_name`) to use for ranking values.

        `to_rank_name`: str - The name of the categorical column to be ranked as a string. Must be a column in the
            input DataFrame (`df`).

        `rank_by_name`: str - The name of the numeric column to be used for ranking the categorical column. Must be 
            a column in the input DataFrame (`df`).

        `method`: str = 'dense' - the method to be used by the pandas `rank` method to determine ranking. Our default of 
            'dense' is sensible, but {'average', 'min', 'max', 'first', 'dense'} are all accepted here.

    Returns:

        tuple[pd.Series, dict] - A tuple with the transformed series and a dictionary containing the mappings to be applied
            on the test data.

    """
    _metric_series = df.groupby(to_rank_name)[rank_by_name].transform('mean')
    ranked_series = _metric_series.rank(method=method)

    _mapping = df.groupby(to_rank_name)[rank_by_name].mean().rank()
    rank_mapping = {index: rank for index,
                    rank in zip(_mapping.index, _mapping)}

    return (ranked_series, rank_mapping)

In [61]:
def rank_features_by_mean(df: pd.DataFrame,
                          to_rank_names: list,
                          rank_by_name: str,
                          method: str = 'dense',
                          ):
    """
    Applies `rank_column_by_mean` on a list of categorical features (expected subset of `df`.columns) to generate a pandas
    DataFrame of ranked columns as well as a dict of their mappings.

    Parameters:

        `df`: pd.DataFrame - Input DataFrame that is expected to contain both the categorical column (`to_rank_name`) to be 
            ranked as well as the numeric column (`rank_by_name`) to use for ranking values.

        `to_rank_names`: list - The list of names of the categorical columns to be ranked. Expected be columns in the
            input DataFrame (`df`).

        `rank_by_name`: str - The name of the numeric column to be used for ranking the categorical column. Must be 
            a column in the input DataFrame (`df`).

        `method`: str = 'dense' - the method to be used by the pandas `rank` method to determine ranking. Our default of 
            'dense' is sensible, but {'average', 'min', 'max', 'first', 'dense'} are all accepted here.

    Returns:

        tuple[pd.DataFrame, dict[dict]] - A tuple with the transformed DataFrame and a dictionary containing the mappings 
            to be applied on the test data.

    """
    ranked_features = {}
    mappings = {}

    for to_rank_name in to_rank_names:
        ranked_features[f'{to_rank_name}_ranked'], mappings[to_rank_name] = rank_column_by_mean(df,
                                                                                                to_rank_name,
                                                                                                rank_by_name,
                                                                                                method)

    return pd.DataFrame(ranked_features), mappings


In [59]:
type(pro)

module

In [60]:
categorical_to_ordinal = ['origin_airport_id',
                          'dest_airport_id',
                          'month_of_year',
                          'day_of_week']

ordinal_df, ordinal_mappings = pro.rank_features_by_mean(df_train, categorical_to_ordinal, 'arr_delay')

AttributeError: module 'sklearn.preprocessing' has no attribute 'rank_features_by_mean'

`dup` and `branded_code_share` can be binarized.

In [32]:
numeric_features = [
    'distance',
    'crs_arr_time',
    'crs_dep_time',
    'crs_elapsed_time',
]

# Ranked in terms of correlation with arr_delay from low to high.
ordinal_features = [
    'origin_delay_rank',
    'dest_delay_rank',
    'day_of_week_rank',
    'month_of_year_rank',
]

binary_features = [
    'dup',
    'code_share',
]