# Automated Feature Engineering and Selection for Titanic Dataset

 The original was done by:
  ***Liana Napalkova*** on 
***3 September 2018***

This edit:
***LGB & Edits - Jan 23 2019***

# Table of contents
1. [Introduction](#introduction)
2. [Load data](#load_data)
3. [Clean data](#clean_data)
4. [Automated feature engineering](#afe)
5. ["Curse of dimensionality": Feature reduction and selection](#frs)
6. [Training and testing the simple model](#ttm)

## 1. Introduction <a name="introduction"></a>

If you have ever manually created hundreds of features for your ML project (I am sure you did it), then you will be happy to find out how the Python package called "featuretools" can help out with this task. The good news is that this package is very easy to use. It is aimed at automated feature engineering. **Of course human expertise cannot be substituted**, but nevertheless **"featuretools" can automate a large amount of routine work**. For the exploration purpose I will use a well-known Titanic dataset. The achieved resuls will be compared to the results obtained with handcrafted feature engineering and manual hyperparameter optimisation.

The main takeaways from this notebook are:

* Firstly, going from 11 total features to 146 using automated feature engineering ("featuretools" package).
* <s>Secondly, applying the feature reduction and selection methods to select X most relevant features out of 146 features.</s> We used all the features that were created in this fork.
* The accuracy of <s>0.74162</s> ~ 0.79 on the public leaderboard with <s>a basic random forest classifier</s> Light GBM.

 <s>**I hope you find this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated.**</s>
 
 ** You should upvote the original, as that is where the bulk of the work was created!**

In [None]:
import pandas as pd
#import autosklearn.classification
import featuretools as ft
from featuretools.primitives import *
from featuretools.variable_types import Numeric
##from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
##from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
import os
print(os.listdir("../input"))

## 2. Load data <a name="load_data"></a>

In [None]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
answers = pd.read_csv('../input/gender_submission.csv')

In [None]:
print(train_df.columns.values)

## 3. Clean data <a name="clean_data"></a>

First of all, it is necessary to clean the data. Since the main focus of this notebook is to explore "featuretools", we will not re-invent the wheel in data cleaning. Therefore, we will apply the code of feature cleaning taken from one of existing Kernels - [Best Titanic Survival Prediction for Beginners](https://www.kaggle.com/vin1234/best-titanic-survival-prediction-for-beginners), where an interested reader can find a very detailed explanation of each steps of the data cleaning procedure.

How much of our data is null?

In [None]:
train_df.drop('Survived',axis=1).isnull().sum()/len(train_df)

It looks like Cabin and Age are missing quite a few. 

Perhaps we can make some edits to Cabin based on the letter.

Well get to Age later.

In [None]:
cabin_letter = [str(c)[0] for c in train_df.Cabin]
cabin_letter[:10]

In [None]:
from collections import Counter

In [None]:
Counter(cabin_letter)

So as we had before, there are 687 null's and then a slightly different balance of letters with the most being in B,C,D and E. 

We will fill the nulls with a capital 'N'.

In [None]:
cabin_train = [str(c)[0].upper() for c in train_df.Cabin]
train_df['Cabin_C'] = cabin_train

In [None]:
cabin_test = [str(c)[0].upper() for c in test_df.Cabin]
test_df['Cabin_C'] = cabin_test

In [None]:
# train_df.Cabin.value_counts()
train_df.head()

In [None]:
combine = train_df.append(test_df)

passenger_id=test_df['PassengerId']
#combine.drop(['PassengerId'], axis=1, inplace=True)
combine = combine.drop(['Ticket', 'Cabin'], axis=1)

combine.Fare.fillna(combine.Fare.mean(), inplace=True)

combine['Sex'] = combine.Sex.apply(lambda x: 0 if x == "female" else 1)

for name_string in combine['Name']:
    combine['Title']=combine['Name'].str.extract('([A-Za-z]+)\.',expand=True)
    
#replacing the rare title with more common one.
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
combine.replace({'Title': mapping}, inplace=True)

combine = combine.drop(['Name'], axis=1)

titles=['Mr','Miss','Mrs','Master','Rev','Dr']
## The Age to impute is done here!
for title in titles:
    age_to_impute = combine.groupby('Title')['Age'].median()[titles.index(title)]
    combine.loc[(combine['Age'].isnull()) & (combine['Title'] == title), 'Age'] = age_to_impute
combine.isnull().sum()

freq_port = train_df.Embarked.dropna().mode()[0]
combine['Embarked'] = combine['Embarked'].fillna(freq_port)
    
combine['Embarked'] = combine['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
combine['Title'] = combine['Title'].map( {'Mr': 0, 'Mrs': 1, 'Miss': 2, 'Master': 3, 'Rev': 4, 'Dr': 5} ).astype(int)
combine.fillna(0, inplace=True)

In [None]:
## new ======================================

cabin_mapping = {'N':0,'C':1,'E':2,'G':3,'D':4,'A':5,'B':6,'F':7,'T':8}
combine.replace({'Cabin_C': cabin_mapping}, inplace=True)

In [None]:
combine.info()

In [None]:
Counter(combine.Cabin_C)

## 4. Perform automated feature engineering <a name="afe"></a> 

Once the data is cleaned, we can proceed to the cake in our party - i.e. automated feature engineering. To work with "featuretools" package, we should specify our dataframes "train_df" and "test_df" as entities of the entity set. The entity is just a table with a uniquely identifying column known as an index. The "featuretools" can automatically infer the variable types (numeric, categorical, datetime) of the columns, but it's a good idea to also pass in specific datatypes to override this behavior.

In [None]:
es = ft.EntitySet(id = 'titanic_data')

es = es.entity_from_dataframe(entity_id = 'combine', dataframe = combine.drop(['Survived'], axis=1), 
                              variable_types = 
                              {
                                  'Embarked': ft.variable_types.Categorical,
                                  'Sex': ft.variable_types.Boolean,
                                  'Title': ft.variable_types.Categorical,
                                  'Cabin_C': ft.variable_types.Categorical ## New
                              },
                              index = 'PassengerId')

es

Once the entity set is created, it is possible to generate new features using so called **feature primitives**. A feature primitive is an operation applied to data to create a new feature. Simple calculations can be stacked on top of each other to create complex features. Feature primitives fall into two categories:

* **Aggregation**: these functions group together child datapoints for each parent and then calculate a statistic such as mean, min, max, or standard deviation. The aggregation works across multiple tables using relationships between tables.


* **Transformation**: these functions work on one or multiple columns of a single table.

In our case we do not have different tables linked between each other. However, we can create dummy tables using "normalize_entity" function. What will it give us? Well, this way we will be able to apply both aggregation and transformation functions to generate new features. To create such tables, we will use categorical, boolean and integer variables.

In [None]:
es = es.normalize_entity(base_entity_id='combine', new_entity_id='Embarked', index='Embarked')
es = es.normalize_entity(base_entity_id='combine', new_entity_id='Sex', index='Sex')
es = es.normalize_entity(base_entity_id='combine', new_entity_id='Title', index='Title')
es = es.normalize_entity(base_entity_id='combine', new_entity_id='Pclass', index='Pclass')
es = es.normalize_entity(base_entity_id='combine', new_entity_id='Parch', index='Parch')
es = es.normalize_entity(base_entity_id='combine', new_entity_id='SibSp', index='SibSp')
es = es.normalize_entity(base_entity_id='combine', new_entity_id='Cabin_C', index='Cabin_C')
es

In [None]:
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100
primitives[primitives['type'] == 'aggregation'].head(primitives[primitives['type'] == 'aggregation'].shape[0])

As we can see, the most of "transformation" functions are applied to datetime or time-dependent variables. In our dataset we do not have such variables. Therefore these functions will not be used.

In [None]:
primitives[primitives['type'] == 'transform'].head(primitives[primitives['type'] == 'transform'].shape[0])

1. Now we will apply a **deep feature synthesis (dfs)** function that will generate new features by automatically applying suitable aggregations, I selected a depth of 2. Higher depth values will stack more primitives. 

** I decided to go with 3 here in the fork. **

In [None]:
features, feature_names = ft.dfs(entityset = es, 
                                 target_entity = 'combine', 
                                 max_depth = 3 #2
                                )

This is a list of new features. For example, "Title.SUM(combine.Age" means the sum of Age values for each unique value of Title.

In [None]:
feature_names

In [None]:
len(feature_names)

In [None]:
features[features['Age'] == 22][["Title.SUM(combine.Age)","Age","Title"]].head()

By using "featuretools", we were able to **generate <s>146</s> 184 features just in a moment**.

The "featuretools" is a powerful package that allows saving time to create new features from multiple tables of data. However, it does not completely subsitute the human domain knowledge. Additionally, now we are facing another problem known as the "curse of dimensionality".

## 5. "Curse of dimensionality": Feature reduction and selection <a name="frs"></a> 

To deal with the "curse of dimensionality", it's necessary to apply the feature reduction and selection, which means removing low-value features from the data. But keep in mind that feature selection can hurt the performance of ML models. The tricky thing is that the design of ML models contains an artistic component. It's definitely not the deterministic process with strict rules that should be followed to achieve success. In order to come up with an accurate model, it is necessary to apply, combine and compare dozens of methods. In this notebook, I will not explain all possible approaches to deal with the "curse of dimensionality". I will rather concentrate on the following methods:

   * Determine collinear features
   * Detect the most relevant features using linear models penalized with the L1 norm

### 5.1 Determine collinear features

Collinearity means high intercorrelations among independent features. If we maintain such features in the mode, it might be difficult to assess the effect of independent features on target variable. Therefore we will detect these features and delete them, though applying a manual revision before removal.

In [None]:
# Threshold for removing correlated variables
threshold = 0.95

# Absolute value correlation matrix
corr_matrix = features.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head(50)

In [None]:
# Select columns with correlations above threshold
collinear_features = [column for column in upper.columns if any(upper[column] > threshold)]

print('There are %d features to remove.' % (len(collinear_features)))

In [None]:
features_filtered = features.drop(columns = collinear_features)

print('The number of features that passed the collinearity threshold: ', features_filtered.shape[1])

**Be aware, however, that it is not a good idea to remove features only by correlation without understanding the removal process**. Features that have very high correlation (for example, Embarked.SUM(combine.Age) and Embarked.SUM(combine.Fare)) with significant difference between may require additional inve. Therefore manual guidance is necessary. But this topic is outside of the scope of this Kernel.

### 5.2 Detect the most relevant features using linear models penalized with the L1 norm

The next step is to use linear models penalized with the L1 norml. 

In [None]:
# features_positive = features_filtered.loc[:, features_filtered.ge(0).all()]

# train_X = features_positive[:train_df.shape[0]]
# train_y = train_df['Survived']

# test_X = features_positive[train_df.shape[0]:]

<s>Since the number of features is smaller than the number of observations in "train_X", the parameter "dual" is equal to False.</s>

In [None]:
# lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(train_X, train_y)
# model = SelectFromModel(lsvc, prefit=True)
# X_new = model.transform(train_X)
# X_selected_df = pd.DataFrame(X_new, columns=[train_X.columns[i] for i in range(len(train_X.columns)) if model.get_support()[i]])
# X_selected_df.shape

In [None]:
# X_selected_df.columns

** In the fork were not going to remove the highly correlated featuers or other using the L1 norm. **

## 6. Training and testing the simple model <a name="ttm"></a> 

<s>Finally, we will create a basic random forest classifier with 2000 estimators. Please notice that I skip essential steps such as crossvalidation, the analysis of learning curves, etc.</s>

Were going to put together a light gbm model for this.

All the parameters are listed out in **lgb_params**.

In [None]:
lgb_params = {
    "max_depth": 8,
    "num_leaves": 1000,
    "learning_rate": 0.033,
    "objective": "binary",
    "n_estimators": 500,
    "boosting_type": "dart",
    "n_jobs": -1,
    "reg_lambda": 0.01,
    "random_state": 42
}

In [None]:
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split

In [None]:
train_X = features[:train_df.shape[0]].drop('Cabin_C',axis=1)
train_y = train_df['Survived']

test_X = features[train_df.shape[0]:].drop('Cabin_C',axis=1)

In [None]:
train_y.head()
train_X.head()
# features.head()

In [None]:
kfolds = 3
preds = 0
for i in range(kfolds):
    print('In kfold:',str(i+1))
    xt,xv,yt,yv = train_test_split(train_X, train_y, test_size=0.2, random_state=(i*42))
    
    trn = lgb.Dataset(xt,yt.values.flatten())
    val = lgb.Dataset(xv,yv.values.flatten())
    model = lgb.train(lgb_params, train_set=trn,
                     valid_sets=[val], valid_names=['val'],
                     verbose_eval=100,
                     early_stopping_rounds=100)
    
    val_pred = model.predict(xv, num_iteration=model.best_iteration+50)
    pred = model.predict(test_X, num_iteration=model.best_iteration+50)
    preds += pred
    print('=========================')
    print(classification_report(np.round(val_pred,0).astype(int), yv))
    print("    F1 Score  : {:.4f}".format(f1_score(np.round(val_pred,0).astype(int), yv)))
    print('=========================')
preds /= kfolds

In [None]:
# random_forest = RandomForestClassifier(n_estimators=2000,oob_score=True)
# random_forest.fit(X_selected_df, train_y)

In [None]:
# X_selected_df.shape

In [None]:
# Y_pred = random_forest.predict(test_X[X_selected_df.columns])

In [None]:
# print(Y_pred)

In [None]:
my_submission = pd.DataFrame({'PassengerId': passenger_id, 'Survived': np.round(preds,0).astype(int)})
print(my_submission.head())
my_submission.to_csv('submission.csv', index=False)

Using light gbm and making a few changes to the features pulled in (and using all the features and a feature tools depth of 3) we got a score of 0.79425!

While it certainly makes sense to limit highly correlated features to get to very good ones, a tool such as featuretools would probably be worth using everything thats created. Additionally, going from 13 features to 184 would still not be considered a very wide data set. So in this use case, why not use all of them?