# Feature selection

In this example we will see how to **select features** through **model-based selection**.

As a toy example, we will use data from 'Titanic: Machine Learning for Disaster', one of the most popular Kaggle competitions. However, we will not use the original data set. We will use a modified data set, which results from a [Kaggle kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I did. 

[Here](https://github.com/pmarcelino/blog/data/titanic_modified.csv) you can access to the data set used in this exercise.

---

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# Load data

Let's load the data set.

In [20]:
df = pd.read_csv('data/titanic_modified.csv', index_col=0)
df.head()

Unnamed: 0,Survived,Fare,FamilySize,Imputed,Pclass_2,Pclass_3,Sex_male,Age_Child,Age_Elder,Embarked_Q,Embarked_S,Title_Miss,Title_Mr,Title_Mrs,Title_Other
0,0,7.25,1,0,0,1,1,0,0,0,1,0,1,0,0
1,1,71.2833,1,0,0,0,0,0,0,0,0,0,0,1,0
2,1,7.925,0,0,0,1,0,0,0,0,1,1,0,0,0
3,1,53.1,1,0,0,0,0,0,0,0,1,0,0,1,0
4,0,8.05,0,0,0,1,1,0,0,0,1,0,1,0,0


As I mentioned, this data set results from a [kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I already did for the Titanic competition. I'll summarize each of the features to give you some context:

* **Survived**. Target variable. It's 1 if the passenger survived and 0 if it didn't.
* **Fare**. Passenger fare. It keeps the same properties as the original feature.
* **FamilySize**. It's the sum of the original features SipSp and Parch. SipSp refers to the # of siblings / spouses aboard the Titanic. Parch refers to the # of parents / children aboard the Titanic.
* **Imputed**. Identifies instances where some missing data imputation was made.
* **Pclass**. Ticket class. The feature was [encoded](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features), so we just have the features corresponding to the second and third class. The one corresponding to the first class was deleted to avoid the [dummy trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html).
* **Sex**. It was also encoded. It's 1 if the instance corresponds to a male, and 0 if it corresponds to a female.
* **Age**. Encoded to the following classes: Children, Adult, and Elder. It's an Adult if Age_Child = 0 and Age_Elder = 0.
* **Embarked**. Port of embarkation. Originally, we had three possible ports: C = Cherbourg, Q = Queenstown, S = Southampton.
* **Title**. Results from the name of the passenger. Guess what? It's also an encoded feature.

# Train and test data sets

In feature selection, our goal is to distinguish features that are useful for prediction from features that just add noise to the prediction model.

To test the model's performance on unseen data, we need a train and a test data set.

In [21]:
# Create train and test set
from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)  # Keep all features except 'Survived'
y = df['Survived']  # Just keep 'Survived'

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)

# Feature selection

The solution we propose uses scikit-learn. You can read its documentation to know more about [feature selection](http://scikit-learn.org/stable/modules/feature_selection.html). 

Basically, you just have to use **SelectFromModel**. When you use this meta-transformer, you specify which **model** you want to use (e.g. Random Forests) and the **threshold** value to use for feature selection. This threshold value defines which features should be kept: features whose value is above the threshold are kept, features whose value is below the threshold are discarded. You can read more about SelectFromModel [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel).

In this example, I'll show you how to do feature selection using **Random Forests** as the base model. You can use other models. The logic is the same. I usually go for Random Forests because it's a flexible model, which works well with different types of features and in different types of problems.

Since the Titanic is a classification problem, we will use **RandomForestClassifier** as the base model.

In [22]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=7), threshold='median')
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
print(X_train.shape)
print(X_train_selected.shape)

(712, 14)
(712, 7)


Some notes:
* We reduced the number of features from 14 to 7.
* We used 'median' as the threshold value, which is the reason why we have 7 features. You can use other thresholds, like the 'mean' or even the 'mean' times a scaling factor.
* We used n_estimators = 100 for illustrative purposes. You can use a different number.

# Model's performance

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

# Assess model performance
lr = LogisticRegression()
lr.fit(X_train_selected, y_train)
strat_kfold = StratifiedKFold(10, random_state=7)
score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=10)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))

# Get features
print(select.get_support(indices=True))

CV accuracy: 0.807 +/- 0.052
[ 0  1  4  5 10 11 12]


# Conclusion

If we compare the results with the example we used in [univariate feature selection](), we can see that model's performance is lower with this model-based solution.

However, we must be aware that in this case, we define the 'median' as the threshold value. Accordingly, we are restricting the model to half of its original features. 

If we compare the results achieved by this model, which has 7 features, with the models with 7 features resulting from the univariate feature selection, we will see that the model-based approach is as good or better than the univariate feature selection.

Feature selection can then be improved using a different base model or considering a different threshold value.

# Complete solution

In [24]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

# Select features
select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=7), threshold='median')
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
print(X_train.shape)
print(X_train_selected.shape)


# Assess model
lr = LogisticRegression()
lr.fit(X_train_selected, y_train)
strat_kfold = StratifiedKFold(10, random_state=7)
score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=10)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))

# Get features
print(select.get_support(indices=True))

(712, 14)
(712, 7)
CV accuracy: 0.807 +/- 0.052
[ 0  1  4  5 10 11 12]
