# Introduction

- The process of selecting a subset of relevant features for use in model construction
- Is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them.
-  Used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.

- **OBJECTIVES**: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

![alt text](Feature_Selection_Techniques.png)

*https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/*

# Assumptions: 

- All the string features haven already been transformed to numeric ones (OneHotEncoder, OrdinalEncoder).
- The data has been scaled (StandardScaler).
- There are no missing values.

# Imports, Parameters, and Functions

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns

In [8]:
# Will do this just once to create the json file
#df1 = pd.read_csv('../Data/titanic_train.csv')
#df1.to_json(r'../Data/titanic_train.json')
#df2 = pd.read_csv('../Data/titanic_test.csv')
#df2.to_json(r'../Data/titanic_test.json')

In [9]:
# Classification (1) or Regression (0)
classification_problem = 1

# Define the name of the target column
target = 'Survived'

In [10]:
def select_columns(coeff, threshold):
    cols = []
    for i in range(len(coeff)):
        value = coeff.iloc[i,0]
        if (value > threshold):
            cols.append(coeff.index[i])
            
    return(cols)

# Load the data -> Faltando: Ler arquivo JSON com o dataframe

In [11]:
#df = pd.read_csv('../Data/titanic_train.csv')
df = pd.read_json('../Data/titanic_train.json')

In [12]:
# Remove observations with missing values
df.dropna(inplace=True)

In [13]:
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


## Split the features from the target

In [14]:
X = df.drop(target, axis = 1)
y = df[target]

In [15]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  183 non-null    int64  
 1   Pclass       183 non-null    int64  
 2   Name         183 non-null    object 
 3   Sex          183 non-null    object 
 4   Age          183 non-null    float64
 5   SibSp        183 non-null    int64  
 6   Parch        183 non-null    int64  
 7   Ticket       183 non-null    object 
 8   Fare         183 non-null    float64
 9   Cabin        183 non-null    object 
 10  Embarked     183 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 17.2+ KB


In [16]:
y.head()

1     1
3     1
6     0
10    1
11    1
Name: Survived, dtype: int64

In [17]:
# Create a categorical target just to simulate this kind of output
y_cat = y.apply(lambda x: 'N' if (x==0) else 'Y')
y_cat.head()

1     Y
3     Y
6     N
10    Y
11    Y
Name: Survived, dtype: object

## Split Numerical and Categorical Inputs

In [18]:
X_num = X.select_dtypes(include=[np.number])
X_cat = X.select_dtypes(exclude=[np.number])

df_num = df.select_dtypes(include=[np.number])
df_cat = df.select_dtypes(exclude=[np.number])

In [19]:
X_num.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  183 non-null    int64  
 1   Pclass       183 non-null    int64  
 2   Age          183 non-null    float64
 3   SibSp        183 non-null    int64  
 4   Parch        183 non-null    int64  
 5   Fare         183 non-null    float64
dtypes: float64(2), int64(4)
memory usage: 10.0 KB


In [20]:
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      183 non-null    object
 1   Sex       183 non-null    object
 2   Ticket    183 non-null    object
 3   Cabin     183 non-null    object
 4   Embarked  183 non-null    object
dtypes: object(5)
memory usage: 8.6+ KB


In [21]:
print(X_num.shape)
print(y.shape)

(183, 6)
(183,)


# Feature Selection Algorithms: Filter, Wrapper and Intrinsic Methods

https://machinelearningmastery.com/an-introduction-to-feature-selection/

https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

https://machinelearningmastery.com/feature-selection-machine-learning-python/


## Filter:

- Methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. 
- Methods: Information Gain, Pearson’s Correlation, Spearman’s Correlation, Feature Importance, Kendall's Tau

**Some univariate statistical measures that can be used for FILTER-based feature selection.**
![alt text](Filter_Based_Feature_Selection.png)

*https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/*

### Numerical Input, Numerical Output

#### Information Gain

In [22]:
from sklearn.feature_selection import mutual_info_classif

# Define the threshold
threshold_IG = 0.02

# Calculate the mutual information coefficients and convert them to a data frame
coeff_IG =pd.DataFrame(mutual_info_classif(X_num, y).reshape(-1, 1),
                         columns=['Coefficient'], index=X_num.columns)

print('Information Gain')
print(coeff_IG)

# Only keep columns whose information gain is higer than the threshold
cols_IG = select_columns(coeff_IG, threshold_IG) 

print('\nColumns to remain in the DF: ', cols_IG)

Information Gain
             Coefficient
PassengerId     0.000000
Pclass          0.000000
Age             0.070762
SibSp           0.036635
Parch           0.020543
Fare            0.076065

Columns to remain in the DF:  ['Age', 'SibSp', 'Parch', 'Fare']


#### Pearson's Correlation (Linear Correlation)

In [23]:
# Define the threshold 
threshold_Pe = 0.09

# Calculate the correlation matrix
corr_mat =  df_num.corr(method = 'pearson')

# Select only those values related to the target
coeff_Pe = corr_mat[target].sort_values(ascending = False)[1:] #discard the first one since it is the target itself

coeff_Pe = pd.DataFrame(coeff_Pe.values, columns=['Coefficient'], index = coeff_Pe.index )

print('Pearsons Correlation')
print(abs(coeff_Pe))

# Only keep columns whose ABSOLUTE value of the correlation is higer than the threshold
cols_Pe = select_columns(abs(coeff_Pe), threshold_Pe) 

print('\nColumns to remain in the DF: ', cols_Pe)

Pearsons Correlation
             Coefficient
PassengerId     0.148495
Fare            0.134241
SibSp           0.106346
Parch           0.023582
Pclass          0.034542
Age             0.254085

Columns to remain in the DF:  ['PassengerId', 'Fare', 'SibSp', 'Age']


#### Spearman’s Correlation (Nonlinear Correlation)

In [24]:
# Define the threshold 
threshold_Sp = 0.09

# Calculate the correlation matrix
corr_mat =  df_num.corr(method = 'spearman')

# Select only those values related to the target
coeff_Sp = corr_mat[target].sort_values(ascending = False)[1:] #discard the first one since it is the target itself

coeff_Sp = pd.DataFrame(coeff_Sp.values, columns=['Coefficient'], index = coeff_Sp.index )

print('Spearmans Correlation')
print(abs(coeff_Sp))

# Only keep columns whose ABSOLUTE value of the correlation is higer than the threshold
cols_Sp = select_columns(abs(coeff_Sp), threshold_Sp) 

print('\nColumns to remain in the DF: ', cols_Sp)

Spearmans Correlation
             Coefficient
Fare            0.172005
PassengerId     0.150280
SibSp           0.118469
Parch           0.046836
Pclass          0.001663
Age             0.257242

Columns to remain in the DF:  ['Fare', 'PassengerId', 'SibSp', 'Age']


### Numerical Input, Categorical Output

#### Kendall’s Tau 

- Kendall’s tau is a measure of the correspondence between two rankings. Values close to 1 indicate strong agreement, and values close to -1 indicate strong disagreement.

In [25]:
from scipy import stats

threshold_Ke = 0.13
threshold_Ke_pvalue = 0.05 

corr_tau = []
p_value = []

for i in range(len(X_num.columns)):
    tau, p_val = stats.kendalltau(X_num.iloc[:,i], y_cat)
    corr_tau.append(abs(tau))
    p_value.append(p_val)

corr_tau = pd.DataFrame(corr_tau, columns = ["Coefficient"], index = X_num.columns)    
p_value = pd.DataFrame(p_value, columns = ["p_value"], index = X_num.columns) 

print("Correlaton")
print(corr_tau)
print("\nP-values")
print(p_value)

# Only select the statistically significant measures
corr_tau = corr_tau[(p_value["p_value"] < threshold_Ke_pvalue).values] 

print('\nStatistically significant Kendall’s Tau')
print(corr_tau)

# Only keep columns whose ABSOLUTE value of the correlation is higer than the threshold
cols_Ke = select_columns(corr_tau, threshold_Ke) 

print('\nColumns to remain in the DF: ', cols_Ke)

Correlaton
             Coefficient
PassengerId     0.123038
Pclass          0.001636
Age             0.212465
SibSp           0.115947
Parch           0.045087
Fare            0.141434

P-values
              p_value
PassengerId  0.042622
Pclass       0.982105
Age          0.000520
SibSp        0.109990
Parch        0.527487
Fare         0.020315

Statistically significant Kendall’s Tau
             Coefficient
PassengerId     0.123038
Age             0.212465
Fare            0.141434

Columns to remain in the DF:  ['Age', 'Fare']


### Categorical Input, Numerical Output

#### Kendall’s Tau 

- Kendall’s tau is a measure of the correspondence between two rankings. Values close to 1 indicate strong agreement, and values close to -1 indicate strong disagreement.

In [26]:
from scipy import stats

threshold_Ke = 0.13
threshold_Ke_pvalue = 0.05 

corr_tau = []
p_value = []

for i in range(len(X_cat.columns)):
    tau, p_val = stats.kendalltau(X_cat.iloc[:,i], y)
    corr_tau.append(abs(tau))
    p_value.append(p_val)

corr_tau = pd.DataFrame(corr_tau, columns = ["Coefficient"], index = X_cat.columns)    
p_value = pd.DataFrame(p_value, columns = ["p_value"], index = X_cat.columns) 

print("Correlaton")
print(corr_tau)
print("\nP-values")
print(p_value)

# Only select the statistically significant measures
corr_tau = corr_tau[(p_value["p_value"] < threshold_Ke_pvalue).values] 

print('\nStatistically significant Kendall’s Tau')
print(corr_tau)

# Only keep columns whose ABSOLUTE value of the correlation is higer than the threshold
cols_Ke = select_columns(corr_tau, threshold_Ke) 

print('\nColumns to remain in the DF: ', cols_Ke)

Correlaton
          Coefficient
Name         0.122497
Sex          0.532418
Ticket       0.017176
Cabin        0.009128
Embarked     0.099129

P-values
               p_value
Name      4.354185e-02
Sex       6.834238e-13
Ticket    7.776211e-01
Cabin     8.806636e-01
Embarked  1.788363e-01

Statistically significant Kendall’s Tau
      Coefficient
Name     0.122497
Sex      0.532418

Columns to remain in the DF:  ['Sex']


### Categorial (ANY) Input, Categorical (ANY) Output

#### Chi-Squared Test
https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223

In [27]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2

In [28]:
# Null hypothesis H0: target is independent from the features.
# H0 is REJECTED if p_value <= alpha.

alpha = 0.05 #significance level
cols_Chi = []

# Before performig Chi-Square test we have to make sure data is label encoded.
label_encoder = LabelEncoder()
X_chi_cat = X_cat.copy()
for col in X_chi_cat.columns:
    X_chi_cat[col] = label_encoder.fit_transform(X_chi_cat[col])

# X_chi is composed of all features. categorical and numeric.    
X_chi = pd.concat([X_chi_cat, X_num], axis =1)   

#print(X_chi.head(2))

# y can be categorical or numeric
chi_values, p_values = chi2(X_chi,y) 
print("\n Chi Square values: ", chi_values)
print("\n p-values: ", p_values)

# test if target is DEPENDENT of the features. True when p_value <= alpha
for i in range(len(p_values)):
    if (p_values[i] <= alpha):
        cols_Chi.append(X_chi.columns[i])

print('\nColumns to remain in the DF: ', cols_Chi)


 Chi Square values:  [1.25629862e+02 2.49452632e+01 1.96133471e+00 4.50110503e-01
 1.32908589e+00 5.37915073e+02 4.83833072e-02 8.06045908e+01
 1.83879484e+00 1.21236333e-01 2.42972795e+02]

 p-values:  [3.70531311e-029 5.89812658e-007 1.61370645e-001 5.02282483e-001
 2.48967905e-001 5.35803658e-119 8.25900704e-001 2.75718180e-019
 1.75092266e-001 7.27697436e-001 8.84137626e-055]

Columns to remain in the DF:  ['Name', 'Sex', 'PassengerId', 'Age', 'Fare']


## Wrapper

- Consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy. 
- Method: Recursive Feature Elimination.

### Numerical Input, Numerical Output

#### Recursive Feature Elimination

First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [29]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

n_features = int(0.7*len(X_num.columns))

if (classification_problem==1):
    estimator = DecisionTreeClassifier(min_samples_leaf=4)
else:
    estimator = DecisionTreeRegressor(min_samples_leaf=4)

selector = RFE(estimator, n_features_to_select=n_features, step=1)
selector = selector.fit(X_num, y)

cols_RFE = list(X_num.columns[selector.support_])

print('Columns to remain in the DF: ', cols_RFE)

Columns to remain in the DF:  ['PassengerId', 'Age', 'SibSp', 'Fare']


## Intrinsic - Feature Importances

- There are some machine learning algorithms that perform feature selection automatically as part of learning the model.
- This includes algorithms such as penalized regression models like Lasso and decision trees, including ensembles of decision trees like random forest.
- Method: Elastic Net, Decision Trees, Random Forest.

### Numerical Input, Numerical Output

#### ElasticNet

In [30]:
from sklearn.linear_model import ElasticNet

threshold_El = 0

model = ElasticNet()
model.fit(X_num, y)

feature_importances = pd.DataFrame(np.abs(model.coef_), columns = ['Coefficient'], index = X_num.columns)

print('Feature Importances')
print(feature_importances)

cols_El = select_columns(feature_importances, threshold_El)

print('\nColumns to remain in the DF: ', cols_El)

Feature Importances
             Coefficient
PassengerId     0.000280
Pclass          0.000000
Age             0.005441
SibSp           0.000000
Parch           0.000000
Fare            0.000611

Columns to remain in the DF:  ['PassengerId', 'Age', 'Fare']


#### Decision Trees 

In [31]:
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor

threshold_Tr = 0.11

if (classification_problem==1):
    model = ExtraTreesClassifier(min_samples_leaf=4)
else:
    model = ExtraTressRegressor(min_samples_leaf=4)
    
model.fit(X_num, y)

feature_importances = pd.DataFrame(model.feature_importances_, columns = ['Coefficient'], index = X_num.columns)

print('Feature Importances')
print(feature_importances)


cols_Tr = select_columns(feature_importances, threshold_Tr) 

print('\nColumns to remain in the DF: ', cols_Tr)

Feature Importances
             Coefficient
PassengerId     0.199157
Pclass          0.079733
Age             0.383714
SibSp           0.114523
Parch           0.066708
Fare            0.156164

Columns to remain in the DF:  ['PassengerId', 'Age', 'SibSp', 'Fare']


#### Random Forest

In [32]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

threshold_RF = 0.11

if (classification_problem==1):
    model = RandomForestClassifier(n_estimators=200, n_jobs=-1)
else:
    model = RandomForestRegressor(n_estimators=200, n_jobs=-1)
    
model.fit(X_num, y)

feature_importances = pd.DataFrame(model.feature_importances_, columns = ['Coefficient'], index = X_num.columns)

print('Feature Importances')
print(feature_importances)


cols_RF = select_columns(feature_importances, threshold_RF) 

print('\nColumns to remain in the DF: ', cols_RF)

Feature Importances
             Coefficient
PassengerId     0.295663
Pclass          0.020301
Age             0.301414
SibSp           0.042231
Parch           0.048230
Fare            0.292160

Columns to remain in the DF:  ['PassengerId', 'Age', 'Fare']


# Save the final data frame with the selected columns as a json file

In [33]:
final_selected_cols = cols_RF

In [34]:
df[cols_RF].to_json(r'../Data/titanic_train_feature_selection.json')