### How to apply Drop Features transformer & Smart Correlated Features transformer

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

## Drop Features

It drops a list of variables indicated by the developer. The function documentation is [here](https://feature-engine.trainindata.com/en/latest/user_guide/selection/DropFeatures.html). The argument is the features you want to drop.

In [8]:
from feature_engine.selection import DropFeatures

We will use the 'insurance' dataset, which contains information on the relationship between personal attributes (age, gender, BMI: body mass index, family size, smoking habits), geographic factors, and their impact on medical insurance charges.

In [11]:
df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


We will set the pipeline with `DropFeatures(),` and we want to drop the variables 'sex' and 'region'. We chose these arbitrarily, just for the exercise.
* In the workplace, you will consider the context. For example, your variable might be CustomerID, which is typically a combination of letters and numbers with high cardinality. You can often only get a little information out of it. Therefore, you may drop this variable.
* Other use cases could be when you create variables combining others, for example, 'distance' and 'time'; you may create a variable 'speed' when dividing one by another. After that, you may discard 'distance' and 'time'
* After setting the pipeline, we `.fit_transform()` the data

In [19]:
pipeline = Pipeline([
      ( 'drop_features', DropFeatures(features_to_drop = ['sex', 'region']) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

Unnamed: 0,age,bmi,children,smoker,expenses
0,19,27.9,0,yes,16884.92
1,18,33.8,1,no,1725.55
2,28,33.0,3,no,4449.46
3,33,22.7,0,no,21984.47
4,32,28.9,0,no,3866.86


### Smart Correlated Features

It’s a technique that identifies and removes one feature from each pair/group of highly correlated variables, using some logic to "smartly" choose which feature to keep and which to drop.

**Correlated Features:**
* Looks for groups of correlated features (not just pairs).
* Tries to retain the most informative or important feature within each group.
* Often selects features based on missing values, variance, or predictive power (like correlation with the target or mutual information).

According to the documentation, this transformer finds groups of correlated features. It then selects, from each group, a feature following certain criteria: Features with the least missing values, features with the most unique values, and features with the highest variance. The documentation is found [here](https://feature-engine.trainindata.com/en/latest/user_guide/selection/SmartCorrelatedSelection.html)
* The arguments we will use are variables, which are the list of variables to evaluate; if you don't parse anything, it will consider all numerical variables in the dataset. The next is a method (like 'Pearson' or 'Spearman'), and threshold, which, according to the documentation, is the correlation threshold above which a feature will be deemed correlated with another one and removed from the dataset.

In [22]:
from feature_engine.selection import SmartCorrelatedSelection

In [24]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


We change the data type to __`'object'`__ by looping over all the variables where its current data type is __`'category'`__

In [80]:
for col in df.select_dtypes(include='category').columns:
  df[col] = df[col].astype('object')
    
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


In [76]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

``SmartCorrelatedSelection()`` transformer works on numerical data; therefore, we must encode the existing categorical variables. We do that in this exercise with ``OrdinalEncoder()``. Then we add ``SmartCorrelatedSelection()``, where we don't pass the variables, meaning we want all numerical variables to be evaluated. We set the method as Pearson, the threshold as 0.6 and selection_method as the variance. A threshold of 0.6 means that any variable correlations that are at least moderate will be considered and subject to removal.

`Threshold defines how strong the correlation between two features has to be before you consider them "too similar" (and drop one).`

If, `threshold = 0.9`
**It means:**
* If two features have a correlation of 0.9 or higher, they are considered too similar.
* So, the algorithm will remove one of them (using smart logic like variance, missing values, etc.)

**Correlation Scale**
- `0.0 - 0.3 --	Weak / no correlation`
- `0.3 - 0.7 --	Moderate correlation`
- `0.7 - 1.0 --	Strong correlation`

**A Big warning**: the tips dataset is intended to be used in a regression task where you are interested in predicting tips. When working on a project, the tips variable wouldn't be a feature but a target. Here, we left it in on purpose as a feature just for the sake of the exercise.

In [82]:
from feature_engine.encoding import OrdinalEncoder
pipeline = Pipeline([
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary') ),
      ('SmartCorrelatedSelection', SmartCorrelatedSelection(method="pearson",
                                                             threshold=0.6,
                                                             selection_method="variance",)) # # or "missing_values", "cardinality", "model_performance"
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

Unnamed: 0,total_bill,sex,smoker,day,size
0,16.99,0,0,0,2
1,10.34,1,0,0,3
2,21.01,1,0,0,3
3,23.68,1,0,0,2
4,24.59,0,0,0,4


We can check which sets of features were marked as correlated (using the rules we set in the previous pipeline). We do that by accessing the pipeline step and using the attribute `.correlated_feature_sets_`

In [59]:
pipeline['SmartCorrelatedSelection'].correlated_feature_sets_

[{'tip', 'total_bill'}, {'day', 'time'}]

We check which variables were removed with the attribute `.features_to_drop_`

In [62]:
pipeline['SmartCorrelatedSelection'].features_to_drop_

['tip', 'time']

Alternatively, we inspected the df_transformed, and as we expected, the variables were removed.

In [67]:
df_transformed.head()

Unnamed: 0,total_bill,sex,smoker,day,size
0,16.99,0,0,0,2
1,10.34,1,0,0,3
2,21.01,1,0,0,3
3,23.68,1,0,0,2
4,24.59,0,0,0,4


 **Additional warning**: This transformer is used in the features when setting your pipeline for your ML task. It is typically one of the last steps of feature engineering since it requires pre-processing the data.