# Recursive Feature Elimination (RFE) - Experiment

This component performs feature ranking with recursive feature elimination based on a Random Forest estimator with default hyperparameters. K-fold cross-validation is employed to estimate featrue importance. It uses the `RFECV` implementation from [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html). 
<br>
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

This notebook shows:
- how to use the [SDK](https://platiagro.github.io/sdk/) to load datasets, save models and other artifacts.
- how to declare parameters and use them to build reusable components.

## Declare parameters and model hyperparameters
Components may declare (and use) these default parameters:
- dataset
- target

Use these parameters to load/save datasets, models, metrics, and figures with the help of [PlatIAgro SDK](https://platiagro.github.io/sdk/). <br />
You may also declare custom parameters to set when running an experiment.

Select the hyperparameters and their respective values to be used when fiting RFE:
- min_features
- n_folds

These parameters are just a few offered by the model class, you may also use another existing parameter. <br />
Check the [model parameters](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) for more information.

In [None]:
# parameters
dataset = "boston" #@param {type:"string"}
target = "medv" #@param {type:"feature", label:"Atributo alvo", description: "Seu modelo será treinado para prever os valores do alvo."}}

# hyperparameters
min_features = 3 #@param {type:"number", label: "Número mínimo de features a ser selecionado"}
n_folds = 10 #@param {type:"number", label: "Número de folds para a validação cruzada"}

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)
X = df.drop(target, axis=1)
y = df[target]

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
import numpy as np
from platiagro import stat_dataset

metadata = stat_dataset(name=dataset)
featuretypes = metadata["featuretypes"]

columns = df.columns.to_numpy()
featuretypes = np.array(featuretypes)
target_index = np.argwhere(columns == target)
target_type = featuretypes[target_index]

columns = np.delete(columns, target_index)
featuretypes = np.delete(featuretypes, target_index)

## Define the order of each feature

In [None]:
from platiagro.featuretypes import NUMERICAL

# Selects the indexes of numerical and non-numerical features
numerical_indexes = np.where(featuretypes == NUMERICAL)[0]
non_numerical_indexes = np.where(~(featuretypes == NUMERICAL))[0]

# After the step handle_missing_values, 
# numerical features are grouped in the beggining of the array
numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes))
non_numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes), len(featuretypes))

## Fit a feature selector using sklearn.feature_selection.RFECV

In [None]:
from category_encoders.ordinal import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

if target_type[0] == NUMERICAL:
    estimator = RandomForestRegressor(random_state=0)
else:
    estimator = RandomForestClassifier(random_state=0)
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)
    
pipeline = Pipeline(steps=[
    ('handle_missing_values',
     ColumnTransformer(
        [('imputer_mean', SimpleImputer(strategy='mean'), numerical_indexes),
         ('imputer_mode', SimpleImputer(strategy='most_frequent'), non_numerical_indexes)],
         remainder='drop')),
    ('handle_categorical_features', ColumnTransformer(
        [('handle_cat_features', OrdinalEncoder(), non_numerical_indexes_after_handle_missing_values)],
        remainder='passthrough')),
    ('rfe_estimator', RFECV(estimator, min_features_to_select=min_features, cv=n_folds))
])

pipeline.fit(X, y)

## Selected features

In [None]:
selected_features = np.array(columns[numerical_indexes].tolist() + columns[non_numerical_indexes].tolist())
selected_features = selected_features[pipeline['rfe_estimator'].support_].tolist()
print(selected_features)

## Save dataset

Stores the transformed dataset in a object storage.<br>

In [None]:
import pandas as pd
from platiagro import save_dataset

save_dataset(name=dataset, df=pd.DataFrame(pipeline.transform(X), columns=selected_features))

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(
    pipeline=pipeline,
    selected_features=selected_features
)