# Pre Selection

Remove features according to the following criteria:
- Variability close to 0
- High correlation between each other
- Handle NaN and missing values 

This notebook shows:
- how to use the [SDK](https://platiagro.github.io/sdk/) to load datasets, save models and other artifacts.
- how to declare parameters and use them to build reusable components.

## Declare parameters
Components may declare (and use) these default parameters:
- dataset
- target

Use these parameters to load/save datasets, models, metrics, and figures with the help of [PlatIAgro SDK](https://platiagro.github.io/sdk/). <br />
You may also declare custom parameters to set when running an experiment.

In [None]:
dataset = "iris" #@param {type:"string"}
target = "Species" #@param {type:"string"}
correlation = 0.95 #@param {type:"number", label:"Correlação", description:"Valor para o corte de correlação entre features"}
threshold = 0.0 #@param {type:"number", label:"Limiar", description:"Atributos com variância menor que o limiar serão removidos"}

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)
X = df.drop(target, axis=1).to_numpy()
y = df[target].to_numpy()

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
import numpy as np
from platiagro import stat_dataset

metadata = stat_dataset(name=dataset)
featuretypes = metadata["featuretypes"]

columns = df.columns.to_numpy()
featuretypes = np.array(featuretypes)
target_index = np.argwhere(columns == target)
columns = np.delete(columns, target_index)
featuretypes = np.delete(featuretypes, target_index)

## Wrapping custom transformer

In [None]:
%%writefile CustomTransformer.py
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class CorrelatedFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, c_indexes, correlation):
        self.categorical_indexes = c_indexes
        self.correlation = correlation
    
    def fit(self, X=None):
        return self
    
    def get_support(self):
        """Returns indexes to be removed"""
        return self._drop_indexes
    
    def transform(self, input_series):
        """Transform data"""
        # Select only numerical features from input
        X_n = pd.DataFrame(np.delete(input_series, self.categorical_indexes, axis=1))

        # Create correlation matrix
        corr_matrix = X_n.corr().abs()

        # Select upper triangle of correlation matrix
        upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

        # Find features with correlation greater than correlation predefined
        self._drop_indexes = [column for column in upper_triangle.columns if any(upper_triangle[column] > self.correlation)]

        # Drop features
        X_n.drop(X_n.columns[self._drop_indexes], axis=1, inplace=True)

        # Put every numerical feature on first indexes and make new series of it
        new_series = np.concatenate((X_n, input_series[:, self.categorical_indexes]), axis=1)

        return new_series

## Features configuration

In [None]:
from platiagro.featuretypes import NUMERICAL

# Selects the indexes of numerical and non-numerical features
numerical_indexes = np.where(featuretypes == NUMERICAL)[0]
non_numerical_indexes = np.where(~(featuretypes == NUMERICAL))[0]

# After the step handle_missing_values, 
# numerical features are grouped in the beggining of the array
numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes))
non_numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes), len(featuretypes))

## Remove features with low-variance

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from CustomTransformer import CorrelatedFeatures

pipeline = Pipeline(steps=[
    ('handle_missing_values', 
     ColumnTransformer(
        [('imputer_mean', SimpleImputer(strategy='mean'), numerical_indexes),
         ('imputer_mode', SimpleImputer(strategy='most_frequent'), non_numerical_indexes)],
         remainder='drop')),
    ('handle_low_variance',
     ColumnTransformer(
         [('variance_threshold', VarianceThreshold(threshold=threshold),
           numerical_indexes_after_handle_missing_values)],
          remainder='passthrough')),
    ('correlated_features',
     CorrelatedFeatures(c_indexes=non_numerical_indexes_after_handle_missing_values,
                        correlation=correlation))
])

X_n = pipeline.fit_transform(X)

# Get features selected by VarianceThreshold
threshold_features = \
pipeline.named_steps.handle_low_variance.named_transformers_.variance_threshold.get_support()

# Removes highly correlated features from the features selected by VarianceThreshold
numerical_indexes = \
np.delete(numerical_indexes[threshold_features], pipeline.named_steps.correlated_features.get_support())

# The pipeline changes features order, and it's necessary to save the changes for inference step.
# numerical features are in the beggining, and non numerical in the end
features_after_pipeline = \
columns[np.concatenate([numerical_indexes, non_numerical_indexes])]

# Convert back to DataFrame
df = pd.DataFrame(X_n, columns=features_after_pipeline)
df[target] = pd.Series(y)

## Save dataset

Stores the transformed dataset in a object storage.<br>

In [None]:
from platiagro import save_dataset

save_dataset(name=dataset, df=df)

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(pipeline=pipeline,
           columns=columns)