<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/marco-canas/Machine-Learning/blob/main/ML/classes/class_march_3/class_march_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

# Clase 22 Pipeline (4 de marzo de 2022)

### Objetivo  

4. Preprocesamiento usando el concepto de pipeline 

# Librerías necesarias para la clase

In [14]:
import numpy as np 
import pandas as pd 

from sklearn.model_selection import train_test_split 

from sklearn.base import BaseEstimator, TransformerMixin 

from sklearn.impute import SimpleImputer 

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

In [2]:
class AdAtribComb(BaseEstimator, TransformerMixin):
    def __init__(self, ad_dph = True, ad_hph = True, ad_pph = True): # no *args or **kargs
        self.ad_dph = ad_dph
        self.ad_hph = ad_hph
        self.ad_pph = ad_pph
        
    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        habitaciones, dormitorios, población, hogares = 3, 4, 5, 6 
        if self.ad_dph:
            dormitorios_por_habitación=X[:,dormitorios]/X[:,habitaciones]
            X=np.c_[X, dormitorios_por_habitación]
        if self.ad_hph:
            habitaciones_por_hogar=X[:,habitaciones]/X[:,hogares]
            X=np.c_[X, habitaciones_por_hogar]
        if self.ad_pph: 
            población_por_hogar = X[:, población]/X[:, hogares]
            X=np.c_[X, población_por_hogar] 
        return X
       

## Obtención de los datos 

In [5]:
v = pd.read_csv('vivienda.csv')

## División en entrenamiento y testeo

In [7]:
v_train,v_test = train_test_split(v,test_size = 0.2, random_state = 513) 

## División en predictores y objetivo

In [9]:
v = v_train.drop('precio', axis = 1)
v_labels = v_train.precio.values.ravel() 

## División en predictores numéricos y predictores categóricos

In [10]:
v_num = v.drop('proximidad', axis = 1)
v_cat = v.proximidad

## Transformation Pipelines

Como puede ver, hay muchos pasos de transformación de datos que deben ejecutarse en el orden correcto.

Afortunadamente, Scikit-Learn proporciona la clase Pipeline para ayudar con tales secuencias de transformaciones. 

Aquí hay una pequeña tubería para los atributos numéricos:

In [11]:
from sklearn.pipeline import Pipeline

pipeline_num = Pipeline([
                        ('imputer', SimpleImputer(strategy="median")),
                        ('attribs_adder', AdAtribComb()),
                        ('std_scaler', StandardScaler()),
])
x_num_tr = pipeline_num.fit_transform(v_num)

El constructor Pipeline toma una lista de pares de nombre/estimador que definen una secuencia de pasos.

All but the last estimator must be transformers (i.e., they must have a fit_transform() method). 

The names can be anything you like
(as long as they are unique and don’t contain double underscores, __); they will
come in handy later for hyperparameter tuning.

When you call the pipeline’s `fit()` method, it calls `fit_transform()` sequentially on all transformers, passing the output of each call as the parameter to the next call until it reaches the final estimator, for which it calls the `fit()` method.

The pipeline exposes the same methods as the final estimator. 

In this example, the last estimator is a StandardScaler, which is a transformer, so the pipeline has a `transform()` method that applies all the transforms to the data in sequence (and of course also a `fit_transform()` method, which is the one we used).

So far, we have handled the categorical columns and the numerical columns separately. 

It would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column. 

In version 0.20, Scikit-Learn introduced the ColumnTransformer for this purpose, and the good news is that it works great with pandas DataFrames. 

Let’s use it to apply all the transformations to the housing data:

In [13]:
from sklearn.compose import ColumnTransformer
lista_atributos_num = list(v_num)
lista_atributos_cat = ["proximidad"]

full_pipeline = ColumnTransformer([
                                   ("num", pipeline_num, lista_atributos_num),
                                   ("cat", OneHotEncoder(), lista_atributos_cat),
                                    ])

X_prep = full_pipeline.fit_transform(v)

First we import the ColumnTransformer class, next we get the list of numerical column names and the list of categorical column names, and then we construct a ColumnTransformer. 

The constructor requires a list of tuples, where each
tuple contains a name, a transformer, and a list of names (or indices) of
columns that the transformer should be applied to. In this example, we specify
that the numerical columns should be transformed using the num_pipeline that
we defined earlier, and the categorical columns should be transformed using a
OneHotEncoder. Finally, we apply this ColumnTransformer to the housing
data: it applies each transformer to the appropriate columns and concatenates the outputs along the second axis (the transformers must return the same
number of rows).

Note that the OneHotEncoder returns a sparse matrix, while the num_pipeline
returns a dense matrix. When there is such a mix of sparse and dense matrices,
the ColumnTransformer estimates the density of the final matrix (i.e., the ratio
of nonzero cells), and it returns a sparse matrix if the density is lower than a
given threshold (by default, sparse_threshold=0.3). In this example, it
returns a dense matrix. And that’s it! We have a preprocessing pipeline that
takes the full housing data and applies the appropriate transformations to each
column.

## TIP

Instead of using a transformer, you can specify the string "drop" if you want the columns to
be dropped, or you can specify "passthrough" if you want the columns to be left untouched.
By default, the remaining columns (i.e., the ones that were not listed) will be dropped, but
you can set the remainder hyperparameter to any transformer (or to "passthrough") if you
want these columns to be handled differently.

If you are using Scikit-Learn 0.19 or earlier, you can use a third-party library such as sklearn-pandas, or you can roll out your own custom transformer to get the same functionality as the ColumnTransformer. 

Alternatively, you can use the FeatureUnion class, which can apply different transformers and
concatenate their outputs. 

But you cannot specify different columns for each transformer; they all apply to the whole data. It is possible to work around this limitation using a custom transformer for column selection (see the Jupyter notebook for an example).

## Select and Train a Model

At last! You framed the problem, you got the data and explored it, you sampled a training set and a test set, and you wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. 

You are now ready to select and train a Machine Learning model