# Creating Pipelines
We need to handle the numerical and categorical attributes differently. Numerical attributes need to be scaled, whereas for categorical columns we need to fill the missing values and then encode the categorical values into numerical values. To apply these sequence of transformations we will use the sklearn’s Pipeline class. We will also build custom transformers that can be directly used with Pipeline.

ColumnsSelector Pipeline
sklearn doesn’t provide libraries to directly manipulate with pandas dataframe. We will write our own custom transformer which will select the corresponding attributes (either numerical or categorical)

In [1]:
#Selects only few colums from our dataset
from sklearn.base import BaseEstimator, TransformerMixin
class ColumnsSelector(BaseEstimator, TransformerMixin):
  
  def __init__(self, type):
    self.type = type
  
  def fit(self, X, y=None):
    return self
  
  def transform(self,X):
    return X.select_dtypes(include=[self.type])

#### Numerical Data Pipeline

1. We select the numerical attributes using the ColumnsSelector transformer defined above and then scale the values using the StandardScaler.                                                                                                                
2. Pipeline of transforms with a final estimator.                                                
3. Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that    is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the    pipeline can be cached using memory argument.                                                            
4. steps : list
   List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with    the last object an estimator.

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline(
    steps=[
        ("num_attr_selector", ColumnsSelector(type='int')),
        ("scaler", StandardScaler())
    ]
)

#### Categorical Data Pipeline
1. We need to replace the missing values in the categorical columns. We will replace the missing values with the most frequently occurring value in each column. sklearn comes with Imputer to handle missing values. 
2. However, Imputer works only with numerical values. We will write a custom transformer which will accept a list of columns for which we need to replace the missing values and the strategy used to fill the missing values

In [3]:
class CategoricalImputer(BaseEstimator, TransformerMixin):
  
  def __init__(self, columns = None, strategy='most_frequent'):
    self.columns = columns
    self.strategy = strategy
    
    
  def fit(self,X, y=None):
    if self.columns is None:
      self.columns = X.columns
    
    if self.strategy is 'most_frequent':
      self.fill = {column: X[column].value_counts().index[0] for 
        column in self.columns}
    else:
      self.fill ={column: '0' for column in self.columns}
      
    return self
      
  def transform(self,X):
    X_copy = X.copy()
    for column in self.columns:
      X_copy[column] = X_copy[column].fillna(self.fill[column])
    return X_copy


All the machine learning models expect numerical values. We will use pd.get_dummies to convert the categorical values to numerical values. This is similar to using OneHotEncoder except that OneHotEncoder requires numerical columns.

We need to merge the train and test dataset before using pd.get_dummies as there might be classes in the test dataset that might not be present in the training dataset. 

For this, in the fit method, we will concatenate the train and test dataset and find out all the possible values for a column. 

In the transform method, we will convert each column to Categorical Type and specify the list of categories that the column can take. pd.get_dummies will create a column of all zeros for the category not present in the list of the categories for that column.

The transformer also takes an argument dropFirst which indicates whether we should drop the first column after creating dummy columns using pd.get_dummies. We should drop the first column to avoid multicollinearity. By default, the value is set to True

In [4]:
class CategoricalEncoder(BaseEstimator, TransformerMixin):
  
  def __init__(self, dropFirst=True):
    self.categories=dict()
    self.dropFirst=dropFirst
    
  def fit(self, X, y=None):
    join_df = pd.concat([train_data, test_data])
    join_df = join_df.select_dtypes(include=['object'])
    for column in join_df.columns:
      self.categories[column] = join_df[column].value_counts().index.tolist()
    return self
    
  def transform(self, X):
    X_copy = X.copy()
    X_copy = X_copy.select_dtypes(include=['object'])
    for column in X_copy.columns:
      X_copy[column] = X_copy[column].astype(
          {column:
                CategoricalDtype(self.categories[column])
          })
    return pd.get_dummies(X_copy, drop_first=self.dropFirst)

#### Complete Categorical Pipeline



In [5]:
cat_pipeline = Pipeline(steps=[
    ("cat_attr_selector", ColumnsSelector(type='object')),
    ("cat_imputer", CategoricalImputer(columns=
          ['workClass','occupation', 'native-country'])),
    ("encoder", CategoricalEncoder(dropFirst=True))
])

#### Complete Pipeline

We have two transformer pipeline i.e, num_pipeline and cat_pipeline. We can merge them using FeatureUnion

In [7]:
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion([("num_pipe", num_pipeline), ("cat_pipeline", cat_pipeline)])