## Handling Categorical variables
In this notebook, I show an easy way to transform categorical variables. The code might look complex, but read the embedded comments to understand the logic.

## Drabacks of existing encoders
The below code (shown in the next block) might look complex, but I had to write this logic due to the following drawbacks
of the existing encoders:
1. We can use get_dummies of pandas. 
   But without modifying or creating a custom transformer, it is not possible to use it in pipeline.
   For example, if we have ['r','g'] in column color, then get_dummies() will give us following:
   
   color_r color_g
   ------- -------
   1       0
   
   0       1
   
   But if our test data has an extra class 'b' or does not have 'r' or 'g', then we will bet different dummy variables.
   Another drawback is, color_r and color_g are correlated. Only one variable is sufficient to represent both values.
   If we have n classes, then we need n-1 dummy variables to represent all the classes. 

2. We can use LabelBinarizer(). But it has the following drawbacks:

2a. LabelBinarizer cannot take more than one categoricalcolumn as input.

2b. LabelBinarizer behaves differently based on the number of unique classes in the input categorical variable.
    For example, if we fit LabelBinarizer on a categorical variable with values ['red','blue'], then it will create
    a model of LabelBinarizer. Now if we aply transform() on ['red','blue'], then we will get a one column variable
    [1, 0]. However if we try to transform ['red','blue','green'], using the same LabelBinarizer model fit on ['red','blue'], then
    we will get a matrix (like the following):
    [[0,1],
     [1,0],
     [0,0]]
    Ideally we should get the matrix shown below:
    [1, 0, 0]

2c. Another drawback of LabelBinarizer is, if we have only 2 classes in a variable only one dummy variable is created to 
    represent the data. But if we have more than 2 classes, then the number of dummy variables is equal to the number of
    classes. Ideally, if we have n classes, we should get n-1 dummy variables only.

The following transformer must be applied on pandas data frame only. This transformer can process multiple categorical
columns and also creates dummy variables based on the principle: n-1 dummy variables per categorical variable, if the 
categorical variable has n classes.
NOTE we will use a global variable categorical_column_transformed to collect the dummy variable names.

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
categorical_column_transformed = []
class CatMultiLabelTransformer(BaseEstimator, TransformerMixin):
    def __init__(self): 
        pass ##Nothing to do
    
    def fit(self, X, y=None):
        
        ##Check if input object is a pandas df, else raise exception
        try:
            if not isinstance(X,pd.DataFrame):
               raise ValueError
        except:
            print("**EXCEPTION/ERROR**: CatMultiLabelTransformer.fit() can accept only Pandas dataframe")
            exit(10)
        
        ##Create an empty list, which will be updated with LabelBinarizer for each column        
        self.binarizers=[]
        ##Declare the categorical_column_transformed as global variable, so that we can update it
        ##with column names used in the final transformation
        
        global categorical_column_transformed 
        ##Reset the global variable to empty list, to flushout any previous data.
        categorical_column_transformed = []
        
        ##Fit a LabelBinarizer for each column and save the models in a class variable
        for i in X.columns:
            lb = LabelBinarizer()
            self.binarizers.append(lb.fit(X[i]))
            
            ##Prefix the dummy variables (class names)  with the variable name (or column name), 
            ##so that we can distinguish the data
            ##If a class called 'male' is present in two variables, then the column prefix will help
            ##us to distinguish between the variables data in the dummy variables
            categorical_column_transformed.append([str(i) + "_" + str(j) for j in list(lb.classes_)])
            
        ##Declare a list object to collect the variable names after applying LabelBinarizer on each column
        self.X_fit_classes = []    
        
        for i,j in enumerate(categorical_column_transformed):
              ##If we have only two classes like 'male' and 'female', then we should consider 
              ##only one of the classes, since LabelBinarizer will encode the 2 values using one dummy variable only        
              if len(j) == 2:
                 self.X_fit_classes.append(j[1])
              else:
              ##If we have more than two classes like 'r', 'g' 'b' 
              ## LabelBinarizer will encode the data using 3 dummy variables, and hence we have to consider first 2 variables only
              ## In general, if we have n (where n>2) dummy variables given by LabelBinarizer, then we need to consider only n-1 variables
                 self.X_fit_classes = self.X_fit_classes+j[0:len(j)-1]
        ##Update the global variable                 
        categorical_column_transformed = self.X_fit_classes         
                 
        return self

    def transform(self, X, y=None):
            ##Check if input object is a pandas df, else raise exception
            try:
               if not isinstance(X,pd.DataFrame):
                  raise ValueError
            except:
               print("**EXCEPTION/ERROR**: CatMultiLabelTransformer.transform() can accept only Pandas dataframe")
               exit(10)

            ##Declare an empty numpy matrix,
            ##This will be a recurrent matrix to collect all the variables transformations
            ##The inital data used to initialize the matrix will be discarded later
            X_transform = np.empty([len(X),1])
            
            ##We will again collect the class names, similar to the way we collected in the fit(), using X_fit_classels list.
            ##At the end we will compare the X_fit_classes and X_transform_classes, and raise an exception if they have different values
            X_transform_classes = []

            #Process all the columns, one after the other             
            for i,j in enumerate(X.columns):
                ##Collect the transformed data into a temp variable
                temp_transformed_data = self.binarizers[i].transform(X[j])
                ##Get the shape of the transformed data
                temp_transformed_data_cols = temp_transformed_data.shape[1]
                
                #If number of classes is more than 2,
                #then collect all the class names, since
                #label binarizer will create n dummy variables, if we have n classes, when n > 2
                #We will modify the class name by prefixing the class with column name
                #Here also we will prefix the class name with the column name
                if temp_transformed_data_cols > 2:
                    X_transform_classes.append([j +"_"+ str(c) for c in list(self.binarizers[i].classes_)])
                    ##Update the recurrent matrix with new data                    
                    X_transform=np.c_[X_transform,temp_transformed_data]

                #If the number of classes is 2, then label binarizer will create one variable only,
                #although the lb.classes_ will have 2 values. lb.classes[1] will be the corresponding
                #label where the binarized column will have 1.
                #Here also we will prefix the class name with the column name
                #print("temp_transformed_data_col = {}".format(temp_transformed_data_cols))
                #print(temp_transformed_data)
                if temp_transformed_data_cols == 2:
                    X_transform_classes.append([j + "_" + str(list(self.binarizers[i].classes_)[1])])
                    X_transform=np.c_[X_transform,temp_transformed_data[:,1]]
                 
        
                #If the number of classes is 1 or 2, then label binarizer will create one variable only
                #Here also we will prefix the class name with the column name
                if temp_transformed_data_cols == 1:
                  ##Check if only one dummy variable is created due to the presence of only one class
                  if len(self.binarizers[i].classes_) == 1:
                      X_transform_classes.append([j + "_" + str(list(self.binarizers[i].classes_)[0])])
                  ##Check if only one dummy variable is created due to the presence of only two class    
                  if len(self.binarizers[i].classes_) == 2:
                      X_transform_classes.append([j + "_" + str(list(self.binarizers[i].classes_)[1])])
                  X_transform=np.c_[X_transform,temp_transformed_data]
            
            ##Flat the list of lists to a singe list
            X_transform_classes=[item for sublist in X_transform_classes for item in sublist]   
            
            #We need to remove the 0th column of recurrent matrix X_transform, since we created a dummy column initially
            #since we need to initialize the recurrent array with some initial value
            X_transform = np.delete(X_transform,0,axis=1)
            
            #Create a data frame with the dummy variable names.
            X_transform = pd.DataFrame(X_transform,columns = X_transform_classes)
            
            #But we need to drop the extra dummy variables (if we have n classes, we need only n-1 dummy variables)
            X_transform = X_transform[self.X_fit_classes]
            return X_transform

## Testing the above code

In [8]:
def test_CatMultiLabelTransformer(input=True):
  if input:
    ct = CatMultiLabelTransformer()
    demo_df_1=pd.DataFrame(['red','green'],columns=['Color'])
    ct.fit(demo_df_1)
    print("\n\nTraining data:\n{}".format(demo_df_1))
    print("\n\nData to be transformed is given below:.\nObserve that we have new class 'blue'")
    print(" and 'blue' was not existing the training data.\n\n")
    demo_df_2 = pd.DataFrame(['red','green','blue'],columns=['Color'])
    print(demo_df_2)
    print("\nAfter transforming/encoding...\n")
    print(pd.DataFrame(ct.transform(demo_df_2),columns=categorical_column_transformed))
    print("But if we have 3 classes in the training data:")
    print("Another training data set:")
    demo_df_1=pd.DataFrame(['S','M','XL'],columns=['Sizes'])
    ct.fit(demo_df_1)
    print("\n\nFitted data frame or training data:\n{}".format(demo_df_1))
    print("Transformed data:")
    print(pd.DataFrame(ct.transform(demo_df_1),columns=categorical_column_transformed))
  else:
    pass  

#To test CatMultiLabelTransformer, call test_CatMultiLabelTransformer(input=True)      
test_CatMultiLabelTransformer(input=True)




Training data:
   Color
0    red
1  green


Data to be transformed is given below:.
Observe that we have new class 'blue'
 and 'blue' was not existing the training data.


   Color
0    red
1  green
2   blue

After transforming/encoding...

   Color_red
0        1.0
1        0.0
2        0.0
But if we have 3 classes in the training data:
Another training data set:


Fitted data frame or training data:
  Sizes
0     S
1     M
2    XL
Transformed data:
   Sizes_M  Sizes_S
0      0.0      1.0
1      1.0      0.0
2      0.0      0.0
