## OneHotEncoder using sklearn

### **Function takes Training and Test Datasets as arguements and returns One Hot Encoded Test Dataset.**

### **ohe_fit_transform function **has been created as an** alternative to get_dummies function **used **for ONE HOT ENCODING.**

**Limitations of get_dummies:**

1. If Training Dataset has 'n' cardinal values for any categorical column, it creates 'n' or 'n-1'(in case of dropfirst) dummy columns as per **ONE HOT ENCODING**.
2. Due to the above, modified Training Dataset will have original no. of columns + above dummy columns.   
3. Assume that the model has been trained on the above Dataset. 
4. Now, suppose that the future dataset for prediction has less than 'n' cardinal values.
5. Using get_dummies function, the modified future Dataset (Test Data) will have number of columns different from that of Training Dataset.
6. Due to the discrepancy in number of columns, Test Data can't be transformed using the above model fit.

**Acknowledgment**: I am thankful to scikit-learn.org for providing the necessary packages and functions at the following link based on which this function has been created.
    
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
def ohe_fit_transform(Train_data,Test_data):
    
    '''
    ohe_fit_transform function has been created 
    as an alternative to get_dummies function used for ONE HOT ENCODING.
    
    It uses sklearn package's OneHotEncoder, fit and transform function.
    
    Limitations of get_dummies:
    
    1. If Training Dataset has 'n' cardinal values for any categorical column,
       it creates 'n' or 'n-1'(in case of dropfirst) dummy columns as per ONE HOT ENCODING.
    2. Due to the above, modified Training Dataset will have original no. of columns + above dummy columns.   
    3. Assume that the model has been trained on the above Dataset.
    4. Now, suppose that the future dataset for prediction has less than 'n' cardinal values.
    5. Using get_dummies function, the modified future Dataset (Test Data) will have number of columns 
       different from that of Training Dataset.
    6. Due to the discrepancy in number of columns, Test Data can't be transformed using the above model fit.
    
    Not finding any readymade function that would take Datasets with categorical columns as inputs
    and resolve the above limitations, a small ohe_fit_transform has been written 
    that takes in two data sets as arguements - Train_data & Test_data.
    
    Function returns new DataSet with dummies for the Test Dataset.
    
    For creating the model, pass only Training Dataset against both the arguements in the function as follows:
    
    OHE_Train_data = ohe_fit_transform (Train_data = Training_data, Test_data = Training_data)
    
    Replace second arguement with future data for prediction against the arguement for Test_data as follows:
    
    OHE_Future_data = ohe_fit_transform (Train_data = Training_data, Test_data = Future_data)
    
    Acknowledgment: I am thankful to scikit-learn.org for providing the necessary packages and functions
    at the following link based on which this function has been created.
    
    https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    
    '''
    
    from sklearn.preprocessing import OneHotEncoder

    #Creating list of columns in the Training Dataset with "object" dtype for One Hot Encoding
    cat_cols = [col for col in Train_data.columns if Train_data[col].dtype == np.object]

    # Store such categorical columns of Train Dataset in a newly formed DataFrame
    Train_data_cat_cols = Train_data[cat_cols] 

    # Fitting the Train Data for One Hot Encoding
    ohe_enc = OneHotEncoder(drop='first').fit(Train_data_cat_cols) 
    
    # Stores list of newly formed features
    cat_labels_new = list(ohe_enc.get_feature_names(cat_cols))
    
    # 
    Test_data_cat_cols = Test_data[cat_cols] 

    enc_df = pd.DataFrame(ohe_enc.transform(Test_data_cat_cols).toarray())
    enc_df=enc_df.astype(int)

    enc_cols = enc_df.columns.to_list()

    columns_dict = dict(zip(enc_cols,cat_labels_new))
    enc_df.rename(columns = columns_dict,inplace=True)
    
    Test_data_temp = Test_data.copy()

    Test_data_temp.drop(cat_cols,axis=1,inplace=True)
    Test_data_temp = Test_data_temp.join(enc_df)
    
    return Test_data_temp
