<a href="https://colab.research.google.com/github/mayankcircle/Data-Preporcessing-for-Machine-Learning/blob/main/One_Hot_Encoding_in_Train_and_Test_set_using_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Required Libraries


In [1]:
import pandas as pd
import numpy as np

# Define Train and Test Data

**Note**

Train data has NaN. In one hot encoding, It will be a zero vector for NaN value.

Test data has some labels which are not in train data. This case also, we will have zero vector for UNKNOWN value in test data.

In [11]:
train_data = pd.DataFrame({"A":["p","q","p","q","q",np.NaN,"q",np.NaN],"B":[1,2,3,4,5,6,7,np.NaN],"C":["x","x","x","y","z","z",np.NaN,np.NaN]})
train_data

Unnamed: 0,A,B,C
0,p,1.0,x
1,q,2.0,x
2,p,3.0,x
3,q,4.0,y
4,q,5.0,z
5,,6.0,z
6,q,7.0,
7,,,


In [12]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      object 
 1   B       7 non-null      float64
 2   C       6 non-null      object 
dtypes: float64(1), object(2)
memory usage: 320.0+ bytes


In [13]:
test_data = pd.DataFrame({"A":["p","q","p","q","q",np.NaN,"q",np.NaN,"t","q"],"B":[1,2,3,4,5,6,7,np.NaN,8,9],"C":["x","x","x","y","z","z",np.NaN,np.NaN,"u","u"]})
test_data

Unnamed: 0,A,B,C
0,p,1.0,x
1,q,2.0,x
2,p,3.0,x
3,q,4.0,y
4,q,5.0,z
5,,6.0,z
6,q,7.0,
7,,,
8,t,8.0,u
9,q,9.0,u


In [14]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       8 non-null      object 
 1   B       9 non-null      float64
 2   C       8 non-null      object 
dtypes: float64(1), object(2)
memory usage: 368.0+ bytes


# OPTIONAL STEP FOR FAST PERFORMANCE

By default, Pandas assigns object datatype for categorical columns. That is great when we have less repetition there. But for the case when we have lots of repetition of a category in the column, we can take this as an advantage by casting datatype object to category. THis will have effiecient memory usage and fast computation for onehot analysis (get_dummies()).


---


For more info- [StackOverflow](https://stackoverflow.com/questions/30601830/when-to-use-category-rather-than-object)

In [15]:
# optional
train_data["A"]=train_data["A"].astype("category")
train_data["C"]=train_data["C"].astype("category")

In [16]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A       6 non-null      category
 1   B       7 non-null      float64 
 2   C       6 non-null      category
dtypes: category(2), float64(1)
memory usage: 408.0 bytes


# One Hot Encoding on Train Data

We apply one Hot encoding in categorical features - {"A","C"}

In [22]:
numerical_cols = ["B"]
categorical_cols = ["A","C"]

In [18]:
# one hot on selected subset of data
selected_transformed_data=pd.get_dummies(train_data[categorical_cols], prefix = categorical_cols)
selected_transformed_data

Unnamed: 0,A_p,A_q,C_x,C_y,C_z
0,1,0,1,0,0
1,0,1,1,0,0
2,1,0,1,0,0
3,0,1,0,1,0
4,0,1,0,0,1
5,0,0,0,0,1
6,0,1,0,0,0
7,0,0,0,0,0


Below is crucial step, we save transformed_data columns after one-hot encoding to have uniformity in feature-set in both train and test data.

In [19]:
selected_transformed_data_columns = list(selected_transformed_data.columns)
selected_transformed_data_columns

['A_p', 'A_q', 'C_x', 'C_y', 'C_z']

In [25]:
# Transformed training data
transformed_train_data = pd.concat([train_data[numerical_cols],selected_transformed_data],axis=1)
transformed_train_data

Unnamed: 0,B,A_p,A_q,C_x,C_y,C_z
0,1.0,1,0,1,0,0
1,2.0,0,1,1,0,0
2,3.0,1,0,1,0,0
3,4.0,0,1,0,1,0
4,5.0,0,1,0,0,1
5,6.0,0,0,0,0,1
6,7.0,0,1,0,0,0
7,,0,0,0,0,0


In [26]:
# one hot on test data
selected_transformed_test_data = pd.get_dummies(test_data[categorical_cols], prefix = categorical_cols)
selected_transformed_test_data = selected_transformed_test_data[selected_transformed_data_columns] # for uniformity in feature set
selected_transformed_test_data

Unnamed: 0,A_p,A_q,C_x,C_y,C_z
0,1,0,1,0,0
1,0,1,1,0,0
2,1,0,1,0,0
3,0,1,0,1,0
4,0,1,0,0,1
5,0,0,0,0,1
6,0,1,0,0,0
7,0,0,0,0,0
8,0,0,0,0,0
9,0,1,0,0,0


In [27]:
# Transformed test data
transformed_test_data = pd.concat([test_data[numerical_cols],selected_transformed_test_data],axis=1)
transformed_test_data

Unnamed: 0,B,A_p,A_q,C_x,C_y,C_z
0,1.0,1,0,1,0,0
1,2.0,0,1,1,0,0
2,3.0,1,0,1,0,0
3,4.0,0,1,0,1,0
4,5.0,0,1,0,0,1
5,6.0,0,0,0,0,1
6,7.0,0,1,0,0,0
7,,0,0,0,0,0
8,8.0,0,0,0,0,0
9,9.0,0,1,0,0,0


**What we need to save here ?**



1.   numerical_cols
2.   categorical_cols
3.   selected_transformed_data_columns **(It is crucial to have fixed same feature set in bot train and test data independent of the values)**

