# One Hot Encoding Using Sklearn
- Used to convert nominal categorical data into numerical data.
- Used when there is no relation between categorical data.
- After one hot encoding there is a condition called dummy variable trap to counter this we must delete one entire row.
- We keep most occouring group imdividual and reat in a feature called 'Others'.
- We used this approach when we have most and least occouring categories.

In [1]:
import pandas as pd
import numpy as np
import sklearn as sks

In [2]:
df = pd.read_csv('cars.csv')
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
781,Mahindra,90000,Diesel,Second Owner,650000
7881,Ford,49957,Diesel,First Owner,700000
7267,Mahindra,30000,Diesel,First Owner,700000
4943,Mahindra,39000,Diesel,First Owner,1225000
3890,Maruti,70000,Petrol,First Owner,55000


In [3]:
df.shape

(8128, 5)

### Why to use sklearn for One Hot Encoding
- Because Pandas will not remember the column order
- It will change whenever the everytime we run the code
- Therefore, we use sklearn to overcome this problem

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=0)

In [5]:
X_train

Unnamed: 0,brand,km_driven,fuel,owner
3042,Hyundai,60000,LPG,First Owner
1520,Tata,150000,Diesel,Third Owner
2611,Hyundai,110000,Diesel,Second Owner
3544,Mahindra,28000,Diesel,Second Owner
4138,Maruti,15000,Petrol,First Owner
...,...,...,...,...
4931,Tata,70000,Diesel,Third Owner
3264,Ford,100000,Diesel,Second Owner
1653,Hyundai,90000,Petrol,Second Owner
2607,Volkswagen,90000,Diesel,First Owner


In [6]:
from sklearn.preprocessing import OneHotEncoder

In [7]:
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)

In [8]:
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])
X_test_new = ohe.fit_transform(X_test[['fuel','owner']])

In [9]:
X_train_new

array([[0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [10]:
X_train[['brand','km_driven']].values

array([['Hyundai', 60000],
       ['Tata', 150000],
       ['Hyundai', 110000],
       ...,
       ['Hyundai', 90000],
       ['Volkswagen', 90000],
       ['Hyundai', 110000]], dtype=object)

In [11]:
# Can be done using Column Transformer
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Hyundai', 60000, 0, ..., 0, 0, 0],
       ['Tata', 150000, 1, ..., 0, 0, 1],
       ['Hyundai', 110000, 1, ..., 1, 0, 0],
       ...,
       ['Hyundai', 90000, 0, ..., 1, 0, 0],
       ['Volkswagen', 90000, 1, ..., 0, 0, 0],
       ['Hyundai', 110000, 0, ..., 0, 0, 0]], dtype=object)

### One Hot Encoding for Fequent Catrgories

In [12]:
df['brand'].value_counts()

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [13]:
count = df['brand'].value_counts()

In [14]:
df['brand'].nunique()
threshold = 100

In [15]:
repl = count[count <= threshold].index

In [16]:
pd.get_dummies(df['brand'].replace(repl, 'Others'))

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Others,Renault,Skoda,Tata,Toyota,Volkswagen
0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,0,0,0,0,1,0,0,0,0,0,0,0,0
8124,0,0,0,0,1,0,0,0,0,0,0,0,0
8125,0,0,0,0,0,0,1,0,0,0,0,0,0
8126,0,0,0,0,0,0,0,0,0,0,1,0,0


# Made by Nitesh Addagatla
- LinkidIn: https://www.linkedin.com/in/nitesh-addagatla/
- GitHub: https://github.com/niteshA04
- Kaggle: https://www.kaggle.com/niteshaddagatla