# One Hot Encoding :

In ml all input columns should be independent and targeted value should be dependent on indipendent column inputs.

In one hot encoding we have problem of dummy variable trap. dummy variable trap means data having multicollinearity. 

To avoid multicollinearity, we remove 1 column of each transform category.

if a categorical variable has K unique categories, it is transformed into K-1 binary variables. 

Linear model will not perform well with multicollinearity.

## Multicollinearity : 

Means there is a strong linear relationship between two or more independent variables. 



In [71]:
import numpy as np
import pandas as pd

def read_csv(file_name):
    data = pd.read_csv(file_name)
    return data

# 1. One Hot Encoding using Pandas:
def one_hot_encoder_using_pandas(data):
    data_transform = pd.get_dummies(data, columns=['fuel','owner'])
    print("**********************One Hot Encoding using Pandas**************************")
    print("Number of Columns after Transform Data using One Hot Encoding using Pandas : ", data_transform.shape[1])
    return data_transform

# 2. K-1 OneHotEncoding:
# if a categorical variable has K unique categories, it is transformed into K-1 binary variables.
def one_hot_encoder_using_panda_and_handling_multicoliearity(data):
    data_transform = pd.get_dummies(data, columns=['fuel','owner'],drop_first = True)
    print("**********************K-1 OneHotEncoding**************************")
    print("Number of Columns after removing the columns to handle multicoliarity issue : ", data_transform.shape[1])
    return data_transform


# 3. OneHotEncoding using Sklearn:
def one_hot_encoder_using_Sklearn(data):
    from sklearn.model_selection import train_test_split
    X_train,X_test,Y_train,Y_test = train_test_split(data.iloc[:,0:4],
                                                     data.iloc[:,-1],
                                                     test_size=0.2,
                                                     random_state=2)
    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)
    ohe.fit(X_train[['fuel','owner']])
    
    X_train_transform = ohe.transform(X_train[['fuel','owner']])
    X_test_transform = ohe.transform(X_test[['fuel','owner']])
    
    print("**********************one_hot_encoder_using_Sklearn**************************")
    print("Number of rows and number of columns of train data",X_train.shape)
    print("Columns of X_train Data : ", X_train.columns)
    
    X_train_new = np.hstack((X_train[['brand','km_driven']].values, X_train_transform))
    X_test_new = np.hstack((X_test[['brand','km_driven']].values, X_test_transform))
    
    print("Number of Columns of train data after one hot encoder using Sklearn : ", X_train_new.shape[1])
    
    X_train_new = pd.DataFrame(X_train_new)
    X_test_new = pd.DataFrame(X_test_new)
    
    return X_train_new, X_test_new
    


# 4. OneHotEncoding with Top Categories
def one_hot_encoder_with_top_categories(data):
    print("**********************OneHotEncoding with Top Categories*****************************")
    print(" unique brands : ",data['brand'].nunique())
    counts = data['brand'].value_counts()
    threshold = 100
    repl = counts[counts <= threshold].index
    print(repl)
    df = pd.get_dummies(data['brand'].replace(repl, 'uncommon')).sample(5)
    return df


data = read_csv('cars.csv')
print("Number of Columns of actual Data : ", data.shape[1])
print("Columns of actual Data : ", data.columns)

data_trans = one_hot_encoder_using_pandas(data)
data_transform = one_hot_encoder_using_panda_and_handling_multicoliearity(data)
X_train_new, X_test_new = one_hot_encoder_using_Sklearn(data)
df = one_hot_encoder_with_top_categories(data)
df

Number of Columns of actual Data :  5
Columns of actual Data :  Index(['brand', 'km_driven', 'fuel', 'owner', 'selling_price'], dtype='object')
**********************One Hot Encoding using Pandas**************************
Number of Columns after Transform Data using One Hot Encoding using Pandas :  12
**********************K-1 OneHotEncoding**************************
Number of Columns after removing the columns to handle multicoliarity issue :  10
**********************one_hot_encoder_using_Sklearn**************************
Number of rows and number of columns of train data (6502, 4)
Columns of X_train Data :  Index(['brand', 'km_driven', 'fuel', 'owner'], dtype='object')
Number of Columns of train data after one hot encoder using Sklearn :  9
**********************OneHotEncoding with Top Categories*****************************
 unique brands :  32
Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuz

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
1634,0,0,0,0,1,0,0,0,0,0,0,0,0
5111,0,0,0,0,0,0,0,0,0,1,0,0,0
2963,0,0,0,0,0,0,1,0,0,0,0,0,0
1809,0,0,0,0,0,0,0,0,0,1,0,0,0
4612,0,0,0,0,0,0,0,0,0,1,0,0,0
