## One Hot Encoding

- Replacing categorical variables by a matrix of boolean variables 

- Each variable is called a dummy variable 

- For gender, we can have variables such as; Male, Female and Non-Binary 

## Number of Dummies 

- Pandas and sklearn provide K dummy variables; where K is the number of unique labels in the variable 

- When K=2, drop one dummy variable 

- When K!=2, drop one dummy variable if the underlying variables provide complete information even without K variables

- Should always use K-1 dummies for linear regression models because it **looks** at all the variables while fitting to the train set 


# OHE of the top most common labels 
__When we have a highly cardinal variable, we can use the top most common categories and encode them only to prevent the exponential expansion of feature space__

### Pros

- Easy
- Does not expand the feature space exponentially

### Cons 

- Loss of information
- No information of the less common variables 


In [392]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [393]:
from google.colab import drive
drive.mount('/content/gdrive')
data = pd.read_csv("gdrive/My Drive/Colab Notebooks/FeatureEngineering/trainh.csv")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [394]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,...,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,...,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,...,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,...,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,...,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,...,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [395]:
data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [396]:
# get number of categories in variables 
categoricals = []
for col in data.columns:
    if data[col].dtypes =='O':
      print('{} categories : {} '.format(col, len(data[col].unique())))
      categoricals.append(col)

MSZoning categories : 5 
Street categories : 2 
Alley categories : 3 
LotShape categories : 4 
LandContour categories : 4 
Utilities categories : 2 
LotConfig categories : 5 
LandSlope categories : 3 
Neighborhood categories : 25 
Condition1 categories : 9 
Condition2 categories : 8 
BldgType categories : 5 
HouseStyle categories : 8 
RoofStyle categories : 6 
RoofMatl categories : 8 
Exterior1st categories : 15 
Exterior2nd categories : 16 
MasVnrType categories : 5 
ExterQual categories : 4 
ExterCond categories : 5 
Foundation categories : 6 
BsmtQual categories : 5 
BsmtCond categories : 5 
BsmtExposure categories : 5 
BsmtFinType1 categories : 7 
BsmtFinType2 categories : 7 
Heating categories : 6 
HeatingQC categories : 5 
CentralAir categories : 2 
Electrical categories : 6 
KitchenQual categories : 4 
Functional categories : 7 
FireplaceQu categories : 6 
GarageType categories : 7 
GarageFinish categories : 4 
GarageQual categories : 6 
GarageCond categories : 6 
PavedDrive cat

In [397]:
# Get variables with more than n categories 
n = 8
cats = []
for col in data.columns:
    if data[col].dtypes =='O': 
        if len(data[col].unique())>n: 
            print('{} categories : {} '.format(col, len(data[col].unique())))
            cats.append(col)

Neighborhood categories : 25 
Condition1 categories : 9 
Exterior1st categories : 15 
Exterior2nd categories : 16 
SaleType categories : 9 


In [398]:
for col in cats:
    if data[col].dtypes =='O': # if the variable is categorical
      print(100*data.groupby(col)[col].count()/np.float(len(data)))
      print()

Neighborhood
Blmngtn     1.164384
Blueste     0.136986
BrDale      1.095890
BrkSide     3.972603
ClearCr     1.917808
CollgCr    10.273973
Crawfor     3.493151
Edwards     6.849315
Gilbert     5.410959
IDOTRR      2.534247
MeadowV     1.164384
Mitchel     3.356164
NAmes      15.410959
NPkVill     0.616438
NWAmes      5.000000
NoRidge     2.808219
NridgHt     5.273973
OldTown     7.739726
SWISU       1.712329
Sawyer      5.068493
SawyerW     4.041096
Somerst     5.890411
StoneBr     1.712329
Timber      2.602740
Veenker     0.753425
Name: Neighborhood, dtype: float64

Condition1
Artery     3.287671
Feedr      5.547945
Norm      86.301370
PosA       0.547945
PosN       1.301370
RRAe       0.753425
RRAn       1.780822
RRNe       0.136986
RRNn       0.342466
Name: Condition1, dtype: float64

Exterior1st
AsbShng     1.369863
AsphShn     0.068493
BrkComm     0.136986
BrkFace     3.424658
CBlock      0.068493
CemntBd     4.178082
HdBoard    15.205479
ImStucc     0.068493
MetalSd    15.068493


In [399]:
data_raw = data.copy()

In [400]:
data = data_raw[cats + ['SalePrice']]

In [401]:
data.columns

Index(['Neighborhood', 'Condition1', 'Exterior1st', 'Exterior2nd', 'SaleType',
       'SalePrice'],
      dtype='object')

In [402]:
def get_top_variables(data, column, n):
  frame = [x for x in data[column].value_counts().sort_values(ascending=False).head(n).index]
  for label in frame:
    data[label] = np.where(data[column]==label, 1, 0)
  data.drop(column, axis = 1, inplace=True)

In [403]:
data.head()

Unnamed: 0,Neighborhood,Condition1,Exterior1st,Exterior2nd,SaleType,SalePrice
0,CollgCr,Norm,VinylSd,VinylSd,WD,208500
1,Veenker,Feedr,MetalSd,MetalSd,WD,181500
2,CollgCr,Norm,VinylSd,VinylSd,WD,223500
3,Crawfor,Norm,Wd Sdng,Wd Shng,WD,140000
4,NoRidge,Norm,VinylSd,VinylSd,WD,250000


In [404]:
get_top_variables(data, 'SaleType', 5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [405]:
data.head()

Unnamed: 0,Neighborhood,Condition1,Exterior1st,Exterior2nd,SalePrice,WD,New,COD,ConLD,ConLw
0,CollgCr,Norm,VinylSd,VinylSd,208500,1,0,0,0,0
1,Veenker,Feedr,MetalSd,MetalSd,181500,1,0,0,0,0
2,CollgCr,Norm,VinylSd,VinylSd,223500,1,0,0,0,0
3,Crawfor,Norm,Wd Sdng,Wd Shng,140000,1,0,0,0,0
4,NoRidge,Norm,VinylSd,VinylSd,250000,1,0,0,0,0


In [406]:
for i in cats:
  if i in data.columns:
    get_top_variables(data, i, 5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [407]:
data.head()

Unnamed: 0,SalePrice,WD,New,COD,ConLD,ConLw,NAmes,CollgCr,OldTown,Edwards,Somerst,Norm,Feedr,Artery,RRAn,PosN,VinylSd,HdBoard,MetalSd,Wd Sdng,Plywood
0,208500,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0
1,181500,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0
2,223500,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0
3,140000,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,250000,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0
