<h2>Categorical Variables and One Hot Encoding</h2>

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("houseprices.csv")
df

Unnamed: 0,Town,Area,Price,House Condition
0,monroe township,2600,550000,average
1,monroe township,3000,565000,good
2,monroe township,3200,610000,good
3,monroe township,3600,680000,excellent
4,monroe township,4000,725000,excellent
5,west windsor,2600,585000,average
6,west windsor,2800,615000,good
7,west windsor,3300,650000,good
8,west windsor,3600,710000,excellent
9,robinsville,2600,575000,poor


<h2 style='color:red'>Using pandas to create dummy variables</h2>

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Town             13 non-null     object
 1   Area             13 non-null     int64 
 2   Price            13 non-null     int64 
 3   House Condition  13 non-null     object
dtypes: int64(2), object(2)
memory usage: 548.0+ bytes


## One Hot Encoding

In [5]:
#As Town are nominal variable , so it is better to handle one hot encoding
dummies = pd.get_dummies(df['Town'], dtype = int)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [6]:
df_merged = pd.concat([df,dummies],axis=1)
df_merged

Unnamed: 0,Town,Area,Price,House Condition,monroe township,robinsville,west windsor
0,monroe township,2600,550000,average,1,0,0
1,monroe township,3000,565000,good,1,0,0
2,monroe township,3200,610000,good,1,0,0
3,monroe township,3600,680000,excellent,1,0,0
4,monroe township,4000,725000,excellent,1,0,0
5,west windsor,2600,585000,average,0,0,1
6,west windsor,2800,615000,good,0,0,1
7,west windsor,3300,650000,good,0,0,1
8,west windsor,3600,710000,excellent,0,0,1
9,robinsville,2600,575000,poor,0,1,0


In [7]:
final_df = df_merged.drop(['Town'], axis=1)
final_df

Unnamed: 0,Area,Price,House Condition,monroe township,robinsville,west windsor
0,2600,550000,average,1,0,0
1,3000,565000,good,1,0,0
2,3200,610000,good,1,0,0
3,3600,680000,excellent,1,0,0
4,4000,725000,excellent,1,0,0
5,2600,585000,average,0,0,1
6,2800,615000,good,0,0,1
7,3300,650000,good,0,0,1
8,3600,710000,excellent,0,0,1
9,2600,575000,poor,0,1,0


<h3 style='color:red'>Dummy Variable Trap</h3>

It is always better to drop one dummy variable

When you can derive one variable from other variables, they are known to be multi-colinear. Here
if you know values of california and georgia then you can easily infer value of new jersey state, i.e. 
california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this
situation linear regression won't work as expected. Hence you need to drop one column. 

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the 
    state columns it is going to work, however we should make a habit of taking care of dummy variable
    trap ourselves just in case library that you are using is not handling this for you**

In [8]:
final_df = final_df.drop(['west windsor'], axis=1)
final_df

Unnamed: 0,Area,Price,House Condition,monroe township,robinsville
0,2600,550000,average,1,0
1,3000,565000,good,1,0
2,3200,610000,good,1,0
3,3600,680000,excellent,1,0
4,4000,725000,excellent,1,0
5,2600,585000,average,0,0
6,2800,615000,good,0,0
7,3300,650000,good,0,0
8,3600,710000,excellent,0,0
9,2600,575000,poor,0,1


In [9]:
X = final_df.drop('Price', axis=1)
X

Unnamed: 0,Area,House Condition,monroe township,robinsville
0,2600,average,1,0
1,3000,good,1,0
2,3200,good,1,0
3,3600,excellent,1,0
4,4000,excellent,1,0
5,2600,average,0,0
6,2800,good,0,0
7,3300,good,0,0
8,3600,excellent,0,0
9,2600,poor,0,1


In [10]:
y = final_df['Price']

<h2 style='color:red'>Using sklearn LabelEncoder</h2>

Using LabelEncoder is mostly prefer with Ordinal Categorical Variables

In [11]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

#Fit and transform value
final_df['House Condition'] = encoder.fit_transform(final_df['House Condition'])
print('\n Encoded (normalized) classes: \n', final_df['House Condition'])

print()

# Reverse the transformation
# final_df['House Condition Encoder'] = encoder.inverse_transform(final_df['House Condition Encoder'])
# print('\n Reverse from encoded classes to original: \n', final_df['House Condition Encoder'])


 Encoded (normalized) classes: 
 0     0
1     2
2     2
3     1
4     1
5     0
6     2
7     2
8     1
9     3
10    0
11    2
12    1
Name: House Condition, dtype: int32



In [12]:
final_df

Unnamed: 0,Area,Price,House Condition,monroe township,robinsville
0,2600,550000,0,1,0
1,3000,565000,2,1,0
2,3200,610000,2,1,0
3,3600,680000,1,1,0
4,4000,725000,1,1,0
5,2600,585000,0,0,0
6,2800,615000,2,0,0
7,3300,650000,2,0,0
8,3600,710000,1,0,0
9,2600,575000,3,0,1


In [13]:
# Reverse the transformation
final_df['House Condition'] = encoder.inverse_transform(final_df['House Condition'])
print('\n Reverse from encoded classes to original: \n', final_df['House Condition'])


 Reverse from encoded classes to original: 
 0       average
1          good
2          good
3     excellent
4     excellent
5       average
6          good
7          good
8     excellent
9          poor
10      average
11         good
12    excellent
Name: House Condition, dtype: object


In [14]:
final_df

Unnamed: 0,Area,Price,House Condition,monroe township,robinsville
0,2600,550000,average,1,0
1,3000,565000,good,1,0
2,3200,610000,good,1,0
3,3600,680000,excellent,1,0
4,4000,725000,excellent,1,0
5,2600,585000,average,0,0
6,2800,615000,good,0,0
7,3300,650000,good,0,0
8,3600,710000,excellent,0,0
9,2600,575000,poor,0,1


<h2 style='color:red'>Using Category(Cat.codes)</h2>

To use this we first need to convert categorical variable to category type.



In [15]:
# Cast 'class' column as categorical
final_df['House Condition'] = final_df['House Condition'].astype('category')

final_df['House Condition'].cat.codes

0     0
1     2
2     2
3     1
4     1
5     0
6     2
7     2
8     1
9     3
10    0
11    2
12    1
dtype: int8