Categorical Variables & One Hot Encoding

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv('homeprices.csv')

In [7]:
df

Unnamed: 0,town,area,price
0,Banglore,2600,550000
1,Banglore,3000,565000
2,Banglore,3200,610000
3,Banglore,3600,680000
4,Banglore,4000,725000
5,Mysore,2600,585000
6,Mysore,2800,615000
7,Mysore,3300,650000
8,Mysore,3600,710000
9,Dharwad,2600,575000


Using Pandas to create dummy variables for town col

In [8]:
dummies = pd.get_dummies(df.town)

# using get dummies

In [9]:
dummies 

Unnamed: 0,Banglore,Dharwad,Mysore
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


Merging original table to dummies table

In [10]:
merged = pd.concat([df, dummies], axis = 'columns')

In [11]:
merged

Unnamed: 0,town,area,price,Banglore,Dharwad,Mysore
0,Banglore,2600,550000,1,0,0
1,Banglore,3000,565000,1,0,0
2,Banglore,3200,610000,1,0,0
3,Banglore,3600,680000,1,0,0
4,Banglore,4000,725000,1,0,0
5,Mysore,2600,585000,0,0,1
6,Mysore,2800,615000,0,0,1
7,Mysore,3300,650000,0,0,1
8,Mysore,3600,710000,0,0,1
9,Dharwad,2600,575000,0,1,0


In [12]:
# Town is Categorical & dropping one dummy variable

In [13]:
final = merged.drop(['town'], axis = 'columns')

In [14]:
final

Unnamed: 0,area,price,Banglore,Dharwad,Mysore
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0
5,2600,585000,0,0,1
6,2800,615000,0,0,1
7,3300,650000,0,0,1
8,3600,710000,0,0,1
9,2600,575000,0,1,0


Dummy Variable Trap
When you can derive one variable from other variables, they are known to be multi-colinear. Here if you know values of california and georgia then you can easily infer value of new jersey state, i.e. california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this situation linear regression won't work as expected. Hence you need to drop one column.

NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you

In [15]:
final = final.drop(['Dharwad'], axis = 'columns')

In [16]:
final

Unnamed: 0,area,price,Banglore,Mysore
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,1
6,2800,615000,0,1
7,3300,650000,0,1
8,3600,710000,0,1
9,2600,575000,0,0


In [17]:
X = final.drop('price', axis = 'columns')

In [18]:
X

Unnamed: 0,area,Banglore,Mysore
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,1
6,2800,0,1
7,3300,0,1
8,3600,0,1
9,2600,0,0


In [19]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [20]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [21]:
model.fit(X, y)

LinearRegression()

In [22]:
model.predict(X) 

# x contains area	Banglore	Mysore

# predict price which is y for all values of X

array([539709.73984091, 590468.71640508, 615848.20468716, 666607.18125133,
       717366.1578155 , 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [24]:
model.score(X, y)

0.9573929037221873

In [25]:
model.predict([[3400, 0, 0]])

# 3400 sqr ft home in Dharwad coz its 0, 0, 1
# Dh col is dropped



array([666914.10449365])

In [26]:
model.predict([[2800, 0, 1]])



array([605103.20361213])

# Using sklearn OneHotEncoder
First step is to use label encoder to convert town names into numbers

In [27]:
from sklearn.preprocessing import LabelEncoder

In [28]:
le = LabelEncoder()

In [30]:
dfle = df

# df -> town categorical
# dfle -> town numerical

In [31]:
dfle.town = le.fit_transform(dfle.town)

In [32]:
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [33]:
X = dfle[['town', 'area']].values

In [34]:
X 

# X contains only town and area

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [35]:
y = dfle.price.values

In [36]:
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000], dtype=int64)

Now use one hot encoder to create dummy variables for each of the town

In [37]:
from sklearn.preprocessing import OneHotEncoder

In [40]:
ohe = OneHotEncoder()

In [41]:
X = ohe.fit_transform(X).toarray()

In [42]:
X

array([[1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])

In [49]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')


In [44]:
X = ct.fit_transform(X)
X

array([[0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])

In [45]:
X = X[: , 1:]

In [46]:
X

array([[1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])

In [47]:
model.fit(X, y)

LinearRegression()

In [48]:
model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor


ValueError: X has 3 features, but LinearRegression is expecting 12 features as input.