<h1 style='color:green' align='center'>Categorical Variables and One Hot Encoding</h1>

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("homeprices.csv")
df

<h2 style='color:purple'>Using pandas to create dummy variables</h2>

In [None]:
dummies = pd.get_dummies(df.town)
dummies

In [None]:
merged = pd.concat([df,dummies],axis='columns')
merged

In [None]:
final = merged.drop(['town'], axis='columns')
final

<h3 style='color:purple'>Dummy Variable Trap</h3>

When you can derive one variable from other variables, they are known to be multi-colinear. Here
if you know values of california and georgia then you can easily infer value of new jersey state, i.e. 
california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this
situation linear regression won't work as expected. Hence you need to drop one column. 

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the 
    state columns it is going to work, however we should make a habit of taking care of dummy variable
    trap ourselves just in case library that you are using is not handling this for you**

In [None]:
final = final.drop(['west windsor'], axis='columns')
final

In [None]:
X = final.drop('price', axis='columns')
X

In [None]:
Y = final.price
Y

Now build and train sklearn linear regression model

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [None]:
model.fit(X,Y)

In [None]:
model.score(X,Y)

In [None]:
model.predict(X) # predict on all the values of 'X'

In [None]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor

In [None]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville

<h2 style='color:purple'>Using sklearn OneHotEncoder</h2>

First step is to use label encoder to convert town names into numbers

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
dfle = df
dfle.town = le.fit_transform(dfle.town)
dfle

In [None]:
X = dfle[['town','area']].values
X

In [None]:
Y = dfle.price.values
Y

Now use one hot encoder to create dummy variables for each of the town

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [None]:
X = ct.fit_transform(X)
X

In [None]:
X = X[:,1:]
X

In [None]:
model.fit(X,Y)

In [None]:
model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor

In [None]:
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville