## When there is a case of nominal data, where we have to predict not only based on a numeric value, but also based on string values
### Then we use dummy variables as a feature to encode those string values
#### This can be perform using 2 methods:-

* **Using Pandas dummies method**
* **Using Sklearn's One Hot Encoder module**

# Using Pandas dummies method

In [77]:
import pandas as pd 

# NOMINAL DATA
df = pd.read_csv("ML Practice Files/Dummy Variables Use/homeprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [78]:
dummies = pd.get_dummies(df.town)
dummies 
#Dummy variables (0/1)

Unnamed: 0,monroe township,robinsville,west windsor
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False
5,False,False,True
6,False,False,True
7,False,False,True
8,False,False,True
9,False,True,False


In [79]:
merged = pd.concat([df,dummies],axis="columns")
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,True,False,False
1,monroe township,3000,565000,True,False,False
2,monroe township,3200,610000,True,False,False
3,monroe township,3600,680000,True,False,False
4,monroe township,4000,725000,True,False,False
5,west windsor,2600,585000,False,False,True
6,west windsor,2800,615000,False,False,True
7,west windsor,3300,650000,False,False,True
8,west windsor,3600,710000,False,False,True
9,robinsville,2600,575000,False,True,False


> Now we have to perform two steps:-
> * Drop 'town' column (As it is No Need)
> * And, Drop one dummy column (to avoid Dummy traps in our Model)

In [80]:
final = merged.drop(['town','west windsor'],axis="columns")
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,True,False
1,3000,565000,True,False
2,3200,610000,True,False
3,3600,680000,True,False
4,4000,725000,True,False
5,2600,585000,False,False
6,2800,615000,False,False
7,3300,650000,False,False
8,3600,710000,False,False
9,2600,575000,False,True


In [81]:
# Now Create your model
from sklearn.linear_model import LinearRegression

model = LinearRegression()
x = final.drop(['price'],axis="columns")         # 3 Independent variables (area,monroe township and robinsville)
y = final.price                                  # Dependent Variable (price)

model.fit(x,y)
# model.predict([[2800,0,1]])

# 2800 Area in robinsville
prediction_1 = model.predict(pd.DataFrame([[2800,0,1]],columns=['area','monroe township','robinsville']))

# 3400 Area in west windsor
prediction_2 = model.predict(pd.DataFrame([[3400,0,0]],columns=['area','monroe township','robinsville']))

print(prediction_1,prediction_2)

[590775.63964739] [681241.66845839]


# Using Sklearn's One Hot Encoder module

In [82]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

new_df = df
new_df.town = encoder.fit_transform(new_df.town)
new_df

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [83]:
x = new_df[['town','area']].values        # town and area are independent variable
y = new_df.price.values;                  # price is dependent variable
print(x,y)

[[   0 2600]
 [   0 3000]
 [   0 3200]
 [   0 3600]
 [   0 4000]
 [   2 2600]
 [   2 2800]
 [   2 3300]
 [   2 3600]
 [   1 2600]
 [   1 2900]
 [   1 3100]
 [   1 3600]] [550000 565000 610000 680000 725000 585000 615000 650000 710000 575000
 600000 620000 695000]


In [84]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('town',OneHotEncoder(),[0])],remainder='passthrough') 
# ColumnTransformer is used to generate Dummy Variables ,along with OneHotEncoder()

x = ct.fit_transform(x)
x

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [85]:
x = x[:,1:]                     # Drop Monroe Township column, to avoid Dummy Traps
x

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [86]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(x,y)

# 2800 Area in robinsville
prediction_3 = model.predict([[1,0,2800]])
# 3400 Area in west windsor
prediction_4 = model.predict([[0,1,3400]])

print(prediction_3,prediction_4)

[590775.63964739] [681241.6684584]


<h1 style="color:green">Exercise</h1>

At the same level as this notebook on github, there is an Exercise folder that contains carprices.csv. This file has car sell prices for 3 different models. First plot data points on a scatter plot chart to see if linear regression model can be applied. If yes, then build a model that can answer following questions,

1) **Predict price of a mercedez benz that is 4 yr old with mileage 45000**

2) **Predict price of a BMW X5 that is 7 yr old with mileage 86000**

3) **Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())**

In [87]:
import pandas as pd
from sklearn.linear_model import LinearRegression

df2 = pd.read_csv("ML Practice Files/Dummy Variables Use/carprices.csv")            #Reading data
dummies2 = pd.get_dummies(df2['Car Model'])                                         #Dummy Columns
merged = pd.concat([df2,dummies2],axis="columns")                                   #Merge original dataframe and dummmy columns
final = merged.drop(['Car Model','Mercedez Benz C class'],axis="columns")           #Then, drop 'Car Model' and 1 dummy column

x = final.drop(['Sell Price($)'],axis="columns")                                    # Independent Variables
y = final['Sell Price($)']                                                          # Dependent Variable

model2 = LinearRegression()
model2.fit(x,y)

mercedez_benz_prediction = model2.predict(pd.DataFrame([[45000,4,0,0]],columns=x.columns))
BMW_prediction = model2.predict(pd.DataFrame([[86000,7,0,1]],columns=x.columns))

print(f"price of a mercedez benz that is 4 yr old with mileage 45000 is {mercedez_benz_prediction}")
print(f"price of a BMW X5 that is 7 yr old with mileage 86000 is {BMW_prediction}")
print(f"score (accuracy) of your model is {model2.score(x,y)}")

price of a mercedez benz that is 4 yr old with mileage 45000 is [36991.31721061]
price of a BMW X5 that is 7 yr old with mileage 86000 is [11080.74313219]
score (accuracy) of your model is 0.9417050937281082
