***Dummy Values Using Pandas and SKlearn One-Hot-Encoding***

In [1]:
import pandas as pd
import numpy as np
import sklearn

In [2]:
df=pd.read_csv("carprices.csv")
df

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


**As we have Model in categorical value but in textual form so it cant be used for prediction**

**Hence we need to convert it into numeric categorical form**

***1- By using Pandas get Dummies***

In [3]:
dummies = pd.get_dummies(df['Car Model'])
dummies

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


**Now we have to merge this into our actual file**

In [4]:
original=pd.concat([df,dummies],axis=1)
original

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,0,1,0
1,BMW X5,35000,34000,3,0,1,0
2,BMW X5,57000,26100,5,0,1,0
3,BMW X5,22500,40000,2,0,1,0
4,BMW X5,46000,31500,4,0,1,0
5,Audi A5,59000,29400,5,1,0,0
6,Audi A5,52000,32000,5,1,0,0
7,Audi A5,72000,19300,6,1,0,0
8,Audi A5,91000,12000,8,1,0,0
9,Mercedez Benz C class,67000,22000,6,0,0,1


***Dummy Variable Trap***

*The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models.*

**Solution**

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change

In [5]:
final_data=original.drop(['Car Model','Mercedez Benz C class'], axis='columns')
final_data

Unnamed: 0,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5
0,69000,18000,6,0,1
1,35000,34000,3,0,1
2,57000,26100,5,0,1
3,22500,40000,2,0,1
4,46000,31500,4,0,1
5,59000,29400,5,1,0
6,52000,32000,5,1,0
7,72000,19300,6,1,0
8,91000,12000,8,1,0
9,67000,22000,6,0,0


**Now we have a clean data. Let use regression model now**

In [6]:
X=final_data.drop(['Sell Price($)'],axis='columns')
Y=final_data.drop(['Mileage','Age(yrs)','Audi A5','BMW X5'],axis='columns')

In [7]:
from sklearn.linear_model import LinearRegression
reg=LinearRegression()

In [8]:
reg.fit(X,Y)

**Predicting for Mileage= , Age(yrs)= , Audi A5= , BMW X5= ,  Mercedez Benz C class= ,**

*If Audi a5 = 0 and BMW x5 = 0 means we are predicting for Mercedes Benz C class*

In [9]:
reg.predict([[59000,5,0,0]])



array([[30477.15426156]])

**Accuracy of Model**

In [10]:
reg.score(X,Y)

0.9417050937281082

***--------------------------------------------------------------------------------------------------------------------------------------------------------***


***ONE-HOT-ENCODING***

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [12]:
org1=df
org1

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


In [13]:
org1['Car Model']=le.fit_transform(org1['Car Model'])
org1

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,1,69000,18000,6
1,1,35000,34000,3
2,1,57000,26100,5
3,1,22500,40000,2
4,1,46000,31500,4
5,0,59000,29400,5
6,0,52000,32000,5
7,0,72000,19300,6
8,0,91000,12000,8
9,2,67000,22000,6


In [14]:
x1=org1[['Car Model','Mileage','Age(yrs)']].values
x1

array([[    1, 69000,     6],
       [    1, 35000,     3],
       [    1, 57000,     5],
       [    1, 22500,     2],
       [    1, 46000,     4],
       [    0, 59000,     5],
       [    0, 52000,     5],
       [    0, 72000,     6],
       [    0, 91000,     8],
       [    2, 67000,     6],
       [    2, 83000,     7],
       [    2, 79000,     7],
       [    2, 59000,     5]], dtype=int64)

In [15]:
y=org1['Sell Price($)']
y

0     18000
1     34000
2     26100
3     40000
4     31500
5     29400
6     32000
7     19300
8     12000
9     22000
10    20000
11    21000
12    33000
Name: Sell Price($), dtype: int64

In [16]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer 
ct = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(categories='auto'), [0])],
    remainder='passthrough' 
)
x1 = ct.fit_transform(x1)
x1=x1[:,1:]
x1

array([[1.00e+00, 0.00e+00, 6.90e+04, 6.00e+00],
       [1.00e+00, 0.00e+00, 3.50e+04, 3.00e+00],
       [1.00e+00, 0.00e+00, 5.70e+04, 5.00e+00],
       [1.00e+00, 0.00e+00, 2.25e+04, 2.00e+00],
       [1.00e+00, 0.00e+00, 4.60e+04, 4.00e+00],
       [0.00e+00, 0.00e+00, 5.90e+04, 5.00e+00],
       [0.00e+00, 0.00e+00, 5.20e+04, 5.00e+00],
       [0.00e+00, 0.00e+00, 7.20e+04, 6.00e+00],
       [0.00e+00, 0.00e+00, 9.10e+04, 8.00e+00],
       [0.00e+00, 1.00e+00, 6.70e+04, 6.00e+00],
       [0.00e+00, 1.00e+00, 8.30e+04, 7.00e+00],
       [0.00e+00, 1.00e+00, 7.90e+04, 7.00e+00],
       [0.00e+00, 1.00e+00, 5.90e+04, 5.00e+00]])

In [17]:
reg.fit(x1,y)

In [18]:
reg.score(x1,y)

0.9417050937281082

here we have to keep in mind that we have actually 3 type of model and mean 3 column but we keep only 2 column which are 2nd and 3rd and we didnt consider 1st column which is BMW X5. So in order to predict for BMW we have to check the x1[] 2nd and 3rd column for BMW which is 1,0, mileage,age

In [26]:
reg.predict([[1,0,35000,3]])

array([35286.78445645])

**Price of mercedez benz that is 4 yr old with mileage 45000**

In [27]:
reg.predict([[0,1,45000,4]])

array([36991.31721062])