**DS 301: Applied Data Modeling and Predictive Analysis**

**Lecture 8 – One-Hot Encoding**

# One-Hot Encoding
Nok Wongpiromsarn, 8 August 2022

**Load the automobile price data and apply some filtering as in Lecture 7**

Automobile_price_data_Raw.csv can be downloaded from

https://github.com/MicrosoftLearning/Principles-of-Machine-Learning-Python/tree/master/Module3

We put it under the *datasets* folder.

In [1]:
import os
import pandas as pd

data_path = os.path.join("datasets", "automobile.csv")
data = pd.read_csv(data_path)

# Remove all rows with ? price or horsepower
data = data[(data.price != "?") & (data.horsepower != "?")]

# Change the data type of price and horsepower from *object* to a suitable numeric type
data.price = pd.to_numeric(data.price)
data.horsepower = pd.to_numeric(data.horsepower)

# Check info again
data.head(10)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
10,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


**Apply one-hot encoding to the body-style column**

In [2]:
body_1hot = pd.get_dummies(data['body-style'], prefix='body-style')
data = pd.concat([data, body_1hot], axis=1)
data.head(10)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,horsepower,peak-rpm,city-mpg,highway-mpg,price,body-style_convertible,body-style_hardtop,body-style_hatchback,body-style_sedan,body-style_wagon
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,111,5000,21,27,13495,1,0,0,0,0
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,111,5000,21,27,16500,1,0,0,0,0
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,154,5000,19,26,16500,0,0,1,0,0
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,102,5500,24,30,13950,0,0,0,1,0
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,115,5500,18,22,17450,0,0,0,1,0
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,110,5500,19,25,15250,0,0,0,1,0
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,110,5500,19,25,17710,0,0,0,1,0
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,110,5500,19,25,18920,0,0,0,0,1
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,140,5500,17,20,23875,0,0,0,1,0
10,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,101,5800,23,29,16430,0,0,0,1,0


In [3]:
# Remove the 'body-style' column. 
# Specify inplace=True to delete the column without having to reassign data.
data.drop('body-style', axis=1, inplace=True)
data.head(10)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,drive-wheels,engine-location,wheel-base,length,...,horsepower,peak-rpm,city-mpg,highway-mpg,price,body-style_convertible,body-style_hardtop,body-style_hatchback,body-style_sedan,body-style_wagon
0,3,?,alfa-romero,gas,std,two,rwd,front,88.6,168.8,...,111,5000,21,27,13495,1,0,0,0,0
1,3,?,alfa-romero,gas,std,two,rwd,front,88.6,168.8,...,111,5000,21,27,16500,1,0,0,0,0
2,1,?,alfa-romero,gas,std,two,rwd,front,94.5,171.2,...,154,5000,19,26,16500,0,0,1,0,0
3,2,164,audi,gas,std,four,fwd,front,99.8,176.6,...,102,5500,24,30,13950,0,0,0,1,0
4,2,164,audi,gas,std,four,4wd,front,99.4,176.6,...,115,5500,18,22,17450,0,0,0,1,0
5,2,?,audi,gas,std,two,fwd,front,99.8,177.3,...,110,5500,19,25,15250,0,0,0,1,0
6,1,158,audi,gas,std,four,fwd,front,105.8,192.7,...,110,5500,19,25,17710,0,0,0,1,0
7,1,?,audi,gas,std,four,fwd,front,105.8,192.7,...,110,5500,19,25,18920,0,0,0,0,1
8,1,158,audi,gas,turbo,four,fwd,front,105.8,192.7,...,140,5500,17,20,23875,0,0,0,1,0
10,2,192,bmw,gas,std,two,rwd,front,101.2,176.8,...,101,5800,23,29,16430,0,0,0,1,0


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

x = data[['engine-size', 'curb-weight', 'width', 'highway-mpg',
          'body-style_convertible', 'body-style_hardtop', 'body-style_hatchback', 
          'body-style_sedan', 'body-style_wagon']]
y = data['price']

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

# Apply least squares linear regression
reg = LinearRegression().fit(x_train, y_train)
print("Coefficient: {}".format(reg.coef_))
print("Intercept: {}".format(reg.intercept_))

y_test_predict = reg.predict(x_test)
rmse = mean_squared_error(y_test, y_test_predict, squared=False)
rsquared = reg.score(x_test, y_test)
print("RMSE: {}".format(rmse))
print("Coefficient of determination: {}".format(rsquared))

Coefficient: [ 9.57007876e+01  2.67635327e+00  5.47526676e+02 -1.18700329e+02
  3.93405940e+03  1.03456116e+03 -1.81350690e+03 -8.39660977e+02
 -2.31545268e+03]
Intercept: -37061.67552981028
RMSE: 2569.8910071978203
Coefficient of determination: 0.9068317546273804
