# One Hot Encoding

In order to predict the CO2 emissions of a vehicle based on its specifications, we plan to employ a linear regression model. However, our dataset contains categorical variables that could potentially introduce inaccuracies and unreliability if directly used in the model. Therefore, we intend to utilize one-hot encoding, a technique that transforms categorical data into numerical columns. For each value of the categorical variable, a new binary variable will be created.

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

In [2]:
# Import the cleaned dataset with no outliers
df = pd.read_csv('CO2 Emissions_Canada_cleaned_removed_outliers.csv')
df.drop(df.columns[0], axis=1, inplace=True)

print("Data type : ", type(df))
print("Data dims : ", df.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (5965, 12)


In [3]:
df.head()

Unnamed: 0,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),CO2 Emissions(g/km),Number of Gears
0,ACURA,ILX,COMPACT,2.0,4,AS,Z,9.9,6.7,8.5,196,5
1,ACURA,ILX,COMPACT,2.4,4,M,Z,11.2,7.7,9.6,221,6
2,ACURA,ILX HYBRID,COMPACT,1.5,4,AV,Z,6.0,5.8,5.9,136,7
3,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS,Z,12.7,9.1,11.1,255,6
4,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS,Z,12.1,8.7,10.6,244,6


## Dropping of Unnecessary Columns 

Our project examines the specifications of a vehicle. The make and model which denotes the company of the vehicle and car model respectively are not vehicle specfications, but serve as identifiers for vehicles. Instead, we will focus on the other categorical variables: Vehicle Class, Transmission, and Fuel Type as these features truly define the distinct specfications of a vehicle.

We will solely utilize the "Fuel Consumption Comb" column because it represents the combined fuel consumption, accounting for both city roads and highway conditions, litres per 100 kilometer. As a result, we will exclude the "Fuel Consumption City" and "Fuel Consumption Hwy" columns.

In [4]:
# Create a copy of the dataset
df2 = df.copy()

# Drop Make and Model columns
df2.drop(['Make', 'Model', 'Fuel Consumption Hwy (L/100 km)', 'Fuel Consumption City (L/100 km)'], axis=1, inplace=True)

df2.head()

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (L/100 km),CO2 Emissions(g/km),Number of Gears
0,COMPACT,2.0,4,AS,Z,8.5,196,5
1,COMPACT,2.4,4,M,Z,9.6,221,6
2,COMPACT,1.5,4,AV,Z,5.9,136,7
3,SUV - SMALL,3.5,6,AS,Z,11.1,255,6
4,SUV - SMALL,3.5,6,AS,Z,10.6,244,6


Our Categorical Variables are: 
> Vehicle Class  
> Transmission  
> Fuel Type    

Since there is no numerical relationship between the different categories for each categorical variable (not ordinal), one-hot encoding can be done.

In [5]:
# Extract the Categorical Variables and create a copy 
categorical_columns = ['Vehicle Class', 'Transmission', 'Fuel Type']
categorical_df = df2[categorical_columns].copy()
categorical_df.head(10)

Unnamed: 0,Vehicle Class,Transmission,Fuel Type
0,COMPACT,AS,Z
1,COMPACT,M,Z
2,COMPACT,AV,Z
3,SUV - SMALL,AS,Z
4,SUV - SMALL,AS,Z
5,MID-SIZE,AS,Z
6,MID-SIZE,AS,Z
7,MID-SIZE,AS,Z
8,MID-SIZE,M,Z
9,COMPACT,AS,Z


In [6]:
# Display the unique values of categorical columns before encoding
# Display the number of unqiue values for each categorical column
print("Unique values before encoding: ")
print()
for column in categorical_columns:
    print(column, ":", df2[column].unique())
    print("Number of unique values:", df2[column].nunique())
    print()

Unique values before encoding: 

Vehicle Class : ['COMPACT' 'SUV - SMALL' 'MID-SIZE' 'TWO-SEATER' 'MINICOMPACT'
 'SUBCOMPACT' 'FULL-SIZE' 'STATION WAGON - SMALL' 'SUV - STANDARD'
 'VAN - CARGO' 'VAN - PASSENGER' 'PICKUP TRUCK - STANDARD' 'MINIVAN'
 'SPECIAL PURPOSE VEHICLE' 'STATION WAGON - MID-SIZE'
 'PICKUP TRUCK - SMALL']
Number of unique values: 16

Transmission : ['AS' 'M' 'AV' 'AM' 'A']
Number of unique values: 5

Fuel Type : ['Z' 'D' 'X' 'E' 'N']
Number of unique values: 5



We will use the `OneHotEncoder` class in `sklearn` to perform one-hot encoding.



In [7]:
from sklearn.preprocessing import OneHotEncoder

# One-Hot Encoding
encoder_one_hot = OneHotEncoder()
encoded_data = encoder_one_hot.fit_transform(categorical_df)

# Convert sparse matrix to DataFrame
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder_one_hot.get_feature_names_out())

# Drop the original columns from the DataFrame
df2.drop(columns=categorical_columns, inplace=True)

# Concatenate the original DataFrame with the one-hot encoded DataFrame
df_encoded = pd.concat([df2, encoded_df], axis=1)

In [8]:
# Display the entire DataFrame after one-hot encoding
df_encoded.head()

Unnamed: 0,Engine Size(L),Cylinders,Fuel Consumption Comb (L/100 km),CO2 Emissions(g/km),Number of Gears,Vehicle Class_COMPACT,Vehicle Class_FULL-SIZE,Vehicle Class_MID-SIZE,Vehicle Class_MINICOMPACT,Vehicle Class_MINIVAN,...,Transmission_A,Transmission_AM,Transmission_AS,Transmission_AV,Transmission_M,Fuel Type_D,Fuel Type_E,Fuel Type_N,Fuel Type_X,Fuel Type_Z
0,2.0,4,8.5,196,5,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2.4,4,9.6,221,6,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,1.5,4,5.9,136,7,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,3.5,6,11.1,255,6,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,3.5,6,10.6,244,6,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [9]:
# Display the new columns
encoded_df.head()

Unnamed: 0,Vehicle Class_COMPACT,Vehicle Class_FULL-SIZE,Vehicle Class_MID-SIZE,Vehicle Class_MINICOMPACT,Vehicle Class_MINIVAN,Vehicle Class_PICKUP TRUCK - SMALL,Vehicle Class_PICKUP TRUCK - STANDARD,Vehicle Class_SPECIAL PURPOSE VEHICLE,Vehicle Class_STATION WAGON - MID-SIZE,Vehicle Class_STATION WAGON - SMALL,...,Transmission_A,Transmission_AM,Transmission_AS,Transmission_AV,Transmission_M,Fuel Type_D,Fuel Type_E,Fuel Type_N,Fuel Type_X,Fuel Type_Z
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [10]:
# Display the column names after one-hot encoding
df_encoded.nunique()

Engine Size(L)                             49
Cylinders                                   7
Fuel Consumption Comb (L/100 km)          165
CO2 Emissions(g/km)                       280
Number of Gears                             7
Vehicle Class_COMPACT                       2
Vehicle Class_FULL-SIZE                     2
Vehicle Class_MID-SIZE                      2
Vehicle Class_MINICOMPACT                   2
Vehicle Class_MINIVAN                       2
Vehicle Class_PICKUP TRUCK - SMALL          2
Vehicle Class_PICKUP TRUCK - STANDARD       2
Vehicle Class_SPECIAL PURPOSE VEHICLE       2
Vehicle Class_STATION WAGON - MID-SIZE      2
Vehicle Class_STATION WAGON - SMALL         2
Vehicle Class_SUBCOMPACT                    2
Vehicle Class_SUV - SMALL                   2
Vehicle Class_SUV - STANDARD                2
Vehicle Class_TWO-SEATER                    2
Vehicle Class_VAN - CARGO                   2
Vehicle Class_VAN - PASSENGER               2
Transmission_A                    

In [11]:
print("Data type : ", type(df_encoded))
print("Data dims : ", df_encoded.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (5965, 31)


The number of rows remained at 5965. Hence, no data is lost during the one-hot encoding process.  Additionally, the number of columns changed from 12 to 31 due to the new binary columns created for the values of each categorical variable and dropping of Make, Model, Vehicle Class, Transmission, and Fuel Type categorical columns

In [13]:
# Create new CSV File after one hot encoding
df_encoded.to_csv('CO2 Emissions_Canada_cleaned_removed_outliers_encoded.csv')