This file cleans the original CO2 emissions data (as downloaded directly from Kaggle) and cleans it into a format that can be used with sklearn's `RandomForestRegressor` (RFR.)

This new file is then saved as `emissions_cleaned.csv`, which is the file used in the experiments.

In [38]:
# Load Data
import pandas as pd
emissions = pd.read_csv("CO2_Emissions_Canada.csv")

Below we view the original column names and note that they have spaces and other characters that are not easy to work with, so we clean these up.

In [39]:
emissions.columns

Index(['Make', 'Model', 'Vehicle Class', 'Engine Size(L)', 'Cylinders',
       'Transmission', 'Fuel Type', 'Fuel Consumption City (L/100 km)',
       'Fuel Consumption Hwy (L/100 km)', 'Fuel Consumption Comb (L/100 km)',
       'Fuel Consumption Comb (mpg)', 'CO2 Emissions(g/km)'],
      dtype='object')

In [40]:
# Rename columns
cols = ["make", "model", "class", "engine_size", "cyl", "transmission", "fuel_type", "fuel_consump_city", "fuel_consump_hwy", "fuel_consump_comb", "fuel_consump_comb_mpg", "co2_emissions"]
emissions.columns = cols

For the purposes of our analysis, we only need a handful of columns, so to simplify the data, we remove some columns that are hard to break down into a format that can be used by the RFR model or not that useful. Specifically, those categorical variables with many different categories or those that are somewhat duplicates of other columns. 

In [41]:
# Drop columns that are hard to subcategorize
emissions = emissions.drop(["fuel_consump_comb_mpg", "make", "model", "transmission"], axis=1)


Next, we group the many classes of vehicles into larger categories so we have fewer to work with (compact, SUV/minivan, sedan or similar, truck, other).

In [28]:
emissions["class"].unique()

array(['COMPACT', 'SUV - SMALL', 'MID-SIZE', 'TWO-SEATER', 'MINICOMPACT',
       'SUBCOMPACT', 'FULL-SIZE', 'STATION WAGON - SMALL',
       'SUV - STANDARD', 'VAN - CARGO', 'VAN - PASSENGER',
       'PICKUP TRUCK - STANDARD', 'MINIVAN', 'SPECIAL PURPOSE VEHICLE',
       'STATION WAGON - MID-SIZE', 'PICKUP TRUCK - SMALL'], dtype=object)

In [42]:
# Group vehicle class into larger categories
def group_class(row):
    if row["class"] in ["COMPACT", "MINICOMPACT", "SUBCOMPACT", "TWO-SEATER"]:
        val = "compact"
    elif row["class"] in ["SUV - SMALL", "SUV - STANDARD", "MINIVAN"]:
        val = "SUV_minivan"
    elif row["class"] in ["MID-SIZE", "FULL-SIZE", "STATION WAGON - SMALL", "STATION WAGON - MID-SIZE"]:
        val = "sedan_or_similar"
    elif row["class"] in ["PICKUP TRUCK - SMALL", "PICKUP TRUCK - STANDARD"]:
        val = "truck"
    else:
        val = "other"
    return val

emissions["class"] = emissions.apply(group_class, axis=1)


Then we one hot encode our remaining categorical varibles, since RFR does not accept categorical data.

In [43]:
# One hot encode fuel type variable and car class variable
one_hot_fuel = pd.get_dummies(emissions['fuel_type'])
one_hot_class = pd.get_dummies(emissions['class'])

emissions = emissions.join(one_hot_fuel)
emissions = emissions.join(one_hot_class)
emissions = emissions.drop(['fuel_type', 'class'],axis = 1)


Then, rename our new modified columns to match the style we want

In [46]:
# Rename columns again after one hot encoding
cols = ['engine_size', 'cyl', 'fuel_consump_city', 'fuel_consump_hwy',
       'fuel_consump_comb', 'co2_emissions', "fuel_diesel", "fuel_ethanol", "fuel_natgas", "fuel_regular", "fuel_premium", "class_SUV_minivan", "class_compact", "class_other", "class_sedan_or_similar", "class_truck"]
emissions.columns = cols

Finally, replace values of True/False with 1 and 0, since True/False aren't recognized by RFR.

In [50]:
# Replace True/False with 1/0
emissions = emissions.replace(True, 1).replace(False, 0)

The new data looks as follows:

In [51]:
emissions

Unnamed: 0,engine_size,cyl,fuel_consump_city,fuel_consump_hwy,fuel_consump_comb,co2_emissions,fuel_diesel,fuel_ethanol,fuel_natgas,fuel_regular,fuel_premium,class_SUV_minivan,class_compact,class_other,class_sedan_or_similar,class_truck
0,2.0,4,9.9,6.7,8.5,196,0,0,0,0,1,0,1,0,0,0
1,2.4,4,11.2,7.7,9.6,221,0,0,0,0,1,0,1,0,0,0
2,1.5,4,6.0,5.8,5.9,136,0,0,0,0,1,0,1,0,0,0
3,3.5,6,12.7,9.1,11.1,255,0,0,0,0,1,1,0,0,0,0
4,3.5,6,12.1,8.7,10.6,244,0,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7380,2.0,4,10.7,7.7,9.4,219,0,0,0,0,1,1,0,0,0,0
7381,2.0,4,11.2,8.3,9.9,232,0,0,0,0,1,1,0,0,0,0
7382,2.0,4,11.7,8.6,10.3,240,0,0,0,0,1,1,0,0,0,0
7383,2.0,4,11.2,8.3,9.9,232,0,0,0,0,1,1,0,0,0,0


In [53]:
emissions.to_csv("emissions_cleaned.csv", index=False)