### Categorical Data

When your data has categories represented by strings, it will be difficult to use them to train machine learning models which often only accepts numeric data.

Instead of ignoring the categorical data and excluding the information from our model, you can tranform the data so it can be used in your models.

Take a look at the table below, it is the same data set that we used in the multiple regression chapter.

In [1]:
import pandas as pd # The pandas module allows us to read csv files and manipulate DataFrame objects.

cars = pd.read_csv('data.csv')

print(cars.to_string())

           Car       Model  Volume  Weight  CO2
0       Toyota        Aygo    1000     790   99
1   Mitsubishi  Space Star    1200    1160   95
2        Skoda      Citigo    1000     929   95
3         Fiat         500     900     865   90
4         Mini      Cooper    1500    1140  105
5           VW         Up!    1000     929  105
6        Skoda       Fabia    1400    1109   90
7     Mercedes     A-Class    1500    1365   92
8         Ford      Fiesta    1500    1112   98
9         Audi          A1    1600    1150   99
10     Hyundai         I20    1100     980   99
11      Suzuki       Swift    1300     990  101
12        Ford      Fiesta    1000    1112   99
13       Honda       Civic    1600    1252   94
14      Hundai         I30    1600    1326   97
15        Opel       Astra    1600    1330   97
16         BMW           1    1600    1365   99
17       Mazda           3    2200    1280  104
18       Skoda       Rapid    1600    1119  104
19        Ford       Focus    2000    13

In the multiple regression chapter, we tried to predict the CO2 emitted based on the volume of the engine and the weight of the car but we excluded information about the car brand and model.

The information about the car brand or the car model might help us make a better prediction of the CO2 emitted.

### One Hot Encoding

We cannot make use of the Car or Model column in our data since they are not numeric. A linear relationship between a categorical variable, Car or Model, and a numeric variable, CO2, cannot be determined.

To fix this issue, we must have a numeric representation of the categorical variable. One way to do this is to have a column representing each group in the category.

For each column, the values will be 1 or 0 where 1 represents the inclusion of the group and 0 represents the exclusion. This transformation is called one hot encoding.

You do not have to do this manually, the Python Pandas module has a function that is called get_dummies() which does one hot encoding.

In [2]:
# The following two lines are used  to increase the cell width of this Jupyter Notebook file in the browser. So that the output of the next cell will fit in the cell.

from IPython.display import display, HTML

display(HTML("<style>.container { width:140% !important; }</style>"))

In [3]:
ohe_cars = pd.get_dummies(cars[['Car']]) # Convert categorical variable into dummy/indicator variables.
# For additional information about get_dummies function, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

print(ohe_cars.to_string())

    Car_Audi  Car_BMW  Car_Fiat  Car_Ford  Car_Honda  Car_Hundai  Car_Hyundai  Car_Mazda  Car_Mercedes  Car_Mini  Car_Mitsubishi  Car_Opel  Car_Skoda  Car_Suzuki  Car_Toyota  Car_VW  Car_Volvo
0          0        0         0         0          0           0            0          0             0         0               0         0          0           0           1       0          0
1          0        0         0         0          0           0            0          0             0         0               1         0          0           0           0       0          0
2          0        0         0         0          0           0            0          0             0         0               0         0          1           0           0       0          0
3          0        0         1         0          0           0            0          0             0         0               0         0          0           0           0       0          0
4          0        0         0    

As a result, a column was created for every car brand in the Car column.

### Predict CO2

We can use this additional information alongside the volume and weight to predict CO2.

To combine the information, we can use the concat() function from pandas module. concat() function concatenates pandas objects along a particular axis.

Then we must select the independent variables (X) and add the dummy variables columnwise.

Also store the dependent variable in y.

In [9]:
X = pd.concat([cars[['Volume', 'Weight']], ohe_cars], axis=1)

y = cars['CO2']

# We also need to import a method from sklearn to create a linear model.

from sklearn import linear_model

# Now we can fit the data to a linear regression.

regr = linear_model.LinearRegression()

regr.fit(X.values,y)

# Finally we can predict the CO2 emissions based on the car's weight, volume and manufacturer.
# Predict the CO2 emission of a Volvo where the weight is 2300kg, and the volume is 1300cm3:

predictedCO2 = regr.predict([[2300, 1300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]])

print(predictedCO2)

[110.06989491]


In [5]:
# We now have a coefficient for the volume, the weight, and each car brand in the data set.

# print(regr.coef_)

### Dummifying

It is not necessary to create one column for each group in your category. The information can be retained using 1 column less than the number of groups you have.

For example, you have a column representing colors and in that column, you have two colors, red and blue.

In [6]:
colors = pd.DataFrame({'color': ['blue', 'red']}) # Constructing DataFrame from a dictionary. For additional information, see the example at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

print(colors)

  color
0  blue
1   red


You can create 1 column called red where 1 represents red and 0 represents not red, which means it is blue.

To do this, we can use the same function that we used for one hot encoding, get_dummies, and then drop one of the columns. There is an argument, drop_first, which allows us to exclude the first column from the resulting table.

In [7]:
dummies = pd.get_dummies(colors, drop_first=True) # Assign False to drop_first parameter and see the result.

print(dummies)

   color_red
0          0
1          1


What if we  have more than 2 groups? How can the multiple groups be represented by 1 less column?

Let's say we have three colors this time, red, blue and green. When we get_dummies while dropping the first column, we get the following table.

In [8]:
colors_three = pd.DataFrame({'color': ['blue', 'red', 'green']})

print (colors_three, "\n")

dummies = pd.get_dummies(colors_three, drop_first=True)

print (dummies, "\n")

dummies ['color'] = colors_three ['color'] # # For additional information about ddding new column to existing DataFrame in Pandas, see
                                           # https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/
print(dummies)

   color
0   blue
1    red
2  green 

   color_green  color_red
0            0          0
1            0          1
2            1          0 

   color_green  color_red  color
0            0          0   blue
1            0          1    red
2            1          0  green


In [5]:
import webbrowser

url = "https://chat.openai.com/chat"

webbrowser.open_new(url)

True