## Categorical Variables

So far, we’ve assumed that our data comes in as a two-dimensional array of floating-point numbers, where each column is a continuous feature that describes the data points. For many applications, this is not how the data is collected. A particularly common type of feature is the *categorical features*. Also known as *discrete features*, these are usually not numeric. 

The distinction between categorical features and continuous features is analogous to the distinction between classification and regression, only on the input side rather than the output side. Examples of continuous features that we have seen are pixel brightnesses and size measurements of plant flowers. Examples of categorical features are the brand of a product, the color of a product, or the department (books, clothing, hardware) it is sold in. These are all properties that can describe a product, but they don’t vary in a continuous way. A product belongs either in the clothing department or in the books department. There is no middle ground between books and clothing, and no natural order for the different categories (books is not greater or less than clothing, hardware is not between books and clothing, etc.).

As an example, we will use the dataset of adult incomes in the United States, derived from the 1994 census database. The task of the adult dataset is to predict whether a worker has an income of over $50,000 or under $50,000. The features in this dataset include the workers’ ages, how they are employed (self employed, private industry employee, government employee, etc.), their education, their gender, their working hours per week, occupation, and more. 

The table shows the first few entries in the dataset.

%%HTML

<p>The first few entries in the adult dataset:</p>

<table>
  <tr>
    <th>&nbsp;</th>
    <th>age</th>
    <th>workclass</th>
    <th>education</th>
    <th>gender</th>
    <th>hrs-per-week</th>
    <th>occupation</th>
    <th>income</th>
  </tr>
  
  <tr>
    <td>0</td>
    <td>39</td>
    <td>State-gov</td>
    <td>Bachelors</td>
    <td>Male</td>
    <td>40</td>
    <td>Adm-clerical</td>
    <td>&lt;=50K</td>
  </tr>
  
  <tr>
    <td>1</td>
    <td>50</td>   
    <td>Self-emp-not-inc</td>
    <td>Bachelors</td>
    <td>Male</td>
    <td>13</td>
    <td>Exec-managerial</td>
    <td>&lt;=50K</td>
  </tr>

  <tr>
    <td>2</td>
    <td>38</td>
    <td>Private</td>
    <td>HS-grad</td>
    <td>Male</td>
    <td>40</td>
    <td>Handlers-cleaners</td>
    <td>&lt;=50K</td>
  </tr>
  
  <tr>
    <td>3</td>
    <td>53</td>
    <td>Private</td>
    <td>11th</td>
    <td>Male</td>
    <td>40</td>            
    <td>Handlers-cleaners</td>
    <td>&lt;=50K</td>
  </tr>
  
  <tr>
    <td>4</td>
    <td>28</td>
    <td>Private</td>
    <td>Bachelors</td>
    <td>Female</td>
    <td>40</td>
    <td>Prof-specialty</td>
    <td>&lt;=50K</td>
  </tr>
  
  <tr>
    <td>5</td>
    <td>37</td>
    <td>Private</td>
    <td>Masters</td>
    <td>Female</td>
    <td>40</td>
    <td>Exec-managerial</td>
    <td>&lt;=50K</td>
  </tr>

  <tr>
    <td>6</td>
    <td>49</td>
    <td>Private</td>
    <td>9th</td>
    <td>Female</td>
    <td>16</td>
    <td>Other-service</td>
    <td>&lt;=50K</td>
  </tr>

  <tr>
    <td>7</td>
    <td>52</td>
    <td>Self-emp-not-inc</td>
    <td>HS-grad</td>
    <td>Male</td>
    <td>45</td>
    <td>Exec-managerial</td>
    <td>&gt;50K</td>
  </tr>

  <tr>
    <td>8</td>  
    <td>31</td>   
    <td>Private</td>          
    <td>Masters</td>     
    <td>Female</td>
    <td>50</td>
    <td>Prof-specialty</td>
    <td>&gt;50K</td>
  </tr>
  
  <tr>
    <td>9</td>  
    <td>42</td>   
    <td>Private</td>          
    <td>Bachelors</td>   
    <td>Male</td>     
    <td>40</td>            
    <td>Exec-managerial</td>    
    <td>&gt;50K</td>
  </tr>
  
  <tr>
    <td>10</td> 
    <td>37</td>   
    <td>Private</td>          
    <td>Some-college</td>
    <td>Male</td>    
    <td>80</td>            
    <td>Exec-managerial</td>    
    <td>&gt;50K</td>
  </tr>  
</table>

The task is phrased as a classification task with the two classes being income *<=50k* and *>50k*. It would also be possible to predict the exact income, and make this a regression task. However, that would be much more difficult, and the 50K division is interesting to understand on its own.

In this dataset, *age* and *hours-per-week* are continuous features, which we know how to treat. The *workclass*, *education*, *sex*, and *occupation* features are categorical, however. All of them come from a fixed list of possible values, as opposed to a range, and denote a qualitative property, as opposed to a quantity.

As a starting point, let’s say we want to learn a logistic regression classifier on this data. We know from Chapter 2 that a logistic regression makes predictions, ŷ, using the following formula:

    ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b > 0

where w[i] and b are coefficients learned from the training set and x[i] are the input features. This formula makes sense when x[i] are numbers, but not when x[2] is "Masters" or "Bachelors". Clearly we need to represent our data in some different way when applying logistic regression. The next section will explain how we can overcome this problem.

### One-Hot-Encoding (Dummy Variables)

By far the most common way to represent categorical variables is using the *one-hot-encoding* or *one-out-of-N encoding*, also known as *dummy variables*. The idea behind dummy variables is to replace a categorical variable with one or more new features that can have the values 0 and 1. The values 0 and 1 make sense in the formula for linear binary classification (and for all other models in scikit-learn), and we can represent any number of categories by introducing one new feature per category, as described here.

Let’s say for the workclass feature we have possible values of "Government Employee", "Private Employee", "Self Employed", and "Self Employed Incorporated". To encode these four possible values, we create four new features, called "Government Employee", "Private Employee", "Self Employed", and "Self Employed Incorporated". A feature is 1 if workclass for this person has the corresponding value and 0 otherwise, so exactly one of the four new features will be 1 for each data point. This is why this is called one-hot or one-out-of-N encoding.

The principle is illustrated in the following table. A single feature is encoded using four new features. When using this data in a machine learning algorithm, we would drop the original *workclass* feature and only keep the 0–1 features.

%%HTML

<p>Encoding the workclass feature using one-hot encoding workclass:</p>

<table>
  <tr>
    <th>nbsp;</th>
    <th>Government Employee</th>
    <th>Private Employee</th>
    <th>Self Employed</th>  
    <th>Self Employed Incorporated</th>
  </tr>
    
  <tr>
    <td>Government Employee</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
  
  <tr>
    <td>Private Employee</td>
    <td>0</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
  </tr>
  
  <tr>
    <td>Self Employed</td>
    <td>0</td>
    <td>0</td>
    <td>1</td>
    <td>0</td>
  </tr>
  
  <tr>
    <td>Self Employed Incorporated</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
    <td>1</td>
  </tr>
</table>

Note

The one-hot encoding we use is quite similar, but not identical, to the dummy coding used in statistics. For simplicity, we encode each category with a different binary feature. In statistics, it is common to encode a categorical feature with k different possible values into k–1 features (the last one is represented as all zeros). This is done to simplify the analysis (more technically, this will avoid making the data matrix rank-deficient).

There are two ways to convert your data to a one-hot encoding of categorical variables, using either pandas or scikit-learn. Let’s see how we can do it using pandas. We start by loading the data using pandas from a comma-separated values (CSV) file:

In [2]:
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import sklearn
from IPython.display import display
import mglearn

# Don't display deprecation warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
import os

# The file has no headers naming the columns, so we pass header=None
# and provide the column names explicitly in "names"
adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data")
data = pd.read_csv(
    adult_path, header=None, index_col=False,
    names=['age', 'workclass', 'fnlwgt', 'education',  'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'gender',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
           'income'])

# For illustration purposes, we only select some of the columns
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]

# IPython.display allows nice output formatting within the Jupyter notebook
display(data.head())

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


### Checking String-Encoded Categorical Data

After reading a dataset like this, it is often good to first check if a column actually contains meaningful categorical data. When working with data that was input by humans (say, users on a website), there might not be a fixed set of categories, and differences in spelling and capitalization might require preprocessing. 

For example, it might be that some people specified gender as “male” and some as “man,” and we might want to represent these two inputs using the same category. A good way to check the contents of a column is using the value_counts method of a pandas Series (the type of a single column in a DataFrame), to show us what the unique values are and how often they appear:

In [4]:
print(data.gender.value_counts())

 Male      21790
 Female    10771
Name: gender, dtype: int64


We can see that there are exactly two values for gender in this dataset, Male and Female, meaning the data is already in a good format to be represented using one-hot-encoding. In a real application, you should look at all columns and check their values. We will skip this here for brevity’s sake.

There is a very simple way to encode the data in pandas, using the *get_dummies* function. The *get_dummies* function automatically transforms all columns that have object type (like strings) or are categorical (which is a special pandas concept that we haven’t talked about yet):

In [5]:
print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))

Original features:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

Features after get_dummies:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-i

You can see that the continuous features age and hours-per-week were not touched, while the categorical features were expanded into one new feature for each possible value:

In [6]:
data_dummies.head()

Unnamed: 0,age,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,...,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,income_ <=50K,income_ >50K
0,39,40,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,50,13,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
2,38,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,53,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,28,40,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


We can now use the values attribute to convert the *data_dummies* DataFrame into a NumPy array, and then train a machine learning model on it. Be careful to separate the target variable (which is now encoded in two income columns) from the data before training a model. Including the output variable, or some derived property of the output variable, into the feature representation is a very common mistake in building supervised machine learning models.

In this case, we extract only the columns containing features—that is, all columns from age to occupation_ Transport-moving. This range contains all the features but not the target:

In [7]:
features = data_dummies.loc[:, 'age': 'occupation_ Transport-moving']

# Extract NumPy arrays
X = features.values
y = data_dummies['income_ >50K'].values
print("X.shape: {} y.shape: {}".format(X.shape, y.shape))

X.shape: (32561, 44) y.shape: (32561,)


Now the data is represented in a way that scikit-learn can work with, and we can proceed as usual:

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

Test score: 0.81


### Numbers Can Encode Categoricals

In the example of the adult dataset, the categorical variables were encoded as strings. On the one hand, that opens up the possibility of spelling errors, but on the other hand, it clearly marks a variable as categorical. Often, whether for ease of storage or because of the way the data is collected, categorical variables are encoded as integers. 

For example, imagine the census data in the adult dataset was collected using a questionnaire, and the answers for workclass were recorded as 0 (first box ticked), 1 (second box ticked), 2 (third box ticked), and so on. Now the column will contain numbers from 0 to 8, instead of strings like "Private", and it won’t be immediately obvious to someone looking at the table representing the dataset whether they should treat this variable as continuous or categorical. Knowing that the numbers indicate employment status, however, it is clear that these are very distinct states and should not be modeled by a single continuous variable.

Categorical features are often encoded using integers. That they are numbers doesn’t mean that they should necessarily be treated as continuous features. It is not always clear whether an integer feature should be treated as continuous or discrete (and one-hot-encoded). If there is no ordering between the semantics that are encoded (like in the workclass example), the feature must be treated as discrete. For other cases, like five-star ratings, the better encoding depends on the particular task and data and which machine learning algorithm is used.

The *get_dummies* function in pandas treats all numbers as continuous and will not create dummy variables for them. To illustrate, let’s create a DataFrame object with two columns corresponding to two different categorical features, one represented as a string and one as an integer.

In [9]:
# Create a DataFrame with an integer feature and a categorical string feature
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1],
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
display(demo_df)

Unnamed: 0,Integer Feature,Categorical Feature
0,0,socks
1,1,fox
2,2,socks
3,1,box


Using get_dummies will only encode the string feature and will not change the integer feature, as you can see.

In [10]:
display(pd.get_dummies(demo_df))

Unnamed: 0,Integer Feature,Categorical Feature_box,Categorical Feature_fox,Categorical Feature_socks
0,0,0,0,1
1,1,0,1,0
2,2,0,0,1
3,1,1,0,0


If you want dummy variables to be created for the “Integer Feature” column, you can explicitly list the columns you want to encode using the *columns* parameter. Then, both features will be treated as categorical

In [11]:
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
display(pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature']))

Unnamed: 0,Integer Feature_0,Integer Feature_1,Integer Feature_2,Categorical Feature_box,Categorical Feature_fox,Categorical Feature_socks
0,1,0,0,0,0,1
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,1,0,0
