# 3.6 Converting Categorical Variables to Quantitative Variables

We have seen how quantitative variables can be converted to categorical variables using the `cut` function. What about the other way around? Can categorical variables be converted to quantitative ones?

In [None]:
import numpy as np
import pandas as pd

data_dir = "https://dlsun.github.io/pods/data/"
df_titanic = pd.read_csv(data_dir + "titanic.csv")
df_titanic

## Converting Categorical Variables to Quantitative Variables

Binary categorical variables (categorical variables with exactly two categories) can be converted into quantitative variables by coding one category as 1 and the other category as 0. (In fact, the **survived** column in the Titanic data set has already been coded this way.) The easiest way to do this is to create a boolean mask. For example, to convert **gender** to a quantitative variable **female**, which is 1 if the passenger was female and 0 otherwise, we can do the following:

In [None]:
df_titanic["female"] = 1 * (df_titanic["gender"] == "female")
df_titanic["female"]

Multiplying by 1 converts the `Series` of booleans to a `Series` of integers.

Now we can manipulate this new variable as we would any other quantitative variable. For example, the sum would tell us how many passengers were female, while the mean would tell us the _proportion_ of passengers who were female.

In [None]:
df_titanic["female"].sum(), df_titanic["female"].mean()

What do we do about a categorical variable with more than two categories, like `embarked`, which has four categories? In general, a categorical variable with $K$ categories can be converted into $K$ separate 0/1 variables, or **dummy variables**. Each of the $K$ dummy variables is an indicator for one of the $K$ categories. That is, a dummy variable is 1 if the observation fell into its particular category and 0 otherwise.

Although it is not difficult to create dummy variables manually, the easiest way to create them is the `get_dummies()` function in `pandas`.

In [None]:
pd.get_dummies(df_titanic["embarked"])

Since every observation is in exactly one category, each row contains exactly one 1; the rest of the values in each row are 0s.

We can call `get_dummies` on a `DataFrame` to encode multiple categorical variables at once. `pandas` will only dummy-encode the variables it deems as categorical, leaving the quantitative variables alone. If there are any categorical variables that are represented in the `DataFrame` using numeric types, they must be cast explicitly to a categorical type, such as `str`.  `pandas` will also automatically prepend the variable name to all dummy variables, to prevent collisions between column names in the final `DataFrame`.

In [None]:
# Pass all variables to get_dummies except ones that are "other" types
df_titanic_quant = pd.get_dummies(
    df_titanic.drop(["name", "ticketno"], axis=1)
)
df_titanic_quant

Notice that categorical variables, like `class`, were converted to dummy variables with names like `class_1st`, `class_2nd` and `class_3rd`, while quantitative variables, like `age`, were left alone.

# Exercises

Exercises 1-3 ask you to work with the Ames housing data set (`https://dlsun.github.io/pods/data/AmesHousing.txt`).

1\. The **Neighborhood** variable in this data set is categorical. Convert it into $K$ quantitative variables. What is $K$ in this case?

How would you use the quantitative variables that you just created to calculate the distribution of houses across the neighborhoods?

2\. How would you use the quantitative variables that you just created, along with the **SalePrice** column, to calculate the average price of a home in each neighborhood?

3\. Suppose you convert both the **Neighborhood** and **Bldg Type** variables to quantitative variables. How many new quantitative variables will you have now? What is the value of the sum across each row of these new quantitative variables?