# Categorial Features in Regression Models

So far, we have fit linear regression models to data where all of the features are quantitative. But what if all or some of the features are categorical? In theory, the solution is simple: we simply transform the categorical variables into quantitative variables using dummy (i.e., one-hot) encoding. However, in practice, some care is needed to ensure that the categorical variables are transformed in a consistent way between the training and the test data.

We'll use the Ames housing data as an example.

In [1]:
import pandas as pd
df_housing = pd.read_csv("http://dlsun.github.io/pods/data/AmesHousing.txt", sep="\t")
df_housing.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


## One Categorical Feature

Let's develop some intuition about the predictions that a regression model will make when there is a single categorical feature. First, suppose we train a linear regression model to predict house price from the neighborhood the house is in.

In [2]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

X = df_housing[["Neighborhood"]] # need 2D array for sklearn
y = df_housing["SalePrice"]

enc = OneHotEncoder()
X_dummies = enc.fit_transform(X)

model = LinearRegression()
model.fit(X_dummies, y)

A regression model with just a single categorical feature, **Neighborhood**, will predict the same price for all houses in the same neighborhood. What is that predicted value? We can obtain it by applying the `OneHotEncoder` to a list of the unique neighborhoods in the data set and passing this to `model.predict()`.

One way to obtain a list of the unique neighborhoods is inside the encoder itself, under the attribute `.categories_`. We convert this to a 2D-array to be compatible with scikit-learn.

In [3]:
X_test = pd.Series(enc.categories_[0], name="Neighborhood").to_frame()
X_test

Unnamed: 0,Neighborhood
0,Blmngtn
1,Blueste
2,BrDale
3,BrkSide
4,ClearCr
5,CollgCr
6,Crawfor
7,Edwards
8,Gilbert
9,Greens


In [4]:
model.predict(enc.transform(X_test))

array([196661.67839574, 143589.99984048, 105608.33315723, 124756.24970827,
       208662.09071842, 201803.43534259, 207550.83467159, 130843.38035976,
       190646.57519339, 193531.24984184, 279999.99984623, 103752.90296697,
       136999.99984704,  95756.48630435, 162226.6312716 , 145097.35004572,
       140710.8693962 , 188406.90803463, 330319.12653848, 322018.26448675,
       123991.89372573, 135071.9373072 , 136751.15185901, 184070.18365919,
       229707.3233949 , 324229.19588196, 246599.54144314, 248314.58316341])

It is a bit hard to tell which prediction corresponds to which neighborhood. Let's put these numbers into a `Series`, indexed by the neighborhood.

In [5]:
pd.Series(
    model.predict(enc.transform(X_test)),
    index=X_test["Neighborhood"]
)

Neighborhood
Blmngtn    196661.678396
Blueste    143589.999840
BrDale     105608.333157
BrkSide    124756.249708
ClearCr    208662.090718
CollgCr    201803.435343
Crawfor    207550.834672
Edwards    130843.380360
Gilbert    190646.575193
Greens     193531.249842
GrnHill    279999.999846
IDOTRR     103752.902967
Landmrk    136999.999847
MeadowV     95756.486304
Mitchel    162226.631272
NAmes      145097.350046
NPkVill    140710.869396
NWAmes     188406.908035
NoRidge    330319.126538
NridgHt    322018.264487
OldTown    123991.893726
SWISU      135071.937307
Sawyer     136751.151859
SawyerW    184070.183659
Somerst    229707.323395
StoneBr    324229.195882
Timber     246599.541443
Veenker    248314.583163
dtype: float64

Could we have obtained these predictions some other way, without going through the trouble of fitting a linear regression model? Intuitively, if all we knew about a house was the neighborhood it was in, we would predict the average price of houses in that neighborhood.

In [6]:
df_housing.groupby("Neighborhood")["SalePrice"].mean()

Neighborhood
Blmngtn    196661.678571
Blueste    143590.000000
BrDale     105608.333333
BrkSide    124756.250000
ClearCr    208662.090909
CollgCr    201803.434457
Crawfor    207550.834951
Edwards    130843.381443
Gilbert    190646.575758
Greens     193531.250000
GrnHill    280000.000000
IDOTRR     103752.903226
Landmrk    137000.000000
MeadowV     95756.486486
Mitchel    162226.631579
NAmes      145097.349887
NPkVill    140710.869565
NWAmes     188406.908397
NoRidge    330319.126761
NridgHt    322018.265060
OldTown    123991.891213
SWISU      135071.937500
Sawyer     136751.152318
SawyerW    184070.184000
Somerst    229707.324176
StoneBr    324229.196078
Timber     246599.541667
Veenker    248314.583333
Name: SalePrice, dtype: float64

These numbers match the predictions from our linear regression model exactly. Linear regression simply predicts the average price in each neighborhood.

## Mathematical interlude

Linea regression simply predicts the average price in each neighborhood. To see this mathematically, recall that linear regression minimizes the total squared distance between the observed price and the predicted price:

$$ \text{sum of } (\text{price} - \widehat{\text{price}})^2. $$

After we expand the **Neighborhood** column into 28 dummy variables (e.g., $I\{ \text{Blmngtn} \}$, $I\{ \text{Blueste} \}$, etc.), one for each neighborhood, we can write the predicted price in the linear regression model as

$$ \widehat{\text{price}} = c_1 I\{ \text{Blmngtn} \} + c_2 I\{ \text{Blueste} \} + \ldots + c_{28} I\{ \text{Veenker} \}. $$

(For simplicity, we have omitted the intercept term $b$.)

Now, consider a house in Bloomington Heights, for which $I\{ \text{Blmngtn} \} = 1$ and all of the other dummy variables $I\{ \text{Blueste} \} = \ldots = I\{ \text{Veenker} \} = 0$. Then, $\widehat{\text{price}}$ for a house in Bloomington Heights is $c_1$. Likewise, $\widehat{\text{price}}$ for a house in Bluestem is $c_2$. And so forth.

Now, we can reframe linear regression as learning the values $c_1, c_2, \ldots, c_{28}$ that minimize

$$ \text{sum of } (\text{price} - \widehat{\text{price}})^2 = \underbrace{\text{sum of } (\text{price} - c_1)^2}_{\text{over houses in Blmngtn}} + \underbrace{\text{sum of } (\text{price} - c_2)^2}_{\text{over houses in Blueste}} + \ldots + \underbrace{\text{sum of } (\text{price} - c_{28})^2}_{\text{over houses in Veenker}}. $$

It turns out that the value of $c$ that mimimizes the $\text{sum of } (\text{price} - c)^2$ is the mean of the prices. So $\hat c_1$ will be the average price of houses in Bloomington Heights, $\hat c_2$ the average price of houses in Bluestem, and so on. Since $\hat c_1, \hat c_2, \ldots, \hat c_{28}$ are also the predicted values for each neighborhood, this shows that linear regression will predict the average label in each category when there is only one categorical variable in the model.

## Mixing Quantitative and Categorical Features

In general, we want to fit machine learning models that use a mix of both categorical and quantitative features. In this situation, we will want to apply the `OneHotEncoder` to only the categorical features. Scikit-learn provides a `ColumnTransformer` that allows us to selectively apply transformations to certain columns.

For example, suppose we want to fit a linear regression model to predict house price from quantitative features (square footage, number of bedrooms, number of full bathrooms) and categorical features (neighborhood, building type). We can use a `ColumnTransformer` to one-hot encode the categorical features and leave the quantitative features unchanged.

In [7]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

ct = make_column_transformer(
    (OneHotEncoder(), ["Neighborhood", "Bldg Type"]),
    remainder="passthrough"  # all other columns in X will be passed through unchanged
)
ct

We have to be careful to transform the training data and the test data in exactly the same way before fitting a model. Most machine learning models have many more preprocessing steps. As the preprocessing gets more complex, it is easy to accidentally omit one of the preprocessing steps. For this reason, scikit-learn provides a _Pipeline_ object, which simply chains together a sequence of preprocessing and model building steps. If we call `Pipeline.fit()` or `Pipeline.predict()` on the data, all of the steps will be applied to the data in a consistent manner.

Next, we integrate this `ColumnTransformer` into a pipeline with the `LinearRegression` model.

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

pipeline = make_pipeline(
    ct,
    LinearRegression()
)

pipeline.fit(X=df_housing[["Gr Liv Area", "Bedroom AbvGr", "Full Bath",
                           "Neighborhood", "Bldg Type"]],
             y=df_housing["SalePrice"])

Now, if we wanted to use this model to predict the price of a 3BR/2BA, 1700 sqft single-family house in Bloomington Heights, we could create a `Series` with this information, and call `pipeline.predict()` on a 2D-array with this single row.

In [9]:
x_test = pd.Series()
x_test["Gr Liv Area"] = 1700
x_test["Bedroom AbvGr"] = 3
x_test["Full Bath"] = 2
x_test["Neighborhood"] = "Blmngtn"
x_test["Bldg Type"] = "1Fam"

pipeline.predict(X=pd.DataFrame([x_test]))

array([237454.14919708])

So this house is predicted to cost $237,458.