# Chapter 4.2 Distance Metrics and Categorical Variables



The distance metrics that we studied in the previous section were designed for quantitative variables. But most data sets contain a mix of categorical and quantitative variables. For example, the Titanic data set contains both quantitative variables, like `age`, and categorical variables, like `sex` and `embarked`. How do we measure the similarity between observations for a data set like this one? The most straightforward solution is to convert the categorical variables into quantitative ones.

In [3]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.options.display.max_rows = 5

titanic = pd.read_csv("https://raw.githubusercontent.com/dlsun/data-science-book/master/data/titanic.csv")
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0000,0,0,315082,7.8750,,S,,,


## Converting Categorical Variables to Quantitative Variables

Binary categorical variables (categorical variables with two categories) can be converted into quantitative variables by coding one category as 1 and the other category as 0. (In fact, the `survived` column in the Titanic data set is an example of a variable where this has been done.) But what do we do about a categorical variable with more than 2 categories, like `embarked`, which has 3 categories?

We can convert a categorical variable with $K$ categories into $K$ separate 0/1 variables, or **dummy variables**. Each of the $K$ variables is an indicator for one of the $K$ categories. That is, each dummy variable is 1 if the observation fell into that category and 0 otherwise.

Although it is not difficult to create dummy variables manually, the easiest way to create them is the `get_dummies()` function in `pandas`.

In [3]:
pd.get_dummies(titanic["embarked"])

Unnamed: 0,C,Q,S
0,0,0,1
1,0,0,1
...,...,...,...
1307,1,0,0
1308,0,0,1


Since every observation is in exactly one category, each row contains exactly one 1; the rest of the values in each row are 0s.

We can call `get_dummies` on a `DataFrame` to encode multiple categorical variables at once. `pandas` will only dummy-encode the variables it deems categorical, leaving the quantitative variables alone. If there are any categorical variables that are represented in the `DataFrame` using numeric types, they must be cast explicitly to a categorical type, such as `str`.  `pandas` will also automatically prepend the variable name to all dummy variables, to prevent collisions between column names in the final `DataFrame`.

In [7]:
# Convert pclass to a categorical type
titanic["pclass"] = titanic["pclass"].astype(str)

# Pass all variables to get_dummies, except ones that are "other" types
titanic_num = pd.get_dummies(
    titanic.drop(["name", "ticket", "cabin", "boat", "body"], axis=1)
)
titanic_num

Unnamed: 0,survived,age,sibsp,parch,fare,pclass_1,pclass_2,pclass_3,sex_female,sex_male,...,"home.dest_Wimbledon Park, London / Hayling Island, Hants","home.dest_Windsor, England New York, NY","home.dest_Winnipeg, MB","home.dest_Winnipeg, MN","home.dest_Woodford County, KY","home.dest_Worcester, England","home.dest_Worcester, MA","home.dest_Yoevil, England / Cottage Grove, OR","home.dest_Youngstown, OH","home.dest_Zurich, Switzerland"
0,1,29.0000,0,0,211.3375,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0.9167,1,2,151.5500,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1307,0,27.0000,0,0,7.2250,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1308,0,29.0000,0,0,7.8750,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


Notice that categorical variables, like `pclass`, were converted to dummy variables with names like `pclass_1`, `pclass_2` and `pclass_3`, while quantitative variables, like `age`, were left alone.

Now that we have converted every variable in our data set into a quantitative variable, we can apply the techniques from the previous section (Section 4.1) to calculate distances between observations. For example, to find the passenger who is most similar to the first passenger, Elisabeth Watson, we can find the row with the smallest Euclidean distance to that row in the above `DataFrame`.

In [4]:
titanic_std = (titanic_num - titanic_num.mean()) / titanic_num.std()
np.sqrt(
    ((titanic_std - titanic_std.loc[0]) ** 2).sum(axis=1)
).sort_values()

NameError: name 'titanic_num' is not defined

The passenger who was most similar to Elisabeth Allen, other than herself, is passenger 238. Let's extract these passengers from the original `DataFrame` to see how similar they really are.

In [5]:
titanic.loc[[0, 238]]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
238,1,1,"Robert, Mrs. Edward Scott (Elisabeth Walton Mc...",female,43.0,0,1,24160,211.3375,B3,S,2,,"St Louis, MO"


The two passengers are indeed very similar, only differing in age and the number of parents/children accompanying her. They even happen to share the same first two names ("Elizabeth Walton").

# Exercises

Exercises 1 and 2 use the Ames housing data set (`https://raw.githubusercontent.com/dlsun/data-science-book/master/data/AmesHousing.txt`).

**Exercise 1.** The neighborhood variable (`Neighborhood`) in this data set is categorical. Convert it to $K$ quantitative variables. What is $K$ in this case?

Based on these $K$ variables only, calculate the Euclidean distance between house 0 and each of the other houses in the data set. What are the possible values of the Euclidean distance? Can you explain what a distance of $0$ means, in the context of this variable? What about a distance of $1$?

In [6]:
ames_df = pd.read_csv("https://raw.githubusercontent.com/dlsun/data-science-book/master/data/AmesHousing.txt", sep='\t')
ames_df_dum = pd.get_dummies(ames_df["Neighborhood"])
eu_dist = lambda df, x: np.sqrt(((df - df.loc[x]) ** 2).sum(axis=1))
eu_dist(ames_df_dum, 0).value_counts() 
# (# col) = (# of neighborhoods)
# when two observations are not in the same neighborhood, the difference of the two is "2" (since they aren't in eachother's neighborhood)

1.414214    2487
0.000000     443
dtype: int64

**Exercise 2.** Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it, by calculating distances after encoding categorical variables as dummy variables. Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

Try different distance metrics and different standardization methods. How sensitive are your results to these choices?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric? 

_Hint:_ There are too many variables in the data set. Do not try to call `pd.get_dummies()` on the entire `DataFrame`! You will want to pare down the number of variables, but be sure to include a mixture of categorical and quantitative variables. Refer to the [data documentation](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) for information about the variables.

In [8]:
ames_df_dum = pd.get_dummies(ames_df[['Utilities', 'Bldg Type', 'Overall Qual', 'Overall Cond', 'Year Built', 'Full Bath', 'Lot Area', 'Bedroom AbvGr', 'Gr Liv Area']])
eu_dist(ames_df_dum, 0).sort_values()
ames_df.loc[[0, 2903]]

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
2903,2904,923125030,20,A (agr),125.0,31250,Pave,,Reg,Lvl,...,0,,,,0,5,2006,WD,Normal,81500
