# Distances Between Observations

In [None]:
import pandas as pd
import numpy as np

# Ames housing - three variables only

As in the reading, first we will work with just three quantitative variables from that data set: the number of bedrooms, the number of bathrooms, and the living area (in square feet).

In [None]:
df_housing = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/AmesHousing.txt", sep="\t")
df_housing["Bathrooms"] = df_housing["Full Bath"] + 0.5 * df_housing["Half Bath"]
df_housing_quant = df_housing[["Bedroom AbvGr", "Gr Liv Area", "Bathrooms"]]
df_housing_quant

In the reading, we scaled these variables using standardized scaling, then computed the Euclidean distance between observations 2927 and 2498 and between observations 2928 and 290.

1\. Instead of standardizing the three variables from the Ames housing data set, normalize them.

You should do this from scratch, without using scikit-learn. (You can also try scikit-learn, but remember that the `Normalizer` scaler normalizes the rows to be length 1, rather than the columns. The scikit-learn function `normalize` is simpler, and allows you to normalize rows or columns using the `axis` argument.)

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

2\. Recompute the Euclidean distances between the two pairs of points, but using the normalized values.

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

3\. Instead of standardizing the three variables from the Ames housing data set, apply a min-max scaling to them.

Try this both from scratch and using the `MinMaxScaler` in scikitlearn.

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

4\. Recompute the Euclidean distances between the two pairs of points, but using the min-max-scaled values.

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

5\. Does your conclusion about which pair of observations is most similar change depending on the scaling you use?

**YOUR RESPONSE HERE**

6\. Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it --- in terms of living area, number of bedrooms, number of bathrooms --- by calculating distances from house 0. Try different distance metrics and different scaling methods. How sensitive are your results to these choices?

Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric?

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

**YOUR RESPONSE HERE**

## Using categorical variables when computing distances

So far, we have only computed distances between observations based on quantitative variables. But what if we want to include categorical variables? We can convert categorical variables into dummy quantitative variables, and then include in the dummy variables in the distance calculations.

Let's add "House Style" to the variables we are considering for the Ames housing data set.


In [None]:
df_housing_mixed = df_housing[["Bedroom AbvGr", "Gr Liv Area", "Bathrooms", "House Style"]]
df_housing_mixed

Recall that we have seen the Pandas `get_dummies()` command which converts all categorical variables into dummy variables (leaving quantitative variables as is).

In [None]:
df_housing_dummies = pd.get_dummies(df_housing_mixed)
df_housing_dummies

7\. Continuing part 6. Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it --- in terms of living area, number of bedrooms, number of bathrooms, **and House Style** --- by calculating distances from house 0. Try different distance metrics and different scaling methods. How sensitive are your results to these choices?

Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric?

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

**YOUR RESPONSE HERE**

## Activity

Continuing parts 6 and 7. Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it, by calculating distances after encoding categorical variables as dummy variables. Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

Try different distance metrics and different scaling methods. How sensitive are your results to these choices?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric?

_Hint:_ There are too many variables in the data set. Do not attempt to call `pd.get_dummies()` on the entire `DataFrame`! You will want to pare down the number of variables, but be sure to include a mixture of categorical and quantitative variables. Refer to the [data documentation](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) for information about the variables.

There are many approaches to this problem. I'll ask several groups to present their approach. Which variables did you decide to include? Which scaling method? Which distance matric? Why? What houses would you recommend?

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

**YOUR RESPONSE HERE**

## Dummy encoding in scikit-learn and sparse matrices

You can do dummy, or "onehot", encoding in scikit-learn using `OneHotEncoder`. There are `fit` and `transform` steps, just like for `StandardScaler`.

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(df_housing[["House Style"]])
output = enc.transform(df_housing[["House Style"]])
output


Notice that `OneHotEncoder` returns a "sparse matrix", which is not a `DataFrame` or even a `numpy` array. A _sparse matrix_ is one whose entries are mostly zeroes. For example,

$$ \begin{pmatrix} 0 & 0 & 0 & 0 & 0 \\ 1.7 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & -0.8 & 0 \end{pmatrix} $$

is an example of a sparse matrix. Instead of storing 20 values (most of which are equal to 0), we can simply store the locations of the non-zero entries and their values:

- $(1, 0) \rightarrow 1.7$
- $(3, 3) \rightarrow -0.8$

All other entries of the matrix are assumed to be zero. This representation offers substantial memory savings when there are only a few non-zero entries. (But if not, then this representation can actually be more expensive.) Transforming a categorical variable into dummy variables usually returns a sparse matrix, since each row only has one non-zero entry.

If we want a dense matrix instead of a sparse matrix, set `sparse_output=False` in `OneHotEncoder`.


In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)
enc.fit(df_housing[["House Style"]])
enc.transform(df_housing[["House Style"]])


You can also convert a sparse matrix to dense using `.todense()`

In [None]:
output.todense()

## Selectively Encoding Variables in Scikit-Learn

What if we have a DataFrame, and we only want to dummy encode the categorical variables? We have seen that Pandas `get_dummies` will pass through the quantitative variables unchanged. What about scikit-learn? Scikit-learn provides a `ColumnTransformer` that allows us to selectively apply transformations to certain columns. We can use `ColumnTransformer` to apply the `OneHotEncoder` to the "House Style" variable, and "passthrough" the remaining variables.





In [None]:
from sklearn.compose import ColumnTransformer
enc = ColumnTransformer(
    [("Encoded House Style", OneHotEncoder(), ["House Style"])],
    remainder="passthrough")

enc.fit(df_housing_mixed)
enc.transform(df_housing_mixed)


One advantage of using `ColumnTransformer` is that you can mix scalers for quantitative variables and encoders for categorical variables.

(Note: We will see later how to combine steps like these into a pipeline which both streamlines our analysis and allows us to apply operations consistently across multiple data sets, for example, across both training and testing data.)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

enc = ColumnTransformer(
    [("Scaled Quant Variables", StandardScaler(), ["Bedroom AbvGr", "Gr Liv Area", "Bathrooms"]),
     ("Encoded House Style", OneHotEncoder(), ["House Style"])],
    remainder="passthrough")

We can visualize the steps in the ColumnTransformer

In [None]:
enc

Now we fit the column transformer to the entire Ames housing data set. Notice that variables we haven't specified will passthrough unchanged

In [None]:
enc.fit(df_housing)
df_housing_enc = enc.transform(df_housing)

df_housing_enc

We convert the output to a Pandas DataFrame, but unforunately, all of the column names have been stripped away.

In [None]:
pd.DataFrame(df_housing_enc)