# Distances Between Observations

In [147]:
import pandas as pd
import numpy as np

# Ames housing - three variables only

As in the reading, first we will work with just three quantitative variables from that data set: the number of bedrooms, the number of bathrooms, and the living area (in square feet).

In [148]:
df_housing = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/AmesHousing.txt", sep="\t")
df_housing["Bathrooms"] = df_housing["Full Bath"] + 0.5 * df_housing["Half Bath"]
df_housing_quant = df_housing[["Bedroom AbvGr", "Gr Liv Area", "Bathrooms"]]
df_housing_quant

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,3,1656,1.0
1,2,896,1.0
2,3,1329,1.5
3,3,2110,2.5
4,3,1629,2.5
...,...,...,...
2925,3,1003,1.0
2926,2,902,1.0
2927,3,970,1.0
2928,2,1389,1.0


In the reading, we scaled these variables using standardized scaling, then computed the Euclidean distance between observations 2927 and 2928 between observations 2498 and 290.

1\. Instead of standardizing the three variables from the Ames housing data set, normalize them.

You should do this from scratch, without using scikit-learn. (You can also try scikit-learn, but remember that the `Normalizer` scaler normalizes the rows to be length 1, rather than the columns. The scikit-learn function `normalize` is simpler, and allows you to normalize rows or columns using the `axis` argument.)

In [149]:
# YOUR CODE HERE. ADD CELLS AS NEEDED
df_housing_norm = df_housing_quant / (np.sqrt((df_housing_quant**2).sum()))
df_housing_norm

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.018649,0.019331,0.009878
1,0.012433,0.010460,0.009878
2,0.018649,0.015514,0.014817
3,0.018649,0.024631,0.024695
4,0.018649,0.019016,0.024695
...,...,...,...
2925,0.018649,0.011709,0.009878
2926,0.012433,0.010530,0.009878
2927,0.018649,0.011323,0.009878
2928,0.012433,0.016215,0.009878


2\. Recompute the Euclidean distances between the two pairs of points, but using the normalized values.

In [150]:
# YOUR CODE HERE. ADD CELLS AS NEEDED
x = df_housing_norm.iloc[2927]
x1 = df_housing_norm.iloc[2928]

np.sqrt(np.sum((x - x1)**2).sum())

0.007910021508841998

In [151]:
x2 = df_housing_norm.iloc[2498]
x3 = df_housing_norm.iloc[290]

np.sqrt(np.sum((x2- x3)**2).sum())

0.021103948426701397

3\. Instead of standardizing the three variables from the Ames housing data set, apply a min-max scaling to them.

Try this both from scratch and using the `MinMaxScaler` in scikitlearn.

In [152]:
# YOUR CODE HERE. ADD CELLS AS NEEDED
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df_housing_quant)
df_housing_mm = scaler.transform(df_housing_quant)

df_housing_mm

array([[0.375     , 0.24905803, 0.2       ],
       [0.25      , 0.10587792, 0.2       ],
       [0.375     , 0.1874529 , 0.3       ],
       ...,
       [0.375     , 0.11981914, 0.2       ],
       [0.25      , 0.19875659, 0.2       ],
       [0.375     , 0.31386586, 0.5       ]])

4\. Recompute the Euclidean distances between the two pairs of points, but using the min-max-scaled values.

In [153]:
# YOUR CODE HERE. ADD CELLS AS NEEDED
from sklearn.metrics.pairwise import euclidean_distances

x = df_housing_mm[[2927, 2498], :]
x1 = df_housing_mm[[2928, 290], :]

euclidean_distances(x, x1)

array([[0.14783816, 0.6330872 ],
       [0.33422312, 0.42500067]])

5\. Does your conclusion about which pair of observations is most similar change depending on the scaling you use?

**YOUR RESPONSE HERE**

For normalizing, the distances between 2927 and 2928 was 0.007. And the distance between 2498 and 290 was 0.02. For min-max scaling, the distance between 2927 and 2928 was 0.14. And the distance between 2498 and 290 was 0.42. The conclusion does not change depending on the scaling used.

6\. Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it --- in terms of living area, number of bedrooms, number of bathrooms --- by calculating distances from house 0. Try different distance metrics and different scaling methods. How sensitive are your results to these choices?

Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric?

In [154]:
# YOUR CODE HERE. ADD CELLS AS NEEDED
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df_housing_quant)
df_housing_ss = scaler.transform(df_housing_quant)


In [155]:
eu_dist = euclidean_distances(df_housing_ss, df_housing_ss[0, :].reshape(1, -1))
np.argsort(eu_dist, axis=0)[1][0]
df_housing.iloc[np.argsort(eu_dist, axis=0)[1][0]]

Order                  1227
PID               534477270
MS SubClass              80
MS Zoning                RL
Lot Frontage           80.0
                    ...    
Yr Sold                2008
Sale Type               WD 
Sale Condition       Normal
SalePrice            165500
Bathrooms               1.0
Name: 1226, Length: 83, dtype: object

In [156]:
from sklearn.metrics.pairwise import manhattan_distances

man_dist = manhattan_distances(df_housing_ss, df_housing_ss[0, :].reshape(1, -1))
np.argsort(man_dist, axis=0)[1][0]
df_housing.iloc[np.argsort(man_dist, axis=0)[1][0]]

Order                  1227
PID               534477270
MS SubClass              80
MS Zoning                RL
Lot Frontage           80.0
                    ...    
Yr Sold                2008
Sale Type               WD 
Sale Condition       Normal
SalePrice            165500
Bathrooms               1.0
Name: 1226, Length: 83, dtype: object

In [157]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df_housing_quant)
df_housing_mm = scaler.transform(df_housing_quant)
df_housing_mm

array([[0.375     , 0.24905803, 0.2       ],
       [0.25      , 0.10587792, 0.2       ],
       [0.375     , 0.1874529 , 0.3       ],
       ...,
       [0.375     , 0.11981914, 0.2       ],
       [0.25      , 0.19875659, 0.2       ],
       [0.375     , 0.31386586, 0.5       ]])

In [158]:
eu_dist = euclidean_distances(df_housing_mm, df_housing_mm[0, :].reshape(1, -1))
np.argsort(eu_dist, axis=0)[1][0]
print(df_housing.iloc[np.argsort(eu_dist, axis=0)[1][0]])
print()
man_dist = manhattan_distances(df_housing_mm, df_housing_mm[0, :].reshape(1, -1))
np.argsort(man_dist, axis=0)[1][0]
print(df_housing.iloc[np.argsort(man_dist, axis=0)[1][0]])

Order                  1227
PID               534477270
MS SubClass              80
MS Zoning                RL
Lot Frontage           80.0
                    ...    
Yr Sold                2008
Sale Type               WD 
Sale Condition       Normal
SalePrice            165500
Bathrooms               1.0
Name: 1226, Length: 83, dtype: object

Order                  1227
PID               534477270
MS SubClass              80
MS Zoning                RL
Lot Frontage           80.0
                    ...    
Yr Sold                2008
Sale Type               WD 
Sale Condition       Normal
SalePrice            165500
Bathrooms               1.0
Name: 1226, Length: 83, dtype: object


In [159]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
scaler.fit(df_housing_quant)
df_housing_norm = scaler.transform(df_housing_quant)
df_housing_norm

array([[1.81159090e-03, 9.99998177e-01, 6.03863633e-04],
       [2.23213591e-03, 9.99996886e-01, 1.11606795e-03],
       [2.25732915e-03, 9.99996815e-01, 1.12866458e-03],
       ...,
       [3.09276707e-03, 9.99994686e-01, 1.03092236e-03],
       [1.43988294e-03, 9.99998704e-01, 7.19941472e-04],
       [1.49999714e-03, 9.99998094e-01, 1.24999762e-03]])

In [160]:
eu_dist = euclidean_distances(df_housing_norm, df_housing_norm[0, :].reshape(1, -1))
np.argsort(eu_dist, axis=0)[1][0]
print(df_housing.iloc[np.argsort(eu_dist, axis=0)[1][0]])
print()
man_dist = manhattan_distances(df_housing_norm, df_housing_norm[0, :].reshape(1, -1))
np.argsort(man_dist, axis=0)[1][0]
print(df_housing.iloc[np.argsort(man_dist, axis=0)[1][0]])

Order                  1227
PID               534477270
MS SubClass              80
MS Zoning                RL
Lot Frontage           80.0
                    ...    
Yr Sold                2008
Sale Type               WD 
Sale Condition       Normal
SalePrice            165500
Bathrooms               1.0
Name: 1226, Length: 83, dtype: object

Order                  1227
PID               534477270
MS SubClass              80
MS Zoning                RL
Lot Frontage           80.0
                    ...    
Yr Sold                2008
Sale Type               WD 
Sale Condition       Normal
SalePrice            165500
Bathrooms               1.0
Name: 1226, Length: 83, dtype: object


**YOUR RESPONSE HERE**

Using any scaler and any metric for distance for each of the scaler, I got the same house that was the closest to house 0.

In [161]:
# x = df_housing_st.loc[0]
# houses_quant_st["distance"] = np.sqrt(((df_housing_st - x) ** 2).sum(axis = 1)) 

## Using categorical variables when computing distances

So far, we have only computed distances between observations based on quantitative variables. But what if we want to include categorical variables? We can convert categorical variables into dummy quantitative variables, and then include in the dummy variables in the distance calculations.

Let's add "House Style" to the variables we are considering for the Ames housing data set.


In [162]:
df_housing_mixed = df_housing[["Bedroom AbvGr", "Gr Liv Area", "Bathrooms", "House Style"]]
df_housing_mixed

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms,House Style
0,3,1656,1.0,1Story
1,2,896,1.0,1Story
2,3,1329,1.5,1Story
3,3,2110,2.5,1Story
4,3,1629,2.5,2Story
...,...,...,...,...
2925,3,1003,1.0,SLvl
2926,2,902,1.0,1Story
2927,3,970,1.0,SFoyer
2928,2,1389,1.0,1Story


Recall that we have seen the Pandas `get_dummies()` command which converts all categorical variables into dummy variables (leaving quantitative variables as is).

In [163]:
df_housing_dummies = pd.get_dummies(df_housing_mixed)
df_housing_dummies

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,House Style_SFoyer,House Style_SLvl
0,3,1656,1.0,False,False,True,False,False,False,False,False
1,2,896,1.0,False,False,True,False,False,False,False,False
2,3,1329,1.5,False,False,True,False,False,False,False,False
3,3,2110,2.5,False,False,True,False,False,False,False,False
4,3,1629,2.5,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...
2925,3,1003,1.0,False,False,False,False,False,False,False,True
2926,2,902,1.0,False,False,True,False,False,False,False,False
2927,3,970,1.0,False,False,False,False,False,False,True,False
2928,2,1389,1.0,False,False,True,False,False,False,False,False


7\. Continuing part 6. Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it --- in terms of living area, number of bedrooms, number of bathrooms, **and House Style** --- by calculating distances from house 0. Try different distance metrics and different scaling methods. How sensitive are your results to these choices?

Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric?

In [164]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

scaler = StandardScaler()
scaler.fit(df_housing_dummies)
df_housing_ss = scaler.transform(df_housing_dummies)
df_housing_ss

array([[ 0.17609421,  0.30926506, -1.17666295, ..., -0.65146333,
        -0.17074395, -0.21373267],
       [-1.03223376, -1.19442705, -1.17666295, ..., -0.65146333,
        -0.17074395, -0.21373267],
       [ 0.17609421, -0.33771825, -0.3987698 , ..., -0.65146333,
        -0.17074395, -0.21373267],
       ...,
       [ 0.17609421, -1.04801492, -1.17666295, ..., -0.65146333,
         5.85672304, -0.21373267],
       [-1.03223376, -0.21900572, -1.17666295, ..., -0.65146333,
        -0.17074395, -0.21373267],
       [ 0.17609421,  0.9898836 ,  1.1570165 , ...,  1.53500581,
        -0.17074395, -0.21373267]])

In [165]:
eu_dist = euclidean_distances(df_housing_ss, df_housing_ss[0, :].reshape(1, -1))
df_housing["eu_distance"] = eu_dist
df_temp = df_housing.sort_values(by = "eu_distance").head(10)
df_temp.sort_values(by = "SalePrice")


Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice,Bathrooms,eu_distance
2298,2299,923251160,20,RL,124.0,27697,Pave,,Reg,Lvl,...,,,0,11,2007,COD,Abnorml,80000,1.0,0.09497
788,789,905450020,20,RL,73.0,9855,Pave,,Reg,Lvl,...,MnPrv,,0,11,2009,COD,Normal,127500,1.0,0.065292
2700,2701,904100170,20,RL,100.0,21370,Pave,,Reg,Lvl,...,,Shed,600,6,2006,WD,Normal,131000,1.0,0.031657
1940,1941,535353050,20,RL,75.0,9532,Pave,,Reg,Lvl,...,GdPrv,,0,2,2007,WD,Normal,153000,1.0,0.017807
314,315,916125360,20,RL,,57200,Pave,,IR1,Bnk,...,,,0,6,2010,WD,Normal,160000,1.0,0.061335
1240,1241,535176100,20,RL,90.0,13200,Pave,,IR1,Lvl,...,,,0,5,2008,WD,Normal,166800,1.0,0.170155
618,619,534476150,20,RL,80.0,9600,Pave,,Reg,Lvl,...,,,0,10,2009,WD,Normal,167000,1.0,0.023743
2282,2283,923205025,190,RL,,32463,Pave,,Reg,Low,...,,,0,3,2007,WD,Normal,168000,1.0,0.06727
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,,,0,5,2010,WD,Normal,215000,1.0,0.0
1541,1542,909428190,20,RL,,14778,Pave,,IR1,Low,...,,,0,11,2008,WD,Normal,224000,1.0,0.031657


In [166]:
man_dist = manhattan_distances(df_housing_ss, df_housing_ss[0, :].reshape(1, -1))
df_housing["man_distance"] = eu_dist
df_temp = df_housing.sort_values(by = "man_distance").head(10)
df_temp.sort_values(by = "SalePrice")

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice,Bathrooms,eu_distance,man_distance
2298,2299,923251160,20,RL,124.0,27697,Pave,,Reg,Lvl,...,,0,11,2007,COD,Abnorml,80000,1.0,0.09497,0.09497
788,789,905450020,20,RL,73.0,9855,Pave,,Reg,Lvl,...,,0,11,2009,COD,Normal,127500,1.0,0.065292,0.065292
2700,2701,904100170,20,RL,100.0,21370,Pave,,Reg,Lvl,...,Shed,600,6,2006,WD,Normal,131000,1.0,0.031657,0.031657
1940,1941,535353050,20,RL,75.0,9532,Pave,,Reg,Lvl,...,,0,2,2007,WD,Normal,153000,1.0,0.017807,0.017807
314,315,916125360,20,RL,,57200,Pave,,IR1,Bnk,...,,0,6,2010,WD,Normal,160000,1.0,0.061335,0.061335
1240,1241,535176100,20,RL,90.0,13200,Pave,,IR1,Lvl,...,,0,5,2008,WD,Normal,166800,1.0,0.170155,0.170155
618,619,534476150,20,RL,80.0,9600,Pave,,Reg,Lvl,...,,0,10,2009,WD,Normal,167000,1.0,0.023743,0.023743
2282,2283,923205025,190,RL,,32463,Pave,,Reg,Low,...,,0,3,2007,WD,Normal,168000,1.0,0.06727,0.06727
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,,0,5,2010,WD,Normal,215000,1.0,0.0,0.0
1541,1542,909428190,20,RL,,14778,Pave,,IR1,Low,...,,0,11,2008,WD,Normal,224000,1.0,0.031657,0.031657


**YOUR RESPONSE HERE**

## Activity

Continuing parts 6 and 7. Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it, by calculating distances after encoding categorical variables as dummy variables. Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

Try different distance metrics and different scaling methods. How sensitive are your results to these choices?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric?

_Hint:_ There are too many variables in the data set. Do not attempt to call `pd.get_dummies()` on the entire `DataFrame`! You will want to pare down the number of variables, but be sure to include a mixture of categorical and quantitative variables. Refer to the [data documentation](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) for information about the variables.

There are many approaches to this problem. I'll ask several groups to present their approach. Which variables did you decide to include? Which scaling method? Which distance matric? Why? What houses would you recommend?

In [167]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

**YOUR RESPONSE HERE**

## Dummy encoding in scikit-learn and sparse matrices

You can do dummy, or "onehot", encoding in scikit-learn using `OneHotEncoder`. There are `fit` and `transform` steps, just like for `StandardScaler`.

In [168]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(df_housing[["House Style"]])
output = enc.transform(df_housing[["House Style"]])
output


<2930x8 sparse matrix of type '<class 'numpy.float64'>'
	with 2930 stored elements in Compressed Sparse Row format>

Notice that `OneHotEncoder` returns a "sparse matrix", which is not a `DataFrame` or even a `numpy` array. A _sparse matrix_ is one whose entries are mostly zeroes. For example,

$$ \begin{pmatrix} 0 & 0 & 0 & 0 & 0 \\ 1.7 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & -0.8 & 0 \end{pmatrix} $$

is an example of a sparse matrix. Instead of storing 20 values (most of which are equal to 0), we can simply store the locations of the non-zero entries and their values:

- $(1, 0) \rightarrow 1.7$
- $(3, 3) \rightarrow -0.8$

All other entries of the matrix are assumed to be zero. This representation offers substantial memory savings when there are only a few non-zero entries. (But if not, then this representation can actually be more expensive.) Transforming a categorical variable into dummy variables usually returns a sparse matrix, since each row only has one non-zero entry.

If we want a dense matrix instead of a sparse matrix, set `sparse_output=False` in `OneHotEncoder`.


In [169]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)
enc.fit(df_housing[["House Style"]])
enc.transform(df_housing[["House Style"]])


array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

You can also convert a sparse matrix to dense using `.todense()`

In [170]:
output.todense()

matrix([[0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 1., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 0., 0.]])

## Selectively Encoding Variables in Scikit-Learn

What if we have a DataFrame, and we only want to dummy encode the categorical variables? We have seen that Pandas `get_dummies` will pass through the quantitative variables unchanged. What about scikit-learn? Scikit-learn provides a `ColumnTransformer` that allows us to selectively apply transformations to certain columns. We can use `ColumnTransformer` to apply the `OneHotEncoder` to the "House Style" variable, and "passthrough" the remaining variables.





In [171]:
from sklearn.compose import ColumnTransformer
enc = ColumnTransformer(
    [("Encoded House Style", OneHotEncoder(), ["House Style"])],
    remainder="passthrough")

enc.fit(df_housing_mixed)
enc.transform(df_housing_mixed)


array([[0.000e+00, 0.000e+00, 1.000e+00, ..., 3.000e+00, 1.656e+03,
        1.000e+00],
       [0.000e+00, 0.000e+00, 1.000e+00, ..., 2.000e+00, 8.960e+02,
        1.000e+00],
       [0.000e+00, 0.000e+00, 1.000e+00, ..., 3.000e+00, 1.329e+03,
        1.500e+00],
       ...,
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 3.000e+00, 9.700e+02,
        1.000e+00],
       [0.000e+00, 0.000e+00, 1.000e+00, ..., 2.000e+00, 1.389e+03,
        1.000e+00],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 3.000e+00, 2.000e+03,
        2.500e+00]])

One advantage of using `ColumnTransformer` is that you can mix scalers for quantitative variables and encoders for categorical variables.

(Note: We will see later how to combine steps like these into a pipeline which both streamlines our analysis and allows us to apply operations consistently across multiple data sets, for example, across both training and testing data.)

In [172]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

enc = ColumnTransformer(
    [("Scaled Quant Variables", StandardScaler(), ["Bedroom AbvGr", "Gr Liv Area", "Bathrooms"]),
     ("Encoded House Style", OneHotEncoder(), ["House Style"])],
    remainder="passthrough")

We can visualize the steps in the ColumnTransformer

In [173]:
enc

Now we fit the column transformer to the entire Ames housing data set. Notice that variables we haven't specified will passthrough unchanged

In [174]:
enc.fit(df_housing)
df_housing_enc = enc.transform(df_housing)

df_housing_enc

array([[0.17609421298456596, 0.3092650614142043, -1.176662951985368, ...,
        215000, 0.0, 0.0],
       [-1.0322337590172566, -1.1944270494036793, -1.176662951985368,
        ..., 105000, 1.9290273331547119, 1.9290273331547119],
       [0.17609421298456596, -0.33771825468770084, -0.3987698000636333,
        ..., 172000, 1.0117831621058675, 1.0117831621058675],
       ...,
       [0.17609421298456596, -1.0480149228240432, -1.176662951985368,
        ..., 132000, 6.49407764378536, 6.49407764378536],
       [-1.0322337590172566, -0.2190057196231311, -1.176662951985368,
        ..., 170000, 1.3187594572247943, 1.3187594572247943],
       [0.17609421298456596, 0.9898835957844042, 1.1570165037798361, ...,
        188000, 3.832809136731903, 3.832809136731903]], dtype=object)

We convert the output to a Pandas DataFrame, but unforunately, all of the column names have been stripped away.

In [175]:
pd.DataFrame(df_housing_enc)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,82,83,84,85,86,87,88,89,90,91
0,0.176094,0.309265,-1.176663,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,,,0,5,2010,WD,Normal,215000,0.0,0.0
1,-1.032234,-1.194427,-1.176663,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,MnPrv,,0,6,2010,WD,Normal,105000,1.929027,1.929027
2,0.176094,-0.337718,-0.39877,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,,Gar2,12500,6,2010,WD,Normal,172000,1.011783,1.011783
3,0.176094,1.207523,1.157017,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,,,0,4,2010,WD,Normal,244000,2.500585,2.500585
4,0.176094,0.255844,1.157017,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,MnPrv,,0,3,2010,WD,Normal,189900,3.772272,3.772272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,0.176094,-0.982723,-1.176663,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,GdPrv,,0,3,2006,WD,Normal,142500,5.441141,5.441141
2926,-1.032234,-1.182556,-1.176663,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,MnPrv,,0,6,2006,WD,Normal,131000,1.919788,1.919788
2927,0.176094,-1.048015,-1.176663,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,MnPrv,Shed,700,7,2006,WD,Normal,132000,6.494078,6.494078
2928,-1.032234,-0.219006,-1.176663,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,,,0,4,2006,WD,Normal,170000,1.318759,1.318759
