## Notebook set up

Submit your notebook to the class leaderboard on HuggingFace at [huggingface.co/spaces/gperdrizet/leaderboard](https://huggingface.co/spaces/gperdrizet/leaderboard)

**Your task**: Apply at least two different feature engineering techniques to the `housing_df` dataframe to improve the dataset. At the end of the notebook, your engineered dataset and the original dataset will be used to train a linear regression model to predict `MedHouseVal`. Your goal is to achieve better model performance via feature engineering.

Don't change any of the code in the Model evaluation section of the notebook, especially the output saving. Otherwise the leaderboard scoring may not work!

**Note**: If you have read ahead or you are familiar with the basics of training ML models, no there is no train-test split and yes, this means data leakage/genralizability is a concern. We will cover those topics in the next unit. For now, the goal is to keep things simple while still giving you an idea of how your feature engineering effects model performance.

Before applying transformations, explore the dataset to understand what techniques would be most beneficial.

### Import libraries

In [1]:
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

## Custom imports
import pygeohash as pgh
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.preprocessing import PolynomialFeatures, RobustScaler, PowerTransformer, QuantileTransformer

# Set random seed for reproducibility
np.random.seed(315)

### Load dataset

In [2]:
# Load California housing dataset
original_housing_df = pd.read_csv('https://gperdrizet.github.io/FSA_devops/assets/data/unit2/california_housing.csv')
housing_df = original_housing_df.copy()

## Task 1: Explore the dataset

Before deciding what feature engineering techniques to apply, explore the dataset to understand its characteristics.

**Things to investigate**:
- Display basic information about the dataset (`.info()`, `.describe()`)
- Check for missing values
- Examine feature distributions (histograms, box plots)
- Look at feature scales and ranges

Use this exploration to inform your feature engineering decisions in the following tasks.

In [3]:
housing_df.sample(20)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7223,1.8715,34.0,2.942116,1.01996,1962.0,3.916168,34.02,-118.16,1.396
18989,5.0346,2.0,8.813167,1.307829,1791.0,3.186833,38.36,-121.94,2.051
17119,4.0039,36.0,5.062762,0.987448,981.0,4.104603,37.48,-122.14,1.489
19501,2.1528,31.0,4.457286,1.090452,614.0,3.085427,37.66,-121.03,0.755
13310,4.6167,11.0,4.694836,0.91784,1503.0,3.528169,34.09,-117.61,1.44
11422,5.6825,24.0,6.699219,1.070312,1601.0,3.126953,33.71,-117.98,2.646
18065,11.1557,24.0,7.921875,1.088542,539.0,2.807292,37.24,-122.04,5.00001
16471,3.2993,25.0,5.553191,1.148936,789.0,3.357447,38.13,-121.25,0.911
12219,5.5351,4.0,8.58427,1.25,2088.0,2.932584,33.55,-117.27,4.29
12075,5.195,23.0,6.18239,0.991614,1671.0,3.503145,33.86,-117.6,1.61


### Sample view
- MedInc - 1.8715 to 11.1567 - Median Income
- HouseAge - 2.0 to 52.0 - House Age
- AveRooms - 2.942116 to 8.813167	- Average num Rooms ---> Clip greater than 25 to 25
- AveBedrms - 0.917840	to 1.187970 - Average num Bedrooms ----> Clip greater than 5 to 5
- Population -	614.0 to 2212.0 - City Population ----> Clip greater than 10,000 to 10,000
- AveOccup - 2.445110 to 4.776243 - Average num People in House  ----> Clip greater than 10 to 10
- Latitude - 33.55 to 38.68           Bin lat long to geohash ()
- Longitude	- -122.43 to -117.34
- MedHouseVal - 0.71400 to 5.00001 - Median House Value (Target)

In [4]:
# YOUR CODE HERE
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [5]:
housing_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [6]:
housing_df.isnull().sum()

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

In [7]:
to_chart = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"]

# Determin relationship between extreme MEDHOUSVAL and other features
extreme_highend_housing_df = housing_df[(housing_df['MedHouseVal'] >= 5)]
extreme_highend_housing_df

extreme_lowend_housing_df = housing_df[(housing_df['MedHouseVal'] <= 1)]


# Explore features with histograms and PDFs on the same chart
for feature in to_chart:
    fig, ax = plt.subplots(figsize=(10, 4))
    
    # Plot histogram with density normalization
    ax.hist(
        extreme_highend_housing_df[feature],
        bins=300,
        density=True,
        alpha=0.7,
        label='High-end (MedHouseVal >= 5)',
        color='green'
    )
    ax.hist(
        housing_df[feature],
        bins=300,
        density=True,
        alpha=0.4,
        label='All Data',
        color='blue'
    )
    ax.hist(
        extreme_lowend_housing_df[feature],
        bins=300,
        density=True,
        alpha=0.4,
        label='Low-end (MedHouseVal <= 1)',
        color='red'
    )
    
    # Plot PDF for high-end subset
    extreme_highend_housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='High-end PDF',
        color='green',
        alpha=0.5
    )
    # Plot PDF for full data
    housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='All Data PDF',
        color='blue',
        alpha=0.5
    )
    # Plot PDF for low-end subset
    extreme_lowend_housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='Low-end PDF',
        color='red',
        alpha=0.5
    )
    
    ax.set_title(f"Binned Count and PDF of {feature} -- Pre-modifications")
    ax.set_xlabel(feature)
    ax.set_ylabel("Density")
    ax.legend()
    plt.tight_layout()
    plt.suptitle('Extreme High, Low, and All Data Values for MedHouseVal', y=1.02)
    plt.show()

### Perform the following operations:
- MedInc - 1.8715 to 11.1567 - Median Income
- HouseAge - 2.0 to 52.0 - House Age
- AveRooms - 2.942116 to 8.813167	- Average num Rooms ---> Clip greater than 25 to 25
- AveBedrms - 0.917840	to 1.187970 - Average num Bedrooms ----> Clip greater than 5 to 5
- Population -	614.0 to 2212.0 - City Population ----> Clip greater than 10,000 to 10,000
- AveOccup - 2.445110 to 4.776243 - Average num People in House  ----> Clip greater than 100 to 100

In [8]:
# See if I can get the shape of Callifornia
coords = housing_df[['Latitude', 'Longitude']].values

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 9))
scatter = plt.scatter(coords[:, 1], coords[:, 0], s=1000/housing_df['Population'], cmap='rainbow', alpha=0.6, c=housing_df['MedHouseVal'])
plt.legend(*scatter.legend_elements(), title="MedHouseVal")
plt.colorbar(label='MedHouseVal')
plt.title('California Housing Locations with Population Size')

Text(0.5, 1.0, 'California Housing Locations with Population Size')

## Task 2: Apply your first feature engineering technique

Based on your exploration, apply your first feature engineering technique.

**Example approaches**:
- Transform skewed features using log, sqrt, power, or quantile transformations
- Create bins/categories from continuous variables
- Create interaction features (e.g., rooms per household = total rooms / households)

In [9]:
'''Perform the following operations:
- MedInc - 1.8715 to 11.1567 - Median Income
- HouseAge - 2.0 to 52.0 - House Age
- AveRooms - 2.942116 to 8.813167	- Average num Rooms ---> Clip greater than 25 to 25
- AveBedrms - 0.917840	to 1.187970 - Average num Bedrooms ----> Clip greater than 5 to 5
- Population -	614.0 to 2212.0 - City Population ----> Clip greater than 10,000 to 10,000
- AveOccup - 2.445110 to 4.776243 - Average num People in House  ----> Clip greater than 100 to 100
'''

housing_df['AveRooms'] = housing_df['AveRooms'].clip(upper=25)
housing_df['AveBedrms'] = housing_df['AveBedrms'].clip(upper=5)
housing_df['Population'] = housing_df['Population'].clip(lower= 10, upper=10000)       # Current Best is lower= 10, upper=10000
housing_df['AveOccup'] = housing_df['AveOccup'].clip(upper=10)                         # Current Best is  upper=10
housing_df['MedInc'] = housing_df['MedInc'].clip(lower= 1.2, upper=10.5)               # Current Best is lower= 1.2, upper=10.5

housing_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.852856,28.639486,5.396687,1.089361,1421.00436,2.931212,35.631861,-119.569704,2.068558
std,1.797798,12.585558,1.779327,0.285856,1071.514855,0.821141,2.135952,2.003532,1.153956
min,1.2,1.0,0.846154,0.333333,10.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,10.5,52.0,25.0,5.0,10000.0,10.0,41.95,-114.31,5.00001


In [10]:
to_chart = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"]

# Determin relationship between extreme MEDHOUSVAL and other features
extreme_highend_housing_df = housing_df[(housing_df['MedHouseVal'] >= 5)]
extreme_highend_housing_df

extreme_lowend_housing_df = housing_df[(housing_df['MedHouseVal'] <= 1)]


# Explore features with histograms and PDFs on the same chart
for feature in to_chart:
    fig, ax = plt.subplots(figsize=(10, 4))
    
    # Plot histogram with density normalization
    ax.hist(
        extreme_highend_housing_df[feature],
        bins=300,
        density=True,
        alpha=0.7,
        label='High-end (MedHouseVal >= 5)',
        color='green'
    )
    ax.hist(
        housing_df[feature],
        bins=300,
        density=True,
        alpha=0.4,
        label='All Data',
        color='blue'
    )
    ax.hist(
        extreme_lowend_housing_df[feature],
        bins=300,
        density=True,
        alpha=0.4,
        label='Low-end (MedHouseVal <= 1)',
        color='red'
    )
    
    # Plot PDF for high-end subset
    extreme_highend_housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='High-end PDF',
        color='green',
        alpha=0.5
    )
    # Plot PDF for full data
    housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='All Data PDF',
        color='blue',
        alpha=0.5
    )
    # Plot PDF for low-end subset
    extreme_lowend_housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='Low-end PDF',
        color='red',
        alpha=0.5
    )
    
    ax.set_title(f"Binned Count and PDF of {feature } -- Post-data clipping")
    ax.set_xlabel(feature)
    ax.set_ylabel("Density")
    ax.legend()
    plt.tight_layout()
    plt.suptitle('Extreme High, Low, and All Data Values for MedHouseVal', y=1.02)
    plt.show()

### Model Boosting - Seems to overfit (Too strong of a model) 
#### - Final r2 score was .96 with nearly 0 deviation (So likely an upperbound for the dataset)
#### - Boosting has always felt a little sus for me anyways, 
- although the model doesn't overfit as bad if I Boost with another LinearRegression Model (Likely due to simplicity)

In [11]:
# Linear Regression Boosting
#X = housing_df.drop(columns=['MedHouseVal'])
#y = housing_df['MedHouseVal']
#lin_reg = LinearRegression()
#cv_scores = cross_val_score(lin_reg, X, y, cv=5, scoring='r2')
#
#print(f"Linear Regression CV R² Scores: {cv_scores}")
#print(f"Mean R² Score: {np.mean(cv_scores)}")
#
## Fit and evaluate the model
#lin_reg.fit(X, y)
#r2_score = lin_reg.score(X, y)
#print(f"R² Score on full data: {r2_score}")
## The mean R² score from cross-validation and the R² score on the full data provide insights into the model's performance.
#
## Make predictions
#lin_reg_preds = lin_reg.predict(X)
#
## Plot predictions vs actual values
#plt.figure(figsize=(10, 6))
#plt.scatter(y, lin_reg_preds, alpha=0.5)

### Binning 

In [12]:
housing_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.852856,28.639486,5.396687,1.089361,1421.00436,2.931212,35.631861,-119.569704,2.068558
std,1.797798,12.585558,1.779327,0.285856,1071.514855,0.821141,2.135952,2.003532,1.153956
min,1.2,1.0,0.846154,0.333333,10.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,10.5,52.0,25.0,5.0,10000.0,10.0,41.95,-114.31,5.00001


In [13]:
# Binning the MedInc feature based on observed distributions
#from scipy.sparse import issparse
#from sklearn.preprocessing import KBinsDiscretizer
#
#feature_to_bin = 'Population'
#
## Quantile methods available: 
## “inverted_cdf”, “averaged_inverted_cdf”,“closest_observation”, “interpolated_inverted_cdf”, “hazen”, “weibull”, “linear”, “median_unbiased”, “normal_unbiased”
#
#kbd = KBinsDiscretizer(encode='ordinal', n_bins=300, strategy='quantile', quantile_method='weibull')
#KBD_MedInc = kbd.fit_transform(housing_df[[feature_to_bin]])
#            
#names = kbd.get_feature_names_out([feature_to_bin])
#
## Convert sparse matrix to dense NumPy array
#if issparse(KBD_MedInc):
#    KBD_MedInc = KBD_MedInc.toarray()
#
## Remake the KBD_MedInc into a DataFrame and join to housing_df
#KBD_MedInc_df = pd.DataFrame(KBD_MedInc, columns=names)
#
#housing_df = housing_df.join(KBD_MedInc_df, rsuffix='_binned')
#housing_df

In [14]:
# Binning the MedInc feature based on observed distributions

#feature_to_bin = 'AveRooms'
#
#kbd = KBinsDiscretizer(encode='ordinal', n_bins=300, strategy='kmeans')
#KBD_MedInc = kbd.fit_transform(housing_df[[feature_to_bin]])
#            
#names = kbd.get_feature_names_out([feature_to_bin])
#
## Convert sparse matrix to dense NumPy array
#if issparse(KBD_MedInc):
#    KBD_MedInc = KBD_MedInc.toarray()
#
## Remake the KBD_MedInc into a DataFrame and join to housing_df
#KBD_MedInc_df = pd.DataFrame(KBD_MedInc, columns=names)
#
#housing_df = housing_df.join(KBD_MedInc_df, rsuffix='_binned')
#housing_df

# Feature Scaling

In [15]:
housing_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.852856,28.639486,5.396687,1.089361,1421.00436,2.931212,35.631861,-119.569704,2.068558
std,1.797798,12.585558,1.779327,0.285856,1071.514855,0.821141,2.135952,2.003532,1.153956
min,1.2,1.0,0.846154,0.333333,10.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,10.5,52.0,25.0,5.0,10000.0,10.0,41.95,-114.31,5.00001


In [16]:
housing_df['log_MedInc'] = np.log1p(housing_df['MedInc'])
housing_df['log_Population'] = np.log1p(housing_df['Population'])
housing_df['log_AveRooms'] = np.log1p(housing_df['AveRooms'])
housing_df['log_AveBedrms'] = np.log1p(housing_df['AveBedrms'])
housing_df['log_AveOccup'] = np.log1p(housing_df['AveOccup'])

housing_df['log_log_Population'] = np.log1p(housing_df['log_Population'])

housing_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,log_MedInc,log_Population,log_AveRooms,log_AveBedrms,log_AveOccup,log_log_Population
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.852856,28.639486,5.396687,1.089361,1421.00436,2.931212,35.631861,-119.569704,2.068558,1.517154,7.02534,1.827947,0.731264,1.350089,2.077973
std,1.797798,12.585558,1.779327,0.285856,1071.514855,0.821141,2.135952,2.003532,1.153956,0.349625,0.734189,0.225777,0.094317,0.189723,0.099243
min,1.2,1.0,0.846154,0.333333,10.0,0.692308,32.54,-124.35,0.14999,0.788457,2.397895,0.613104,0.287682,0.526093,1.223156
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196,1.270715,6.669498,1.693911,0.696182,1.232485,2.037251
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797,1.511781,7.062192,1.829236,0.717245,1.339757,2.087185
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725,1.748025,7.453562,1.953365,0.741712,1.454481,2.134588
max,10.5,52.0,25.0,5.0,10000.0,10.0,41.95,-114.31,5.00001,2.442347,9.21044,3.258097,1.791759,2.397895,2.323411


In [17]:
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,log_MedInc,log_Population,log_AveRooms,log_AveBedrms,log_AveOccup,log_log_Population
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526,2.232720,5.777652,2.077455,0.704982,1.268511,1.913631
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585,2.230165,7.784057,1.979364,0.678988,1.134572,2.172938
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521,2.111110,6.208590,2.228738,0.729212,1.335596,1.975273
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413,1.893579,6.326149,1.919471,0.729025,1.266369,1.991450
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422,1.578195,6.338594,1.985385,0.732888,1.157342,1.993147
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781,0.940124,6.740519,1.799307,0.757686,1.269931,2.046469
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771,1.268861,5.877736,1.962070,0.839751,1.416534,1.928289
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923,0.993252,6.915723,1.825443,0.751460,1.201661,2.068851
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847,1.053336,6.609349,1.845223,0.775611,1.138861,2.029378


### PloynomialFeatures

In [18]:
x_poly = PolynomialFeatures(degree=3)  #3
new_x = x_poly.fit_transform(housing_df.drop(columns=['MedHouseVal']))
new_feature_names = x_poly.get_feature_names_out()

new_housing_df = pd.DataFrame(new_x, columns=new_feature_names)
new_housing_df['MedHouseVal'] = housing_df['MedHouseVal']
housing_df = new_housing_df.copy(deep=True)
housing_df

Unnamed: 0,1,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,log_MedInc,...,log_AveBedrms^2 log_AveOccup,log_AveBedrms^2 log_log_Population,log_AveBedrms log_AveOccup^2,log_AveBedrms log_AveOccup log_log_Population,log_AveBedrms log_log_Population^2,log_AveOccup^3,log_AveOccup^2 log_log_Population,log_AveOccup log_log_Population^2,log_log_Population^3,MedHouseVal
0,1.0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,2.232720,...,0.630449,0.951073,1.134401,1.711316,2.581631,2.041188,3.079263,4.645267,7.007683,4.526
1,1.0,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,2.230165,...,0.523065,1.001777,0.874029,1.673945,3.205950,1.460481,2.797122,5.357064,10.259879,3.585
2,1.0,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,2.111110,...,0.710203,1.050352,1.300780,1.923783,2.845170,2.382456,3.523523,5.211100,7.706934,3.521
3,1.0,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,1.893579,...,0.673047,1.058412,1.169130,1.838537,2.891223,2.030862,3.193668,5.022258,7.897839,3.413
4,1.0,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,1.578195,...,0.621637,1.070567,0.981660,1.690591,2.911495,1.550193,2.669704,4.597700,7.918049,3.422
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,1.0,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.940124,...,0.729052,1.174852,1.221938,1.969129,3.173214,2.048048,3.300390,5.318514,8.570682,0.781
20636,1.0,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,1.268861,...,0.998913,1.359793,1.685018,2.293769,3.122445,2.842374,3.869246,5.267100,7.169960,0.771
20637,1.0,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.993252,...,0.678568,1.168263,1.085099,1.868172,3.216356,1.735184,2.987397,5.143282,8.854982,0.923
20638,1.0,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,1.053336,...,0.685108,1.220819,1.005972,1.792577,3.194258,1.477108,2.632112,4.690255,8.357735,0.847


### Geohashing

In [19]:
# Create city_geohash feature from latitude and longitude
housing_df['geohash'] = housing_df.apply(
    lambda row: pgh.encode(row['Latitude'], row['Longitude'], precision=3), #3
    axis=1
)
 #Create region_geohash feature from latitude and longitude
housing_df['region'] = housing_df.apply(
    lambda row: pgh.encode(row['Latitude'], row['Longitude'], precision=2), #2
    axis=1
)

# Convert geohash to categorical codes for model training
housing_df['geohash_code'] = pd.Categorical(housing_df['geohash']).codes
housing_df['region_code'] = pd.Categorical(housing_df['region']).codes

# Drop the string geohash column, keep only the numeric code
housing_df = housing_df.drop('geohash', axis=1)
housing_df = housing_df.drop('region', axis=1)

housing_df = pd.get_dummies(data = housing_df, columns=['region_code'], drop_first=True) 
housing_df = pd.get_dummies(data = housing_df, columns=['geohash_code'], drop_first=True)

## Task 3: Apply your second feature engineering technique

**Example approaches**:
- Scale features to similar ranges
- Encode any categorical variables you created
- Create aggregate statistics by groups

In [20]:
# YOUR CODE HERE

Robust_scaler = RobustScaler()
scaled_housing_df = pd.DataFrame(
    Robust_scaler.fit_transform(housing_df.drop(columns=['MedHouseVal'])),
    columns=housing_df.drop(columns=['MedHouseVal']).columns
)
scaled_housing_df['MedHouseVal'] = housing_df['MedHouseVal']
housing_df = scaled_housing_df.copy(deep=True)


In [21]:
to_chart = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"]

# Determin relationship between extreme MEDHOUSVAL and other features
extreme_highend_housing_df = housing_df[(housing_df['MedHouseVal'] >= 5)]
extreme_highend_housing_df

extreme_lowend_housing_df = housing_df[(housing_df['MedHouseVal'] <= 1)]


# Explore features with histograms and PDFs on the same chart
for feature in to_chart:
    fig, ax = plt.subplots(figsize=(10, 4))
    
    # Plot histogram with density normalization
    ax.hist(
        extreme_highend_housing_df[feature],
        bins=300,
        density=True,
        alpha=0.7,
        label='High-end (MedHouseVal >= 5)',
        color='green'
    )
    ax.hist(
        housing_df[feature],
        bins=300,
        density=True,
        alpha=0.4,
        label='All Data',
        color='blue'
    )
    ax.hist(
        extreme_lowend_housing_df[feature],
        bins=300,
        density=True,
        alpha=0.4,
        label='Low-end (MedHouseVal <= 1)',
        color='red'
    )
    
    # Plot PDF for high-end subset
    extreme_highend_housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='High-end PDF',
        color='green',
        alpha=0.5
    )
    # Plot PDF for full data
    housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='All Data PDF',
        color='blue',
        alpha=0.5
    )
    # Plot PDF for low-end subset
    extreme_lowend_housing_df[feature].plot(
        kind='density',
        ax=ax,
        linewidth=2,
        label='Low-end PDF',
        color='red',
        alpha=0.5
    )
    
    ax.set_title(f"Binned Count and PDF of {feature} -- Post-scaling and Post-data clipping")
    ax.set_xlabel(feature)
    ax.set_ylabel("Density")
    ax.legend()
    plt.tight_layout()
    plt.suptitle('Extreme High, Low, and All Data Values for MedHouseVal', y=1.02)
    plt.show()

### Could Likely get better performance if I binned the upper echelon (top 1 - 10%) of MedInc

## (Optional) Additional feature engineering

Add more techniques if you'd like to experiment further.

In [22]:
# Feature Selection with Lasso Regression
from sklearn.linear_model import LassoCV
X = housing_df.drop('MedHouseVal', axis=1)
y = housing_df['MedHouseVal']
lasso = LassoCV(cv=5, random_state=315)
lasso.fit(X, y)
importance = np.abs(lasso.coef_)
feature_names = X.columns

# Select features with non-zero importance
selected_features = feature_names[importance > 0]
print("Selected features:", selected_features)
X_selected = X[selected_features]

# Evaluate model with selected features
model = LinearRegression()
scores = cross_val_score(model, X_selected, y, cv=10, scoring='r2')
print("R^2 scores with selected features:", scores)
X_selected

# Remake housing_df with selected features
housing_df = X_selected.copy(deep=True)
housing_df['MedHouseVal'] = y

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


  model = cd_fast.enet_coordinate_descent(


Selected features: Index(['MedInc', 'Longitude', 'MedInc^2', 'MedInc HouseAge', 'MedInc AveRooms',
       'MedInc Longitude', 'MedInc log_MedInc', 'MedInc log_Population',
       'MedInc log_AveRooms', 'HouseAge log_MedInc',
       ...
       'geohash_code_9', 'geohash_code_10', 'geohash_code_11',
       'geohash_code_12', 'geohash_code_13', 'geohash_code_14',
       'geohash_code_15', 'geohash_code_16', 'geohash_code_18',
       'geohash_code_20'],
      dtype='object', length=112)


R^2 scores with selected features: [0.68958724 0.78936297 0.68851617 0.60816572 0.76993906 0.65219228
 0.59443883 0.66449073 0.50182248 0.66124686]


## Final Dataframe

In [23]:
housing_df

Unnamed: 0,MedInc,Longitude,MedInc^2,MedInc HouseAge,MedInc AveRooms,MedInc Longitude,MedInc log_MedInc,MedInc log_Population,MedInc log_AveRooms,HouseAge log_MedInc,...,geohash_code_10,geohash_code_11,geohash_code_12,geohash_code_13,geohash_code_14,geohash_code_15,geohash_code_16,geohash_code_18,geohash_code_20,MedHouseVal
0,2.197582,-0.986807,3.567069,3.037982,2.481050,-2.272853,2.630926,1.485035,2.318376,1.732024,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.526
1,2.186664,-0.984169,3.542225,0.999048,2.087821,-2.261419,2.616156,2.538149,2.135153,0.202159,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.585
2,1.707732,-0.989446,2.522385,3.478133,2.605101,-1.774374,1.981999,1.291103,2.080711,2.356048,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.521
3,0.967177,-0.992084,1.214873,2.453273,0.915528,-1.020509,1.061153,0.694423,0.947302,1.968976,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.413
4,0.142854,-0.992084,0.144307,1.312487,0.379598,-0.181062,0.144262,-0.027244,0.269435,1.407786,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.422
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-0.905796,-0.686016,-0.631633,-0.653085,-0.627677,0.893745,-0.770161,-0.911016,-0.754868,-0.596182,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.781
20636,-0.448655,-0.717678,-0.374046,-0.567441,-0.147815,0.431461,-0.417090,-0.623425,-0.286246,-0.618890,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.771
20637,-0.841709,-0.720317,-0.603037,-0.776487,-0.567261,0.828257,-0.726130,-0.831994,-0.692122,-0.822638,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.923
20638,-0.765007,-0.746702,-0.565590,-0.718988,-0.499125,0.750092,-0.670853,-0.794746,-0.619543,-0.751641,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.847


In [24]:
housing_df.columns

Index(['MedInc', 'Longitude', 'MedInc^2', 'MedInc HouseAge', 'MedInc AveRooms',
       'MedInc Longitude', 'MedInc log_MedInc', 'MedInc log_Population',
       'MedInc log_AveRooms', 'HouseAge log_MedInc',
       ...
       'geohash_code_10', 'geohash_code_11', 'geohash_code_12',
       'geohash_code_13', 'geohash_code_14', 'geohash_code_15',
       'geohash_code_16', 'geohash_code_18', 'geohash_code_20', 'MedHouseVal'],
      dtype='object', length=113)

## Model evaluation

Now we'll compare model performance on the original dataset versus your engineered dataset.

### Evaluate datasets

In [25]:
# Create output directory if it doesn't exist
output_directory = 'data/outputs'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Save a copy of the engineered dataframe
housing_df.to_csv('data/outputs/housing_df.csv', index=False)

In [26]:
# Create linear regression model
model = LinearRegression()

# Evaluate on original dataset
scores_original = cross_val_score(
    model,
    original_housing_df.drop('MedHouseVal', axis=1),
    original_housing_df['MedHouseVal'],
    cv=10,
    scoring='r2'
)

# Evaluate on engineered dataset
scores_engineered = cross_val_score(
    model,
    housing_df.drop('MedHouseVal', axis=1),
    housing_df['MedHouseVal'],
    cv=10,
    scoring='r2'
)

engineered_mean = scores_engineered.mean()
original_mean = scores_original.mean()
mean_improvement = (engineered_mean - original_mean) / original_mean

print(f'\nMean improvement: {mean_improvement:.2f}%')
print(mean_improvement)


Mean improvement: 0.30%
0.2954351174206833


#### Current Best : 0.29638208712775677

### Visualize model performance comparison

In [27]:
original_model = LinearRegression()
original_model.fit(original_housing_df.drop('MedHouseVal', axis=1), original_housing_df['MedHouseVal'])
original_predictions = original_model.predict(original_housing_df.drop('MedHouseVal', axis=1))

model = LinearRegression()
model.fit(housing_df.drop('MedHouseVal', axis=1), housing_df['MedHouseVal'])
predictions = model.predict(housing_df.drop('MedHouseVal', axis=1))

# Create boxplot comparing performance
data_to_plot = [scores_original, scores_engineered]
labels = ['Original', 'Engineered']

fig, axs = plt.subplots(1, 2, figsize=(9,4.5))

fig.suptitle(f'Model performance comparison\nmean improvement: {mean_improvement:.2f}%')

axs[0].set_title('Cross validation R² scores')
axs[0].boxplot(data_to_plot, tick_labels=labels)
axs[0].set_xlabel('Dataset')
axs[0].set_ylabel('R² score')

axs[1].set_title('Predictions vs true values')
axs[1].plot(
    original_housing_df['MedHouseVal'], original_predictions,
    'o', markersize=1, label='Original', alpha=0.25
)

axs[1].plot(
    housing_df['MedHouseVal'], predictions,
    'o', markersize=1, label='Engineered', alpha=0.25
)

axs[1].set_xlabel('True Values')
axs[1].set_ylabel('Predictions')

leg = axs[1].legend(loc='upper left', markerscale=8, framealpha=1)

for lh in leg.legend_handles: 
    lh.set_alpha(1)

plt.tight_layout()
plt.show()

## 3. Reflection

**Questions to consider**:

1. Which feature engineering techniques had the biggest impact on model performance?
2. Did adding more features always improve performance, or did some hurt it?
3. How might you further improve the engineered dataset?
4. What trade-offs did you consider (e.g., interpretability vs performance, complexity vs gains)?

**Your reflection**:

*Write your thoughts here...*