# Housing Investment Analysis

Note: In this project, I use the California Housing Prices dataset to practice an end-to-end machine learning workflow, with an emphasis on understanding the data and justifying each modeling decision.

We will use a supervised regression model with batch learning, since the model won’t be updated continuously. This is a univariate regression task, predicting one target variable (median house value) for each district. We choose supervised learning because we have labeled data, the median house price, which will be used to train the model.

## Get the data

In [32]:
# Import Libraries
import urllib.request
import tarfile
from pathlib import Path
import pandas as pd

tarball_path = Path("datasets/housing.tgz") # the file path where the tarball will be saved or read from.

def get_housing_data():
    if not tarball_path.is_file():
        # create the parent folder if it doesn't exist
        tarball_path.parent.mkdir(parents=True, exist_ok=True)
        
        url = 'https://github.com/ageron/data/raw/main/housing.tgz' # target url
        # Now download
        urllib.request.urlretrieve(url, tarball_path)

    with tarfile.open(tarball_path) as _:
            _.extractall("datasets", filter='data')
    return pd.read_csv("datasets/housing/housing.csv")    

housing_full = get_housing_data()

### A quick look at the data

In [33]:
housing_full.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY


In [34]:
housing_full.shape

(20640, 10)

Our dataset contains 20,640 rows (districts) and 10 columns (features), each representing different attributes of the districts.

In [35]:
housing_full.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [36]:
housing_full.info()

<class 'pandas.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  str    
dtypes: float64(9), str(1)
memory usage: 1.6 MB


All features describing the districts are numerical, except ocean_proximity, which is categorical and will require encoding before modeling.

In [37]:
housing_full.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

The total_bedrooms feature has 207 missing values, which we’ll need to handle before modeling.

In [14]:
# train test split

from sklearn.model_selection import train_test_split

In [50]:
X = housing_full.drop(columns="median_house_value").copy()
y = housing_full[["median_house_value"]].copy()

#X_train, y_train, X_test, y_test = train_test_split(housing_full, test_size=0.2, random_state=42)

In [53]:
y.head(1)

Unnamed: 0,median_house_value
0,452600.0


In [62]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [64]:
X_test.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
20046,-122.38,40.67,10.0,2281.0,444.0,1274.0,438.0,2.212,INLAND
3024,-118.37,33.83,35.0,1207.0,207.0,601.0,213.0,4.7308,<1H OCEAN
15663,-117.24,32.72,39.0,3089.0,431.0,1175.0,432.0,7.5925,NEAR OCEAN
20484,-118.44,34.05,18.0,4780.0,1192.0,1886.0,1036.0,4.4674,<1H OCEAN
9814,-118.44,34.18,33.0,2127.0,414.0,1056.0,391.0,4.375,<1H OCEAN


In [65]:
X_test.shape[0]/housing_full.shape[0]

0.2

In [68]:
y_housing_train = housing_train["median_house_value"]

In [69]:
y_housing_train.head(2)

14196    291000.0
8267     156100.0
Name: median_house_value, dtype: float64

In [77]:
# preprocessing

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
import numpy as np

In [72]:
num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler()
)

In [76]:
cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder()
)

In [79]:
preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object))
)

In [81]:
X_prepared = preprocessing.fit_transform(X)

In [82]:
X_prepared

array([[-1.32783522,  1.05254828,  0.98214266, ...,  0.        ,
         1.        ,  0.        ],
       [-1.32284391,  1.04318455, -0.60701891, ...,  0.        ,
         1.        ,  0.        ],
       [-1.33282653,  1.03850269,  1.85618152, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.8237132 ,  1.77823747, -0.92485123, ...,  0.        ,
         0.        ,  0.        ],
       [-0.87362627,  1.77823747, -0.84539315, ...,  0.        ,
         0.        ,  0.        ],
       [-0.83369581,  1.75014627, -1.00430931, ...,  0.        ,
         0.        ,  0.        ]], shape=(20640, 13))

In [92]:
feature_names = preprocessing.get_feature_names_out()
feature_names

array(['pipeline-1__longitude', 'pipeline-1__latitude',
       'pipeline-1__housing_median_age', 'pipeline-1__total_rooms',
       'pipeline-1__total_bedrooms', 'pipeline-1__population',
       'pipeline-1__households', 'pipeline-1__median_income',
       'pipeline-2__ocean_proximity_<1H OCEAN',
       'pipeline-2__ocean_proximity_INLAND',
       'pipeline-2__ocean_proximity_ISLAND',
       'pipeline-2__ocean_proximity_NEAR BAY',
       'pipeline-2__ocean_proximity_NEAR OCEAN'], dtype=object)

In [94]:
X_df = pd.DataFrame(X_prepared, columns=feature_names)

In [95]:
X_df.isna().sum()

pipeline-1__longitude                     0
pipeline-1__latitude                      0
pipeline-1__housing_median_age            0
pipeline-1__total_rooms                   0
pipeline-1__total_bedrooms                0
pipeline-1__population                    0
pipeline-1__households                    0
pipeline-1__median_income                 0
pipeline-2__ocean_proximity_<1H OCEAN     0
pipeline-2__ocean_proximity_INLAND        0
pipeline-2__ocean_proximity_ISLAND        0
pipeline-2__ocean_proximity_NEAR BAY      0
pipeline-2__ocean_proximity_NEAR OCEAN    0
dtype: int64