# Housing Investment Analysis

Note: In this project, I use the California Housing Prices dataset to practice an end-to-end machine learning workflow, with an emphasis on understanding the data and justifying each modeling decision.

We will use a supervised regression model with batch learning, since the model won’t be updated continuously. This is a univariate regression task, predicting one target variable (median house value) for each district. We choose supervised learning because we have labeled data, the median house price, which will be used to train the model.

## Get the data

In [1]:
# Import Libraries
import urllib.request
import tarfile
from pathlib import Path
import pandas as pd

tarball_path = Path("datasets/housing.tgz") # the file path where the tarball will be saved or read from.

def get_housing_data():
    if not tarball_path.is_file():
        # create the parent folder if it doesn't exist
        tarball_path.parent.mkdir(parents=True, exist_ok=True)
        
        url = 'https://github.com/ageron/data/raw/main/housing.tgz' # target url
        # Now download
        urllib.request.urlretrieve(url, tarball_path)

    with tarfile.open(tarball_path) as _:
            _.extractall("datasets", filter='data')
    return pd.read_csv("datasets/housing/housing.csv")    

housing_full = get_housing_data()

### A quick look at the data

In [2]:
housing_full.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY


In [3]:
housing_full.shape

(20640, 10)

Our dataset contains 20,640 rows (districts) and 10 columns (features), each representing different attributes of the districts.

In [4]:
housing_full.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [5]:
housing_full.info()

<class 'pandas.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  str    
dtypes: float64(9), str(1)
memory usage: 1.6 MB


All features describing the districts are numerical, except ocean_proximity, which is categorical and will require encoding before modeling.

In [6]:
housing_full.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

The total_bedrooms feature has 207 missing values, which we’ll need to handle before modeling.

In [7]:
# train test split

from sklearn.model_selection import train_test_split

In [8]:
X = housing_full.drop(columns="median_house_value").copy()
y = housing_full[["median_house_value"]].copy()

In [9]:
y.head(1)

Unnamed: 0,median_house_value
0,452600.0


In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [11]:
X_test.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
20046,-122.38,40.67,10.0,2281.0,444.0,1274.0,438.0,2.212,INLAND
3024,-118.37,33.83,35.0,1207.0,207.0,601.0,213.0,4.7308,<1H OCEAN
15663,-117.24,32.72,39.0,3089.0,431.0,1175.0,432.0,7.5925,NEAR OCEAN
20484,-118.44,34.05,18.0,4780.0,1192.0,1886.0,1036.0,4.4674,<1H OCEAN
9814,-118.44,34.18,33.0,2127.0,414.0,1056.0,391.0,4.375,<1H OCEAN


In [12]:
# Lets dael with missing values

from sklearn.impute import SimpleImputer

In [13]:
num_X_train = X_train.select_dtypes(include="number")
num_X_train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
14196,-117.22,32.75,34.0,6001.0,1111.0,2654.0,1072.0,4.5878
8267,-117.03,32.69,10.0,901.0,163.0,698.0,167.0,4.6648
17445,-122.27,37.74,28.0,6909.0,1554.0,2974.0,1484.0,3.6875
14265,-121.82,37.25,25.0,4021.0,634.0,2178.0,650.0,5.1663
2271,-115.98,33.32,8.0,240.0,46.0,63.0,24.0,1.4688
...,...,...,...,...,...,...,...,...
11284,-122.37,37.94,49.0,969.0,229.0,599.0,195.0,1.3167
11964,-118.38,33.89,35.0,1778.0,330.0,732.0,312.0,6.5745
5390,-119.33,36.28,16.0,2624.0,527.0,1077.0,520.0,2.1250
860,-117.19,34.08,22.0,2467.0,555.0,1567.0,494.0,2.6536


In [14]:
num_X_train.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        163
population              0
households              0
median_income           0
dtype: int64

In [15]:
from sklearn.impute import SimpleImputer

In [19]:
imputer = SimpleImputer(strategy="most_frequent")
num_tr = imputer.fit(num_X_train)
num_tr

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'most_frequent'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False


In [21]:
num_tr_df = pd.DataFrame(num_tr, columns=num_X_train.columns, index=num_X_train.index)

In [22]:
num_tr_df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
dtype: int64

In [28]:
pd.DataFrame({"num": [1,3,4]})

Unnamed: 0,num
0,1
1,3
2,4
