# California Median House Price Prediction
Using Ridge (linear_model) and Random Forest Regressor.  
Steps to follow:
> 1. Collect data
> 2. Prepare/Process the data (including splitting, encoding and scaling)
> 3. Fit the model
> 4. Test the model
> 5. Experiment (Evaluate and tune the model)
> 6. Choose second model and repeat

In [1]:
#import necessary packages
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge #algorithm 1
from sklearn.ensemble import RandomForestRegressor #algorithm 2

### Data Collection

Import dataset from Scikit-Learn

In [3]:
from sklearn.datasets import fetch_california_housing

housing_dataset_raw = fetch_california_housing()

### Data Preparation

First, exploring the data to understand its makeup

In [4]:
housing_dataset_raw

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]], shape=(20640, 8)),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894], shape=(20640,)),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': 

In [5]:
#extracting the features from the dictionary into dataframe 
x = pd.DataFrame(housing_dataset_raw.data, columns= housing_dataset_raw.feature_names)
y = pd.DataFrame(housing_dataset_raw.target, columns= housing_dataset_raw.target_names)

In [6]:
x.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [9]:
x.shape

(20640, 8)

In [10]:
y.head()

Unnamed: 0,MedHouseVal
0,4.526
1,3.585
2,3.521
3,3.413
4,3.422


In [11]:
y.shape

(20640, 1)

In [12]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB


In [13]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedHouseVal  20640 non-null  float64
dtypes: float64(1)
memory usage: 161.4 KB


In [22]:
x.isna().sum() #checking for missing values on the x axis

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

In [21]:
y.isna().sum() #checking for missing values on the y axis

MedHouseVal    0
dtype: int64

In [20]:
x.duplicated().sum() #checking for duplicates

np.int64(0)

From the breif look into the dataset, it is noted that it is a fully numerical dataset without any missing values or duplicates.

**Splitting Test & Train Data**

In [None]:
from sklearn.model_selection import train_test_split

np.random.seed(30) #fix a seed for the random actions that will be going on in this notebook so that they can replicated

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [26]:
x_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
7186,2.9167,39.0,4.544776,1.08209,690.0,5.149254,34.03,-118.18
7686,4.5769,35.0,5.711268,1.06338,845.0,2.975352,33.93,-118.1
6332,6.7978,24.0,6.589189,0.956757,610.0,3.297297,33.99,-117.95
14192,4.0921,20.0,5.577608,1.033079,1766.0,4.493639,32.69,-117.07
6611,5.9009,52.0,7.287755,1.040816,1434.0,2.926531,34.18,-118.11


In [27]:
y_train.head()

Unnamed: 0,MedHouseVal
7186,1.458
7686,1.861
6332,3.25
14192,1.35
6611,3.76


In [29]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape #checking the shapes to ensure they align

((16512, 8), (16512, 1), (4128, 8), (4128, 1))

**Scaling the Data**

In [33]:
from sklearn.preprocessing import MinMaxScaler #import the scaler function
from sklearn.compose import ColumnTransformer #import the transformer that will apply the scaler function to the columns

In [34]:
normaliser = MinMaxScaler()
features = x.keys()
columns_transformer = ColumnTransformer([("features_scaler", normaliser, features)], remainder="passthrough")

Apply scaling on __x_train__ and __x_test__

In [35]:
x_train_transformed = columns_transformer.fit_transform(x_train)
x_test_transformed = columns_transformer.fit_transform(x_test)

In [37]:
x_train_transformed[:5] #preview the transformed x_train data

array([[0.16667356, 0.74509804, 0.02592456, 0.02219633, 0.0240521 ,
        0.00354041, 0.15744681, 0.61454183],
       [0.28116854, 0.66666667, 0.03419637, 0.02164171, 0.0294787 ,
        0.00179091, 0.14680851, 0.62250996],
       [0.43433194, 0.45098039, 0.04042187, 0.01848093, 0.02125127,
        0.00205   , 0.15319149, 0.6374502 ],
       [0.24773451, 0.37254902, 0.03324856, 0.02074345, 0.06172321,
        0.00301279, 0.01489362, 0.7250996 ],
       [0.37247762, 1.        , 0.04537553, 0.02097282, 0.05009978,
        0.00175162, 0.17340426, 0.62151394]])

In [39]:
x_test_transformed[:5] #preview the transformed x_test data

array([[0.14874278, 0.36      , 0.0729864 , 0.03598703, 0.0563404 ,
        0.0520616 , 0.54797441, 0.33229491],
       [0.31639564, 0.48      , 0.08612793, 0.03971209, 0.02651643,
        0.04156553, 0.09808102, 0.68224299],
       [0.28267196, 0.2       , 0.09319063, 0.04196746, 0.05833053,
        0.05136678, 0.6098081 , 0.2305296 ],
       [0.51563427, 0.28      , 0.10201446, 0.03411437, 0.03702769,
        0.03274709, 0.13219616, 0.63551402],
       [0.177701  , 1.        , 0.05627677, 0.03722725, 0.03985873,
        0.02657767, 0.5575693 , 0.18587747]])

### Fitting, Testing, and Evaluating the Models

#### (A) Ridge Linear Model

Fit the model to the data (train)

In [81]:
ridge_model = Ridge()
ridge_model.fit(x_train_transformed, y_train.MedHouseVal) #train the algorithm

0,1,2
,alpha,1.0
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,


Test the model and check the R^2 (r-squared) score (this is used to check the accuracy of regression algorithms)

In [82]:
pred_med_house_price = ridge_model.predict(x_test_transformed)
print(f"First 5 predictions on x_test: {pred_med_house_price[:5]}", end="\n\n")

r_squared_score = ridge_model.score(x_test_transformed, y_test)
print(f"The r-squared score of the prediction is: {round(r_squared_score * 100, 2)}%")

First 5 predictions on x_test: [1.11592924 2.41952859 1.97364705 3.44929169 2.31589054]

The r-squared score of the prediction is: 57.62%


From the R-squared score above, the model may not be better fitted for this dataset or this problem though its hypeparameters can be tuned later to achieve a better score.

Moving to the next algorithm

#### (B) Random Forest Regressor

Fit the model to the data

In [85]:
rand_forest_model = RandomForestRegressor()
rand_forest_model.fit(x_train_transformed, y_train.MedHouseVal) #the array from the y_train target feature dataFrame should be passed as this model expects a 1D array for y

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [87]:
rs_score_randforest = rand_forest_model.score(x_test_transformed, y_test)
print(f"The r-squared score of the model's prediction (using scaled features) is: {round(rs_score_randforest, 2) * 100}%")

The r-squared score of the model's prediction (using scaled features) is: 55.00000000000001%


From the evaluation, the model performs poorly when trained with scaled features.  
Trying it again with unscaled features...

In [88]:
rand_forest_model = RandomForestRegressor()
rand_forest_model.fit(x_train, y_train.MedHouseVal) #the array from the y_train target feature dataFrame should be passed as this model expects a 1D array for y

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [89]:
rs_score_randforest = rand_forest_model.score(x_test, y_test)
print(f"The r-squared score of the random forest model prediction is: {round(rs_score_randforest, 2) * 100}%")

The r-squared score of the random forest model prediction is: 80.0%


From the evaluation the model trained with the unscaled features performs better than the one trained with the scaled features.
This will be further investigated.

### Exporting the Models

In [90]:
import pickle

pickle.dump(ridge_model, open("./exported model/median house price prediction models/ridge_regr_med_house_price_pred_model.pkl", "wb")) #export the ridge model
pickle.dump(ridge_model, open("./exported model/median house price prediction models/randfors_regr_med_house_price_pred_model.pkl", "wb")) #export the random forest regressor model