<a href="https://colab.research.google.com/github/komoni03/housing-pricing-model/blob/main/Housing_Pricing_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd
df_test = pd.read_csv('/content/sample_data/california_housing_test.csv')
df_train = pd.read_csv('/content/sample_data/california_housing_train.csv')

In [4]:
df_test.shape

(3000, 9)

In [5]:
df_train.shape

(17000, 9)

In [6]:
df_test.describe(include='all')

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,-119.5892,35.63539,28.845333,2599.578667,529.950667,1402.798667,489.912,3.807272,205846.275
std,1.994936,2.12967,12.555396,2155.593332,415.654368,1030.543012,365.42271,1.854512,113119.68747
min,-124.18,32.56,1.0,6.0,2.0,5.0,2.0,0.4999,22500.0
25%,-121.81,33.93,18.0,1401.0,291.0,780.0,273.0,2.544,121200.0
50%,-118.485,34.27,29.0,2106.0,437.0,1155.0,409.5,3.48715,177650.0
75%,-118.02,37.69,37.0,3129.0,636.0,1742.75,597.25,4.656475,263975.0
max,-114.49,41.92,52.0,30450.0,5419.0,11935.0,4930.0,15.0001,500001.0


In [7]:
df_train.describe(include='all')

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [8]:
df_train.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

In [9]:
df_test.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

The train dataset has 0 null values. We can proceed to look at indiviual columns and ascertain their relevance in detrmining the housing price

Next, we will split the columns into the target variable and input variables

In [10]:
X_train = df_train.drop(columns=['median_house_value']) #input variable for the train dataset

y_train = df_train['median_house_value']      #target variable for thr train dataset


In [11]:
train_shape = {'X_train': X_train.shape, 'y_train': y_train.shape}
print(train_shape)

{'X_train': (17000, 8), 'y_train': (17000,)}


In [12]:
X_test = df_test.drop(columns=['median_house_value'])         #input variable for the test dataset

y_test = df_test['median_house_value']      #target variable for thr test dataset


In [13]:
test_shape = {'X_test': X_test.shape, 'y_test': y_test.shape}
print(test_shape)

{'X_test': (3000, 8), 'y_test': (3000,)}


In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [15]:
def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = RandomForestRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    preds_val = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000, 10000]:
    my_mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  61937
Max leaf nodes: 50  		 Mean Absolute Error:  43578
Max leaf nodes: 500  		 Mean Absolute Error:  34536
Max leaf nodes: 5000  		 Mean Absolute Error:  32125
Max leaf nodes: 10000  		 Mean Absolute Error:  32107


In [16]:
forest_model = RandomForestRegressor(max_leaf_nodes=10000, random_state=1)
forest_model.fit(X_train, y_train)
y_preds = forest_model.predict(X_test)

Compare RandomForestRegressor model with LinearRegression model based on mae scores

In [17]:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_preds_y = linear_model.predict(X_test)

In [18]:
linear_mae_score = mean_absolute_error(y_test, linear_preds_y)
print(f'The mean absolute error for the LinearRegression model is: {linear_mae_score}')

The mean absolute error for the LinearRegression model is: 50352.22825794297


From the Mean absolute error scores of both the RandomFOrestRegressor and LinearRegression models, it can be said that the RandomForestRegressor Model is more accurate and can be used to predict hosuing prices given the required input