# **Prediction: House prices in Washington DC**

**Objective** : To predict house prices based on sales data from May, 2014 to May, 2015

* Dataset includes house sale prices for King County in USA. 
* Homes that are sold in the time period: May, 2014 and May, 2015.
* Data Source: https://www.kaggle.com/harlfoxem/housesalesprediction

* Columns:

> 1. ida: notation for a house
2. date: Date house was sold
3. price: Price is prediction target
4. bedrooms: Number of Bedrooms/House
5. bathrooms: Number of bathrooms/House
6. sqft_living: square footage of the home
7. sqft_lot: square footage of the lot
8. floors: Total floors (levels) in house
9. waterfront: House which has a view to a waterfront
10. view: Has been viewed
11. condition: How good the condition is ( Overall )
12. grade: overall grade given to the housing unit, based on King County grading system
13. sqft_abovesquare: footage of house apart from basement
14. sqft_basement: square footage of the basement
15. yr_built: Built Year
16. yr_renovated: Year when house was renovated
17. zipcode: zip
18. lat: Latitude coordinate
19. long: Longitude coordinate
20. sqft_living15: Living room area in 2015(implies-- some renovations) 
21. sqft_lot15: lotSize area in 2015(implies-- some renovations)


In [3]:
#pip install pandas_visual_analysis

!pip install --force-reinstall --no-cache-dir numpy pandas jupyter ipykernel

Collecting numpy
  Downloading numpy-2.0.2-cp39-cp39-macosx_14_0_x86_64.whl.metadata (60 kB)
Collecting pandas
  Downloading pandas-2.3.2-cp39-cp39-macosx_10_9_x86_64.whl.metadata (91 kB)
Collecting jupyter
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting ipykernel
  Downloading ipykernel-6.30.1-py3-none-any.whl.metadata (6.2 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting notebook (from jupyter)
  Downloading notebook-7.4.5-py3-none-any.whl.metadata (10 kB)
Collecting jupyter-console (from jupyter)
  Downloading jupyter_console-6.6.3-py3-none-any.whl.metadata (5.8 kB)
Collecting nbconvert (from jupyter)
  Downloading nbconvert-7.16.6-py3-none-any.whl.metada

# **Import Libraries and Dataset**

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from tensorflow import keras
import joblib

In [None]:
house_sales = pd.read_csv('kc_house_data.csv', encoding = 'ISO-8859-1')

In [None]:
house_sales

In [None]:
house_sales.describe()

In [None]:
house_sales.info()

# **Data Cleanup and Visualization**

In [None]:
sns.scatterplot(x='sqft_living', y='price', data = house_sales )

In [None]:
#sns.scatterplot(x='long', y='lat', data = house_sales )
house_sales.plot(kind='scatter', x = 'long', y='lat', alpha = 0.4, s =house_sales['sqft_living']/200, label = 'sqft_living', figsize = (20,14),
                c='price', cmap=plt.get_cmap('jet'), colorbar = True)

In [None]:
#alternative plot
#f, ax = plt.subplots(figsize = (20, 20))
#sns.heatmap(house_sales.corr(), annot = True)

corrmat=house_sales.corr() 
top_corr_features=corrmat.index 
plt.figure(figsize=(20,20)) 
#plot heat map 
g=sns.heatmap(house_sales[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [None]:
house_high = house_sales[house_sales['price']>= 650000]
house_high

In [None]:
corrmat1=house_high.corr() 
top_corr_features1=corrmat1.index 
plt.figure(figsize=(20,20)) 
#plot heat map 
g=sns.heatmap(house_high[top_corr_features1].corr(),annot=True,cmap="RdYlGn")

## Feature importance for Prices >= $650k


In [None]:
#Using dataset for high prices
yh = house_high['price']
yh

In [None]:
xh = house_high.drop(['price','date'], axis=1)
xh.shape

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
#import matplotlib.pyplot as plt
model_h = ExtraTreesRegressor()
model_h.fit(xh,yh)

In [None]:
print(model_h.feature_importances_)

In [None]:
#plot graph of feature importances for better visualization
#Factors responsible for higher property value - >= $650k price 

feat_importances = pd.Series(model_h.feature_importances_, index=xh.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

In [None]:
house_high.hist(bins=20, figsize = (20,20), color = 'b')

In [None]:
house_sales.hist(bins=20, figsize = (20,20), color = 'g')

In [None]:
#Data Cleaning 
y = house_sales['price']
y

In [None]:
y.shape

In [None]:
#X = house_sales[['bedrooms', 'bathrooms', 'waterfront', 'grade', 'sqft_above', 'sqft_living', 'lat', 'long', 'sqft_living15']]
X = house_sales.drop(['price', 'date', 'id', 'view'], axis=1)

X.head()

# First Model - Random Forest Regressor

In [None]:
### Feature Importance categorisation 

from sklearn.ensemble import ExtraTreesRegressor
#import matplotlib.pyplot as plt
model_01 = ExtraTreesRegressor()
model_01.fit(X,y)

In [None]:
# feature Importance

print(model_01.feature_importances_)

In [None]:
#plot graph of feature importances for better visualization

feat_importances = pd.Series(model_01.feature_importances_, index=X.columns)
feat_importances.nlargest(7).plot(kind='barh')
plt.show()

### **'Grade' and 'Sqft_living' seem to be the most important features**

In [None]:
#We choose input features with reasonable correlation coefficients to price
#x = house_sales[['bathrooms', 'bedrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement']]
#x

### Split data into training and testing dataset using the Stratified Shuffle split based on the 'sqft_living' input

In [None]:
#Join the clean input features X and the output y1 in a dataframe
final_house = pd.concat([X, y], axis=1)
final_house.head()

In [None]:
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
#prepare data by distributing it 
final_house["sqft_living_cut"] = pd.cut( house_sales["sqft_living"], 
                                bins =[ 0, 1000, 2000, 3000, 4000, np.inf], 
                                labels =[ 1, 2, 3, 4, 5])

In [None]:
final_house['sqft_living_cut'].hist()

In [None]:
#split data using Stratified shuffle Split
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 2)
for train_index, test_index in split.split(final_house, final_house['sqft_living_cut']):
    train_set = final_house.loc[train_index]
    test_set = final_house.loc[test_index]

In [None]:
final_house = final_house.drop(['sqft_living_cut'], axis=1)

In [None]:
train_set.shape, test_set.shape

In [None]:
train_set.head()

In [None]:
X_train = train_set.iloc[:,:-2]
y_train = train_set.iloc[:,-2]

X_test = test_set.iloc[:,:-2]
y_test = test_set.iloc[:,-2]

In [None]:
X_train.shape, y_train.shape

In [None]:
X_train

In [None]:
y_train.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 300, stop = 1500, num = 13)]
print(n_estimators)

In [None]:
##Hyperparameters

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 300, stop = 1500, num = 13)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

In [None]:
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}


print(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf_model01 = RandomForestRegressor()

In [None]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = rf_model01, param_distributions = random_grid,scoring='neg_mean_squared_error', 
                               n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)

In [None]:
rf_random.fit(X_train, y_train)

In [None]:
rf_random.best_params_

In [None]:
rf_random.best_score_

In [None]:
predictions=rf_random.predict(X_test)

In [None]:
sns.distplot(y_test-predictions)

In [None]:
plt.scatter(y_test,predictions)

In [None]:
#Model Evaluation

k = X_test.shape[1]
n = len(X_test)
n, k

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, predictions)),'.3f'))
MSE = mean_squared_error(y_test, predictions)
MAE = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 

**Good model with 86% accuracy to predict house prices from historical data.**

In [None]:
#import pickle
# open a file, where you ant to store the data
#file = open('random_forest_regression_model.pkl', 'wb')

# dump information to that file
#pickle.dump(rf_random, file)

In [None]:
#joblib.dump(rf_random, "random_forest_regression_model.joblib", compress = 9)

# Second Model - Artificial Neural Network

In [None]:
#Normalise the inputs 'X' and output 'y' for training and testing data

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_tr = scaler.fit_transform(X_train)
X_te = scaler.fit_transform(X_test)



In [None]:
X_tr

In [None]:
X_te

In [None]:
y_train=y_train.values.reshape(-1,1)
y_test=y_test.values.reshape(-1,1)

y_tr = scaler.fit_transform(y_train)
y_te = scaler.fit_transform(y_test)

In [None]:
y_tr

In [None]:
scaler.data_max_

In [None]:
X_tr.shape, X_te.shape, y_tr.shape

In [None]:
y_tr.shape, y_te.shape

# Create Model

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=200, activation='relu', input_shape=(17, )))
model.add(tf.keras.layers.Dense(units=150, activation='relu'))
model.add(tf.keras.layers.Dense(units=100, activation='relu'))
model.add(tf.keras.layers.Dense(units=50, activation='relu'))
model.add(tf.keras.layers.Dense(units=1, activation='linear'))

In [None]:
model.summary()

In [None]:
model.compile(optimizer='Adam', loss='mean_squared_error')

In [None]:
epochs_hist = model.fit(X_tr, y_tr, epochs = 100, batch_size = 50, validation_split = 0.2)

#**Evaluate the Model**

In [None]:
epochs_hist.history.keys()

In [None]:
plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])
plt.title('Model Loss Progress During Training')
plt.xlabel('Epoch')
plt.ylabel('Training and Validation Loss')
plt.legend(['Training Loss', 'Validation Loss'])

In [None]:
# Try predict using inputs values
# 'bedrooms','bathrooms','sqft_living','sqft_lot','floors', 'sqft_above', 'sqft_basement'
#X_test_1 = np.array([[ 4, 3, 1960, 5000, 1, 2000, 3000 ]])

#scaler_1 = MinMaxScaler()
#X_test_scaled_1 = scaler_1.fit_transform(X_test_1)

#y_predict_1 = model.predict(X_test_scaled_1)

#y_predict_1 = scaler.inverse_transform(y_predict_1)
#y_predict_1

In [None]:
# 
y_predict = model.predict(X_te)
plt.plot(y_te, y_predict, "^", color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')

In [None]:
y_predict_orig = scaler.inverse_transform(y_predict)
y_test_orig = scaler.inverse_transform(y_te)

In [None]:
plt.plot(y_test_orig, y_predict_orig, "^", color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')
plt.xlim(0, 5000000)
plt.ylim(0, 3000000)


In [None]:
k = X_te.shape[1]
n = len(X_te)

n, k

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 


# **Try a wider range of inputs split**

In [None]:
Xa = house_sales.drop(['price', 'date', 'id', 'view'], axis=1)
#X= house_sales[['bedrooms','bathrooms','sqft_living','sqft_lot','floors', 'sqft_basement', 'waterfront', 'condition', 'grade', 'sqft_above', 'yr_built', 
#'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']]

In [None]:
Xa

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scale = scaler.fit(Xa)

In [None]:
joblib.dump(scaler, 'scaler_x.gz')

In [None]:
X_scaled = scaler.fit_transform(Xa)

In [None]:
X_scaled

In [None]:
Y= house_sales['price']

In [None]:
Y = Y.values.reshape(-1,1)
Y_scale = scaler.fit(Y)

In [None]:
joblib.dump(scaler, 'scaler_y.gz')

In [None]:
Y_scaled = scaler.fit_transform(Y)

In [None]:
Y_scaled

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y_scaled, test_size = 0.25)

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=200, activation='relu', input_shape=(17, )))
model.add(tf.keras.layers.Dense(units=100, activation='relu'))
model.add(tf.keras.layers.Dense(units=100, activation='relu'))
model.add(tf.keras.layers.Dense(units=50, activation='relu'))
model.add(tf.keras.layers.Dense(units=1, activation='linear'))

In [None]:
model.compile(optimizer='Adam', loss='mean_squared_error')

In [None]:
epochs_hist = model.fit(X_train, y_train, epochs = 100, batch_size = 50, validation_split = 0.20)

In [None]:
plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])
plt.title('Model Loss Progress During Training')
plt.ylabel('Training and Validation Loss')
plt.xlabel('Epoch number')
plt.legend(['Training Loss', 'Validation Loss'])

In [None]:
y_predict = model.predict(X_test)
plt.plot(y_test, y_predict, "o", color = 'r')
plt.xlabel("Model Predictions")
plt.ylabel("True Value (ground Truth)")
plt.title('Linear Regression Predictions')
plt.show()

In [None]:
y_predict_orig = scaler.inverse_transform(y_predict)
y_test_orig = scaler.inverse_transform(y_test)


In [None]:
plt.plot(y_test_orig, y_predict_orig, "^", color = 'r')
plt.xlabel("Model Predictions")
plt.ylabel("True Value (ground Truth)")
plt.title('Linear Regression Predictions')
plt.show()

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 


**Good model with 88% accuracy to predict house prices from historical data.**

In [None]:
from tensorflow.keras.models import load_model

In [None]:
tf.keras.models.save_model(model,'ann_model.h5')

In [None]:
#10, 5492885819312364962500198509825047.5601-122.089835003700

#predann 1322184960.0
#predrf 2182392.88

In [None]:
#file = open('ANN_model.pkl', 'wb')

# dump information to that file
#pickle.dump(model, file)