# House Price Prediction

In this project, I'll use the California House dataset available in `sklearn` and use `Jupyter Kernel Gateway` to expose its cells as Endpoints.

### Import libraries

In [1]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

### Dataset import

We get the features inside `.data` and labels inside `.target`. We split it into test and train data using `trsin_test_split` with test size of `33%`.

In [2]:
fetched_data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(fetched_data.data, fetched_data.target, test_size = 0.33)

Now we will get its desription using `.DESC` and get the complete information on the same.

In [3]:
print(fetched_data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

### Data Analysis

Here, we will analyse the dataset and creata a GET endpoint to fetch the basic stats.

We first concatenate the features and labels and then combine them as columns with specific column names.

In [4]:
dataset = pd.concat([pd.DataFrame(fetched_data.data, columns = fetched_data.feature_names), 
                     pd.DataFrame(fetched_data.target*100000, columns = ['Price'])], axis = 1)

Let's analyse the dataset.

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
MedInc        20640 non-null float64
HouseAge      20640 non-null float64
AveRooms      20640 non-null float64
AveBedrms     20640 non-null float64
Population    20640 non-null float64
AveOccup      20640 non-null float64
Latitude      20640 non-null float64
Longitude     20640 non-null float64
Price         20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB


We see that we have a total of 20640 houses. There are total of 8 features and 1 label column. There are no `null` values.

In [6]:
dataset.corr()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
MedInc,1.0,-0.119034,0.326895,-0.06204,0.004834,0.018766,-0.079809,-0.015176,0.688075
HouseAge,-0.119034,1.0,-0.153277,-0.077747,-0.296244,0.013191,0.011173,-0.108197,0.105623
AveRooms,0.326895,-0.153277,1.0,0.847621,-0.072213,-0.004852,0.106389,-0.02754,0.151948
AveBedrms,-0.06204,-0.077747,0.847621,1.0,-0.066197,-0.006181,0.069721,0.013344,-0.046701
Population,0.004834,-0.296244,-0.072213,-0.066197,1.0,0.069863,-0.108785,0.099773,-0.02465
AveOccup,0.018766,0.013191,-0.004852,-0.006181,0.069863,1.0,0.002366,0.002476,-0.023737
Latitude,-0.079809,0.011173,0.106389,0.069721,-0.108785,0.002366,1.0,-0.924664,-0.14416
Longitude,-0.015176,-0.108197,-0.02754,0.013344,0.099773,0.002476,-0.924664,1.0,-0.045967
Price,0.688075,0.105623,0.151948,-0.046701,-0.02465,-0.023737,-0.14416,-0.045967,1.0


We see that the price is mainly dependant on Median Income with a correlation of approximately ~0.7.

#### GET Endpoint

This endpoint will extract important information about our dataset and return the same when the endpoint is called.

In [7]:
# GET /housing_stats
total_houses = len(dataset)
max_value = dataset['Price'].describe()['max']
min_value = dataset['Price'].describe()['min']
print(json.dumps({
    'total_houses': total_houses,
    'max_value': max_value,
    'min_value': min_value,
    'most_imp_feature': 'Median Income'
}))

{"total_houses": 20640, "max_value": 500000.99999999994, "min_value": 14999.000000000002, "most_imp_feature": "Median Income"}


### Machine Learning

Let's now directly train our dataset on the train data and analyse our Mean Absolute Error on the test data.

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

clf = RandomForestRegressor(n_estimators = 100, max_depth = 50)
clf.fit(X_train, y_train)
print("Mean Absolute Error: {}".format(mean_squared_error(y_test, clf.predict(X_test))))

Mean Absolute Error: 0.2616913553723617


#### POST Endpoint

Here, I'll train the model on the complete dataset and then simply use the post endpoint to get the price.

In [9]:
endpoint_classifier = RandomForestRegressor(n_estimators = 100, max_depth = 50)
endpoint_classifier.fit(fetched_data.data, fetched_data.target)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Let's define a random `REQUEST` object for our POST Endpoint with the mean values from our dataset.

In [10]:
features = pd.DataFrame(fetched_data.data)
mean_values = features.describe().iloc[1, :]

REQUEST = json.dumps({
    'body': {
        'MedInc': mean_values[0],
        'HouseAge': mean_values[1],
        'AveRooms': mean_values[2],
        'AveBedrms': mean_values[3],
        'Population': mean_values[4],
        'AveOccup': mean_values[5],
        'Latitude': mean_values[6],
        'Longitude': mean_values[7]
    }
})

The endpoint will accept all the values from the user and return the predicted price. The data received is in the `body` part of the request.

In [11]:
# POST /get_price
req = json.loads(REQUEST)
req = np.array(list(req['body'].values()))
expected_price = endpoint_classifier.predict(req.reshape(1, -1))[0]
expected_price = "{0:.2f}".format(expected_price*100000)
print(json.dumps({
    'result': 'The price of the house with your specifications should be approximately: $' + expected_price
}))

{"result": "The price of the house with your specifications should be approximately: $87476.00"}
