# Building a House Price Predictor API  
The client is interested in forecasting houses so the company can look to invest. Rather than using a registered valuer they're looking for something which can be done using the latest technology. Jamie has collected some data on existing house prices over the last few years and has asked to see what you can do. 

# 1. Import Data

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("./data/regressiondata.csv", index_col="ID")
data

Unnamed: 0_level_0,TransactionDate,HouseAge,DistanceToStation,NumberOfPubs,PostCode,HousePrice
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2020.12,17.0,467.644775,4.0,5222.0,467104
1,2021.04,36.0,659.924963,3.0,5222.0,547714
2,2019.04,38.0,305.475941,7.0,5213.0,277232
3,2021.10,11.0,607.034754,5.0,5213.0,295958
4,2021.02,14.0,378.827222,5.0,5614.0,439963
...,...,...,...,...,...,...
9351,2019.07,36.0,554.324820,3.0,5217.0,420246
9352,2021.02,21.0,2296.349397,4.0,5614.0,256087
9353,2020.11,18.0,856.174897,0.0,5614.0,257663
9354,2021.10,6.0,87.260667,9.0,5614.0,681072


In [3]:
data.iloc[0:5, 3:6]

Unnamed: 0_level_0,NumberOfPubs,PostCode,HousePrice
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4.0,5222.0,467104
1,3.0,5222.0,547714
2,7.0,5213.0,277232
3,5.0,5213.0,295958
4,5.0,5614.0,439963


In [4]:
data.iloc[9355]

TransactionDate         2020.12
HouseAge                   20.0
DistanceToStation    584.007146
NumberOfPubs                4.0
PostCode                 5614.0
HousePrice               403096
Name: 9355, dtype: object

# 2. Split Data to Prevent Snooping Bias

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
train, test = train_test_split(data, test_size=.3, random_state=88)

In [7]:
type(train)

pandas.core.frame.DataFrame

In [8]:
print(f"Train dimensions: {train.shape}")
print(f"Test dimensions: {test.shape}")

Train dimensions: (6549, 6)
Test dimensions: (2807, 6)


# 3. Exploratory Data Analysis

## Bird's Eye View

In [14]:
train.iloc[0].to_dict()

{'TransactionDate': 2019.05,
 'HouseAge': 17.0,
 'DistanceToStation': 605.6898114,
 'NumberOfPubs': 6.0,
 'PostCode': 5614.0,
 'HousePrice': '355857'}

In [16]:
train.iloc[0].HouseAge

np.float64(17.0)

In [15]:
train.iloc[0].HousePrice

'355857'

In [9]:
train.dtypes

TransactionDate      float64
HouseAge             float64
DistanceToStation    float64
NumberOfPubs         float64
PostCode             float64
HousePrice            object
dtype: object

In [17]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6549 entries, 8505 to 6432
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   TransactionDate    6547 non-null   float64
 1   HouseAge           6546 non-null   float64
 2   DistanceToStation  6547 non-null   float64
 3   NumberOfPubs       6548 non-null   float64
 4   PostCode           6548 non-null   float64
 5   HousePrice         6544 non-null   object 
dtypes: float64(5), object(1)
memory usage: 358.1+ KB


## Analyse Numerical Attributes

### Plot Distributions

### Why isn't House Price Showing up as Numeric?

### Drop Outlier 

### What's happening with pubs

## Analyse Categorical Variables

## Analyse Relationships Numeric/Numeric

### Calculate Pearson's Correlation

## Analyse Relationships Cat/Num

### Is Post Code Driving Value?

### What about the date it was sold?

# 4. Data Preprocessing

## Build Preprocessing Function

## Preview Preprocessed Data

## Clean up Analysis Features

## Create X and y values

# 5. Modelling

## Import ML Dependencies

## Create Pipelines

### Training Outside of a Pipeline

## Create Tuning Grids

## Train Models and Perform HPO

# 6. Evaluate Models

## Import Evaluation Metrics

## Preprocess Test Set For Predictions

### Look for Nulls

### Check datatypes

### Create X_test and y_test 

## Calculate Regression Metrics

## Make Predictions

## Explain model

### Calculate Feature Importance

### Calculate Feature Importance for All Models

### Plot Trees

# 7. Save model