### HOUSE PREDICTION AND HYPERPARAMETER TUNING

**Project Goal**: Using data from the [Geo Data and Lab](https://geodacenter.github.io/data-and-lab/KingCounty-HouseSales2015/) website to predict the housing prices using Random Forest Regression and fine-tune its hyperparameters so as to improve the model performance. 

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [5]:
df = pd.read_csv("kc_house_data.csv")
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


**Exploratory Data Analysis**

In [9]:
print(f"The dataset contains {df.shape[0]} samples and " f"{df.shape[1]} features")

The dataset contains 21613 samples and 21 features


In [10]:
df.isnull().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

- We drop the id and date column from our data as they will not be useful for prediction excercise. We also seperate the target variable from the predictor vaariables.

In [11]:
X = df.drop(["id", "price", "date"], axis = 1)
y = df["price"]

- Next step is to pick out categorical variables from numerical variables using column selection from sklearn's compose.

In [12]:
from sklearn.compose import make_column_selector as selector

categorical_column_selector = selector(dtype_include=object)
categorical_columns = categorical_column_selector(X)
categorical_columns

[]

In [13]:
# Identifyingg the numerical variables.

num_vars = [var for var in X.columns if var not in categorical_columns]

# Number of numerical variables
print(len(num_vars))
print(num_vars)

18
['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']


- Note: Some categorical variables are numerical variables and we put them into their correct list.

In [14]:
# List of discrete variables
categorical_vars = [var for var in num_vars if len(X[var].unique()) < 20]

print("Number of categorical variables: ", len(categorical_vars))
print(categorical_vars)

Number of categorical variables:  6
['bedrooms', 'floors', 'waterfront', 'view', 'condition', 'grade']


In [15]:
# Visualizing the categorical variables.

X[categorical_vars].head()

Unnamed: 0,bedrooms,floors,waterfront,view,condition,grade
0,3,1.0,0,0,3,7
1,3,2.0,0,0,3,7
2,2,1.0,0,0,3,6
3,4,1.0,0,0,5,7
4,3,1.0,0,0,3,8


- The remaining numerical variables will now be classified as continuous variables.

In [17]:
#A list of continuous variables.
cont_vars = [var for var in num_vars if var not in categorical_vars]

print("Number of continuous variables: ", len(cont_vars))
print(cont_vars)

Number of continuous variables:  12
['bathrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']


In [18]:
# Visualize the continuous variables
X[cont_vars].head()

Unnamed: 0,bathrooms,sqft_living,sqft_lot,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,1.0,1180,5650,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,2.25,2570,7242,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,1.0,770,10000,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,3.0,1960,5000,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,2.0,1680,8080,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
