
# Boston House Prices
In this notebook we are going to build a predictive regression model for esitmating the house prices in thousands of $ given some housing factors such as crime rate in neighborhood, number of schools %, lower status of the population etc.

We will apply and elaborate the steps seen in the first Workshop (when applicable).




<dl></dl>
<dl></dl>

**Data Science Cycle:**

    Data Understanding
        0. Exploratory Data Analysis
        
    Data Preparation
        1. Target Definition
        2. Data Splitting
        3. Feature Engineering
    
    Modeling
        4. Variable Selection
        5. Model Selection
        6. Fine-tuning

    Evaluation
        7. Evaluation & Interpretation


# Set Up
Import required libraries

In [None]:
#imports - please put in environment>base(root)>open terminal and run
'''
pip install xgboost
pip install numpy
pip install scipy
pip install statsmodels==0.10.2
pip install sklearn
pip install boruta
pip install datetime
'''

# Boston Data
from sklearn.datasets import load_boston

# Data Manipulation & analysis
import pandas as pd
import numpy as np
from scipy import stats

# Visualizations
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preparation
from sklearn.preprocessing import MinMaxScaler 
from sklearn.model_selection import train_test_split

# Modeling
!pip install boruta
from boruta import BorutaPy 
import statsmodels.api as sm 
import sklearn
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import RandomizedSearchCV 
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import ensemble
!pip install xgboost
import xgboost as xgb

# Evaluation
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import KFold
from sklearn.inspection import plot_partial_dependence

# Other Set Up
from datetime import datetime
import os
import warnings
from pprint import pprint

In [None]:
#!pip install tensorflow

Set styles

In [None]:
# Set style for displaying data
%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Set style for plotting
sns.set_style("whitegrid")

Set Timer

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

Fix random seed for reproducibility

In [None]:
np.random.seed(42)

# Exploratory Data Analysis

We will start by exploring Boston data (Step 0) and setting the target (Step 1). 

Boston dataset is extremely common in machine learning experiments thus it is embedded in sklearn.

In [None]:
# Read Data
boston = load_boston()

Detailed description of dataset and features

Create pandas dataframe with objects in rows and features in columns, and define target.
In this study, the target is already given as being the housing price (i.e. "MEDV", or the Median value of owner-occupied homes in $1000's)

In [None]:
# Set panda dataframe
boston_data = pd.DataFrame(boston.data)

# Set column names
boston_data.columns = boston.feature_names

# Set target
boston_target=pd.DataFrame(boston.target)
boston_target.columns=['PRICE']

# Merge into 1 data frame
boston_df = pd.merge(boston_data,boston_target,left_index = True, right_index = True)

# Set X and Y (features and target)
X = pd.DataFrame(boston.data, columns=boston.feature_names)
Y = boston.target

In [None]:
# Check first few rows of data
boston_df.head()

In [None]:
# Get descriptive statistics
boston_df.describe()