# Understanding Over & Underfitting
## Predicting Boston Housing Prices

## Getting Started
In this project, you will use the Boston Housing Prices dataset to build several models to predict the prices of homes with particular qualities from the suburbs of Boston, MA.
We will build models with several different parameters, which will change the goodness of fit for each. 

In [None]:
# Import packages

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

---
## Data Exploration
Since we want to predict the value of houses, the **target variable**, `'MEDV'`, will be the variable we seek to predict.

### Import and explore the data. Clean the data for outliers and missing values. 
Download the data from [here](https://drive.google.com/file/d/1o-vZHHSywBksnPuGRunvpdYN7grYbe8h/view?usp=sharing) and place it in the data folder

In [None]:
# Your code here

boston = pd.read_csv('../data/boston_data.csv')

In [None]:
boston.dtypes

# all data types are numerical

In [None]:
boston.isnull().sum()

# no missing values

In [None]:
#plt.figure(figsize=(16,10))

#sns.boxplot(x="variable", y="value", data=pd.melt(boston))

In [None]:
# Remove outliers
# Normally I wouldn't just remove all outliers like this without inspection and good reason
# but because this is just practise and I don't know what the variables mean I'm removing all rows with at least one outlier

boston2 = boston[(np.abs(stats.zscore(boston)) < 3).all(axis=1)]

In [None]:
print(boston.shape)
print(boston2.shape)

# 83 rows were removed because they contained at least one outlier

### Next, we want to explore the data. Pick several varibables you think will be ost correlated with the prices of homes in Boston, and create plots that show the data dispersion as well as the regression line of best fit.

In [None]:
# Your plots here

boston2.corr()

# from this table it appears that two variables are most correlated with house price (medv): 'rm' (r = 0.71) and 'lstat' (r = -0.75)

In [None]:
# I'll plot those two variables against medv

sns.lmplot(x ='lstat', y ='medv', data = boston2) 

In [None]:
sns.lmplot(x ='rm', y ='medv', data = boston2) 

### What do these plots tell you about the relationships between these variables and the prices of homes in Boston? Are these the relationships you expected to see in these variables?

In [None]:
# Your response here

# lstat seems to be negatively correlated with house price
# rm seems to be posititively correlated with house price
# I don't know what lstat or rm means so I can't say what expected to see

### Make a heatmap of the remaining variables. Are there any variables that you did not consider that have very high correlations? What are they?

In [None]:
# Your response here
corr = boston2.drop(columns = ['lstat', 'rm'], axis = 1).corr()
sns.heatmap(corr)

# there are no other high correlations
# I do see that something strange is going on with variable 'chas'

In [None]:
boston2['chas'].value_counts()

# boston2 has value 0.0 for all instances so that column should be removed for the rest of the analysis

### Calculate Statistics
Calculate descriptive statistics for housing price. Include the minimum, maximum, mean, median, and standard deviation. 

In [None]:
# Your code here

boston['medv'].describe()

----

## Developing a Model

### Implementation: Define a Performance Metric
What is the performance meteric with which you will determine the performance of your model? Create a function that calculates this performance metric, and then returns the score. 

In [None]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    # Your code here:
    return r2_score(y_true, y_predict)

### Implementation: Shuffle and Split Data
Split the data into the testing and training datasets. Shuffle the data as well to remove any bias in selecting the traing and test. 

In [None]:
# Your code here

y = boston2['medv']
X = boston2.drop('medv', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# according to the documentation train_test_split shuffles the data by default 

----

## Analyzing Model Performance
Next, we are going to build a Random Forest Regressor, and test its performance with several different parameter settings.

### Learning Curves
Lets build the different models. Set the max_depth parameter to 2, 4, 6, 8, and 10 respectively. 

In [None]:
# Five separate RFR here with the given max depths

RFR2 = RandomForestRegressor(max_depth=2)
RFR4 = RandomForestRegressor(max_depth=4)
RFR6 = RandomForestRegressor(max_depth=6)
RFR8 = RandomForestRegressor(max_depth=8)
RFR10 = RandomForestRegressor(max_depth=10)

Now, plot the score for each tree on the training set and on the testing set.

In [None]:
# Produce a plot with the score for the testing and training for the different max depths

models = [RFR2, RFR4, RFR6, RFR8, RFR10]
r2score_train = []
r2score_test = []

for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_train)
    r2score_train.append(r2_score(y_train, y_pred))
    y_test_pred = model.predict(X_test)
    r2score_test.append(r2_score(y_test, y_test_pred))

In [None]:
df_r2 = pd.DataFrame(list(zip(r2score_train, r2score_test)), index = ['RFR2', 'RFR4', 'RFR6', 'RFR8', 'RFR10'], columns =['r2_train', 'r2_test'])


In [None]:
df_r2

In [None]:
sns.lmplot('r2_train', 'r2_test', data = df_r2)


What do these results tell you about the effect of the depth of the trees on the performance of the model?

In [None]:
# Your response here

# Looking at the figure and the r2 scores in the table df_r2, the higher the max depth, the better the model performs
# The model with max depth = 10 has the highest r2 score for the train and the test set
# but the r2 scores for the models with max depths of 6,8 and 10 are very similar 

### Bias-Variance Tradeoff
When the model is trained with a maximum depth of 1, does the model suffer from high bias or from high variance? How about when the model is trained with a maximum depth of 10?

In [None]:
# Your response here

# When the model is trained with a maximum depth of 1 there is a high bias, 
# when the depth increases the bias becomes lower but the variance higher
# so max depth 10 has low bias but high variance

### Best-Guess Optimal Model
What is the max_depth parameter that you think would optimize the model? Run your model and explain its performance.

In [None]:
# Your response here

# I think max depth of 6 as a higher max depth doesn't improve the model score much

### Applicability
*In a few sentences, discuss whether the constructed model should or should not be used in a real-world setting.*  
**Hint:** Some questions to answering:
- *How relevant today is data that was collected from 1978?*
- *Are the features present in the data sufficient to describe a home?*
- *Is the model robust enough to make consistent predictions?*
- *Would data collected in an urban city like Boston be applicable in a rural city?*

In [None]:
# Your response here

# I guess the data isn't very relevant today anymore as the housing situation has changed a lot since 1978
# I don't know what the columns/ features mean (did I miss something?)
# The model does seem to score quite well (or am I wrong?)
# the data collected in Boston would probably not be representative for a rural city