# Understanding Over & Underfitting
## Predicting Boston Housing Prices

## Getting Started
In this project, you will use the Boston Housing Prices dataset to build several models to predict the prices of homes with particular qualities from the suburbs of Boston, MA.
We will build models with several different parameters, which will change the goodness of fit for each. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---
## Data Exploration
Since we want to predict the value of houses, the **target variable**, `'MEDV'`, will be the variable we seek to predict.

### Import and explore the data. Clean the data for outliers and missing values. 
Download the data from [here](https://drive.google.com/file/d/1o-vZHHSywBksnPuGRunvpdYN7grYbe8h/view?usp=sharing) and place it in the data folder

In [None]:
# Your code here
boston = pd.read_csv('../data/boston_data.csv')
boston.isna().sum()
boston

In [None]:
boston.describe()

In [None]:
from scipy import stats
z = np.abs(stats.zscore(boston))
threshold = 3
np.where(z > 3)

In [None]:
boston = boston[(z < 3).all(axis=1)]
boston

### Next, we want to explore the data. Pick several variables you think will be ost correlated with the prices of homes in Boston, and create plots that show the data dispersion as well as the regression line of best fit.

In [None]:
# Your plots here
#crim & medv (per capita crime rate by town & median value of owner-ocuppied homes in/$1000s)
#rm & medv (average number of rooms per dwelling & median value of owner-ocuppied homes in/$1000s)
#lstat & medv (lower status of the population(percent) & median value of owner-ocuppied homes in/$1000s)
#tax & medv (ful-value property-tax rate per /$10,000 & median value of owner-ocuppied homes in/$1000s)

sns.regplot(x="crim", y="medv", data=boston)
#crim & medv seem to be negatively correlated

In [None]:
sns.regplot(x="rm", y="medv", data=boston)
#rm & medv seem to be positively correlated

In [None]:
sns.regplot(x="lstat", y="medv", data=boston)
#lstat & medv seem to be negatively correlated

In [None]:
sns.regplot(x="tax", y="medv", data=boston)
#tax & medv seem to be negatively correlated

### What do these plots tell you about the relationships between these variables and the prices of homes in Boston? Are these the relationships you expected to see in these variables?

In [None]:
# Your response here
#crim & medv seem to be negatively correlated - expected
#rm & medv seem to be positively correlated - expected
#lstat & medv seem to be negatively correlated - expected
#tax & medv seem to be negatively correlated - expected

### Make a heatmap of the remaining variables. Are there any variables that you did not consider that have very high correlations? What are they?

In [None]:
# Your response here
plt.figure(figsize=(10,10))
sns.heatmap(boston.corr(), annot=True)
#medv is highly correlated with the values that I considered in the previous step. 

### Calculate Statistics
Calculate descriptive statistics for housing price. Include the minimum, maximum, mean, median, and standard deviation. 

In [None]:
# Your code here
boston.describe()

----

## Developing a Model

### Implementation: Define a Performance Metric
What is the performance meteric with which you will determine the performance of your model? Create a function that calculates this performance metric, and then returns the score. 

In [None]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    # Your code here:
    return r2_score(y_true,y_predict)

### Implementation: Shuffle and Split Data
Split the data into the testing and training datasets. Shuffle the data as well to remove any bias in selecting the traing and test. 

In [None]:
# Your code here
from sklearn.model_selection import train_test_split
X=boston.drop('medv', axis=1)
y=boston['medv']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

----

## Analyzing Model Performance
Next, we are going to build a Random Forest Regressor, and test its performance with several different parameter settings.

### Learning Curves
Lets build the different models. Set the max_depth parameter to 2, 4, 6, 8, and 10 respectively. 

In [None]:
# Five separate RFR here with the given max depths
from sklearn.ensemble import RandomForestRegressor
model_2=RandomForestRegressor(max_depth=2)
model_4=RandomForestRegressor(max_depth=4)
model_6=RandomForestRegressor(max_depth=6)
model_8=RandomForestRegressor(max_depth=8)
model_10=RandomForestRegressor(max_depth=10)

In [None]:
from sklearn.tree import export_graphviz
train_2= model_2.fit(X_train, y_train)
train_4= model_4.fit(X_train, y_train)
train_6= model_6.fit(X_train, y_train)
train_8= model_8.fit(X_train, y_train)
train_10= model_10.fit(X_train, y_train)

pred_2 = train_2.pedict(X_test)


Now, plot the score for each tree on the training set and on the testing set.

In [None]:
# Produce a plot with the score for the testing and training for the different max depths



What do these results tell you about the effect of the depth of the trees on the performance of the model?

In [None]:
# Your response here

### Bias-Variance Tradeoff
When the model is trained with a maximum depth of 1, does the model suffer from high bias or from high variance? How about when the model is trained with a maximum depth of 10?

In [None]:
# Your response here

### Best-Guess Optimal Model
What is the max_depth parameter that you think would optimize the model? Run your model and explain its performance.

In [None]:
# Your response here

### Applicability
*In a few sentences, discuss whether the constructed model should or should not be used in a real-world setting.*  
**Hint:** Some questions to answering:
- *How relevant today is data that was collected from 1978?*
- *Are the features present in the data sufficient to describe a home?*
- *Is the model robust enough to make consistent predictions?*
- *Would data collected in an urban city like Boston be applicable in a rural city?*

In [None]:
# Your response here