## Regression Tree or Other Algorithm Exercise

Kaggle hosts a dataset which contains house sales prices for King County, which includes Seattle. 

You can download the dataset from [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction) or feel free to download it from my [GitHub](https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv)


Your goal is to do the following:

1. The challenge is about predicting house prices based on whatever features in the dataset you choose. One thing to keep in mind is if you dont know what the features in the dataset mean, you can look on Kaggle for the documentation (you dont need an account to view feature information). 
2. Do some exploratory data analysis.
3. For this notebook, use cross validation and grid search. While I haven't showed the code for how to do this in the course, spend some time figuring it out. 


For this paritcular notebook you need to install folium if you want to run all the cells. 

Option 1:
`pip install folium`

Option 2 (Anaconda):
`conda install -c conda-forge folium`

### Import Libraries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import folium
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer

### Load the Data

### Exploratory Data Analysis
This is not a very big dataset and we do not have too many features. Creating plots and examining the data before applying a model is a very good practice because we may find some outliers or decide to do normalize the data. This is not a must but getting to know the data is always good.

#### Histograms for Continuous Data

#### Bin the histogram into quartiles so we can have some more balanced bins and reasonable colors

### Working with Multiple Features
One  benefit of modeling is the ability to reason about hundreds of features at once. There is no limit to the number of features you can use. However, often a small set of features accounts for most of the variance (assuming there is a linear relationship at all). A relatively good way to choose features is to plot a correlation matrix (though with a lot of variables, the matrix can be a bit overwhelming).

#### Approach 1 to view the correlation matrix

#### Approach 2 to view the correlation matrix

### Arrange Data into Features Matrix and Target Vector

### Split Data into Training and Test Sets

### Make a Model to Show What we Have to Improve Upon

## Tune the Depth of a Tree
Finding the optimal value for max_depth is one way to tune your model. The code below outputs the R^2 for regression trees with different values for max_depth.

Since the graph below shows that the best accuracy for the model is when the parameter max_depth is greater than or equal to 3, it might be best to choose the least complicated model with max_depth = 3.

This is not an ideal approach as you will see in the next section

### Fine-tuning a Machine Learning Model Via Grid Search
In machine learning, we have two types of parameters. One are those learned from the training data. For example, weights for linear regression. The second are parameters of a learning algorithm that are optimized separately.The latter are tuning parameters, also known as hyperparameters. The code we will use evaluates the optimal combination of hyperparameter values using grid search. 

Grid search is a brute force exhaustive search paradigm where we specify a list of values for different hyperparameters, and the grid search algorithm evaluates the model performance for each combination to obtain the optimal combination of values from this set. If we did this with a normal train test split, we would be essentially reusing the same test dataset over and over again. This is a problem as a test set will become part of our training data and a model we choose will be more likely to overfit. 

One approach would be to split our dataset into three parts:a training dataset, validation set, and a test set. The training dataset is used to fit the different models, and the performance on the validation dataset is then used for model selection. The following figure illustrates the concept of holdout cross-validation, where you use a validation dataset to repeatedly evaluate the performance of the model after training using different hyperparameter values. Once we are satisfied with the tuning of hyperparameter values, we estimate the model's generalization performance on the test dataset. 

![images](images/hyperparametersRepeat.png)
Image from [Python Machine Learning](https://github.com/rasbt/python-machine-learning-book-3rd-edition) pg 196.

What scikit-learn 
([code](https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/model_selection/_search.py#L841), [documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)) does for GridSearchCV is more similar to the image below. 

![images](images/crossvalidationidea.png)

An image of k-fold cross-validation used by scikit-learn for this process is illustrated below. The performance measure reported by k-fold cross-validation is the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set).

![images](images/kfoldcrossvalidation.png)

The code below is only looking for the optimal `max_depth`, but in the future we will likely use grid search with alot more hyperparameters. 

### Fine-tuning a Machine Learning Model Via Randomized Search
As mentioned earlier, grid search is computationally expensive. [Randomized search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) usually performs about as well as grid search but is much more cost and time effective. In contrast to GridSearchCV, RandomizedSearchV doesn't try out not all parameter values, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter. We aren't going to cover how it works, but I mention this as over time so many advances have made machine learning more acccessible over time. 

If you had to make this model better, what are some things we can do to make this model better? 

### Create or Import your Own Scoring Function

Not the most understandable metric. People often use R^2 as it is basically rescaled version of MSE.

### Trying Multiple Machine Learning Models
A future task we will look into for future classes is to make a table of different models and their scores. 