In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# plt.style.use('fivethirtyeight')
%config InlineBackend.figure_format = 'retina'

In [None]:
from sklearn import (datasets, dummy, ensemble,
                     linear_model, metrics,
                     model_selection as skms,
                     naive_bayes, neighbors, tree)

In [None]:
from utils import (make_learning_curve, make_complexity_curve, 
                   rms_error, rmse)

In [None]:
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

# Part 1:  Fit-Predict-Evaluate

### Exercise 1:
  * Part A:
    * Read in the data from `datasets/housing_small.csv` using `pandas`.
    * There is a target in the `Target` column.  Make that the output target and everything else the input features.
    *  Build a 3-nearest neighbor model and train it on that entire dataset.
    *  Make predictions on that same dataset.
    *  Evaluate the predictions using root-mean-squared-error.
  * Part B:
    * Read in the data from `datasets/housing_small.csv` using `pandas`.
    * There is a target in the `Target` column.  Make that the output target and everything else the input features.
    * Make a training and testing set from that dataset.
    * Build a 3-nearest neighbor model and train it on the training set.
    * With that trained model, make predictions on both the training and testing sets.
    * Evalute the predictions using root-mean-squared-error.

### Part A: Simple sklearn (in-sample only)

### Part B: Simple sklearn (train-test)

# Part 2:  Comparing Models on TTS

### Exercise 2:
  * Part A:
    * On a train-test split built from `datasets/housing_small.csv`, fit and predict using a `dummy.DummyRegressor`.
    * Compute the root-mean-squared-error (RMSE) for training and testing.
  * Part B:
    * Create a train-test split from `datasets/housing_small.csv`.
    * Build and evaluate three different nearest neighbor models (by varying the number of neighbors) using RMSE.
  * Part C:
    * Create a train-test split from `datasets/housing_small.csv`.
    * Build and evaluate three different decision tree models (by varying the depth of the tree) using RMSE.

### Part A: Baseline `Predict-the-Mean` Model

### Part B: Three k-Nearest Neighbors Models

### Part C: Three Decision Tree Models

# Part 3: Cross-Validation and Model Choice

### Exercise 3:
  * Part A:
    * Use `skms.cross_val_score` (imported above) to evaluate the RMSE of a 3-nearest neighbors model on `datasets/housing_small.csv`.  You can use `scoring=rmse` to have `cross_val_score` return the necessary values.
    * Use `skms.cross_val_score` to evaluate the RMSE of the models you built in Exercise 2.
  * Part B:
    * Still working with `datasets/housing_small.csv`, find a good value for the number of neighbors by using `make_complexity_curve`.
    * With the good number of neighbors, generate a learning curve with `make_learning_curve`.
  * Part C:
    * Repeat Part B using a decision tree and finding a good max_depth.

### Part A: Cross-Validation

### Part B: A Good Hyper + A Learning Curve

### Part C: A Good Hyper + A Learning Curve

# Part 4: Now to Improve!

# Exercise 4:
  *  Part A:
      * We can train pretty well with more complex models, but they are overfitting. Can we use more examples to smooth things out?  Using the data in `datasets/housing_tall.csv`:
        * Reevaluate our baseline mean-only model.
        * Find a good nearest neighbors model and build a learning curve for it.
        * Find a good decision tree model and build a learning curve for it.
  *  Part B:
      * Does adding more features improve our results?  We'll go back to fewer examples, but use a lot more features.  Using the data in `datasets/housing_wide.csv`:
        * Find a good nearest neighbors model and build a learning curve for it.
        * Find a good decision tree model and build a learning curve for it.
  *  Part C (optional):
      * Does it help to be selective about our features?  Using a `RandomForestRegressor` along with `feature_importances_` identify a top-10 set of features and use those to build a model.
  *  Part D:
      * Does using a lot of features and a lot of examples help?  Using the data in `datasets/housing_full.csv`:
        * Find good nearest neighbor and decision tree models.
        * (Optional) Determine if selecting a top-10 set of features (as in Part C) helps.
  * Part E:
    * How have we done overall?  Using the best model you found for `housing_wide.csv` or `housing_all.csv`, train that model on *all* of the data in that `.csv` file.  Evaluate that trained model on the data in `datasets/housing_hot_wide.csv`.

### Part A: More Examples

### Part B: More Features (back to shorter dataset)

### Part C: Let's Be Selective about our Features

### Part D:  All the Data

### Part E: Train on All Data and Evaluate on Hold-Out Test Set