<a href="https://colab.research.google.com/github/ivybrundege/academic-rag/blob/main/BlankFollowAlong.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#import statements- these allow us to use the packages that handle ML operations for us
import pandas as pd #allows us to create data frames for data processing
import numpy as np #allows sophisticated vector and data frame operations
#import os
import matplotlib.pyplot as plt #allows us to plot our results if desired
import seaborn as sns #also used for plotting data

## ML PACKAGES
from sklearn.model_selection import train_test_split #splits our data for us.
from sklearn.tree import DecisionTreeRegressor #the decision tree package
from sklearn.metrics import root_mean_squared_error #one metric for how well our model works

# GWC @ VT: Intro to Supervised Learning
In this notebook, we're going to illustrate the basic principles of the machine learning life cycle, starting with data and ending with a predictive model!

Any spaces for you to fill in will be marked with a line of stars as a comment. You'll want to fill in wherever we have a comment with stars and an instruction.
It'll look something like this:


```
# ************************************************
#Let's start by importing our data
filename = # ** ADD FILE NAME HERE **
df =  # ** CREATE DATA FRAME HERE **
```

To edit a code box, just click on the box and start typing.

Want to add your own text notes or code blocks? Just press the "+ Code" or "+ Text" buttons at the top of the screen. You can relocate boxes within the notebook by pressing the up or down arrows on the right side of the box.



## Step One: Business Understanding

Can we predict the median house price in an area based on some data we have about the area?

## Step Two: Data Understanding
Now that we have a problem, it is important to understand how we're actually going to use our data to solve this problem.

### Build a Dataframe
You can think of a data frame as a table of all our data, which allows us to do all of our preprocessing.

In [None]:
# ************************************************
#Let's start by importing our data
filename = # ** ADD FILE NAME HERE **
df =  # ** CREATE DATA FRAME HERE **

In [None]:
#next, let's print a few row from the data frame to see what we're working with.
df.head()

In [None]:
#Let's also make sure that there's no null values:
np.sum(df.isnull(), axis = 0)

## Step Three: Data Preparation
Here's where we prepare a training and test data set. We need to do a few things:


1.   Prepare our data frame so that it only contains our label and desired features, and make sure they're in the correct format
2.   Separate our data into a training and testing data set.



In [None]:
# ************************************************
#1. First, let's only select the relevant variables. In this case, we want median housing age, total rooms, total bedrooms, and median income.
df = # ** DROP THE NEEDED COLUMNS HERE **
df.head()

In [None]:
# ************************************************
#2. Separate our data into y (the label) and X (the features)
y = # **CREATE LABEL**
X = # ** CREATE FEATURES **

In [None]:
# ************************************************
#3. Use sklearn to randomly separate our data into test and train sets
X_train, X_test, y_train, y_test = # ** CREATE TRAIN AND TEST SETS**

Things to note with the train_test_split method:
*   Returns four sets: `X train, X test, y train, y test`. Note that this order is important!! Make sure to use this consistently
*   `test_size` determines how big we want our test set. Here's we're using 0.3, or 30%, which is pretty typical.
*   `random_state` is just used to set the random state. long story short, we're using that here to make sure we all get the same random split of data. This isn't something you need to worry about for your own data sets.





In [None]:
#Let's look at the size of our training and testing blocks.
print("X train size: " + str(X_train.shape))
print("X test size: " + str(X_test.shape))
print("y train size: " + str(y_train.shape))
print("y test size: " + str(y_test.shape))

## Step Four: Modeling
Here's the exciting part: let's make our model!!

In this case, we're going to use a decision tree. This is a pretty basic model, with the following hyperparameters:

1.   `max_depth`: how "deep" we'll allow the tree to go
2.   `min_samples_leaf`: how large each leaf can be
3.   `criterion`: the criteria for how the tree will train

Let's start by making a function that will create, train, and evaluate a decision tree.

In [None]:
# ************************************************
#creates a method to train and test a decision tree!

def train_test_DT(X_train, X_test, y_train, y_test, depth, leaf=1, crit='squared_error'):
  model = #** #1: CREATE MODEL OBJECT **
  #** #2: TRAIN MODEL WITH TRAIN DATA**
  predictions = #** #3: USE MODEL TO MAKE PREDICTIONS**
  mse_score = #** #4: CALCULATE RMSE FOR MODEL**
  return mse_score

Some explanation of what's happening here:

Parameters: These are how we specify the details of our tree
*   `depth`: sets the max depth
*   `leaf`: sets the minimum leaves. we're starting by setting it to 1 as a default
*   `crit`: sets the criterion. we're making it mse by default

What the function does:
*   #1: creates a `DecisionTreeRegressor` object with the hyperparameters we set using the function parameters
*   #2: fits the model to our data. This is the training step
*   #3: creates a set of predictions for our test data set
*   #4: Calculates the root mean squared error. This is just one of many methods we can use to evaluate how well our model predicts new data points.






In [None]:
# ************************************************
#Let's use this to make and test a series of trees with different depths!
max_depth_range = [] # ** ADD SOME DEPTH VALUES**
mse = [] #this will hold our outcomes

for md in max_depth_range:
  print(md)
  mse_score = # ** CALL OUR FUNCTION USING MD AS DEPTH **
  mse.append(mse_score)
  print(mse_score)

We've now made a model! However, our work's not done. We need to tune the hyperparameters to make a better model.

Let's start by looking at which of those depths we chose was the best:

In [None]:
#Graph the results of mse_score
fig = plt.figure()
ax = fig.add_subplot(111)

sns.lineplot(x = max_depth_range, y = mse)

plt.title('Test set RMSE of the DT predictions, for $max\_depth\in\{1, 50\}$')
ax.set_xlabel('max_depth')
ax.set_ylabel('root mean squared error')
plt.show()

In reality, we'll want to do much more exhaustive hyperparameter tuning, such as a grid search, which allows us to iterate through different combinations of hyperparameters.

In this case, we're just going to look at leaf size and call it good for the sake of time :)


In [None]:
# ************************************************
#Look at different leaf sizes
#Let's use this to make and test a series of trees with different depths!
min_leaf_range = [] # ** SPECIFY LEAF SIZES
mse = [] #this will hold our outcomes

for ml in min_leaf_range:
  print(ml)
  mse_score = train_test_DT(X_train, X_test, y_train, y_test, depth = 7, leaf = ml)
  mse.append(mse_score)
  print(mse_score)

In [None]:
#Graph the results of mse_score for leaf size
fig = plt.figure()
ax = fig.add_subplot(111)

sns.lineplot(x = min_leaf_range, y = mse)

plt.title('Test set RMSE of the DT predictions, for $min\_leaf\in\{1, 21\}$')
ax.set_xlabel('min_leaf')
ax.set_ylabel('root mean squared error')
plt.show()

Ok, let's make our final model using our selected hyperparameters!

In [None]:
# ************************************************
#Our final model!!
final_model = #** CREATE FINAL MODEL**
# ** TRAIN FINAL MODEL

## Step Five: Evaluation
Let's look at how well our model performs!

In [None]:
#Use the "score" method to evaluate our performance.
print("Train set score: " + str(final_model.score(X_train, y_train)))
print("Test set score: " + str(final_model.score(X_test, y_test)))

Here, we really want to focus on test set score! This is an indicator of how our model will perform on real world, previously unseen data.

A score of 0.56 is... fine. It'll depend on your project specification if this is good enough. Real world applications will also use plenty of other validation techniques to determine efficacy.

However, this exemplifies one of the reasons more advanced techniques were made. For example, Random Forest is an ensemble model which actually combines multiple decision trees. Going further, neural nets use an entirely different design, which includes hidden layers that allow additional processing.

Note that even with these more advanced techniques, this life cycle process is the same! The hyperparameters you're tuning will be different, but you can follow these same steps :)

## Step Six: Deployment
Time to make your model known! Some options:

*   GitHub: make your own repo to share your code with the public, or even invite others to collaborate
*   Online: create a website that uses your model!
*   User Interface: make a usable interface so that people don't need to interact with the code to use the model!