# The Price is Right: Predicting Housing Prices

## 1. Introduction

### The Data

In this module you will analyze the California Housing Prices data set from the StatLib repository. This data set includes geographical data collected from the 1990 California census. Information included is is average house price, average build date, and average resident income for given Block Groups (which are geographical units in the Census data). You will use this data to build a model that **predicts housing prices** in the California.

### Big Picture View

The model you build should learn from the data provided and be able to predict the median housing price, given all other metrics. 

### Framing the Problem

The data set for this problem is **labeled data** which means a **supervised learning** technique is most appropriate. Since you are asked to **predict a value** the best technique to use is probably a **regression** approach. Since you only want one value, this is a **univariate** problem, though multiple variables will be used in your analysis. 

## !!! Your Turn !!!

Answer the following question(s):

1. What category of machine learning will your algorithm use? 
2. Define the term 'univariate'. How is a univariate model different from a multivariate model?
3. What will be the final output/goal of our algorithm?

***Answer the questions in the cell below***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 2. Importing the Data

In the following snippets of code you will import the data into our notebook and perform a couple of quick analyses. 

### Downloading and Extracting the Data

The following snippet of code will download the data from the internet and save it in a folder called datasets/housing. Read through each line of the code and then run the snippet (the `# comments` provide info on what each line of code is doing). After running the snippet you should see a new folder called datasets in your Jupyter home screen.

In [None]:
import os
import tarfile
import urllib

### Set data set url and location to save data
DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml2/master/'
HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing/housing.tgz'
HOUSING_PATH = os.path.join("datasets", 'housing')

### Function to fetch the data
def fetch_housing_data(housing_url=HOUSING_URL, housing_path = HOUSING_PATH):
    os.makedirs(housing_path, exist_ok = True) #Make the dataset directory if it doesn't already exist
    tgz_path = os.path.join(housing_path, "housing.tgz") #Set the path for the downloaded zip file
    urllib.request.urlretrieve(housing_url, tgz_path) #Download the data
    housing_tgz = tarfile.open(tgz_path) #Use tar to open the tar file
    housing_tgz.extractall(path=housing_path) #Extract the data
    housing_tgz.close() #Close the tar file/clear from memory
    
fetch_housing_data() #Run the functino we just made

### Opening the Data as a Dataframe using Pandas

We will use a data library called Pandas to open and store our data in a `dataframe` object. `Dataframes` are used to store information of all types in a table-like matrix array. (One of the best ways to think about `dataframes` is as a set of rows and columns.) Pandas `dataframes` have a number of built-in functions that makes accessing and manipulating the data *relatively* easy. 

After loading the data we will look at a summary of information contained in the data set using the `.info()` method of a dataframe. (Methods are things that objects in python can do, like display information about themselves.) The `.info()` method tells us the column names, how many values in each column contain information (null means empty, so non-null means filled) and the data type (is it words, integers, decimal number (float 64), etc.). 

In [None]:
import pandas as pd #Import the pandas library. Instead of writing pandas all the time, set it to just write pd instead

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, 'housing.csv') #Set the exact path to the csv file containing our data
    return pd.read_csv(csv_path)

housing = load_housing_data() #Load the housing data as a dataframe called housing

housing.head() #View the first five rows (header) of the housing data frame
housing.info()

## !!! Your Turn !!! 

Answer the following questions:
1. How many data points are included in this data set? 
2. Which column(s) contains incomplete data? 
3. Which column(s) contains non-numeric information?

***Write your answers in the cell below***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 3. Understanding the Data

Before we begin analyzing the data it is worth taking a moment to get a feel for what data is contained in this data set. We will do this using several of the built-in `dataframe methods`. It is always worth taking time to understand your data (at a very base level!) before beginning your analysis.

Let's look at the `ocean_proximity` data, some general information about the dataset, and histograms of the data. (Run each cell separately)

In [None]:
housing["ocean_proximity"].value_counts() #Tells us the values and number of times each value is found in ocean_proxmity

In [None]:
housing.describe() #Run the describe() method to see info like mean, standard deviation, range, etc

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt #import the plotting library
housing.hist(bins=50, figsize=(20,15)) #Generate the data for the histograms
plt.show() #print the data

## !!! Your Turn !!!

Answer the following questions:
1. What is the scale for median income? Do you think this is in standard USD? 
2. Provide an explanation for the large peaks at high values for Median House Value and Median House Age.
3. How do the scales of the different attributes vary?
4. Describe the shape of the total_rooms attribute. Does the shape of the distribution look even, front heavy, or tail heavy?

***Write your answers in the cell below***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 4. Creating a Test Set

Now that we have a general feel for the data, we will make a test set. Crucially, once we make the test sest of data we will ensure that we **never look at the test set again**. This will *prevent bias* in our solutions (either intentionally or unintentionally). 

Here, we will select 20% of the data to put aside as a test set. There are many ways to create a test set, but the key is to **ensure the test data cannot be used as training data**. This is especially important if your data can change or new data can be added. In general, there are two approaches to creating a test set:

1. Random sampling
2. Stratified sampling

[Random sampling](https://www.youtube.com/watch?v=yx5KZi5QArQ) is exactly what it sounds like. You randomly select 20% of the data and remove it to a test set. While this process is simple it can incur sampling bias. Sampling bias occurs when you collect a sampling that is not representative of the whole data set, creating a skewed perspective. As an example, consider polling election data in a county that is 53% male and 47% female. Random sampling may overrepresent the male voice, leading to biased data. 

To **remove sampling bias**, you can select test data that is representative of the data. To to do this, we will use a [stratified sampling approach.](https://www.youtube.com/watch?v=sYRUYJYOpG0) to create *a representative test set* ***AND*** a *representative training set*. Here's the approach:

1. Make categories based on the distribution of median income (determined by us)
2. Use scikitlearn methods to sample the data into a test set
3. Put everything back together again

**NEVER LOOK AT THE TEST SET AGAIN!!!**

In [None]:
import numpy as np #import the numpy library

housing["income_cat"] = pd.cut(housing["median_income"],
                              bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],
                              labels=[1,2,3,4,5]) #Create categories for the data based on median income

from sklearn.model_selection import StratifiedShuffleSplit #Import the splitting method

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) #split 20% of the data, using a 42 as the random seed

#Cean-up to split the results from the above line of code into training and test sets
for train_index, test_index in split.split(housing, housing["income_cat"]):
    train_set = housing.loc[train_index]
    test_set = housing.loc[test_index]

#Remove the "income cat" from the training set
for set_ in (train_set, test_set):
    set_.drop("income_cat", axis=1, inplace=True)



## !!! Your Turn !!!

Answer the following questions:
1. What are the pros and cons of a random sampling procedure for generating a sample set?
2. What problem does a stratified sampling approach help eliminate?
3. Why is it important to never look at the test data again?

***Write your answers in the cell below***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 5. Look for Insights in the Data


### Visualizing the Data

Now that we have created a training set and a test set, let's take another look at the data. Let's start by visualzing the data based on geographical location in three different ways. (Remember, *run each cell separately*.)

In [None]:
housing = train_set.copy() #make a copy of the training set to play with

#Plot the raw data based on on geographic location alone
housing.plot(kind="scatter", x="longitude", y="latitude")

Good news! The above plot indeed looks simliar to the shape of California! But it doesn't really tell us too much beyond this. Let's visualize it another way, highlighting areas where the data is most dense (possibly corresponding to urban areas):

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

This is a bit more insightful! But we can go further, let's incorporate average median house value into the data visualization.

Here, the size of the circle (`s=`) is based on the population size and the color of each spot (`c=`) is based on the median house value. The colour scale is chosen using the cmap option. 

In [None]:
plt.rcParams['image.cmap'] = 'jet' #Set the colour scale

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, # plot the lat/long data  
            s=housing["population"]/100, label="population",figsize=(10,7), # population is size of dot
            c="median_house_value", colorbar=True) # color of dot is based on house value

## !!! Your Turn !!!

Use the scatter plots above and your knowledge of California (or Google Maps) to answer the following questions.

1. Where is the data most dense? What citiies to these locations correspond to?
2. Where are housing prices the highest? Where are they lowest? Provide a possible reason for these trends.

***Write your answers in the cell below*** 

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 6. Looking for Correlations

We can look for standard correlations between pairs of attributes. Here, we will look for *linear* correlations using the *standard correlation coefficient*. For this data, the closer the correlation efficient is to 1, the more correleted the data (and the closer to 0, the less correlated). Below are examples of data sets with various correlation coefficients.

![Correlation Examples](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-Correlation_examples2.svg.png)

A key point to note: The standard correlation coefficient will **ONLY LOOK FOR  LINEAR CORRELATIONS**. For example, in the third row of the image there is obviously some correlation between the x- and y-axes, even though the correlation cofficient is 0.

Let's calculate the how much each attribute correlates with our target attribute, median house value:

In [None]:
corr_matrix = housing.corr(numeric_only=True) #calculate the correlation cofficients

corr_matrix["median_house_value"].sort_values(ascending=False)

We can also look at correlation coefficients visually using scatter a scatter matrix. Let's do that with our data. 

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12,8))

Here, trends will show up as "line-like" pictures, while non-correlated variables will look like scattered masses. For example, median_house_value and median_income look strongly correlated while housing_age and house_value do not appear to be correlated. Histograms will show up for correlations between the same attribute. 

### Attribute Combinations

It's worth looking at ways to combine attributes that may allow us to get better information from the data. For example, the attribute `total_rooms` doesn't seem to be very useful in aggregate. However, a rooms per house attribute may be more useful. We can also look at the percentage of rooms in a house that are bedrooms, on average. Finally, a population per household may be of interest as well. Let's create those now and calculate new correlations:

In [None]:
housing['rooms_per_household'] = housing['total_rooms']/housing['households'] # make a rooms per household attribute
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms'] # make a bedrooms per room attribute
housing['pop_per_household'] = housing['population']/housing['households'] # make a population per house attribute

corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)

Looks like there is a much stronger (negative) correlation between bedrooms per room and median house value. This is a win!

## !!! You Turn !!!

Answer the following questions:

1. Looking at the scatter matrix, which variables seem most correlated? Which variables seem to have no correlation?
2. How does your answer compare with results of the the numeric correlation matrix?
3. What does a negative correlation coefficient mean?
4. Justify why it makes sense to combine the households attribute with the total_rooms attribute.

***Write your answers in the cell below!***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 7. Cleaning the Data

### Dealing with Missing Data

Up to this point we have only looked at the data to find general trends. But now it's time to "clean" our data in order to ensure our regression algorithms can handle the data. For instance, we know that some total_bedroom data is incomplete. Also, the median house age was also capped. Let's make some functions to clean up this data. 

In the case of missing data, we have three options:

1. Get rid of corresponding districts (delete data)
2. Remove the attribute (don't count it)
3. Set values to some arbitraty value (0, mean, median, etc).

Here, we will use a built-in function to accomplish option #3

In [None]:
housing = train_set.drop("median_house_value", axis=1) #make a copy of the training data without the prediction attribute
housing_labels = train_set["median_house_value"].copy() #make a copy of the housing values (prediction attribute)

from sklearn.impute import SimpleImputer #import the filling algorithm

imputer = SimpleImputer(strategy="median") #use median to fill in missing values
housing_num = housing.drop("ocean_proximity", axis = 1) #remove non-numeric ocean_proximity attribute for the Imputer fucntion

imputer.fit(housing_num) #calculate the median of each attribute

X = imputer.transform(housing_num) #use the imputer to fill in any missing values with the median of the given attribute

housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index) #convert the imputer result to a dataframe

### Dealing with Non-Numeric Data

We've dealt with some basic data cleaning, but we also need to deal with our ocean_proximity data, which is a text-based data. Unfortunately, our ML algorithms do not work with non-numeric data very well. Fortunately, SciKit Learn has a nifty function which will convert our data into a series of 1's and 0's. Since there are five categories, there will be five columns. For each row one value will be a 1 (corresponding to that category) and the others will all be 0.

In [None]:
from sklearn.preprocessing import OneHotEncoder #import the Encoder code
housing_cat = housing[["ocean_proximity"]] #Make a dataframe with only the ocean_proximity data

cat_encoder = OneHotEncoder() #
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot.toarray()

### Code to Add Attributes

Based on the previous explorations, it looks like adding combined attributes will help our model. Let's add some code in to take care of adding the rooms per household, population per household, and bedrooms per room attributes.

***Note: Understanding this code in detail is not necessary. Just make sure to `run` the code***

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin #Import base classes that we will use to modify code

rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6 #not sure what this code is doing

# Code to make a CombinedAttributesAdder class that will work with other SciKit Learn Functionaity (i.e., pipeline) 
# add_bedrooms_per_room is included as a hyperparameter (something that can change with fitting)

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): #no *args or **kargs allows for hyperparamter tuning
        self.add_bedrooms_per_room = add_bedrooms_per_room
    
    def fit(self, X, y=None):
        return self #just return the constructor/do nothing (no fitting required here)
    
    def transform(self,X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix] #take slices of data, and divide
        population_per_household = X[:, population_ix] / X[:, households_ix]
        
        if self.add_bedrooms_per_room: #if you want to add the bedrooms per room, by default included
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room] #c_ concatenates arrays along a second access
        else:
            return np.c_[X, rooms_per_household, population_per_household]
    
#Test the code above
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

print(housing_extra_attribs)

### Feature Scaling

The last bit that we need is to scale the attribute values so that they are roughly equal in value. In general, most Machine Learning algorithms don't perform well when the values are vastly different. For example, the total number of rooms currently ranges from 6 to 39,320 while the median income ranges from 0 to 15. We need to apply an appropriate scale so that these values are generally within the same order of magnitude in order for our algorithms to work. Note that, in general, scaling target values is not required.

There are two common ways to scale the data: *min-max scaling* and *standardization*. 

#### Min-Max Scaling
Sometimes called normalization, the maximum and minimum values of an attribute are determined. The minimum value is subtracted from all values and then they are divided by (max-min). This results in a range of values from 0 to 1.

#### Standardization
In this procedure, the mean is subtracted from all values and then the values are divided by the standard deviation. In this method, there is no bound to the minimum and maximum values after transformation, which can be problematic for *some* algorithms (ex. neural nets). The trade-off is that it is more robust to outliers however. 

Here we will use the Standardized scaling algorthim provided by SciKitLearn. Note that in all cases you should *only fit scalers to training data*. Remember, we do not look at or use the **test set until the end**.

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
scaled_housing = std_scaler.fit_transform(housing_num)

scaled_housing = pd.DataFrame(scaled_housing, columns=housing_num.columns, index = housing_num.index)

scaled_housing.head(15)


## !!! You Turn !!!

Answer the following questions:

1. Describe how missing values in the total_room attribute were addressed in this cleaning process.
2. Why was a numeric value needed for the Ocean_Proximity attribute?
3. A data set contains numeric data and will be analyzed using a neural network. Should you use Min-Max scaling or Standarization scaling for this task? Justify your response.

***Write your answers in the cell below!***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 8. Building a Data Pipeline

In real-world applications it is not uncommon for additional data to be added to a data set or to require repeated cleaning of datasets for multiple fit approaches. Running the cleaning code each time we added data to our dataset would not be pragmatic. To aid in this we can create a data `pipeline` that will automate our data cleaning procedures. 

You can think of a pipeline as the series of steps/transformations that our data needs to go through before it can be used in our ML algorithm. We can have multiple smaller `pipelines` that feed into an overall pipeline. In our case, two smaller pipelines are needed to clean our data: one for numeric data and one for non-numeric data. We will combine the results of the two pipelines using the `ColumnTransformer` method, which requires column label data to work.

In [None]:
from sklearn.pipeline import Pipeline # Import the pipeline code
from sklearn.compose import ColumnTransformer # Import the ColumnTransformer code.

# Build the numeric pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler())
])

# Build the categorical pipeline
cat_pipeline = Pipeline([
    ("cat_encoder", OneHotEncoder())
])

# Combine the two pipelines
num_attribs = list(housing_num) #Get the column names for the numeric data only
cat_attribs = ["ocean_proximity"] #Only one categorical attribute here

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs)
])

# Use the full pipeline to clean the data
housing_prepared = full_pipeline.fit_transform(housing)

## !!! Your Turn !!!

Answer the following questions:

1. Why are pipelines helpful for processing data?

***Write your answers in the cell below!***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 9. Select and Train Your Model

Now that you have data that is cleaned and ready to analyze, its time to do the Machine Learning and fit the data! 

We will need to settle on a model/type of algorithm that will best fit our model. There are lots of options here, and a full exploration of all the possible models is beyond the scope of this class.

### Linear Model

Let's start with the most basic type of model: a linear regression. As the name might imply, we are literally going to find the line of best fit for the data. The result will be an equation of the form:

$Y = B + \sum m_{i}x_{i} $

where B is the "y-intercept", $m_{i}$ is the slope with respect attribute i, and $x_{i}$ is the value of attribute i.

Without getting too deep into the theory, let's train a linear model and see the results.

In [None]:
from sklearn.linear_model import LinearRegression #Import the code for a linear model

lin_reg = LinearRegression() # Make a linear model object
lin_reg.fit(housing_prepared, housing_labels) # Fit the model to our data

And that's it! You've trained your linear model. (Pretty easy, eh?)

But, is the model any good? Let's check how good it is using some random *training data*. (Remember, we **don't look at the test set until the end!!!!**

In [None]:
some_data = housing.iloc[:5] #pick the first five rows of data
some_labels = housing_labels.iloc[:5] #pick the first five rows of average values
some_data_prepared = full_pipeline.transform(some_data) #put the small test set we selected through the cleaning pipeline

predictions = lin_reg.predict(some_data_prepared) #Calculate the predicted values

print("Predictions: ", predictions) #print the predicted values
print("Labels: ", list(some_labels)) #print the actual values

Qualitatively looking at the values above, it seems like our model leaves something to be desired. Let's quantativitely evaluate our model using a [Root Mean Square](https://www.mathwords.com/r/root_mean_square.htm) (RMSE) for the whole data set.

In [None]:
from sklearn.metrics import mean_squared_error #import the library code

housing_predictions = lin_reg.predict(housing_prepared) #Predict values for the entire data set
lin_mse = mean_squared_error(housing_labels, housing_predictions) #Calculate the Mean Squared Error
lin_rmse = np.sqrt(lin_mse) #Take the square root to get the Root Mean Squared Error
print(lin_rmse)

The RMSE for our linear model is \\$68,628. That's okay (better than a random guess), but unacceptably large given that most housing prices are between \\$120,000 and \
\$265,000. In this case, it seems the linear model is signficantly underfitting our data. This means we either do not have enough data to make a prediction or our model is not powerful enough to make a good prediction. Let's move on to try a different approach. 

### Decision Tree Regressor

Another type of model we could use is a Decision Tree. The full nature of a decision tree is beyond the scope of our exploration here, but for now we can note that it is a very powerful model that can handle non-linear relationships. Let's make and fit a Decision Tree model and see how this works.



In [None]:
from sklearn.tree import DecisionTreeRegressor # Import the appropriate library code

tree_reg = DecisionTreeRegressor() # Make a decision tree model
tree_reg.fit(housing_prepared, housing_labels) # Fit the model

housing_predictions = tree_reg.predict(housing_prepared) # Calculate the predicted values
lin_mse = mean_squared_error(housing_labels, housing_predictions) # Determine the mean squared error
lin_rmse = np.sqrt(lin_mse) # Calculate the square root of the mean squared error
print(lin_rmse)


That's a RMSE of 0. That means this model is perfect, right?!? Actually, it is more likely that our model is *overfitting* our data. That's not good. But RMSE as we have been using it is a rather coarse evaluation tool. Let's look at another way we can evaluate our models.

### Evaluation via Cross-Validation

We can split our training data into a validation set and a training set, and use this smaller validation set to evaluate our data. (***REMEMBER: NEVER LOOK AT YOUR TEST SET!!!!***) This would work, and would be relatively easy to implement, but there is a more powerful way to evaluate our model. This is using a built-in *K-fold cross-validation* function. (Phew! That's a lot of big words!). Basically, what this means is we will break up the training data into multiple subsets, called folds. Our algorithm will use 10 subsets, but we could also use five, as shown in the figure below. We will pick one subset as the validation data and the other nine subsets as the training data. Then, we will pick a new subset as our validation data and use the other 9 subsets as training data. This process repeats 10 times. It's basically getting 10 free validation sets for the price of one! And better yet, the code is implemented for us!

![Cross validation data](https://miro.medium.com/max/1400/1*AAwIlHM8TpAVe4l2FihNUQ.png)

Let's evaluate the decision tree model and linear models we fit above using this new technique.

In [None]:
from sklearn.model_selection import cross_val_score

#Evaluate the tree model:
tree_scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv = 10) #perform cross validation on the data with 10 folds
tree_rmse_scores = np.sqrt(-tree_scores) # Calculate the RMSE; the negative is an algorithmic artifact

#Evaluate the linear model:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv = 10)
lin_rmse_scores = np.sqrt(-lin_scores)

#create a function we can reuse to display fit scores from cross-validation
def display_scores(scores):
    print("Scores: ", scores)
    print("Mean: ", scores.mean())
    print("Standard Deviation: ", scores.std())
    
print("Tree Scores: ")
display_scores(tree_rmse_scores)

print("Linear Model Scores: ")
display_scores(lin_rmse_scores)

Well that's interesting! It seems like, when evaluated via cross-validation, the linear model is actually better at predicting housing prices than the DecisionTree model (lower mean RMSE). The standard deviation for each set of evaluations is slightly higher for the linear model, which is consistent with underfitting. Let's finish our model building by trying one more type of model to fit the data.

### Random Forest Model

As with the Decision Tree model, the full scope of a Random Forest model is beyond the scope of this activity. But briefly, it is similar to the Decision Tree model in that it uses a large number of Decision Trees to perform a fit (a forest contains many trees! Yup, mathematicians are goofy like that.) Let's go ahead and build and evaluate our model.

***Note: This model is more advanced and will take more time to calculate. If you see*** `In [*]` ***remain patient, your model is being calculated!***

In [None]:
#This cell is for fitting the model
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

#This cell is for evaluating the model
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv = 10)
forest_scores_rmse = np.sqrt(-forest_scores)
display_scores(forest_scores_rmse)

Wow! That looks like a better model than either the Linear Regression or Decision Tree models we used earlier! But it is worth noting that there is still overfitting of the data going on. Which means some fine tuning is probably in order. We'll cover that in the next section. But before we do, let's make sure to save each model so we can use it later.

In [None]:
import joblib # Import the code that will help us save the model

joblib.dump(forest_reg, "forest_regression.pkl") #Model to save, name for the model
joblib.dump(lin_reg, "linear_regression.pkl")
joblib.dump(tree_reg, "tree_regression.pkl")

#After running this cell you should see three new files in your JupyterHub homescreen.
#Note the size difference between the different models

## !!! Your Turn !!!

Answer the following questions:

1. What were the three models used to fit the data?
2. A model is fitting corporate incomes that ranges between \\$1 billion and \\$3 billion dollars. It has a RMSE of \\$135,323. Evaluate the model based on the data provided.
3. What is the point of cross validation? Why not just use the test data instead?
4. Define what it means for a model to overfit the data. How do we know the Decision Tree model is overfitting the housing data?

***Write your answers in the cell below!***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 10. Fine-Tuning

We now have a small list of models that seem to be promising predicting housing price from our housing data. But we have only trained them in a coarse way so far. There are ways to tune these models, using **hyperparameters** that will allow us to make these models more accurate.

**Hyperparameters** are parameters within a model that are not fit during training. For example, when fitting our decision tree model we could set limits on how many nodes (decision points) we'll allow it to fit. One way to finetune the hyperparameters would be to vary them by hand. *Gross!* And so time consuming!

Fortunately, we can adopt a `GridSearch` algorithm to explore a wide range of possible hyperparameter values using built-in functionality in the SciKitLearn library.  Let's do this for the random forest model now. Don't worry about the exact parameters being fit (that's beyond the scope of our study), though the curious researcher is advised to explore further in the text.

In [None]:
from sklearn.model_selection import GridSearchCV #Import the relevant code library

param_grid = [
    {'n_estimators':[3, 10, 30], 'max_features':[2, 4, 6, 8]}, # Set values to search for the n_estimators and max_features hyperparameters
    {'bootstrap': [False], 'n_estimators':[3,10], 'max_features': [2,3,4]} # Second search grid to check, this time with bootstrap parameter set as false
]

forest_reg_tune = RandomForestRegressor() # Make a model

grid_search = GridSearchCV(forest_reg_tune, param_grid, cv=5, 
                           scoring='neg_mean_squared_error', return_train_score=True) #Set up the parameters for the grid search

grid_search.fit(housing_prepared, housing_labels)


The above code performs two grid searches:

-One search with three n_estimators values and four max features values (3x4 = 12 combinations in total)
-A second search with bootstrap set to `false` and two n_estimators and three max_features (2x3 = 6 combinations in total)

For each set of values a cross-validation fit was done five times (five folds in the data). This means that we did 12 x 5 + 6 x 5 fits in total...in other words, 90 fits! (No wonder that block took so long to run!) 

We can get information about the results of our grid search as well. Lets look at that now.

In [None]:
print("Best Parameters: ", grid_search.best_params_) # Print the best parameters
print("Best Estimator: ", grid_search.best_estimator_)

cvres = grid_search.cv_results_ # set the set of scoring results as a variable

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]): #go through each scoring result and print the results
    print(np.sqrt(-mean_score), params)

Here, we can see a *sliglht* improvement by fine tuning our hyperparameters. It's worth noting that these parameters could be tuned further by re-running the grid search with max_feature values greater than 8 and n_estimator values greater than 30. But for now, we can rest assured that we have improved our model a little bit.

There are other algorithms and methods for tuning hyperparameters, but for now this is beyond the scope of this activity.

## !!! Your Turn !!!

Answer the following questions:
1. Define the term "hyperparameter" in your own words.
2. In your own words, explain how a grid search works.
3. Why might max_features values greater than 8 and n_estimator values greater than 30 be worth exploring to further tune your model?

***Write your answers in the cell below***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>

## 11. Evaluate Your Model!

Wow! You've done it! You've studied your data, evaluated different models, and fine-tuned the model to build a final product. It's now time to see if your model is effective by running it on the test data! (Yes, you can finally look at your test data :D). 

There's not much too this. First, we'll clean the test data using the pipeline you built earlier. Then you we'll use the data to make housing price predictions using our final model and compare the predicted results to the actual results. (Similar to how we validated our model earlier.)

Let's get to it!

In [None]:
final_model = grid_search.best_estimator_ # Pick the best model that we tuned in the grid search

X_test = test_set.drop("median_house_value", axis=1) #Make the test attribue data without the housing price information
Y_test = test_set["median_house_value"].copy() #Housing price information is held separate from the attribute data

X_test_prepared = full_pipeline.transform(X_test) # Transform/clean the test attribute data

final_predictions = final_model.predict(X_test_prepared) # Generate predicted values using the model

final_mse = mean_squared_error(Y_test, final_predictions) # Calculate the mean squared error
final_rmse = np.sqrt(final_mse) # Calculate Root Mean Square

print("Final root mean sq: ", final_rmse)

And that's it! Overall, the model can predict a house price with an error of \~\\$48,000. We could a little bit more statistics on the effectiveness of model, but for now the RMSE is sufficient. (The curious researcher is urged to check the text here.)

Beyond this, it would be time to present your solution (including assumptions, what worked, what didn't work, and process) and then launch your solution (perhaps as web app). The model generated here is currently not better than expert price predictions, but could be used as labor saver for initial predictions or when targeting new areas to invest. 

The model could be improved using different algorithms or further fine-tuning of hyperparameters (though the impact of overfitting should be considered). 

## !!! Your Turn !!!

Answer the following questions:
1. How well does your model work? What are ways you could improve it?
2. Why is it important to document the process used to generate a model when presenting your solution?
3. Based on your experience here, how important is human input in the generating machine learning models?
4. What was your biggest point of learning during this activity?
5. What is one (or more) ways you would improve this activity for future students?
***Write your answers in the cell below***

\<DOUBLE CLICK HERE TO TYPE ANSWERS\>