# INTRO TO MACHINE LEARNING from [Kaggle](https://www.kaggle.com/learn/intro-to-machine-learning)
Learn the core ideas in machine learning, and build your first models.

In [79]:
def view(n=5):
    '''
    cvs max rows view
    '''
    import pandas as pd
    pd.set_option('display.max_rows', n)

view()

**Builds on**  
[Python](https://www.kaggle.com/learn/python)

**Preparation for**  
[Machine Learning Explainability](https://www.kaggle.com/learn/machine-learning-explainability)  
[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)  
[Intro to Deep Learning](https://www.kaggle.com/learn/intro-to-deep-learning)

# 1. How Models Work

The first step if you're new to machine learning.

## Introduction

We'll start with an overview of how machine learning models work and how they are used. 

This course will have you build models as you go through following scenario:

> Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth. <br>
<br>
You ask your cousin how he's predicted real estate values in the past, and he says it is just intuition. But more questioning reveals that he's identified price patterns from houses he has seen in the past, and he uses those patterns to make predictions for new houses he is considering.

Machine learning works the same way. We'll start with a model called the **Decision Tree**. There are fancier models that give more accurate predictions. But decision trees are easy to understand, and they are the basic building block for some of the best models in data science.

For simplicity, we'll start with the simplest possible decision tree.

<div><img src="http://i.imgur.com/7tsb5b1.png" alt="Simple decision tree" style="width: 400px; margin-left: 20%"><div>

It divides houses into only two **categories**. The predicted price for any house under consideration is the historical average price of houses in the same category.

We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called **fitting** or **training** the model. The data used to **fit** the model is called the **training data**.

The details of how the model is fit (e.g. how to split up the data) is complex enough that we will save it for later. After the model has been fit, you can apply it to new data to **predict** prices of additional homes.

## Improving the Decision Tree
Which of the following two decision trees is more likely to result from fitting the real estate training data?

<div><img src="http://i.imgur.com/prAjgku.png" alt="Two decision trees" style="width: 700px; margin-left: 5%"><div>

The decision tree on the left (Decision Tree 1) probably makes more sense, because it captures the reality that houses with more bedrooms tend to sell at higher prices than houses with fewer bedrooms. The biggest shortcoming of this model is that it doesn't capture most **factors** affecting home price, like number of bathrooms, lot size, location, etc.

You can capture more factors using a tree that has more "splits." These are called **"deeper" trees**. A decision tree that also considers the total size of each house's lot might look like this:

<div><img src="http://i.imgur.com/R3ywQsR.png" alt="Deeper tree" style="width: 600px; margin-left: 10%"><div>

You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house's characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a **leaf**.

The splits and values at the leaves will be determined by the data, so it's time for you to check out the data you will be working with.

# 2. Basic Data Exploration

Load and understand your data.

## Using Pandas to Get Familiar With Your Data
The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd:

In [36]:
import pandas as pd

As an example, we'll look at data about home prices in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.

In [38]:
# save filepath to variable for easier access
melbourne_file_path = 'materials/house_prices/melb_data.csv'

# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)

# print a summary of the data in Melbourne data
view(10)
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Interpreting Data Description
The results show 8 numbers for each column in your original dataset. The first number, the **count**, shows how many rows have **non-missing values**.

Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of **missing data**.

The second value is the **mean**, which is the average. Under that, **std** is the standard deviation, which measures how numerically spread out the values are.

To interpret the **min, 25%, 50%, 75%** and **max** values, imagine sorting each column from lowest to highest value. The first (smallest) value is the **min**. If you go a quarter way through the list, you'll find a number that is bigger than **25%** of the values and smaller than **75%** of the values. That is the **25%** value (pronounced _"25th percentile"_). The **50th** and **75th percentiles** are defined analogously, and the **max** is the largest number.

# 3. Your First Machine Learning Model

## Selecting Data for Modeling
Your dataset had too many variables to wrap your head around, or even to print out nicely. How can you pare down this overwhelming amount of data to something you can understand?

We'll start by picking a few variables using our intuition. Later courses will show you statistical techniques to automatically prioritize variables.

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the `.columns` property of the DataFrame (the bottom line of code below).

In [16]:
import pandas as pd

melbourne_file_path = 'materials/house_prices/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

The Melbourne data has some missing values (some houses for which some variables weren't recorded.). We'll learn to handle missing values in a later tutorial. Your Iowa data doesn't have missing values in the columns you use. So we will take the simplest option for now, and drop houses from our data:

In [17]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

There are many ways to select a subset of your data. The Pandas course covers these in more depth, but we will focus on two approaches for now.

1. **Dot notation**, which we use to select the "prediction target".
2. Selecting with a **column list**, which we use to select the "features".

## Selecting The Prediction Target ($y$)
You can pull out a variable with **dot-notation**. This single column is stored in a Series.

We'll use the dot notation to select the column we want to predict, which is called the **prediction target**. By convention, the prediction target is called $y$. So the code we need to save the house prices in the Melbourne data is

In [21]:
y = melbourne_data.Price
y

1        1035000.0
2        1465000.0
           ...    
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64

## Choosing "Features" ($X$)
The **columns** that are inputted into our model (and later used to make predictions) are called **features**. In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

Here is an example:

In [24]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

By convention, this data is called $X$.

In [27]:
X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.

In [43]:
view()
X

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.80790,144.99340
2,3,2.0,134.0,-37.80930,144.99440
...,...,...,...,...,...
12209,2,1.0,0.0,-37.85581,144.99025
12212,6,3.0,1087.0,-37.81038,144.89389


In [45]:
view(10)
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [47]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6196 entries, 1 to 12212
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Rooms       6196 non-null   int64  
 1   Bathroom    6196 non-null   float64
 2   Landsize    6196 non-null   float64
 3   Lattitude   6196 non-null   float64
 4   Longtitude  6196 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 290.4 KB


Visually checking your data with these commands is an important part of a data scientist's job. You'll frequently find surprises in the dataset that deserve further inspection.

## Building Your Model
You will use the **scikit-learn** library to create your models. When coding, this library is written as `sklearn`, as you will see in the sample code. 

> Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

- **Define**: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- **Fit**: Capture patterns from provided data. This is the heart of modeling.
- **Predict**: Just what it sounds like
- **Evaluate**: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [49]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

Many machine learning models allow some **randomness in model training**. Specifying a number for `random_state` ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [50]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Exercises

### Load Data

In [74]:
import pandas as pd

# Path of the file to read
iowa_file_path = 'materials/house_prices/train.csv'
home_data = pd.read_csv(iowa_file_path)

In [75]:
view()
home_data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125
1459,1460,20,RL,75.0,9937,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2008,WD,Normal,147500


### Step 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

In [76]:
# print the list of columns in the dataset to find the name of the prediction target
home_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [77]:
y = home_data.SalePrice
y

0       208500
1       181500
         ...  
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

### Step 2: Create X
Now you will create a DataFrame called `X` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

- LotArea
- YearBuilt
- 1stFlrSF
- 2ndFlrSF
- FullBath
- BedroomAbvGr
- TotRmsAbvGrd

After you've created that list of features, use it to create the DataFrame that you'll use to fit the model.

In [78]:
# Create the list of features below
feature_names = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Select data corresponding to features in feature_names
X = home_data[feature_names]

#### Review Data
Before building a model, take a quick look at `X` to verify it looks sensible

In [79]:
X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
...,...,...,...,...,...,...,...
1458,9717,1950,1078,0,1,2,5
1459,9937,1965,1256,0,1,3,6


In [80]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   LotArea       1460 non-null   int64
 1   YearBuilt     1460 non-null   int64
 2   1stFlrSF      1460 non-null   int64
 3   2ndFlrSF      1460 non-null   int64
 4   FullBath      1460 non-null   int64
 5   BedroomAbvGr  1460 non-null   int64
 6   TotRmsAbvGrd  1460 non-null   int64
dtypes: int64(7)
memory usage: 80.0 KB


In [81]:
view(10)
X.describe()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


### Step 3: Specify and Fit Model

Create a `DecisionTreeRegressor` and save it `iowa_model`. Ensure you've done the relevant `import` from `sklearn` to run this command.

Then fit the model you just created using the data in `X` and `y` that you saved above.

In [87]:
from sklearn.tree import DecisionTreeRegressor

#specify the model
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit the model
iowa_model.fit(X, y)

### Step 4: Make Predictions

Make predictions with the model's `predict` command using `X` as the data. Save the results to a variable called `predictions`.

In [68]:
predictions = iowa_model.predict(X)

In [97]:
predictions

array([208500., 181500., 223500., ..., 266500., 142125., 147500.])

### Think About Your Results

Use the `.head` method to compare the top few predictions to the actual home values (in `y`) for those same homes. Anything surprising?

In [98]:
# print(iowa_model.predict(X.head()))
print(predictions[:5])
print('Real Prices:')
print(y.head())

[208500. 181500. 223500. 140000. 250000.]
Real Prices:
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


# 4. Model Validation

Measure the performance of your model, so you can test and compare alternatives.

You've built a model. But how good is it?

In this lesson, you will learn to use model validation to measure the quality of your model. Measuring model quality is the key to iteratively improving your models.

## What is Model Validation
You'll have to evaluate almost every model you ever build, lazy slacker. In most (though not all) applications, the relevant measure of model quality is **predictive accuracy**. In other words, will the model's predictions be close to what actually happens.

> Many people make a **huge mistake** when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. 

You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for $10,000$ houses, you'll likely find mix of good and bad predictions. Looking through a list of $10,000$ predicted and actual values would be pointless. We need to summarize this into a **single metric**.

### MAE

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called **MAE**). Let's break down this metric starting with the last word - _error_.

The prediction error for each house is:

```python
error = actual − predicted
```

So, if a house cost `$150,000` and you predicted it would cost `$100,000` the error is `$50,000`.

With the `MAE` metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

> On average, our predictions are off by about `X`.

To calculate MAE, we first need a model:

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Load data
melbourne_file_path = 'materials/house_prices/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(X, y)

Once we have a model, here is how we calculate the mean absolute error:

In [3]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544

## The Problem with "In-Sample" Scores
The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

> Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, **the model would be very inaccurate when used in practice**.

## Validation data

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.m

### `train_test_split`
The `scikit-learn` library has a function `train_test_split` to break up the data into **two pieces**. We'll use some of that data as **training data** to fit the model, and we'll use the other data as **validation data** to calculate `mean_absolute_error`.

Split data into training and validation data, for both features and target. The split is based on a random number generator. Supplying a numeric value to the `random_state` argument guarantees we get the same split every time we run this script.

Here is the code:

In [4]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)

print(mean_absolute_error(val_y, val_predictions))

258886.83214977404


## Conclusions

Your `mean absolute error` for the in-sample data was about `500` dollars. Out-of-sample it is more than `250,000` dollars!

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So 

> the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types.

## Exercises

### Load Data

You've built a model. In this exercise you will test how good your model is.

Run the cell below to set up your coding environment where the previous exercise left off.

In [28]:
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = 'materials/house_prices/train.csv'
home_data = pd.read_csv(iowa_file_path)

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:\t\t", iowa_model.predict(X.head()))
print("Actual target values for those homes:\t", y.head().tolist())

First in-sample predictions:		 [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes:	 [208500, 181500, 223500, 140000, 250000]


### Step 1: Split Your Data
Use the `train_test_split` function to split up your data.

Give it the argument `random_state=1` so the check functions know what to expect when verifying your code.

Recall, 
- your target is loaded in `y` and
- your features are loaded in the DataFrame `X`.

In [29]:
# Import the train_test_split function and uncomment
from sklearn.model_selection import train_test_split

# fill in and uncomment
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [30]:
train_y

6       307000
807     223500
         ...  
235      89500
1061     81000
Name: SalePrice, Length: 1095, dtype: int64

In [31]:
train_X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
6,10084,2004,1694,0,2,3,7
807,21384,1923,1072,504,1,3,6
...,...,...,...,...,...,...,...
235,1680,1971,483,504,1,2,5
1061,18000,1935,894,0,1,2,6


In [32]:
val_y

258     231500
267     179500
         ...  
1369    232000
628     135000
Name: SalePrice, Length: 365, dtype: int64

In [33]:
val_X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
258,12435,2001,963,829,2,3,7
267,8400,1939,1052,720,2,4,8
...,...,...,...,...,...,...,...
1369,10635,2003,1668,0,2,3,8
628,11606,1969,1040,1040,1,5,9


### Step 2: Specify and Fit the Model

Create a `DecisionTreeRegressor` model and fit it to the relevant data. Set `random_state` to `1` again when creating the model.

In [34]:
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

### Step 3: Make Predictions with Validation data

In [35]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

Inspect your predictions and actual values from `validation data`.

In [36]:
# print the top few validation predictions
print(iowa_model.predict(val_X.head()))

# print the top few actual prices from validation data
print(y.head())

[186500. 184000. 130000.  92000. 164500.]
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


What do you notice that is different from what you saw with in-sample predictions?

Do you remember why **validation predictions** differ from **in-sample** (or **training**) predictions? This is an important idea from the last lesson.

### Step 4: Calculate the Mean Absolute Error in Validation Data

In [37]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

In [38]:
val_mae

29652.931506849316

Is that `MAE` good? There isn't a general rule for what values are good that applies across applications. But you'll see how to use (and improve) this number in the next step.

# 5. Underfitting and Overfitting

Fine-tune your model for better performance.

At the end of this step, you will understand the concepts of underfitting and overfitting, and you will be able to apply these ideas to make your models more accurate.

## Experimenting With Different Models
Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

You can see in [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)'s documentation that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's **depth**. Recall from the first lesson in this course that 

> a tree's depth is a measure of how many splits it makes before coming to a prediction. 

This is a relatively shallow tree:

<div><img src="http://i.imgur.com/R3ywQsR.png" alt="Three level decision tree" style="width: 500px; margin-left: 10%"><div>

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  $2^{10}$  groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called **overfitting**, where 

> a model matches the training data almost perfectly, but does poorly in validation and other new data. 

On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). 

> When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

<div><img src="http://i.imgur.com/AXSEOfI.png" alt="Three level decision tree" style="width: 550px; margin-left: 10%"><div>

## Example
There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the `max_leaf_nodes` argument provides a very sensible way to control _overfitting_ **vs** _underfitting_. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare `MAE` scores from different values for `max_leaf_nodes`:

In [40]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

The data is loaded into `train_X`, `val_X`, `train_y` and `val_y` using the code you've already seen (and which you've already written).

In [42]:
# Data Loading Code Runs At This Point
import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
melbourne_file_path = 'materials/house_prices/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [43]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  255575


Of the options listed, `500` is the optimal number of leaves.

## Conclusion
Here's the takeaway: Models can suffer from either:

- **Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
- **Underfitting**: failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

## Exercises

### Load Data

You've built your first model, and now it's time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off.

In [47]:
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = 'materials/house_prices/train.csv'

home_data = pd.read_csv(iowa_file_path)

# Create target object and call it y
y = home_data.SalePrice

# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

Validation MAE: 29,653


In [48]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

### Step 1: Compare Different Tree Sizes
Write a loop that tries the following values for `max_leaf_nodes` from a set of possible values.

Call the `get_mae` function on each value of `max_leaf_nodes`. Store the output in some way that allows you to select the value of `max_leaf_nodes` that gives the most accurate model on your data.

In [49]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
best_mae = None

# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for max_leaf_nodes in candidate_max_leaf_nodes:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f'Max leaf nodes: {max_leaf_nodes} \t Mean Absolute Error: {my_mae}')
    if best_mae is None or my_mae < best_mae:
        best_mae = my_mae
        best_tree_size = max_leaf_nodes

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
# best_tree_size = min(tree_size)
print('\nBest tree size:', best_tree_size)

Max leaf nodes: 5 	 Mean Absolute Error: 35044.51299744237
Max leaf nodes: 25 	 Mean Absolute Error: 29016.41319191076
Max leaf nodes: 50 	 Mean Absolute Error: 27405.930473214907
Max leaf nodes: 100 	 Mean Absolute Error: 27282.50803885739
Max leaf nodes: 250 	 Mean Absolute Error: 27893.822225701646
Max leaf nodes: 500 	 Mean Absolute Error: 29454.18598068598

Best tree size: 100


### Step 2: Fit Model Using All Data
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [50]:
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)

You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use **Random Forests** to improve your models even more.

# 6. Random Forests

Using a more sophisticated machine learning algorithm.

## Introduction
Decision trees leave you with a difficult decision. 

> A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. <br> 
<br>
But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the **random forest** as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

## Example
You've already seen the code to load the data a few times. At the end of data-loading, we have the following variables:

- `train_X`
- `val_X`
- `train_y`
- `val_y`

In [51]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
melbourne_file_path = 'materials/house_prices/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

# Filter rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)

# Choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

We build a **random forest model** similarly to how we built a decision tree in `scikit-learn` - this time using the `RandomForestRegressor` class instead of `DecisionTreeRegressor`.

In [53]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)

print(mean_absolute_error(val_y, melb_preds))

191669.7536453626


## Conclusion
There is likely room for further improvement, but this is a big improvement over the best decision tree error of `250,000`. There are parameters which allow you to change the performance of the `Random Forest` much as we changed the maximum depth of the single decision tree. But **one of the best features of Random Forest models is that they generally work reasonably even without this tuning**.

## Exercises

### Load Data

In [55]:
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = 'materials/house_prices/train.csv'

home_data = pd.read_csv(iowa_file_path)

# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: \t {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: \t {:,.0f}".format(val_mae))

Validation MAE when not specifying max_leaf_nodes: 	 29,653
Validation MAE for best value of max_leaf_nodes: 	 27,283


<br>
<br>
Data science isn't always this easy. But replacing the decision tree with a Random Forest is going to be an easy win.

### Step 1: Use a Random Forest

In [56]:
from sklearn.ensemble import RandomForestRegressor

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)

# fit your model
rf_model.fit(train_X, train_y)

# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_preds = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_preds, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

Validation MAE for Random Forest Model: 21,857


So far, you have followed specific instructions at each step of your project. This helped learn key ideas and build your first model.

# 7. Continue Your Progress

There are many ways to improve your model, and experimenting is a great way to learn at this point.

The best way to improve your model is to **add features**. To add more features to the data, revisit the first code cell, and change this line of code to include more column names:

```python
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
```
Some features will cause errors because of issues like missing values or non-numeric data types. Here is a complete list of potential columns that you might like to use, and that won't throw errors:

```
'MSSubClass'
'LotArea'
'OverallQual'
'OverallCond'
'YearBuilt'
'YearRemodAdd'
'1stFlrSF'
'2ndFlrSF'
'LowQualFinSF'
'GrLivArea'
'FullBath'
'HalfBath'
'BedroomAbvGr'
'KitchenAbvGr'
'TotRmsAbvGrd'
'Fireplaces'
'WoodDeckSF'
'OpenPorchSF'
'EnclosedPorch'
'3SsnPorch'
'ScreenPorch'
'PoolArea'
'MiscVal'
'MoSold'
'YrSold'
```
Look at the list of columns and think about what might affect home prices.

## 1. Decision Tree Regressor - complete

In [65]:
# Code to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = 'materials/house_prices/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae_simple = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae_simple))


Validation MAE: 29,653


In [66]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)


# candidate_max_leaf_nodes = [1500]
candidate_max_leaf_nodes = [2*i-1 for i in range(2, 75)]

best_mae = None

# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for leaf_node in candidate_max_leaf_nodes:
    my_mae = get_mae(leaf_node, train_X, val_X, train_y, val_y)
    print(f'Max leaf nodes: {leaf_node}\tMean Absolute Error: {my_mae}')
    if best_mae is None or my_mae<best_mae:
        best_mae = my_mae
        best_tree_size = leaf_node

# Store the best value of max_leaf_nodes 
print('\nBest tree size:', best_tree_size)

Max leaf nodes: 3	Mean Absolute Error: 39912.20512711714
Max leaf nodes: 5	Mean Absolute Error: 35044.51299744237
Max leaf nodes: 7	Mean Absolute Error: 34769.10089767185
Max leaf nodes: 9	Mean Absolute Error: 31863.851616036944
Max leaf nodes: 11	Mean Absolute Error: 30389.783612505194
Max leaf nodes: 13	Mean Absolute Error: 29124.908937039498
Max leaf nodes: 15	Mean Absolute Error: 28125.478430318668
Max leaf nodes: 17	Mean Absolute Error: 27807.663665995344
Max leaf nodes: 19	Mean Absolute Error: 28648.267042530915
Max leaf nodes: 21	Mean Absolute Error: 28750.331097785598
Max leaf nodes: 23	Mean Absolute Error: 28653.86284944501
Max leaf nodes: 25	Mean Absolute Error: 29016.41319191076
Max leaf nodes: 27	Mean Absolute Error: 28616.229360696358
Max leaf nodes: 29	Mean Absolute Error: 28704.92928766505
Max leaf nodes: 31	Mean Absolute Error: 28994.467469483232
Max leaf nodes: 33	Mean Absolute Error: 28355.08322598861
Max leaf nodes: 35	Mean Absolute Error: 28761.35218024895
Max leaf 

In [67]:
# FINAL MODEL

# Fill in argument to make optimal size and uncomment
iowa_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)

# fit the final model
iowa_model.fit(train_X, train_y)

val_predictions = iowa_model.predict(val_X)

val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of candidate_max_leaf_nodes: \t {:,.0f}".format(val_mae))
print("Validation MAE when not specifying candidate_max_leaf_nodes: \t {:,.0f}".format(val_mae_simple))


Validation MAE for best value of candidate_max_leaf_nodes: 	 26,704
Validation MAE when not specifying candidate_max_leaf_nodes: 	 29,653


In [69]:
print('Final model prediction:')
# print(iowa_model.predict(val_X.head()))
print(val_predictions[:5])

print('\nReal Prices:')
print(y.head())

Final model prediction:
[181225.35416667 130647.68518519 125404.5         94060.
 149639.97826087]

Real Prices:
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


## 2. Random Forests

In [71]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
iowa_preds = forest_model.predict(val_X)

print("Validation MAE when not specifying max_leaf_nodes: \t {:,.0f}".format(val_mae_simple))
print('Best Decision Tree Regressor\'s MAE: \t\t\t {:,.0f}'.format(val_mae))
print('Random Forest Regresor\'s MAE: \t\t\t\t {:,.0f}'.format(mean_absolute_error(val_y, iowa_preds)))

Validation MAE when not specifying max_leaf_nodes: 	 29,653
Best Decision Tree Regressor's MAE: 			 26,704
Random Forest Regresor's MAE: 				 21,857


## 3. Random Forest Training by selecting the features for a project

In [75]:
def house_pricing(features: list) -> float:
    # Code to load data
    import pandas as pd
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split

    # Path of the file to read
    iowa_file_path = 'materials/house_prices/train.csv'

    home_data = pd.read_csv(iowa_file_path)

    # Create target object and call it y
    y = home_data.SalePrice
    # Create X
    X = home_data[features]

    # Split into validation and training data
    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

    forest_model = RandomForestRegressor(random_state=1)
    forest_model.fit(train_X, train_y)
    iowa_preds = forest_model.predict(val_X)

    # Make validation predictions and calculate mean absolute error
    val_mae_simple = mean_absolute_error(iowa_preds, val_y)

    print('Random Forest Regresor\'s MAE: \t {:,.0f}'.format(mean_absolute_error(val_y, iowa_preds)))


In [78]:
a = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
     'FullBath', 'MSSubClass', 'OverallQual', 'OverallCond', 
     'YearRemodAdd', 'GrLivArea', 'Fireplaces', 'WoodDeckSF']
b = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
     'FullBath', 'MSSubClass', 'OverallQual', 'OverallCond', 
     'YearRemodAdd', 'GrLivArea', 'Fireplaces', 'WoodDeckSF', 
     'OpenPorchSF']
c = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
     'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'MSSubClass', 
     'OverallQual', 'OverallCond', 'YearRemodAdd', 'GrLivArea', 
     'HalfBath', 'KitchenAbvGr', 'TotRmsAbvGrd', 'OpenPorchSF', 
     'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 
      'MiscVal', 'MoSold', 'YrSold']

house_pricing(a)
house_pricing(b)
house_pricing(c)

Random Forest Regresor's MAE: 	 17,077
Random Forest Regresor's MAE: 	 17,278
Random Forest Regresor's MAE: 	 18,067


# What's next?

As mentioned above, some of the features will throw an error if you try to use them to train your model. The [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.

The [Pandas](https://www.kaggle.com/Learn/Pandas) course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects.

You are also ready for the [Deep Learning](https://www.kaggle.com/Learn/intro-to-Deep-Learning) course, where you will build models with better-than-human level performance at computer vision tasks.