# Introduction
We'll start with an overview of how machine learning models work and how they are used. This may feel basic if you've done statistical modeling or machine learning before. Don't worry, we will progress to building powerful models soon.

The course will have you build models for the following scenario:

Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth.

You ask your cousin how he's predicted real estate values in the past. and he says it is just intuition. But more questioning reveals that he's identified price patterns from houses he has seen in the past, and he uses those patterns to make predictions for new houses he is considering.

Machine learning works the same way. We'll start with a model called the Decision Tree. There are fancier models that give more accurate predictions. But decision trees are easy to understand, and they are the basic building block for some of the best models in data science.

For simplicity, we'll start with the simplest possible decision tree.

![](http://i.imgur.com/7tsb5b1.png)

It divides houses into only two categories. You predict the price of a new house by finding out which category it's in, and the prediction is the historical average price from that category.

This captures the relationship between house size and price. We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called **fitting** or **training** the model. The data used to **fit** the model is called the **training data**.

The details of how the model is fit (e.g. how to split up the data) is complex enough that we will save it for later. After the model has been fit, you can apply it to new data to **predict** prices of additional homes.

# Example
Assuming your decision tree works in a sensible way, which of the two trees shown here do you think you might get from fitting this especially simple decision tree?

![First Decision Trees](http://i.imgur.com/prAjgku.png)

# Improving the Decision Tree
The decision tree on the left (Decision Tree 1) probably makes more sense, because it captures the reality that houses with more bedrooms tend to sell at higher prices than houses with fewer bedrooms. The biggest shortcoming of this model is that it doesn't capture most factors affecting home price, like number of bathrooms, lot size, location, etc.

You can capture more factors using a tree that has more "splits." These are called "deeper" trees. A decision tree that also considers the total size of each house's lot might look like this:

![Depth 2 Tree](http://i.imgur.com/R3ywQsR.png)

You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house's characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a leaf.

The splits and values at the leaves will be determined by the data, so it's time for you to check out the data you will be working with.

In [1]:
import pandas as pd

main_file_path = 'data/train.csv' #path to the Iowa data from the kaggle website

In [2]:
iowa_data = pd.read_csv(main_file_path)
iowa_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


# Selecting and Filtering Data
For datasets with too many variables to easily understand, (or easily print out) we can filter by 
- intuition
- statistical methods

In [3]:
iowa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

# Choosing the Prediction Target

In [4]:
y = iowa_data.SalePrice

# Choosing Predictors

In [11]:
predictors = ['LotArea', 'YearBuilt', '1stFlrSF', 
              '2ndFlrSF', 'FullBath', 'BedroomAbvGr',
             'TotRmsAbvGrd']
X = iowa_data[predictors]

In [12]:
from sklearn.tree import DecisionTreeRegressor

In [13]:
# Define model
tree_model = DecisionTreeRegressor(random_state=42)

# Fit model
tree_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [14]:
print('Making predictions for the following 5 houses:')
print(X.head())
print("The predictions are")
print(tree_model.predict(X.head()))

Making predictions for the following 5 houses:
   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  
The predictions are
[ 208500.  181500.  223500.  140000.  250000.]


# Model Validation
How good is the model we've just built?

- Generally, the relevant measure of model quality is predictive accuracy. 
    - Compare the predictions on your training data, to the actual targert values of the training data.
- **MAE** (Mean Absolute Error)
    - error = actual - predicted
    - take the absolute value
    - compute the mean (we average the absolute values to prevent positive and negative errors from canceling eachother out in the calculation).

In [15]:
from sklearn.metrics import mean_absolute_error

In [16]:
predicted_home_prices = tree_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

62.354337899543388

# The problem with "In-Sample" Scores
The measure we just computed can be called an "in-sample" score. 
- We used a singe set of houses (data sample) for both building the model and for calculating it's MAE score.
    - **This is bad**
    - the model may interpret idiosyncratic coincidences in the sample data as generally valid predictive variables
       - magine that, in the large real estate market, door color is unrelated to home price. However, in the sample of data you used to build the model, it may be that all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

## Solution:
Score the predictions on data not included in the training/fitting.
- exclude a subset of the data from the model-building process
- test the model's accuracy on the "holdout data."

In [17]:
from sklearn.model_selection import train_test_split

In [22]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 42)
#the split is generated on a random generator seeded with the random state 42

# Retrain the model on the training data
tree_model.fit(train_X, train_y)

val_predictions = tree_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

30160.7424658


# Experimenting With Different Models
Now we can experiment with alternative models and see which give the best predictions. 

## Avoiding overfitting
In practice it's not uncommon for a decision tree to have 10 splits. 
- As the tree gets deeper the dataset gets sliced up into leaves with fewer houses. 
    - for n levels we end up with 2^n leaves (or categories)
    - Leaves with few houses will make predictions that are quite close to those home's actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses)
    - this is an example of **overfitting**
![Mean Average Error](http://i.imgur.com/2q85n9s.png)

## Modulating parameters
#### Modulating Decision Tree parameters
To control depth:
- max_leaf_nodes - a sensible way to control overfitting vs underfitting.
    - more leaves leads to more overfitting
    
## comparing MAE scores for different max_leaf_nodes values

In [23]:
# define a function to compute and return the MAE for given max_leaf_nodes on a decision tree regressor
def get_mae(max_leaf_nodes, predictors_train, predictors_val,
           targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes,
                                 random_state = 42)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    
    return(mae)

In [24]:
# loop over different max_leaf_nodes values
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, 
                    val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  35244
Max leaf nodes: 50  		 Mean Absolute Error:  27232
Max leaf nodes: 500  		 Mean Absolute Error:  31450
Max leaf nodes: 5000  		 Mean Absolute Error:  31724


# Conclusion
Here's the takeaway: Models can suffer from either:

- **Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
- **Underfitting**: failing to capture relevant patterns, again leading to less accurate predictions.
We use **validation** data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

But we're still using Decision Tree models, which are not very sophisticated by modern machine learning standards.

- Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

- Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. 

## Random Forest:

The random forest uses many, trees and it makes a prediction by averaging the predictions of each component tree. 
- It generally has a much better predictive accuracy than a single decision tree 
- and it works well with default parameters. 

In [25]:
from sklearn.ensemble import RandomForestRegressor

In [27]:
forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
forest_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, forest_preds))

22287.1210046


Notice, 
- this 22,287 is an improvement from our last best pred output 
-   of 27,232

### Random Forest Advantages:
- can be further tuned
- generally works reasonably well even without tuning

## Submitting predictions in Kaggle competitions

In [28]:
import numpy as np

In [30]:
# Read in the test data
test = pd.read_csv('data/test.csv')
# Trea the test data in the same way as the training data. i.e. pull the same columns
test_X = test[predictors]
# Use the model to make predictions
predicted_prices = forest_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible. 
print(predicted_prices)

[ 135065.  155980.  185750. ...,  161760.  141350.  223040.]


# Prepare Submission File
We make submissions in CSV files. Your submissions usually have two columns: an ID column and a prediction column. The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the housing data is the string 'Id'). The prediction column will use the name of the target field.

We will create a DataFrame with this data, and then use the dataframe's to_csv method to write our submission file. Explicitly include the argument index=False to prevent pandas from adding another column in our csv file.

In [31]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})

my_submission.to_csv('kaggle_ml_course_submission.csv', index = False)