# Introduction
We'll start with an overview of how machine learning models work and how they are used. This may feel basic if you've done statistical modeling or machine learning before. Don't worry, we will progress to building powerful models soon.

The course will have you build models for the following scenario:

Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth.

You ask your cousin how he's predicted real estate values in the past. and he says it is just intuition. But more questioning reveals that he's identified price patterns from houses he has seen in the past, and he uses those patterns to make predictions for new houses he is considering.

Machine learning works the same way. We'll start with a model called the Decision Tree. There are fancier models that give more accurate predictions. But decision trees are easy to understand, and they are the basic building block for some of the best models in data science.

For simplicity, we'll start with the simplest possible decision tree.

![](http://i.imgur.com/7tsb5b1.png)

It divides houses into only two categories. You predict the price of a new house by finding out which category it's in, and the prediction is the historical average price from that category.

This captures the relationship between house size and price. We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called **fitting** or **training** the model. The data used to **fit** the model is called the **training data**.

The details of how the model is fit (e.g. how to split up the data) is complex enough that we will save it for later. After the model has been fit, you can apply it to new data to **predict** prices of additional homes.

# Example
Assuming your decision tree works in a sensible way, which of the two trees shown here do you think you might get from fitting this especially simple decision tree?

![First Decision Trees](http://i.imgur.com/prAjgku.png)

# Improving the Decision Tree
The decision tree on the left (Decision Tree 1) probably makes more sense, because it captures the reality that houses with more bedrooms tend to sell at higher prices than houses with fewer bedrooms. The biggest shortcoming of this model is that it doesn't capture most factors affecting home price, like number of bathrooms, lot size, location, etc.

You can capture more factors using a tree that has more "splits." These are called "deeper" trees. A decision tree that also considers the total size of each house's lot might look like this:

![Depth 2 Tree](http://i.imgur.com/R3ywQsR.png)

You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house's characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a leaf.

The splits and values at the leaves will be determined by the data, so it's time for you to check out the data you will be working with.

In [1]:
import pandas as pd

main_file_path = 'data/train.csv' #path to the Iowa data from the kaggle website

In [2]:
iowa_data = pd.read_csv(main_file_path)
iowa_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


# Selecting and Filtering Data
For datasets with too many variables to easily understand, (or easily print out) we can filter by 
- intuition
- statistical methods

In [3]:
iowa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

# Choosing the Prediction Target

In [4]:
y = iowa_data.SalePrice

# Choosing Predictors

In [11]:
predictors = ['LotArea', 'YearBuilt', '1stFlrSF', 
              '2ndFlrSF', 'FullBath', 'BedroomAbvGr',
             'TotRmsAbvGrd']
X = iowa_data[predictors]

In [12]:
from sklearn.tree import DecisionTreeRegressor

In [13]:
# Define model
tree_model = DecisionTreeRegressor(random_state=42)

# Fit model
tree_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [14]:
print('Making predictions for the following 5 houses:')
print(X.head())
print("The predictions are")
print(tree_model.predict(X.head()))

Making predictions for the following 5 houses:
   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  
The predictions are
[ 208500.  181500.  223500.  140000.  250000.]


# Model Validation
How good is the model we've just built?

- Generally, the relevant measure of model quality is predictive accuracy. 
    - Compare the predictions on your training data, to the actual targert values of the training data.
- **MAE** (Mean Absolute Error)
    - error = actual - predicted
    - take the absolute value
    - compute the mean (we average the absolute values to prevent positive and negative errors from canceling eachother out in the calculation).

In [15]:
from sklearn.metrics import mean_absolute_error

In [16]:
predicted_home_prices = tree_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

62.354337899543388

# The problem with "In-Sample" Scores
The measure we just computed can be called an "in-sample" score. 
- We used a singe set of houses (data sample) for both building the model and for calculating it's MAE score.
    - **This is bad**
    - the model may interpret idiosyncratic coincidences in the sample data as generally valid predictive variables
       - magine that, in the large real estate market, door color is unrelated to home price. However, in the sample of data you used to build the model, it may be that all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

## Solution:
Score the predictions on data not included in the training/fitting.
- exclude a subset of the data from the model-building process
- test the model's accuracy on the "holdout data."

In [17]:
from sklearn.model_selection import train_test_split

In [22]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 42)
#the split is generated on a random generator seeded with the random state 42

# Retrain the model on the training data
tree_model.fit(train_X, train_y)

val_predictions = tree_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

30160.7424658


# Experimenting With Different Models
Now we can experiment with alternative models and see which give the best predictions. 

## Avoiding overfitting
In practice it's not uncommon for a decision tree to have 10 splits. 
- As the tree gets deeper the dataset gets sliced up into leaves with fewer houses. 
    - for n levels we end up with 2^n leaves (or categories)
    - Leaves with few houses will make predictions that are quite close to those home's actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses)
    - this is an example of **overfitting**
![Mean Average Error](http://i.imgur.com/2q85n9s.png)

## Modulating parameters
#### Modulating Decision Tree parameters
To control depth:
- max_leaf_nodes - a sensible way to control overfitting vs underfitting.
    - more leaves leads to more overfitting
    
## comparing MAE scores for different max_leaf_nodes values

In [23]:
# define a function to compute and return the MAE for given max_leaf_nodes on a decision tree regressor
def get_mae(max_leaf_nodes, predictors_train, predictors_val,
           targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes,
                                 random_state = 42)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    
    return(mae)

In [24]:
# loop over different max_leaf_nodes values
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, 
                    val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  35244
Max leaf nodes: 50  		 Mean Absolute Error:  27232
Max leaf nodes: 500  		 Mean Absolute Error:  31450
Max leaf nodes: 5000  		 Mean Absolute Error:  31724


# Conclusion
Here's the takeaway: Models can suffer from either:

- **Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
- **Underfitting**: failing to capture relevant patterns, again leading to less accurate predictions.
We use **validation** data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

But we're still using Decision Tree models, which are not very sophisticated by modern machine learning standards.

- Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

- Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. 

## Random Forest:

The random forest uses many, trees and it makes a prediction by averaging the predictions of each component tree. 
- It generally has a much better predictive accuracy than a single decision tree 
- and it works well with default parameters. 

In [25]:
from sklearn.ensemble import RandomForestRegressor

In [27]:
forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
forest_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, forest_preds))

22287.1210046


Notice, 
- this 22,287 is an improvement from our last best pred output 
-   of 27,232

### Random Forest Advantages:
- can be further tuned
- generally works reasonably well even without tuning

## Submitting predictions in Kaggle competitions

In [28]:
import numpy as np

In [30]:
# Read in the test data
test = pd.read_csv('data/test.csv')
# Trea the test data in the same way as the training data. i.e. pull the same columns
test_X = test[predictors]
# Use the model to make predictions
predicted_prices = forest_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible. 
print(predicted_prices)

[ 135065.  155980.  185750. ...,  161760.  141350.  223040.]


# Prepare Submission File
We make submissions in CSV files. Your submissions usually have two columns: an ID column and a prediction column. The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the housing data is the string 'Id'). The prediction column will use the name of the target field.

We will create a DataFrame with this data, and then use the dataframe's to_csv method to write our submission file. Explicitly include the argument index=False to prevent pandas from adding another column in our csv file.

In [31]:
#my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})

#my_submission.to_csv('kaggle_ml_course_submission.csv', index = False)

# Handling Missing Values
There are many ways data can end up with missing values. For example

- A 2 bedroom house wouldn't include an answer for How large is the third bedroom
- Someone being surveyed may choose not to share their income

Most libraries, (including scikit-learn) will give you an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.

In [33]:
iowa_data.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive

## Solutions
# 1)Drop Columns with Missing Values

In [34]:
iowa_data_drp_cols_with_nas = iowa_data.dropna(axis = 1)

In [35]:
#do the same to the test data
cols_with_missing = [col for col in iowa_data.columns if iowa_data[col].isnull().any()]


In [36]:
reduced_original_data = iowa_data.drop(cols_with_missing,
                                          axis = 1)
reduced_test_data = test.drop(cols_with_missing, axis = 1)

- But, the model looses access to this information when the column is dropped.
- Also, if test data has missing values in places where your training data did not, this will result in an error.

**So, usually, this is a terrible solution**

# 2) A Better Option: Imputation
Imputation fills in the missing value with some number.

In [37]:
from sklearn.preprocessing import Imputer

In [40]:
my_imputer = Imputer()

This naive imputer can't handle categoricla data in strings, so... lets use it on the original subset of data

In [42]:
predictors

['LotArea',
 'YearBuilt',
 '1stFlrSF',
 '2ndFlrSF',
 'FullBath',
 'BedroomAbvGr',
 'TotRmsAbvGrd']

In [None]:
# recall we defined X as
# X = iowa_data[predictors]

In [43]:
data_with_imputed_values = my_imputer.fit_transform(X)

The default behavior fills in the mean value for imputation.
- Statisticians have researched more complex strategies
- **but, they typically give no benefit once you plug the results into sophisticated machine learning models!!!**

A nice feature of imputation is that it can be **easily included in a scikit-learn Pipeline.** 
- Pipelines simplify model building, model validation and model deployment.

# 3) An Extension To Imputation
Imputation is the standard approach (and it usually works well).
- However! Imputed values may be systematically aboe or below their actual values
    - i.e. missing values for garage squarefootage, may actually mean that there is no garage on that property.
    - These rows with missing values may be unique in some other way, they fall into the category of non-garage properties
- In these cases, the model makes better predictions by considering which values were originally missing. 

In [63]:
# make a copy to avoid changing original data (when Imputing)
new_data = iowa_data.copy()

In [64]:
new_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [65]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [66]:
# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                    if new_data[col].isnull().any())

for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()
    #creates 19 new boolean columns

In [67]:
# Imputation
my_imputer = Imputer()
new_data = my_imputer.fit_transform(new_data)

ValueError: could not convert string to float: 'Normal'

In some cases this approach will meaningfully improve results. In other cases, it doesn't help at all. 

* ⚠️ This is a terribly obscure statement. I have no Idea what the imputer is doing with these boolean columns!! ⚠️ *

## Example (Comparing All Solutions)
We will see an eample predicting housing prices with our iowa housing data.

In [80]:
iowa_data = pd.read_csv('data/train.csv')
iowa_target = iowa_data.SalePrice
iowa_predictors = iowa_data.drop(['SalePrice'], axis = 1)

In [81]:
# for the sake of simplicity, we'll use only numeric predictors
iowa_numeric_predictors = iowa_data.select_dtypes(exclude = ['object'])

In [82]:
iowa_numeric_predictors.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,0,61,0,0,0,0,0,2,2008,208500
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,298,0,0,0,0,0,0,5,2007,181500
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,0,42,0,0,0,0,0,9,2008,223500
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,0,35,272,0,0,0,0,2,2006,140000
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,192,84,0,0,0,0,0,12,2008,250000


### Create Function to Measure Quality of An Approach

We divide our data in **training** and **test**.

Kaggle used a preloaded function socre_dataset(X_train, X_test, y_train, y_test) to compare the quality of different approaches to missing values. This function reports the out-of-sample MAE score from a Random Forest

- *So, I guess it scores by comparing the predicted value to the known value in the training set.*



In [83]:
X_train, X_test, y_train, y_test = train_test_split(iowa_numeric_predictors, 
                                                  iowa_data.SalePrice, 
                                                  random_state = 42)

### Get Model Score from Dropping Columns with Missing Values

In [84]:
cols_with_missing = [col for col in X_train.columns
                    if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis = 1)
reduced_X_test = X_test.drop(cols_with_missing, axis = 1)

In [85]:
# Kaggle did not publish the function they used, so I'll
# just fit the current forest model

forest_model.fit(reduced_X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [86]:
forest_model.score(reduced_X_test, y_test)

0.99058886639694632

In [87]:
forest_preds = forest_model.predict(reduced_X_test)

In [88]:
print('Mean Absolute Error from dropping columns with Missing Values:')
print(mean_absolute_error(y_test, forest_preds))

Mean Absolute Error from dropping columns with Missing Values:
1017.88821918


Compare the above mae of 

_1,017.89 mae of dropped columns with nas

22,287.00
That is a dramatic reduction in mean error. 

### make a submission with the reduced dataset

In [176]:
#read in test and train data
train_data = pd.read_csv('data/train.csv')
train_target = train_data.SalePrice
train_predictors = train_data.drop(['SalePrice'], axis = 1)
train_numeric_predictors = train_predictors.select_dtypes(exclude = ['object'])
test = pd.read_csv('data/test.csv')
test_numeric_predictors = test.select_dtypes(exclude = ['object'])

#drop cols with missing values in train and test
train_cols_with_missing = [col for col in train_numeric_predictors.columns
                    if train_numeric_predictors[col].isnull().any()]
test_cols_with_missing = [col for col in test_numeric_predictors.columns
                         if test_numeric_predictors[col].isnull().any()]
#make union of dropped cols
results_list = [train_cols_with_missing, test_cols_with_missing]
cols_with_missing = set().union(*results_list)
reduced_X_train = train_numeric_predictors.drop(cols_with_missing, axis = 1)
reduced_X_test = test_numeric_predictors.drop(cols_with_missing, axis = 1)

#fit random forest
forest_model = RandomForestRegressor(max_leaf_nodes=50)
forest_model.fit(reduced_X_train, train_target)

#generate in-sample predictions
preds_on_reduced_numeric_data = forest_model.predict(reduced_X_train)

#insample score
print('in-sample forest_model.score:', forest_model.score(reduced_X_train, train_target))
print('Mean Absolute in-sample Error from dropping columns with Missing Values:')
print(mean_absolute_error(train_target, preds_on_reduced_numeric_data))

#generate out-sample predictions
preds_on_reduced_numeric_test_data = forest_model.predict(reduced_X_test)

my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': preds_on_reduced_numeric_test_data})

my_submission.to_csv('kaggle_ml_course_reduced_numeric.csv', index = False)

in-sample forest_model.score: 0.920969244423
Mean Absolute in-sample Error from dropping columns with Missing Values:
16322.4405066


In [172]:
len(preds_on_reduced_numeric_data)

1460

In [None]:
len(test)

### Get Model Score from Imputation

In [89]:
my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)

In [90]:
forest_model.fit(imputed_X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [91]:
forest_model.score(imputed_X_test, y_test)

0.99373992624303586

In [92]:
forest_preds = forest_model.predict(imputed_X_test)

In [93]:
print('Mean Absolute Error from Imputation:')
print(mean_absolute_error(y_test, forest_preds))

Mean Absolute Error from Imputation:
893.72


Compare the above mae of 

___893.72 mae of imputed values 

_1,017.89 mae of dropped columns with nas

22,287.00 the "out-of-the-box" random forest mae

27,232.__ the best max_nodes = 50, tunded tree regressor mae

30,160.74 the out of sample tree_regressor mae **underfit?**

____62.35 the in-sample tree_regressor mae **overfit**

That is a dramatic reduction in mean error. 

### make a submission with imputed dataset

In [178]:
#read in test and train data
train_data = pd.read_csv('data/train.csv')
train_target = train_data.SalePrice
train_predictors = train_data.drop(['SalePrice'], axis = 1)
train_numeric_predictors = train_predictors.select_dtypes(exclude = ['object'])
test = pd.read_csv('data/test.csv')
test_numeric_predictors = test.select_dtypes(exclude = ['object'])

#impute value to both test and train data
my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(train_numeric_predictors)
imputed_X_test = my_imputer.transform(test_numeric_predictors)

#fit random forest
forest_model = RandomForestRegressor(max_leaf_nodes=50)
forest_model.fit(imputed_X_train, train_target)

#generate in-sample predictions
preds_on_imputed_numeric_data = forest_model.predict(imputed_X_train)

#insample score
print('in-sample forest_model.score:', forest_model.score(imputed_X_train, train_target))
print('Mean Absolute in-sample Error from dropping columns with Missing Values:')
print(mean_absolute_error(train_target, preds_on_imputed_numeric_data))

#generate out-sample predictions
preds_on_imputed_numeric_test_data = forest_model.predict(imputed_X_test)

my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': preds_on_imputed_numeric_test_data})

my_submission.to_csv('kaggle_ml_course_imputed_numeric.csv', index = False)

in-sample forest_model.score: 0.931296484106
Mean Absolute in-sample Error from dropping columns with Missing Values:
14767.7396892


### Get score from imputation with Extra Columns Showing What Was Imputed

In [94]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

In [95]:
cols_with_missing = (col for col in X_train.columns
                    if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
    
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

forest_model.fit(imputed_X_train_plus, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [97]:
forest_model.score(imputed_X_test_plus, y_test)

0.99297195118742509

In [98]:
forest_preds = forest_model.predict(imputed_X_test_plus)

In [99]:
print('Mean Absolute Error from Imputation while Tracking What Was Imputed:')
print(mean_absolute_error(y_test, forest_preds))

Mean Absolute Error from Imputation while Tracking What Was Imputed:
937.506849315


Compare:

___937.51 mae of imputed with tracking

___893.72 mae of imputed values 

_1,017.89 mae of dropped columns with nas

22,287.00 the "out-of-the-box" random forest mae

27,232.__ the best max_nodes = 50, tunded tree regressor mae

30,160.74 the out of sample tree_regressor mae **underfit?**

____62.35 the in-sample tree_regressor mae **overfit**

That is a dramatic reduction in mean error. 

### make submission tracking what was missing

In [180]:
#read in test and train data
train_data = pd.read_csv('data/train.csv')
train_target = train_data.SalePrice
train_predictors = train_data.drop(['SalePrice'], axis = 1)
train_numeric_predictors = train_predictors.select_dtypes(exclude = ['object'])
test = pd.read_csv('data/test.csv')
test_numeric_predictors = test.select_dtypes(exclude = ['object'])

# track union of cols with misssing vals in train and test data
train_cols_with_missing = [col for col in train_numeric_predictors.columns
                    if train_numeric_predictors[col].isnull().any()]
test_cols_with_missing = [col for col in test_numeric_predictors.columns
                         if test_numeric_predictors[col].isnull().any()]
#make union of dropped cols
results_list = [train_cols_with_missing, test_cols_with_missing]
cols_with_missing = set().union(*results_list)

#track cols with missing data
for col in cols_with_missing:
    train_numeric_predictors[col + '_was_missing'] = train_numeric_predictors[col].isnull()
    test_numeric_predictors[col + '_was_missing'] = test_numeric_predictors[col].isnull()

#impute value to both test and train data
my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(train_numeric_predictors)
imputed_X_test = my_imputer.transform(test_numeric_predictors)
                            #not sure why, but it is supposed to be .transform instead of the .fit_transform

#fit random forest
forest_model = RandomForestRegressor()
forest_model.fit(imputed_X_train, train_target)

#generate in-sample predictions
preds_on_imputed_numeric_data = forest_model.predict(imputed_X_train)

#insample score
print('in-sample forest_model.score:', forest_model.score(imputed_X_train, train_target))
print('Mean Absolute in-sample Error from dropping columns with Missing Values:')
print(mean_absolute_error(train_target, preds_on_imputed_numeric_data))

#generate out-sample predictions
preds_on_imputed_numeric_test_data = forest_model.predict(imputed_X_test)

my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': preds_on_imputed_numeric_test_data})

my_submission.to_csv('kaggle_ml_course_tracked_imputed_numeric.csv', index = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


in-sample forest_model.score: 0.971075690954
Mean Absolute in-sample Error from dropping columns with Missing Values:
8055.4580137


## Conclusion

The benefits of this result can vary widely from one dataset to the next (largely dteremined by whether rows with missing values are intrinsically like or unlike those without missing values).

# Using Categorical Data with One Hot Encoding

## Introduction
Categorical data is data that takes only a limited number of values

- For example: makes of car: Honda, Toyota, Ford, None, etc

You will get an error if you try to plug these variables into most machine learning models in Python without "enodig" them first. 

## One-Hot Encoding: The Standard Approach for Categorical Data
One hot enconding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of values (i.e. you generally won't fit for variables taking more than 15 different values. It'd be a poor choice in some cases with fewer values, though that varies.)

One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data. Let's work through an example

![](https://i.imgur.com/mtimFxh.png)

The values in the original data are Red, Yellow and Green. We create a separate column for each possible value. Wherever the original value was Red, we put a 1 in the Red column.

In [182]:
iowa_data = pd.read_csv('data/train.csv')
target = iowa_data.SalePrice
train_predictors = iowa_data.drop(['SalePrice'], axis = 1)

test_predictors = pd.read_csv('data/test.csv')

In [183]:
train_predictors.dtypes.sample(10)

FullBath        int64
RoofStyle      object
Exterior1st    object
ExterQual      object
ScreenPorch     int64
LandSlope      object
PoolQC         object
MiscFeature    object
MSZoning       object
FireplaceQu    object
dtype: object

**Object** indicates a column has text (there are other things it could be theoretically be, but that's unimportant for our purposes). It's most commont to one-hot encode these "objects" columns, since they can't be plugged directly into most models. Pandas offers a convenient function called **get_dummies** to get one-hot encodings. Call it like this:

In [184]:
one_hot_encoded_training_predictores = pd.get_dummies(train_predictors)

In [185]:
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)

Alternatively, we could have dropped the categoricals. To see how the approaches compare, we can calulate the mean absolute error of models built with two alternative sets of predictors:
1. One-hot encoded categoricals as well as numeric predictors
2. Numerical predictors, where we drop categoricals. 
    - which we've done above
One-hot encoding usually helps but it vaires on a case-by-case basis. In this case, there doesen't appear to be any meaningful benefit from using the one-hot encoded variables. 

In [187]:
# Handle missing values through untracked imputation
my_imputer = Imputer()
one_hot_encoded_training_predictores = my_imputer.fit_transform(one_hot_encoded_training_predictores)
one_hot_encoded_test_predictors = my_imputer.fit_transform(one_hot_encoded_test_predictors)

In [188]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    '''multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention'''
    return -1* cross_val_score(RandomForestRegressor(max_leaf_nodes=50),
                              X, y, 
                              scoring = 'neg_mean_absolute_error').mean()


In [189]:
predictors_without_categoricals = train_predictors.select_dtypes(exclude = ['object'])
# handle missing values through untracked imputation
predictors_without_categoricals = my_imputer.fit_transform(predictors_without_categoricals)

In [190]:
mae_without_categoricals = get_mae(predictors_without_categoricals,
                                  target)

In [191]:
mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictores,
                             target)

In [192]:
print('Mean Absolute Error when Dropping Categoricals and imputing nas: ' + str(int(mae_without_categoricals)))
print('Mean Absolute Error with One-Hot Encoding and imputing nas: ' + str(int(mae_one_hot_encoded)))

Mean Absolute Error when Dropping Categoricals and imputing nas: 20852
Mean Absolute Error with One-Hot Encoding and imputing nas: 20399


In [193]:
mae_without_categoricals
# 20,664

20852.464541759018

In [194]:
mae_one_hot_encoded
# 20,514

20399.917485257756

# Caution
Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test datasets get misaligned, your resultes will be nonesense. This could happen if a categorical had a different set of levels in the training data vs the test data.

Ensure the test data is enoded in the same manner as the training data with the align command:

In [195]:
# reload original test and train data for certainty
iowa_data = pd.read_csv('data/train.csv')
target = iowa_data.SalePrice
train_predictors = iowa_data.drop(['SalePrice'], axis = 1)
test_predictors = pd.read_csv('data/test.csv')

In [196]:
one_hot_encoded_training_predictores = pd.get_dummies(train_predictors)

one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)

In [198]:
#align categorical levels/columns
final_train, final_test = one_hot_encoded_training_predictores.align(one_hot_encoded_test_predictors,
                                                                    join = 'left',
                                                                    axis = 1)

In [199]:
final_train = my_imputer.fit_transform(final_train)
final_test = my_imputer.fit_transform(final_test)

In [200]:
#mae of aligned one-hot encoded aligned left (i.e. to train)
get_mae(final_train, target)

20974.262539237254

the align command makes sure the columns show up the same order in both datasets (it uses column names to identify which columns line up in each dataset.) The argument `join = 'left'` specifies that we will do the equivalent of SQL's *left join*. Tat means, if there are ver colunms that show up in one dataset and not the other, we will keep exactly the columns from our training data. The argument `join = 'inner'` would do what SQL databases call an *inner join*, keeping only the columns showing up in both datasets. That's also a sensible choice.

In [201]:
#align categorical levels/columns
final_train, final_test = one_hot_encoded_training_predictores.align(one_hot_encoded_test_predictors,
                                                                    join = 'inner',
                                                                    axis = 1)

In [202]:
final_train = my_imputer.fit_transform(final_train)
final_test = my_imputer.fit_transform(final_test)

In [203]:
#mae of aligned one-hot encoded aligned inner (i.e. to intersection of train and test)
get_mae(final_train, target)

20270.741890381945

## make submission with encoed categorical data

In [205]:
## make submission with untracked imputation and inner alignment of dummy variables
# define and fit model to the imputed and inner aligned encoded data
forest_model = RandomForestRegressor()

# score in-sample prediction
forest_model.fit(final_train, train_target)

#generate out-sample predictions
prds_on_untr_imp_encodedcat_inner_align_test_data = forest_model.predict(final_test)

my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': prds_on_untr_imp_encodedcat_inner_align_test_data})
my_submission.to_csv('kaggle_ml_course_prds_on_untr_imp_encodedcat_inner_align_test_data.csv', index = False)

## Further Resources for Categorical Data
- **Pipelines:** Deploying models into production ready systems is a topic unto itself. While one-hot enconding is still a great approach, your code will need to be built in an especially robust way. Scikit-learn pipelines are a great tool for this. Scikit-learn offers a class for one-hot-encoding and this can be added to a Pipeline. Unfortunately, it doesn't handle text or object values, which is a common use case.
- **Applications To Text For Deep Learning:** Keras and TensorFlow have functionality for one-hot encoding, which is useful for working with text.
- **Categoricals with Many Values:** Scikit-learn's Feature Hashser uses the hasing trick to store high-dimensional data. This will add some complexity to your modeling code. 

# What is XGBoost?
XGBoost is the leading model for working with standard tabular data. XGBoost models dominate many Kaggle competitions.

To reach peak accuracy, XGBoost models require more knowledge and *model tuning* than tecchniques like Random Rofrest. This tutorial will:
- Follow the full modeling workflow with XGBoost
- Fine-tune XGBoost models for optimal performance
XGBoost is an implementation of the **Gradient Boosted Decision Trees** Algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.)

What are **Gradient Boosted Decision Trees**?
![](https://i.imgur.com/e7MIgXk.png)

We go through cycles that repeatedly build new models and combine them into an **ensemble** model. We start the cycle by calculating the errors for each observation in the dataset. We then build a new model to predict those. We add predictions from this error-predicting model to the "ensemble of models."

To make a prediction, we add the predictions from all previous models. We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.

There's one piece outside that cycle. We need some base prediction to start the cycle. In practice, the initial predictions can be pretty naive. Even if it's predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors. 

This process may sound complicated, but the code to use it is straightfroward. [i.e. obscure and protective of scikit-learn's ip] We'll fill in some additional explanatory details in the **model tuning** section below.

In [252]:
data = pd.read_csv('data/train.csv')
data.SalePrice.isnull().sum()

0

In [253]:
data.dropna(axis = 0, subset = ['SalePrice'], inplace = True)
#I think this is meant to drop only those rows for which there is no sale price

In [254]:
len(data)

1460

In [255]:
y = data.SalePrice
y.isnull().sum()

0

In [256]:
len(y)

1460

In [257]:
X = data.drop(['SalePrice'], axis = 1).select_dtypes(exclude = ['object'])
# it seems we're ignoring categorical data for now
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), 
                                                   y.as_matrix(), 
                                                   test_size = 0.25,
                                                   random_state = 42)
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

In [258]:
len(X)

1460

In [259]:
# build the model
from xgboost import XGBRegressor

In [260]:
my_model = XGBRegressor()
# Add silent = True to avoid printing out updates with each cycle
my_model.fit(train_X, train_y, verbose = False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [261]:
predictions = my_model.predict(test_X)

print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error : 17542.815229


## make submission

In [230]:
# read in the test and train data 
# select numeric test and train features
train = pd.read_csv('data/train.csv')
train.dropna(axis = 0, subset = ['SalePrice'], inplace = True)
train_y = train.SalePrice
train_X = train.drop(['SalePrice'], axis = 1).select_dtypes(exclude = ['object'])

test = pd.read_csv('data/test.csv')
test_X = test.select_dtypes(exclude = ['object'])
# impute any missing values in both test and train
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)
# fit model to train
my_model.fit(train_X, train_y)
# predict on test
xgb_numeric_preds = my_model.predict(test_X)

In [231]:
#submit
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': xgb_numeric_preds})
my_submission.to_csv('xgb_numeric_preds.csv', index = False)

# XGB Model Tuning
XGBoost has a few parameters that can dramatically affect your model's accuracy and training speed. The first parameters you should understand are:
**n_estimators and early_stopping_rounds**
- n_estimators specifies how many times to go through the modeling cycle described above. 

in the underfitting vs overfitting graph, n_estimators moves you further to the right.
- Too low value causes underfitting, which is inaccurate predictions on both training data and new data.
- Too large a value causes overfitting, 
    - which is accurate predictions on training data, but inaccurate predictions on new data
Typical values range from 100-1000, though this depends a lot on the **learning rate** (discussed below).

**early_stopping_rounds** offers a way to automatically find the ideal value. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for **n_estimators**. It's smart to set a high value for **n_estimators** and then use **early_stopping_rounds** to find the optimal time to stop iterating. 

Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping. **early_stopping_rounds = 5** is a reasonable value. Thus we stop after 5 straight rounds of deteriorating validation scores. 

Here is the code to fit with early_stopping:

In [262]:
my_model = XGBRegressor(n_estimators = 1000)

my_model.fit(train_X, train_y, early_stopping_rounds=5,
            eval_set=[(test_X, test_y)], verbose = False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [263]:
predictions = my_model.predict(test_X)

print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error : 17776.7281678


In [None]:
# mae of 17,776.73 is slightly worse, could be overfitting

when using **early_stopping_rounds**, you need to set aside some of your data for checking the number of rounds to use. If you later want to fit a model with all of your data, set **n_estimtors** to whatever value you found to be optimal when run with early stopping. 

### learning_rate
Here's a subtle but important trick for better XGBoost models:

Instead of getting predictions by simply adding up the predictions from each component model, we will multiply the predictions from each model by a small number before adding them in. This means each tree we add to the ensemble helps us less. **In practice, this reduces the model's propensity to overfit.** 

So, you can use a higher value of **n_estimators** without overfitting. If you use early stopping, the appropriate number of trees will be set automatically. 

In general, a small learning rate (and a large number of estimators) will yield more accurate XGBoost models, though it will also take the model loner to train since it does more iterations through the cycle. 

Modifying the example above to include a learning rate would yield the following code:

In [264]:
my_model = XGBRegressor(n_estimators=1000, learning_rate = 0.5)
my_model.fit(train_X, train_y, early_stopping_rounds = 5,
            eval_set = [(test_X, test_y)], verbose = True)

[0]	validation_0-rmse:108923
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:66220.3
[2]	validation_0-rmse:46770
[3]	validation_0-rmse:38591.9
[4]	validation_0-rmse:35387.7
[5]	validation_0-rmse:34861.7
[6]	validation_0-rmse:33999.1
[7]	validation_0-rmse:33492.9
[8]	validation_0-rmse:32693.9
[9]	validation_0-rmse:31953.4
[10]	validation_0-rmse:31751.9
[11]	validation_0-rmse:31821.6
[12]	validation_0-rmse:31770.5
[13]	validation_0-rmse:31667.2
[14]	validation_0-rmse:31661.6
[15]	validation_0-rmse:31664
[16]	validation_0-rmse:31469.6
[17]	validation_0-rmse:32019
[18]	validation_0-rmse:32000.4
[19]	validation_0-rmse:32049.6
[20]	validation_0-rmse:31807.3
[21]	validation_0-rmse:31617.7
Stopping. Best iteration:
[16]	validation_0-rmse:31469.6



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.5, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [265]:
predictions = my_model.predict(test_X)

print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error : 20305.7345248


In [None]:
# mae 20,305 is significantly worse, probably overfitting

### n_jobs
On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter **n_jobs** equal to the number of cores on your machine. On smaller datasets, this won't help.

The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction. But, if it's useful in large datasets where you would otherwise spend a long time wiating during `fit` command. 

XGBoost has a multitude of other parameters, but these will go a very long way in helping you fine-tune your XGBoost model for optimal performance. 

## make a tuned xgboost submission

In [271]:
# read in the test and train data 
# select numeric test and train features
train = pd.read_csv('data/train.csv')
train.dropna(axis = 0, subset = ['SalePrice'], inplace = True)
train_y = train.SalePrice
train_X = train.drop(['SalePrice'], axis = 1).select_dtypes(exclude = ['object'])

test = pd.read_csv('data/test.csv')
test_X = test.select_dtypes(exclude = ['object'])

# impute any missing values in both test and train
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

# fit tuned model to train
my_model = XGBRegressor(n_estimators=1000, learning_rate = 0.5)
my_model.fit(train_X, train_y, early_stopping_rounds = 5,
             eval_set = [(train_X, train_y)],
             verbose = True)

# predict on test
tuned_xgb_numeric_preds = my_model.predict(test_X)

[0]	validation_0-rmse:106056
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:61705.3
[2]	validation_0-rmse:40611.6
[3]	validation_0-rmse:31483.2
[4]	validation_0-rmse:27199.8
[5]	validation_0-rmse:24887.1
[6]	validation_0-rmse:23640.6
[7]	validation_0-rmse:23006.5
[8]	validation_0-rmse:22256.1
[9]	validation_0-rmse:21932.3
[10]	validation_0-rmse:21605.9
[11]	validation_0-rmse:21436.3
[12]	validation_0-rmse:21002.5
[13]	validation_0-rmse:20215.8
[14]	validation_0-rmse:19706.2
[15]	validation_0-rmse:19499
[16]	validation_0-rmse:19303.9
[17]	validation_0-rmse:18880.3
[18]	validation_0-rmse:18675.8
[19]	validation_0-rmse:18258.2
[20]	validation_0-rmse:17905
[21]	validation_0-rmse:17693.7
[22]	validation_0-rmse:17454.9
[23]	validation_0-rmse:17147.8
[24]	validation_0-rmse:16918.1
[25]	validation_0-rmse:16660.9
[26]	validation_0-rmse:16556.8
[27]	validation_0-rmse:16297.4
[28]	validation_0-rmse:15952.7
[29]	validation_0-rmse:15762.9
[30]	validation_0-rms

[259]	validation_0-rmse:3354.07
[260]	validation_0-rmse:3339.75
[261]	validation_0-rmse:3321.65
[262]	validation_0-rmse:3300.42
[263]	validation_0-rmse:3274.4
[264]	validation_0-rmse:3248.24
[265]	validation_0-rmse:3230.01
[266]	validation_0-rmse:3211.15
[267]	validation_0-rmse:3200.92
[268]	validation_0-rmse:3176.99
[269]	validation_0-rmse:3150.37
[270]	validation_0-rmse:3129.04
[271]	validation_0-rmse:3113.24
[272]	validation_0-rmse:3090.42
[273]	validation_0-rmse:3075.46
[274]	validation_0-rmse:3055.07
[275]	validation_0-rmse:3045.78
[276]	validation_0-rmse:3031.88
[277]	validation_0-rmse:3011.36
[278]	validation_0-rmse:2987.01
[279]	validation_0-rmse:2968.68
[280]	validation_0-rmse:2951.28
[281]	validation_0-rmse:2935.6
[282]	validation_0-rmse:2916.82
[283]	validation_0-rmse:2894.18
[284]	validation_0-rmse:2874.52
[285]	validation_0-rmse:2858.89
[286]	validation_0-rmse:2850.15
[287]	validation_0-rmse:2830.06
[288]	validation_0-rmse:2824.53
[289]	validation_0-rmse:2812.38
[290]	vali

[517]	validation_0-rmse:891.41
[518]	validation_0-rmse:888.045
[519]	validation_0-rmse:883.038
[520]	validation_0-rmse:880.461
[521]	validation_0-rmse:877.881
[522]	validation_0-rmse:871.532
[523]	validation_0-rmse:868.307
[524]	validation_0-rmse:864.513
[525]	validation_0-rmse:860.44
[526]	validation_0-rmse:859.579
[527]	validation_0-rmse:856.686
[528]	validation_0-rmse:855.82
[529]	validation_0-rmse:851.086
[530]	validation_0-rmse:847.024
[531]	validation_0-rmse:842.966
[532]	validation_0-rmse:838.233
[533]	validation_0-rmse:834.874
[534]	validation_0-rmse:831.714
[535]	validation_0-rmse:827.771
[536]	validation_0-rmse:821.279
[537]	validation_0-rmse:815.817
[538]	validation_0-rmse:814.344
[539]	validation_0-rmse:810.477
[540]	validation_0-rmse:805.444
[541]	validation_0-rmse:803.644
[542]	validation_0-rmse:800.255
[543]	validation_0-rmse:795.265
[544]	validation_0-rmse:789.427
[545]	validation_0-rmse:781.949
[546]	validation_0-rmse:776.598
[547]	validation_0-rmse:774.597
[548]	valid

[774]	validation_0-rmse:280.659
[775]	validation_0-rmse:279.395
[776]	validation_0-rmse:278.234
[777]	validation_0-rmse:277.416
[778]	validation_0-rmse:277.072
[779]	validation_0-rmse:275.676
[780]	validation_0-rmse:275.145
[781]	validation_0-rmse:274.665
[782]	validation_0-rmse:274.273
[783]	validation_0-rmse:273.09
[784]	validation_0-rmse:272.872
[785]	validation_0-rmse:272.108
[786]	validation_0-rmse:270.996
[787]	validation_0-rmse:269.728
[788]	validation_0-rmse:267.435
[789]	validation_0-rmse:265.422
[790]	validation_0-rmse:262.89
[791]	validation_0-rmse:261.762
[792]	validation_0-rmse:260.346
[793]	validation_0-rmse:259.105
[794]	validation_0-rmse:257.261
[795]	validation_0-rmse:255.919
[796]	validation_0-rmse:255.625
[797]	validation_0-rmse:254.728
[798]	validation_0-rmse:253.167
[799]	validation_0-rmse:252.401
[800]	validation_0-rmse:251.962
[801]	validation_0-rmse:250.963
[802]	validation_0-rmse:249.224
[803]	validation_0-rmse:248.774
[804]	validation_0-rmse:246.971
[805]	vali

In [268]:
#submit
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': tuned_xgb_numeric_preds})
my_submission.to_csv('loosely_tuned_xgb_numeric_preds.csv', index = False)

In [None]:
## make an xgboost prediction on encoded data