# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



In [None]:
import pandas as pd

main_file_path = '../input/train.csv'
data = pd.read_csv(main_file_path)
print(data.describe())

In [None]:
print(data.columns)

In [None]:
print(data.SalePrice.head())

In [None]:
print(data[['LotFrontage', 'LotArea']].describe())

# Training

In [57]:
# import pandas as pd
from sklearn.tree import DecisionTreeRegressor

y = data.SalePrice
cols = [
    'LotArea'
    , 'YearBuilt'
    , '1stFlrSF'
    , '2ndFlrSF'
    , 'FullBath'
    , 'BedroomAbvGr'
    , 'TotRmsAbvGrd'
]
X = data[cols]

model = DecisionTreeRegressor()
model.fit(X, y)

print('Predicting "head" of training set:')
print('----------------------------------------')
print(X.head())
print('Predicted price:')
print('--------')
print(model.predict(X.head()))
print('Real price:')
print('-----------')
print(data.SalePrice.head())
print('Looks a bit *too* accurate...')



# Validation

In [28]:
from sklearn.metrics import mean_absolute_error as mae
y_pred = model.predict(X)
mae(y_pred, y)

Hard to see what the above metric even means because it's absolute. It looks fairly small compared with the actual sale prices and that makes sense since we're overfitting here:

In [15]:
y.mean()

However, to get a better idea how small the is, I would at least normalize MAE by the average sale price:

In [16]:
mae(y_pred, y)/y.mean()

0.03% prediction error -- now that's what I call overfitting. Onward to validation tutorial.

In [27]:
from sklearn.model_selection import train_test_split

# splitting data into training and validation sets:
X_trn, X_val, y_trn, y_val = train_test_split(X, y, random_state = 0)

model.fit(X_trn, y_trn)

y_val_pred = model.predict(X_val)
print('mean absolute validation error:')
print(mae(y_val, y_val_pred))

print('and normalized validation mae:')
print(mae(y_val, y_val_pred)/y_val.mean())

~20% validation error -- more like real life

# Tuning tree depth

In [54]:
def get_mae( max_leafs, X_trn, X_val, y_trn, y_val ):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leafs, random_state=0)
    model.fit(X_trn, y_trn)
    y_val_pred = model.predict(X_val)
    return mae(y_val, y_val_pred)

for max_leafs in [5, 50, 500, 5000]:
    mae_n = get_mae(max_leafs, X_trn, X_val, y_trn, y_val)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leafs, mae_n))

# Random Forest

In [56]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_trn, y_trn)
y_val_pred = model.predict(X_val)
print('Random forest validation set prediction error:')
print(mae(y_val, y_val_pred))

print('And normalized by mean sale price:')
print(mae(y_val, y_val_pred)/y_val.mean())

# Tuning the random forest

In [67]:
def get_mae_rf( max_leafs, X_trn, X_val, y_trn, y_val ):
    model = RandomForestRegressor(max_leaf_nodes=max_leafs, random_state=0)
    model.fit(X_trn, y_trn)
    y_val_pred = model.predict(X_val)
    return mae(y_val, y_val_pred)

for max_leafs in range(60,70, 1): # [5, 50, 500, 5000]:
    mae_n = get_mae_rf(max_leafs, X_trn, X_val, y_trn, y_val)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leafs, mae_n))

# Submitting

In [69]:
data_tst = pd.read_csv('../input/test.csv')
X_tst = data_tst[cols]
model = RandomForestRegressor(max_leaf_nodes=64, random_state=0)
model.fit(X_trn, y_trn)
y_tst_pred = model.predict(X_tst)
print(y_tst_pred)

In [70]:
sub = pd.DataFrame({'Id':data_tst.Id, 'SalePrice': y_tst_pred})
sub.to_csv('submission.csv', index = False)