# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



# - Level 1

## **---- Decision Trees ----------------------**

In [None]:
import pandas as pd

# filepath
melbourne_file_path = '../input/train.csv'
# reading the data and store in dataframe
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the dataset
print(melbourne_data.describe())
# print columns' name
print(melbourne_data.columns)

In [None]:
# store the series of prices separately as melbourne_price_data
melbourne_price_data = melbourne_data.SalePrice
# first 5 rows
print(melbourne_price_data.head())

In [None]:
# select two columns of interest
columns_of_interest = ['LotArea','GrLivArea']
# filter the dataframe
two_columns_of_data = melbourne_data[columns_of_interest]
# summary of these two columns
two_columns_of_data.describe()

In [None]:
# create a prediction target vector
y = melbourne_data.SalePrice
# choose predictors
melbourne_predictors = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','BedroomAbvGr','TotRmsAbvGrd']
# create the predictor's dataframe
X = melbourne_data[melbourne_predictors]

In [None]:
# creating the model
from sklearn.tree import DecisionTreeRegressor
# define model
melbourne_model = DecisionTreeRegressor()
# fit model
melbourne_model.fit(X,y)

In [None]:
# validating model
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y,predicted_home_prices)

In [None]:
# printing the results
print('Making predictions for the following 5 houses:')
print(X.head())
print('The predictors are:')
print(melbourne_model.predict(X.head()))

### Splitting my dataset in **TRAINING DATA** and **VALIDATION DATA**

In [None]:
from sklearn.model_selection import train_test_split

# split the data into training and validation data, for both predictors and target
# the split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we run this script
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 42)
# define model
melbourne_model = DecisionTreeRegressor()
# fit model
melbourne_model.fit(train_X,train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

### Working with **OVERFITTING** and **UNDERFITTING** <br>
#### Working with Tree Depths

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes,random_state=42)
    model.fit(predictors_train,targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [None]:
# compare MAE with different values for max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000, 50000, 500000]:
    mae_dt = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, mae_dt))
    
# 50 is the optimal number of leaves!

## Conclusion <br>
* **Overfitting:** capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or <br>
* **Underfitting:** failing to capture relevant patterns, again leading to less accurate predictions.

# **---- Random Forests ----------------------**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
mae_rf = mean_absolute_error(val_y, melb_preds)

In [None]:
print("Mean Absolute Error by Decision Tree: %.2f" %mae_dt)
print("Mean Absolute Error by Random Forests: %.2f" %mae_rf)

# **---- Preparing Submission File ----------------------**

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Read the data
train = pd.read_csv('../input/train.csv')

# pull data into target (y) and predictors (X)
train_y = train.SalePrice
predictor_cols = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']

# Create training predictors data
train_X = train[predictor_cols]

my_model = RandomForestRegressor()
my_model.fit(train_X, train_y)

In [None]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted_prices = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

In [None]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

# - Level 2
### Handling Missing Values