# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



In [25]:
import pandas as pd

main_file_path = '../input/train.csv'
iowa_data = pd.read_csv(main_file_path)
print('hello world')

## Using Pandas to Get Familiar with your Data

In [26]:
print(iowa_data.describe())

# Selecting and Filtering Data

In [27]:
print(iowa_data.columns)

## Selecting a Single column

In [28]:
iowa_data_lot_area = iowa_data.LotArea
print(iowa_data_lot_area.head())

## Selecting Multiple Columns

In [29]:
columns_of_interest = ["LotArea", "LotShape"]
iowa_data_lot_area_shape = iowa_data[columns_of_interest]

In [30]:
iowa_data_lot_area_shape.describe()

## Choosing the Prediction Target

In [31]:
y = iowa_data.SalePrice

## Choosing Predictors

In [32]:
iowa_predictors = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]

In [33]:
X = iowa_data[iowa_predictors]

## Building your model

In [34]:
from sklearn.tree import DecisionTreeRegressor

# Define model
iowa_model = DecisionTreeRegressor()

# Fit model (train model)
iowa_model.fit(X, y)

In [35]:
step = 5
for i in range(0, 20, step):
    print("Making predictions for the following 5 houses")
    print(X[i:i + step])
    print("The predictions are")
    print(iowa_model.predict(X[i: i + step]))

## Model validation

In [36]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = iowa_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

In [37]:
from sklearn.model_selection import train_test_split

# Seed the random state with argument ensures the split is deterministic every time
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

# Define model
iowa_model = DecisionTreeRegressor()

# Fit model (train it)
iowa_model.fit(train_X, train_y)

# Predict prices using validation set
val_predictions = iowa_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

## Experimenting with Different Models


In [38]:
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [39]:
for max_leaf_nodes in [5, 25, 32, 38, 44, 50, 100, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

## Random Forests

In [40]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
iowa_preds = forest_model.predict(val_X)
print(mean_absolute_error(iowa_preds, val_y))

## Submitting from a Kernel

In [41]:
# Read the test data

test = pd.read_csv("../input/test.csv")
test.describe()

In [44]:

# Treat the test data the same way as the training data. Pull the same columns
test_X = test[iowa_predictors]

# Use the model to make predictions
predicted_home_prices = forest_model.predict(test_X)

# Peek the predicted prices to ensure we have something sensible
print(predicted_home_prices)


## Prepare Submission File

In [43]:
my_submission = pd.DataFrame({
    "Id": test.Id,
    "SalePrice": predicted_home_prices
})
# You can use any name
my_submission.to_csv("submission.csv", index=False)