# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



In [1]:
import pandas as pd

main_file_path = '../input/train.csv'
df = pd.read_csv(main_file_path)
print('hello world')

In [2]:
df.head()

In [3]:
df.describe()

# Selecting and Filtering Data禱n
1. Print a list of the columns
2. From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable (as you saw above to create melbourne_price_data.)
3. Use the head command to print out the top few lines of the variable you just created.
4. Pick any two variables and store them to a new DataFrame (as you saw above to create two_columns_of_data.)
5. Use the describe command with the DataFrame you just created to see summaries of those variables. 

In [4]:
sorted(df.columns)

In [5]:
home_sales_prices = df.SalePrice
home_sales_prices.head()

In [6]:
two_columns_of_data = df[['YearBuilt', 'YrSold']]
two_columns_of_data.describe()

# Choosing the Prediction Target禱n
**The steps to building and using a model are:**

* **Define:** What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
* **Fit:** Capture patterns from provided data. This is the heart of modeling.
* **Predict:** Just what it sounds like
* **Evaluate:** Determine how accurate the model's predictions are


**Now it's time for you to define and fit a model for your data (in your notebook).**

1. Select the target variable you want to predict. You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable). Save this to a new variable called y.

2. Create a list of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

    - LotArea
    - YearBuilt
    - 1stFlrSF
    - 2ndFlrSF
    - FullBath
    - BedroomAbvGr
    - TotRmsAbvGrd

3. Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.

4. Create a `DecisionTreeRegressorModel` and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.

5. Fit the model you have created using the data in X and the target data you saved above.

5. Make a few predictions with the model's predict command and print out the predictions.


In [7]:
y = df.SalePrice

In [8]:
predictors = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]
X = df[predictors]

In [9]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

In [10]:
model = DecisionTreeRegressor()

In [11]:
model.fit(X, y)

# Model Validation

1. Use the train_test_split command to split up your data.
2. Fit the model with the training data
3. Make predictions with the validation predictors
4. Calculate the mean absolute error between your predictions and the actual target values for the validation data.

In [12]:
predicted_sale_prices = model.predict(X)
mean_absolute_error(y, predicted_sale_prices)

In [13]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

In [14]:
# Define model
model = DecisionTreeRegressor()
# Fit model
model.fit(train_X, train_y)

In [15]:
# get predicted prices on validation data
val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

# Underfitting, Overfitting and Model Optimization

In [16]:
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return mae

In [17]:
# compare MAE with differing values of max_leaf_nodes
import matplotlib.pyplot as plt

number_of_nodes = list(range(2, 1000))
mae = []
for max_leaf_nodes in number_of_nodes:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    mae.append((max_leaf_nodes, my_mae))

plt.ylabel("MAE Score")
plt.xlabel("# of Leaf Nodes")
plt.plot([x[0] for x in mae], [x[1] for x in mae])

In [18]:
min(mae, key=lambda item: item[1])

# Random Forests

In [19]:
from sklearn.ensemble import RandomForestRegressor

In [21]:
forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

In [22]:
def random_forest_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = RandomForestRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return mae

In [25]:
rf_mae = []
for max_leaf_nodes in number_of_nodes:
    my_mae = random_forest_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    rf_mae.append((max_leaf_nodes, my_mae))

plt.ylabel("MAE Score")
plt.xlabel("# of Leaf Nodes")
plt.plot([x[0] for x in rf_mae], [x[1] for x in rf_mae])

In [27]:
min(rf_mae, key=lambda item: item[1])

In [28]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Read the data
train = pd.read_csv('../input/train.csv')

# pull data into target (y) and predictors (X)
train_y = train.SalePrice

# Create training predictors data
train_X = train[predictors]

my_model = RandomForestRegressor()
my_model.fit(train_X, train_y)

In [30]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictors]
# Use the model to make predictions
predicted_prices = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

In [31]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)