# Introduction
**This will be your workspace for the [Machine Learning course](https://www.kaggle.com/learn/machine-learning).**

You will need to translate the concepts to work with the data in this notebook, the Iowa data. Each page in the Machine Learning course includes instructions for what code to write at that step in the course.

# Write Your Code Below

In [None]:
import pandas as pd

main_file_path = '../input/house-prices-advanced-regression-techniques/train.csv' 
# this is the path to the Iowa data that you will use
data = pd.read_csv(main_file_path)

# Run this code block with the control-enter keys on your keyboard. 
# Or click the blue botton on the left
print('Some output from running this cell')

In [None]:
import pandas as pd

In [None]:
# save filepath to variable for easier access
iowa_file_path = '../input/house-prices-advanced-regression-techniques/train.csv'
# read the data and store data in DataFrame titled iowa_data
iowa_data = pd.read_csv(iowa_file_path) 
# print a summary of the data in Iowa data
print(iowa_data.describe())

Reviewing what is available from the above data, it appears that Id is numbered from 1 to 1460 (appears to be a unique identifier), MSSubClass (building class) ranges from 20 to 190, Year Sold ranges from 2006 to 2010 (problematic if we are using this data for current year projections, as it is dated), and SalePrice ranges from 34900 to 755000 (very wide range).

In [None]:
# print list of all columns in the data
print(iowa_data.columns)

In [None]:
# store the series of sales prices separately as iowa_price_data
iowa_price_data = iowa_data.SalePrice
# print out the top few lines of this data
print(iowa_price_data.head())

In [None]:
# select two variables of interest and store in a new DataFrame
columns_of_interest = ['LotArea', '1stFlrSF']
two_columns_of_data = iowa_data[columns_of_interest]
# use describe to see a summary of these two variables
two_columns_of_data.describe()

In [None]:
from sklearn.tree import DecisionTreeRegressor
# define y as the prediction target, sale price
y = iowa_data.SalePrice
# define x as the predictors, lot area, year built, 1st floor sq ft, 2nd floor sq ft, 
# full bath, bedroom above ground, and total rooms above ground
iowa_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = iowa_data[iowa_predictors]

In [None]:
# define the model
iowa_model = DecisionTreeRegressor()
# fit the model
iowa_model.fit(X, y)

In [None]:
# make predictions using the first few rows of the training data for practice
print('Making predictions for the following five houses:')
print(X.head())
print('The predictions are:')
print(iowa_model.predict(X.head()))

In [None]:
# mean absolute error calculation (just for practice)
from sklearn.metrics import mean_absolute_error

predicted_home_prices = iowa_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

In [None]:
# split up data into two pieces, one for training and one for validation
# evaluate model on different data than used to train it - avoid spurious correlations

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
# Define model
iowa_model = DecisionTreeRegressor()
# Fit model
iowa_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = iowa_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

Note that the mean absolute error calculated when splitting the data is much higher than that calculated previously.

In [None]:
# use a utility function to compare MAE scores for different max_leaf_nodes

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [None]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 25, 44, 45, 46, 47, 48, 50, 52, 100, 250, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Trying different values for max_leaf_nodes, it appears that 45 produces the lowest MAE.

In [None]:
# build a random forest model
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
iowa_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, iowa_preds))

This random forest is an improvement over the best decision tree error - 23,852 vs. 27,531.

In [None]:
# submitting from a kernel - practice

# Read the test data
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[iowa_predictors]
# Use the model to make predictions
predicted_prices = iowa_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

In [None]:
# prepare the submission file

my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
# include index=False to prevent pandas from adding another column to the file
my_submission.to_csv('submission.csv', index=False)


**If you have any questions or hit any problems, come to the [Learn Discussion](https://www.kaggle.com/learn-forum) for help. **

**Return to [ML Course Index](https://www.kaggle.com/learn/machine-learning)**