# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



In [1]:
import pandas as pd

main_file_path = '../input/train.csv'
data = pd.read_csv(main_file_path)
print('hello world')

In [2]:
data.head()

In [3]:
data.describe()

In [4]:
data.columns

In [5]:
selected = data[['LotArea', 'SalePrice']]

In [6]:
selected.describe(include='all')

In [7]:
target = data['SalePrice']
features = ['LotArea', 'YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
train = data[features]

In [8]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(train, target)

In [9]:
print("Making predictions for the following 5 houses:")
print(train.head())

In [10]:
prediction = model.predict(train.head())
print("The predictions are")
srs = pd.Series(prediction)
srs

In [11]:
## Compare the prediction with real value
cols = ['RealPrice', 'PredictedPrice']
df_compare  = pd.concat([data.loc[:4,'SalePrice'], srs], axis=1)
df_compare = df_compare.rename(index = str , columns = {'SalePrice': 'RealPrice', 0: 'Predicted Price'})
df_compare

In [12]:
from sklearn.metrics import mean_absolute_error
prediction_test = model.predict(train)
mean_absolute_error(target, prediction_test)

In [13]:
# Splitting train and test data
from sklearn.model_selection import train_test_split
X_train, val_X, Y_train, val_Y = train_test_split(train, target, random_state=0)

In [14]:
model_data = DecisionTreeRegressor()
model_data.fit(X_train, Y_train)

In [15]:
# get predicted prices on validation data
val_predictions = model_data.predict(val_X)
print(mean_absolute_error(val_Y, val_predictions))

In [16]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [17]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, X_train, val_X, Y_train, val_Y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Found out that optimal max_leaf_nodes is 50.

In [18]:
# Now using Random Forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(X_train, Y_train)
predict_forest = forest_model.predict(val_X)
print(mean_absolute_error(val_Y, predict_forest))

In [19]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[features]
# Use the model to make predictions
predicted_prices = forest_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

In [20]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

In [21]:
my_submission.head()