# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



In [2]:
import pandas as pd

main_file_path = '../input/train.csv'
data = pd.read_csv(main_file_path)
# print a summary of the data in Melbourne data
print(data.describe(include = "all"))

In [3]:
print(data.columns)

In [6]:
#print the top few lines of salePrice data
print(data.SalePrice.head())

In [74]:
# choosing the predicting target
y = data.SalePrice
# choose the predictors 
lowa_predictors = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd','GrLivArea','Fireplaces','GarageArea','MiscVal']
X_orginal =  data[lowa_predictors]
#For the sake of keeping the example simple, we'll use only numeric predictors. 
lowa_predictors_new = data.drop(['SalePrice'], axis=1)
# For the sake of keeping the example simple, we'll use only numeric predictors. 
train_X = lowa_predictors_new.select_dtypes(exclude=['object'])
print(X_orginal.shape)
print(train_X.shape)

In [101]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_orginal = test[lowa_predictors]
test_X = test.select_dtypes(exclude=['object'])
print(test_orginal.shape)
print(test_X.shape)


In [76]:
from sklearn.preprocessing import Imputer
# make copy to avoid changing original data (when Imputing)

cols_with_missing = [col for col in train_X.columns 
                                 if train_X[col].isnull().any()]

imputed_X_train_plus = train_X.drop(cols_with_missing, axis=1)
imputed_X_test_plus = test_X.drop(cols_with_missing, axis=1)
print(imputed_X_train_plus.shape)
print(imputed_X_test_plus.shape)
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)



**Building Your Model**

You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.
Here is the example for defining and fitting the model.



In [102]:
# data validation for to test model accuracy
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor() 
    model.fit(X_train, y_train) 
    preds = model.predict(X_test) 
    return mean_absolute_error(y_test, preds)

org_train_X, org_val_X, train_y, val_y = train_test_split(imputed_X_train_plus, y,random_state = 0)
# define the model
iowa_model = DecisionTreeRegressor()
#fit model
iowa_model.fit(org_train_X,train_y)
#get predicted price on the validation data
val_predictions = iowa_model.predict(org_val_X)
print("Mean Absolute Error without imputation:")
print(mean_absolute_error(val_y,val_predictions))


In [94]:
# create utility function to compare MAE scores for differnt value of max leaf modes
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes,predictors_train,predictors_val,target_train,target_val):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(predictors_train,target_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(target_val,preds_val)
    return mae

In [95]:
# use for loops to compare differnet values max leaf nodes 
for max_leaf_nodes in [5,50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes,org_train_X,org_val_X,train_y, val_y)
    print("max leaf nodes : %d \t\t Mean Absolute Error : %d"%(max_leaf_nodes,my_mae))

In [103]:
# build RandomForest similar to DecisionTree in scikit-learn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(org_train_X,train_y)
iowa_preds = forest_model.predict(org_val_X)
print(mean_absolute_error(val_y,iowa_preds))



In [98]:
# Use the model to make predictions
predicted_prices = forest_model.predict(imputed_X_test_plus)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

In [99]:
#submitting the final prediction
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission_new.csv', index=False)