# Introduction
**This will be your workspace for the [Machine Learning course](https://www.kaggle.com/learn/machine-learning).**

You will need to translate the concepts to work with the data in this notebook, the Iowa data. Each page in the Machine Learning course includes instructions for what code to write at that step in the course.

# Write Your Code Below

In [2]:
import pandas as pd

iowa_file_path = '../input/train.csv' # this is the path to the Iowa data that you will use
iowa_data = pd.read_csv(iowa_file_path)

# print a summary of the data in Iowa data
print(iowa_data.describe())



In [3]:
# print the columns in the iowa data
print(iowa_data.columns)

Selecting single and  multiple columns

In [4]:
# get the price data
iowa_price_data = iowa_data.SalePrice
# print a few lines of the price data
print(iowa_price_data.head())

# two columns of the data
columns_of_interest = ["LotShape", "CentralAir"]
two_columns_data = iowa_data[columns_of_interest]
print(two_columns_data.describe())

About to write some kernels now.... Yay! I'm excited. Can't wait.

In [5]:
# choosing the prediction target
y = iowa_data.SalePrice

# choosing the predictors. We will start with the numeric columns for now.
iowa_data_predictors = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]

X = iowa_data[iowa_data_predictors]

Building the models

In [6]:
from sklearn.tree import DecisionTreeRegressor

# Define model
iowa_model = DecisionTreeRegressor()

# Fit model
iowa_model.fit(X, y)

In [10]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions of a few of the houses are: ")
print(iowa_model.predict(X.head()))

#print the actual prices and compare and see
print(iowa_price_data.head())

Evaluating the model we just built

In [11]:
from sklearn.metrics import mean_absolute_error

# do the predictions and evaluate our model.
predicted_iowa_prices = iowa_model.predict(X)
print("Mean Absolute Error (MAE)")
mean_absolute_error(y, predicted_iowa_prices)

Split data into Training and Prediction sets

In [12]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
# Define model
iowa_model = DecisionTreeRegressor()
# Fit model
iowa_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = iowa_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

Experimenting with different models. Specifically tryign different values for max_leaf_nodes

In [14]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [17]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

From these options, we can see that 50 is the optimal number of leaves to get the minimum Mean Absolute Error (MAE).


Now let's go to our next sophisticated models. Random Forest. This should be awesome. Can't wait to start to enjoy this ride. So what's this Random Forest that everybody is talkingn about?

In [19]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# get the forest model and do the fitting thing.
forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)

# do the prediction, and print the absolute errors)
iowa_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, iowa_preds))


Wow, this MAE is a great improvement from the the 27825 we have from the DecisionTree version earlier. Eso es absolutamente la vida buena or no?. Wow this brought out my little spanish. 

Alright time to submit this file for the competition now. 

In [22]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[iowa_data_predictors]

# Use the model to make predictions
predicted_prices = forest_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

Prepare submission file now.  Submissions are made as csv files so I have to somehow convert the id's and predicted prices to a CSV. This should be fun. Or not depending on where you stand.

In [23]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose iowa_submission here :)
# Note: We I'm explicitly including the argument `index=False` here to prevent panda from adding another column in our csv file.
my_submission.to_csv('iowa_submission.csv', index=False) 


**If you have any questions or hit any problems, come to the [Learn Discussion](https://www.kaggle.com/learn-forum) for help. **

**Return to [ML Course Index](https://www.kaggle.com/learn/machine-learning)**