Okay, let's just follow the tutorial I guess.

In [None]:
import pandas as pd

main_file_path = '../input/house-prices-advanced-regression-techniques/train.csv' # this is the path to the Iowa data that you will use
iowa_df = pd.read_csv(main_file_path)

Let's see the shape of the data:

In [None]:
iowa_df.shape

And some info too:

In [None]:
iowa_df.describe()

Since there's so many columns, it might be nice to see a list of them:

In [None]:
iowa_df.columns

Yeah, just as expected, a bunch of columns. There's a way of making our analysis more focused by selecting only a few columns. For instance:

In [None]:
iowa_df[['Id', 'LotConfig', 'SalePrice']].head()

Okay, let's see what might be cool to use on use analysis.

(Yeah, still following the tutorial)

In [None]:

reduced_iowa_df = iowa_df[['LotArea', 'YearBuilt',
                           '1stFlrSF', '2ndFlrSF',
                           'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]
reduced_iowa_df.describe()

Now is time for sime Machine Learning it seems. First, let's define the Prediction Target:

In [None]:
y = iowa_df.SalePrice

And we need the predictors too, which will be used to guess the target:

In [None]:
X = reduced_iowa_df

And now, the training!

In [None]:
from sklearn.tree import DecisionTreeRegressor
iowa_model = DecisionTreeRegressor()
iowa_model.fit(X, y)

And it is done it seems!

Still following the guide, we'll test the model with the dataframe used in it's training, just to get the feeling I guess.

In [None]:
print('Predicting the price for the following houses:')
print(X.head())
print('And using the model we just obtained, we have:')
print(iowa_model.predict(X.head()))
print('And if you are curious, here we have the real values:')
print(list(y.head()))
print("Yeah... seems pretty good! But of course, if it wasn't, I guess we would have a problem...")

Given the expected value, we can check the error in some predictions using mean_absolute_error, a pretty convenient function:

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y, iowa_model.predict(X))

Welp, that doesn't seems so bad. Still, we are using the training data to check for errors, so something small is about the expected. Now the tutorial teaches us about the train_test_split function, that takes our training dataframe and breaks it into other dataframes, so that we can test the model without having skewed results.

Seems fairly simple:

In [None]:
from sklearn.model_selection import train_test_split
train_X, validation_X, train_y, validation_y = train_test_split(X, y, random_state = 0)
iowa_model = DecisionTreeRegressor()
iowa_model.fit(train_X, train_y)

And now, testing the model on data external to our sample, we have:

In [None]:
print(mean_absolute_error(validation_y, iowa_model.predict(validation_X)))

Dang, that's a lot bigger!

You know, maybe we can use the same model -- that is, the DecisionTreeRegression -- but get better results. For that, controlling the number of leaves of our tree might help! 

To do so, we can make a nice little function that tests various leaf numbers and computes the model error for each:

In [None]:
def get_maximum_average_error( maximum_number_of_leaves,
                             predictors_training,
                             predictors_validation,
                             target_training,
                            target_validation):
    model = DecisionTreeRegressor(max_leaf_nodes = maximum_number_of_leaves,
                                          random_state = 0)
    model.fit(predictors_training, target_training)
    predicted_values = model.predict(predictors_validation)
    return mean_absolute_error(target_validation, predicted_values)

Now, let's put this function to work:

In [None]:
for maximum_leaves in [5, 50, 500, 5000, 50000, 500000]:
    current_maximum_average_error = get_maximum_average_error(maximum_leaves,
                                                             train_X,
                                                             validation_X,
                                                             train_y,
                                                             validation_y)
    print('For %d maximum number of leaves, \t \t we have a Mean Absolute Error for %d' %(maximum_leaves, current_maximum_average_error))

Interesting: for about 50 leaves, we have the best model, that is, the model with least error. For less than that the results aren't that good, and for more we kinda of stagnate on 33382 as the Mean Absolute Error.

The next step is making a better analysis: to do so, the tutorial teaches us other type of model, which will give us better predictions than the Decision Tree.
This model is the Random Forest, that apparently is obtained by making a bunch of different Decision Trees and averaging them (hence the name, random forest).
Apparently it is pretty simple to use this model:

In [None]:
from sklearn.ensemble import RandomForestRegressor
train_X, validation_X, train_y, validation_y = train_test_split(X, y, random_state = 0)
iowa_model_with_random_forest = RandomForestRegressor()
iowa_model_with_random_forest.fit(train_X, train_y)

In [None]:
mean_absolute_error(validation_y, iowa_model_with_random_forest.predict(validation_X) )

Hey, that's pretty good actually, given that our best bets with the Decision Tree had about 28k as error, and we got that with some testing on the number of leaves. Here it wasn't even needed!

Anyway, the tutorial ends about here: the only thing left is to create a submission file, and then make the submission. Ok, let's try to do it.

First, I'll change things up so that the RandomForest is trained on the entire train.csv, instead of a portion of it from the train_test_split():

In [None]:
X = reduced_iowa_df
y = iowa_df.SalePrice
iowa_model = RandomForestRegressor()
iowa_model.fit(X, y)

Now let's try to predict some prices based on the test.csv data:

In [None]:
iowa_df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
X_test = iowa_df_test[['LotArea', 'YearBuilt',
                           '1stFlrSF', '2ndFlrSF',
                           'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]
predicted_prices = iowa_model.predict(X_test)

print('And here are the predicted prices:')
print(predicted_prices)

...or at least some of them, it seems. About 1.5k values is a pretty big quantity, anyway.

Now, to the csv file. Seems fairly simple to create:

In [None]:
submission = pd.DataFrame({'Id': iowa_df_test.Id,
                           'SalePrice': predicted_prices})
submission.to_csv('submission.csv', index = False)