# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



In [6]:
import pandas as pd

main_file_path = '../input/train.csv'
data = pd.read_csv(main_file_path)
print(data.describe())

Print a list of the columns

In [8]:
print(data.columns)

From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable (as you saw above to create melbourne_price_data.)
Use the head command to print out the top few lines of the variable you just created.

In [9]:
# store the series of prices separately as price_data.
price_data = data.SalePrice
# the head command returns the top few lines of data.
print(price_data.head())

Pick any two variables and store them to a new DataFrame (as you saw above to create two_columns_of_data.)
Use the describe command with the DataFrame you just created to see summaries of those variables. 

In [13]:
columns_of_interest = ['OverallQual', 'Fireplaces']
two_columns_of_data = data[columns_of_interest]
two_columns_of_data.describe()

Select the target variable you want to predict. We are going to choose SalePrice as the prediction target. Save this to a new variable called y.

In [14]:
y = data.SalePrice

Create a list of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

* LotArea
* YearBuilt
* 1stFlrSF
* 2ndFlrSF
* FullBath
* BedroomAbvGr
* TotRmsAbvGrd

Save this with the variable name X

In [17]:
data_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                        'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = data[data_predictors]

Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.

Fit the model you have created using the data in X and the target data you saved above.

In [18]:
from sklearn.tree import DecisionTreeRegressor

# Define model
iowa_model = DecisionTreeRegressor()

# Fit model
iowa_model.fit(X, y)

In [None]:
Make a few predictions with the model's predict command and print out the predictions.

In [19]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(iowa_model.predict(X.head()))

1. Use the train_test_split command to split up your data.
1. Fit the model with the training data
1. Make predictions with the validation predictors
1. Calculate the mean absolute error between your predictions and the actual target values for the validation data.

In [21]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
# Define model
iowa_model = DecisionTreeRegressor()
# Fit model
iowa_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = iowa_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

Use a for loop that tries different values of max_leaf_nodes and calls the get_mae function on each to find the ideal number of leaves for your Iowa data.

In [22]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Run the RandomForestRegressor on your data. You should see a big improvement over your best Decision Tree models.

In [23]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

We're doing very minimal data set up here so we can focus on how to submit modeling results to competitions. Other tutorials will teach you how build great models. So the model in this example will be fairly simple. We'll start with the code to read data, select predictors, and fit a model.

In [134]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Read the data
train = pd.read_csv('../input/train.csv')

# pull data into target (y) and predictors (X)
train_y = train.SalePrice
predictor_cols = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']

# Create training predictors data
train_X = train[predictor_cols]

my_model = RandomForestRegressor()
my_model.fit(train_X, train_y)

In addition to your training data, there will be test data. This is frequently stored in a file with the title test.csv. This data won't include a column with your target (y), because that is what we'll have to predict and submit. Here is sample code to do that.

In [104]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted_prices = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

We will create a DataFrame with this data, and then use the dataframe's to_csv method to write our submission file. Explicitly include the argument index=False to prevent pandas from adding another column in our csv file.

In [None]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

**Level 2 Material**

Start of Level 2 material

**Basic Problem Set-Up**

Load data, select variables, split into test and training data.

In [56]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
iowa_data = pd.read_csv('../input/train.csv')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

iowa_target = iowa_data.SalePrice
iowa_predictors = iowa_data.drop(['SalePrice'], axis=1)

# For the sake of keeping the example simple, we'll use only numeric predictors. 
iowa_numeric_predictors = iowa_predictors.select_dtypes(exclude=['object'])


from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iowa_numeric_predictors, 
                                                    iowa_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)



"Get Model Score from Imputation"

In [None]:
from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))




**Get Score from Imputation with Extra Columns Showing What Was Imputed**

In [57]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns 
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

**Using Categorical Data with One Hot Encoding**

In [111]:
#Use imputing to handle missing values
my_imputer = Imputer()
iowa_numeric_predictors_plus = my_imputer.fit_transform(iowa_numeric_predictors)
numeric_training_predictors = pd.DataFrame(data=iowa_numeric_predictors_plus)

#Use one hot encoding on categorical variables
iowa_object_predictors = iowa_predictors.select_dtypes(include=['object'])
one_hot_encoded_training_predictors = pd.get_dummies(iowa_object_predictors)

#merge numeric and one-hot encoded data together
one_hot_encoded_training_predictors = pd.concat([one_hot_encoded_training_predictors, numeric_training_predictors], axis=1, join_axes=[one_hot_encoded_training_predictors.index])
print(one_hot_encoded_training_predictors.head())

In [112]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor


def get_mae(X, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = iowa_numeric_predictors_plus

mae_without_categoricals = get_mae(predictors_without_categoricals, iowa_target )

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, iowa_target )

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

**Learning to Use XGBoost**

In [137]:
#We are going to use the data that we have one-hot encoded and imputed missing values
X_train, X_test, y_train, y_test = train_test_split(one_hot_encoded_training_predictors, 
                                                    iowa_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)
#We can then proceed with the XGBoost model, 
from xgboost import XGBRegressor

my_model = XGBRegressor(n_estimators=1000)
my_model.fit(imputed_X_train_plus, y_train, early_stopping_rounds=100, 
             eval_set=[(imputed_X_test_plus, y_test)], verbose=False)

# make predictions
predictions = my_model.predict(imputed_X_test_plus)

from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, y_test)))

**Partial Dependence Plots**

In this step, you will learn how to create and interpret partial dependence plots, one of the most valuable ways to extract insight from your models.

Pick three predictors in your project. Formulate an hypothesis about what the partial dependence plot will look like. Create the plots, and check the results against your hypothesis.

In [141]:
def get_some_data():
    cols_to_use = ['LotArea', 'YearBuilt', '1stFlrSF']
    data = pd.read_csv('../input/train.csv')
    y = data.SalePrice
    X = data[cols_to_use]
    my_imputer = Imputer()
    imputed_X = my_imputer.fit_transform(X)
    return imputed_X, y

from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.ensemble import GradientBoostingRegressor

# get_some_data is defined in hidden cell above.
X, y = get_some_data()
# scikit-learn originally implemented partial dependence plots only for Gradient Boosting models
# this was due to an implementation detail, and a future release will support all model types.
my_model = GradientBoostingRegressor()

# fit the model as usual
my_model.fit(X, y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model,       
                                   features=[0, 1, 2], # column numbers of plots we want to show
                                   X=X,            # raw predictors data.
                                   feature_names=['LotArea', 'YearBuilt', '1stFlrSF'], # labels on graphs
                                   grid_resolution=10) # number of values to plot on x axis


**Pipelines**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer

my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
my_pipeline.fit(train_X, train_y)
predictions = my_pipeline.predict(test_X)