# Predicting the Sale Price of Bulldozers using Machine Learning

In this notebook, we're going to go through an example of machine learning project with the goal of predicting the sale price of bulldozers.

## 1. Problem Definition

> How well can we predict the future sale price of a bulldozer, given its characterisitcs and previous examples of how much similar bulldozers have been sold for?

## 2. Data

The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are 3 main datasets:

* Train.csv is the training set, which contains data through the end of 2011.
* Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012. You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
* Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

## 3. Evaluation

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more on the evaluation of this project check:
https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

**Note:** The goal for most regression evaluation metrics is to minimise the error. For example, our goal for this project will be to build a machine learning model which minimizes RMSLE.

## 4. Features

Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Google Sheets:
https://www.kaggle.com/c/bluebook-for-bulldozers/data?select=Data+Dictionary.xlsx

In [None]:
# Basic EDA Tools:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [None]:
# Import training and validation sets
# guessing dtypes for each col is memory demanding
# Pandas tries to determine what dtype to set in each column.
df = pd.read_csv("bluebook-for-bulldozers/TrainAndValid.csv",low_memory=False)

In [None]:
df.info()

In [None]:
# Plotting between saledate & saleprice
fig, ax = plt.subplots()
ax.scatter(df.saledate[:1000],df.SalePrice[:1000]);

In [None]:
# Plotting histogram for observing distribution
df.SalePrice.plot.hist();

### Parsing dates

When we work with time series data, we want to enrich the time & date component as much as possible.

We can do that by telling pandas which of our columns have dates in it using the `parse_dates` parameter.

In [None]:
# Import data again but this time parse dates
df = pd.read_csv("bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False,
                 parse_dates=["saledate"])

In [None]:
df.saledate.dtype

In [None]:
df.saledate

In [None]:
fig, ax = plt.subplots()
ax.scatter(df.saledate[:1000],df.SalePrice[:1000]);

In [None]:
df.head()

In [None]:
df.head().T

### Sort DataFrame by saledate

When working with time-series data, it's a good idea to sort it by date.

In [None]:
# Sort dataframe by saledate
df.sort_values("saledate", inplace=True, ascending=True)
df.saledate

### Make a copy of the original dataframe

We make a copy of the original dataframe so when we manipulate the copy, we still got our original data.

In [None]:
df_tmp = df.copy()
df_tmp.saledate

Now we've parsed saledate and sort the dataframe as per sorted date.

## Add datetime parameters for `saledate` column

In [None]:
df_tmp[:1].saledate.dt.year

In [None]:
df_tmp[:1].saledate.dt.day

In [None]:
df_tmp[:1].saledate.dt.day_name()

In [None]:
df_tmp[:1].saledate

In [None]:
# Adding 5 new cols:
df_tmp["saleYear"] = df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayOfWeek"] = df_tmp.saledate.dt.dayofweek 
# (0: Mon, 1: Tue, 2: Wed, 3:Thu, 4: Fri, 5: Sat, 6:Sun)
df_tmp["saleDayOfYear"] = df_tmp.saledate.dt.dayofyear

In [None]:
df_tmp.info()

In [None]:
# Now we've enriched our dataframe with datetime features, we can remove `saledate`
df_tmp.drop(["saledate"], axis=1, inplace=True)

## 5. Modelling

We've done enough EDA (we could always do more) but let's start to do some model-driven EDA.

In [None]:
# let's build a machine learning model
from sklearn.ensemble import RandomForestRegressor

# Instantiate a model
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

# Fit the model
model.fit(df_tmp.drop("SalePrice",axis=1),df_tmp["SalePrice"])
# Error since it cannot convert string to float

In [None]:
df_tmp["UsageBand"].dtype

### Convert strings to categories

One way we can turn all of our data into numbers is by converting them into panda categories.

We can check the different datatypes compatible with pandas here:
https://pandas.pydata.org/pandas-docs/version/0.25.3/reference/general_utility_functions.html#data-types-related-functionality



In [None]:
pd.api.types.is_string_dtype(df_tmp["UsageBand"])

In [None]:
# Find the columns which contain strings
for name, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(name)

In [None]:
df_tmp["UsageBand"]

In [None]:
# This will turn all of the string values into category values
for name, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[name] = content.astype("category").cat.as_ordered()
# while converting to categories it converts missing values(NaN) to -1 in codes

In [None]:
df_tmp["UsageBand"]

In [None]:
df_tmp["UsageBand"].cat.codes

In [None]:
df_tmp.info()

In [None]:
df_tmp.state

In [None]:
df_tmp.state.cat.categories

In [None]:
df_tmp.state.value_counts()

In [None]:
df_tmp.state.cat.codes

Thanks to Pandas categories we now have a way to access all of our data in the form of numbers. 

But we still have a bunch of missing data...

In [None]:
# Check missing data
df_tmp.isnull().sum()/len(df_tmp)

### Save preprocessed data

In [None]:
# Export current tmp dataframe
df_tmp.to_csv("bull.csv", index=False)

In [None]:
# Import preprocessed data
df_temp = pd.read_csv("bull.csv", low_memory=False)
df_temp

In [None]:
df_temp.info()

### Fill missing values

#### Fill numerical missing values first

In [None]:
for name, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(name)

In [None]:
# Check for which numeric columns have null values
for name, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(name)

In [None]:
# Fill numeric rows with the median
for name, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Add a column which tells us if the data was missing or not
            df_temp[name+"_is_missing"] = pd.isnull(content)
            # Fill missing numeric values with median
            df_temp[name] = content.fillna(content.median())

In [None]:
# Check if there's any missing numeric values left
for name, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(name)

In [None]:
# Check to see how many examples were missing
df_temp.auctioneerID_is_missing.value_counts()

In [None]:
df_temp.isna().sum()

In [None]:
# Check for columns which aren't numeric
for name, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(name)

In [None]:
df_temp.state.dtype

In [None]:
pd.Categorical(df_temp["state"]).codes

In [None]:
pd.Categorical(df_temp["UsageBand"]).codes

In [None]:
pd.Categorical(df_temp["UsageBand"])

In [None]:
# Turn categorical variables into numbers
# Fill already took place when converting
# Missing places filled by -1

for name, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Add columns to indicate whether sample had missing values
        df_temp[name+"_is_missing"] = pd.isnull(content)
        # Turn categories into numbers and add +1
        df_temp[name] = pd.Categorical(content).codes + 1
        
# pd.Categorical(content).codes assign value -1 in the missing place codes and we dont want that
# so as to fill all numbers with positive values

In [None]:
df_temp.info()

In [None]:
df_temp.UsageBand.dtype

In [None]:
pd.Categorical(df_temp["UsageBand"]).codes

In [None]:
df_temp.info()

In [None]:
df_temp.isna().sum()

Now that we have all of our data numeric as well as our dataframe has no missing values, we should be able to build a machine learning model

In [None]:
%%time

# let's build a machine learning model
from sklearn.ensemble import RandomForestRegressor

# Instantiate a model
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

# Fit the model
#model.fit(df_temp.drop("SalePrice",axis=1),df_temp["SalePrice"])

In [None]:
# Score the model
# model.score(df_temp.drop("SalePrice",axis=1),df_temp["SalePrice"])

**Question:** Why doesn't the above metric hold water? (why isn't the metric reliable)

**Generalization:** The ability for a machine learning model to perform well on data it hasn't seen before.

In [None]:
df_temp.saleYear

In [None]:
## Splitting data into train & validation sets:

df_val = df_temp[df_temp.saleYear==2012]
df_train = df_temp[df_temp.saleYear!=2012]
len(df_val), len(df_train)

In [None]:
# Split data into X & y
X_train, y_train = df_train.drop("SalePrice",axis=1), df_train["SalePrice"]
X_valid, y_valid = df_val.drop("SalePrice",axis=1), df_val["SalePrice"]

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

## Building an evaluation function

In [None]:
# Create evaluation function (the competition uses RMSLE)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

def rmsle(y_true,y_preds):
    """
    Calculates root mean squared log error between predictions & true labels.
    """
    return np.sqrt(mean_squared_log_error(y_true,y_preds))

# Create function to evaluate model on few different levels
def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Valid MAE": mean_absolute_error(y_valid, val_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_valid, val_preds),
              "Training R^2": r2_score(y_train, train_preds),
              "Valid R^2": r2_score(y_valid, val_preds)}
    return scores


## Testing our model on a subset (to tune the hyperparamters)

In [None]:
# This takes far too long... for experimenting

# %%time
# model = RandomForestRegressor(n_jobs=-1, random_state=42)
# model.fit(X_train, y_train)

In [None]:
len(X_train)

In [None]:
# Change max_samples value (Training on 10000 records)
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42,
                              max_samples=10000)
model

In [None]:
%%time
# Cutting down on the max number of samples each estimator can see improves training time
model.fit(X_train,y_train)

In [None]:
show_scores(model)

### Hyperparameter Tuning with RandomizedSearchCV

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10,100,10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2,20,2),
           "min_samples_leaf": np.arange(1,20,2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000]}

# Instantiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                              param_distributions = rf_grid,
                              n_iter = 2,
                              cv = 5,
                              verbose = True)

# Fit the RandomizedSearchCV model
rs_model.fit(X_train,y_train)

In [None]:
# Find the best hyperparameters
rs_model.best_params_

In [None]:
# Evaluate the RandomizedSearch model
show_scores(rs_model)

### Train a model with the best hyperparameters

**Note:** These were found after 100 iterations of `RandomizedSearchCV`.

In [None]:
%%time

# Most ideal hyperparameters
ideal_model = RandomForestRegressor(n_estimators=40,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None,
                                    random_state=42)

# Fit the ideal model
ideal_model.fit(X_train,y_train)

In [None]:
# Scores for ideal model (trained on all data)
show_scores(ideal_model)

In [None]:
# Scores on rs_model (only trained on 10000 examples)
show_scores(rs_model)

## Make predictions on test data

In [None]:
# Import the test data
df_test = pd.read_csv("bluebook-for-bulldozers/Test.csv",
                      low_memory=False,
                      parse_dates=["saledate"])
df_test.head()

Preprocessing the data (getting the test dataset in the same format as our training & validation dataset)

In [None]:
def preprocess_data(df):
    """
    Performs transformations on df and returns transformed df.
    """
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayOfWeek"] = df.saledate.dt.dayofweek 
    df["saleDayOfYear"] = df.saledate.dt.dayofyear
    
    df.drop("saledate", axis=1, inplace=True)
    
    # Fill the empty numeric rows with median
    for name, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                df[name+"_is_missing"] = pd.isnull(content)
                df[name] = content.fillna(content.median())
        
        # Filling categorical missing data and turning categories into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[name+"_is_missing"] = pd.isnull(content)
            df[name] = pd.Categorical(content).codes+1
            
    return df

In [None]:
# Process the test data
df_test = preprocess_data(df_test)
df_test.head()

In [None]:
X_train.head()

In [None]:
# We can find how the columns differ using sets
set(X_train.columns)-set(df_test.columns)

In [None]:
# 'auctioneerID' in the training set contains missing values but not in the test set
df_test['auctioneerID_is_missing'] = False

In [None]:
# def preprocess_data(df):
#     """
#     Performs transformations on df and returns transformed df.
#     """
   
#     df["saleYear"] = df.saledate.dt.year
#     df["saleMonth"] = df.saledate.dt.month
#     df["saleDay"] = df.saledate.dt.day
#     df["saleDayOfWeek"] = df.saledate.dt.dayofweek 
#     df["saleDayOfYear"] = df.saledate.dt.dayofyear
    
#     df.drop("saledate", axis=1, inplace=True)
    
    
#     for label, content in df.items():
#         # Turning categories into numbers
#         if not pd.api.types.is_numeric_dtype(content):
#             df[label] = content.astype("category").cat.as_ordered()
            
#         # Fill the numeric rows with median
#         if pd.api.types.is_numeric_dtype(content):
#             if pd.isnull(content).sum():
#                 df[label+"_is_missing"] = pd.isnull(content)
#                 df[label] = content.fillna(content.median())
        
#         # Filling categorical missing data 
#         if not pd.api.types.is_numeric_dtype(content):
#             df[label+"_is_missing"] = pd.isnull(content)
#             df[label] = pd.Categorical(content).codes+1
            
#     return df

In [None]:
# df_test = preprocess_data(df_test)
# df_test.head()

In [None]:
df_test.head()

Finally now our test dataframe has same features as training dataframe and we can make predictions

In [None]:
# Make predictions on the test data
test_preds = ideal_model.predict(df_test)

In [None]:
test_preds

We've made some predictions but they're not in the same format Kaggle is asking for: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

In [None]:
# Format predictions into the same format Kaggle is after
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalesPrice"] = test_preds
df_preds

In [None]:
# Export Prediction Data
df_preds.to_csv("bulldozer_prediction.csv", index=False)

### Feature Importance

Feature Importance seeks to figure out which different attributes of the data were most importance when it comes to predicting the target variable(**Sale Price**)

In [None]:
# Find feature importance of our best model
ideal_model.feature_importances_

In [None]:
df = (pd.DataFrame({"features": X_train.columns,
                    "feature_importances": ideal_model.feature_importances_})
      .sort_values("feature_importances", ascending=False)
      .reset_index(drop=True))
df

In [None]:
len(ideal_model.feature_importances_)

In [None]:
# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importances": importances})
          .sort_values("feature_importances", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframes
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:n])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature Importance")
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns, ideal_model.feature_importances_)

**Question to finish:** Why might knowing the feature importances of a trained machine learning model be helpful?

**Final Challenge:**  What other machine learning models could you try on this dataset?