##  🚚 Predicting the Sale Price of Bulldozers using Machine Learning


In this notebook, we're going to go through an example machine learning project with the goal of predicting the sale price of bulldozers.

## 1. Problem defition

How well can we predict the future sale price of a bulldozer, given its characteristics and previous examples of how much      similar bulldozers have been sold for?

## 2. Data

The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

1. There are 3 main datasets:
2. Train.csv is the training set, which contains data through the end of 2011.
   Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set  throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
   
   
  3. Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 -    November 2012. Your score on the test set determines your final rank for the competition.


## 3. Evaluation
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more on the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning model which minimises RMSLE.

## 4. Features
Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Google Sheets: https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import sklearn

ImportError: DLL load failed while importing _cext: The specified module could not be found.

In [None]:
# Imort traiining amd validation sets 
df=pd.read_csv("TrainAndValid.csv",
              low_memory = False )

In [None]:
df.info()

In [None]:
df.isna().sum()


In [None]:
fig,ax= plt.subplots()
ax.scatter(df["saledate"][:1000],df["SalePrice"][:1000]);

In [None]:
df.saledate[:1000]


In [None]:
df.SalePrice.plot.hist()


### Parsing Dates

When we work with timw and data we want to enrich the time and date component as much as possible .

we can do that by telling pandas which of our column has dates in it using then past_dates parametere 

In [None]:
# Import data again but this time parse dates
df=pd.read_csv("TrainAndValid.csv",
              low_memory = False ,
              parse_dates=["saledate"])

In [None]:
df.saledate.dtype

In [None]:
df.saledate[:1000]

In [None]:






fig,ax= plt.subplots()
ax.scatter(df["saledate"][:1000],df["SalePrice"][:1000]);

In [None]:
df.head().T

In [None]:
len(df.columns)

In [None]:
df.saledate.head(20)

### Sort Data frame by saledate 

When working with time searies data, its good idea to sort it by date .

In [None]:
# Sort  dataframe in date order
df.sort_values(by=["saledate"],inplace=True,ascending= True )
df.saledate.head(20)

### Make a copy of orignal DataFrame 
we make a copy of the orignal dataframe so when we manupulate the copy ,we've still got out orignal data.

In [None]:
## make a copy 
df_tmp= df.copy()

## Add datetime parametere for 'saledate' column 

In [None]:
df_tmp["saleYear"]=df_tmp.saledate.dt.year
df_tmp["saleMonth"]=df_tmp.saledate.dt.month 
df_tmp["saleDay"]=df_tmp.saledate.dt.day
df_tmp["saleDayofweek"]=df_tmp.saledate.dt.dayofweek
df_tmp["saleDayofYear"]=df_tmp.saledate.dt.dayofyear

In [None]:
df_tmp.head().T

In [None]:
df_tmp.head()

In [None]:
# Check the value of diffferent columns 
df_tmp.state.value_counts()

## 5. Modeling 

we ve  done through EDA (we could always do more ) but lets start to do some model driven EDA .

In [None]:
# np.random.seed(42)
# #Lets build a machine learning model (check machine learning map to select model )
# from sklearn.ensemble import RandomForestRegressor

# model=RandomForestRegressor(n_jobs=1,
#                             random_state=42)

# model.fit(df_tmp.drop("SalePrice",axis=1),df_tmp["SalePrice"])

In [None]:
## THE ABOVE ERROR IS BECAUSE OF ALL THE DATA IS NOT NUMERIC AND ALSO WE HAVE MISSING VALUE ^^^
 

### Convert strings to categories 

One way we can turn all of our data into numbersd is by converting them to pandas categories .

In [None]:
df_tmp.head().T

In [None]:
pd.api.types.is_string_dtype(df_tmp["UsageBand"])

In [None]:
# Find the colums which contains strings
for label,content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# If you're wondering what df.items() does, here's an example
random_dict = {"key1": "hello",
               "key2": "world!"}

for key, value in random_dict.items():
    print(f"this is a key: {key}",
          f"this is a value: {value}")

In [None]:
# This will turn all of the string value into category values
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label] = content.astype("category").cat.as_ordered()

In [None]:
df_tmp.info()

In [None]:
df_tmp.state.cat.categories

In [None]:
df_tmp.state.cat.codes

In [None]:
# Check missing data
df_tmp.isnull().sum()/len(df_tmp)

## Save Preprocess data 

In [None]:

# Export current tmp dataframe
df_tmp.to_csv("train_tmp.csv",
              index=False)


In [None]:

# Import preprocessed data
df_tmp = pd.read_csv("train_tmp.csv",
                     low_memory=False)
df_tmp.head().T

## Fill missing values 

### Fill numeric missing values first 


In [None]:
for label,content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:

df_tmp.ModelID

In [None]:
# Check for which numeric columns have null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Fill numeric rows with the median
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Add a binary column which tells us if the data was missing or not
            df_tmp[label+"_is_missing"] = pd.isnull(content)
            # Fill missing numeric values with median
            df_tmp[label] = content.fillna(content.median())

In [None]:
# Demonstrate how median is more robust than mean
hundreds = np.full((1000,), 100)
hundreds_billion = np.append(hundreds, 1000000000)
np.mean(hundreds), np.mean(hundreds_billion), np.median(hundreds), np.median(hundreds_billion)

In [None]:
# Check if there's any null numeric values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Check to see how many examples were missing
df_tmp.auctioneerID_is_missing.value_counts()

In [None]:
df_tmp.isna().sum()

### Filling and turning catrgotical variable into numbers

In [None]:
# Check for columns which aren't numeric
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
# Turn categorical variables into numbers and fill missing
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Add binary column to indicate whether sample had missing value
        df_tmp[label+"_is_missing"] = pd.isnull(content)
        # Turn categories into numbers and add +1
        df_tmp[label] = pd.Categorical(content).codes+1

In [None]:
pd.Categorical(df_tmp["state"]).codes+1


In [None]:
df_tmp.info()

In [None]:
df_tmp.head().T

In [None]:
df_tmp.isna().sum()

Now that all of data is numeric as well as our dataframe has no missing values, we should be able to build a machine learning model.

In [None]:
df_tmp.head()

In [None]:
# %%time
# # Instantiate model
# model = RandomForestRegressor(n_jobs=-1,
#                               random_state=42)

# # Fit the model
# model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

In [None]:
# SCORE THE MODEL 
model.score(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

Question: Why doesn't the above metric hold water? (why isn't the metric reliable)

## Splitting data into train/valid sets

In [None]:
df_tmp.head()

According to the Kaggle data page, the validation set and test set are split according to dates.

This makes sense since we're working on a time series problem.

E.g. using past events to try and predict future events.

Knowing this, randomly splitting our data into train and test sets using something like train_test_split() wouldn't work.

Instead, we split our data into training, validation and test sets using the date each sample occured.

In our case:

### 1- Training = all samples up until 2011
### 2- Valid = all samples form January 1, 2012 - April 30, 2012
### 3- Test = all samples from May 1, 2012 - November 2012

For more on making good training, validation and test sets, check out the post How (and why) to create a good validation set by Rachel Thomas.

In [None]:

df_tmp.saleYear.value_counts()

In [None]:
# Split data into training and validation
df_val = df_tmp[df_tmp.saleYear == 2012]
df_train = df_tmp[df_tmp.saleYear != 2012]

len(df_val), len(df_train)

In [None]:
# Split data into X & y
X_train, y_train = df_train.drop("SalePrice", axis=1), df_train.SalePrice
X_valid, y_valid = df_val.drop("SalePrice", axis=1), df_val.SalePrice

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

Building an evaluation function
According to Kaggle for the Bluebook for Bulldozers competition, the evaluation function they use is root mean squared log error (RMSLE).

RMSLE = generally you don't care as much if you're off by $10 as much as you'd care if you were off by 10%, you care more about ratios rather than differences. MAE (mean absolute error) is more about exact differences.

It's important to understand the evaluation metric you're going for.

Since Scikit-Learn doesn't have a function built-in for RMSLE, we'll create our own.

We can do this by taking the square root of Scikit-Learn's mean_squared_log_error (MSLE). MSLE is the same as taking the log of mean squared error (MSE).

We'll also calculate the MAE and R^2 for fun.

In [None]:
# Create evaluation function (the competition uses Root Mean Square Log Error)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error

def rmsle(y_test, y_preds):
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create function to evaluate our model
def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Valid MAE": mean_absolute_error(y_valid, val_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_valid, val_preds),
              "Training R^2": model.score(X_train, y_train),
              "Valid R^2": model.score(X_valid, y_valid)}
    return scores

### Testing our model on a subset (to tune the hyperparameters)
Retraing an entire model would take far too long to continuing experimenting as fast as we want to.

So what we'll do is take a sample of the training set and tune the hyperparameters on that before training a larger model.

If you're experiments are taking longer than 10-seconds (give or take how long you have to wait), you should be trying to speed things up. You can speed things up by sampling less data or using a faster computer.

In [None]:
# This takes too long...

# %%time
# # Retrain a model on training data
# model.fit(X_train, y_train)
# show_scores(model)

In [None]:
len(X_train)

In [None]:
from sklearn.ensemble import RandomForestRegressor

Depending on your computer (mine is a MacBook Pro), making calculations on ~400,000 rows may take a while...

Let's alter the number of samples each n_estimator in the RandomForestRegressor see's using the max_samples parameter.

In [None]:
# Change max samples in RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1,
                              max_samples=10000)

Setting max_samples to 10000 means every n_estimator (default 100) in our RandomForestRegressor will only see 10000 random samples from our DataFrame instead of the entire 400,000.

In other words, we'll be looking at 40x less samples which means we'll get faster computation speeds but we should expect our results to worsen (simple the model has less samples to learn patterns from).

In [None]:
%%time
# Cutting down the max number of samples each tree can see improves training time
model.fit(X_train, y_train)

### In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
### On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org

In [None]:
show_scores(model)

#### Beautiful, that took far less time than the model with all the data.

With this, let's try tune some hyperparameters.

Hyperparameter tuning with RandomizedSearchCV
You can increase n_iter to try more combinations of hyperparameters but in our case, we'll try 20 and see where it gets us.

Remember, we're trying to reduce the amount of time it takes between experiments.

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestClassifier hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1.0, "sqrt"], # Note: "max_features='auto'" is equivalent to "max_features=1.0", as of Scikit-Learn version 1.1
           "max_samples": [10000]}

rs_model = RandomizedSearchCV(RandomForestRegressor(),
                              param_distributions=rf_grid,
                              n_iter=20,
                              cv=5,
                              verbose=True)

rs_model.fit(X_train, y_train)

In [None]:
# Find the best parameters from the RandomizedSearch 
rs_model.best_params_

In [None]:
# Evaluate the RandomizedSearch model
show_scores(rs_model)

### Train a model with the best parameters
In a model I prepared earlier, I tried 100 different combinations of hyperparameters (setting n_iter to 100 in RandomizedSearchCV) and found the best results came from the ones you see below.

Note: This kind of search on my computer (n_iter = 100) took ~2-hours. So it's kind of a set and come back later experiment.

We'll instantiate a new model with these discovered hyperparameters and reset the max_samples back to its original value

In [None]:
%%time
# Most ideal hyperparameters
ideal_model = RandomForestRegressor(n_estimators=90,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None)
ideal_model.fit(X_train, y_train)

In [None]:

show_scores(ideal_model)

With these new hyperparameters as well as using all the samples, we can see an improvement to our models performance.

You can make a faster model by altering some of the hyperparameters. Particularly by lowering n_estimators since each increase in n_estimators is basically building another small model.

However, lowering of n_estimators or altering of other hyperparameters may lead to poorer results.

In [None]:
%%time
# Faster model
fast_model = RandomForestRegressor(n_estimators=40,
                                   min_samples_leaf=3,
                                   max_features=0.5,
                                   n_jobs=-1)
fast_model.fit(X_train, y_train)

In [None]:
show_scores(fast_model)

### Make predictions on test data
Now we've got a trained model, it's time to make predictions on the test data.

Remember what we've done.

Our model is trained on data prior to 2011. However, the test data is from May 1 2012 to November 2012.

So what we're doing is trying to use the patterns our model has learned in the training data to predict the sale price of a Bulldozer with characteristics it's never seen before but are assumed to be similar to that of those in the training data.

In [None]:
df_test = pd.read_csv("Test.csv",
                      parse_dates=["saledate"])
df_test.head()


Ahhh... the test data isn't in the same format of our other data, so we have to fix it. Let's create a function to preprocess our data.

### Preprocessing the test data
Our model has been trained on data formatted in the same way as the training data.

This means in order to make predictions on the test data, we need to take the same steps we used to preprocess the training data to preprocess the test data.

Remember: Whatever you do to the training data, you have to do to the test data.

Let's create a function for doing so (by copying the preprocessing steps we used above).

In [None]:
def preprocess_data(df):
    """
    Performs transformations on df and returns transformed df.
    """
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayOfweek"] = df.saledate.dt.dayofweek
    df["saleDayOfYear"] = df.saledate.dt.dayofyear
    
    df.drop("saledate", axis=1, inplace=True)
    
    # Fill the numeric rows with median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                # Add a binary column which tells us if the data was missing or not
                df[label+"_is_missing"] = pd.isnull(content)
                # Fill missing numeric values with median
                df[label] = content.fillna(content.median())
    
        # Filled categorical missing data and turn categories into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            # We add +1 to the category code because pandas encodes missing categories as -1
            df[label] = pd.Categorical(content).codes+1
    
    return df

Question: Where would this function break?

Hint: What if the test data had different missing values to the training data?

Now we've got a function for preprocessing data, let's preprocess the test dataset into the same format as our training dataset.

In [None]:
df_test = preprocess_data(df_test)
df_test.head()

In [None]:
# Make predictions on updated test data
test_preds = ideal_model.predict(df_test)

In [None]:
X_train.head()

We've found an error and it's because our test dataset (after preprocessing) has 101 columns where as, our training dataset (X_train) has 102 columns (after preprocessing).

Let's find the difference.

In [None]:
# We can find how the columns differ using sets
set(X_train.columns) - set(df_test.columns)

In this case, it's because the test dataset wasn't missing any auctioneerID fields.

To fix it, we'll add a column to the test dataset called auctioneerID_is_missing and fill it with False, since none of the auctioneerID fields are missing in the test dataset.

In [None]:
# Match test dataset columns to training dataset
df_test["auctioneerID_is_missing"] = False
df_test=df_test.reindex(columns=list(X_train.columns))
df_test.head()

In [None]:
df_test.head()


There's one more step we have to do before we can make predictions on the test data.

And that's to line up the columns (the features) in our test dataset to match the columns in our training dataset.

As in, the order of the columnns in the training dataset, should match the order of the columns in our test dataset.

Note: As of Scikit-Learn 1.2, the order of columns that were fit on should match the order of columns that are predicted on.

Now the test dataset column names and column order matches the training dataset, we should be able to make predictions on it using our trained model.

In [None]:
# Make predictions on the test dataset using the best model
test_preds = ideal_model.predict(df_test)

When looking at the Kaggle submission requirements, we see that if we wanted to make a submission, the data is required to be in a certain format. Namely, a DataFrame containing the SalesID and the predicted SalePrice of the bulldozer.

Let's make it.

In [None]:
# Create DataFrame compatible with Kaggle submission requirements
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalePrice"] = test_preds
df_preds

In [None]:
# Export to csv...
#df_preds.to_csv("../data/bluebook-for-bulldozers/predictions.csv",
#                index=False)

### Feature Importance
Since we've built a model which is able to make predictions. The people you share these predictions with (or yourself) might be curious of what parts of the data led to these predictions.

This is where feature importance comes in. Feature importance seeks to figure out which different attributes of the data were most important when it comes to predicting the target variable.

In our case, after our model learned the patterns in the data, which bulldozer sale attributes were most important for predicting its overall sale price?

Beware: the default feature importances for random forests can lead to non-ideal results.

To find which features were most important of a machine learning model, a good idea is to search something like "[MODEL NAME] feature importance".

Doing this for our RandomForestRegressor leads us to find the feature_importances_ attribute.

Let's check it out.

In [None]:
# Find feature importance of our best model
ideal_model.feature_importances_

In [None]:
# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importances": importances})
          .sort_values("feature_importances", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature importance")
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns, ideal_model.feature_importances_)

In [None]:
df["Enclosure"].value_counts()

Question to finish: Why might knowing the feature importances of a trained machine learning model be helpful?

To do Process: What other machine learning models could you try on our dataset? Hint: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html check out the regression section of this map, or try to look at something like CatBoost.ai or XGBooost.ai.