# Predicting the price of a bulldozer using GridSearchCV

The dataset for this project can be found on this link: https://www.kaggle.com/c/bluebook-for-bulldozers/data
It has been divided into three parts:

Train.csv is the training set, which contains data through the end of 2011.
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012.
Test.csv is the test set, contains data from May 1, 2012 - November 2012.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error, mean_absolute_error
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [7]:
#load the train and validation dataset
df = pd.read_csv('../input/bluebook-for-bulldozers/TrainAndValid.csv',
                parse_dates=["saledate"], 
                low_memory = False)

In [8]:
#look for total number of empty spaces in the dataset
df.isna().sum()

In [10]:
#plotting the scatterplot of the date of sale of bulldozer vs. its sale price for first 2000 entries in each respective column
fig,ax = plt.subplots()
ax.scatter(df["saledate"][:2000],df["SalePrice"][:2000]);

In [11]:
df.head()

In [12]:
#sorting the values of the dataset in ascending order with respet to date of sale
df.sort_values(by = ["saledate"],inplace=True, ascending=True)
df.head(30)

In [13]:
#this step is done to copy the original dataset into a variable for fututre use
df_new = df.copy()

In [14]:
#adding a few extra parameters to the dataframe and getting rid of the date-month-year format
#this step is important because the time data cannot be preprocessed in the abovedescribed format
df_new["saleYear"] = df_new.saledate.dt.year
df_new["saleMonth"] = df_new.saledate.dt.month
df_new["saleDay"] = df_new.saledate.dt.day
df_new["saleDayOfYear"] = df_new.saledate.dt.dayofyear
df_new.drop("saledate",axis=1, inplace=True)

In [16]:
df_new.head()

Now we can see that the above dataset is ready to be preprocessed after arranging it in ascending order wrt. the sale year

In [17]:
for label,content in df_new.items():
    
    #in the new dataset , search for string data type cintents and convert them to category type
    
    if pd.api.types.is_string_dtype(content):
        df_new[label] = content.astype("category").cat.as_ordered()

In [18]:
df_new.info()

In [19]:
df_new.to_csv('train_new.csv',
              index = False)

In [20]:
for label,value in df_new.items():
    
    #in the dataset if any numeric data type value is null , fill it with the median of the column
    
    if pd.api.types.is_numeric_dtype(value):
        if pd.isnull(value).sum():
            df_new[label] = content.fillna(value.median())

In [21]:
#checking for leftover missing values
df_new.isna().sum()

In [23]:
#creating a list of those labels which have missing values
# missing values
missing = []
for label in df_new:
    if df_new[label].isna().sum():
        missing.append(label)
        
missing

In [24]:
#creating a list of columns with non numeric data type content
not_numeric = []
for label,value in df_new.items():
    if not pd.api.types.is_numeric_dtype(value):
        not_numeric.append(label)
        
not_numeric

By default, the categorical code for a missing values is assigned -1 in Pandas. We can change this by adding 1 to the categorical code of missing values


In [25]:
for label,value in df_new.items():
    if not pd.api.types.is_numeric_dtype(value):
        
        df_new[label] = pd.Categorical(value).codes+1

In [26]:
df_new.isna().sum().head(40)

Now that we have successfully cleaned our data, we can move on to model it using RandomForestRegressor with GridSearchCV to launch an exhaustive search for finding the best hyperparameters.

# Modelling and hyperparameter tuning
Here we will divide the given dataset into training and validation set and use grid search CV to fing the best parameters ad find out the cost function using RMSLE. After that we will use the generated model to predict prices on the test set and find out the accuracy of the model.

In [32]:
x = df_new.drop("SalePrice" , axis=1)
x

In [33]:
y = df_new["SalePrice"]
y

In [34]:
#fitting the dataset into the regression model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs = -1)
model.fit(x , y);

In [31]:
model.score(x,y)

Our random forest regression model has showcased 98.7 percent accuracy on the original set. Now we split it into training and validation data.

In [36]:
#validation set as described in the original problem from January 1, 2012 - April 30, 2012.

valid_set = df_new[df_new.saleYear==2012]

#anything other than the data of theyear 2012 falls in the training set.

train_set = df_new[df_new.saleYear!= 2012]

#splitting the training and validation data into trainable set and target

x_train , y_train = train_set.drop("SalePrice",axis=1), train_set.SalePrice
x_valid, y_valid = valid_set.drop("SalePrice", axis=1), valid_set.SalePrice

In [37]:
# creating rsmle function to return the root mean square log error

def rmsle(y_test,y_preds):
    return np.sqrt(mean_squared_log_error(y_test,y_preds))

# creating show_scores function to return the rmsle values

def show_scores(model):
    train_preds = model.predict(x_train)
    val_preds = model.predict(x_valid)
    scores = {"Valid RSMLE": rmsle(y_valid, val_preds),
             "Training RSMLE": rmsle(y_train,train_preds)}
    return scores

In [38]:
model.fit(x_train,y_train)

In [39]:
show_scores(model)

Here we can clearly see that our model is overfitting as it performs exceptinally on the training set when compared to validation set.

Grid Search CV will be used to find the best hyperparameters . However , because of its exhaustive nature, it may take hours to run even on a powerful machine. Hence,in order to save time and space, we will first reduce the parameters using RandomizedSearchCV and then apply GridSearchCV.

In [40]:
#making a grid of hyperparameters for RandomizedSearchCV

grid = {"n_estimators": np.arange(10,100,10),
          "max_depth": [None,3,5,10],
          "min_samples_split": np.arange(2,20,2),
          "min_samples_leaf": np.arange(1,20,2),
          "max_features": [0.5,1,"sqrt","auto"],
          "max_samples": [10000]}

# creating a model of the RandomizedSearchCV by passing in the grid, the number of iterations(n_iter) and number of folds(cv)

gs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                   random_state=42),
                             param_distributions=grid,
                             n_iter = 10,
                             cv=5,
                             verbose=True)
gs_model.fit(x_train,y_train)

In [41]:
# extracting the best parameters based on random search
gs_model.best_params_

In [42]:
show_scores(gs_model)

Now we can clearly see that our model is performing similarly on the training set and validation set.
Let's use this information to create another grid that is to be passed in grid search CV.

In [43]:
grid_2 = {"n_estimators": [90],
          "max_depth": [10,15],
          "min_samples_split": [4,6,8],
          "min_samples_leaf": [15],
          "max_features": ["sqrt","auto"],
          "max_samples": [10000]}

In [44]:
# preparing model for gridsearch CV by passing in the grid_2 and number of folds(cv).
# number of itertions will not be passed as the grid search CV tries every single combination.
gs_model_2 = GridSearchCV(RandomForestRegressor(n_jobs =-1,
                                                random_state=42),
                          param_grid = grid_2,
                          cv=5,
                          verbose=True)

gs_model_2.fit(x_train,y_train)

In [45]:
gs_model_2.best_params_

In [46]:
show_scores(gs_model_2)

As is clear from the above results, we were able to improve the performance of our model using grid search CV. We can further improve this by passing in more hyperparameters but that will require a significant amount of running time.

# Predictions on test data

In [48]:
test_set = pd.read_csv("../input/bluebook-for-bulldozers/Test.csv",
                     parse_dates=["saledate"])

In [49]:
#function to preprocess the test data
def preprocess(df):
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayOfYear"] = df.saledate.dt.dayofyear
    df.drop("saledate",axis=1, inplace=True)
    for label,value in df.items():
        if pd.api.types.is_numeric_dtype(value):
            if pd.isnull(value).sum():
                df[label] = value.fillna(value.median())
        elif not pd.api.types.is_numeric_dtype(value):
            df[label] = pd.Categorical(value).codes+1
            
    return df

In [50]:
test_set = preprocess(test_set)

In [51]:
#predict the price of bulldozers on test dataset
test_preds = gs_model_2.predict(test_set)

In [52]:
df_preds = pd.DataFrame()
df_preds["SalesID"] = test_set["SalesID"]
df_preds["SalesPrice"] = test_preds
df_preds

In [53]:
df_preds.to_csv("test_predition.csv")