# Bulldozer Sale Price Prediction

A machine learning project to predict the future sale price of bulldozers at auction based on their characteristics and historical data.

## Problem Definition

**Objective:** How well can we predict the future sale price of a bulldozer, given its characteristics and previous sale prices?

This is a regression problem where we aim to estimate the continuous value (sale price) of heavy equipment at auction.


## Dataset Description

The dataset is split into three files covering different time periods:

| File | Description | Time Period |
|------|-------------|-------------|
| **Train.csv** | Training set for model development | Data through end of 2011 |
| **Valid.csv** | Validation set for model tuning | January 1, 2012 - April 30, 2012 |
| **Test.csv** | Test set for final evaluation | May 1, 2012 - November 2012 |


## Evaluation Metric

**Root Mean Squared Logarithmic Error (RMSLE)**

This metric is particularly suitable for this problem because:
- Penalizes underestimation more than overestimation
- Handles the large range of sale prices effectively
- Focuses on relative differences rather than absolute differences

## Key Features

### Core Identifiers
- **SalesID**: Unique identifier for each sale transaction
- **MachineID**: Unique identifier for each machine (machines can have multiple sales)
- **ModelID**: Identifier for machine model (links to fiModelDesc)

### Time & Usage
- **YearMade**: Manufacturing year
- **Saledate**: Date of auction sale
- **MachineHoursCurrentMeter**: Current usage in hours at time of sale
- **UsageBand**: Usage category (Low/Medium/High) relative to model average

### Sale Information
- **Saleprice**: ðŸ’° **TARGET VARIABLE** - Sale price in USD
- **datasource**: Source of the sale record
- **auctioneerID**: Company that conducted the auction

### Location
- **State**: US state where sale occurred

### Product Classification
- **ProductGroup**: Top-level hierarchical grouping
- **ProductGroupDesc**: Description of product group
- **ProductClassDesc**: Second-level hierarchical grouping
- **fiModelDesc**: Complete model description
- **fiBaseModel**: Base model designation
- **fiSecondaryDesc**: Secondary description
- **fiModelSeries**: Model series
- **fiModelDescriptor**: Model descriptor
- **ProductSize**: Size class grouping

### Machine Configuration

#### Drive & Mobility
- **Drive_System**: 2WD or 4WD configuration
- **Travel_Controls**: Operator control configuration
- **Steering_Controls**: Steering system type

#### Power & Performance
- **Engine_Horsepower**: Engine power rating
- **Turbocharged**: Naturally aspirated or turbocharged
- **Transmission**: Automatic or manual transmission type

#### Hydraulics
- **Hydraulics**: Type of hydraulic system
- **Hydraulics_Flow**: Normal or high flow system
- **Ride_Control**: Optional loader feature for smoother operation

#### Attachments & Implements
- **Forks**: Lifting attachment
- **Stick**: Control type
- **Stick_Length**: Length of digging implement
- **Thumb**: Grabbing attachment
- **Ripper**: Soil tilling implement
- **Scarifier**: Soil conditioning implement
- **Backhoe_Mounting**: Optional backhoe interface
- **Pushblock**: Configuration option
- **Coupler**: Implement interface type
- **Coupler_System**: Implement interface system
- **Pattern_Changer**: Adjustable operator control configuration

#### Blade Configuration
- **Blade_Type**: Type of blade
- **Blade_Width**: Width measurement
- **Blade_Extension**: Extension beyond standard
- **Tip_control**: Blade control type

#### Undercarriage & Tracks
- **Pad_Type**: Type of treads for crawler machines
- **Track_Type**: Tread configuration
- **Grouser_Tracks**: Ground contact interface
- **Grouser_Type**: Tread type specification
- **Undercarriage_Pad_Width**: Width of crawler treads

#### Wheels & Tires
- **Tire_Size**: Primary tire dimensions
- **Differential_Type**: Locking or standard differential

#### Enclosure
- **Enclosure**: Cab enclosure presence
- **Enclosure_Type**: Type of cab enclosure



In [2]:
# Timestamp
import datetime

print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-12-30 13:58:36.544957


In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

### Loading the Train data and the Data Dictionary

In [5]:
train_url = "https://media.githubusercontent.com/media/jhlopesalves/classic_workflows/refs/heads/main/supervised_learning/regression/bulldozer/data/Train.csv"
train = pd.read_csv(train_url, low_memory=False)

In [6]:
train

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,11/16/2006 0:00,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,3/26/2004 0:00,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2/26/2004 0:00,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,5/19/2011 0:00,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,7/23/2009 0:00,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336,10500,1840702,21439,149,1.0,2005,,,11/2/2011 0:00,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401121,6333337,11000,1830472,21439,149,1.0,2005,,,11/2/2011 0:00,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401122,6333338,11500,1887659,21439,149,1.0,2005,,,11/2/2011 0:00,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401123,6333341,9000,1903570,21435,149,2.0,2005,,,10/25/2011 0:00,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
