# Vehicle Noise Classification (UK GOV dataset)

**Objective:** Predict whether a vehicle is **Quiet** or **Noisy** from engine and emission features in the UK Government fuel/emissions data.

## 1. Setup & Imports

Import core libraries and configure display options.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

pd.set_option('display.max_columns', None)
sns.set_theme()
print("Environment ready.")

Matplotlib is building the font cache; this may take a moment.


Environment ready.


## 2. Load Data

Automatically select the largest CSV in '/data' (typically the main vehicle table) and preview the schema.

In [12]:
base_dir = Path().resolve().parent
csv_path = base_dir / "data" / "Euro_6_latest.csv"

df = pd.read_csv(csv_path, low_memory=False, encoding='latin1')
print(df.shape)
df.info()
df.head(3)

(4197, 45)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4197 entries, 0 to 4196
Data columns (total 45 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Manufacturer                           4197 non-null   object 
 1   Model                                  4197 non-null   object 
 2   Description                            4197 non-null   object 
 3   Transmission                           3563 non-null   object 
 4   Manual or Automatic                    4197 non-null   object 
 5   Engine Capacity                        4195 non-null   float64
 6   Fuel Type                              4197 non-null   object 
 7   Powertrain                             4197 non-null   object 
 8   Engine Power (PS)                      4178 non-null   float64
 9   Engine Power (Kw)                      4189 non-null   float64
 10  Electric energy consumption Miles/kWh  2597 non-null   float6

Unnamed: 0,Manufacturer,Model,Description,Transmission,Manual or Automatic,Engine Capacity,Fuel Type,Powertrain,Engine Power (PS),Engine Power (Kw),Electric energy consumption Miles/kWh,wh/km,Maximum range (Km),Maximum range (Miles),Euro Standard,Diesel VED Supplement,Testing Scheme,WLTP Imperial Low,WLTP Imperial Medium,WLTP Imperial High,WLTP Imperial Extra High,WLTP Imperial Combined,WLTP Imperial Combined (Weighted),WLTP Metric Low,WLTP Metric Medium,WLTP Metric High,WLTP Metric Extra High,WLTP Metric Combined,WLTP Metric Combined (Weighted),WLTP CO2,WLTP CO2 Weighted,Equivalent All Electric Range Miles,Equivalent All Electric Range KM,Electric Range City Miles,Electric Range City Km,Emissions CO [mg/km],THC Emissions [mg/km],Emissions NOx [mg/km],THC + NOx Emissions [mg/km],Particulates [No.] [mg/km],RDE NOx Urban,RDE NOx Combined,Noise Level dB(A),Date of change,Unnamed: 44
0,ABARTH,500e MY25,114kW Electric (VL),A1,Electric - Not Applicable,0.0,Electricity,Battery Electric Vehicle (BEV) / Pure Electric...,155.0,114.0,3.6,172.0,264.0,164.0,Euro 6-WLTP (for BEVs only),False,WLTP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,,223.0,359.0,0.0,,,,,,,68.0,07 October 2025,
1,ABARTH,500e MY25,114kW Electric (VH),A1,Electric - Not Applicable,0.0,Electricity,Battery Electric Vehicle (BEV) / Pure Electric...,155.0,114.0,3.3,187.0,244.0,152.0,Euro 6-WLTP (for BEVs only),False,WLTP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,,205.0,330.0,0.0,,,,,,,68.0,07 October 2025,
2,ABARTH,500e MY25,114kW Electric Convertible (VL),A1,Electric - Not Applicable,0.0,Electricity,Battery Electric Vehicle (BEV) / Pure Electric...,155.0,114.0,3.6,172.0,264.0,164.0,Euro 6-WLTP (for BEVs only),False,WLTP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,,223.0,359.0,0.0,,,,,,,68.0,07 October 2025,


## 3. Normalise Column Names

Standardise headers to lowercase snake_case and map common variants to consistent names used later.

In [None]:
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(r"[^\w]+", "_", regex=True)
)

rename_map = {
    "engine_capacity": "engine_cc",
    "fuel_type": "fuel_type",
    "transmission": "transmission",
    "manual_or_automatic": "gear_type",
    "wltp_co2": "co2_gkm",
    "wltp_metric_combined": "mpg_metric_combined",
    "wltp_imperial_combined": "mpg_imperial_combined",
    "emissions_nox_mg_km": "nox_mgkm",
    "manufacturer": "make",
    "model": "model",
}
for k, v in rename_map.items():
    if k in df.columns:
        df = df.rename(columns={k: v})

noise_candidates = [c for c in df.columns if ("noise" in c and "db" in c)]
if noise_candidates:
    df = df.rename(columns={noise_candidates[0]: "noise_db"})

cols_set = set(df.columns)
required = {"noise_db", "fuel_type", "engine_cc", "co2_gkm"}
missing = sorted(required - cols_set)
print("Missing required columns:", missing)

expected = {
    "make","model","engine_cc","fuel_type","transmission","gear_type",
    "co2_gkm","mpg_metric_combined","mpg_imperial_combined","noise_db","nox_mgkm"
}
print("Present (expected ∩ actual):", sorted(expected & cols_set))
print("Detected noise column candidates:", noise_candidates)

Missing required columns: []
Present (expected ∩ actual): ['co2_gkm', 'engine_cc', 'fuel_type', 'gear_type', 'make', 'model', 'mpg_imperial_combined', 'mpg_metric_combined', 'noise_db', 'transmission']
Detected noise column candidates: ['noise_level_db_a_']


## 4. Build Working Subset & Clean

Keep only the columns relevant to the noise classification task and ensure all numeric fields are valid.

In [21]:
keep_cols = [c for c in [
    "noise_db", "fuel_type", "engine_cc", "co2_gkm", "mpg_metric_combined", "mpg_imperial_combined", "gear_type", "transmission"
] if c in df.columns]

work = df[keep_cols].copy()

# Drop rows missing critical fields
work = work.dropna(subset=["noise_db", "fuel_type", "engine_cc", "co2_gkm"])

# Coerce numeric fields
for col in ["noise_db", "engine_cc", "co2_gkm", "mpg_metric_combined", "mpg_imperial_combined"]:
    if col in work.columns:
        work[col] = pd.to_numeric(work[col], errors="coerce")

# Drop any remaining NaNs after conversion
work = work.dropna()

print("shape after cleaning: ", work.shape)
display(work.describe(include="all").T.head(10))

shape after cleaning:  (3563, 8)


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
noise_db,3563.0,,,,55.382627,25.966766,0.0,65.0,67.0,68.0,77.0
fuel_type,3563.0,8.0,Petrol Electric,1215.0,,,,,,,
engine_cc,3563.0,,,,1689.610441,881.649969,0.0,1199.0,1499.0,1993.0,6749.0
co2_gkm,3563.0,,,,127.484704,65.214456,0.0,114.0,132.0,152.0,453.0
mpg_metric_combined,3563.0,,,,5.574712,2.980662,0.0,4.9,5.7,6.5,42.8
mpg_imperial_combined,3563.0,,,,40.261325,18.589255,0.0,35.75,45.6,51.75,74.3
gear_type,3563.0,3.0,Automatic,2438.0,,,,,,,
transmission,3563.0,44.0,M6,567.0,,,,,,,


Clean invalid zeros & outliers (cleaning step)

In [22]:
clean = work.copy()

# Drop invalid zeros in key numeric columns
for col, min_valid in [
    ("noise_db", 60), # typical type-approved exterior pass-by noise -60-75 dB(A)
    ("engine_cc", 900), # <900cc unlikely in this table; anything below ~1L is typically jdm
    ("co2_gkm", 1), # 0 g/km would be EV; we are using engine metrics, so these are invalid
]:
    if col in clean.columns:
        clean = clean[clean[col] >= min_valid]

# Metric fuel: L/100km should be >0 and sensible (<25 L/100km is generous)
if "mpg_metric_combined" in clean.columns:
    clean = clean[(clean["mpg_metric_combined"] > 0) & (clean["mpg_metric_combined"] <25)]

# Imperial mpg: keep within plausible 10-100 mpg window
if "mpg_imperial_combined" in clean.columns:
    clean = clean[(clean["mpg_imperial_combined"] >= 10) & (clean["mpg_imperial_combined"] <= 100)]

print("Shape before: ", work.shape, "after: ", clean.shape)
display(clean.describe(include="all").T.loc[["noise_db","engine_cc","co2_gkm","mpg_metric_combined","mpg_imperial_combined"]].dropna(how="all"))

Shape before:  (3563, 8) after:  (2442, 8)


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
noise_db,2442.0,,,,67.620925,2.202018,63.0,66.0,67.0,69.0,77.0
engine_cc,2442.0,,,,1848.282965,856.386267,998.0,1199.0,1598.0,1998.0,6749.0
co2_gkm,2442.0,,,,150.865684,48.020189,33.0,123.0,140.0,160.0,453.0
mpg_metric_combined,2442.0,,,,6.575143,2.170194,3.8,5.4,6.0,6.9,23.7
mpg_imperial_combined,2442.0,,,,46.058804,10.662516,11.9,40.9,46.3,52.3,74.3
