# Data Science Project Based on Australian Vehicle Prices Dataset

Jane Citizen 40987654

## Dataset Description:

This dataset contains the latest information on car prices in Australia for the year 2023. It covers various brands, models, types, and features of cars sold in the Australian market. It provides useful insights into the trends and factors influencing the car prices in Australia. The dataset includes information such as brand, year, model, car/suv, title, used/new, transmission, engine, drive type, fuel type, fuel consumption, kilometres, colour (exterior/interior), location, cylinders in engine, body type, doors, seats, and price. The dataset has over 16,000 records of car listings from various online platforms in Australia.

* Brand: Name of the car manufacturer
* Year: Year of manufacture or release
* Model: Name or code of the car model
* Car/Suv: Type of the car (car or suv)
* Title: Title or description of the car
* UsedOrNew: Condition of the car (used or new)
* Transmission: Type of transmission (manual or automatic)
* Engine: Engine capacity or power (in litres or kilowatts)
* DriveType: Type of drive (front-wheel, rear-wheel, or all-wheel)
* FuelType: Type of fuel (petrol, diesel, hybrid, or electric)
* FuelConsumption: Fuel consumption rate (in litres per 100 km)
* Kilometres: Distance travelled by the car (in kilometres)
* ColourExtInt: Colour of the car (exterior and interior)
* Location: Location of the car (city and state)
* CylindersinEngine: Number of cylinders in the engine
* B odyType: Shape or style of the car body (sedan, hatchback, coupe, etc.)
* Doors: Number of doors in the car
* Seats: Number of seats in the car
* Price: Price of the car (in Australian dollars)


## AIM:

We would like to predict the price of the car base on the cars' features (e.g. manufacture year, transmission, engine). Meanwhile, we we like to compare the performance of different regresion models.


## Import Libraries

Here we import all the libraries we need.

In [64]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Random seed for reproducibility
student_id = 48726591

# Data Loading
file_csv = "australian_vehicle_prices.csv"
data = pd.read_csv(file_csv)

# Data Cleaning and Missing Value Imputation
data.replace(['-', 'POA'], np.nan, inplace=True)
data['Location'] = data['Location'].str.replace('AU-VIC', 'VIC')

# Impute numerical columns
numerical_imputer = SimpleImputer(strategy='mean')
numerical_columns = ['FuelConsumption', 'Kilometres', 'CylindersinEngine', 'Doors', 'Seats', 'Displacement']


# Impute categorical columns
data[['Transmission', 'DriveType', 'FuelType', 'Location']] = data[['Transmission', 'DriveType', 'FuelType', 'Location']].fillna('unknown')

# Encode categorical variables
categorical_features = ['Transmission', 'DriveType', 'FuelType', 'Location']
encoder = OrdinalEncoder()
data[categorical_features] = encoder.fit_transform(data[categorical_features])

# Feature Selection and Data Splitting
features = ['Year', 'Transmission', 'DriveType', 'FuelConsumption', 'Kilometres', 'CylindersinEngine', 'Doors', 'Seats', 'Displacement']

y = data['Price']
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import OrdinalEncoder, SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Random seed for reproducibility
student_id = 48726591

# Data Loading
file_csv = "australian_vehicle_prices.csv"
data = pd.read_csv(file_csv)

# Data Cleaning and Missing Value Imputation
data.replace(['-', 'POA'], np.nan, inplace=True)
data['Location'] = data['Location'].str.replace('AU-VIC', 'VIC')

# Impute numerical columns
numerical_imputer = SimpleImputer(strategy='mean')
numerical_columns = ['FuelConsumption', 'Kilometres', 'CylindersinEngine', 'Doors', 'Seats', 'Displacement']
data[numerical_columns] = numerical_imputer.fit_transform(data[numerical_columns])

# Impute categorical columns
data[['Transmission', 'DriveType', 'FuelType', 'Location']] = data[['Transmission', 'DriveType', 'FuelType', 'Location']].fillna('unknown')

# Encode categorical variables
categorical_features = ['Transmission', 'DriveType', 'FuelType', 'Location']
encoder = OrdinalEncoder()
data[categorical_features] = encoder.fit_transform(data[categorical_features])

# Feature Selection and Data Splitting
features = ['Year', 'Transmission', 'DriveType', 'FuelConsumption', 'Kilometres', 'CylindersinEngine', 'Doors', 'Seats', 'Displacement']
X = data[features]
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=student_id)

# Model Training and Evaluation
def evaluate_model(y_true, y_pred, model_name):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    print(f"{model_name} Performance:")
    print(f"Mean Absolute Error (MAE): {mae:.2f}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
    print(f"R-squared (R²): {r2:.2f}")
    print("-" * 40)

# Initialize and evaluate models
models = {
    "Linear Regression": linear_model.LinearRegression(),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=student_id),
    "MLP Regressor": MLPRegressor(max_iter=500, random_state=student_id)
}

for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    evaluate_model(y_test, y_pred, model_name)


# Model Training and Evaluation
def evaluate_model(y_true, y_pred, model_name):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    print(f"{model_name} Performance:")
    print(f"Mean Absolute Error (MAE): {mae:.2f}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
    print(f"R-squared (R²): {r2:.2f}")
    print("-" * 40)

# Initialize and evaluate models
models = {
    "Linear Regression": linear_model.LinearRegression(),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=student_id),
    "MLP Regressor": MLPRegressor(max_iter=500, random_state=student_id)
}

for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    evaluate_model(y_test, y_pred, model_name)


ValueError: Found input variables with inconsistent numbers of samples: [14197, 16734]

As the performance of training set and testing set are similar, we can say it is not overfitting.

## Analysis

According to the result we have, among these three models (LR, DTR and MLP) the best option for this dataset is DTR. However, the releationships among the features within this dataset are not obvious for these models to catch, thus all the performance are not very satisfying. We might need to further clean the data (*e.g.* remove outliers) or deploy deep learning models for the prediction.