# Car Listings Analysis

## Updated Version
This project involves the analysis of scraped data from **100,000 used car listings**, which have been separated into files corresponding to each car manufacturer. The data was initially collected to create a tool to predict the selling price of a friend's old car in comparison to other cars on the market. This idea was then expanded to create a more general car value regression model.

## Previous Version
The previous version of this project focused on two common cars on the British market: the **Ford Focus** and the **Mercedes C Class**. The aim was to determine the ideal time to sell certain cars, i.e., at what age and mileage there are significant drops in resale value. Comparisons were also made between the two car models, and a classifier was developed to distinguish between a Ford or Mercedes car. More makes and models can easily be added upon request.

## Content
The cleaned data set contains information on the following columns: 
- 'carID'
- 'brand'
- 'model'
- 'year'
- 'transmission'
- 'mileage'
- 'fuelType'
- 'tax'
- 'mpg'
- 'engineSize'

Duplicate listings have been removed and the columns have been cleaned. A notebook detailing the cleaning process and the original data is included for anyone interested in checking or improving the work.


In [39]:
import pandas as pd
import numpy as np

from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error

import optuna

In [74]:
X = pd.read_csv("../data/UK Used Car Data set/X_train.csv")
y = pd.read_csv("../data/UK Used Car Data set/y_train.csv")

In [21]:
X.dtypes

carID             int64
brand            object
model            object
year              int64
transmission     object
mileage           int64
fuelType         object
tax             float64
mpg             float64
engineSize      float64
dtype: object

In [22]:
def preprocess(X):
    for label, content in X.items():
        if not pd.api.types.is_numeric_dtype(content):
            dummies = pd.get_dummies(content, prefix=label)
            X = pd.concat([X, dummies], axis=1)
            X.drop(label, axis=1, inplace=True)

    return X

In [23]:
X = preprocess(X)

In [27]:
y.drop("carID", axis=1, inplace=True)

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

In [34]:
model = CatBoostRegressor(eval_metric="RMSE", early_stopping_rounds=100)

In [35]:
model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=False, plot=True);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [57]:
y_preds = model.predict(X_test)

In [58]:
mean_squared_log_error(y_test, y_preds, squared=False)

0.1515510116638285

In [69]:
test_df = pd.read_csv("../data/UK Used Car Data set/X_test.csv")

In [75]:
test_df = preprocess(test_df)

In [78]:
set(X.columns) - set(test_df.columns)

{'brand', 'fuelType', 'model', 'transmission'}

In [84]:
len(df.columns), len(X.columns)

(6, 10)