# 100K UK Used Car - Price Prediction Model [0.95+]

## Task Details

This Notebook has been written with an interest in the task "*Make a tool to predict car value*" from the task "*Car Price Regression Model*" from the "*100,000 UK Used Car Dataset*" Dataset.

More specifically, the purpose of this notebook is to have some insights and visualisations of the dataset as a whole and create a generalized used car price prediction model.

### Context of the Dataset

" *Scraped data of used cars listings. 100,000 listings, which have been separated into files corresponding to each car manufacturer.*" [Source](https://www.kaggle.com/adityadesai13)

### Content of the Dataset

" *The cleaned data set contains information of price, transmission, mileage, fuel type, road tax, miles per gallon (mpg), and engine size.*"

# Step 1 : Data

In [None]:
# python 3 environment
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O

# plotly libraries
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.colors import n_colors
from plotly.subplots import make_subplots

# machine learning library
from pycaret.regression import *

In [None]:
# data 
audi = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/audi.csv')
bmw = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/bmw.csv')
cclass = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/cclass.csv')
focus = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/focus.csv')
ford = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/ford.csv')
hyundai = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/hyundi.csv')
merc = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/merc.csv')
skoda = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/skoda.csv')
toyota = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/toyota.csv')
vauxhall = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/vauxhall.csv')
vw = pd.read_csv('../100K UK Used Car - Price Prediction Model/Data Sources/vw.csv')

In [None]:
# data summary
audi.info()

## 1.1 Data Aggregation

The data of the different files are merged to make a global work. To begin with, an additional category is added to indicate the manufacturers. Then, work will be done to unify the "tax" column. An additional column "has_tax" will be created to indicate if the column was missing in the initial file in order to better indicate to our future model that there is a missing column.

In [None]:
audi['manufacturer'] = 'Audi'
audi['has_tax'] = 1

bmw['manufacturer'] = 'BMW'
bmw['has_tax'] = 1

cclass['manufacturer'] = 'Mercedes-Benz'
cclass['has_tax'] = 0
cclass['tax'] = np.nan

focus['manufacturer'] = 'Ford'
focus['has_tax'] = 0
focus['tax'] = np.nan

hyundai['manufacturer'] = 'Hyundai Motor'
hyundai['has_tax'] = 1
hyundai = hyundai.rename(columns={"tax(£)": "tax"})

merc['manufacturer'] = 'Mercedes'
merc['has_tax'] = 1

skoda['manufacturer'] = 'Skoda'
skoda['has_tax'] = 1

toyota['manufacturer'] = 'Toyota'
toyota['has_tax'] = 1

vauxhall['manufacturer'] = 'Vauxhall'
vauxhall['has_tax'] = 1

vw['manufacturer'] = 'Volkswagen'
vw['has_tax'] = 1

In [None]:
# aggregation
cars = pd.concat([audi, bmw, cclass, focus, hyundai, merc, skoda, toyota, vauxhall, vw], ignore_index=True)

# dimensionality
cars.shape

In [None]:
# data overview
cars

# Step 2 : Exploratory Data Analysis

## 2.1 Quantitative representation of variables

### **Question** : How many cars per Manufacturers?

In [None]:
fig = px.histogram(cars, x="manufacturer", hover_data=cars.columns)
fig.update_layout(title='Quantitative representation of the number of vehicles per manufacturer')
fig.show()

## 2.2 Statistical Distribution

### **Question** :  What's the probabilities of occurrence for Price, Year, Miles Per Gallon, Mileage, Tax and Engine Size ?

In [None]:
fig = px.histogram(cars, x="price", marginal="box")
fig.update_layout(title='Statistical Distribution of Price')
fig.show()

In [None]:
fig = px.histogram(cars, x="year", marginal="box")
fig.update_layout(title='Statistical Distribution of Year')
fig.show()

In [None]:
fig = px.histogram(cars, x="mpg", marginal="box")
fig.update_layout(title='Statistical Distribution of Miles Per Gallon')
fig.show()

In [None]:
fig = px.histogram(cars, x="mileage", marginal="box")
fig.update_layout(title='Statistical Distribution of Mileage')
fig.show()

In [None]:
fig = px.histogram(cars, x="tax", marginal="box")
fig.update_layout(title='Statistical Distribution of Tax')
fig.show()

In [None]:
fig = px.histogram(cars, x="engineSize", marginal="box")
fig.update_layout(title='Statistical Distribution of Engine Size')
fig.show()

## 2.3 Relationship between values

### **Question :** How much dependency between Sales Prices and the Manufacturer?

In [None]:
fig = px.scatter(cars, x='manufacturer', y='price', color='price')
fig.update_layout(title='Sales Price VS Manufacturer',xaxis_title="Manufacturer",yaxis_title="Price")
fig.show()

### **Question :** How much dependency between Sales Prices and Mileage?

In [None]:
fig = px.scatter(cars, x='mileage', y='price', color='price')
fig.update_layout(title='Sales Price VS Mileage',xaxis_title="Mileage",yaxis_title="Price")
fig.show()

### **Question :** How much dependency between Sales Prices and Transmission?

In [None]:
fig = px.scatter(cars, x='transmission', y='price', color='price')
fig.update_layout(title='Sales Price VS Transmission System',xaxis_title="Transmission",yaxis_title="Price")
fig.show()

# Step 3 : Model

## 3.1 Setting up Environment

This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment.

In [None]:
env = setup(cars, 
            target='price', 
            normalize=True, 
            transformation=True, 
            transform_target=True, 
            polynomial_features=True, 
            polynomial_degree=2, 
            sampling=False, 
            silent=True,  
            session_id=707)

## 3.2 Compare Models

This function uses all models in the model library and scores them using K-fold Cross Validation. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds) of all the available models in model library.

In [None]:
compare_models(blacklist=['svm', 'tr', 'ransac', 'huber', 'lar', 'llar', 'lr'])

## 3.3 Create Model

This function tunes the hyperparameters of a model and scores it using K-fold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (by default = 10 Folds). This function returns a trained model object. The model was not tuned because it takes too much time and was not considered necessary in this context.

In [None]:
rf = create_model('rf')

## 3.4 Evaluate Model


This function takes a trained model object and returns a plot based on the test / hold-out set. 

In [None]:
plot_model(rf, plot='parameter')

In [None]:
plot_model(rf, plot='feature')

## 3.5 Finalize Model

This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

In [None]:
rf_final = finalize_model(rf)

## 3.6 Save Model


This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

In [None]:
save_model(rf_final, '100k_used_car_rf_model_lrm_28072020')