# Used Car Sales 

https://s3.amazonaws.com/talent-assets.datacamp.com/DS+Case+Study+-+Used+Car+Sales+-+Prework.pdf

| Column Name   | Details                                                                                                      |
|---------------|--------------------------------------------------------------------------------------------------------------|
| model         | Character, the model of the car, 18 possible values                                                          |
| year          | Numeric, year of registration from 1998 to 2020                                                              |
| price         | Numeric, listed value of the car in GBP                                                                      |
| transmission  | Character, one of "Manual", "Automatic", "Semi-Auto" or "Other"                                              |
| mileage       | Numeric, listed mileage of the car at time of sale                                                           |
| fuelType      | Character, one of "Petrol", "Hybrid", "Diesel" or "Other"                                                    |
| tax           | Numeric, road tax in GBP. Calculated based on CO2 emissions or a fixed price depending on the age of the car |
| mpg           | Numeric, miles per gallon as reported by manufacturer                                                        |
| engineSize    | Numeric, listed engine size, one of 16 possible values                                                       |

## 1. Business Goal

They want us to predict prices within 10% of the listed price. But as their team can only manage 30%, it is probably ok to show we are at least as good as that. I don’t know how close you will get in the time we have, but do your best and present whatever you find.

Next month our most experienced sales team member will be retiring. They have been on the team almost since the company was founded. They are incredibly talented at estimating the sales price of cars. We are quite worried that when they retire we won’t be able to estimate as well and that will have a huge impact on sales.

Currently, when a new car comes in, team members take all of the information that usually appears in the advert and give it to this team member. They then estimate the price. We have been testing the team members estimating themselves but they are always around 30% away from the price we know the car will sell for.

Can you help us estimate the price we should list a car for? The team estimates are always around 30% off, we really want to be within 10% of the price. This will mean we can automate the whole process and be able to sell cars quicker.

As I said, the team member retires in a month, so we would like to get your initial thoughts as soon as possible. We would like to see a presentation, you will be presenting to me and another sales manager. We would like to hear about whatever you manage to achieve to help us make decisions on the way forward.

- **Business Goal 1:** Predict prices within 10% of the price.
- **Business Goal 2:** automate the whole process and be able to sell cars quicker

## 2. Understanding the Data


In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

df = pd.read_csv('toyota.csv')
df.head()

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.info()

## 2. Cleaning Data 

The datatypes seem ok and there are no null values in the dataset. 

Lets now have a look to the presence of duplicates, and how clean the string features are.

In [None]:
df.duplicated().sum()

In [None]:
df['model'].value_counts()

In order to be sure that there are no trailing spaces that could end up in duplicates not caught, lets remove them from the string features.

In [None]:
df['model'] = df['model'].str.strip()
df['transmission'] = df['transmission'].str.strip()
df['fuelType'] = df['fuelType'].str.strip()

In [None]:
for col in ['model', 'transmission', 'fuelType']: 
    df[col] = df[col].astype('category')

In [None]:
df['transmission'].value_counts()

In [None]:
df['fuelType'].value_counts()

For the duplicate detection and removal we use all the columns. If mileage wasnt present, we would have keep the duplicates, but we understand is very rare to have 2 second hand cars with the same model, year, engine... and mileage.

In [None]:
df.duplicated(keep=False).sum()

In [None]:
len(df)

In [None]:
df = df.drop_duplicates(keep='first')

In [None]:
len(df)

## 3. Exploratory data analysis 

Now that the dataset is clean, lets proceed to perform a bit of EDA to better understand the features and how they interact.

In [None]:
df.head()

In [None]:
sns.countplot(data=df, x='model')
plt.xticks(rotation=60)
plt.title('Distribution of observations by model')
plt.show()

In [None]:
df['model'].value_counts(normalize=True)

In [None]:
df['model'].value_counts()

More than 50% of the observations correspond to Yaris and Aygo. 3 types of cars have less than 10 observations each.

In [None]:
df['year'].value_counts()

In [None]:
sns.histplot(data=df, x='year')

For the most part, the observations present in the dataset belong to 2015-2020 with very few from 2010 back to 1998.

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='price')
plt.show()

The distribution of the price suggests that it could be interesting to apply a log transformation to normalize the prices.

In [None]:
np.quantile(df.price, 0.985)

Almost 99% of the observations are below 30k. 

In [None]:
df[df['price']>40000]

The extreme prices seem to be perfectly valid ones. These observations seem to correspond to more expensive cars.

In [None]:
df['transmission'].value_counts()

In [None]:
df[df['transmission']=='Other']

In [None]:
df[(df['model']=='Yaris') & (df['year']==2015) & (df['tax']==0) & (df['engineSize']==1.5) & (df['mpg']==78)]

Based on the amount of observations similar to the ones with the transmission 'Other', lets assign 'Automatic' there

In [None]:
df.loc[df['transmission'] == 'Other', 'transmission'] = 'Automatic'

In [None]:
df['transmission'].value_counts()

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='mileage')
plt.show()

Another distribution highly skewed to the right.

In [None]:
df['fuelType'].value_counts()

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='tax')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='mpg')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='engineSize')
plt.show()

In [None]:
df['engineSize'].value_counts()

### Bivariate Analysis

In [None]:
df.head()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, y='price', hue='model')
plt.show()

The kind of car model seems to have a huge impact in the price.

In [None]:
sns.scatterplot(data=df, x='year', y='price')

FINDING! cars with more than 20 years are more expensive? Very few observations though.

In [None]:
df['price_log'] = np.log(df['price'])

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='price_log')
plt.show()

In [None]:
sns.lmplot(data=df, x='year', y='price_log')
plt.show()

In [None]:
sns.lmplot(data=df, x='mileage', y='price_log')
plt.show()

In [None]:
df['mileage_log'] = np.log1p(df['mileage'])

In [None]:
sns.lmplot(data=df, x='mileage_log', y='price_log')
plt.show()

In [None]:
df = df.drop('mileage_log', axis=1)

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, y='price_log', hue='model')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, y='price_log', hue='transmission')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, y='price_log', hue='fuelType')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.lmplot(data=df, x='engineSize', y='price')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.lmplot(data=df, x='tax', y='price_log')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.scatterplot(data=df, x='mpg', y='price', hue='fuelType')
plt.show()

## Feature Engineering

Lets:
- use the price_log to make predictions
- bin tax
- create a old_timer feature indicating the car model has been put in the market in 2000 or before

In [None]:
df['tax_binned'] = pd.cut(df['tax'], bins=[0, 100, 200, 300, 400, 1000], labels=['Low', 'Medium', 'High', 'Very High', 'Extreme'])

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='tax_binned', y='price')
plt.show()

In [None]:
df['old_timer'] = (df['year']<=2000)

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, x='old_timer', y='price')
plt.show()

## Preparing data to train a model

In [None]:
df.columns

In [None]:
features = df[['model', 'year', 'transmission', 'mileage', 'fuelType', 'mpg', 'engineSize', 'tax_binned', 'old_timer']] 
y = df['price_log']

In [None]:
from sklearn.preprocessing import StandardScaler

cols_to_scale = ['year', 'mileage', 'mpg', 'engineSize']

scaler = StandardScaler()
features[cols_to_scale] = scaler.fit_transform(features[cols_to_scale])
X = pd.get_dummies(features, columns=['model', 'transmission', 'fuelType', 'tax_binned'], drop_first=True)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define parameter grids for each model
param_grid_ridge = {
    'alpha': [0.1, 1, 10],
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag', 'saga']
}

param_grid_lasso = {
    'alpha': [0.1, 1, 10],
    'max_iter': [1000, 5000],
    'selection': ['cyclic', 'random']
}

param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
}

# Initialize models
ridge = Ridge()
lasso = Lasso()
lr = LinearRegression()
rf = RandomForestRegressor()

# Train models using GridSearchCV
models = {
    "Ridge": GridSearchCV(ridge, param_grid_ridge, cv=5, scoring='neg_mean_squared_error'),
    "Lasso": GridSearchCV(lasso, param_grid_lasso, cv=5, scoring='neg_mean_squared_error'),
    "Linear Regression": GridSearchCV(lr, {}, cv=5, scoring='neg_mean_squared_error'),
    "Random Forest": GridSearchCV(rf, param_grid_rf, cv=5, scoring='neg_mean_squared_error')
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Transform predictions back to original price
    y_pred_train_original = np.exp(y_pred_train)  # Inverse of np.log
    y_pred_test_original = np.exp(y_pred_test)    # Inverse of np.log

    # Calculate RMSE on original price
    train_rmse = np.sqrt(mean_squared_error(np.exp(y_train), y_pred_train_original))
    test_rmse = np.sqrt(mean_squared_error(np.exp(y_test), y_pred_test_original))
    
    # Calculate R² on original price
    train_r2 = r2_score(np.exp(y_train), y_pred_train_original)
    test_r2 = r2_score(np.exp(y_test), y_pred_test_original)

    # Print results
    print(f"{name} Best Parameters: {model.best_params_ if hasattr(model, 'best_params_') else 'N/A'}")
    print(f"Train RMSE (original price): {train_rmse:.2f}")
    print(f"Test RMSE (original price): {test_rmse:.2f}")
    print(f"Train R²: {train_r2:.2f}")
    print(f"Test R²: {test_r2:.2f}\n")

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define a new parameter grid for fine-tuning
param_grid_rf_fine_tune = {
    'n_estimators': [150, 200, 250],
    'max_depth': [8, 10, 12],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Random Forest Regressor
rf_fine_tune = RandomForestRegressor()

# Train the model using GridSearchCV
rf_grid_search = GridSearchCV(rf_fine_tune, param_grid_rf_fine_tune, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
rf_grid_search.fit(X_train, y_train)

# Make predictions on train and test sets
y_pred_train_fine = rf_grid_search.predict(X_train)
y_pred_test_fine = rf_grid_search.predict(X_test)

# Transform predictions back to original price
y_pred_train_fine_original = np.exp(y_pred_train_fine)  # Inverse of np.log
y_pred_test_fine_original = np.exp(y_pred_test_fine)    # Inverse of np.log

# Calculate RMSE on original price
train_rmse_fine = np.sqrt(mean_squared_error(np.exp(y_train), y_pred_train_fine_original))
test_rmse_fine = np.sqrt(mean_squared_error(np.exp(y_test), y_pred_test_fine_original))

# Calculate R² on original price
train_r2_fine = r2_score(np.exp(y_train), y_pred_train_fine_original)
test_r2_fine = r2_score(np.exp(y_test), y_pred_test_fine_original)

# Print results
print(f"Random Forest Best Parameters: {rf_grid_search.best_params_}")
print(f"Train RMSE (original price): {train_rmse_fine:.2f}")
print(f"Test RMSE (original price): {test_rmse_fine:.2f}")
print(f"Train R²: {train_r2_fine:.2f}")
print(f"Test R²: {test_r2_fine:.2f}")

In [None]:
# Create a DataFrame for easier analysis
results_df = pd.DataFrame({
    'Actual Price': np.exp(y_test),
    'Predicted Price': np.exp(rf_grid_search.predict(X_test))
})

# Calculate percentage error
results_df['Percentage Error'] = (results_df['Predicted Price'] - results_df['Actual Price']) / results_df['Actual Price'] * 100

# Calculate summary statistics
mean_error = results_df['Percentage Error'].mean()
median_error = results_df['Percentage Error'].median()
std_dev_error = results_df['Percentage Error'].std()

# Output results
print("Mean Percentage Error: {:.2f}%".format(mean_error))
print("Median Percentage Error: {:.2f}%".format(median_error))
print("Standard Deviation of Percentage Error: {:.2f}%".format(std_dev_error))

# You can also inspect the full results DataFrame if needed
print(results_df)

In [None]:
results_df['Percentage Error'].describe()

In [None]:
np.quantile(results_df['Percentage Error'], [0, 0.05, 0.1, 0.9, 0.95, 1])

In [None]:
sns.displot(data=results_df['Percentage Error'])

In [None]:
mean = results_df['Percentage Error'].mean()
std_dev = results_df['Percentage Error'].std()

# Set the figure size
plt.figure(figsize=(12, 6))

# Create the KDE plot
sns.kdeplot(results_df['Percentage Error'], fill=True, color='skyblue', alpha=0.5)

# Add vertical lines for mean, 1*SD, 2*SD, 3*SD
plt.axvline(mean, color='red', linestyle='--', label='Mean')
plt.axvline(mean + std_dev, color='green', linestyle='--', label='Mean + 1 SD')
plt.axvline(mean - std_dev, color='green', linestyle='--', label='Mean - 1 SD')
plt.axvline(mean + 2 * std_dev, color='orange', linestyle='--', label='Mean + 2 SD')
plt.axvline(mean - 2 * std_dev, color='orange', linestyle='--', label='Mean - 2 SD')
plt.axvline(mean + 3 * std_dev, color='blue', linestyle='--', label='Mean + 3 SD')
plt.axvline(mean - 3 * std_dev, color='blue', linestyle='--', label='Mean - 3 SD')

# Add titles and labels
plt.title('KDE Plot with Standard Deviations')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()

# Display the plot
plt.show()

In [None]:
feature_importances = rf_grid_search.best_estimator_.feature_importances_

# Create a DataFrame for better readability
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot using Seaborn
plt.figure(figsize=(10, 6))
sns.barplot(data=importance_df, x='Importance', y='Feature', palette='viridis')
plt.title('Feature Importances from Random Forest')
plt.xlabel('Importance')
plt.ylabel('Features')

plt.show()

In [None]:
np.abs(results_df['Percentage Error']).describe()

In [None]:
np.quantile(np.abs(results_df['Percentage Error']), [0,0.05, 0.1, 0.9, 0.95, 1])

In [None]:
np.round(np.quantile(np.abs(results_df['Percentage Error']), [0,0.05, 0.1, 0.80, 0.9, 0.95, 1]), 2)

In [None]:
features.columns

In [None]:
df.columns