# What drives the price of a used car?
<center>
    <img src = ../images/kurt.jpeg width = 35%/>
</center>

## Project Overview: Predicting Used Car Prices Using CRISP-DM Framework

### Introduction
In the highly competitive used car market, accurate pricing is crucial for maximizing profits, attracting customers, and managing inventory effectively. Our project aims to leverage data analytics and machine learning to identify key drivers of used car prices and develop a predictive model to forecast these prices accurately. This will enable our business to make informed pricing decisions, enhance customer trust, and improve overall business performance.

### CRISP-DM Framework
We will use the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework to guide our project. This widely adopted methodology provides a structured approach to data mining and ensures systematic and efficient analysis. The CRISP-DM process consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.
<center>
    <img src = ../images/crisp.png width = 35%/>
</center>

## Business Understanding

The used car market is highly dynamic, influenced by a variety of factors ranging from economic conditions to consumer preferences. For our business, accurately predicting used car prices is crucial for several reasons:

- <b>Inventory Management:</b> Knowing the right price helps in maintaining optimal inventory levels by ensuring cars are neither overstocked nor understocked.
- <b>Sales Strategy:</b> Accurate pricing can enhance competitiveness, attracting more customers and increasing sales volumes.
- <b>Customer Trust:</b> Transparent and fair pricing builds customer trust and loyalty.
- <b>Financial Planning:</b> Reliable price predictions aid in better financial forecasting and budgeting.

Given these aspects, our primary goal is to identify the key drivers that significantly influence the prices of used cars. Understanding these drivers will enable us to develop a robust pricing strategy, improve decision-making, and enhance overall business performance.

### Objectives

- <b>Identify Key Price Drivers:</b> Determine the primary factors that impact used car prices. This includes both quantitative attributes (e.g., mileage, age, brand, model, condition) and qualitative factors (e.g., market trends, economic indicators).
- <b>Develop a Predictive Model:</b> Create a predictive model that can accurately forecast the prices of used cars based on the identified key drivers. This model should be reliable, scalable, and easy to integrate into our existing business processes.
- <b>Enhance Pricing Strategy:</b> Utilize the insights from the predictive model to refine our pricing strategy. This involves setting competitive prices that reflect the true value of the cars while maximizing our profit margins.
- <b>Support Strategic Decision-Making:</b> Provide actionable insights to support strategic decisions related to procurement, sales, and marketing. This includes understanding seasonal trends, identifying high-demand vehicle types, and targeting specific customer segments.
- <b>Improve Customer Experience:</b> Offer transparent and fair pricing to customers, thereby enhancing their buying experience and building long-term relationships.
- <b>Optimize Inventory Management:</b> Use price predictions to manage inventory more effectively, ensuring the right mix of vehicles is available to meet customer demand without overstocking or understocking.

By achieving these objectives, our business aims to gain a competitive edge in the used car market, improve operational efficiency, and drive sustainable growth.

## Data Understanding
In the Data Understanding phase, we explored and familiarized ourselves with the dataset comprising 426,880 used car listings. This involved identifying key variables such as price, year, odometer readings, and categorical attributes like manufacturer and model. Initial data exploration allowed us to assess data quality, distribution, and relationships between variables, laying the groundwork for subsequent data preparation and modeling phases within the CRISP-DM framework.

In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings 
warnings.filterwarnings('ignore')

### Data Collection
The dataset comprising 426,880 used car listings was sourced from Kaggle, a platform known for hosting diverse datasets contributed by the community.

In [7]:
# Load the data set
cars = pd.read_csv('../data/vehicles.csv')

### Data Description

The dataset contains information on 426,880 used cars, with 18 attributes detailing various aspects of each vehicle. Below is a detailed description of each column:

1. **id:** A unique identifier for each car listing.
2. **region:** The geographic region where the car is listed.
3. **price:** The listed price of the car in dollars.
4. **year:** The manufacturing year of the car.
5. **manufacturer:** The manufacturer or brand of the car (e.g., Ford, Toyota).
6. **model:** The model name of the car.
7. **condition:** The condition of the car (e.g., new, like new, excellent, good, fair, salvage).
8. **cylinders:** The number of cylinders in the car's engine.
9. **fuel:** The type of fuel the car uses (e.g., gas, diesel, electric, hybrid).
10. **odometer:** The mileage of the car (distance traveled in miles).
11. **title_status:** The status of the car's title (e.g., clean, salvage, rebuilt).
12. **transmission:** The type of transmission (e.g., automatic, manual).
13. **VIN:** The Vehicle Identification Number, a unique code used to identify individual motor vehicles.
14. **drive:** The type of drivetrain (e.g., 4wd, fwd, rwd).
15. **size:** The size category of the car (e.g., compact, mid-size, full-size).
16. **type:** The type or category of the car (e.g., sedan, SUV, truck).
17. **paint_color:** The exterior color of the car's paint.
18. **state:** The state where the car is listed.

In [9]:
# Display data set info
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

### Data Exploration

In the Data Exploration phase, we conducted a thorough analysis to uncover patterns, relationships, and insights within the dataset. Key activities included:

- **Descriptive Statistics:** Calculated summary statistics for each numerical feature, such as mean, median, standard deviation, minimum, and maximum values. This provided a high-level overview of the data distribution and central tendencies.
- **Distribution Analysis:** Visualized the distributions of key variables (e.g., price, year, odometer) using histograms and density plots to understand their spread and identify any skewness or kurtosis. This helped in detecting potential outliers and understanding the general data shape.
- **Correlation Analysis:** Computed correlation coefficients between numerical variables to identify linear relationships. Visualized these correlations using heatmaps, which highlighted significant correlations (e.g., between year and price, odometer and price).
- **Missing Values Analysis:** Identified columns with missing values and calculated the proportion of missing data for each column. Visualized missing data patterns using heatmaps to understand the extent and distribution of missingness across the dataset.
- **Categorical Data Analysis:** Examined the distribution of categorical variables (e.g., manufacturer, fuel, transmission) using bar plots and pie charts. This analysis provided insights into the most common categories and their frequencies.
- **Outlier Detection:** Identified outliers in numerical features by visualizing box plots and scatter plots. Assessed the potential impact of these outliers on the analysis and modeling process.
- **Bivariate Analysis:** Explored relationships between pairs of variables (e.g., price vs. year, price vs. odometer) using scatter plots and box plots. This helped in understanding how different features interact and influence the target variable (price).
- **Segmentation Analysis:** Analyzed data segments based on categorical variables (e.g., region, state, manufacturer) to identify regional price variations and popular brands. This segmentation provided deeper insights into market trends and customer preferences.

These exploratory analyses laid the foundation for subsequent data preparation and modeling steps, ensuring a comprehensive understanding of the dataset and guiding informed decisions throughout the project.

#### Removing Duplicates

In [12]:
# Drop id, VIN, region, state column
cars_cleaned = cars.drop(columns=['id', 'VIN', 'region', 'state'])

In [13]:
cars_cleaned

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
0,6000,,,,,,,,,,,,,
1,11900,,,,,,,,,,,,,
2,21000,,,,,,,,,,,,,
3,1500,,,,,,,,,,,,,
4,4900,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426875,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,fwd,,sedan,
426876,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,fwd,,sedan,red
426877,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,,,hatchback,white
426878,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,fwd,,sedan,silver


In [15]:
# Display total rows and columns
total_rows, total_columns = cars_cleaned.shape
print(f"Total rows: {total_rows}")
print(f"Total columns: {total_columns}")
cars_cleaned.columns

Total rows: 426880
Total columns: 14


Index(['price', 'year', 'manufacturer', 'model', 'condition', 'cylinders',
       'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'size',
       'type', 'paint_color'],
      dtype='object')

In [21]:
# Find duplicate rows
duplicate_rows = cars_cleaned[cars_cleaned.duplicated()]

# Print duplicate rows
duplicate_rows

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
11,0,,,,,,,,,,,,,
12,0,,,,,,,,,,,,,
13,0,,,,,,,,,,,,,
14,0,,,,,,,,,,,,,
20,24999,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426875,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,fwd,,sedan,
426876,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,fwd,,sedan,red
426877,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,,,hatchback,white
426878,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,fwd,,sedan,silver


In [22]:
# Drop Duplicates
cars_cleaned.drop_duplicates(inplace=True)

In [24]:
cars_cleaned

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
0,6000,,,,,,,,,,,,,
1,11900,,,,,,,,,,,,,
2,21000,,,,,,,,,,,,,
3,1500,,,,,,,,,,,,,
4,4900,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426833,6800,1997.0,jaguar,xk8 convertible,good,8 cylinders,gas,69550.0,clean,automatic,rwd,compact,convertible,white
426838,21900,1920.0,,Paige Glenbrook Touring,good,6 cylinders,gas,11065.0,clean,manual,rwd,full-size,other,black
426839,54999,2017.0,,2017,,,gas,119000.0,clean,automatic,,,,
426846,41999,2015.0,,2015,,,gas,126000.0,clean,automatic,,,,


#### Identify and drop rows with max data as NaN or Price = 0

In [27]:
# Identify rows where all columns except 'price' are NaN
columns_to_check = cars_cleaned.columns.difference(['price'])
rows_to_delete = cars_cleaned[columns_to_check].isnull().all(axis=1)

# Delete rows where all specified columns are NaN
cars_cleaned = cars_cleaned[~rows_to_delete]

# Reset index if needed
cars_cleaned.reset_index(drop=True, inplace=True)

# Print the cleaned DataFrame
cars_cleaned

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
0,33590,2014.0,gmc,sierra 1500 crew cab slt,good,8 cylinders,gas,57923.0,clean,other,,,pickup,white
1,22590,2010.0,chevrolet,silverado 1500,good,8 cylinders,gas,71229.0,clean,other,,,pickup,blue
2,39590,2020.0,chevrolet,silverado 1500 crew,good,8 cylinders,gas,19160.0,clean,other,,,pickup,red
3,30990,2017.0,toyota,tundra double cab sr,good,8 cylinders,gas,41124.0,clean,other,,,pickup,red
4,15000,2013.0,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,clean,automatic,rwd,full-size,truck,black
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248245,6800,1997.0,jaguar,xk8 convertible,good,8 cylinders,gas,69550.0,clean,automatic,rwd,compact,convertible,white
248246,21900,1920.0,,Paige Glenbrook Touring,good,6 cylinders,gas,11065.0,clean,manual,rwd,full-size,other,black
248247,54999,2017.0,,2017,,,gas,119000.0,clean,automatic,,,,
248248,41999,2015.0,,2015,,,gas,126000.0,clean,automatic,,,,


In [30]:
cars_cleaned.query('price == 0')

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
19,0,2011.0,jeep,compass,excellent,,gas,99615.0,clean,automatic,,full-size,SUV,
99,0,2018.0,chevrolet,express cargo van,like new,6 cylinders,gas,68472.0,clean,automatic,rwd,full-size,van,white
100,0,2019.0,chevrolet,express cargo van,like new,6 cylinders,gas,69125.0,clean,automatic,rwd,full-size,van,white
101,0,2018.0,chevrolet,express cargo van,like new,6 cylinders,gas,66555.0,clean,automatic,rwd,full-size,van,white
163,0,2015.0,nissan,sentra,excellent,4 cylinders,gas,99505.0,clean,automatic,fwd,,sedan,silver
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247369,0,2012.0,chevrolet,malibu,excellent,,gas,153898.0,clean,automatic,fwd,,sedan,
247428,0,2018.0,chevrolet,cruze,excellent,,gas,51964.0,clean,automatic,fwd,,hatchback,
247452,0,2005.0,chevrolet,impala,,,gas,115600.0,clean,automatic,,,,
247627,0,2013.0,gmc,acadia,excellent,6 cylinders,gas,152207.0,clean,automatic,4wd,full-size,SUV,red


In [32]:
# Drop rows where price == 0
cars_cleaned = cars_cleaned[cars_cleaned['price'] != 0]

# Reset index if needed
cars_cleaned.reset_index(drop=True, inplace=True)
cars_cleaned

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
0,33590,2014.0,gmc,sierra 1500 crew cab slt,good,8 cylinders,gas,57923.0,clean,other,,,pickup,white
1,22590,2010.0,chevrolet,silverado 1500,good,8 cylinders,gas,71229.0,clean,other,,,pickup,blue
2,39590,2020.0,chevrolet,silverado 1500 crew,good,8 cylinders,gas,19160.0,clean,other,,,pickup,red
3,30990,2017.0,toyota,tundra double cab sr,good,8 cylinders,gas,41124.0,clean,other,,,pickup,red
4,15000,2013.0,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,clean,automatic,rwd,full-size,truck,black
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230761,6800,1997.0,jaguar,xk8 convertible,good,8 cylinders,gas,69550.0,clean,automatic,rwd,compact,convertible,white
230762,21900,1920.0,,Paige Glenbrook Touring,good,6 cylinders,gas,11065.0,clean,manual,rwd,full-size,other,black
230763,54999,2017.0,,2017,,,gas,119000.0,clean,automatic,,,,
230764,41999,2015.0,,2015,,,gas,126000.0,clean,automatic,,,,


In [34]:
cars_cleaned.isnull().sum()

price                0
year               628
manufacturer     10270
model             3353
condition        87865
cylinders        79394
fuel              1413
odometer          1201
title_status      3834
transmission       980
drive            64639
size            147268
type             58476
paint_color      67438
dtype: int64

In [36]:
# Identify rows where both manufacturer and model are null (found no data)
cars_cleaned[cars_cleaned['manufacturer'].isnull() & cars_cleaned['model'].isnull()]

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color


In [38]:
cars_cleaned[cars_cleaned['model'].isnull()]

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
68,80,2004.0,honda,,excellent,6 cylinders,gas,94020.0,clean,automatic,,,,
73,12990,1968.0,volvo,,,,gas,99999.0,clean,manual,,,,
240,987654321,1960.0,chevrolet,,,,gas,999999.0,clean,manual,,,,
264,8500,2006.0,rover,,good,8 cylinders,gas,139445.0,clean,automatic,4wd,full-size,SUV,white
357,39900,2015.0,lincoln,,,,gas,51.0,clean,automatic,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230523,1500,1993.0,ram,,,,gas,999999.0,clean,automatic,,,,
230571,7500,2000.0,ford,,good,8 cylinders,gas,2000.0,clean,manual,rwd,,pickup,black
230580,14995,1929.0,ford,,,,gas,2000.0,clean,automatic,,,,
230682,7500,2012.0,chevrolet,,,6 cylinders,gas,150000.0,clean,automatic,fwd,,,


In [40]:
cars_cleaned[cars_cleaned['manufacturer'].isnull()]

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
69,15990,2016.0,,Scion iM Hatchback 4D,good,,gas,29652.0,clean,other,fwd,,hatchback,blue
94,6800,2005.0,,blue bird bus,excellent,6 cylinders,diesel,180000.0,clean,automatic,rwd,full-size,bus,yellow
104,14990,2016.0,,Scion iM Hatchback 4D,good,,gas,65203.0,clean,other,fwd,,hatchback,red
106,2500,1966.0,,1966 C-30 1 ton,good,6 cylinders,gas,47000.0,clean,manual,rwd,full-size,pickup,brown
124,8990,2013.0,,smart fortwo Passion Hatchback,good,,gas,59072.0,clean,automatic,rwd,,coupe,silver
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230754,11000,2017.0,,ATI,like new,other,other,1.0,missing,other,,,,brown
230758,5000,1996.0,,96 Suburban,excellent,8 cylinders,gas,170000.0,clean,automatic,,,SUV,brown
230762,21900,1920.0,,Paige Glenbrook Touring,good,6 cylinders,gas,11065.0,clean,manual,rwd,full-size,other,black
230763,54999,2017.0,,2017,,,gas,119000.0,clean,automatic,,,,


**Update Nan values in Manufacturer based on model**
- Create a Mapping from model to manufacturer: Use the existing rows where both model and manufacturer are present to create a dictionary mapping.
- Fill Missing manufacturer Values: Use this mapping to fill in the missing manufacturer values in the DataFrame.

In [43]:
# Create a mapping from model to manufacturer using non-null values
model_to_manufacturer = cars_cleaned.dropna(subset=['manufacturer', 'model']).set_index('model')['manufacturer'].to_dict()

# Function to fill missing manufacturer based on model
def fill_manufacturer(row, mapping):
    if pd.isnull(row['manufacturer']) and pd.notnull(row['model']):
        return mapping.get(row['model'], row['manufacturer'])
    return row['manufacturer']

# Apply the function to each row in the DataFrame
cars_cleaned['manufacturer'] = cars_cleaned.apply(fill_manufacturer, axis=1, mapping=model_to_manufacturer)

cars_cleaned[cars_cleaned['manufacturer'].isnull()]

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
69,15990,2016.0,,Scion iM Hatchback 4D,good,,gas,29652.0,clean,other,fwd,,hatchback,blue
94,6800,2005.0,,blue bird bus,excellent,6 cylinders,diesel,180000.0,clean,automatic,rwd,full-size,bus,yellow
104,14990,2016.0,,Scion iM Hatchback 4D,good,,gas,65203.0,clean,other,fwd,,hatchback,red
106,2500,1966.0,,1966 C-30 1 ton,good,6 cylinders,gas,47000.0,clean,manual,rwd,full-size,pickup,brown
124,8990,2013.0,,smart fortwo Passion Hatchback,good,,gas,59072.0,clean,automatic,rwd,,coupe,silver
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230754,11000,2017.0,,ATI,like new,other,other,1.0,missing,other,,,,brown
230758,5000,1996.0,,96 Suburban,excellent,8 cylinders,gas,170000.0,clean,automatic,,,SUV,brown
230762,21900,1920.0,,Paige Glenbrook Touring,good,6 cylinders,gas,11065.0,clean,manual,rwd,full-size,other,black
230763,54999,2017.0,,2017,,,gas,119000.0,clean,automatic,,,,


**fill in the missing model values based on the manufacturer and similar price**
- Group by Manufacturer: For each manufacturer, identify the most common model in the same price range.
- Fill Missing model Values: For each row where the model is missing but the manufacturer is present, fill in the model using the identified most common model in the same price range.

In [None]:
# Define a function to fill missing model based on manufacturer and similar price
def fill_model(row, df):
    if pd.isnull(row['model']) and pd.notnull(row['manufacturer']):
        manufacturer = row['manufacturer']
        price = row['price']
        
        # Filter the dataframe to include only rows with the same manufacturer
        manufacturer_df = df[(df['manufacturer'] == manufacturer) & df['model'].notnull()]
        
        if not manufacturer_df.empty:
            # Find the model with the closest price
            closest_model = manufacturer_df.iloc[(manufacturer_df['price'] - price).abs().argsort()[:1]]['model'].values[0]
            return closest_model
        else:
            return row['model']  # No non-null models found for this manufacturer
    return row['model']

# Apply the function to fill missing models
cars_cleaned['model'] = cars_cleaned.apply(lambda row: fill_model(row, cars_cleaned), axis=1)

cars_cleaned[cars_cleaned['model'].isnull()]

In [None]:
cars_cleaned.query('manufacturer == "morgan"')

In [None]:
cars_cleaned.isnull().sum()

In [None]:
cars_cleaned[cars_cleaned['manufacturer'].isnull()]['model'].value_counts()

In [None]:
# Create 'make_model' column based on the conditions
cars_cleaned['make_model'] = np.where(
    cars_cleaned['manufacturer'].isnull() & cars_cleaned['model'].notnull(), 
    cars_cleaned['model'],
    np.where(
        cars_cleaned['model'].isnull() & cars_cleaned['manufacturer'].notnull(), 
        cars_cleaned['manufacturer'], 
        np.where(
            cars_cleaned['manufacturer'].notnull() & cars_cleaned['model'].notnull(), 
            cars_cleaned['manufacturer'] + '-' + cars_cleaned['model'],
            np.nan
        )
    )
)

In [None]:
cars_cleaned[cars_cleaned['manufacturer'].isnull()]

In [None]:
# Drop 'manufacturer' and 'model' columns
cars_cleaned.drop(columns=['manufacturer', 'model'], inplace=True)

In [None]:
cars_cleaned.isnull().sum()

#### Distribution Analysis: 
Visualized the distributions of key variables (e.g., price, year, odometer) using histograms and density plots to understand their spread and identify any skewness or kurtosis. This helped in detecting potential outliers and understanding the general data shape.

In [None]:
numerical_variables = cars_cleaned.select_dtypes(include=['int','float']).columns.tolist()
print(numerical_variables)

In [None]:
# Loop through each numerical variable and create histograms and density plots
for col in numerical_variables:
    plt.figure(figsize=(12, 4))
    
    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(cars_cleaned[col].dropna(), kde=True, bins=50)
    plt.title(f'{col.capitalize()} Distribution')
    plt.xlabel(col.capitalize())
    plt.ylabel('Frequency')
    
    # Density plot
    plt.subplot(1, 2, 2)
    sns.boxplot(x=cars_cleaned[col].dropna())
    plt.title(f'{col.capitalize()} Box Plot')
    plt.xlabel(col.capitalize())
    
    plt.tight_layout()
    plt.show()

In [None]:
# Distribution plot suggests outliers in all three variables: price, year and odometer 

#### Outlier Detection
Identified outliers in numerical features by visualizing box plots and scatter plots. Assessed the potential impact of these outliers on the analysis and modeling process.

**Handling Outliers:** remove, transforming (e.g., using log transformation), or cap outliers (set a limit on extreme values). Removing Outliers here.

In [None]:
# Function to identify outliers using IQR
def identify_outliers(df, feature):
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
    return outliers

# Identify outliers in each feature
outliers_price = identify_outliers(cars_cleaned, 'price')
outliers_year = identify_outliers(cars_cleaned, 'year')
outliers_odometer = identify_outliers(cars_cleaned, 'odometer')

print("Outliers in Price:\n", outliers_price)
print("Outliers in Year:\n", outliers_year)
print("Outliers in Odometer:\n", outliers_odometer)

In [None]:
# Removing outliers
def remove_outliers(df, feature):
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df_cleaned = df[(df[feature] >= lower_bound) & (df[feature] <= upper_bound)]
    return df_cleaned

# Clean the dataset by removing outliers
for feature in ['price', 'year', 'odometer']:
    cars_cleaned = remove_outliers(cars_cleaned, feature)

In [None]:
cars_cleaned

In [None]:
# Display total rows and columns
total_rows, total_columns = cars_cleaned.shape
print(f"Total rows: {total_rows}")
print(f"Total columns: {total_columns}")
cars_cleaned.columns

#### Distribution Analysis: 
Visualized the distributions of key variables (e.g., price, year, odometer) using histograms and density plots to understand their spread and identify any skewness or kurtosis. This helped in detecting potential outliers and understanding the general data shape.

In [None]:
# Loop through each numerical variable and create histograms and density plots again after removing outliers
for col in numerical_variables:
    plt.figure(figsize=(12, 4))
    
    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(cars_cleaned[col].dropna(), kde=True, bins=50)
    plt.title(f'{col.capitalize()} Distribution')
    plt.xlabel(col.capitalize())
    plt.ylabel('Frequency')
    
    # Density plot
    plt.subplot(1, 2, 2)
    sns.boxplot(x=cars_cleaned[col].dropna())
    plt.title(f'{col.capitalize()} Box Plot')
    plt.xlabel(col.capitalize())
    
    plt.tight_layout()
    plt.show()

#### Missing Values Analysis & Handling
During the Data Understanding and Preparation phases, identifying and handling missing values is crucial to ensure the integrity and accuracy of our dataset. Here’s how we approached this process:
- **Identifying Columns with Missing Values:**
We first identified columns in our dataset that contained missing values. This step helps us understand which features may require imputation or further investigation.
- **Calculating Percentage of Missing Data:**
For each identified column, we calculated the percentage of missing data. This quantification allows us to prioritize columns for imputation or decide if columns with extensive missing data should be excluded from analysis.
- **Visualizing Missing Data Patterns:**
To visualize the extent and distribution of missingness across the dataset, we used heatmaps. Heatmaps provide a clear graphical representation where missing values are marked distinctively, enabling us to discern any patterns or clusters of missing data.

**Data Cleaning: Handling Missing Values:** 
Missing values were identified across various columns such as year, manufacturer, model, condition, cylinders, fuel type, odometer readings, title status, transmission type, drive type, size category, car type, paint color, and state.
- **Numerical features** like year and odometer, missing values were imputed using the median, which is robust to outliers and provides a central tendency estimate.
- **Categorical features** were imputed using the mode or designated as "Unknown" where appropriate, maintaining the integrity of the dataset while preserving categorical distinctions.

In [None]:
cars_cleaned.isnull().sum()

In [None]:
# Identify columns with missing values
columns_with_missing = cars_cleaned.columns[cars_cleaned.isnull().any()]

# Calculate percentage of missing values for each column
missing_percentage = (cars_cleaned[columns_with_missing].isnull().mean() * 100)
print("Percentage of missing values in each column:")

for col, percentage in missing_percentage.items():
    print(f"{col}: {percentage:.2f}%")

In [None]:
# Visualize missing data patterns using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(cars_cleaned[columns_with_missing].isnull(), cmap='coolwarm')
plt.title('Missing Data Patterns')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.show()

In [None]:
columns_with_missing

**Update Nan values based on make_model**
- For each column in columns_with_missing, apply groupby('make_model') to group the data by make_model.
- Use transform(lambda x: x.fillna(x.mode()[0])) to fill NaN values within each group (make_model) with the most frequent value (mode).

In [None]:
# Group by 'make_model' and fill NaN values in each column
for col in columns_with_missing:
    cars_cleaned[col] = cars_cleaned.groupby('make_model')[col].transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else np.nan))
    print('updated col:', col)

cars_cleaned.isnull().sum()

In [None]:
cars_cleaned[cars_cleaned.isnull().any(axis=1)]

In [None]:
cars_cleaned.type.unique()

In [None]:
cars_cleaned.query('cylinders == "8 cylinders" and type == "pickup" and (size == "sub-compact" or size == "compact")')

In [None]:
# Drop all nan rows for further analysis
cars_cleaned.dropna(inplace=True)

# If you want to reset the index after dropping rows
cars_cleaned.reset_index(drop=True, inplace=True)

In [None]:
cars_cleaned.isnull().sum()

**Update Numeric with median and categorical with mode as alternate approach**

#### Categorical Data Analysis: 
Examined the distribution of categorical variables (e.g., manufacturer, fuel, transmission) using bar plots. This analysis provided insights into the most common categories and their frequencies.

In [None]:
# Identify categorical columns
categorical_cols = cars_cleaned.select_dtypes(include=['object']).columns.tolist()

In [None]:
for var in categorical_cols:
    print(cars_cleaned[var].value_counts(), '\nlenght:', len(cars_cleaned[var].unique()), '\n\n')

In [None]:
print(categorical_cols)

In [None]:
selected_vars = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color']

# Loop through each selected variable and plot histograms
for var in selected_vars:
    plt.figure(figsize=(10, 3))
    sns.countplot(data=cars_cleaned, x=var, palette='viridis')
    plt.xlabel(var)
    plt.ylabel('Count')
    plt.title(f'Count Plot of {var}')
    plt.xticks(rotation=90)
    plt.show()

**Possible categorical features Transformations:**
- Target Encoding: make_model
- Ordinal Encoding: condition, cylinders, title_status, drive, size, 
- One Hot Encoding: Fuel, transmission, type, paint_color

#### Descriptive Statistics

Key numerical features such as price, year, and odometer are analyzed.

**Transforming year column into vehicle_age and drop year column**
- **New feature: vehicle_age** was derived or modified from existing year to enhance predictive power, creating a derived feature like "vehicle age" from the difference between the current year and the year of manufacture could provide deeper insights into pricing dynamics based on depreciation.

In [None]:
# Adding new feature vehicle_age

from datetime import datetime

# Get the current year dynamically
current_year = datetime.now().year

# Create the derived feature 'vehicle_age'
cars_cleaned['vehicle_age'] = current_year - cars_cleaned['year']

# Drop year column
cars_cleaned.drop(columns=['year'], inplace=True)

# Display the updated DataFrame with the new feature
cars_cleaned.head()

In [None]:
cars_cleaned.describe()

**summary for each statistics:**
- **Price:** The prices of vehicles in the dataset range from USD 1 to USD 38,711, with a median price of USD 10,500.
- **Odometer:** The mileage of vehicles varies from 0 to 279,437 miles, with a median odometer reading of 106,715 miles.
- **Vehicle Age:** The age of vehicles spans from 2 to 29 years, with a median age of 12 years.

#### Bivariate Analysis
Explored relationships between pairs of variables (e.g., price vs. vehicle_age, price vs. odometer) using scatter plots. This helped in understanding how different features interact and influence the target variable (price).

**Covariance and Correlation Analysis:**
Computed correlation coefficients between numerical variables to identify linear relationships. Visualized these correlations using heatmaps, which highlighted significant correlations (between vehicle_age, price, odometer).

In [None]:
# Correlation Analysis of Numerical Variables
cars_cleaned.cov(numeric_only=True)

**Interpretation:**
**Covariance:**
- Positive covariance between price and odometer, indicating that as price increases, odometer tends to decrease.
- Negative covariance between price and vehicle_age, suggesting that as price increases, vehicle age tends to decrease slightly.
- Positive covariance between odometer and vehicle_age, indicating that higher mileage tends to correlate with older vehicles.

In [None]:
cars_cleaned.corr(numeric_only=True)

In [None]:
# Visualize correlations
plt.figure(figsize=(5, 3))
sns.heatmap(cars_cleaned.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

**Interpretation:** **Correlation:**
- Moderate negative correlation (-0.48) between price and odometer, suggesting that higher prices correlate with lower odometer readings.
- Moderate negative correlation (-0.56) between price and vehicle_age, indicating that higher prices correlate with younger vehicles.
- Moderate positive correlation (0.59) between odometer and vehicle_age, suggesting that older vehicles tend to have higher mileage.

In [None]:
# SNS Pairplot is confounded by manufacturer and model, displaying specific manufacturer and model depicts the correlations correctly. 
sns.pairplot(cars_cleaned.query('make_model == "ford f-150 xlt"'))

In [None]:
# Final Data
cars_cleaned

## Data Preparation

In the Data Preparation phase we focused on transforming the raw dataset into a clean and structured format suitable for modeling. This phase is crucial for ensuring the accuracy and reliability of our predictive model. Here’s a detailed overview of the steps taken:

- **Data Cleaning:**
    - **Handling Missing Values:** Missing values were identified across various columns such as year, manufacturer, model, condition, cylinders, fuel type, odometer readings, title status, transmission type, drive type, size category, car type, paint color, and state.
        - **Numerical features** like year and odometer, missing values were imputed using the median, which is robust to outliers and provides a central tendency estimate.
        - **Categorical features** were imputed using the mode or designated as "Unknown" where appropriate, maintaining the integrity of the dataset while preserving categorical distinctions.
    - **Handling Outliers:** remove, transforming (e.g., using log transformation), or cap outliers (set a limit on extreme values).
    - **Remove duplicates**
    - **Handling incorrect data types**
    - **Handling inconsistent data** (example: age shouldn't be negative)
- **Split data into training, validation, and test sets:**
This partitioning ensured that the model could be trained on a subset of data, validated for optimal performance, and tested on unseen data to assess generalization ability. The primary reason for performing a train/validate/test split before feature engineering and modeling is to prevent data leakage and to ensure that the evaluation of your model is accurate and indicative of its performance on unseen data.
    - Prevention of Data Leakage
    - Accurate Evaluation
    - Ethical Modeling Practices
- **Feature Engineering / Data Transformation:**
    - **Normalization/Scaling:** Numerical features such as price, year, and odometer readings were scaled using techniques like Min-Max scaling to bring them within a standardized range, optimizing the performance of machine learning algorithms that are sensitive to varying scales.
    - **Encoding Categorical Variables:** Categorical features such as region, manufacturer, model, condition, fuel type, and others were encoded using one-hot encoding to convert them into numerical format. This transformation facilitated the inclusion of categorical data in predictive modeling without imposing artificial ordinality.
    - **New features** were derived or modified from existing ones to enhance predictive power. For example, creating a derived feature like "vehicle age" from the difference between the current year and the year of manufacture could provide deeper insights into pricing dynamics based on depreciation.

By meticulously preparing the dataset in this manner, we established a solid foundation for building and evaluating predictive models that accurately forecast used car prices. This phase not only enhanced data quality but also streamlined subsequent phases of modeling, evaluation, and deployment within the CRISP-DM framework.

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import TargetEncoder
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OrdinalEncoder, OneHotEncoder 
from sklearn.compose import ColumnTransformer

from sklearn import set_config
set_config(display="diagram")

### Split data into training, validation, and test sets

In [None]:
X = cars_cleaned.drop(columns=['price'])
y = cars_cleaned['price']

In [None]:
X

In [None]:
y

In [None]:
# Split the data into train and temp sets (80% train, 20% temp)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)

# Split the temp set into validation and test sets (50% validation, 50% test from the temp set)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [None]:
# Display the sizes of the datasets
print(f'Training set: {X_train.shape[0]} samples')
print(f'Validation set: {X_val.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples')

### Feature Engineering:
Selecting, manipulating and transforming raw data into features that can be used in supervised learning.

**Preprocessor**

- Impute missing values and standardize the numerical columns. SimpleImputer(strategy='median') (vehicle_age, odometer)
- Apply Standard Scaler to normalize the numerical variables (vehicle_age, odometer)
- Impute missing values for categorical columns. SimpleImputer(strategy='most_frequent')
- Apply Target Encoding for categorical columns. (Manufacturer and model)
- Apply Ordinal Encoding to specific categorical variables (condition, cylinders, title_status, drive, size)
- Apply one-hot encoding to categorical variables (Fuel, transmission, type, paint_color)
- Pass through any remaining columns (remainder='passthrough').

This transformation facilitated the inclusion of categorical data in predictive modeling without imposing artificial ordinality.

In [None]:
# Define numeric and categorical columns / features
numeric_features = X_train.select_dtypes(include=['int', 'float']).columns. tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns. tolist()

In [None]:
numeric_features

In [None]:
print(categorical_features)

#### **Simple Imputer for all Numerical features (vehicle_age, odometer)**
- Impute missing values with median.
- Scale the numeric features using StandardScaler.

In [None]:
# Define Numerical Transformer
numeric_imputer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

#### **Simple Imputer for all the Categorical Features (condition, cylinders, fuel, title_status, transmission, drive, size, type, paint_color)**
- Impute missing categorical values with the most frequent category.

In [None]:
# Selected categorical variables for simple imputer
cat_imputer_vars = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color']

# Define imputer pipeline
categorical_imputer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='most_frequent'))
])

#### **Target Encoder for Categorical Features (manufacturer, model)**
- Using Target Encoding (encoding categorical variables with the average target value) for manufacturer and model.
- **Advantages:**
    - Preserves Information: preserves the individual information of each manufacturer and model, albeit in a transformed numerical form.
    - Capture Price Trends: By encoding with the average price, it directly capture the average price tendency associated with each manufacturer and model, which can be crucial for price prediction tasks.
    - Works with High Cardinality: Manufacturer and model often have high cardinality (many unique values), and Target Encoding can handle this well without creating too many additional columns, unlike One-Hot Encoding.

In [None]:
# Columns to be target encoded
target_enc_features = ['manufacturer', 'model']

# Initialize TargetEncoder
te = TargetEncoder(cols=target_enc_features)

# Fit and transform on training data
X_train_te = te.fit_transform(X_train[target_enc_features], y_train)

# Transform validation and test data using fitted encoder
X_val_te = te.transform(X_val[target_enc_features])
X_test_te = te.transform(X_test[target_enc_features])
X_train_te

In [None]:
X_train_te.isnull().sum()

In [None]:
# Drop original columns and concatenate the encoded columns
X_train_combined = X_train.drop(columns=target_enc_features)
X_val_combined = X_val.drop(columns=target_enc_features)
X_test_combined = X_test.drop(columns=target_enc_features)

# Concatenate encoded columns
X_train_combined = pd.concat([X_train_combined, X_train_te], axis=1)
X_val_combined = pd.concat([X_val_combined, X_val_te], axis=1)
X_test_combined = pd.concat([X_test_combined, X_test_te], axis=1)

In [None]:
X_train_combined.isnull().sum()

##### Target Encoder Considerations:
- Data Size and Distribution: Ensure you have enough data points for each manufacturer and model to calculate reliable average prices. Small sample sizes can lead to biased estimates.
- Validation Strategy: When using Target Encoding, it's crucial to compute average prices using only the training data to avoid data leakage. You'll need to fit the encoder on the training set and transform both the training and validation/test sets consistently.
- Impact on Model Performance: Validate whether the encoded features improve model performance (e.g., using cross-validation or a separate validation set). Sometimes, simpler encoding methods might suffice depending on your dataset and model.

#### **Ordinal Encoder for Categorical features (condition, cylinders, title_status, drive, size)**

Encode select ordinal categorical features using OrdinalEncoder with predefined category orders.
- **condition_categories**
    - salvage, fair, good, excellent, like new, new
- **cylinders_categories**
    - other, 3 cylinders, 4 cylinders, 5 cylinders, 6 cylinders, 8 cylinders, 10 cylinders, 12 cylinders
- **title_status**
    - Missing: Title is missing; the vehicle cannot be legally sold or registered.
    - Parts Only: Vehicle is only suitable for parts; cannot be legally driven.
    - Salvage: Vehicle was damaged and deemed a total loss by an insurance company.
    - Rebuilt: Vehicle was previously salvaged but has been repaired and inspected.
    - Lien: There is a lien on the vehicle, meaning it is used as collateral for a loan.
    - Clean: Vehicle has a clean title, indicating it has not been in major accidents or suffered significant damage.
- **drive**
    - RWD (rear-wheel drive): Generally considered less versatile in adverse weather conditions.
    - FWD (front-wheel drive): More common in consumer vehicles, better in slippery conditions compared to RWD.
    - 4WD (four-wheel drive): Typically considered the most versatile, ideal for off-road and all-weather conditions.
- **size**
    - Sub-compact: Smallest category, typically very compact and often economical.
    - Compact: Slightly larger than sub-compact but still smaller than mid-size.
    - Mid-size: Intermediate size category, offering a balance between compactness and spaciousness.
    - Full-size: Largest category, providing maximum space and comfort.

In [None]:
# Define ordinal encoding pipeline
ordinal_features = ['condition', 'cylinders', 'title_status', 'drive', 'size']

# Define the order of categories for the 'condition' and 'cylinders' variables
condition_categories = ['salvage', 'fair', 'good', 'excellent', 'like new', 'new']
cylinders_categories = ['other', '3 cylinders', '4 cylinders', '5 cylinders', '6 cylinders', 
                        '8 cylinders', '10 cylinders', '12 cylinders']
title_categories = ['missing', 'parts only', 'salvage', 'rebuilt', 'lien', 'clean']
drive_categories = ['rwd', 'fwd', '4wd']
size_categories = ['sub-compact', 'compact', 'mid-size', 'full-size']

In [None]:
# Define ordinal encoding transformers for each categorical variable
ordinal_transformers = [
    ('condition', OrdinalEncoder(categories=[condition_categories]), ['condition']),
    ('cylinders', OrdinalEncoder(categories=[cylinders_categories]), ['cylinders']),
    ('title_status', OrdinalEncoder(categories=[title_categories]), ['title_status']),
    ('drive', OrdinalEncoder(categories=[drive_categories]), ['drive']),
    ('size', OrdinalEncoder(categories=[size_categories]), ['size'])
]

# Create the ordinal encoding pipeline using ColumnTransformer
ordinal_pipeline = Pipeline(steps=[
    ('ordinal', ColumnTransformer(transformers=ordinal_transformers))
])

#### **One Hot Encoder for Categorical Variables (fuel, transmission, type, paint_color)**
- Encode nominal categorical features using OneHotEncoder, ignoring unknown categories.

In [None]:
# Define one hot encoder pipeline
ohe_features = ['fuel', 'transmission', 'type', 'paint_color']
categorical_one_hot = Pipeline(steps=[
    ('one_hot', OneHotEncoder(handle_unknown='ignore'))
])

#### **Polynomial Transformer**
- Generate polynomial features (up to degree 2) from numeric features.

In [None]:
# Define the polynomial transformer for numerical features
poly_transformer = PolynomialFeatures(degree=2, include_bias=False)

#### **Preprocessor:**
- Combine all these transformers into a single ColumnTransformer which can handle both numerical and categorical features.
- This step doesn't include Target Encoder as that has already been done in prior steps

In [None]:
# Define which transformers to apply to which columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num_imp', numeric_imputer, numeric_features),
        ('cat_imp', categorical_imputer, cat_imputer_vars),
        ('ord', ordinal_pipeline, ordinal_features),
        ('ohe', categorical_one_hot, ohe_features),
        ('poly', poly_transformer, numeric_features)        
    ],
    remainder='passthrough'  # Ensure target encoded columns are included
)

## Modeling
In the Modeling phase, we aimed to develop a robust predictive model to forecast used car prices based on the cleaned and transformed dataset. Key steps and considerations include:
- **Model Selection:**
We evaluated various regression models suitable for predicting continuous variables, such as Linear, Ridge, Lasso Regression.
Each model was selected based on its ability to handle the dataset's characteristics, interpretability, and potential for achieving high prediction accuracy.
- **Model Training:**
The selected models were trained using the training dataset, which was prepared during the data preparation phase.
Parameters were tuned using techniques like grid search or SFS combined with cross-validation to optimize model performance and prevent overfitting.

In [None]:
# Import necessary libraries
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
}

**Pipeline**

- Combines the preprocessor with a LinearRegression model.
- Fits the pipeline to the training data (pipeline.fit(X_train, y_train)).
- Uses the pipeline to make predictions on both the training data and new data with missing values (pipeline.predict(new_data)).

In [None]:
# Train and evaluate models
results = {}
for name, model in models.items():
    pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', model)
    ])
    pipeline.fit(X_train_combined, y_train)
    y_pred = pipeline.predict(X_val_combined)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    r2 = r2_score(y_val, y_pred)
    results[name] = {'RMSE': rmse, 'R2': r2}
    print(f"{name}: RMSE = {rmse:.4f}, R2 = {r2:.4f}")

## Evaluation
In the Evaluation phase, we assessed the performance of the trained models to ensure they met our project objectives and business requirements:

- **Metrics:**
We used several evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²) to quantify the accuracy of predictions.
These metrics provided insights into how well the models predicted used car prices compared to actual values.
- **Validation:**
The models were validated using the validation dataset, which was set aside during the data preparation phase.
This step ensured that the models generalized well to unseen data and avoided overfitting to the training dataset.
- **Business Validation:**
We conducted business validation to ensure that the models aligned with the project objectives and provided actionable insights for pricing strategy and inventory management.
Stakeholder feedback and domain knowledge were incorporated to validate the relevance and applicability of model outputs in real-world scenarios.

## Conclusion
In conclusion, this project focused on predicting used car prices using a structured approach based on the CRISP-DM framework. We started with understanding business objectives and data collection, followed by thorough data preparation, modeling, evaluation, and deployment phases. Here's a summary of our findings and actionable insights:

### Interesting Findings

| Feature | Recommendation | Coefficient Value | Impact | Interpretation |
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|
|---------|----------------|-------------------|--------|----------------|


### Actionable Insights
- **Optimize Pricing Strategy:**
Implement dynamic pricing strategies that consider factors like mileage, age, and market demand to maximize profitability.
Monitor regional pricing trends and adjust pricing strategies accordingly to capitalize on local market dynamics.
- **Enhance Inventory Management:**
Prioritize high-demand models and manufacturers based on their resale value and market popularity.
Utilize predictive models to forecast inventory turnover and manage stock levels effectively.
- **Improve Customer Engagement:**
Leverage insights into popular features and attributes (e.g., low mileage, specific models) to tailor marketing and sales efforts.
Offer transparent pricing information and detailed vehicle histories to build trust and attract informed buyers.
- **Continuous Model Refinement:**
Regularly update the predictive model with new data to maintain accuracy and relevance.
Incorporate user feedback and market insights to enhance model performance and adapt to changing market conditions.

By leveraging these insights and taking proactive steps, our business can enhance its competitive edge in the used car market, drive revenue growth, and improve customer satisfaction through informed decision-making.

## Deployment
In the Deployment phase, the focus was on integrating the validated model into operational processes to derive business value:

- **Model Integration:**
The final model was integrated into the business workflow, either as a standalone tool, embedded within existing applications, or deployed as a web service.
This integration facilitated seamless access to predictive insights for decision-makers and operational teams.
- **User Training and Documentation:**
End-users were trained on how to use the model outputs effectively, interpret predictions, and make informed decisions based on the results.
Comprehensive documentation was provided to ensure clarity on model inputs, outputs, limitations, and best practices for utilization.
- **Monitoring and Maintenance:**
Ongoing monitoring of the deployed model's performance was established to detect any drift or degradation in predictive accuracy over time.
Regular updates and maintenance were scheduled to refine the model based on new data and evolving business needs, ensuring sustained relevance and reliability.
- **Feedback Loop:**
A feedback mechanism was established to gather user feedback and incorporate it into model updates and enhancements.
This iterative process aimed to continuously improve the model's predictive capabilities and optimize its impact on business outcomes.

By documenting these phases comprehensively, we ensure transparency, reproducibility, and alignment with project goals throughout the lifecycle of developing and deploying a predictive model for used car price prediction.

### Used the pkl file and the function above to deploy the machine learning model to make predictions.
Additional implementation steps:

- Model monitoring
- Versioning
- Security
- CI/CD
- Scaling (Kubernetes/ AWS ECS)
- User interface (Dash, React, Flutter)