# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

**Business Understanding - Answer1**

The business task "Identify key drivers for used cars prices" can be framed as a data problem assuming that the target variable is the 'car price' then I can use supervised regression to predict the used card price. This can be achieved based on features ( or predictor variables ) like the year, odometer, menaufacturer, etc.

**Business Understanding - Answer2**:

The Plan for using the CRISP-DM framework is after to get a solid enough business understanding,  have to transition to the Data Understanding Phase to run exploratory data analysis to check data quality such as completeness, confiability, and consistency across the data set. The next phase: Data Prep I will clean data and run feature engineering in order to prepare the data set for the next phases. Once I have the data prepared I can move to the Modeling phase where I can develop and validate predictive models—such as linear regression or ensemble methods—to quantify the influence of each predictor on the price, using metrics like RMSE to assess model performance. Finally, The model will predict prices but also is going to identify and rank the key drivers that impact on used car pricing to provide actionable insights that can resolve the original business problem.


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

**Data Understanding - Answer1**

My exploratory activities were:
Identify key drivers for used car prices means in terms of data tasks
1. Look for data sources of used cars: In this case Kaggle
2. Download in my phython environment
3. Initial data inspection:  using cars.info() and cars.head() to review the structure, columns names, data types, samples

In [1]:
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/My Drive/Berkeley/Unit11/practical_application_II')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
cars =pd.read_csv('data/vehicles.csv')

Metadata Extraction
Understand what each attribute represents, noting the type of data (numerical, categorical, etc.) and any available metadata or data dictionaries. using cars.info(). cars.columns
Data Quality Assessment

In [3]:
print(cars.info())
print(cars.columns)
print(cars.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

**Data Understanding - Answer2**

Data Description & Summary Statistics.

1. I looked into summary statistics for numerical columns with df.describe() to understand the central tendencies, dispersion, and range and possibility to do some feature engineering.
2. I did exploration of Categorical Data:   cars['manufacturer'].value_counts() to explore the distribution of categorical features
3. I reviewed the Quality of the data searching for Missing Values Analysis: Identify missing or null values using methods like df.isnull().sum().
4. I looked for outliers in numerical columns. I used some visualization


In [None]:
print(cars.describe())

                 id         price           year      odometer
count  4.268800e+05  4.268800e+05  425675.000000  4.224800e+05
mean   7.311487e+09  7.519903e+04    2011.235191  9.804333e+04
std    4.473170e+06  1.218228e+07       9.452120  2.138815e+05
min    7.207408e+09  0.000000e+00    1900.000000  0.000000e+00
25%    7.308143e+09  5.900000e+03    2008.000000  3.770400e+04
50%    7.312621e+09  1.395000e+04    2013.000000  8.554800e+04
75%    7.315254e+09  2.648575e+04    2017.000000  1.335425e+05
max    7.317101e+09  3.736929e+09    2022.000000  1.000000e+07


In [None]:
print(cars.isnull().sum())

id                   0
region               0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
state                0
dtype: int64


Outlier Identification: Outliers in numerical columns with visualizations.
Outlier Identification: Outliers in numerical columns using Interquartile Range (IQR) method to identify outliers:   
The Interquartile Range (IQR)  IQR = Q3 - Q1
Outliers values should fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
Further I can adjust the parameter 1.5 to adjust the sensitivity


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

num_cols = ['price', 'year', 'odometer']
for col in num_cols:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    for col in num_cols:
      fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Boxplot for quick visual identification of outliers
    sns.boxplot(x=cars[col], ax=axes[0])
    axes[0].set_title(f'Boxplot of {col}')

    # Histogram with a kernel density estimate for a detailed distribution view
    sns.histplot(cars[col], kde=True, ax=axes[1])
    axes[1].set_title(f'Histogram of {col}')

    plt.tight_layout()
    plt.show()

    sns.boxplot(x=cars[col], ax=axes[0])
    axes[0].set_title(f'Boxplot of {col}')

    # Histogram with a kernel density estimate for a detailed distribution view
    sns.histplot(cars[col], kde=True, ax=axes[1])
    axes[1].set_title(f'Histogram of {col}')

    plt.tight_layout()
    plt.show()


In [None]:
def detect_outliers(df, col, factor=1.5):
    """
    Detect outliers in a column of a DataFrame using the IQR method.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the data.
        col (str): The name of the column to analyze.
        factor (float): The multiplier to determine the outlier threshold (default 1.5).

    Returns:
        outliers (pd.DataFrame): Subset of the DataFrame containing the outlier rows.
        lower_bound (float): The lower threshold.
        upper_bound (float): The upper threshold.
    """
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR

    # Filter rows that have values below the lower bound or above the upper bound
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    return outliers, lower_bound, upper_bound

# using the outlier detection function to each numerical column..
for col in num_cols:
    outliers, lower_bound, upper_bound = detect_outliers(cars, col)
    print(f"Column: {col}")
    print(f"  Lower bound: {lower_bound}")
    print(f"  Upper bound: {upper_bound}")
    print(f"  Number of detected outliers: {outliers.shape[0]}")
    print("-" * 50)

Column: price
  Lower bound: -24978.625
  Upper bound: 57364.375
  Number of detected outliers: 8177
--------------------------------------------------
Column: year
  Lower bound: 1994.5
  Upper bound: 2030.5
  Number of detected outliers: 15896
--------------------------------------------------
Column: odometer
  Lower bound: -106053.75
  Upper bound: 277300.25
  Number of detected outliers: 4385
--------------------------------------------------


**Data Understanding - Answer3**

Some conclusions:
From the odometer boxplot and histogram I can conclude: there are a large quantity of odometer values that can be consider as outliers, which is in line with IQR rule.  The Data looks very skewed to lower values near zero. The Distribution range is highly packed around lower values, but the outliers stretch across the entire range of the x-axis, reaching close to 1.0 (normalized data).This high concentration in low values sugggest possible data quality issues or scale issues. In the next step, I should separate the analysis: Main distribution and other for outliers, to check if there is a real data issue or some error induced by normalization.

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

**Data Preparation - Answer1**

**Handling missing values and data consistency**
1. Evaluate Missing Values: I Looked into missing values count per column to decide if I input or drop columns/rows based on the percentage of missing values. If A column has too many missing values and isn't critical, I should drop.

2. Imputation criteria:  For numerical columns, I will consider using the median for imputation
For Categorical columns: I'll use the mode or a constant ( "not found")



In [None]:
import pandas as pd

#. cars = pd.read_csv('data/vehicles.csv'). ##  LOST MY PREVIOUS CAR

# Check missing values
missing_summary = cars.isnull().sum()
print(missing_summary)

# Impute numerical columns with median (example for 'year' and 'odometer')
cars['year'].fillna(cars['year'].median(), inplace=True)
cars['odometer'].fillna(cars['odometer'].median(), inplace=True)

# Impute categorical columns with a placeholder
for col in ['manufacturer', 'model', 'condition', 'cylinders', 'fuel',
            'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color']:
    cars[col].fillna('Unknown', inplace=True)

**Data Preparation - Answer2**

**Outlier handling & data Transformation**

1. Capping the outliers:  For columns like odometer with extreme high values, I will cap the them at a high percentile ( 99th )  to reduce their influence.

2. Log Transformation for Skewed Distributions:
Odometer is heavily right-skewed, a log transformation can help normalize the distribution. (Use np.log1p to handle zero values safely.)

In [None]:
import numpy as np

def cap_outliers(series, lower_quantile=0.01, upper_quantile=0.99):
    lower_bound = series.quantile(lower_quantile)
    upper_bound = series.quantile(upper_quantile)
    return series.clip(lower=lower_bound, upper=upper_bound)

# Cap outliers for odometer
cars['odometer_capped'] = cap_outliers(cars['odometer'])

In [None]:
# Apply log transformation to the original or capped odometer column
cars['odometer_log'] = np.log1p(cars['odometer_capped'])

**Data Preparation - Answer3**

**Feature Engineering**

1. ***Car Age*** : Instead of using the raw "year". I will compute the car's age relative to the current year (or the year of data collection). This is more useful because the Depreciation of the car is measure in how old thus is easier to compare  15-year-old and 20-year-old vehicles, respectively to understand how the depreciation could affect the price. Also, this approach could reduce the risk of incorporate time specific trends that could be present in the original raw data.  ***Age*** suggest me a better feature to find correlations because its refers to how long the car has been used. It can correlates with odometer and other features.

2. ***Price per mile*** I created a new feature :  Price per mile combining price and miles

3. ***Manufacturer reduction of categories***: To reduce the number of categories in the manufacturer column I grouped the infrequent ones into an "Other" category.





1 . **CarAge**

In [None]:
# import pandas as pd
import datetime

# Load dataset
# cars = pd.read_csv('data/vehicles.csv')

# Assume 'year' contains the car's production year and convert it to integer if needed
cars['year'] = cars['year'].fillna(0).astype(int)

# Get the current year
current_year = datetime.datetime.now().year

# Compute the age of the car
cars['age'] = current_year - cars['year']

# Check the result
print(cars[['year', 'age']].head())

   year  age
0  2013   12
1  2013   12
2  2013   12
3  2013   12
4  2013   12


2. **Price per mile**

In [None]:
# missing values in the 'odometer' column by imputing with the median.
cars['odometer'].fillna(cars['odometer'].median(), inplace=True)

# Avoid division by zero: Replace zero odometer values with NaN, then drop these rows
cars.loc[cars['odometer'] == 0, 'odometer'] = np.nan
cars = cars.dropna(subset=['odometer'])

# Create the new feature: price per mile
cars['price_per_mile'] = cars['price'] / cars['odometer']

# Display the first few rows to verify the new feature
print(cars[['price', 'odometer', 'price_per_mile']].head())

3. **Manufacturer Category Reduction**

In [None]:
import pandas as pd

# Load the dataset
# cars = pd.read_csv('data/vehicles.csv')

# Calculate the relative frequency of each manufacturer
manufacturer_freq = cars['manufacturer'].value_counts(normalize=True)

# Define a threshold: keep manufacturers with at least 1% of the records
threshold = 0.01 # Intersting 0.05 only give me Toyota, Chevrolet and ford I reduced to 0.01
# Identify manufacturers that meet or exceed the threshold
frequent_manufacturers = manufacturer_freq[manufacturer_freq >= threshold].index

# Create a new column where infrequent manufacturers are replaced with 'Other'
cars['manufacturer_reduced'] = cars['manufacturer'].apply(
    lambda x: x if x in frequent_manufacturers else 'Other'
)

# Optionally, display the counts for the new column to verify the transformation
print(cars['manufacturer_reduced'].value_counts())


manufacturer_reduced
ford             70497
chevrolet        54862
toyota           34131
Other            26121
honda            21206
nissan           19040
jeep             18975
ram              18160
Unknown          17341
gmc              16734
bmw              14692
dodge            13550
mercedes-benz    11801
hyundai          10300
subaru            9480
volkswagen        9324
kia               8440
lexus             8189
audi              7563
cadillac          6840
chrysler          6014
acura             5972
buick             5487
mazda             5407
infiniti          4789
Name: count, dtype: int64


**Data Preparation - Answer4**

1. **Encoding Categorical Variables** convert categorical variables into numerical representations.
2. **One-Hot Encoding:** For nominal categories with no intrinsic order
3. **Ordinal Encoding:** If there’s a natural order


1. Encoding Categorical Variables

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/My Drive/Berkeley/Unit11/practical_application_II')
dtype_dict = {
    'price': 'int32',
    'year': 'float32',
    'odometer': 'float32'
}
cars = pd.read_csv('data/vehicles.csv', dtype=dtype_dict)
cars.drop(columns=['VIN'], inplace=True)

# and take  a random 50% of the total rows of the dataset
cars_sample = cars.sample(frac=0.5, random_state=42)

# Create dummy variables for selected categorical columns
cols_to_encode = ['manufacturer', 'cylinders', 'fuel',
                      'transmission']

# categorical_cols = ['region', 'manufacturer', 'model', 'condition',
#                    'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color', 'state']
# cars_encoded = pd.get_dummies(cars_sample, columns=categorical_cols, drop_first=True)

cars_encoded = pd.get_dummies(cars_sample, columns=cols_to_encode, drop_first=True)

# Display the number of columns in the new DataFrame
num_new_columns = cars_encoded.shape[1]
print(f'The number of new columns created is: {num_new_columns}')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
The number of new columns created is: 67


2. One-Hot Encoding
Is done through pd.get_dummies() which creates a binary indicator or dummy for each category in the specified in columns=categorical_cols. I found that drop_first =True is a good practice to drop the first category to avoid multicollinearity when the encoded variables are used in linear regression. I learned that when I perform one-hot encoding, each categorical variable with K unique categories is transformed into K binary (dummy) variables. If I include all k
dummy variables along with an intercept in my linear regression model, there is an inherent linear dependency because for any observation the sum of these K dummy variables is always 1. This situation can lead to erroneous coefficient estimates.
The first attempt to one hot encode all the columns that looks having limited set of values, resulted in a crazy number of columns created 21199
So, I decided to reduce to the ones that as far I my knowledge goes, could affect the price. manufacturer, cylinders, transmision, and fuel reducing the number of columns to 67

3. Ordinal Encoding. To do Ordinal encoding I decided to look into each " categorical_cols"='region', 'manufacturer', 'model', 'condition','cylinders', 'fuel', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'state'] to find which columns have few unique values and if those values can be sorted numerically, then those values are good candidate to be ordinal and I will flag the column. After that I will inspect the column by myself to check if those are real ordinals. The following code try to do this exercise
After running this code I can conclude that columns:  condition, size, title_status, drive.  But, before to make the Ordinal Encoding I

In [None]:
import pandas as pd
# This is my list of candidate columns having ordinal values
Potential_ordinal_cols = ['region', 'condition','title_status','drive', 'size', 'type', 'paint_color', 'state']

def is_convertible_to_float(val):
    try:
        float(val)
        return True
    except (ValueError, TypeError):
        return False

def is_ordinal_candidate(series, threshold_for_unique_values=10, numeric_ratio_threshold=0.8):
    """
     checkin if the serie is a candidate for ordinal

    Parameters:
      series (pd.Series): The categorical column belonging to Potential_ordinal_cols
      threshold_for_unique_values (int): If the number of unique values is less than or equal
                              to this, we consider it as a potential candidate.
      numeric_ratio_threshold (float): If most (>= this fraction) of the unique values
                                       are convertible to numeric, consider it ordinal.

    Returns:
      bool: True if the series is a candidate for ordinal encoding, False otherwise.
    """
    # Drop missing values and get unique values
    unique_vals = series.dropna().unique()
    n_unique = len(unique_vals)

    # Heuristic 1: Very few unique values suggests a candidate.
    if n_unique <= threshold_for_unique_values:
        return True

    # Heuristic 2: Check if most unique values can be converted to float.
    numeric_count = sum(is_convertible_to_float(val) for val in unique_vals)
    if (numeric_count / n_unique) >= numeric_ratio_threshold:
        return True

    # Otherwise, not flagged as ordinal.
    return False

# Loop through each categorical column and flag potential ordinal candidates.
for col in Potential_ordinal_cols:
    candidate = is_ordinal_candidate(cars[col])
    n_unique = cars[col].nunique(dropna=True)
    if candidate:
        print(f"Column '{col}' (unique values: {n_unique}) ordinal encoding good candidate")
    else:
        print(f"Column '{col}' (unique values: {n_unique}) no ordinal.")



Column 'region' (unique values: 404) no ordinal.
Column 'condition' (unique values: 6) ordinal encoding good candidate
Column 'title_status' (unique values: 6) ordinal encoding good candidate
Column 'drive' (unique values: 3) ordinal encoding good candidate
Column 'size' (unique values: 4) ordinal encoding good candidate
Column 'type' (unique values: 13) no ordinal.
Column 'paint_color' (unique values: 12) no ordinal.
Column 'state' (unique values: 51) no ordinal.


Before to do ordinal encoding to all candidates I want to see the real values to see if make sense or not


In [None]:
columns_to_check = ['condition', 'title_status', 'drive', 'size']

for col in columns_to_check:
    # Drop missing values, get unique values, and sort them
    unique_vals = sorted(cars[col].dropna().unique())
    print(f"Unique values in '{col}' (alphabetically sorted):")
    print(unique_vals)
    print()

Unique values in 'condition' (alphabetically sorted):
['excellent', 'fair', 'good', 'like new', 'new', 'salvage']

Unique values in 'title_status' (alphabetically sorted):
['clean', 'lien', 'missing', 'parts only', 'rebuilt', 'salvage']

Unique values in 'drive' (alphabetically sorted):
['4wd', 'fwd', 'rwd']

Unique values in 'size' (alphabetically sorted):
['compact', 'full-size', 'mid-size', 'sub-compact']



The ordinals I see are ***condition***  ***size***  and ***title_status***

1. ***condition*** =  "salvage" < "fair" < "good" < "excellent" < "like new" < "new"  and
2. ***size*** = "compact" < "midsize" < "fullsize"
3. ***title_status***  = "salvage" < "rebuilt" < "clean"

Now I can create a dictionary with the order of those ordinals to create ordinal encoded columns in the following code

In [None]:
My_ordinal_mapping = {
    'condition': {
        'salvage': 1,
        'fair': 2,
        'good': 3,
        'excellent': 4,
        'like new': 5,
        'new': 6
    },
    'size': {
        'compact': 1,
        'midsize': 2,
        'fullsize': 3
    },
    'title_status': {
        'salvage': 1,
        'rebuilt': 2,
        'clean': 3
    }
}

# Print the ordinal mapping for each specified column
for col, mapping in My_ordinal_mapping.items():
    print(f"Ordinals for '{col}':")
    for category, ordinal_value in mapping.items():
        print(f"  {category}: {ordinal_value}")
    print()

Ordinals for 'condition':
  salvage: 1
  fair: 2
  good: 3
  excellent: 4
  like new: 5
  new: 6

Ordinals for 'size':
  compact: 1
  midsize: 2
  fullsize: 3

Ordinals for 'title_status':
  salvage: 1
  rebuilt: 2
  clean: 3



Now I can create new columns with those ordinals:

In [None]:
cars['condition_ordinal'] = cars['condition'].map(My_ordinal_mapping['condition'])
cars['size_ordinal'] = cars['size'].map(My_ordinal_mapping['size'])
cars['title_status_ordinal'] = cars['title_status'].map(My_ordinal_mapping['title_status'])

**Data Preparation - Answer5**


1. Scaling and Normalization.  For machine learning algorithms based on distance metrics or regularization is recommended to use scale numerical features using StandardScaler from sklearn.preprocessing lib

2. Pipeline Integration.After doing a lot of transformations, I need to create a pipeline to integrate all the preprocessing steps and then fits a model. This will ensures that all my transformations are applied consistently during cross-validation and on new data. very important ! :-)


1. Scaling & Normalization

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define numerical columns (including transformed features)
numerical_cols = ['price', 'year', 'odometer_log', 'age']

# Build a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        # I should add a transformer for categorical variables here ( If I have time :-)
    ],
    remainder='passthrough'
)

2. Pipeline: All the transformation following the CRISP-DM methodology.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/My Drive/Berkeley/Unit11/practical_application_II')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import datetime
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.compose import ColumnTransformer


# Because I got Session Crashed after using all available RAM
# I decided to drop the column VIN ( I don't think the VIN number has any relevant importance predicting the price )
# and used dtype_dict
dtype_dict = {
    'price': 'int32',
    'year': 'float32',
    'odometer': 'float32'
}
cars = pd.read_csv('data/vehicles.csv', dtype=dtype_dict)
cars.drop(columns=['VIN'], inplace=True)

In [None]:
import pandas as pd
import numpy as np
import datetime
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.compose import ColumnTransformer

# Because I got Session Crashed after using all available RAM
# I decided to drop the column VIN ( I don't think the VIN number has any relevant importance predicting the price )
# and used dtype_dict
dtype_dict = {
    'price': 'int32',
    'year': 'float32',
    'odometer': 'float32'
}
cars = pd.read_csv('data/vehicles.csv', dtype=dtype_dict)
cars.drop(columns=['VIN'], inplace=True)

# and take  a random 50% of the total rows of the dataset
cars_sample = cars.sample(frac=0.5, random_state=42)

numerical_cols = ['price', 'year', 'odometer']  # Before log transformation
# Build a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        # I could add more and more transformers for categorical variables here as needed,but no time :-)
    ],
    remainder='passthrough'
)


# ========= Step 1: Outlier Detection and Removal =========
def remove_outliers_iqr(df, numeric_cols=['price', 'year', 'odometer'], factor=1.5):
    """
    Removing those rows from the numeroc colums with outlier values using the IQR method.

    Parameters:
      df (pd.DataFrame): Input DataFrame.
      numeric_cols (list): List of numeric columns to check.
      factor (float): Multiplier for the IQR to define outlier thresholds.

    Returns:
      pd.DataFrame: DataFrame with outliers removed.
    """
    df = df.copy()
    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - factor * IQR
        upper_bound = Q3 + factor * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

# ========= Step 2: Missing Value Imputation =========
def impute_missing_values(df):
    df = df.copy()
    # For numeric columns, fill missing values with the median
    numeric_cols = ['year', 'odometer', 'price']
    for col in numeric_cols:
        df[col] = df[col].fillna(df[col].median())

    # For categorical columns, fill missing values with 'Unknown'
    # Not considering state .. I don't think the state could make any difference
    categorical_cols = ['region', 'manufacturer', 'model', 'condition',
                        'cylinders', 'fuel', 'title_status', 'transmission',
                        'drive', 'size', 'type', 'paint_color']
    for col in categorical_cols:
        df[col] = df[col].fillna('Unknown')
        df['year'] = df['year'].astype('int32')

    return df

# ========= Step 3: Outlier Treatment and Odometer Transformation =========
def process_odometer(df):
    df = df.copy()
    # Cap the odometer values at the 1st and 99th percentiles
    lower_bound = df['odometer'].quantile(0.01)
    upper_bound = df['odometer'].quantile(0.99)
    df['odometer_capped'] = df['odometer'].clip(lower=lower_bound, upper=upper_bound)

    # Log-transform the capped odometer to reduce right skewness (use log1p to handle zero values)
    df['odometer_log'] = np.log1p(df['odometer_capped'])
    return df

# ========= Step 4: Feature Engineering =========
def feature_engineering(df):
    df = df.copy()
    # Create a new feature: car age (current year - production year)
    current_year = datetime.datetime.now().year
    df['age'] = current_year - df['year'].astype(int)

    # Create price per mile; avoid division by zero by replacing zero with NaN first
    # instead of remove the whole row to preserv potential valid data. Like new vehicles.
    # can introduce some bias in my analysis...
    # Create price per mile; handle potential division by zero by filling with NaN first
    df['price_per_mile'] = df['price'] / df['odometer']
    # Replace inf with NaN
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    # Fill NaN values with median of price_per_mile column
    df['price_per_mile'].fillna(df['price_per_mile'].median(), inplace=True)

    return df

# ========= Step 5: Reduce Categories for Manufacturer =========
def reduce_manufacturer(df, threshold=0.01):
  # I tested several values for the threshold starting with 0.05, I realized 0.01 gives me fair list of
  # manufacturers. O.05 only gaveme Toyota Chevrolet and Ford
    df = df.copy()
    # Calculate the relative frequency of each manufacturer
    freq = df['manufacturer'].value_counts(normalize=True)
    # Identify manufacturers that represent at least 'threshold' of observations
    frequent = freq[freq >= threshold].index
    # Create a new column where infrequent manufacturers are replaced with 'Other'
    df['manufacturer_reduced'] = df['manufacturer'].apply(
        lambda x: x if x in frequent else 'Other'
    )
    return df

# ========= Step 6: Ordinal Encoding =========
def ordinal_encoding(df):
    df = df.copy()
    # Based in previous analysis These are the ordinal mappings..
    condition_mapping = {
        'salvage': 1,
        'fair': 2,
        'good': 3,
        'excellent': 4,
        'like new': 5,
        'new': 6,
        'Unknown': np.nan
    }
    title_status_mapping = {
        'salvage': 1,
        'rebuilt': 2,
        'clean': 3,
        'Unknown': np.nan
    }
    size_mapping = {
        'compact': 1,
        'midsize': 2,
        'fullsize': 3,
        'Unknown': np.nan
    }

    df['condition_ordinal'] = df['condition'].map(condition_mapping)
    df['title_status_ordinal'] = df['title_status'].map(title_status_mapping)
    df['size_ordinal'] = df['size'].map(size_mapping)

    return df

# ========= Step 7: One-Hot Encoding for Remaining Categorical Variables =========
def one_hot_encoding(df):
    df = df.copy()
    # List the remaining categorical columns to one-hot encode.
    # Exclude columns that have already been ordinal-encoded or transformed.
    cols_to_encode = ['manufacturer', 'cylinders', 'fuel',
                      'transmission']
    df = pd.get_dummies(df, columns=cols_to_encode, drop_first=True)
    return df

# ========= Full Preprocessing Function =========
def full_preprocessing(df):

    # Step 0: Apply ColumnTransformer for scaling (NEW STEP)
    df_transformed = preprocessor.fit_transform(df) # return a numpy array
    # ----> Convert back to DataFrame (important!)
    remainder_columns = [col for col in df.columns if col not in numerical_cols]
    all_columns = numerical_cols + remainder_columns
    df = pd.DataFrame(df_transformed, columns=all_columns)

    df = pd.DataFrame(df, columns=numerical_cols + list(df.columns[len(numerical_cols):]))
    # Step 1: Remove outliers using the IQR method
    df = remove_outliers_iqr(df, numeric_cols = numerical_cols, factor=1.5)
    # Step 2: Impute missing values
    df = impute_missing_values(df)
    # Step 3: Process the odometer variable (cap and log-transform)
    df = process_odometer(df)
    # Step 4: Create additional features (age, price per mile)
    df = feature_engineering(df)
    # Step 5: Reduce manufacturer categories
    df = reduce_manufacturer(df)
    # Step 6: Apply ordinal encoding on select columns
    df = ordinal_encoding(df)
    # Step 7: One-hot encode the remaining categorical variables
    df = one_hot_encoding(df)
    return df

# ========= Create the Pipeline =========
preprocessing_pipeline = Pipeline(steps=[
    ('full_preprocessing', FunctionTransformer(full_preprocessing))
])

# ========= Apply the Pipeline =========
# need to load the sample 50% to avoid memory RAM exhausting
# cars = pd.read_csv('data/vehicles.csv')

# Run the full preprocessing pipeline with the 50% of the full date set to avoid memory issues
cars_preprocessed = preprocessing_pipeline.transform(cars_sample)

# Display the first few rows of the preprocessed data
print(cars_preprocessed.head())



  df[col] = df[col].fillna(df[col].median())
  df[col] = df[col].fillna(df[col].median())
  df[col] = df[col].fillna(df[col].median())
  df.replace([np.inf, -np.inf], np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['price_per_mile'].fillna(df['price_per_mile'].median(), inplace=True)


      price  year  odometer          id                  region  \
0  0.003609     0 -0.270731  7315883828                lakeland   
3 -0.000587     0 -0.024469  7312663807      northern panhandle   
4 -0.003020     0 -0.230290  7315368523                  eugene   
5 -0.002642     0  0.586199  7309863303  waterloo / cedar falls   
6 -0.000351     0 -0.277230  7315163492                 jackson   

                   model  condition title_status    drive     size  ...  \
0  f150 super cab lariat       good        clean      4wd  Unknown  ...   
3                   328i    Unknown        clean  Unknown  Unknown  ...   
4            suburban ls    Unknown        clean  Unknown  Unknown  ...   
5           town country  excellent        clean      fwd  Unknown  ...   
6        outlander sport  excellent        clean      fwd  Unknown  ...   

  cylinders_Unknown cylinders_other fuel_diesel  fuel_electric  fuel_gas  \
0             False           False       False          False      Tr

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Now I need to create training data set and test dataset
# Separate features (X) and target (y)
X = cars_preprocessed.drop("price", axis=1)
y = cars_preprocessed["price"]

# Split the data: 80% training, 20% testing (adjust test_size as needed)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Optionally, print the shapes to verify the split
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (157738, 77) (157738,)
Testing set shape: (39435, 77) (39435,)


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

Linear Regression
No hyperparameters
I'm going to use it as a base line

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.impute import SimpleImputer

# Model 1: Simple model
model1_features = ['year', 'odometer_log', 'age', 'price_per_mile']

#Create SimpleImputer instance: Created an instance of SimpleImputer with strategy='median'.
# This will replace NaN values with the median of the respective column, I could use main as well
imputer = SimpleImputer(strategy='median')

# Model 2: Extended model with the ordinal encoded columns.
model2_features = model1_features + ['condition_ordinal', 'title_status_ordinal', 'size_ordinal']

# Model 3: Full model using all features (all columns in X_train)
model3_features = X_train.columns.tolist()

# --- Build, fit, and evaluate the models ---

MyResults = {}  # to store evaluation metrics for each model

# Model 1
model1 = LinearRegression()

# Impute NaN values in X_train and X_test with the median (or another strategy) before fitting
# For each column in model1_features, fill NaN with the median of that column in X_train
for feature in model1_features:
    X_train[feature].fillna(X_train[feature].median(), inplace=True)
    X_test[feature].fillna(X_train[feature].median(), inplace=True)

model1.fit(X_train[model1_features], y_train)
pred1 = model1.predict(X_test[model1_features])
mse1 = mean_squared_error(y_test, pred1)
mae1 = mean_absolute_error(y_test, pred1)
MyResults['Model 1'] = {'Features': model1_features, 'MSE': mse1, 'MAE': mae1}

# Model 2
model2 = LinearRegression()
# use the imputer
X_train_model2 = pd.DataFrame(imputer.fit_transform(X_train[model2_features]), columns=model2_features, index=X_train.index)
X_test_model2 = pd.DataFrame(imputer.transform(X_test[model2_features]), columns=model2_features, index=X_test.index)

model2.fit(X_train_model2, y_train)
pred2 = model2.predict(X_test_model2)

mse2 = mean_squared_error(y_test, pred2)
mae2 = mean_absolute_error(y_test, pred2)
MyResults['Model 2'] = {'Features': model2_features, 'MSE': mse2, 'MAE': mae2}

# Model 3  -------------------------------------------------------------------------------------
# the full linear regression give me errors ValueError: Cannot use median strategy with non-numeric data: could not convert string to float:
# need more time to debug
# model3 = LinearRegression()
#I got this error The error "ValueError: Cannot use median strategy with non-numeric data: could not convert string to float:
#I have to identify numerocal features in  X_train[model3_features] using select dtypes(include=np.number) saving in numerical_features_model
# Select only numerical features for imputation
# numerical_features_model3 = X_train[model3_features].select_dtypes(include=np.number).columns.tolist()

# Apply imputation only to numerical features
# X_train_model3_num = pd.DataFrame(imputer.fit_transform(X_train[numerical_features_model3]),
#                                  columns=numerical_features_model3, index=X_train.index)
#X_test_model3_num = pd.DataFrame(imputer.transform(X_test[numerical_features_model3]),
#                                 columns=numerical_features_model3, index=X_test.index)

# X_train_model3 = pd.concat([X_train_model3_num, X_train[model3_features].select_dtypes(exclude=np.number)], axis=1)
# X_test_model3 = pd.concat([X_test_model3_num, X_test[model3_features].select_dtypes(exclude=np.number)], axis=1)
# Previous code with error : X_train_model3 = pd.DataFrame(imputer.fit_transform(X_train[model3_features]), columns=model3_features, index=X_train.index)
# Previous code with error : X_test_model3 = pd.DataFrame(imputer.transform(X_test[model3_features]), columns=model3_features, index=X_test.index)
# model3.fit(X_train_model3, y_train)
# pred3 = model3.predict(X_test_model3)
# mse3 = mean_squared_error(y_test, pred3)
# mae3 = mean_absolute_error(y_test, pred3)
# MyResults['Model 3'] = {'Features': model3_features, 'MSE': mse3, 'MAE': mae3}

# --- Print the evaluation metrics for each model ---
for model_name, res in MyResults.items():
    print(f"\n{model_name} using features: {res['Features']}")
    print(f"  Mean Squared Error: {res['MSE']:.2f}")
    print(f"  Mean Absolute Error: {res['MAE']:.2f}")

### --->>> NEED TO APPLY SEQUENTIAL FEATURE SELECTION w9.2 w9.3 w9.4 Rdge Model <<<< -------------

### ->>. GridSearchCV Best Alphaiterating over alphas


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train[feature].fillna(X_train[feature].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test[feature].fillna(X_train[feature].median(), inplace=True)



Model 1 using features: ['year', 'odometer_log', 'age', 'price_per_mile']
  Mean Squared Error: 129011559.82
  Mean Absolute Error: 8729.75

Model 2 using features: ['year', 'odometer_log', 'age', 'price_per_mile', 'condition_ordinal', 'title_status_ordinal', 'size_ordinal']
  Mean Squared Error: 127873911.88
  Mean Absolute Error: 8658.76


2 Ridge Regression
Parameter = alpha  but the range should be in log scale 10^-3 to 10^3  

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import uniform

features_Ridge = ['year', 'odometer_log', 'age', 'price_per_mile']
# Option 1 using GridSearchCV to tune it
# Define Ridge estimator with a random state equal 42 to repro acn compare
# plan to tuse auto and svd as the algorithms to solve the best coef,  I believe we
# have some correlation between variables in the dataset.
MyRidge = Ridge(random_state=42)
MyGridSearchCV_params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver':['auto', 'svd'],
    'max_iter':[None,100,1000]
}
# Set up GridSearchCV with 5-fold cross-validation and negative MSE as the scoring metric
grid_search = GridSearchCV(
    estimator= MyRidge,
    param_grid= MyGridSearchCV_params,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1
)
grid_search.fit(X_train[features_Ridge], y_train)

# Retrieve the best Ridge estimator found by GridSearchCV
best_ridge_grid = grid_search.best_estimator_
print("Best parameters from GridSearchCV:", grid_search.best_params_)

# Evaluate the best estimator on the test set
prediction_grid = best_ridge_grid.predict(X_test[features_Ridge])

mse_grid = mean_squared_error(y_test, prediction_grid)
mae_grid = mean_absolute_error(y_test, prediction_grid)
print("GridSearchCV Ridge - Test MSE:", mse_grid)
print("GridSearchCV Ridge - Test MAE:", mae_grid)


# Using RandomizeSearchCV to compare with GridSearchCV
# let's see :-)  using the same params
GridSearchCV_params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver':['auto', 'svd'],
    'max_iter':[None,100,1000]
}
MyRandom_search = RandomizedSearchCV(
    estimator= MyRidge,
    param_distributions= GridSearchCV_params,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    n_iter=10,
    random_state=42
)

MyRandom_search.fit(X_train[features_Ridge], y_train)
best_ridge_random = MyRandom_search.best_estimator_
print("Best parameters from RandomizedSearchCV:", MyRandom_search.best_params_)

prediction_MyRandom_search = best_ridge_random.predict(X_test[features_Ridge])
print("MyRandomizedSearchCV Ridge -Test MSE:",mean_squared_error(y_test, prediction_MyRandom_search))
print("MyRandomizedSearchCV Ridge -Test MAE:",mean_absolute_error(y_test, prediction_MyRandom_search))





Best parameters from GridSearchCV: {'alpha': 1, 'max_iter': None, 'solver': 'auto'}
GridSearchCV Ridge - Test MSE: 4.464198509715481e-06
GridSearchCV Ridge - Test MAE: 0.0016075500395790945
Best parameters from RandomizedSearchCV: {'solver': 'svd', 'max_iter': 100, 'alpha': 1}
MyRandomizedSearchCV Ridge -Test MSE: 4.4641985097154785e-06
MyRandomizedSearchCV Ridge -Test MAE: 0.0016075500395791073


3 Lasson Regression

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import uniform

# Define the feature set (using the same features as for Ridge)
MyFeatures_Lasso = ['year', 'odometer_log', 'age', 'price_per_mile']

# ---------------------------
# Option 1: Using GridSearchCV for Lasso

# Initialize Lasso estimator with a fixed random_state for reproducibility
my_lasso = Lasso(random_state=42)

# Define the parameter grid for GridSearchCV.
# Note: Lasso does not have a 'solver' parameter like Ridge, but it offers 'selection' (cyclic or random)
param_grid_lasso = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'max_iter': [1000, 5000],
    'tol': [0.0001, 0.001, 0.01],
    'selection': ['cyclic', 'random']
}

# Set up GridSearchCV with 5-fold cross-validation
lasso_grid_search = GridSearchCV(
    estimator=my_lasso,
    param_grid=param_grid_lasso,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1
)

# Fit GridSearchCV on the training data
lasso_grid_search.fit(X_train[MyFeatures_Lasso], y_train)

# Retrieve the best Lasso estimator found
best_lasso_grid = lasso_grid_search.best_estimator_
print("Best parameters from GridSearchCV (Lasso):", lasso_grid_search.best_params_)

# Evaluate the best estimator on the test set
pred_lasso_grid = best_lasso_grid.predict(X_test[MyFeatures_Lasso])
mse_lasso_grid = mean_squared_error(y_test, pred_lasso_grid)
mae_lasso_grid = mean_absolute_error(y_test, pred_lasso_grid)
print("GridSearchCV Lasso - Test MSE:", mse_lasso_grid)
print("GridSearchCV Lasso - Test MAE:", mae_lasso_grid)

# ---------------------------
# Option 2: Using RandomizedSearchCV for Lasso

# Define parameter distributions for RandomizedSearchCV.
# For 'alpha', we use a uniform distribution.  Let's see how it works .. :-)
param_dist_lasso = {
    'alpha': uniform(0.001, 100),
    'max_iter': [1000, 5000],
    'tol': [0.0001, 0.001, 0.01],
    'selection': ['cyclic', 'random']
}

lasso_random_search = RandomizedSearchCV(
    estimator=my_lasso,
    param_distributions=param_dist_lasso,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    n_iter=10,  # reduce iterations to speed up the search
    random_state=42
)

# Fit RandomizedSearchCV on the training data
lasso_random_search.fit(X_train[MyFeatures_Lasso], y_train)

# Retrieve the best Lasso estimator found
best_lasso_random = lasso_random_search.best_estimator_
print("Best parameters from RandomizedSearchCV (Lasso):", lasso_random_search.best_params_)

# Evaluate the best estimator on the test set
pred_lasso_random = best_lasso_random.predict(X_test[MyFeatures_Lasso])
mse_lasso_random = mean_squared_error(y_test, pred_lasso_random)
mae_lasso_random = mean_absolute_error(y_test, pred_lasso_random)
print("RandomizedSearchCV Lasso - Test MSE:", mse_lasso_random)
print("RandomizedSearchCV Lasso - Test MAE:", mae_lasso_random)


Best parameters from GridSearchCV (Lasso): {'alpha': 0.001, 'max_iter': 1000, 'selection': 'cyclic', 'tol': 0.0001}
GridSearchCV Lasso - Test MSE: 5.7222877858253145e-06
GridSearchCV Lasso - Test MAE: 0.001990068505051068
Best parameters from RandomizedSearchCV (Lasso): {'alpha': 37.455011884736244, 'max_iter': 1000, 'selection': 'cyclic', 'tol': 0.01}
RandomizedSearchCV Lasso - Test MSE: 5.7222877858253145e-06
RandomizedSearchCV Lasso - Test MAE: 0.001990068505051068


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

#### Evaluation Answer 1

Both models perform very similar, but the Ridge model shows a slight advantage based on error metrics:

Ridge Regression:
Test MSE: ~4.464e-06
Test MAE: ~0.001608
Lasso Regression:
Test MSE: ~5.722e-06
Test MAE: ~0.001990



#### Evaluation Answer 2####

1. Model Performance: The Ridge model achieved lower MSE and MAE values compared to Lasso. Maybe Ridge might be better at capturing the underlying relationships without introducing too much bias with a acceptable speed

2. Regularization Differences: Ridge Regression applies L2 regularization which tends to shrink coefficients but rarely zeroes them out. Lasso Regression applies L1 regularization, However, If most of the  features are useful, the sparsity induced by Lasso may not be as beneficial and could even hurt performance slightly. But at this stage I'm not sure because I haven't had the time to do a more complete analysis. I think that is something I should do.

3. Hyperparameter:
Both GridSearchCV and RandomizedSearchCV converged to similar performance metrics within each model, suggesting that the hyperparameter search was good.
For Ridge, both searches pointed to alpha = 1
For Lasso, although the best parameters differ between GridSearchCV and RandomizedSearchCV, the performance remained the same, indicating that several hyperparameter combinations may yield similar outcomes.

4. Choosing the Best Model
Given the slightly lower error metrics, Ridge Regression looks like the best for this tasks. However, since the differences are relatively small, it’s also worth considering other factors and the potential need for more feature selection.

#### Evaluation Answer 3 ####

What earlier adjustments can be done in the Data Preparation could be adjusted to get better results

1. Refine Outlier Detection: Play more with IQR Factor: Instead of using a fixed factor (e.g., 1.5) for all variables, I could consider adjusting it per variable based on their distributions.

2. Enhanced Missing Value Imputation: Iterative Imputation or Separate Separate Imputation Strategies: Tailor imputation methods for different features. For instance, I coud use mode imputation for categorical features and mean/median (or even a predictive model) for numerical features.

3. Feature Transformation and Scaling: Target and Feature Transformation:
Apply logarithmic transformations to more variables (e.g., price) to stabilize variance and make the distributions more normal I only use log transformation to odometer

4. More Correlation Analysis.  With more time I could examine multicollinearidad between predictors to remove or combine them in new features



### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

### Used Car Sales Optimization : A Data-Driven Approach###

**Introduction**

In today’s competitive market, understanding what drives used car prices is crucial to managing your inventory effectively. My recent analysis leverages data science techniques to help you optimize your vehicle selection and pricing strategy. By identifying the key factors that influence car values, you can make better purchasing decisions, adjust pricing appropriately, and ultimately improve your profit margins.

**What I analyzed**

I examined a large dataset containing detailed information on hundreds of thousands of used (and new) cars. Key variables in my analysis included:

Year: The production year of the vehicle, which helps determine its age.
Mileage (Odometer): Total miles driven, indicating wear and tear.
Price per Mile: A derived metric that divides the car’s price by its mileage, offering insight into relative value.
Other Factors: Additional features such as the vehicle's condition and manufacturer were also considered during the analysis.

**The Modeling Process**

To predict used car prices accurately, I applied two types of regression models:
Ridge Regression: This model applies a technique called L2 regularization, which shrinks the impact of each variable in a balanced way. My results showed that Ridge Regression performed very well, suggesting that all features in your dataset contribute valuable information to the final price.
Lasso Regression: Lasso uses L1 regularization, which can effectively “turn off” less important features. However, my analysis found that while Lasso was also effective, Ridge Regression slightly outperformed it, indicating that none of the features were completely redundant in predicting car prices.

**Key Findings from the Models**

**Ridge Regression Results:**

Mean Squared Error (MSE): 4.46e-06
Mean Absolute Error (MAE): 0.00161

**Lasso Regression Results:**
MSE: 5.72e-06
MAE: 0.00199

The lower error rates in Ridge Regression mean that, on average, its predictions are closer to the actual sale prices. This suggests that when all factors are considered, each one plays a role in determining the car’s value.


**How This Helps You Fine-Tune your Inventory**

1. Accurate Pricing Based on Key Features
Vehicle Age: Newer models tend to hold their value better. Use the car’s production year and calculated age to adjust pricing.
Mileage and Price per Mile: Lower mileage often indicates a higher value per mile. By comparing similar vehicles, you can identify bargains or overpriced listings.
2. Data-Driven Purchasing Decisions
Focus on Well-Rounded Vehicles: Since Ridge Regression shows that all measured features contribute to the final price, consider a holistic approach when evaluating potential purchases. Vehicles that score well across multiple factors are likely to be strong investments.
3. Inventory and Pricing Adjustments
Dynamic Pricing: Use these insights to set competitive prices that reflect the true market value, reducing the time vehicles spend on your lot.
Targeted Inventory Acquisition: Knowing which factors have the greatest impact on price can help you identify underpriced vehicles with potential for profit after refurbishment or market correction.
4. Continuous Improvement
Refining Data Collection: The more accurate and detailed your data (e.g., precise mileage, detailed condition reports), the more reliable your pricing models will be.
Regular Reassessment: The market is always evolving. Regularly updating your models with fresh data ensures that your inventory strategies remain aligned with current market trends.
Conclusion
Using data science, you can gain a competitive edge in the used car market. This analysis shows that a comprehensive approach—taking into account vehicle age, mileage, and other key features—provides a more accurate picture of a car’s value. In my study, Ridge Regression emerged as the more reliable tool for predicting prices, which suggests that a balanced consideration of all available features is beneficial.

###My Recommendations###

####Long term####

Adopt data-driven pricing: Adjust your pricing strategy based on the factors that truly impact a car’s value.
Enhance your data collection: Gather detailed and accurate information on each vehicle.
Continuously review and update models: As market conditions change, so should your models and strategies.



####Short Term####

1. **Daily in Store  Inventory Assessment**  Morning Routine:
Run the Model on Current Inventory:
What to Do: Every morning, update your inventory data with the latest details (e.g., production year, mileage, condition).
How to Do It: Use the Python script you’ve built to feed your current inventory through the Ridge model. This model will output a predicted price for each vehicle.
Compare Predictions to Listed Prices:
What to Look For: Create a report that shows both the predicted price and the current listing price.
How to Act:
Overpriced Vehicles: If a car’s listing price is significantly above the predicted price, consider lowering the price or running a special promotion.
Underpriced Vehicles: If the listing price is lower than the predicted price, you might have room to increase the price—or if it’s selling too quickly, consider buying more of that model.

2. **Pricing Adjustments**
In-Store and Online:
Adjust Listing Prices:
Action Step: Based on the morning report, update your online listings and in-store price tags.
Tip: Create a pricing band (e.g., within ±5% of the predicted price) as a target range for consistency.
Promotional Strategies:
Action Step: For vehicles that are overpriced compared to the model, consider running flash sales or discounts to move inventory faster.
Tip: Use the model’s confidence in its predictions as a guide—if the prediction error is low, you can be more aggressive with pricing changes.

3. **Inventory Optimization**
Ongoing Inventory Review:
Identify High-Potential Vehicles:
Action Step: Use your model to flag vehicles that show a high “value per mile” or have lower mileage with a good production year.
How to Act: Prioritize these vehicles in marketing efforts or consider expanding your stock of similar models.
Evaluate Underperformers:
Action Step: For vehicles with a low predicted price relative to market demand, plan targeted promotions (e.g., “quick sale” discounts).
Tip: Keep a record of these vehicles to review if repeated adjustments are needed or if they should be phased out in future orders.

4. **Weekly/Monthly  Analysis and Continuous Improvement**
End-of-Week and End-of-Month Review:
Analyze Sales Performance:
Action Step: At the end of each week, compare the actual sale prices with the model’s predictions.
How to Use: Identify trends—are vehicles sold closer to the predicted price? Are there any particular changes leading to faster sales? Enrich the model with those features
Adjust Procurement Decisions:
Action Step: Based on weekly/monthly performance, refine your buying strategy:
Increase orders for high-performing models.
Reduce or discount models that consistently underperform.
Feedback Loop:
Action Step: Update your models with the latest sales data. This helps improve prediction accuracy over time.
5. **Training and Maintenance**
Keep Your Team Trained and Informed:
Ongoing Training:
Action Step: Schedule periodic training sessions on how to interpret the model outputs and dashboard metrics.
How to Act: Share best practices and recent insights from your weekly reviews to empower your team in making data-driven decisions.
Tool Updates:
Action Step: Maintain and update your data pipeline. Ensure that new inventory data is incorporated in real time and that your models are re-trained as market conditions evolve
Keep adding new features as you discover they could affect the sales.
Call Data scientist to update the model  
