# What drives the price of a car?

![](images/kurt.jpeg)

In [None]:
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from warnings import filterwarnings 
filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error


import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder


**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### Data Problem Definition:  

The goal is to identify and model the relationships between used car prices and relevant predictor variables (also known as features) from historical sales data. This involves selecting, extracting, transforming, and analyzing a dataset containing information on used cars, including price, make, model, year, mileage, condition, trim level, location, and other relevant attributes, to develop a predictive model that captures the key drivers of used car prices. 

In this formulation: 

    Predictor variables  refer to the input features (e.g., make, model, year) that may influence used car prices.
    Target variable  is the outcome or response we're interested in predicting (used car price).
    Goal  is to develop a predictive model that can accurately estimate used car prices based on the predictor variables.
     

This reframed task aligns with the CRISP-DM methodology, which emphasizes a structured approach to data science projects. 

###  1 Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### 1.1 Load the data

The following code will load the data into a pandas dataframe and then provide a highlevel view of the columns and column types

In [None]:
vehicles_df = pd.read_csv('data/vehicles.csv', low_memory=False)
vehicles_df.info()

In [None]:
vehicles_df.sample(5)

### Check for missing data
The following code is used to count the number of rows in the DataFrame vehicles_df that contain at least one missing (NaN) value. Here's a 

In [None]:
nan_rows = vehicles_df.isnull().T.any().T.sum()
print('There are ' + str(nan_rows) + ' rows with at least one missing value.')

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

Remove column and rows not needed and correct some manufacturer errors

In [None]:
import pandas as pd
import numpy as np

# Function to process the vehicle dataset
def process_vehicle_data(df):
    # Handle missing values in 'cylinders' and 'drive'
    df['cylinders'] = df['cylinders'].fillna('unknown')
    df['drive'] = df['drive'].fillna('unknown')

    # Drop irrelevant columns
    columns_to_drop = ['VIN', 'size', 'Unnamed: 18']
    df = df.drop(columns=columns_to_drop, errors='ignore').copy()

    # Remove rows with missing essential data and zero price
    essential_columns = ['price', 'year', 'manufacturer', 'model']
    df.dropna(subset=essential_columns, inplace=True)
    df = df[df['price'] != 0]

    # Remove specific manufacturers
    df = df[df['manufacturer'].str.lower() != 'harley-davidson']

    # Replace specific manufacturer names
    df['manufacturer'] = df['manufacturer'].replace({
        'rover': 'land rover',
        'mini': 'bmw'
    })

    # Clean and correct model names
    df = clean_and_correct_model_names(df, redundant_words=[
        '4x4', 'sedan', 'suv', 'coupe', 'hatchback', 'convertible', 'pickup', 'drw', 'benz', '4wd',
        'wagon', 'cab', 'hd', 'crew', 'extended', 'utility', '2d', '4d', 'crew', 'sport', 'unlimited',
        'connect', 'black', 'white', 'luxury', 'v6', 'all', 'new', 'lt', 'xlt', 'lx', 'xle', 'grand',
        'limited', 'sr', 'big', 'horn', 'r/t', 'ltz', 'super', 'duty', 'ss', 'se', 'xl', 'gt', 'premium',
        'st', 'ls', 'hard', 'top', '2.5i', 'regular', '2.5', 'le', 'exl', 'double', 'doub'
    ])

    # Encode categorical features
    df = encode_categorical_features(df)

    # Clean and fill cylinders and drive columns
    df = clean_and_fill_cylinders_drive(df)

    # Create vehicle categories
    df = create_vehicle_categories(df, type_mapping={
        'pickup': 'Truck/Van', 'truck': 'Truck/Van', 'van': 'Truck/Van', 'mini-van': 'Truck/Van',
        'coupe': 'Car', 'sedan': 'Car', 'hatchback': 'Car', 'convertible': 'Car', 'wagon': 'Car',
        'SUV': 'SUV/Offroad', 'offroad': 'SUV/Offroad', 'bus': 'Commercial/Other',
        'other': 'Commercial/Other', 'unknown': 'Commercial/Other'
    })

    # Apply one-hot encoding
    df = apply_one_hot_encoding(df, ['condition', 'title_status', 'vehicle_category', 'fuel', 'transmission'])

    return df

# Function to clean and correct model names
def clean_and_correct_model_names(df, redundant_words):
    # Convert the 'model' and 'manufacturer' columns to NumPy arrays for faster processing
    model_array = df['model'].values.astype(str)
    manufacturer_array = df['manufacturer'].values.astype(str)

    # Convert manufacturer names to lowercase using np.char.lower
    manufacturer_array = np.char.lower(manufacturer_array)

    # Replace redundant words using a loop
    for word in redundant_words:
        model_array = np.char.replace(model_array, word, '')

    # Apply the correction logic for Silverado 1500 models
    chevy_mask = manufacturer_array == 'chevrolet'
    silverado_mask = (np.char.find(np.char.lower(model_array), '1500') >= 0) | (np.char.find(np.char.lower(model_array), 'silverado') >= 0)

    # Update the model names to 'Silverado 1500' where the conditions are met
    model_array[chevy_mask & silverado_mask] = 'silverado 1500'

    # Remove hyphens ("-") from the model names
    model_array = np.char.replace(model_array, '-', '')

    # Remove the substring " x", " w", " s" from the model names
    model_array = np.char.replace(model_array, ' x', '')
    model_array = np.char.replace(model_array, ' w', '')
    model_array = np.char.replace(model_array, ' s', '')

    # Strip leading and trailing spaces from the model names
    model_array = np.char.strip(model_array)

    # Assign the processed array back to the DataFrame
    df['model_corrected'] = model_array

    return df

# Function to encode categorical features
def encode_categorical_features(df):
    # Frequency Encoding for 'manufacturer'
    df['manufacturer_freq'] = df['manufacturer'].map(df['manufacturer'].value_counts())

    # Frequency Encoding for 'model'
    df['model_freq'] = df['model_corrected'].map(df['model_corrected'].value_counts())

    # Frequency Encoding for 'region'
    df['region_freq'] = df['region'].map(df['region'].value_counts())

    # Frequency Encoding for 'state'
    df['state_encoded'] = df['state'].map(df['state'].value_counts())

    # Ordinal Encoding for 'paint_color'
    color_mapping = {
        'unknown': 0, 'white': 1, 'black': 2, 'silver': 3, 'blue': 4, 'red': 5, 'grey': 6,
        'green': 7, 'custom': 8, 'brown': 9, 'yellow': 10, 'orange': 11, 'purple': 12
    }
    df['paint_color_ordinal'] = df['paint_color'].map(color_mapping)

    return df

# Function to clean and fill cylinders and drive based on model
def clean_and_fill_cylinders_drive(df):
    # Create the cylinders_drive dictionary from non-missing data
    cylinders_drive_df = df[(df['cylinders'] != 0) & (df['drive'] != 'unknown')].drop_duplicates(subset='model')
    cylinders_drive = cylinders_drive_df.set_index('model')[['cylinders', 'drive']].to_dict('index')

    def fill_cylinders(row):
        # Check if the model exists in the dictionary
        if row['cylinders'] == 0 and row['model'] in cylinders_drive:
            return cylinders_drive[row['model']][0]
        return row['cylinders']

    def fill_drive(row):
        # Check if the model exists in the dictionary
        if row['drive'] == 'unknown' and row['model'] in cylinders_drive:
            return cylinders_drive[row['model']][1]
        return row['drive']

    df['cylinders'] = df.apply(fill_cylinders, axis=1)
    df['drive'] = df.apply(fill_drive, axis=1)

    # Remove rows where 'cylinders' or 'drive' are still not determined
    df = df[(df['cylinders'] != 0) & (df['drive'] != 'unknown')]

    return df

# Function to create vehicle categories
def create_vehicle_categories(df, type_mapping):
    df['vehicle_category'] = df['type'].map(type_mapping)
    df['vehicle_category'] = df['vehicle_category'].fillna('Commercial/Other')
    return df

# Function to apply one-hot encoding
def apply_one_hot_encoding(df, columns_to_encode):
    dummies = pd.get_dummies(df[columns_to_encode], prefix=columns_to_encode)
    dummies = dummies.astype(int)
    df = pd.concat([df.drop(columns=columns_to_encode), dummies], axis=1)
    return df

# Example usage:
vehicles_df_cleaned = process_vehicle_data(vehicles_df)

# Convert 'year' to Datetime and Calculate 'age'
vehicles_df_cleaned['year'] = pd.to_datetime(vehicles_df_cleaned['year'], format='%Y', errors='coerce')
current_year = pd.to_datetime('today').year
vehicles_df_cleaned['age'] = current_year - vehicles_df_cleaned['year'].dt.year

# Convert 'odometer' to integers
vehicles_df_cleaned['odometer'] = vehicles_df_cleaned['odometer'].astype(int)

# Reset the index of the final DataFrame
vehicles_df_cleaned.reset_index(drop=True, inplace=True)

# Display the Final DataFrame
print("\nFinal DataFrame:")
print(vehicles_df_cleaned.head())

# Save the final DataFrame to CSV
vehicles_df_cleaned.to_csv('final_processed_vehicles.csv', index=False)


In [None]:
# Drop non-numeric columns

df_numeric = vehicles_df_final

# Calculate the correlation matrix for the numeric columns
correlation_matrix = df_numeric.corr()

# Plot the heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

Before you proceed with machine learning, you need to ensure your data is properly prepared. This includes handling missing values, encoding categorical variables, scaling features, and applying logarithmic transformations where appropriate.

Steps:

    Handle Missing Values: Ensure that all missing values are addressed.
    Encoding Categorical Variables: Use methods like one-hot encoding or label encoding, depending on the type of categorical variable.
    Feature Scaling: Normalize or standardize numerical features to bring them onto a similar scale, especially for regression models.
    Logarithmic Transformation: Apply log transformation to skewed numerical features to stabilize variance and make the data more normally distributed.

In [None]:
df = vehicles_df_final

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

# Encoding categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Log transform skewed numerical features
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = df[numeric_cols].apply(lambda x: np.log1p(x) if np.abs(x.skew()) > 0.5 else x)

# Feature scaling
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Verify data preparation
print(df.head())

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models. Here, you should build a number of different regression models with the price as the target. In building your models, you should explore different parameters and be sure to cross-validate your findings.


### Use of Multiple Regression Models

Now that your data is prepared, you can apply multiple regression models to it. Here are a few models you could consider:

    Linear Regression
    Ridge Regression
    Lasso Regression
    ElasticNet Regression
    Random Forest Regression

In [None]:

# Split the data into features (X) and target (y)
X = df.drop(columns=['price'])  # Assuming 'price' is the target variable
y = df['price']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'ElasticNet Regression': ElasticNet(),
#    'Random Forest': RandomForestRegressor()
}

# Evaluate each model using cross-validation
for name, model in models.items():
    model.fit(X_train, y_train)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"{name} - CV MSE: {-cv_scores.mean()} - Test MSE: {mse}")

### Cross-Validation and Grid Search for Hyperparameters

Use cross-validation to assess model performance and grid search to find the best hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV

# Example: Grid search for Ridge Regression
ridge = Ridge()
param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print("Best parameters for Ridge Regression:", grid_search.best_params_)
print("Best cross-validation score:", -grid_search.best_score_)

### Interpretation of Coefficients

Once you have the best model, you can interpret the coefficients (for linear models) to understand the impact of each feature.

In [None]:
from sklearn.metrics import mean_squared_error

# Evaluate the best model on the test set
best_ridge_model = Ridge(alpha=100.0)
best_ridge_model.fit(X_train, y_train)
y_pred = best_ridge_model.predict(X_test)

# Calculate test MSE
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE for the best Ridge Regression model: {test_mse}")


### Interpretation of Model Coefficients:

 Since Ridge Regression is a linear model, you can interpret the coefficients to understand the impact of each feature on the target variable (price). This can provide insights into which features are most influential.

In [None]:
# Extract and display coefficients
feature_names = X_train.columns
coefficients = best_ridge_model.coef_

coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
}).sort_values(by='Coefficient', ascending=False)

print(coef_df)


In [None]:
plt.figure(figsize=(10, 12))
sns.barplot(x='Coefficient', y='Feature', data=coef_df, palette='coolwarm')
plt.title('Feature Coefficients')
plt.xlabel('Coefficient')
plt.ylabel('Feature')
plt.show()

In [None]:
import pandas as pd

# Assuming coef_df is your DataFrame with features and coefficients
# and vehicles_df_cleaned (the preprocessed DataFrame before dropping original columns) is available

# Select the top 10 features with the highest positive coefficients
top_positive_features = coef_df.sort_values(by='Coefficient', ascending=False).head(10)['Feature']

# Select the top 10 features with the lowest (most negative) coefficients
top_negative_features = coef_df.sort_values(by='Coefficient', ascending=True).head(10)['Feature']

# Original columns to include in the output
original_columns = ['manufacturer', 'model', 'price', 'odometer', 'year']  # Add any other original columns you want to see

# Ensure the DataFrame you're working with includes these original columns
vehicles_df_with_originals = vehicles_df_cleaned.copy()

# Extract cars with top positive coefficients, including original columns
top_positive_cars = vehicles_df_with_originals[
    (vehicles_df_final[top_positive_features] > 0).any(axis=1)
][original_columns + top_positive_features.tolist()].head(10)

# Extract cars with top negative coefficients, including original columns
top_negative_cars = vehicles_df_with_originals[
    (vehicles_df_final[top_negative_features] > 0).any(axis=1)
][original_columns + top_negative_features.tolist()].head(10)

# Display the results
print("\nTop 10 Cars with Best Coefficients:")
print(top_positive_cars)

print("\nTop 10 Cars with Worst Coefficients:")
print(top_negative_cars)




In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate residuals
residuals = y_test - y_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Residuals vs Predicted")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()


In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Suppose X_train has two features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X_train)

# X_poly will now contain original features, squared terms, and interaction terms
print(X_poly.shape)


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.