# What drives the price of a car?

![](images/kurt.jpeg)

In [194]:
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from warnings import filterwarnings 
filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error


import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder


**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### Data Problem Definition:  

The goal is to identify and model the relationships between used car prices and relevant predictor variables (also known as features) from historical sales data. This involves selecting, extracting, transforming, and analyzing a dataset containing information on used cars, including price, make, model, year, mileage, condition, trim level, location, and other relevant attributes, to develop a predictive model that captures the key drivers of used car prices. 

In this formulation: 

    Predictor variables  refer to the input features (e.g., make, model, year) that may influence used car prices.
    Target variable  is the outcome or response we're interested in predicting (used car price).
    Goal  is to develop a predictive model that can accurately estimate used car prices based on the predictor variables.
     

This reframed task aligns with the CRISP-DM methodology, which emphasizes a structured approach to data science projects. 

###  1 Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### 1.1 Load the data

The following code will load the data into a pandas dataframe and then provide a highlevel view of the columns and column types

In [195]:
vehicles_df = pd.read_csv('data/vehicles.csv', low_memory=False)
vehicles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [196]:
vehicles_df.sample(5)

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
161065,7313922295,omaha / council bluffs,16995,,,Mazda3 5-Door,,4 cylinders,gas,23229.0,rebuilt,automatic,JM1BPALMXK1108549,fwd,,hatchback,black,ia
116684,7316506250,tampa bay area,27990,2012.0,gmc,sierra 2500 hd extended cab,good,8 cylinders,gas,68696.0,clean,other,1GT220CG8CZ231238,4wd,,pickup,black,fl
89421,7315085741,"washington, DC",5800,2005.0,subaru,legacy sedan (natl) 2.,excellent,4 cylinders,gas,152488.0,,automatic,4S3BL616857222919,4wd,,sedan,,dc
34875,7315899885,los angeles,0,2018.0,chevrolet,silverado,,,gas,,clean,automatic,1GCNCREC5JZ269967,fwd,,pickup,red,ca
235779,7313027353,fayetteville,23500,2013.0,gmc,sierra,excellent,8 cylinders,gas,121000.0,clean,automatic,,4wd,full-size,truck,white,nc


### Check for missing data
The following code is used to count the number of rows in the DataFrame vehicles_df that contain at least one missing (NaN) value. Here's a 

In [197]:
nan_rows = vehicles_df.isnull().T.any().T.sum()
print('There are ' + str(nan_rows) + ' rows with at least one missing value.')

There are 392012 rows with at least one missing value.


### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

Remove column and rows not needed and correct some manufacturer errors

In [198]:
# Handle Missing Values in 'cylinders' and 'drive'
vehicles_df['cylinders'] = vehicles_df['cylinders'].fillna('unknown')
vehicles_df['drive'] = vehicles_df['drive'].fillna('unknown')

# Drop Irrelevant Columns
columns_to_drop = ['VIN', 'size', 'Unnamed: 18']
vehicles_df_cleaned = vehicles_df.drop(columns=columns_to_drop, errors='ignore').copy()

# Remove Rows with Missing Essential Data and Zero Price
essential_columns = ['price', 'year', 'manufacturer', 'model']
vehicles_df_cleaned.dropna(subset=essential_columns, inplace=True)
vehicles_df_cleaned = vehicles_df_cleaned[vehicles_df_cleaned['price'] != 0]

# Remove Specific Manufacturers
vehicles_df_cleaned = vehicles_df_cleaned[vehicles_df_cleaned['manufacturer'].str.lower() != 'harley-davidson']

# Replace Specific Manufacturer Names
vehicles_df_cleaned['manufacturer'] = vehicles_df_cleaned['manufacturer'].replace({
    'rover': 'land rover',
    'mini': 'bmw'
})

# Filter rows where the 'condition' is 'like new' and 'fuel' is 'gas'
filtered_df = vehicles_df[(vehicles_df['fuel'] == 'other')]

Cleanup the Model Column

In [199]:

# Convert the 'model' and 'manufacturer' columns to NumPy arrays for faster processing
model_array = vehicles_df_cleaned['model'].values.astype(str)
manufacturer_array = vehicles_df_cleaned['manufacturer'].values.astype(str)

# Convert manufacturer names to lowercase using np.char.lower
manufacturer_array = np.char.lower(manufacturer_array)

# Define the redundant words to remove
redundant_words = ['4x4', 'sedan', 'suv', 'coupe', 'hatchback', 'convertible', 'pickup', 'drw', 'benz', '4wd',
                   'wagon', 'cab', 'hd', 'crew','extended','utility','2d','4d','crew', 'sport', 'unlimited',
                   'connect', 'black', 'white', 'luxury', 'v6','all', 'new', 'lt', 'xlt','lx', 'xle', 'grand',
                   'limited', 'sr', 'big', 'horn', 'r/t', 'ltz','super','duty', 'ss', 'se', 'xl', 'gt' ,'premium',
                   'st','ls','hard','top','2.5i', 'regular', '2.5', 'le','exl', 'double', 'doub'
                  ]                 

# Replace redundant words using a loop
for word in redundant_words:
    model_array = np.char.replace(model_array, word, '')

# Apply the correction logic for Silverado 1500 models
chevy_mask = manufacturer_array == 'chevrolet'
silverado_mask = np.char.find(np.char.lower(model_array), '1500') >= 0
silverado_mask = np.char.find(np.char.lower(model_array), 'silverado') >= 0

# Update the model names to 'Silverado 1500' where the conditions are met
model_array[chevy_mask & silverado_mask] = 'silverado 1500'

# Remove hyphens ("-") from the model names
model_array = np.char.replace(model_array, '-', '')

# Remove the substring " x" from the model names
model_array = np.char.replace(model_array, ' x', '')
model_array = np.char.replace(model_array, ' w', '')
model_array = np.char.replace(model_array, ' s', '')

# Strip leading and trailing spaces from the model names
model_array = np.char.strip(model_array)

# Assign the processed array back to the DataFrame
vehicles_df_cleaned['model_corrected'] = model_array
#vehicles_df_cleaned.dropna(subset=essential_columns, inplace=True)

Fix some issues with tesla and jeep rows

In [200]:
# Update 'fuel' and 'cylinders' for 'tesla' Models
tesla_mask = vehicles_df_cleaned['manufacturer'].str.lower() == 'tesla'
vehicles_df_cleaned.loc[tesla_mask, ['fuel', 'cylinders']] = ['electric', 'none']

# Update the 'type' Column Based on Conditions
jeep_mask = vehicles_df_cleaned['manufacturer'].str.lower() == 'jeep'
vehicles_df_cleaned.loc[jeep_mask, 'type'] = 'SUV'
accord_mask = vehicles_df_cleaned['model'].str.lower() == 'accord'
vehicles_df_cleaned.loc[accord_mask, 'type'] = 'sedan'

Frequency encode manufacturer, model, region and state

In [201]:

# Frequency Encoding for 'manufacturer'
vehicles_df_cleaned['manufacturer_freq'] = vehicles_df_cleaned['manufacturer'].map(vehicles_df_cleaned['manufacturer'].value_counts())
#vehicles_df_cleaned = vehicles_df_cleaned.drop(columns=['manufacturer'], errors='ignore').copy()

# Frequency Encoding for 'model'
vehicles_df_cleaned['model_freq'] = vehicles_df_cleaned['model_corrected'].map(vehicles_df_cleaned['model_corrected'].value_counts())
#vehicles_df_cleaned = vehicles_df_cleaned.drop(columns=['model_corrected'], errors='ignore').copy()

# Frequency Encoding for 'region'
vehicles_df_cleaned['region_freq'] = vehicles_df_cleaned['region'].map(vehicles_df_cleaned['region'].value_counts())
#vehicles_df_cleaned = vehicles_df_cleaned.drop(columns=['region'], errors='ignore').copy()

# Frequency Encoding for 'state'
vehicles_df_cleaned['state_encoded'] = vehicles_df_cleaned['state'].map(vehicles_df_cleaned['state'].value_counts())
#vehicles_df_cleaned = vehicles_df_cleaned.drop(columns=['state'], errors='ignore').copy()


Ordinal encode paint color

In [202]:
# Ordinal Encoding for 'paint_color'
color_mapping = {
    'unknown': 0,
    'white': 1,
    'black': 2,
    'silver': 3,
    'blue': 4,
    'red': 5,
    'grey': 6,
    'green': 7,
    'custom': 8,
    'brown': 9,
    'yellow': 10,
    'orange': 11,
    'purple': 12
}

vehicles_df_cleaned['paint_color_ordinal'] = vehicles_df_cleaned['paint_color'].map(color_mapping)
#vehicles_df_cleaned = vehicles_df_cleaned.drop(columns=['paint_color'], errors='ignore').copy()

Fix columns cylinder and drive

In [203]:

# Clean the 'cylinders' column
def clean_cylinders(value):
    try:
        # Attempt to extract the number before the word "cylinders"
        if isinstance(value, str) and value.lower() != 'other' and 'cylinder' in value:
            return int(value.split()[0])
        else:
            return 0  # Set as 0 for "other" or "unknown" values
    except ValueError:
        return 0  # Fallback in case the string isn't in the expected format

vehicles_df_cleaned['cylinders'] = vehicles_df_cleaned['cylinders'].apply(clean_cylinders)

# Populate the 'cylinders_drive' Dictionary
cylinders_drive_df = vehicles_df_cleaned[
    (vehicles_df_cleaned['cylinders'] != 0) &
    (vehicles_df_cleaned['drive'] != 'unknown')
].drop_duplicates(subset='model')[['model', 'cylinders', 'drive']]
cylinders_drive = cylinders_drive_df.set_index('model')[['cylinders', 'drive']].to_dict('index')
cylinders_drive = {model: [info['cylinders'], info['drive']] for model, info in cylinders_drive.items()}

# Fill 'cylinders' and 'drive' Based on 'model'
def fill_cylinders(row):
    if row['cylinders'] == 0:
        return cylinders_drive.get(row['model'], [row['cylinders'], 'unknown'])[0]
    return row['cylinders']

def fill_drive(row):
    if row['drive'] == 'unknown':
        return cylinders_drive.get(row['model'], ['unknown', row['drive']])[1]
    return row['drive']

vehicles_df_cleaned['cylinders'] = vehicles_df_cleaned.apply(fill_cylinders, axis=1)
vehicles_df_cleaned['drive'] = vehicles_df_cleaned.apply(fill_drive, axis=1)

# Remove Rows with Remaining 'unknown' Values
vehicles_df_cleaned = vehicles_df_cleaned[
    (vehicles_df_cleaned['cylinders'] != 0) &
    (vehicles_df_cleaned['drive'] != 'unknown')
]

Convert year to age and convert odometer to integer

In [204]:
# Convert 'year' to Datetime and Calculate 'age'
vehicles_df_cleaned['year'] = pd.to_datetime(vehicles_df_cleaned['year'], format='%Y', errors='coerce')
current_year = pd.to_datetime('today').year
vehicles_df_cleaned['age'] = current_year - vehicles_df_cleaned['year'].dt.year
#vehicles_df_cleaned.drop(columns=['year'], inplace=True)

# Handle Missing Values in Other Columns
vehicles_df_cleaned['fuel'] = vehicles_df_cleaned['fuel'].fillna('gas').replace('other', 'gas')
vehicles_df_cleaned['transmission'] = vehicles_df_cleaned['transmission'].fillna('automatic').replace('other', 'automatic')
vehicles_df_cleaned['condition'] = vehicles_df_cleaned['condition'].fillna('unknown').replace('new', 'like new')

# Fill Missing Values in All Categorical and Numerical Columns
categorical_columns = vehicles_df_cleaned.select_dtypes(include=['object']).columns
vehicles_df_cleaned.loc[:, categorical_columns] = vehicles_df_cleaned[categorical_columns].fillna('unknown')
numerical_columns = vehicles_df_cleaned.select_dtypes(include=['number']).columns
vehicles_df_cleaned.loc[:, numerical_columns] = vehicles_df_cleaned[numerical_columns].fillna(
    vehicles_df_cleaned[numerical_columns].median()
)

# Convert 'odometer' to Integer
vehicles_df_cleaned['odometer'] = vehicles_df_cleaned['odometer'].astype(int)


Cleanup the type column

In [205]:

# Map 'type' to 'vehicle_category'
type_mapping = {
    'pickup': 'Truck/Van',
    'truck': 'Truck/Van',
    'van': 'Truck/Van',
    'mini-van': 'Truck/Van',
    'coupe': 'Car',
    'sedan': 'Car',
    'hatchback': 'Car',
    'convertible': 'Car',
    'wagon': 'Car',
    'SUV': 'SUV/Offroad',
    'offroad': 'SUV/Offroad',
    'bus': 'Commercial/Other',
    'other': 'Commercial/Other',
    'unknown': 'Commercial/Other'
}
vehicles_df_cleaned['vehicle_category'] = vehicles_df_cleaned['type'].map(type_mapping)
vehicles_df_cleaned['vehicle_category'] = vehicles_df_cleaned['vehicle_category'].fillna('Commercial/Other')

One hot encoding of condition, title_status, vehicle_category, fule, tranmission, cylinders and drive

In [206]:
# Apply One-Hot Encoding to Categorical Columns
dummy_columns = ['condition', 'title_status', 'vehicle_category', 'fuel', 'transmission']
dummies = pd.get_dummies(vehicles_df_cleaned[dummy_columns], prefix=dummy_columns)
dummies = dummies.astype(int)

# Check unique values in 'cylinders' before get_dummies
print("Unique values in 'cylinders':", vehicles_df_cleaned['cylinders'].unique())

# Convert 'cylinders' and 'drive' to Integer and Apply One-Hot Encoding
vehicles_df_cleaned['cylinders'] = pd.to_numeric(vehicles_df_cleaned['cylinders'], errors='coerce').fillna(0).astype(int)
vehicles_df_cleaned['drive'] = vehicles_df_cleaned['drive'].astype(str)
cyl_dummies = pd.get_dummies(vehicles_df_cleaned['cylinders'], prefix='cyl')
cyl_dummies = cyl_dummies.astype(int)
drive_dummies = pd.get_dummies(vehicles_df_cleaned['drive'], prefix='drive')
drive_dummies = drive_dummies.astype(int)

# Concatenate Dummy Variables with the Original DataFrame
#columns_to_drop_for_dummies = ['model','condition', 'title_status', 'type', 'vehicle_category', 'fuel', 'transmission', 'cylinders', 'drive']
#vehicles_df_final = vehicles_df_cleaned.drop(columns=columns_to_drop_for_dummies)
vehicles_df_final = pd.concat([vehicles_df_final, dummies, cyl_dummies, drive_dummies], axis=1)

# Reset the index of the final DataFrame
vehicles_df_final.reset_index(drop=True, inplace=True)

# Display the Final DataFrame
print("\nFinal DataFrame:")
print(vehicles_df_final.head())

Unique values in 'cylinders': [ 8  6  4  5 10  3 12]

Final DataFrame:
             id    price  odometer model_corrected  manufacturer_freq  \
0  7.316815e+09  33590.0   57923.0     sierra 1500            15228.0   
1  7.316815e+09  22590.0   71229.0  silverado 1500            49818.0   
2  7.316815e+09  39590.0   19160.0  silverado 1500            49818.0   
3  7.316743e+09  30990.0   41124.0          tundra            31407.0   
4  7.316356e+09  15000.0  128000.0            f150            64199.0   

   model_freq  region_freq  state_encoded  paint_color_ordinal   age  \
0      3410.0        131.0         4407.0                  1.0  10.0   
1     15631.0        131.0         4407.0                  4.0  14.0   
2     15631.0        131.0         4407.0                  5.0   4.0   
3      1886.0        131.0         4407.0                  5.0   7.0   
4     11834.0        131.0         4407.0                  2.0  11.0   

   condition_excellent  condition_fair  condition_good  c

In [207]:
pd.options.display.max_columns = None
vehicles_df_final.head(5)

Unnamed: 0,id,price,odometer,model_corrected,manufacturer_freq,model_freq,region_freq,state_encoded,paint_color_ordinal,age,condition_excellent,condition_fair,condition_good,condition_like new,condition_salvage,condition_unknown,title_status_clean,title_status_lien,title_status_missing,title_status_parts only,title_status_rebuilt,title_status_salvage,title_status_unknown,vehicle_category_Car,vehicle_category_Commercial/Other,vehicle_category_SUV/Offroad,vehicle_category_Truck/Van,fuel_diesel,fuel_electric,fuel_gas,fuel_hybrid,transmission_automatic,transmission_manual,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8,cyl_10,cyl_12,drive_4wd,drive_fwd,drive_rwd,condition_excellent.1,condition_fair.1,condition_good.1,condition_like new.1,condition_salvage.1,condition_unknown.1,title_status_clean.1,title_status_lien.1,title_status_missing.1,title_status_parts only.1,title_status_rebuilt.1,title_status_salvage.1,title_status_unknown.1,vehicle_category_Car.1,vehicle_category_Commercial/Other.1,vehicle_category_SUV/Offroad.1,vehicle_category_Truck/Van.1,fuel_diesel.1,fuel_electric.1,fuel_gas.1,fuel_hybrid.1,transmission_automatic.1,transmission_manual.1,cyl_3.1,cyl_4.1,cyl_5.1,cyl_6.1,cyl_8.1,cyl_10.1,cyl_12.1,drive_4wd.1,drive_fwd.1,drive_rwd.1,condition_excellent.2,condition_fair.2,condition_good.2,condition_like new.2,condition_salvage.2,condition_unknown.2,title_status_clean.2,title_status_lien.2,title_status_missing.2,title_status_parts only.2,title_status_rebuilt.2,title_status_salvage.2,title_status_unknown.2,vehicle_category_Car.2,vehicle_category_Commercial/Other.2,vehicle_category_SUV/Offroad.2,vehicle_category_Truck/Van.2,fuel_diesel.2,fuel_electric.2,fuel_gas.2,fuel_hybrid.2,transmission_automatic.2,transmission_manual.2,cyl_3.2,cyl_4.2,cyl_5.2,cyl_6.2,cyl_8.2,cyl_10.2,cyl_12.2,drive_4wd.2,drive_fwd.2,drive_rwd.2
0,7316815000.0,33590.0,57923.0,sierra 1500,15228.0,3410.0,131.0,4407.0,1.0,10.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,7316815000.0,22590.0,71229.0,silverado 1500,49818.0,15631.0,131.0,4407.0,4.0,14.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,7316815000.0,39590.0,19160.0,silverado 1500,49818.0,15631.0,131.0,4407.0,5.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,7316743000.0,30990.0,41124.0,tundra,31407.0,1886.0,131.0,4407.0,5.0,7.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,7316356000.0,15000.0,128000.0,f150,64199.0,11834.0,131.0,4407.0,2.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [208]:
vehicles_df_final.to_csv('final.csv', index=False)
copied_df = vehicles_df_final.copy()
vehicles_df_final = vehicles_df_final.drop(columns=['region','year','manufacturer','model', 'paint_color','state'], errors='ignore').copy()


In [209]:
# Drop non-numeric columns

df_numeric = vehicles_df_final

# Calculate the correlation matrix for the numeric columns
correlation_matrix = df_numeric.corr()

# Plot the heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

ValueError: could not convert string to float: 'sierra 1500'

Before you proceed with machine learning, you need to ensure your data is properly prepared. This includes handling missing values, encoding categorical variables, scaling features, and applying logarithmic transformations where appropriate.

Steps:

    Handle Missing Values: Ensure that all missing values are addressed.
    Encoding Categorical Variables: Use methods like one-hot encoding or label encoding, depending on the type of categorical variable.
    Feature Scaling: Normalize or standardize numerical features to bring them onto a similar scale, especially for regression models.
    Logarithmic Transformation: Apply log transformation to skewed numerical features to stabilize variance and make the data more normally distributed.

In [None]:
df = vehicles_df_final

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

# Encoding categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Log transform skewed numerical features
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = df[numeric_cols].apply(lambda x: np.log1p(x) if np.abs(x.skew()) > 0.5 else x)

# Feature scaling
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Verify data preparation
print(df.head())

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models. Here, you should build a number of different regression models with the price as the target. In building your models, you should explore different parameters and be sure to cross-validate your findings.


### Use of Multiple Regression Models

Now that your data is prepared, you can apply multiple regression models to it. Here are a few models you could consider:

    Linear Regression
    Ridge Regression
    Lasso Regression
    ElasticNet Regression
    Random Forest Regression

In [None]:

# Split the data into features (X) and target (y)
X = df.drop(columns=['price'])  # Assuming 'price' is the target variable
y = df['price']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'ElasticNet Regression': ElasticNet(),
#    'Random Forest': RandomForestRegressor()
}

# Evaluate each model using cross-validation
for name, model in models.items():
    model.fit(X_train, y_train)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"{name} - CV MSE: {-cv_scores.mean()} - Test MSE: {mse}")

### Cross-Validation and Grid Search for Hyperparameters

Use cross-validation to assess model performance and grid search to find the best hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV

# Example: Grid search for Ridge Regression
ridge = Ridge()
param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print("Best parameters for Ridge Regression:", grid_search.best_params_)
print("Best cross-validation score:", -grid_search.best_score_)

### Interpretation of Coefficients

Once you have the best model, you can interpret the coefficients (for linear models) to understand the impact of each feature.

In [None]:
from sklearn.metrics import mean_squared_error

# Evaluate the best model on the test set
best_ridge_model = Ridge(alpha=100.0)
best_ridge_model.fit(X_train, y_train)
y_pred = best_ridge_model.predict(X_test)

# Calculate test MSE
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE for the best Ridge Regression model: {test_mse}")


### Interpretation of Model Coefficients:

 Since Ridge Regression is a linear model, you can interpret the coefficients to understand the impact of each feature on the target variable (price). This can provide insights into which features are most influential.

In [None]:
# Extract and display coefficients
feature_names = X_train.columns
coefficients = best_ridge_model.coef_

coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
}).sort_values(by='Coefficient', ascending=False)

print(coef_df)


In [None]:
plt.figure(figsize=(10, 12))
sns.barplot(x='Coefficient', y='Feature', data=coef_df, palette='coolwarm')
plt.title('Feature Coefficients')
plt.xlabel('Coefficient')
plt.ylabel('Feature')
plt.show()

In [None]:
import pandas as pd

# Assuming coef_df is your DataFrame with features and coefficients
# and vehicles_df_cleaned (the preprocessed DataFrame before dropping original columns) is available

# Select the top 10 features with the highest positive coefficients
top_positive_features = coef_df.sort_values(by='Coefficient', ascending=False).head(10)['Feature']

# Select the top 10 features with the lowest (most negative) coefficients
top_negative_features = coef_df.sort_values(by='Coefficient', ascending=True).head(10)['Feature']

# Original columns to include in the output
original_columns = ['manufacturer', 'model', 'price', 'odometer', 'year']  # Add any other original columns you want to see

# Ensure the DataFrame you're working with includes these original columns
vehicles_df_with_originals = vehicles_df_cleaned.copy()

# Extract cars with top positive coefficients, including original columns
top_positive_cars = vehicles_df_with_originals[
    (vehicles_df_final[top_positive_features] > 0).any(axis=1)
][original_columns + top_positive_features.tolist()].head(10)

# Extract cars with top negative coefficients, including original columns
top_negative_cars = vehicles_df_with_originals[
    (vehicles_df_final[top_negative_features] > 0).any(axis=1)
][original_columns + top_negative_features.tolist()].head(10)

# Display the results
print("\nTop 10 Cars with Best Coefficients:")
print(top_positive_cars)

print("\nTop 10 Cars with Worst Coefficients:")
print(top_negative_cars)




In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate residuals
residuals = y_test - y_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Residuals vs Predicted")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()


In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Suppose X_train has two features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X_train)

# X_poly will now contain original features, squared terms, and interaction terms
print(X_poly.shape)


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.