<a href="https://colab.research.google.com/github/lakshmisharma17/module11/blob/main/Module11_PracticalApplication_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

**[Lakshmi's Understanding] **

From a data perspective, this is a **supervised regression problem** where:

*   the goal is to predict used car prices, based on features like age, mileage, brand, model, fuel type, condition, and transmission, to help our client who is a used car dealership

*   We will do so by finding which features have the most impact on price and evaluate model performance using metrics like R² or RMSE.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

Step 1: Data Collection and Processing

In [34]:
#Import required libraries and their dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn import metrics

from sklearn.metrics import mean_squared_error, r2_score


In [35]:
#Load the vehicle dataset provided in csv file to a Pandas Dataframe

car_dataset = pd.read_csv('data/vehicles.csv')


In [36]:
#Read the information from the Car DataSet
car_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [37]:
#print top 5 rows of the dataset
car_dataset.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [38]:
#Top 5 rows don't give much information on the values of the columns
# (may be because it is older data)
# Hence print bottom 10 rows of the dataset to get better understanding
car_dataset.tail(10)

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
426870,7301592119,wyoming,22990,2020.0,hyundai,sonata se sedan 4d,good,,gas,3066.0,clean,other,5NPEG4JAXLH051710,fwd,,sedan,blue,wy
426871,7301591639,wyoming,17990,2018.0,kia,sportage lx sport utility 4d,good,,gas,34239.0,clean,other,KNDPMCAC7J7417329,,,SUV,,wy
426872,7301591201,wyoming,32590,2020.0,mercedes-benz,c-class c 300,good,,gas,19059.0,clean,other,55SWF8DB6LU325050,rwd,,sedan,white,wy
426873,7301591202,wyoming,30990,2018.0,mercedes-benz,glc 300 sport,good,,gas,15080.0,clean,automatic,WDC0G4JB6JV019749,rwd,,other,white,wy
426874,7301591199,wyoming,33590,2018.0,lexus,gs 350 sedan 4d,good,6 cylinders,gas,30814.0,clean,automatic,JTHBZ1BLXJA012999,rwd,,sedan,white,wy
426875,7301591192,wyoming,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,1N4AA6AV6KC367801,fwd,,sedan,,wy
426876,7301591187,wyoming,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,7JR102FKXLG042696,fwd,,sedan,red,wy
426877,7301591147,wyoming,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,1GYFZFR46LF088296,,,hatchback,white,wy
426878,7301591140,wyoming,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,58ABK1GG4JU103853,fwd,,sedan,silver,wy
426879,7301591129,wyoming,30590,2019.0,bmw,4 series 430i gran coupe,good,,gas,22716.0,clean,other,WBA4J1C58KBM14708,rwd,,coupe,,wy


In [39]:
#Top 5 rows and .info() shows many null values in few of the columns.
# Validating the null count for each column using .isnull and comparing with
# the total non-null values in info

car_dataset.isnull().sum()


Unnamed: 0,0
id,0
region,0
price,0
year,1205
manufacturer,17646
model,5277
condition,174104
cylinders,177678
fuel,3013
odometer,4400


In [52]:
# Understand the distribution of the data

# now print the different types of values and their count in each of the columns
# except 'VIN' and 'state'
object_columns = car_dataset.select_dtypes(include=['object']).columns.tolist()
columns_to_print = [col for col in object_columns if col not in ['VIN', 'region', 'state']]

# Iterate through the selected columns and print unique values
for col in columns_to_print:
    print(f"Unique values in column '{col}':")
    print(car_dataset[col].value_counts(), "\n")
    print("-" * 30) # Separator for clarity


Unique values in column 'manufacturer':
manufacturer
ford               70985
chevrolet          55064
toyota             34202
honda              21269
nissan             19067
jeep               19014
ram                18342
gmc                16785
bmw                14699
dodge              13707
mercedes-benz      11817
hyundai            10338
subaru              9495
volkswagen          9345
kia                 8457
lexus               8200
audi                7573
cadillac            6953
chrysler            6031
acura               5978
buick               5501
mazda               5427
infiniti            4802
lincoln             4220
volvo               3374
mitsubishi          3292
mini                2376
pontiac             2288
rover               2113
jaguar              1946
porsche             1384
mercury             1184
saturn              1090
alfa-romeo           897
tesla                868
fiat                 792
harley-davidson      153
ferrari               

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

**Step 1: **

Data Cleaning

In [48]:
#Remove NANs
car_dataset_no_nan = car_dataset.dropna()

#Remove VIN, region, and State based on heuristics

car_dataset_cleaned = car_dataset_no_nan.drop(columns=['VIN', 'region', 'state'], axis=1)

#print car_dataset_cleaned
car_dataset_cleaned.info()


<class 'pandas.core.frame.DataFrame'>
Index: 34868 entries, 126 to 426836
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            34868 non-null  int64  
 1   price         34868 non-null  int64  
 2   year          34868 non-null  float64
 3   manufacturer  34868 non-null  object 
 4   model         34868 non-null  object 
 5   condition     34868 non-null  object 
 6   cylinders     34868 non-null  object 
 7   fuel          34868 non-null  object 
 8   odometer      34868 non-null  float64
 9   title_status  34868 non-null  object 
 10  transmission  34868 non-null  object 
 11  drive         34868 non-null  object 
 12  size          34868 non-null  object 
 13  type          34868 non-null  object 
 14  paint_color   34868 non-null  object 
dtypes: float64(2), int64(2), object(11)
memory usage: 4.3+ MB


Step 2: **Data Encoding**

In [53]:
#Encoding the categorical data so that we have all numeric data to use for Regressional Analysis
car_dataset_encoded = pd.get_dummies(car_dataset_cleaned, columns=['manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color'])

# Display the first few rows of the encoded DataFrame to see the result
print(car_dataset_encoded.head())
# Print the info of the encoded DataFrame to see the new columns
car_dataset_encoded.info()


             id  price    year  odometer  manufacturer_acura  \
126  7305672709      0  2018.0   68472.0               False   
127  7305672266      0  2019.0   69125.0               False   
128  7305672252      0  2018.0   66555.0               False   
215  7316482063   4000  2002.0  155000.0               False   
219  7316429417   2500  1995.0  110661.0               False   

     manufacturer_alfa-romeo  manufacturer_aston-martin  manufacturer_audi  \
126                    False                      False              False   
127                    False                      False              False   
128                    False                      False              False   
215                    False                      False              False   
219                    False                      False              False   

     manufacturer_bmw  manufacturer_buick  ...  paint_color_brown  \
126             False               False  ...              False   
127     

In [55]:
from sklearn.preprocessing import StandardScaler

# Separate target variable 'Price' from rest of the variables aka 'X'
X = car_dataset_encoded.drop('price', axis=1)
y = car_dataset_encoded['price']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the features and transform them
car_dataset_X_scaled = scaler.fit_transform(X)

# Now X_scaled is a NumPy array containing the scaled features
# You can convert it back to a DataFrame if needed
car_dataset_X_scaled_df = pd.DataFrame(car_dataset_X_scaled, columns=X.columns)

# Print the first few rows of the scaled DataFrame to see the result
print(car_dataset_X_scaled_df.tail(10))

             id      year  odometer  manufacturer_acura  \
34858 -1.902238  0.394170 -0.493498           -0.086171   
34859 -1.902245  0.254548  0.068222           -0.086171   
34860 -1.902248 -1.560529  0.917759           -0.086171   
34861 -1.902247  0.394170  0.213427           -0.086171   
34862 -1.905472  0.533791  0.394324           -0.086171   
34863 -1.925795  0.673412 -0.463150           -0.086171   
34864 -1.951991  0.394170  0.472867           -0.086171   
34865 -1.952583  0.952655 -0.709999           -0.086171   
34866 -2.132589 -1.979393 -0.378842           -0.086171   
34867 -2.141220  0.952655 -0.869877           -0.086171   

       manufacturer_alfa-romeo  manufacturer_aston-martin  manufacturer_audi  \
34858                -0.020746                  -0.009276          -0.114344   
34859                -0.020746                  -0.009276          -0.114344   
34860                -0.020746                  -0.009276          -0.114344   
34861                -0.020746

**Step 3: Splitting the data into test and train**

I tried normalizing the data before splitting, but then going back to the notes and checking on Gemini found that there is data leakage issue when standarsizing the data before splitting. hence chose the option to split first and normalize later.


**Step 3: Normalize the data**

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.