# Introduction

This investigation examines vital aspects including pricing dynamics, brand attraction, vehicle age, condition preferences, and market duration. The aim is to provide actionable insights into the core patterns governing the automotive marketplace. This analysis will explore pricing trends linked to vehicle age and reveal brand and model preferences, recognizing that each dimension offers indispensable insights into consumer behavior, market trends, and industry dynamics.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from matplotlib import pyplot as plt
from scipy import stats as st
import math

In [2]:
# Attempt to import data, handle exception, give feedback if successful
file_path = "../vehicles_us.csv"

try:
    vehicles_df = pd.read_csv(file_path)
except FileNotFoundError as error_msg:
    print(f"Error reading file: {error_msg}. Try again!")
else:
    print(f"The file at path: [{file_path}] was imported.")
    print("The import was saved to the variable: [vehicles_df]")

The file at path: [../vehicles_us.csv] was imported.
The import was saved to the variable: [vehicles_df]


## 1. Data Cleaning 

In [3]:
# Print general information
vehicles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


**Observations**

The dataset contains 51,525 entries. Among these, various columns exhibit missing data, namely:

- `model_year`
- `cylinders`
- `odometer`
- `paint_color`
- `is_4wd`

For the analysis, a standardized approach will be employed to address these missing values:

- Missing values in `model_year`, `cylinders`, and `odometer` will be replaced with their respective statistical means.
- The `paint_color` column's missing values will be filled with the categorical value `"Unknown"`.
- Missing values in the `is_4wd` column will be filled with `0`. From an analytical perspective, it's preferable to inaccurately classify cars without four-wheel drive as not having it than vice versa.

Regarding duplicated values, there's no need for concern since all columns could potentially contain duplicates. For instance, multiple cars might share the same mileage or posting date. Similarly, there could be duplicates of the same cars.

Furthermore, certain columns will be repurposed and derived from existing ones:
1. Convert the `date_posted` column to `datetime` format.
2. Create `year_posted`, `month_posted`, and `day_posted` columns from `date_posted`, enabling comparisons at daily, monthly, and yearly intervals.
3. Introduce an `id` column to assign each car a unique identifier. Currently, the index serves as the ID, but adding a unique ID ensures persistence even if the index is reset.
4. Split the existing `model` column into `make` and `model` columns. The `make` column will denote the brand (e.g., BMW, Honda, Subaru), while the `model` column will specify the model (e.g., X5, F-150, Sonata). This division facilitates comparisons by brand or specific models.

In [4]:
# Updating numerical missing values

try:
    # Fill missing values for 'model_year' with the median
    vehicles_df["model_year"].fillna(round(vehicles_df["model_year"].median()), inplace=True)
    
    # Fill missing values for 'cylinders' grouped by 'model' and 'model_year' with median
    vehicles_df["cylinders"].fillna(vehicles_df.groupby(['model', 'model_year'])['cylinders'].transform('median'), inplace=True)
    
    # Fill missing values for 'odometer' grouped by 'model' and 'model_year' with mean
    vehicles_df["odometer"].fillna(vehicles_df.groupby(['model', 'model_year'])['odometer'].transform('mean'), inplace=True)
    
    # Fill missing values for 'paint_color' with "unknown"
    vehicles_df['paint_color'].fillna("unknown", inplace=True)
    
    # Fill missing values for 'is_4wd' with 0
    vehicles_df['is_4wd'].fillna(0, inplace=True)
    
    # Replace remaining missing values in 'cylinders' and 'odometer' with overall median and mean respectively
    vehicles_df['cylinders'].fillna(vehicles_df['cylinders'].median(), inplace=True)
    vehicles_df['odometer'].fillna(vehicles_df['odometer'].mean(), inplace=True)
    
except Exception as e:
    print("An error occurred:", e)
else:
    # Print number of missing values for each column after filling
    print(f"Missing values for 'model_year': {vehicles_df['model_year'].isna().sum()}")
    print(f"Missing values for 'cylinders': {vehicles_df['cylinders'].isna().sum()}")
    print(f"Missing values for 'odometer': {vehicles_df['odometer'].isna().sum()}")
    print(f"Missing values for 'paint_color': {vehicles_df['paint_color'].isna().sum()}")
    print(f"Missing values for 'is_4wd': {vehicles_df['is_4wd'].isna().sum()}")

Missing values for 'model_year': 0
Missing values for 'cylinders': 0
Missing values for 'odometer': 0
Missing values for 'paint_color': 0
Missing values for 'is_4wd': 0


In [5]:
# Check for duplicates
vehicles_df.duplicated().sum()

0

In [6]:
# Change the data type of the entire column
columns_to_convert = ['model_year', 'cylinders', 'odometer', 'is_4wd']
vehicles_df[columns_to_convert] = vehicles_df[columns_to_convert].astype(int)

In [7]:
# Feature engineering dates based on the date_posted column
vehicles_df['date_posted'] = pd.to_datetime(vehicles_df['date_posted'])
vehicles_df['year_posted'] = vehicles_df['date_posted'].dt.year
vehicles_df['month_posted'] = vehicles_df['date_posted'].dt.month
vehicles_df['day_posted'] = vehicles_df['date_posted'].dt.day

# Separate out the make and model so analysis can be done on both
vehicles_df[['make', 'model']] = vehicles_df['model'].str.split(' ', n=1, expand=True)
vehicles_df.insert(2, 'make', vehicles_df.pop('make'))
vehicles_df.insert(3, 'model', vehicles_df.pop('model'))

In [8]:
# Create an ID column based on index. If the index is reset later, the ID will still map back to original
vehicles_df.insert(0, 'id', vehicles_df.index)

**Observations**

Upon examining the data, it's apparent that some values are treated as unique when they shouldn't be. For instance, variations in spelling or formatting, such as "f150" and "f-150," represent the same entity and should be unified for consistency.

Furthermore, string values in columns should be converted to title case to enhance readability and presentation during data visualization. This ensures a cleaner and more uniform dataset.

In [9]:
# Look for any values that should be combined
unique_models = vehicles_df['model'].unique()

In [10]:
# combining 'unqiue' values
vehicles_df.loc[vehicles_df['model'] == "silverado", 'model'] = 'silverado 1500'
vehicles_df.loc[vehicles_df['model'] == "f150", 'model'] = 'f-150'

In [11]:
# Title casing specific values

columns_to_title = ['condition', 'make', 'paint_color', 'transmission', 'fuel']

for col in columns_to_title:
    vehicles_df[col] = vehicles_df[col].str.title()

In [12]:
vehicles_df

Unnamed: 0,id,price,model_year,make,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,year_posted,month_posted,day_posted
0,0,9400,2011,Bmw,x5,Good,6,Gas,145000,Automatic,SUV,Unknown,1,2018-06-23,19,2018,6,23
1,1,25500,2011,Ford,f-150,Good,6,Gas,88705,Automatic,pickup,White,1,2018-10-19,50,2018,10,19
2,2,5500,2013,Hyundai,sonata,Like New,4,Gas,110000,Automatic,sedan,Red,0,2019-02-07,79,2019,2,7
3,3,1500,2003,Ford,f-150,Fair,8,Gas,175165,Automatic,pickup,Unknown,0,2019-03-22,9,2019,3,22
4,4,14900,2017,Chrysler,200,Excellent,4,Gas,80903,Automatic,sedan,Black,0,2019-04-02,28,2019,4,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,51520,9249,2013,Nissan,maxima,Like New,6,Gas,88136,Automatic,sedan,Black,0,2018-10-03,37,2018,10,3
51521,51521,2700,2002,Honda,civic,Salvage,4,Gas,181500,Automatic,sedan,White,0,2018-11-14,22,2018,11,14
51522,51522,3950,2009,Hyundai,sonata,Excellent,4,Gas,128000,Automatic,sedan,Blue,0,2018-11-15,32,2018,11,15
51523,51523,7455,2013,Toyota,corolla,Good,4,Gas,139573,Automatic,sedan,Black,0,2018-07-02,71,2018,7,2


In [13]:
# Export update data for app.py
vehicles_df.to_csv('../vehicles_data.csv', index=False)

## 2. Data Visualization

In [14]:
# price distribution

price_dist = px.histogram(vehicles_df, x="price", nbins=50)
price_dist.update_layout(title_text="Distribution of Vehicle Price", yaxis_title='Frequency', xaxis_title='Price', bargap=0.2)
price_dist.update_traces(marker_color='rgb(136, 204, 238)')
price_dist.show()

**Summary:** 

Based on the data analysis, it's evident that the majority of car prices lie within the range of $0 to $30,000, with approximately 28,000 cars falling within the $0 to $10,000 bracket. This distribution aligns with expectations considering the diverse range of vehicles in terms of age, mileage, and condition. However, there are indications of potential outliers with exceptionally high prices in the dataset.

In [15]:
# model_year distribution

vehicle_year_dist = px.histogram(vehicles_df, x="model_year",  nbins=200)
vehicle_year_dist.update_layout(title_text="Distribution of Vehicle Year", yaxis_title='Frequency', xaxis_title='Year of Model',  bargap=0.2)
vehicle_year_dist.show()

**Summary:** 

Based on the distribution of vehicle years, it's apparent that the majority of our dataset comprises vehicles manufactured from the year 2000 onwards. Nonetheless, we also have records of vehicles dating back to the 1900s, albeit with decreasing frequency as the age of the vehicles increases. Notably, the year 2010 stands out with the highest frequency of occurrences within our dataset.

In [16]:
# make distribution

vehicle_make_dist = px.histogram(vehicles_df, x='make')
vehicle_make_dist.update_xaxes(tickangle=45) 
vehicle_make_dist.update_layout(title_text="Distribution of Vehicle Make", yaxis_title='Frequency', xaxis_title='Vehicle Make', bargap=0.2)
vehicle_make_dist.update_traces(marker_color='rgb(248, 156, 116)')
vehicle_make_dist.show()

**Summary:** 

Upon examining the dataset, it becomes apparent that Ford and Chevrolet emerge as the most prevalent car brands (make). In fact, a significant portion of American-branded vehicles, including Ford, Chevrolet, Ram, GMC, and Jeep, are among the most frequently occurring. Additionally, among the non-American brands, Honda, Toyota, and Nissan, which are Japanese, also hold a notable presence in the dataset.

In [17]:
# model distribution (top 20)

model_counts = vehicles_df['model'].value_counts()
top_n_models = model_counts.head(25)

vehicle_model_dist = px.histogram(x=top_n_models.index, y=top_n_models.values)
vehicle_model_dist.update_xaxes(tickangle=45) 
vehicle_model_dist.update_layout(title_text="Distribution of Vehicle Model (Top 25)", yaxis_title='Frequency', xaxis_title='Vehicle Model', bargap=0.2)
vehicle_model_dist.update_traces(marker_color='rgb(102, 194, 165)')
vehicle_model_dist.show()

**Summary:** 

The dominance of 'Silverado 1500', 'F-150', '1500', 'Wrangler', and '2500' as the most frequent models in the dataset is entirely logical, considering they are all manufactured by American vehicle brands like Chevrolet, Ford, Jeep, and Ram. This observation corroborates with the findings from the previous analysis.

In [18]:
# make distribution by condition

vehicle_make_cond_dist = px.histogram(vehicles_df, x='make', color='condition')
vehicle_make_cond_dist.update_layout(title_text="Distribution of Vehicle Make by Condition", yaxis_title='Frequency', xaxis_title='Vehicle Make', height=800)
vehicle_make_cond_dist.show()

**Summary:** 

The dataset indicates that the overwhelming majority of vehicles fall within the "Excellent" or "Good" condition categories, with "Like New" being the third most common condition range. Remarkably, this trend holds true across various brands, irrespective of their country of origin. Notably, there are very few, if any, vehicles categorized under the 'salvage' condition band.

In [19]:
# paint color distribution
paint_color_dist = px.histogram(vehicles_df, x='paint_color')
paint_color_dist.update_layout(title_text="Distribution of Vehicle Paint Color", yaxis_title='Frequency', xaxis_title='Vehicle Paint Color')
paint_color_dist.update_traces(marker_color='rgb(102, 166, 30)')
paint_color_dist.show()

**Summary:** 

The dataset predominantly features vehicles in colors such as 'white', 'black', and 'silver'. However, the color of approximately 9,000 vehicles remains undisclosed as they are labeled as 'Unknown'.

In [20]:
# price vs odometer scatter

price_year_scatter = px.scatter(vehicles_df, x='model_year', y='price', title="Price vs Year (By Vehicle Make)", labels={'model_year': 'Year', 'price': 'Price ($)', 'make': "Vehicle Make"}, opacity=0.7, color='make', color_discrete_sequence=px.colors.qualitative.G10)
price_year_scatter.show()

**Summary:** 

The data indicates a discernible trend between the price of vehicles and their respective manufacturing years. Specifically, vehicles from slightly older periods, ranging between 1960 and 1975, tend to command higher prices compared to those from the early 2000s. Prices appear relatively stable between 1980 and 2000. However, as the vehicle's manufacturing year advances into more recent times, prices tend to rise once again. This pattern suggests that both older vehicles, often considered antiques, and modern vehicles carry higher price tags compared to those manufactured in between. Interestingly, when examining the data by brand, there's little evidence to suggest any correlation between brand and pricing trends.

In [21]:
# price vs days listed scatter

price_days_listed_scatter = px.scatter(vehicles_df, x='days_listed', y='price', title="Price vs Days Listed", labels={"days_listed": "Days Listed", 'price': 'Price ($)'}, opacity=0.3)
price_days_listed_scatter.show()

**Summary:** 

Upon analyzing the data, it becomes apparent that, on average, newer cars tend to command higher prices compared to vehicles that have been on the market for a longer duration. As vehicles accumulate more days on the market, their prices tend to decrease. This trend may indicate that vehicles experience a decline in value over time, reflecting depreciation as they remain on the market.

# Conclusion

Through a comprehensive analysis of the dataset, several key trends have emerged, shedding light on various aspects of the automotive market.

1. Price Dynamics and Vehicle Age: There is a discernible relationship between the price of vehicles and their age. Generally, newer vehicles tend to be priced higher, while older vehicles, particularly those considered antiques, also command higher prices. Vehicles from the early 2000s typically fetch lower prices compared to both older and newer counterparts.

2. Condition and Price: The dataset indicates that vehicles in excellent or good condition dominate the market, with "like new" conditions also being prevalent. Few vehicles are labeled as salvage. This underscores the preference for well-maintained vehicles among buyers.

3. Brand and Model Popularity: American brands such as Ford, Chevrolet, Ram, GMC, and Jeep are prominent in the dataset, with models like the Silverado 1500, F-150, and Wrangler being highly represented. Additionally, Japanese brands like Honda, Toyota, and Nissan also enjoy significant presence.

4. Color Preferences: The most popular colors among vehicles include white, black, and silver. However, a considerable number of vehicles have an unknown color, indicating incomplete data.

5. Market Duration and Pricing: On average, newer vehicles tend to carry higher price tags, while older vehicles experience a decline in price over time as they spend more days on the market. This suggests a trend of depreciation as vehicles age.

In summary, the automotive market exhibits various dynamics influenced by factors such as vehicle age, condition, brand, and market duration. While newer vehicles generally command higher prices, brand loyalty and model popularity also play significant roles in shaping market trends. Understanding these dynamics is essential for both buyers and sellers in navigating the complexities of the automotive market.