Here I will begin my data analysis of the vehicles dataset to help me look at the pricing of the car by year and the price by miles. To go one step further, I also plan to create a boxplot to help me analyze the price distribution across different models, which will help me see if there are any outliers or variations in price across different models. 

In [1]:
import pandas as pd, streamlit as st, plotly.express as px, altair, matplotlib.pyplot as plt, numpy as np
# Importing all necessary libraries


In [2]:
vehicles_df = pd.read_csv('vehicles_us.csv') # Importing the dataset
vehicles_df.head(5) # Displaying the first section of data

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


In [3]:
vehicles_df.duplicated().sum() # Checking for duplicate values, and none were found

np.int64(0)

In [4]:
vehicles_df.isna().sum() #Checking for missing values. Some were found but that can be due to missing data when it was collected

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

In [5]:
vehicles_df = vehicles_df.fillna('unknown') # Filling in missing values with unknown. I decided not to remove the rows entirely as I do not want to lose other crucial pieces of data
vehicles_df.head(5)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,unknown,1.0,2018-06-23,19
1,25500,unknown,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,unknown,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,unknown,automatic,pickup,unknown,unknown,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,unknown,2019-04-02,28


In [6]:
vehicles_df.isna().sum() # Confirming missing values have all been filled

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

In [7]:
# Now that my data has been cleaned up, I will begin my data analysis
# Starting by comparing the distribution of price by model year 
fig = px.histogram(vehicles_df, x="model_year", y="price", title="Price Distribution by Year")
fig.show()

This histogram shows me the price distribution of the cars based on the model year. I can see that the typical trend is that the newer the car, the more expensive it is. 

In [8]:
# Now looking at the number of miles the car has and the price it is being sold for 
fig = px.scatter(vehicles_df, x="odometer", y='price', title="Mileage vs. Price")
fig.show()

This scatterplot shows the relationship between the number of miles the car has and how much it is being sold for. The trend does show that typically, the more miles on the car the less its value. 

In [9]:
# Comparing the price distribution across different models 
fig = px.box(vehicles_df, x="model", y="price", title="Price Distribution by Model")
fig.show()

I can see that there are some outliers in some car models, but for the most part the price of each car per model is about the same. The differences can come from newer cars vs older cars, or cars that have been customized. 