This Project we analyze a dataset from MarketCheck which holds information about the Used Car Market in the US.

Aggregated from over 65,000 dealer websites, the dataset contains ~7 million rows and 21 columns.

We'll be making use of ~2.5 million rows with 15 columns to conduct our analysis

Through the analysis, we aim to:

1. Get to know the 'Used Car Market in the US'.

2. Understand the Used Car Market for newer used cars from years 2010-2021.

3. Find insights and value by answering questions to help find the best deal for a car.

In [1]:
# IMPORT LIBRARIES
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import pandas as pd
import statsmodels.api as sm

In [2]:
# IMPORT DATASET
df_used = pd.read_csv('us-dealers-used.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
df_used = pd.read_csv('us-dealers-used.csv',  low_memory=False, nrows=2500000)

In [4]:
# show the number of rows and columns used for analysis
df_used.shape

(2500000, 21)

In [5]:
df_used.head()

Unnamed: 0,id,vin,price,miles,stock_no,year,make,model,trim,body_type,...,drivetrain,transmission,fuel_type,engine_size,engine_block,seller_name,street,city,state,zip
0,38b2f52e-8f5d,1GCWGFCF3F1284719,20998.0,115879.0,W1T503168C,2015.0,Chevrolet,Express Cargo,Work Van,Cargo Van,...,RWD,Automatic,E85 / Unleaded,4.8,V,nissan ellicott city,8569 Baltimore National Pike,Ellicott City,MD,21043
1,97ba4955-ccf0,WBY7Z8C59JVB87514,27921.0,7339.0,P33243,2018.0,BMW,i3,s,Hatchback,...,RWD,Automatic,Electric / Premium Unleaded,0.6,I,hendrick honda pompano beach,5381 N Federal Highway,Pompano Beach,FL,33064
2,be1da9fd-0f34,ML32F4FJ2JHF10325,11055.0,39798.0,WM2091A,2018.0,Mitsubishi,Mirage G4,SE,Sedan,...,FWD,Automatic,Unleaded,1.2,I,russ darrow toyota,2700 West Washington St.,West Bend,WI,53095
3,84327e45-6cb6,1GCPTEE15K1291189,52997.0,28568.0,9U2Y425A,2019.0,Chevrolet,Colorado,ZR2,Pickup,...,4WD,Automatic,Diesel,2.8,I,young kia,308 North Main Street,Layton,UT,84041
4,cde691c3-91dd,1G2AL18F087312093,,188485.0,T36625A,2008.0,Pontiac,G5,Base,Coupe,...,FWD,Automatic,Unleaded,2.2,I,pappas toyota,10011 Spencer Rd,Saint Peters,MO,63376


In [6]:
df_used.dtypes

id               object
vin              object
price           float64
miles           float64
stock_no         object
year            float64
make             object
model            object
trim             object
body_type        object
vehicle_type     object
drivetrain       object
transmission     object
fuel_type        object
engine_size     float64
engine_block     object
seller_name      object
street           object
city             object
state            object
zip              object
dtype: object

The 'year' datatype needs to converting from a float to an integer for relevant use.

In [14]:
df_used['year'] = df_used['year'].astype(int)

In [15]:
df_used.dtypes

vin              object
price           float64
miles           float64
year              int64
make             object
model            object
body_type        object
vehicle_type     object
drivetrain       object
transmission     object
fuel_type        object
engine_size     float64
city             object
state            object
zip              object
dtype: object

We will select 15 columns out of the original 21 columns that are relevant for our analysis

In [None]:
df_used = df_used[['vin', 'price', 'miles','year','make', 'model', 'body_type', 'vehicle_type',
                   'drivetrain', 'transmission', 'fuel_type', 'engine_size', 'city', 'state', 'zip']]

In [8]:
df_used.shape

(2500000, 15)

We now look at the numeric columns of our dataset to get a rough idea of its basic statistics

In [11]:
df_used.describe().round(3)

Unnamed: 0,price,miles,year,engine_size
count,2270385.0,2474949.0,2499928.0,2448312.0
mean,27785.537,53168.938,2016.399,2.903
std,19256.314,45979.902,3.899,1.333
min,0.0,0.0,1980.0,0.6
25%,16995.0,21662.0,2015.0,2.0
50%,23997.0,38629.0,2018.0,2.4
75%,34900.0,74602.0,2019.0,3.6
max,1495000.0,2975291.0,2022.0,30.0


Let start cleaning our data by checking for missing values and duplicates 

First let's see how many null values exists in our dataset

In [10]:
df_used.isnull().sum()

vin                  0
price           229615
miles            25051
year                72
make                 0
model             4015
body_type        13922
vehicle_type     19210
drivetrain        7870
transmission      6744
fuel_type        22893
engine_size      51688
city              4208
state             4215
zip               4295
dtype: int64

Price is a key metric and essential for our analysis, we will remove all null values from its column.

We will also remove the other null values since they aren't many relative to the dataset and won't distort our readings.

In [12]:
df_used = df_used.dropna()

In [13]:
df_used.isnull().sum()

vin             0
price           0
miles           0
year            0
make            0
model           0
body_type       0
vehicle_type    0
drivetrain      0
transmission    0
fuel_type       0
engine_size     0
city            0
state           0
zip             0
dtype: int64

In [16]:
df_used.shape

(2213774, 15)

The Vehicle Identification Number (VIN) is a unique identifier for each individual car. We will check to see if duplicates exists and remove them

In [17]:
df_used['vin'].duplicated().sum()

1141960

In [18]:
df_used.drop_duplicates(subset=['vin'], inplace=True)

In [19]:
df_used.vin.duplicated().sum()

0

In [20]:
df_used.shape

(1071814, 15)

Since we are only looking for used cars from 2010-2021 we'll filter the dataframe for only those years.

In [21]:
df_used = df_used.loc[(df_used['year'] > 2009) & (df_used['year'] < 2022) ]

In [22]:
car_yrs = df_used.year.unique()
car_yrs.sort()
print(car_yrs)

[2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021]


In [23]:
df_used.shape

(995980, 15)

In [24]:
df_used.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 995980 entries, 0 to 2499999
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   vin           995980 non-null  object 
 1   price         995980 non-null  float64
 2   miles         995980 non-null  float64
 3   year          995980 non-null  int64  
 4   make          995980 non-null  object 
 5   model         995980 non-null  object 
 6   body_type     995980 non-null  object 
 7   vehicle_type  995980 non-null  object 
 8   drivetrain    995980 non-null  object 
 9   transmission  995980 non-null  object 
 10  fuel_type     995980 non-null  object 
 11  engine_size   995980 non-null  float64
 12  city          995980 non-null  object 
 13  state         995980 non-null  object 
 14  zip           995980 non-null  object 
dtypes: float64(3), int64(1), object(11)
memory usage: 121.6+ MB
