EDA US Cars 

This is a dataset of the cars for sale in the United States. From what I can see in the dataframe is the car price, type, model and general descriptors of the cars. I am going to just do some basic exploratory data analysis and clean up the data to eventually upload this to a web application for my project.

In [184]:
import pandas as pd
import plotly.express as px
import numpy as np

In [185]:
df = pd.read_csv('C:/Users/Joe/project/drhorrible/vehicles_us.csv')
df.sample(15)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
20102,15700,2016.0,ford escape,excellent,,gas,,automatic,SUV,,1.0,2018-08-09,14
30518,11995,2018.0,chevrolet cruze,excellent,4.0,gas,23886.0,automatic,hatchback,black,,2018-05-27,55
19027,8995,2015.0,ford escape,excellent,4.0,gas,107000.0,automatic,SUV,black,,2019-03-01,24
3464,16990,2016.0,ford escape,excellent,4.0,gas,,automatic,SUV,grey,1.0,2018-11-02,48
32558,4500,2010.0,toyota corolla,fair,4.0,gas,211000.0,automatic,sedan,green,,2019-01-25,36
33356,16500,2016.0,chevrolet colorado,excellent,6.0,gas,143276.0,automatic,pickup,silver,1.0,2018-07-31,87
38009,9000,1999.0,chevrolet corvette,excellent,8.0,gas,119.0,automatic,coupe,red,,2018-11-29,21
51049,2000,1998.0,nissan maxima,good,6.0,gas,237000.0,automatic,other,blue,,2019-02-19,16
2001,25995,2011.0,ram 3500,good,8.0,diesel,136175.0,automatic,truck,white,1.0,2018-06-28,3
34115,7500,2007.0,toyota tundra,good,8.0,gas,155000.0,automatic,truck,silver,1.0,2018-07-27,31


Ok so first off, there's a lot going on. The model year is a float value, the type of the cars aren't all lowercase, is_4wd looks like it runs off a yes/no system but fills the value with 1 if it fulfills that criteria, and date_posted is not in datetime. All of these are simple fixes that are also verified below by getting the info of the dataframe. Also there's something going on where the price is listed as 1 for some cars, that will be explored further at a later time

In [186]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


In [187]:
df.describe()

Unnamed: 0,price,model_year,cylinders,odometer,is_4wd,days_listed
count,51525.0,47906.0,46265.0,43633.0,25572.0,51525.0
mean,12132.46492,2009.75047,6.125235,115553.461738,1.0,39.55476
std,10040.803015,6.282065,1.66036,65094.611341,0.0,28.20427
min,1.0,1908.0,3.0,0.0,1.0,0.0
25%,5000.0,2006.0,4.0,70000.0,1.0,19.0
50%,9000.0,2011.0,6.0,113000.0,1.0,33.0
75%,16839.0,2014.0,8.0,155000.0,1.0,53.0
max,375000.0,2019.0,12.0,990000.0,1.0,271.0


So this gives me a lot of information about the numerical aspects of the data such as the price and odometer. For example something weird is that some of the prices are labeled as 1 and . This is strange because this could either be an error or maybe it's meant for you to contact the owner for pricing. Odometer also has a weird max value of almost 1 million miles, for now I will filter out the cars priced at 1 as it could clutter and mess with our data.

In [188]:
lower_values= df['price'].quantile(0.05)
df = df[df['price'] > lower_values]

I changed the price column to filter out the higher and lower 5% of price values as these are outliers that affect our data 

In [189]:
# changing the date_posted to datetime in case of any date related data analysis later
df['date_posted'] = pd.to_datetime(df['date_posted'])

In [190]:
missing_values = df.isnull().sum()
missing_values

price               0
model_year       3426
model               0
condition           0
cylinders        4980
fuel                0
odometer         7473
transmission        0
type                0
paint_color      8792
is_4wd          24548
date_posted         0
days_listed         0
dtype: int64

In [191]:
df['type']= df['type'].str.lower() 
df['model'] = df['model'].str.replace(' ', '_')

In [192]:
duplicate_vaules = df.duplicated(subset=['model', 'price', 'odometer'])
duplicate_vaules.sum()

np.int64(12130)

If I specify the duplicates to the 3 most general categories of a car it equals out to our data being almost 25% duplicates which is a significant amount

In [193]:
dupes= df[df.duplicated(subset=['model', 'price', 'odometer'], keep=False)]
dupes.head(20)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
1,25500,,ford_f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai_sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
4,14900,2017.0,chrysler_200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler_300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
6,12990,2015.0,toyota_camry,excellent,4.0,gas,79212.0,automatic,sedan,white,,2018-12-27,73
7,15990,2013.0,honda_pilot,excellent,6.0,gas,109473.0,automatic,suv,black,1.0,2019-01-07,68
8,11500,2012.0,kia_sorento,excellent,4.0,gas,104174.0,automatic,suv,,1.0,2018-07-16,19
9,9200,2008.0,honda_pilot,excellent,,gas,147191.0,automatic,suv,blue,1.0,2019-02-15,17
10,19500,2011.0,chevrolet_silverado_1500,excellent,8.0,gas,128413.0,automatic,pickup,black,1.0,2018-09-17,38
11,8990,2012.0,honda_accord,excellent,4.0,gas,111142.0,automatic,sedan,grey,,2019-03-28,29


Okay so I displayed the duplicates but I think I should specify to exact duplicates adding in the days_listed into the filtering. This would filter out true duplicates rather than getting rid of what I would call near duplicates. Adding in the days_listed changes a lot because cars can be relisted and the days_listed is an always changing variable

In [194]:
df_true_duplicates= df.duplicated(subset=['model', 'price', 'odometer', 'days_listed'])
df_true_duplicates.sum()

np.int64(387)

So this now narrows down the true duplicates down to a much smaller amount of duplicates in our data which we will get rid of.

In [195]:
df.drop_duplicates(subset=['model', 'price', 'odometer', 'days_listed'])

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw_x5,good,6.0,gas,145000.0,automatic,suv,,1.0,2018-06-23,19
1,25500,,ford_f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai_sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
4,14900,2017.0,chrysler_200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler_300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,9249,2013.0,nissan_maxima,like new,6.0,gas,88136.0,automatic,sedan,black,,2018-10-03,37
51521,2700,2002.0,honda_civic,salvage,4.0,gas,181500.0,automatic,sedan,white,,2018-11-14,22
51522,3950,2009.0,hyundai_sonata,excellent,4.0,gas,128000.0,automatic,sedan,blue,,2018-11-15,32
51523,7455,2013.0,toyota_corolla,good,4.0,gas,139573.0,automatic,sedan,black,,2018-07-02,71


In [196]:
# convert model_year to numeric since car manufacturers don't use have years and dropped missing values from the model year. This should be ok as long as the model_year isn't too crucial to the data in my opinion
df['model_year'] = pd.to_numeric(df['model_year'])
df.dropna(subset=['model_year'], inplace=True)
df['model_year'].isna().sum()

np.int64(0)

In [197]:
df['cylinders'].isna().sum()

np.int64(4641)

In [None]:
df['cylinders']= df['cylinders'].fillna(df['cylinders'].median())

np.int64(0)

In [200]:
# I have a bit of a dilemma concerning the odometer. I don't want to just eliminate NaN values but I would not know a safe replacement. Two options I'm thinking of are dropna() or just fillna() with "Unknown"
df.dropna(subset=['odometer'], inplace=True)
df['odometer'].isna().sum()

np.int64(0)

In [201]:
df['paint_color'] = df['paint_color'].fillna('not_listed')
df['paint_color'].isna().sum()

np.int64(0)

In [202]:
df['is_4wd'] = df['is_4wd'].replace(1, "yes",).fillna('no')
df.head()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw_x5,good,6.0,gas,145000.0,automatic,suv,not_listed,yes,2018-06-23,19
2,5500,2013.0,hyundai_sonata,like new,4.0,gas,110000.0,automatic,sedan,red,no,2019-02-07,79
4,14900,2017.0,chrysler_200,excellent,4.0,gas,80903.0,automatic,sedan,black,no,2019-04-02,28
5,14990,2014.0,chrysler_300,excellent,6.0,gas,57954.0,automatic,sedan,black,yes,2018-06-20,15
6,12990,2015.0,toyota_camry,excellent,4.0,gas,79212.0,automatic,sedan,white,no,2018-12-27,73


In [203]:
df.isna().sum()

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

Ok we have handled all the missing values. Now it's time to do a bit more cleaning like making the type of car all lowercase and then make some graphs and histograms

In [204]:
# Histogram for Car price per car type
fig = px.histogram(df, x='price', color='type', nbins=50, title='Price Distribution by Car Type', opacity=0.7, barmode='overlay')
fig.show()

# Histogram of Price by car's condition
fig= px.histogram(df, x='price', color='condition', nbins=50, title='Price Distribution by Car Condition', opacity=0.7, barmode='overlay')
fig.show()

In [205]:
# Scatter Plot for Price vs Odometer
fig = px.scatter(df, x='odometer', y='price', color='type', title='Price vs. Odometer of Car Type')
fig.show()

# Scatter Plot for Price vs Model Year
fig = px.scatter(df, x='model_year', y='price', color='model', title='Price vs. Model Year')
fig.show()