# Car Sales Data - Exploratory Data Analysis (EDA)

## Introduction

In this project, we will perform Exploratory Data Analysis (EDA) on the car sales dataset. The main objectives are:

1. Preprocess the data by handling missing values and outliers.
2. Visualize key relationships in the data using histograms and scatter plots.
3. Gain insights into how car attributes like price, odometer readings, and model year affect vehicle pricing.


In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/Users/matt/Car-Sales-Advertisements/Car-Sales-Advertisements/vehicles_us.csv')

# Inspect the first few rows
df.head()


Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


In [4]:
# Fill missing values for 'model_year' grouped by 'model' using the median
df['model_year'] = df.groupby('model')['model_year'].transform(lambda x: x.fillna(x.median()))

# Fill missing values for 'cylinders' grouped by 'model' using the median
df['cylinders'] = df.groupby('model')['cylinders'].transform(lambda x: x.fillna(x.median()))

# Fill missing values for 'odometer' grouped by 'model_year' and 'model' using the mean
df['odometer'] = df.groupby(['model_year', 'model'])['odometer'].transform(lambda x: x.fillna(x.mean()))

# Verify missing values are handled
df.isna().sum()


price               0
model_year          0
model               0
condition           0
cylinders           0
fuel                0
odometer           83
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

In [5]:
# Remove outliers for 'model_year' and 'price'
q_low_model_year = df['model_year'].quantile(0.01)
q_high_model_year = df['model_year'].quantile(0.99)
df = df[(df['model_year'] >= q_low_model_year) & (df['model_year'] <= q_high_model_year)]

q_low_price = df['price'].quantile(0.01)
q_high_price = df['price'].quantile(0.99)
df = df[(df['price'] >= q_low_price) & (df['price'] <= q_high_price)]


In [6]:
import plotly.express as px

# Histogram of prices
fig1 = px.histogram(df, x='price', title='Distribution of Car Prices')
fig1.update_layout(xaxis_title="Price ($)")
fig1.show()

# Scatterplot of odometer vs price
fig2 = px.scatter(df, x='odometer', y='price', title='Odometer vs Price')
fig2.update_layout(xaxis_title="Odometer (miles)", yaxis_title="Price ($)")
fig2.show()


## Conclusion

From the analysis, we observe that the majority of cars are priced between \$5,000 and \$20,000, with a few high-end models exceeding that range. There is also a clear inverse relationship between the odometer reading and price, where cars with lower mileage tend to be priced higher.

Further analysis could involve exploring the impact of other variables such as brand and condition on the price.
