# SDT Sprint Project EDA

This project is to perform an analysis for Crankshaft List to find which factors affecting the price of a vehicle. Hundreds of free advertisements for vechicles are published on Crankshaft List website everyday, this project will analyse the data collected over the last few years and determine which factors have impacts on the price of a vehicle.

In [11]:
# Importing necessary libraries
import pandas as pd
import plotly.express as px

## Importing website data

Here we add the data into a DataFrame and display the info and a sample of the data.

In [26]:
# Importing the data into a pandas DataFrame
try:
    df = pd.read_csv('vehicles_us.csv')
except:
    df = pd.read_csv('/datasets/vehicles_us.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


There are 51525 vehicles and 13 columns of relevant information

According to the documentation
- `price` = price of the vehicle
- `model_year` = model year of the vehicle
- `model` = model of the vehicle
- `condition` = condition of the vehicle (excellent, good, fair, etc.)
- `cylinders` = number of cylinders in the vehicle
- `fuel` = type of fuel the vehicle takes(gas, diesel, etc.)
- `odometer` = the mileage of the vehicle when it was published to the website
- `transmission` = automatic vs. manual
- `paint_color` = color of the vehicle
- `is_4wd` = if the vehicle has 4-wheel drive (Boolean)
- `date_posted` = the date the vehicle was published to the site
- `days_listed` = how long the vehicle was on the site to removal

## Checking for duplicate and/or missing data

We will now check the DataFrame for duplicate entries and explore the columns that contain null values in order to determine if we will need to use any default values.

### Duplicate data

In [23]:
df.duplicated().sum()

0

The table does not include any unique id value for the vehicles so we look for duplicates across the whole DataFrame. None are found, so no further action is required.

### Missing Data

In [28]:
df.isna().sum()

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

The only columns with missing values are the `model_year`, `cylinders`, `odometer`, `paint_color` and `is_4wd` 

#### `is_4wd`

In [30]:
df['is_4wd'].value_counts(dropna=False)

is_4wd
NaN    25953
1.0    25572
Name: count, dtype: int64

Because this is a boolean column, we can replace all missing values with 0 to represent **False**

In [34]:
df['is_4wd'] = df['is_4wd'].fillna(0)
df['is_4wd'].value_counts()

is_4wd
0.0    25953
1.0    25572
Name: count, dtype: int64

In [35]:
df.isna().sum()

price              0
model_year      3619
model              0
condition          0
cylinders       5260
fuel               0
odometer        7892
transmission       0
type               0
paint_color     9267
is_4wd             0
date_posted        0
days_listed        0
dtype: int64

#### `odometer`

In [9]:
df['odometer'].value_counts(dropna=False)

odometer
NaN         7892
0.0          185
140000.0     183
120000.0     179
130000.0     178
            ... 
87836.0        1
172625.0       1
103597.0       1
167239.0       1
139573.0       1
Name: count, Length: 17763, dtype: int64