# EDA
___
**From PreProcessing:**<br>
The purpose of this exercise is to analyze used car sales data from the United States and produce a web app to share findings. First, the data will be loaded, then opened and inspected in order to understand the structure and potential issues. Next is processing the data. By making necessary or useful updates, the data will be prepared for further data exploration.

**EDA**<br>
Most of the preprocessing was be done in the preprocessing notebook ('PreProcessing.ipynb'). Converting dates to datetime format as well as the creation of a second dataframe storing sales data, will be conducted in the EDA notebook ('EDA.ipynb')

A second dataframe will be constructed with data from the pre-processed dataframe. The second dataframe will focus on sales. columns will be date, sales_in_units, sales_in_dollars, inventory_in_units, inventory_in_dollars (based on sale price). The original dataframe had listing dates and how many days the vehicle was for sale. I used this information to create a sale date. And anytime time between listing and sold would refer to a time when the vehicle was in the inventory.

Analyzing the Data: <br.>
In this notebook will be several plots created in order to further investigate the data.  Plotly.express is the visualization library utilized. 

Statistical Analysis: <br>
Various statistical calculations will be made in order to interpret data and draw conclusions, identifying trends or correlations.

In [1]:
# import libraries

# built-in library
from datetime import datetime

# third party libraries
import pandas as pd
import numpy as np
import streamlit as st
import plotly.express as px

In [2]:
# read in processed vehicles sales data
vehicles = pd.read_csv("../processed_vehicles.csv")

In [3]:
# convert date column to datetime data type
vehicles['date_posted'] = pd.to_datetime(
    vehicles['date_posted'], format='%Y-%m-%d')

In [4]:
# create a date sold coumn by adding days listed to date posted
vehicles['date_sold'] = vehicles['date_posted'] + \
    pd.to_timedelta(vehicles['days_listed'], unit='d')

In [5]:
# create the date range of vehicles, from listing to selling
date_range = pd.date_range(
    start=vehicles['date_posted'].min(), end=vehicles['date_sold'].max())

# initialize lists to store the calculated data and build columns
inventory_units = []
inventory_dollars = []
sales_units = []
sales_dollars = []
month_labels = []

# calculate metrics for each date
for single_date in date_range:
    # inventory calculations
    in_inventory = vehicles[(vehicles['date_posted'] <= single_date) &
                            (vehicles['date_sold'] >= single_date)]
    inventory_units.append(len(in_inventory))
    inventory_dollars.append(in_inventory['price'].sum())

    # sales calculation - vehicles sold per day
    sold_vehicles = vehicles[vehicles['date_sold'] == single_date]
    sales_units.append(len(sold_vehicles))
    sales_dollars.append(sold_vehicles['price'].sum())

    # format of month column
    month_labels.append(single_date.strftime("%B %Y"))

# compute the week number offset
first_week_number = date_range.isocalendar().week[0]

# construct the second DataFrame
sales = pd.DataFrame({
    'day': date_range.day,
    'week': date_range.isocalendar().week - first_week_number + 1,
    'month': month_labels,
    'inventory_in_units': inventory_units,
    'inventory_in_dollars': inventory_dollars,
    'sales_in_units': sales_units,
    'sales_dollars': sales_dollars
})

  first_week_number = date_range.isocalendar().week[0]


In [6]:
# Display the resulting DataFrame
sales.info()
sales.head(5)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 572 entries, 2018-05-01 to 2019-11-23
Freq: D
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   day                   572 non-null    int32 
 1   week                  572 non-null    UInt32
 2   month                 572 non-null    object
 3   inventory_in_units    572 non-null    int64 
 4   inventory_in_dollars  572 non-null    int64 
 5   sales_in_units        572 non-null    int64 
 6   sales_dollars         572 non-null    int64 
dtypes: UInt32(1), int32(1), int64(4), object(1)
memory usage: 31.8+ KB


Unnamed: 0,day,week,month,inventory_in_units,inventory_in_dollars,sales_in_units,sales_dollars
2018-05-01,1,1,May 2018,124,1407235,0,0
2018-05-02,2,1,May 2018,279,3178309,0,0
2018-05-03,3,1,May 2018,443,5067669,0,0
2018-05-04,4,1,May 2018,593,6958067,2,8598
2018-05-05,5,1,May 2018,724,8482729,6,68392


In [7]:
# turn the date index into a column
sales = sales.reset_index()

In [8]:
# save the sales and processed csv files to the parent directory
sales.to_csv('../vehicle_sales.csv', index=False)
vehicles.to_csv('../processed_vehicles.csv', index=False)

In [9]:
# vehicles dataframe is large enough to make saturated plots
sample = vehicles.sample(5000)

In [10]:
# Set a default color sequence
px.defaults.color_continuous_scale = px.colors.sequential.Viridis
px.defaults.color_discrete_sequence = px.colors.qualitative.Bold

In [11]:
# every vehicle sold value total
sum_sales = sales.sales_dollars.sum()

print(f"The total vehicle sales were: ${sum_sales:,}")

# line plot of sales over time
fig = px.line(sales,
              x='index',
              y='sales_dollars',
              title='Total sales per day')

fig.update_layout(xaxis_title="Time in day intervals",
                  yaxis_title="Sum of sales ($)",
                  width=800,
                  height=600,
                  )

# retrieve last listing date
last_listing = vehicles['date_posted'].max()

# Add a vertical line for average price
fig.add_shape(type="line",
              x0=last_listing, x1=last_listing,
              y0=0, y1=2500000,
              line=dict(color='red', dash='dash'))

# add text annotation for the line
fig.add_annotation(
    x=last_listing,
    y=2250000,
    text=f"Last listing date: {last_listing}",  # text to display
    showarrow=True,
    arrowhead=1,
    ax=40,  # x offset for the arrow end
    ay=-40,  # y offset for the arrow end
    bordercolor="black",
    borderwidth=1,
    borderpad=4,
    bgcolor="white",
    opacity=0.8
)

fig.show()

The total vehicle sales were: $625,125,255


In [12]:
fig = px.line(sales,
              x='index',
              y='sales_in_units',
              title='Sales over Time')

fig.update_layout(xaxis_title='Time')
fig.show()

In [13]:
# average listing duration of vehicle type (how fast vehicle types sell)

# calculate the average listing duration by vehicle type
average_duration = sample.groupby('type')['days_listed'].mean().reset_index()

# bar plot
fig = px.bar(average_duration, x='type',
             y='days_listed', text='days_listed')

# order the bars by descending height
fig.update_layout(xaxis={'categoryorder': 'total descending'})

# automatically adjust text size, format, and position
fig.update_traces(texttemplate='%{text:.2f}', textposition='inside')

# add titles
fig.update_layout(title='Average listing duration for vehicle type',
                  xaxis_title="Vehicle type",
                  yaxis_title="Duration of listing (days)",
                  width=800,
                  height=575,
                  )

# add a horizontal line
fig.add_shape(
    type="line",
    # extend line horizontally across the plot
    x0=-0.5, x1=len(sample['type'].unique()) - 0.5,
    # set at average_value
    y0=sample['days_listed'].mean(), y1=sample['days_listed'].mean(),
    line=dict(color='Red', width=2, dash='dash'),  # Style of the line
)

# Add an annotation if desired
fig.add_annotation(
    x='pickup',  # horizontal position within the plot area
    y=sample['days_listed'].mean()+1,
    text=f'Average duration: {sample['days_listed'].mean():.2f} days',
    showarrow=False,
    yshift=10,  # vertical shift
    bgcolor="white",
    bordercolor="black",
    borderwidth=1,
    opacity=0.8
)


# Show the plot
fig.show()

In [14]:
# average listing duration of vehicle make (how fast vehicle makes sell)

# calculate the average listing duration by vehicle make
average_make = sample.groupby('make')['days_listed'].mean().reset_index()

# bar plot
fig = px.bar(average_make,
             x='make',
             y='days_listed',
             text='days_listed')

# automatically adjust text size, format, and position
fig.update_traces(texttemplate='%{text:.2f}', textposition='inside')

# order the bars by ascending height
fig.update_layout(xaxis={'categoryorder': 'total ascending'})

# add titles/ labels
fig.update_layout(title='Vehicle make average listing duration',
                  xaxis_title="Vehicle make",
                  yaxis_title="Average listing duration (days)",
                  width=800,
                  height=575,
                  )

# add a horizontal line
fig.add_shape(
    type="line",
    # extend line horizontally across the plot
    x0=-0.5, x1=len(sample['make'].unique()) - 0.5,
    # set at average_value
    y0=sample['days_listed'].mean(), y1=sample['days_listed'].mean(),
    line=dict(color='Red', width=2, dash='dash'),  # Style of the line
)

# Add an annotation if desired
fig.add_annotation(
    x='volkswagen',  # horizontal position within the plot area
    y=sample['days_listed'].mean()+1,
    text=f'Average duration: {sample['days_listed'].mean():.2f} days',
    showarrow=False,
    yshift=10,  # vertical shift
    bgcolor="white",
    bordercolor="black",
    borderwidth=1,
    opacity=0.8
)

# Show the plot
fig.show()

In [15]:
# average price of vehicle type

# calculate the average listing duration by vehicle make
average_price = sample.groupby('type')['price'].mean().reset_index()

# bar plot
fig = px.bar(average_price,
             x='type',
             y='price',
             text='price')

# automatically adjust text size, format, and position
fig.update_traces(texttemplate='%{text:,.2f}', textposition='inside')

# order the bars by descending height
fig.update_layout(xaxis={'categoryorder': 'total descending'})

# add titles/ labels
fig.update_layout(title='Vehicle type average listing price',
                  xaxis_title="Vehicle type",
                  yaxis_title="Average listing price ($)",
                  width=800,
                  height=575,
                  )

# add a horizontal line
fig.add_shape(
    type="line",
    # extend line horizontally across the plot
    x0=-0.5, x1=len(sample['type'].unique()) - 0.5,
    # set at average_value
    y0=sample['price'].mean(), y1=sample['price'].mean(),
    line=dict(color='Red', width=2, dash='dash'),  # Style of the line
)

# add an annotation if desired
fig.add_annotation(
    x='van',  # horizontal position within the plot area
    y=sample['price'].mean()+100,
    text=f'Average price: ${sample['price'].mean():,.2f}',
    showarrow=False,
    yshift=10,  # vertical shift
    bgcolor="white",
    bordercolor="black",
    borderwidth=1,
    opacity=0.8
)

# Show the plot
fig.show()

In [16]:
# make slider sample size! that's fun


# explore data
fig = px.scatter(sample,
                 x='model_year',
                 y='price',
                 color='condition',
                 category_orders={'condition': [
                     'salvage', 'fair', 'good', 'excellent', 'like new', 'new']}
                 )
fig.update_layout(title='Comparing price with release year of used cars',
                  xaxis_title="Vehicle production year",
                  yaxis_title="Price of vehicle ($)",
                  width=800,
                  height=575,
                  )

# Set x and y axis limits
fig.update_xaxes(range=[sample['model_year'].min() -
                 1, sample['model_year'].max()+1])
fig.update_yaxes(range=[0, sample['price'].max()+1000])

fig.show()

In [25]:
# create grouping
# calculate the average price by condition
average_condition = sample.groupby('condition')['price'].mean().reset_index()

# histogram plot
fig = px.histogram(sample,
                   x='price',
                   title='Used car prices',
                   nbins=100,
                   color='condition',
                   barmode='overlay',
                   )

# add an annotation
fig.add_annotation(
    x=80000,  # horizontal position within the plot area
    y=275,
    text=f"""Average prices:<br>
    {average_condition.iloc[0, 0]}: ${average_condition.iloc[0, 1]:,.2f}<br>
    {average_condition.iloc[1, 0]}: ${average_condition.iloc[1, 1]:,.2f}<br>
    {average_condition.iloc[2, 0]}: ${average_condition.iloc[2, 1]:,.2f}<br>
    {average_condition.iloc[3, 0]}: ${average_condition.iloc[3, 1]:,.2f}<br>
    {average_condition.iloc[4, 0]}: ${average_condition.iloc[4, 1]:,.2f}<br>
    {average_condition.iloc[5, 0]}: ${average_condition.iloc[5, 1]:,.2f}""",
    showarrow=False,
    bgcolor="white",
    bordercolor="black",
    borderwidth=1,
    opacity=0.8,
    align="left"
)

fig.update_layout(xaxis_title="Price ($)",
                  yaxis_title="Number of vehicles",
                  width=800,
                  height=575,
                  )

# Change transparency/visibility with 'opacity' parameter
fig.update_traces(opacity=0.7)

# set x and y axis limits
fig.update_xaxes(range=[0, 100000])
fig.update_yaxes(range=[0, 350])

fig.show()