Brian Rivers - Sprint 4 Project Notebook

This notebook is a record of the code used to read andpreprocess the data, filter data based on various parameters and visualize the results using interactive charts, and provides insights into the relationship between car models, prices, model years, and other attributes.

Every code cell has been marked with at least one comment, illustrating the function of each cell.

In [2]:
# Importing the neccessary libraries
import pandas as pd
import numpy as np
import streamlit as st
import plotly.express as px
import altair as alt

In [3]:
# Reading the vehicles file
df = pd.read_csv('vehicles_us.csv')

In [4]:
df.head()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


In [11]:

# Ensure 'model_year' and 'odometer' are numeric
df['model_year'] = pd.to_numeric(df['model_year'], errors='coerce')
df['odometer'] = pd.to_numeric(df['odometer'], errors='coerce')

# Calculate the median of 'model_year' and 'odometer'
median_year = df['model_year'].median()
median_odometer = df['odometer'].median()

# Replace NaN values in 'model_year' with the median year
df['model_year'] = df['model_year'].fillna(median_year)
df['odometer'] = df['odometer'].fillna(median_odometer)

In [12]:
df.head()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,2011.0,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,113000.0,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


In [13]:
# Creating text for first histogram
st.header('Market of used cars')
st.write('Filter the data below to see the price and model by model year')

# Checkbox to filter by SUV type
show_suv = st.checkbox('Show only SUV type cars')

2024-07-08 12:20:16.877 
  command:

    streamlit run /Users/macbook/Desktop/new_anaconda/anaconda3/lib/python3.12/site-packages/ipykernel_launcher.py [ARGUMENTS]


In [14]:
# Streamlit slider to filter by model year
min_year = df['model_year'].min()
max_year = df['model_year'].max()
year_range = st.slider('Select the range of model year', min_year, max_year, (min_year, max_year))

In [15]:
# Filter the DataFrame based on the selected year range
filtered_df = df[(df['model_year'] >= year_range[0]) & (df['model_year'] <= year_range[1])]

# Display the filtered DataFrame
st.write("Filtered Table of Models and Prices:")
st.dataframe(filtered_df[['model', 'price']])

DeltaGenerator()

In [16]:
# Creating text for second histogram
st.header('Data analysis')
st.write("""
###### Analyzing condition and odometer dat of the used cars""")

# List of columns to be used for selections
list_for_hist = ['model', 'transmission', 'type', 'paint_color']

# Streamlit selectbox for histogram
selected_type = st.selectbox('select statistic to filter by', list_for_hist)

# Create and display histogram
fig1 = px.histogram(df, x='type', y ='odometer', color=selected_type)
fig1.update_layout(title=f"<b> Split of odometer by age: {selected_type}</b>")
fig1

2024-07-08 12:21:06.483 Session state does not function when running a script without `streamlit run`


In [17]:
# Define age category function
def age_category(x):
    if x < 5:
        return '<5'
    elif x >= 5 and x < 10:
        return '5-10'
    elif x >= 10 and x < 20:
        return '10-20'
    else:
        return '>20'

# Calculate age and age category
df['age'] = 2024 - df['model_year']
df['age_category'] = df['age'].apply(age_category)

# List of columns to be used for scatter plot
list_for_scatter = ['price', 'model_year', 'odometer', 'transmission', 'type', 'paint_color']

# Streamlit selectbox for scatter plot
choice_for_scatter = st.selectbox('Dependency on', list_for_scatter)

# Create and display scatter plot
fig2 = px.scatter(df, x="age", y='condition', color="model", hover_data=['model_year'])
fig2.update_layout(title=f"<b> condition to age relationship: {choice_for_scatter}</b>")

This notebook provides a comprehensive analysis of the used car market, allowing users to explore data dynamically and gain insights into various factors affecting car prices and conditions.