# SDA - Project <a id='intro'></a>

## Project description
This project is an interactive web application built with Streamlit, designed for analyzing a car advertisement dataset. It provides users with a platform to explore various attributes of cars, including condition, model year, price, and mileage. Through features like histograms and scatter plots, users can gain insights into relationships between these attributes, such as understanding the distribution of car conditions across model years and the impact of mileage on car prices.

## Features
- Histogram of Car Conditions: A histogram to visualize the distribution of car conditions across different model years.
- Scatter Plot of Price vs. Mileage: A scatter plot to examine how the price of a car is influenced by its mileage.
- Interactive Checkbox: A feature that allows users to toggle the distribution of different ages of cars in the scatter plot.

## Technologies Used
- Python: The main programming language used for the project.
- Pandas: For data manipulation and analysis.
- Streamlit: For building the interactive web application.
- Plotly Express: For creating interactive visualizations.
- Altair: Optionally used for additional visualizations.

## Files in the Repository
- README.md: This file.
- .gitignore: Specifies files and directories to be ignored by git.
- app.py: The main application file for the Streamlit app.
- vehicles_us.csv: The dataset file (to be downloaded separately).
- notebooks/EDA.ipynb: Jupyter notebook for exploratory data analysis.
- requirements.txt: List of dependencies required by the project.
- .streamlit/config.toml: Configuration file for Streamlit deployment.

In [1]:
import pandas as pd
import streamlit as st
import plotly.express as px

In [2]:
try:
    # Try to read the CSV file from the local path.
    vehicles_df = pd.read_csv('/Users/benjaminstephen/Documents/TripleTen/Sprint_4/Vehicle-Analysis_Project/notebooks/vehicles_us.csv')
except FileNotFoundError:
    try:
        # Try to read the CSV file from the server path
        vehicles_df = pd.read_csv('vehicles_us.csv')
        print("CSV file successfully read from the server path.")
    except FileNotFoundError:
        print("CSV file not found. Please check the file paths.")
else:
    print("CSV file successfully read from the local path.")

CSV file successfully read from the server path.


In [3]:
display(vehicles_df)
print(vehicles_df.info())
print()
print("DUPLICATE ROWS:")
print(vehicles_df[vehicles_df.duplicated()])

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,9249,2013.0,nissan maxima,like new,6.0,gas,88136.0,automatic,sedan,black,,2018-10-03,37
51521,2700,2002.0,honda civic,salvage,4.0,gas,181500.0,automatic,sedan,white,,2018-11-14,22
51522,3950,2009.0,hyundai sonata,excellent,4.0,gas,128000.0,automatic,sedan,blue,,2018-11-15,32
51523,7455,2013.0,toyota corolla,good,4.0,gas,139573.0,automatic,sedan,black,,2018-07-02,71


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB
None

DUPLICATE ROWS:
Empty DataFrame
Columns: [price, model_year, model, condition, cylinders, fuel, odometer, transmission, type, paint_color, is_4wd, date_posted, days_listed]
Index

As we can see from initial analysis of the data there a multiple null values within certain columns of the data (model_year, cylinders, odometer, paint_color, is_4wd). Let's fix these up.

In [4]:
# cleaning up 'model_year' column
vehicles_df['model_year'].fillna(0, inplace=True)
vehicles_df['model_year'] = vehicles_df['model_year'].astype(int)
vehicles_df['model_year'] = vehicles_df['model_year'].astype(str)
vehicles_df['model_year'].replace('0', 'unknown', inplace=True)

In [5]:
# cleaning up 'cylinders' column
vehicles_df['cylinders'].fillna(-1, inplace=True)  # Use -1 as placeholder for unknown
vehicles_df['cylinders'] = vehicles_df['cylinders'].astype(int)

In [6]:
# cleaning up 'odometer' column
vehicles_df['odometer'].fillna(-1, inplace=True)

In [7]:
# cleaning up 'paint_color' column
vehicles_df['paint_color'].fillna('unknown', inplace=True)

In [8]:
# cleaning up 'is_4wd' column
vehicles_df['is_4wd'].fillna('no', inplace=True)
vehicles_df['is_4wd'].replace(1.0, 'yes', inplace=True)

In [9]:
# printing out the data again to ensure that it is clean
display(vehicles_df)
print(vehicles_df.info())

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011,bmw x5,good,6,gas,145000.0,automatic,SUV,unknown,yes,2018-06-23,19
1,25500,unknown,ford f-150,good,6,gas,88705.0,automatic,pickup,white,yes,2018-10-19,50
2,5500,2013,hyundai sonata,like new,4,gas,110000.0,automatic,sedan,red,no,2019-02-07,79
3,1500,2003,ford f-150,fair,8,gas,-1.0,automatic,pickup,unknown,no,2019-03-22,9
4,14900,2017,chrysler 200,excellent,4,gas,80903.0,automatic,sedan,black,no,2019-04-02,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,9249,2013,nissan maxima,like new,6,gas,88136.0,automatic,sedan,black,no,2018-10-03,37
51521,2700,2002,honda civic,salvage,4,gas,181500.0,automatic,sedan,white,no,2018-11-14,22
51522,3950,2009,hyundai sonata,excellent,4,gas,128000.0,automatic,sedan,blue,no,2018-11-15,32
51523,7455,2013,toyota corolla,good,4,gas,139573.0,automatic,sedan,black,no,2018-07-02,71


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    51525 non-null  object 
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     51525 non-null  int64  
 5   fuel          51525 non-null  object 
 6   odometer      51525 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   51525 non-null  object 
 10  is_4wd        51525 non-null  object 
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(1), int64(3), object(9)
memory usage: 5.1+ MB
None


Now we will create a histogram for the purpose exploring the distribution of car conditions across model years.

In [10]:
# Filter out model years equal to 0
filtered_df = vehicles_df[vehicles_df['model_year'] != '0']

# Group by manufacturer and vehicle type, and count the number of vehicles
grouped_df = filtered_df.groupby(['model_year', 'condition']).size().reset_index(name='count')

# Create histogram
fig1 = px.histogram(grouped_df, x="model_year", y="count", color="condition")

st.plotly_chart(fig1)

2024-05-27 00:18:04.330 
  command:

    streamlit run /opt/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py [ARGUMENTS]


DeltaGenerator()

Now we will create a scatterplot for the purpose exploring the distribution of mileage affecting the price of a car.

In [11]:
# Create scatterplot
fig2 = px.scatter(vehicles_df, x="price", y="odometer")

st.plotly_chart(fig2)

DeltaGenerator()