## Title: 
#### Software Development Tools

## Introduction

This project will focus on enhacning software engineering skills by develping and deploying a public web application using a vehicle data set. The project will demonstrate the knowlege of using softare development tools by using git, github, render and also implementing an EDA on some of the data. 

## Libraries

In [69]:
#Loading the libraries for the project. 

import pandas as pd

import plotly_express as px

import os

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy import stats

from scipy.stats import ttest_ind

## Data Set

In [70]:
#Loading the data set. 

vehicle_data = pd.read_csv('../vehicles_us.csv')  

## Preprocessing Data

In [71]:
#Overview of key metrics. 
vehicle_data.describe()

Unnamed: 0,price,model_year,cylinders,odometer,is_4wd,days_listed
count,51525.0,47906.0,46265.0,43633.0,25572.0,51525.0
mean,12132.46492,2009.75047,6.125235,115553.461738,1.0,39.55476
std,10040.803015,6.282065,1.66036,65094.611341,0.0,28.20427
min,1.0,1908.0,3.0,0.0,1.0,0.0
25%,5000.0,2006.0,4.0,70000.0,1.0,19.0
50%,9000.0,2011.0,6.0,113000.0,1.0,33.0
75%,16839.0,2014.0,8.0,155000.0,1.0,53.0
max,375000.0,2019.0,12.0,990000.0,1.0,271.0


In [72]:
# Fill missing values in the `model_year` column by grouping by `model`
vehicle_data['model_year'] = vehicle_data.groupby('model')['model_year'].transform(lambda x: x.fillna(x.median()))

# Fill missing values in the `cylinders` column by grouping by `model`
vehicle_data['cylinders'] = vehicle_data.groupby('model')['cylinders'].transform(lambda x: x.fillna(x.median()))

# Fill missing values in the `odometer` column by grouping by `model` and `year`
# Alternatively, use ('model', 'year') if `year` and `model` are separate columns
vehicle_data['odometer'] = vehicle_data.groupby(['model', 'model_year'])['odometer'].transform(lambda x: x.fillna(x.median()))

### Histogram 1 Analysis:
The histogram shows that most vehicles are priced between 5k and 20k, with the highest concentration around 10k–15k. There are few vehicles in the higher 25k–30k range, suggesting potential outliers or luxury items.

In [73]:
#Plot histogram
data = {
    'price': [14990, 15990, 19500, 12990, 14990, 13990, 12500, 7500, 5000, 3890, 8000, 11499, 5100, 11200, 10400, 30300, 13000, 16999, 14950, 1900, 23800, 16500, 11950, 4826, 18000],
    'odometer': [57954, 109473, 128413, 132285, 130725, 100669, 128325, 180000, 137273, 300000, 234000, 54772, 188000, 90302, 111871, 30339, 146000, 137230, 114773, 207, 10899, 123262, 170000, 226111, 50500],
    'model_year': [2014, 2013, 2011, 2009, 2010, 2014, 2013, 2004, 2009, 2011, 2009, 2017, 2008, 2015, 2012, 2017, 2005, 2013, 2012, 1994, 2019, 2012, 2008, 2000, 2010]
}
df = pd.DataFrame(data)

In [74]:
# Histogram for Price
fig_price = px.histogram(df, x='price', nbins=20, title='Histogram 1: Distribution of Vehicle Prices')
fig_price.show()

### Histogram 2 Analysis:
The histogram shows that most vehicles have odometer readings between 50k and 150k miles, with the highest concentration around 100k–150k miles. Few vehicles have lower (under 50k) or very high (over 200k) mileage, indicating that the dataset mainly consists of mid-to-high mileage vehicles. Some high-mileage vehicles near 300k miles may represent outliers or older, heavily used cars.

In [75]:
# Histogram for Odometer
fig_odometer = px.histogram(df, x='odometer', nbins=20, title='Historgram 2: Distribution of Odometer Readings')
fig_odometer.show()

### Histogram 3 Analysis:
The histogram shows that most vehicles in the dataset are from model years 2005 to 2015, with the highest concentration around 2010 to 2015. Older models (pre-2005) are less common, indicating the dataset primarily consists of more recent vehicles.

In [76]:
# Histogram for Model Year
fig_model_year = px.histogram(df, x='model_year', nbins=20, title='Histogram 3: Distribution of Model Years')
fig_model_year.show()

### Scatter Plot 1: Analysis
The scatter plot shows the relationship between price and odometer readings. Based on the visual, we can see that cars priced between 10k and 20k on average have an odometer reading between 100k and 150k milage.


In [77]:
# Define a function to remove outliers based on IQR
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    # Filter rows where values are within 1.5*IQR of the median
    return df[(df[column] >= Q1 - 1.5 * IQR) & (df[column] <= Q3 + 1.5 * IQR)]

# Apply the function to remove outliers in `model_year` and `price`
df = remove_outliers(df, 'model_year')
df = remove_outliers(df, 'price')

# Sample data (replace this with loading your actual dataset)
data = {
    'price': [14990, 15990, 19500, 12990, 14990, 13990, 12500, 7500, 5000, 3890, 8000, 11499, 5100, 11200, 10400, 30300, 13000, 16999, 14950, 1900, 23800, 16500, 11950, 4826, 18000],
    'odometer': [57954, 109473, 128413, 132285, 130725, 100669, 128325, 180000, 137273, 300000, 234000, 54772, 188000, 90302, 111871, 30339, 146000, 137230, 114773, 207, 10899, 123262, 170000, 226111, 50500],
    'model_year': [2014, 2013, 2011, 2009, 2010, 2014, 2013, 2004, 2009, 2011, 2009, 2017, 2008, 2015, 2012, 2017, 2005, 2013, 2012, 1994, 2019, 2012, 2008, 2000, 2010]
}
df = pd.DataFrame(data)


In [78]:
# Scatter plot for Price vs. Odometer

fig_price_odometer = px.scatter(df, x='odometer', y='price', title='Scatter Plot 1: Price vs. Odometer', labels={'odometer': 'Odometer (miles)', 'price': 'Price ($)'})

fig_price_odometer.show()

### Scatter Plot 2: Analysis

The scatter plot displays the replationship between odometer and model year of vehicles. With this data we can see that model cars from 2007 - 2015 on average are above the 100k odometer reading. Based on this informaiton we can conclude that models that are 10 years and older will have a higher odometer reading, which can give us more insight on life cycle of the vehicle.

In [79]:
# Scatter plot for Odometer vs. Model Year

fig_odometer_model_year = px.scatter(df, x='model_year', y='odometer', title='Scatter Plot 2: Odometer vs. Model Year', labels={'model_year': 'Model Year', 'odometer': 'Odometer (miles)'})

fig_odometer_model_year.show()

### Conclusions

Based on our findings from the histograms and scatter plots we can conclude that the value of a car dpeends on several factos but the key metrics to find useful are the odometer, price, and model year. These three metrics can help us find the best valued car based on only three key metrics. 