# Used Car Project

## Project Description
I retrieved my data from Kaggle: https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset 
This is a large dataset about used cars. I will be creating a machine learning model to predict the price of used cars.

## Goals
- To discover the drivers of value of cars
- Use the drivers to develop a ML model that determines the value of cars
- Deliever a report to a technical data science team

In [None]:
import wrangle as w
import functions as f
import prepare as p

## Acquire
- data gathered from the Kaggle
- each row represents a vehicle
- each column represents a feature of the vehicle
- the data gathered from the database started at 3 million rows and 66 columns

## Prepare

### Prepare Actions:
- Removed columns that did not contain useful information
- Removed null values
- Checked that column data types were appropriate
- Split data into train, validate and test (approx. 60/20/20), stratifying on 'price'

## Data Dictionary

| Target Variable | Definition|
|-----------------|-----------|
| price | The total price of the vehicle |

| Feature  | Definition |
|----------|------------|
| back_legroom |  Legroom in the rear seat |
| body_type | Body type of the vehicle (Hatchback, Sedan, Convertible, etc.) |
| city | City where the car is listed |
| city_mpg | Fuel economy in city traffic (km/L) |
| daysonmarket | Number of days since vehicle was listed |
| cyl | Engine configuration (I4, V6, etc.) |
| displ | Engine displacement |
| dealer | Whether the seller is a dealer |
| front_legroom | Legroom for inches in the passenger seat |
| tank_size | Fuel tank's max capacity in gallons |
| fuel_type | Dominant type of fuel used in vehicle |
| accidents | Whether the vin has any accidents registered |
| hiway_mpg | Fuel economy in highway traffic (km/L) |
| horsepower | Horsepower of the vehicle |
| new | Whether the vehicle is new |
| length | Length of vehicle in inches |
| listed_date | Date the vehicle was first listed |
| model | Model of the car |
| seats | Number of seats in the vehicle |
| mileage | The odometer reading on the vehicle |
| owners | Number of owneres the vehicle has had |
| seller_rating | Rating of the seller |
| tran | Transmission type of the vehicle (Manual, Auto, etc.) |
| drive_type | The drive train of the vehicle (FWD, RWD, etc.) |
| wheelbase | Measurement of wheelbase in inches |
| width | Width of the vehicle in inches |
| year | Year the car was made |

In [None]:
# acquring and cleaning the data
cars = wrangle_cars()

# splitting the data into train, validate and test
train, val, test = p.train_val_test(cars)

# taking a quick look at the data
train.head()

## Does horsepower affect the price of the vehicle?

**Ho: Horsepower and price are independent of each other.**  
**Ha: Horsepower and price are related.**
- determine if the horsepower and price are related
- confidence interval of 95%
- alpha of .05

In [None]:
# visualization of horsepower compared to price in a relplot
f.horsepower_plot(train, 'horsepower','price')

In [None]:
# Pearsonr stats test to determine if the two variables are related.
f.pearson_test(train, 'horsepower')

## Takeaways:

- The stats test rejects the null therefore horsepower and price are related.
- In this graph we can see there is a positive correlation between horsepower and price.

## Does mileage affect price of the vehicle?
**Ho: The mileage and price of the vehicle are indpendent of each other.**  
**Ha: Mileage and price are related.**

- determine if mileage and price are related
- confidence interval of 95%
- alpha of .05

In [None]:
# visualizing the price of the vehicle based on mileage
f.mileage_plot(train, 'mileage', 'price')

In [None]:
# testing for a relationship between mileage and price
f.pearson_test(train, 'mileage')

## Takeways:

- The stats test proves that there is a relationship between mileage and price of a vehicle.

## Do the dimensions of the vehcile affect price?
**Ho: The width and price of the vehicle are indpendent of each other.**  
**Ha: The width and price are related.**

- determine if width and price are related
- confidence interval of 95%
- alpha of .05

In [None]:
# visualizing the correlation between width and price
f.width_plot(train, 'width', 'price')

In [None]:
# testing for a relationship between width and price
f.pearson_test(train, 'width')

**Ho: The length and price of the vehicle are indpendent of each other.**  
**Ha: Length and price are related.**

- determine if length and price are related
- confidence interval of 95%
- alpha of .05

In [None]:
# visualizing the correlation between length and price
f.length_plot(train, 'length', 'price')

In [None]:
# testing for a relationship between length and price
f.pearson_test(train, 'length')

## Takeways:

- The stats test proves that there is a relationship between width and price.
- The stats test proves that there is a relationship between length and price.

## Does whether the car is sold by a dealer affect price?
**Ho: A vehicle being sold by a dealer and price of the vehicle are indpendent of each other.**  
**Ha: A vehicle being sold by a dealer and price are related.**

- determine if width and price are related
- confidence interval of 95%
- alpha of .05

In [None]:
# visualization of the comparison between whether a car is sold by a dealer and price
f.dealer_plot(train, 'dealer', 'price')

In [None]:
# testing correlation between whether a vehicle is sold by a dealer and price
f.ttest_samp(train, 'dealer')

## Takeaways
- The stats test proves there a correlation between whether a vehcile is sold by a dealer

## Exploration Summary
- Horsepower and price are positively correlated. As horsepower increases so does price.
- Mileage and power are negatively correlated. As mileage increases, price decreases.
- Width and Length are positively correlated to price. As the vehicle gets larger, the price increases.
- Whether a car being sold by a dealer affects price. The average price of vehicles are higher if they are sold by a dealer

# Modeling
- I will be using RMSE as the evaluation metric
- The baseline RMSE is 10888

In [None]:
# splitting data again and dropping columns not modeling with
X_train, y_train, X_val, y_val, X_test, y_test = f.split_scale(cars)

In [None]:
# creating prediction table
preds, baseline_rmse = f.preds_table(y_train)

In [None]:
preds, lm_rmse = f.linear_reg(X_train, y_train, preds)

In [None]:
preds, lasso_rmse = f.lasso(X_train, y_train, preds)

In [None]:
preds, poly_rmse = f.lm_poly(X_train, y_train, preds)

In [None]:
preds, lassopoly_rmse = f.lasso_poly(X_train, y_train, preds)

In [None]:
preds, xgb_rmse = f.xgb_model(X_train, y_train, preds)

In [None]:
rmse_df = f.rmse_table(baseline_rmse, lm_rmse, lasso_rmse, poly_rmse, lassopoly_rmse, xgb_rmse)

In [None]:
f.rmse_graph(rmse_df)

In [None]:
val_preds = f.val_tests(X_train, y_train, X_val, y_val)

In [None]:
val_rmse = f.val_rmse(val_preds)

In [None]:
f.val_plot(val_rmse)

In [None]:
f.test_set(X_train, y_train, X_test, y_test)