In [None]:
# What drives the price of a car?

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = ../images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [None]:
import pandas as pd
import numpy as np
# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from matplotlib import rcParams



In [None]:
# Load the vehicles.csv file
vehicles = pd.read_csv('s3://ml-ai-bucket/vehicles.csv')
vehicles.info()

In [None]:
vehicles.sample(10)

In [None]:
vehicles.describe()

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
vehicles.isnull().sum()

In [None]:
vehicles["price"].hist(bins=20)
plt.show()

In [None]:
avg_price_by_manufacturer = vehicles.groupby("manufacturer")["price"].mean()
avg_price_by_manufacturer.plot(kind="bar", title="Average price by Manufacturer", figsize=(15,15))

In [None]:
# Average price by Cylinders
avg_price_by_manufacturer = vehicles.groupby("cylinders")["price"].mean()
avg_price_by_manufacturer.plot(kind="bar", title="Average price by Cylinders", figsize=(20,20))

In [None]:
vehicles.info()

In [None]:
# Printing the unique values for the columns
print(f'Region: ',vehicles["region"].nunique())
print(f'Manufacturer: ', vehicles["manufacturer"].nunique())
print(f'Model: ', vehicles["model"].nunique())
print(f'Condition: ', vehicles["condition"].nunique())
print(f'cylinders: ', vehicles["cylinders"].nunique())
print(f'Fuel: ', vehicles["fuel"].nunique())
print(f'Title Status:', vehicles["title_status"].nunique())
print(f'Drive:',vehicles["drive"].nunique())
print(f'Size:', vehicles["size"].nunique())
print(f'Type:',vehicles["type"].nunique())
print(f'Paint Color:',vehicles["paint_color"].nunique())

In [None]:
# Print the mode values
print(f'Region: ',vehicles["region"].mode())
print(f'Manufacturer: ', vehicles["manufacturer"].mode())
print(f'Model: ', vehicles["model"].mode())
print(f'Condition: ', vehicles["condition"].mode())
print(f'cylinders: ', vehicles["cylinders"].mode())
print(f'Fuel: ', vehicles["fuel"].mode())
print(f'Title Status:', vehicles["title_status"].mode())
print(f'Drive:',vehicles["drive"].mode())
print(f'Size:', vehicles["size"].mode())
print(f'Type:',vehicles["type"].mode())
print(f'Paint Color:',vehicles["paint_color"].mode())

In [None]:
vehicles.to_csv('s3://ml-ai-bucket/vehicles_cleaned.csv')