# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

The business problem can be reframed as a supervised learning task, specifically regression. We aim to build a predictive model that estimates the price of a used car based on its various features, such as make, model, year, mileage, and condition. By analyzing the coefficients or feature importance from the trained model, we can identify which attributes have the strongest correlation with the sale price, thus identifying the key drivers, and help a used car dealership better curate inventory and pricing strategies.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

1.  **Load the dataset:** Start by loading the dataset into a pandas DataFrame.

2.  **Initial inspection:**
    *   Use `.head()` and `.tail()` to see the first and last few rows.
    *   Use `.info()` to get a summary of the DataFrame, including column names, data types, and non-null counts. This helps in identifying missing values.
    *   Use `.describe()` to get descriptive statistics for numerical columns (count, mean, std, min, max, quartiles).
    *   Use `.shape` to see the number of rows and columns.
    *   Use `.columns` to list all column names.

3.  **Examine data types:** Check if the data types are appropriate for each column. For example, numerical features should be integers or floats, and categorical features should ideally be objects or strings.

4.  **Check for missing values:**
    *   Use `.isnull().sum()` to count the number of missing values per column.
    *   Visualize missing values using a heatmap or bar chart to understand their distribution.

5.  **Explore categorical features:**
    *   Use `.value_counts()` for each categorical column to see the frequency of each unique category.
    *   Identify categories with very low frequencies or a large number of unique values (high cardinality).

6.  **Explore numerical features:**
    *   Visualize the distribution of numerical features using histograms or box plots. This helps in understanding the spread, skewness, and potential outliers.
    *   Calculate the correlation matrix between numerical features to identify potential multicollinearity.

7.  **Identify potential quality issues:**
    *   **Missing data:** As identified in step 4.
    *   **Inconsistent data:** Look for variations in how categories are spelled or represented (e.g., "New York" vs. "NY").
    *   **Outliers:** Identify extreme values in numerical features using box plots or statistical methods.
    *   **Incorrect data types:** Ensure columns are stored in the correct data type.
    *   **Duplicate rows:** Check for and remove duplicate entries in the dataset.

8.  **Relate data to business understanding:** Consider how each feature might influence car prices. For example, what is the expected relationship between mileage and price? How might the car's make or model affect its value? Use this exploration to form hypotheses about which features will be most important for the price prediction model.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'pandas'

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.