# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

1. Our Business Objective is to answer what aspects/featues justify the price of an automobile.

2. Assessing the situation: The Cars Dataset is what is provided and I will have to take that at face value, the risk of this approach is that we are limiting our price assumptions to a handful of features from a dataset of 460k inputs.

3. Data Mining Goal: Originally, my goal was to predict the price of a vehicle given their correlation to the following features: Mileage, Size, Drive Train, Color, Manufacturer and Condition. It was during the initial phase of analysis that I realized that I should verify the relationship of each feature to price. This way our dealer will know what features of the vehicle will drive a lower or higher price. 

4. Project Plan: 
    - First, I will need to pre-process the data so that I can determine which feature correlates to Price the best. 
    - Next, I will scale all numerical values and transform categorical features. 
    - Finally, by utilizing Regression and Grid Search I should be able to determine which feature will drive automobile prices. 


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

1. The Data seemed pretty straightforward, there is 17 columns with features that the common car buyer uses to place utility in a vehicle. Below is the information for the dataset and its columns 

2. Quality is not exactly where I want it to perform a descent analysis. I'll have to convert categorical columns into number values and drop or replace missing values with averages of an entire column, depends on the strength of its correlation to price. 

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

Step 1: I decided to drop columns [id, model, region, VIN, state, title_Status] as these features were not relevant to the dataset or could cause too much variability, this left us with 3 numerical columns, including Price, and 8 Categorical Columns.

Step 2: I needed to deal with null values but did not want to filter more than required. I then checked the proportion of null values, and viewed other unique categorical features, please see below:

Step 4:

After multiple pre-processing steps I refined the dataset by doing the following:
- Dropped: the size column
- Grouped: vehicle 'types' into four categories; Car, SUV, Truck and other
- Grouped: vahicle 'manufacturer' into continental categories
- Filtered: removed Nulls from the 'odometer' Column, filtered where mileage is between 10,000 and 150,000
- Filtered: only selected 'years' that were between 2000 and 2018
- Filtered: where 'condition' did not equal 'New' or 'Salvage'
- Replaced: less common variables with 'other' for the following columns; Condition, Fuel, Paint_color and Cylinders
- Categorical Missing Data: chose to fill the null data with the Mode of each corresponding column, in this case for Condition, Cylinders, Drive, Type and Paint_Color
- Filtered: I had to filter on prices that were <= to $35,000 as STD was way to variable

Step 3: Below is the result of a refined dataset with less outliers and more consistently populated columns. Almost half of the dataset was removed.

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

Step 1: Price is the target which makes all other columns the features. I split the data in X = Price and Y = all other columns. 

Step 2 : I then split both X and y datasets into Test and Training data where the split is 80/20. 

Step 3: Defined the numerical and categorical features and then ran each corresponding column into a Standard Scaler or OneHotEncoder, this was placed into a preprocessor pipeline. 

Step 4: Next, I created a base pipeline which incorporated the preprocessor and set of polynomial degrees (1, 2, 3), with a target_regressor reand of (0.1, 1, 10, 100).
- I did create three different pipelines as I used three different Regressions(Ridge, Lasso and SFS).

Step 5: Each pipeline ran through GridSearch Cross Validation model, which performed a search over a specified parameter grid to find the best combo hyperparameter for a given estimate. I set the Cross Validation to 5 split fold per each Regression. 

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

Hours were spent going back-and-fourth to the data and filtering, dropping or even grouping just to get close to a 70% R-Squared. I am not too thrilled with the results, but below it shows that the Ridge Regression Model performed the best of the three. Features such as year, cylinders, odometer, drive and vehicle_type_group drove the price of a vehicle. The MSE is extremely high, which is driven by a significant variance in price and odometer values. Evalutions for the each model are below.
- Additionally, a step-by-step with visuals for each model is attached within this folder. 

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

Findings
1. Utlizing the current data with further refinement, I am able to provide drivers to the pricing of vehicle with a confidence level of 70%.
2. This 70% consisted of various transformations to the dataset and grouping of 'Makes' and 'Models' by their regions of origin as to gain any insight into their relevance on Price. 
3. Additionally, within this research it should be noted that Year of the vechile, along with the Enginge Type and Mileage affect Price more than other categories available. 

Questions for the Dealer

I strongly urge the client to determine which type of car dealer they want to be, options below. Based on their decision I strongly believe that with further refinement of the data that I can yield better results. 
- The Client should decided if they want to sell high-end or economy class used vehicles. (Example: BMW, Audi, Mercedes, or Toyota, Nissan, Honda)
- A year range will heavily impact the relationship on Price, I encountered too many outliers, but within the data I found that the average Year of the Vehicles was 2011. Does the client want to carry 10 year old vehicles within their fleet?
- Since price is the target, what is max-value of a vehicle that they want to sell? Within the dataset there was Kia and Ferarri, which are exteme lows and highs on price spectrum.  The client needs to know what they want to sell and who to sell to. 
- Is the client in the Electric car business or only strictly fuel?
- Finally, I strongly believe that the client does not want to sell vehicles with more than 50K mileage, as the value would not be worth as much as a lower mileage car. 