# Car Price Prediction

[ML Cookbook](https://www.ml-book.com) | [SLACK Channel](https://join.slack.com/t/mlckbk/shared_invite/zt-9qsjm911-6nSHAcCSjKfuHi972iEfEg)

## About

In this project, you will build a model that **predicts a car price based on a certain independent variables**.

The project contains 6 sections in total, each with **step-by-step instructions** of what to do.  Note that, as we go further with our lessons, we will try to step away from guided projects like this to "less-guided", with less intructions involved. Thus, my advice is try to understand why we do what we do in what order. 

## Structure
The project is split into **6 sections**, each containing **step-by-step instructions** of what to do. These sections are the following:


1.   Import the Libratries
2.   Import the datasets
3.   Data Preprocessing
4.   Data Overview
5.   Model Building
6.   Conclusion

## Data
There are 3 datasets provided that you should use for this project:

- Automobile_data1.csv
- Automobile_data2.csv
- Automobile_data3.csv

# 1. Import the Libraries

Import the libraries needed (here you will also keep adding up the required libraries as you go further with this project)

In [None]:
import pandas as pd



# 2. Import the datasets

Do the following:

*   **Step 1**: Import three datasets as df1, df2 and df3 **(we did that for you)**
*   **Step 2**: See what the dataframes look like
*   **Step 3**: Check the shape of each dataset by printing three lines: 

        Data shape of df1 is (X, Y),
        Data shape of df2 is (X, Y), 
        Data shape of df3 is (X, Y)


Use .format funtion for that.



----

## Step 1
Import three datasets as df1, df2 and df3 **(we did that for you)**

In [None]:
df1 = pd.read_csv('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project1/Automobile_data1.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project1/Automobile_data2.csv')
df3 = pd.read_csv('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project1/Automobile_data3.csv')


## Step 2
See what the dataframes look like

## Step 3
Check the shape of each dataset by printing three lines: 

    Data shape of df1 is (X, Y),
    Data shape of df2 is (X, Y), 
    Data shape of df3 is (X, Y)


Use .format funtion for that.


# 3. Data Preprocessing

Do the following:


*   **Step 1:** Combine three datasets into 1


*   **Step 2**: Check data types. Change if needed.


*   **Step 3**: Check unique values of each column. Investigate them. Is everything fine? *Bonus*: create a function *unique(col)*, that would print unique values for a certain column 'col' (where 'col' is a name of a column in string format)


*   **Step 4**: Check the null and inapropriate values (like "?" or "!"). If there are any, replace them with 0 (if numeric), or 'None' (if categorical).


*   **Step 5**: Normalize the data. Divide the prices by 1000 so that the values are in thousand euros. 


*   **Step 6**: Validate the data: check *dtypes* (presence of wrong values (?, !), null values etc), and unique values for each column

-----

## Step 1
Combine three datasets into 1

## Step 2
Check data types. Change if needed.

## Step 3
Check unique values of each column. Investigate them. Is everything fine? Bonus: create a function unique(col), that would print unique values for a certain column 'col' (where 'col' is a name of a column in string format)

## Step 4
Check the null and inapropriate values (like "?" or "!"). If there are any, replace them with 0 (if numeric), or 'None' (if categorical).

## Step 5
Normalize the data. Divide the prices by 1000 so that the values are in thousand euros.

## Step 6
Validate the data: check dtypes (presence of wrong values (?, !), null values etc), and unique values for each column

# 4. Data Overview

Observe the data:


*   **Step 1**: Find an average price,  an average engine size,  an average bore and  an average stroke across different car manufacturers (column named 'make'). 


*   **Step 2**: Split car manufacturers into body styles (column named 'body-style' column) and find their an average price, an average engine size, an average bore and an average stroke


*   **Step 3**: Find a total amount of money manufacturers got from their cars. (sum the prices for each manufacturer)



----

## Step 1
Find an average price, an average engine size, an average bore and an average stroke across different car manufacturers (column named 'make').

## Step 2
Split car manufacturers into body styles (column named 'body-style' column) and find their an average price, an average engine size, an average bore and an average stroke

## Step 3
Find a total amount of money manufacturers got from their cars. (sum the prices for each manufacturer)

<a id='model_building'></a>

# 5. Model Building

Do the following:


*   **Step 1**: Select two variables, let's call them X1 and X2 (int) that you think are the most significant indicators for a price prediction (I'd advice taking horsepower as one of the two). Set y variable as price


*   **Step 2**: Split the data into train and test


*   **Step 3**: Import Lasso regression. You should know that Lasso() model has different parameters. Check them out on the web. Set alpha = 1.5 and max_iter = 10000)


*   **Step 4**: Fit the train data


*   **Step 5**: Print lasso's intercept and coefficients


*   **Step 6**: Print a score for the test data


*   **Step 7**: Predict a price of a car that has 90 hoursepowers and n (any value) of X2. Predict a price of a car that has 330 hoursepowers and n (any value) of X2. Is the car price different in this case? Do both results make sence? 


*   **Step 8**: Code a loop that would print a score for Lasso with alpha = 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4.



----

## Step 1
Select two variables, let's call them X1 and X2 (int) that you think are the most significant indicators for a price prediction (I'd advice taking horsepower as one of the two). Set y variable as price

## Step 2
Split the data into train and test

## Step 3
Import Lasso regression. You should know that Lasso() model has different parameters. Check them out on the web. Set alpha = 1.5 and max_iter = 10000)

## Step 4
Fit the train data

## Step 5
Print lasso's intercept and coefficients

## Step 6
Print a score for the test data

## Step 7
Predict a price of a car that has 90 hoursepowers and n (any value) of X2. Predict a price of a car that has 330 hoursepowers and n (any value) of X2. Is the car price different in this case? Do both results make sence?

## Step 8
Code a loop that would print a score for Lasso with alpha = 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4.

# 6. Conclusion

Summarize your **findings**. Did you manage to build a reliable model? What **data preprocessing** strategies and **feature selection** techniques have you used in order to get the best model? Which model has performed the best?

Feel free to share/discuss your findings in our [Slack Channel](https://join.slack.com/t/mlcookbook/shared_invite/zt-eyz4czw4-l95j_2iuETCbVRPpgA3kWA)!

In [None]:
# Answer:

'''

I used X model and achieved Y accuracy...
I believe the model is reliable as I performed X feature selection technique...

'''

# 7.* Advance Zone (OPTIONAL)

*This is a section intended for advanced students or those who is willing to do some additional googling in order to familiarize themselves with potentially new concepts. The steps outlined below are typically used in production data science applications, and that is why the ML-Book team thought it would be important to include it.*

<a id='full_dataset'></a>

# 7.1* Utilizing Full Dataset

*In section 5.1 it was advised to select just two features for the sake of simplicity. However, oftentimes you want to give a model more features, so that it can extract more complex relationship from data and hopefully become more accurate.*

*   **Step 1**: Select all numerical features from the dataset


*   **Step 2**: Complete steps 2-8 (except 7) from [section 5](#model_building).


*   **Step 3**: Compare the test set performance of the model with all numerical features included against the model from [section 5](#model_building) which has just 2 features.

*As you can see, there are bunch of categorical columns in our dataset as well. Let's try plugging them in to our model as well. However, Lasso regression (as well as many other ML models) can only work with numerical values of features. Thus, some sort of encoding is needed.*

*   **Step 4**: Use all numerical features and categorical features in the dataset. For categorical ones try any of the encodings techniques outlined in this [tutorial](https://medium.com/machine-learning-eli5/dealing-with-categorical-data-f4c8556cbda0).


*   **Step 5**: Complete steps 2-8 (except 7) from [section 5](#model_building).


*   **Step 6**: Compare the test set performance of the model with all numerical and categorical features included against the model with only numerical features.

----------------

## Step 1
Select all numerical features from the dataset

## Step 2
Complete steps 2-8 (except 7) from [section 5](#model_building).

## Step 3
Compare the test set performance of the model with all numerical features included against the model from [section 5](#model_building) which has just 2 features.

## Step 4
*As you can see, there are bunch of categorical columns in our dataset as well. Let's try plugging them in to our model as well. However, Lasso regression (as well as many other ML models) can only work with numerical values of features. Thus, some sort of encoding is needed.*

Use all numerical features and categorical features in the dataset. For categorical ones try any of the encodings techniques outlined in this [tutorial](https://medium.com/machine-learning-eli5/dealing-with-categorical-data-f4c8556cbda0).

## Step 5
Complete steps 2-8 (except 7) from [section 5](#model_building).

## Step 6
Compare the test set performance of the model with all numerical and categorical features included against the model with only numerical features.

# 7.2* Exploring Different Models

*As you know, in data science there is no such algorithm that can outperform any other algorithms on any given dataset. Thus, model selection is typically an iterative process where we are not only searching for the optimal set of hyperparameters (as we did in [section 5, step 8](#model_building)) but also exploring different machine learning algorithms as well.*

*You can imagine, that for models with many parameters it can easily get very boring to specify all the values of hyperparameters that you want to check. For this and some other reasons people are using [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1).*

*   **Step 1**: Train [Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *alpha* and *l1_ratio*. 


*   **Step 2**: Train [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *n_estimators* and *max_depth*.


*   **Step 3**: Compare the performance of models from steps 1 and 2 with the model built in [section 7.1](#full_dataset). Keep in mind that it makes sense only to compare models that were trained on the same set of data.

--------

## Step 1
Train [Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *alpha* and *l1_ratio*. 

## Step 2
Train [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *n_estimators* and *max_depth*.

## Step 3
Compare the performance of models from steps 1 and 2 with the model built in [section 7.1](#full_dataset). Keep in mind that it makes sense only to compare models that were trained on the same set of data.

# 7.3* Feature Engineering

*Oftentimes the relationship between our features and target variable is very complex. Thus, it can be fruitful to include some additional features based on already existing ones. In this section we will explore feature engineering for numerical columns only, but there are techniques that can be applied to categorical features as well. You can experiment with transformations that are not listed below as well!*

*   **Step 1**: Generate additional univariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations.
    *      Power of 2
    *      Square root (watch out for negative values!)
    *      Log transformation (can be applied only to positive values)


*   **Step 2**: Generate additional multivariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations:
    *      Multiplication of features' values
    *      Ratio of features' values (watch out for zero denominator)
    
    
*   **Step 3**: Train any model that was described in this notebook on this extended dataset.


*   **Step 4**: Compare the performance of the model trained on the extended dataset against the models trained on original dataset.

--------

## Step 1
Generate additional univariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations.
- Power of 2
- Square root (watch out for negative values!)
- Log transformation (can be applied only to positive values)

## Step 2
Generate additional multivariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations:
- Multiplication of features' values
- Ratio of features' values (watch out for zero denominator)

## Step 3
Train any model that was described in this notebook on this extended dataset.

## Step 4
Compare the performance of the model trained on the extended dataset against the models trained on original dataset.