Libraries Used: Pandas, scikit-learn, matplotlib
Aim: Our goal is to predict the vehicle price using the open source Auto data set from the UCI machine learning repository. In this data set, we have prices for 205 automobiles, along with other features such as fuel type, engine type,engine size,etc.
Description:
- We took the dataset from UCI Machine Learning repository, to predict vehicle price using linear regression
- After handling missing values with different types of imputation, we did some analysis on the dataset
- Visualized some important features and also analyzed them with their patterns
- Then we checked linear regression assumptions for the dataset, applied Anderson-Darling test, Goldfeld-Quandt test, etc to check the assumptions
- When we found multicolinearity in the dataset, we removed 2 columns to overcome multicolinearity problem
- Then we applied feature engineering pipeline on the rest of the dataset
- We also tested different feature selections technique and checked model accuracy
- Finally, we applied regularization and solved model overfitting problem
What we have learned so far from this project:
- How to handle missing values
- How to check for linear regression assumptions
- How to apply feature engineering pipeline
- Different types of feature selection tools
- Regularization technique for linear regression