---
<center><h1>Used Car Price Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including the make and model of the car, its mileage, its condition and the current market conditions. As a result, it can be difficult for sellers to accurately price their cars.

The goal of this project is to leverage machine learning **to develop a machine learning model that can predict the price of a used car based on its features**. This falls under **Regression Machine Learning Problem**.

## 2) Understanding Data
---

The project uses **Used Car Price  Data** which contains several variables (independent variables) and one outcome variable (dependent variable) called **selling_price**. The variables in the datasets are as follows:

- name
- year
- selling_price (Target Variable)
- km_driven
- fuel
- seller_type
- transmission
- Owner

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from six.moves import urllib

warnings.filterwarnings("ignore")
%matplotlib inline

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
used_car_df = pd.read_csv('Datasets/Day7_Used_Car_Price_Data.csv') 

In [None]:
used_car_df

In [None]:
print('The size of Dataframe is: ', used_car_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
used_car_df.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in used_car_df.columns if used_car_df[feature].dtype != 'O']
categorical_features = [feature for feature in used_car_df.columns if used_car_df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=used_car_df.isnull().sum().sort_values(ascending=False)
percent=(used_car_df.isnull().sum()/used_car_df.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
used_car_df.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
used_car_df.describe(include='object').T

## 5) Data Cleaning & Preprocessing
---

### Encoding the Categorical Data

In [None]:
# encoding "Fuel_Type" Column
used_car_df.replace({'Fuel_Type':{'Petrol':0,'Diesel':1,'CNG':2}},inplace=True)

# encoding "Seller_Type" Column
used_car_df.replace({'Seller_Type':{'Dealer':0,'Individual':1}},inplace=True)

# encoding "Transmission" Column
used_car_df.replace({'Transmission':{'Manual':0,'Automatic':1}},inplace=True)

In [None]:
used_car_df

## 6) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = used_car_df.drop(columns = ['Car_Name','Selling_Price'], axis=1) # Feature matrix
y = used_car_df['Selling_Price'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
# For Model Building
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
models = [LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor, RandomForestRegressor]
mae_scores = []
mse_scores = []
rmse_scores = []
r2_scores = []

for model in models:
    regressor = model().fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    
    mae_scores.append(mean_absolute_error(y_test, y_pred))
    mse_scores.append(mean_squared_error(y_test, y_pred))
    rmse_scores.append(mean_squared_error(y_test, y_pred, squared=False))
    r2_scores.append(r2_score(y_test, y_pred))

In [None]:
regression_metrics_df = pd.DataFrame({
    "Model": ["Linear Regression", "Lasso", "Ridge", "SVR", "Decision Tree Regressor", "Random Forest Regressor"],
    "Mean Absolute Error": mae_scores,
    "Mean Squared Error": mse_scores,
    "Root Mean Squared Error": rmse_scores,
    "R-squared (R2)": r2_scores
})

regression_metrics_df.set_index('Model', inplace=True)
regression_metrics_df

### Inference

In the context of predicting used car prices,
- **Random Forest Regressor** seems to be the **best-performing model** followed closely by the Decision Tree Regressor and the linear regression-based models. These models offer relatively low prediction errors and good explanatory power for the variance in car prices. However, the choice of the best model may also depend on other factors such as computational complexity and interpretability.