<a href="https://colab.research.google.com/github/ibshafique/mlops_with_poridhi/blob/main/prerequisite_projects/Car_Price/car_price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerequisites


Importing the related libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import pickle as pkl

Importing the dataset.

In [None]:
train_url = "https://raw.githubusercontent.com/ibshafique/mlops_with_poridhi/refs/heads/main/prerequisite_projects/Car_Price/dataset/car_price.csv"
train_df = pd.read_csv(train_url)

# Data Insights

In [None]:
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.isnull().sum()

In [None]:
train_df.duplicated().sum()

# Data Cleaning

The 'torque' values donot have much relation with the price of a car. So we are dropping this column.

In [None]:
train_df = train_df.drop(columns=['torque'])

In the previous section we saw there are some rows with missing values, so we are dropping those rows.

We are also dropping the duplicated values of rows.

In [None]:
train_df.dropna(inplace=True)
train_df.drop_duplicates(inplace=True)
train_df.shape

In [None]:
train_df.head()

Now we will be extracting the values from these columns:

i. name
ii. mileage
iii. engine
iv. max_power

In [None]:
train_df['name'] = train_df['name'].str.split(' ').str[0]
train_df['mileage'] = train_df['mileage'].str.split(' ').str[0].astype(float)
train_df['engine'] = train_df['engine'].str.split(' ').str[0].astype(float)
train_df['max_power'] = train_df['max_power'].str.split(' ').str[0]

#there was a non-numerical value in the 'max_power' column, so removed that with:
train_df['max_power'] = pd.to_numeric(train_df['max_power'], errors='coerce')
train_df.dropna(inplace=True)

train_df.head()
train_df.info()

In [None]:
print(train_df['name'].unique())
print('')
print(train_df['fuel'].unique())
print('')
print(train_df['seller_type'].unique())
print('')
print(train_df['transmission'].unique())

The columns 'name', 'fuel', 'transmission_type' and 'owner' are objects.
We will convert them to integers.

In [None]:
train_df['name'] = train_df['name'].replace({'Maruti': 1 , 'Skoda': 2, 'Honda': 3, 'Hyundai': 4, 'Toyota': 5, 'Ford': 6, 'Renault': 7,
                                             'Mahindra': 8 , 'Tata': 9 , 'Chevrolet': 10, 'Datsun': 11, 'Jeep': 12, 'Mercedes-Benz': 13,
                                             'Mitsubishi': 14, 'Audi': 15, 'Volkswagen': 16, 'BMW': 17, 'Nissan': 18, 'Lexus': 19,
                                             'Jaguar': 20, 'Land': 21, 'MG': 22, 'Volvo': 23, 'Daewoo': 24, 'Kia': 25, 'Fiat': 26, 'Force': 27,
                                             'Ambassador': 28, 'Ashok': 29, 'Isuzu': 30, 'Opel': 31})

train_df['transmission'] = train_df['transmission'].replace({'Manual': 1, 'Automatic': 2})

train_df['seller_type'] = train_df['seller_type'].replace({'Individual': 1, 'Dealer': 2, 'Trustmark Dealer': 3})

train_df['fuel'] = train_df['fuel'].replace({'Diesel': 1, 'Petrol': 2, 'LPG': 3, 'CNG': 4})

train_df['owner'] = train_df['owner'].replace({'First Owner': 1, 'Second Owner': 2, 'Third Owner': 3, 'Fourth & Above Owner': 4, 'Test Drive Car': 5})


In [None]:
train_df

In [None]:
train_df.info()

# Data Visualisation

In [None]:
corr_matrix = train_df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

In [None]:
# Compute the correlation matrix
corr_matrix = train_df.corr().abs()  # Compute absolute correlation values

# Extract correlations with 'selling_price'
selling_price_corr = corr_matrix["selling_price"].sort_values(ascending=False)

# Filter values greater than 0.4 for high correlation (adjustable threshold)
high_corr_selling_price = selling_price_corr[selling_price_corr > 0.4]

# Convert to DataFrame for better readability
high_corr_selling_price_df = high_corr_selling_price.reset_index()
high_corr_selling_price_df.columns = ["Feature", "Correlation with Selling Price"]

# Display the results
print(high_corr_selling_price_df)

# Data Training

## Splitting Data

In [None]:
input_data = train_df.drop(columns=['selling_price'])
output_data = train_df['selling_price']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(input_data, output_data, test_size=0.2)

## Model Training

We are using Linear Regression to fit this model.

In [None]:
model = LinearRegression()
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_test)


In [None]:
# R² Score (Accuracy)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2*100:.4f}")

# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.4f}")

# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")

The accuracy of this model is around 60%. This might be due to less data.
However, we will be more looking into exporting the model to make our app.

# Exporting Model

In [None]:
pkl.dump(model,open('car_price_model.pkl','wb'))