# ðŸš— Car Price Prediction using Machine Learning

### ðŸ“˜ Project Overview
This project predicts the **selling price of used cars** based on their features such as model year, fuel type, transmission, and kilometers driven.  
It applies **Linear Regression** to build a predictive model and evaluates its performance using **RÂ² Score** and **Mean Absolute Error**.

---


In [1]:
!pip install pandas numpy matplotlib seaborn scikit-learn



## ðŸ“¥ Step 1: Load the Dataset
Weâ€™ll use the *CarDekho car price dataset* to train our model.  
Letâ€™s first import it and take a quick look at the data.


In [2]:
import pandas as pd

df = pd.read_excel("car data.csv.xlsx")
df.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


## ðŸ§¹ Step 2: Data Cleaning and Preprocessing
Weâ€™ll remove unnecessary columns and convert categorical features (like Fuel Type and Transmission) into numeric form.


In [3]:
df.info()

# Missing values
print("\nMissing values in each column:\n", df.isnull().sum())

# Overall stats (mean, std, etc.)
print("\nSummary statistics:\n", df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB

Missing values in each column:
 Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

Summary statistics:
               Year  Selling_Price  Present_Price     Kms_Driven       Owner
count   301.000000     301.000000 

In [4]:
df.columns

Index(['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',
       'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner'],
      dtype='object')

In [5]:
# Creating a new column for current year
df['Current_Year'] = 2020

# Now calculating how old the car is
df['Car_Age'] = df['Current_Year'] - df['Year']

# Drop[ing unnecessary columns
df = df.drop(['Car_Name', 'Year', 'Current_Year'], axis=1)

# Converting categorical columns to dummy/indicator variables
df = pd.get_dummies(df, drop_first=True)

# Showing the cleaned data
df.head()

Unnamed: 0,Selling_Price,Present_Price,Kms_Driven,Owner,Car_Age,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,3.35,5.59,27000,0,6,False,True,False,True
1,4.75,9.54,43000,0,7,True,False,False,True
2,7.25,9.85,6900,0,3,False,True,False,True
3,2.85,4.15,5200,0,9,False,True,False,True
4,4.6,6.87,42450,0,6,True,False,False,True


In [6]:
from sklearn.model_selection import train_test_split

# Separating features (X) and target (y)
X = df.drop(['Selling_Price'], axis=1)
y = df['Selling_Price']

# Splitting into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

((240, 8), (61, 8))

## ðŸ¤– Step 3: Model Training
Weâ€™ll train a **Linear Regression** model using Scikit-learn to learn the relationship between car features and selling price.


In [7]:
from sklearn.linear_model import LinearRegression

# Creating the model
model = LinearRegression()
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## ðŸ“ˆ Step 4: Model Evaluation
Now, weâ€™ll check how well our model predicts car prices using metrics like **RÂ² Score** and **MAE (Mean Absolute Error)**.


In [8]:
from sklearn.metrics import r2_score, mean_absolute_error

y_pred = model.predict(X_test)

# Comparison of actual vs predicted
comparison = pd.DataFrame({'Actual Price': y_test, 'Predicted Price': y_pred})
print(comparison.head(10))

# Evaluating model performance
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"\nModel Performance:")
print(f"RÂ² Score: {r2:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")

     Actual Price  Predicted Price
177          0.35         2.955343
289         10.11         8.178939
228          4.95         6.454273
198          0.15        -1.424175
60           6.95         9.088899
9            7.45         7.418254
118          1.10         1.336443
154          0.50         0.840272
164          0.45         1.365019
33           6.00         7.490268

Model Performance:
RÂ² Score: 0.85
Mean Absolute Error: 1.22


## ðŸ’° Step 5: Predict Car Price for User Input
Weâ€™ll let the user input car details manually and predict the estimated selling price.


In [9]:
# Example: Predict price from user input
# Getting user input

print("Enter details of the car to predict its price:")
year = int(input("Year of manufacture: "))
present_price = float(input("Present price (in lakhs): "))
kms_driven = int(input("Kms driven: "))
fuel_type = input("Fuel type (Petrol/Diesel/CNG): ")
seller_type = input("Seller type (Dealer/Individual): ")
transmission = input("Transmission type (Manual/Automatic): ")
owner = int(input("Number of previous owners (0, 1, 2, 3): "))

# Creating dataframe for prediction
input_data = pd.DataFrame({
    'Year': [year],
    'Present_Price': [present_price],
    'Kms_Driven': [kms_driven],
    'Fuel_Type': [fuel_type],
    'Seller_Type': [seller_type],
    'Transmission': [transmission],
    'Owner': [owner]
})

# Match encoding used earlier
input_data_encoded = pd.get_dummies(input_data, drop_first=True).reindex(columns=X.columns, fill_value=0)

# Predict price
predicted_price = model.predict(input_data_encoded)[0]
print(f"\nðŸ’° The predicted selling price of the car is: {predicted_price:.2f} lakhs")

Enter details of the car to predict its price:


Year of manufacture:  2022
Present price (in lakhs):  8.5
Kms driven:  50000
Fuel type (Petrol/Diesel/CNG):  Petrol
Seller type (Dealer/Individual):  Dealer
Transmission type (Manual/Automatic):  Manual
Number of previous owners (0, 1, 2, 3):  0



ðŸ’° The predicted selling price of the car is: 8.07 lakhs


---
### âœ… Conclusion
- The model achieved an **RÂ² Score of 0.85**, showing strong predictive performance.
- This project demonstrates how machine learning can help estimate car resale values.
- Tools Used: *Python, Pandas, Scikit-learn, Matplotlib, Seaborn, Streamlit*

---


In [10]:
import pickle

pickle.dump(model, open('car_price_model.pkl', 'wb'))
pickle.dump(X.columns, open('x_columns.pkl', 'wb'))

---

## ðŸ’¬ Final Thoughts

This project was a great exercise in applying machine learning to a real-world business problem.  
It demonstrates how regression techniques can be used to estimate **market-based pricing** effectively.

ðŸ”¹ The model can be further improved by:
- Adding more features such as car brand, engine size, or location  
- Trying advanced algorithms like **Random Forest** or **XGBoost**  
- Deploying it using **Streamlit** or **Flask**