# Introduction

This notebook provides a comprehensive workflow for analyzing and preparing a dataset of car details for predictive modeling. The dataset includes various features such as car names, mileage, engine capacity, and seller type. The objective is to process the data, clean and encode its features, and build a predictive model for car selling prices.

Key steps include:

1. Importing necessary libraries and loading the dataset.
2. Cleaning and preprocessing data by handling missing values, removing duplicates, and transforming columns.
3. Encoding categorical variables to prepare the data for modeling.
4. Building and training a linear regression model using scikit-learn.
5. Making predictions and evaluating the model's performance on test data.

This notebook aims to provide a clear and reusable approach for similar predictive tasks involving structured datasets.



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pickle as pk

### Load the dataset

In [2]:
# Load the dataset
car_df = pd.read_csv('Cardetails.csv')


In [3]:
car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0


### Drop unnecessary columns and handle missing values

In [4]:
# Drop unnecessary columns and handle missing values
car_df.drop(columns='torque', inplace=True)
car_df.dropna(inplace=True)
car_df.drop_duplicates(inplace=True)


In [5]:
car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,5.0


### Utility functions for cleaning and transforming data

In [6]:
# Utility functions for cleaning and transforming data
def get_brand_name(car_name):
    return car_name.split()[0].strip()

def clean_numeric(value):
    value = value.split(' ')[0].strip()
    return float(value) if value else 0.0


### Transform Categorical variables to Quantitative variables using utility functions

In [7]:
# Transform data using utility functions
car_df['name'] = car_df['name'].apply(get_brand_name)
car_df['mileage'] = car_df['mileage'].apply(clean_numeric)
car_df['max_power'] = car_df['max_power'].apply(clean_numeric)
car_df['engine'] = car_df['engine'].apply(clean_numeric)

# Encode categorical variables
encode_dicts = {
    'name': {name: idx+1 for idx, name in enumerate(car_df['name'].unique())},
    'transmission': {'Manual': 1, 'Automatic': 2},
    'seller_type': {'Individual': 1, 'Dealer': 2, 'Trustmark Dealer': 3},
    'fuel': {'Diesel': 1, 'Petrol': 2, 'LPG': 3, 'CNG': 4},
    'owner': {'First Owner': 1, 'Second Owner': 2, 'Third Owner': 3, 'Fourth & Above Owner': 4, 'Test Drive Car': 5}
}
for col, mapping in encode_dicts.items():
    car_df[col].replace(mapping, inplace=True)


In [8]:
car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,1,2014,450000,145500,1,1,1,1,23.4,1248.0,74.0,5.0
1,2,2014,370000,120000,1,1,1,2,21.14,1498.0,103.52,5.0
2,3,2006,158000,140000,2,1,1,3,17.7,1497.0,78.0,5.0
3,4,2010,225000,127000,1,1,1,1,23.0,1396.0,90.0,5.0
4,1,2007,130000,120000,2,1,1,1,16.1,1298.0,88.2,5.0


### Training data for modeling using scikit learn

In [9]:
# Prepare data for modeling
input_data = car_df.drop(columns=['selling_price'])
output_data = car_df['selling_price']

# Split data
x_train, x_test, y_train, y_test = train_test_split(input_data, output_data, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(x_train, y_train)

# Make predictions
predictions = model.predict(x_test)
print(predictions)


[ 771222.5869164   680626.30204771  532211.36059441 ... 1061558.84250543
 -239241.24473153   52377.60034005]


### Predicting sales price for new input

In [10]:
# Predicting for new input
new_data = pd.DataFrame(
    [[9, 2001, 9000, 2, 1, 1, 1, 20.3, 1199.0, 84.0, 5.0]],
    columns=['name', 'year', 'km_driven', 'fuel', 'seller_type', 'transmission', 'owner', 'mileage', 'engine', 'max_power', 'seats']
)
new_prediction = model.predict(new_data)
print(new_prediction)

[41814.57258646]


In [11]:
#pickle

In [12]:
pk.dump(model, open('model.pkl', 'wb'))

### run the command below on your terminal to open the predictive web app
```streamlit run cars.py```