### LOAD THE DATASET AND EXPLORE IT
##### Step 1.1: Import pandas and read the CSV file
##### Step 1.2: Use .head(), .info(), .describe() to understand the dataset
##### Step 1.3: Check for missing values using .isnull().sum()
##### Step 1.4: Identify which columns are features (X) and which is the label (y)

In [16]:
import pandas as pd
data = pd.read_csv("Car_Data.csv")
data.head() # to preview
data.info() # to check data types
data.describe() # to get stats
data.isnull().sum() # to check missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

### PREPROCESS THE DATA
##### Step 2.1: Handle missing values (if any)
##### Step 2.2: Convert categorical variables into numerical using get_dummies or map
##### Step 2.3: Drop unnecessary columns like car name if needed

In [17]:
data = data.dropna()
data.drop(['Car_Name'], axis=1, inplace=True)
data = pd.get_dummies(data, drop_first=True)

### Create feature matrix (X) and label vector (y)
##### Step 3.1: Separate features and target column

In [18]:
y = data['Selling_Price']
X = data.drop(['Selling_Price'], axis=1)

### Train-Test Split
##### Step 4.1: Use train_test_split from sklearn to split X and y

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Train the Linear Regression Model
##### Step 5.1: Import LinearRegression and fit it to training data

In [20]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

### Evaluate the model
##### Step 6.1: Predict using X_test
##### Step 6.2: Use r2_score and mean_squared_error to evaluate accuracy

In [21]:
from sklearn.metrics import r2_score, mean_squared_error
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test,y_pred)
print("R² Score:", r2)
print("Mean Squared Error:", mse)

R² Score: 0.848981302489644
Mean Squared Error: 3.4788039706439524


### Print coefficients and intercept to understand the model

In [22]:
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)

Model Coefficients: [ 3.53801365e-01  4.29152503e-01 -6.15725866e-06 -9.03759824e-01
  2.53327258e+00  7.38464226e-01 -1.19059291e+00 -1.63902155e+00]
Model Intercept: -709.9529230420987


### (Optional) Save the model using pickle or joblib

In [23]:
import pickle
with open('car_price_model.pkl', 'wb') as f:
  pickle.dump(model, f)