#### A Chinese automobile company aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. Essentially, the company wants to know:
##### 1.Which variables are significant in predicting the price of a car
##### 2.How well those variables describe the price of a car
##### Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the American market



### Business Goal:

You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for the management to
understand the pricing dynamics of a new market.



![Rolls Royce Spectre](R.jpeg)


### 1. Loading and Preprocessing 

### importing Libraries

In [282]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR

In [159]:
### Load the dataset

In [221]:
data = pd.read_csv("Carprice.csv")
data

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,201,-1,volvo 145e (sw),gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845.0
201,202,-1,volvo 144ea,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045.0
202,203,-1,volvo 244dl,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485.0
203,204,-1,volvo 246,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470.0


### Check the first few rows of the dataset


In [164]:
data.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [166]:
print(data.head())

   car_ID  symboling                   CarName fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   

       carbody drivewheel enginelocation  wheelbase  ...  enginesize  \
0  convertible        rwd          front       88.6  ...         130   
1  convertible        rwd          front       88.6  ...         130   
2    hatchback        rwd          front       94.5  ...         152   
3        sedan        fwd          front       99.8  ...         109   
4        sedan        4wd          front       99.4  ...         136   

   fuelsystem  boreratio  stroke compressionratio horsepower  peakrpm citympg  \

In [168]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

In [170]:
data.shape


(205, 26)

In [172]:
data.columns

Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')

In [174]:
# Handle missing values

In [176]:
missing_values = data.isnull().sum()
print("Missing values in each column:")
print(missing_values)

Missing values in each column:
car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64


In [178]:
# checking for duplication
data.duplicated().sum()

0

In [180]:
data.columns = data.columns.str.strip()

In [182]:
print(data.columns)

Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')


In [184]:
data.shape

(205, 26)

In [224]:
# Drop irrelevant columns
data = data.drop(columns=['car_ID', 'CarName'])
data

Unnamed: 0,symboling,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,3,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,3,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,1,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,2,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950.0
4,2,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845.0
201,-1,gas,turbo,four,sedan,rwd,front,109.1,188.8,68.8,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045.0
202,-1,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485.0
203,-1,diesel,turbo,four,sedan,rwd,front,109.1,188.8,68.9,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470.0


In [226]:
#  Convert categorical variables into numeric using One-Hot Encoding
categorical_columns = ['fueltype', 'aspiration', 'doornumber', 'carbody', 
                       'drivewheel', 'enginelocation', 'enginetype', 'fuelsystem']
data = pd.get_dummies(data, columns=categorical_columns, drop_first=True)
data

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,cylindernumber,enginesize,boreratio,stroke,...,enginetype_ohcf,enginetype_ohcv,enginetype_rotor,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,3,88.6,168.8,64.1,48.8,2548,four,130,3.47,2.68,...,False,False,False,False,False,False,False,True,False,False
1,3,88.6,168.8,64.1,48.8,2548,four,130,3.47,2.68,...,False,False,False,False,False,False,False,True,False,False
2,1,94.5,171.2,65.5,52.4,2823,six,152,2.68,3.47,...,False,True,False,False,False,False,False,True,False,False
3,2,99.8,176.6,66.2,54.3,2337,four,109,3.19,3.40,...,False,False,False,False,False,False,False,True,False,False
4,2,99.4,176.6,66.4,54.3,2824,five,136,3.19,3.40,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,109.1,188.8,68.9,55.5,2952,four,141,3.78,3.15,...,False,False,False,False,False,False,False,True,False,False
201,-1,109.1,188.8,68.8,55.5,3049,four,141,3.78,3.15,...,False,False,False,False,False,False,False,True,False,False
202,-1,109.1,188.8,68.9,55.5,3012,six,173,3.58,2.87,...,False,True,False,False,False,False,False,True,False,False
203,-1,109.1,188.8,68.9,55.5,3217,six,145,3.01,3.40,...,False,False,False,False,False,True,False,False,False,False


In [217]:
data.columns

Index(['symboling', 'wheelbase', 'carlength', 'carwidth', 'carheight',
       'curbweight', 'cylindernumber', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price', 'fueltype_gas', 'aspiration_turbo', 'doornumber_two',
       'carbody_hardtop', 'carbody_hatchback', 'carbody_sedan',
       'carbody_wagon', 'drivewheel_fwd', 'drivewheel_rwd',
       'enginelocation_rear', 'enginetype_dohcv', 'enginetype_l',
       'enginetype_ohc', 'enginetype_ohcf', 'enginetype_ohcv',
       'enginetype_rotor', 'fuelsystem_2bbl', 'fuelsystem_4bbl',
       'fuelsystem_idi', 'fuelsystem_mfi', 'fuelsystem_mpfi',
       'fuelsystem_spdi', 'fuelsystem_spfi'],
      dtype='object')

#### Handle 'cylindernumber' by converting it to numeric
##### data['cylindernumber'] = data['cylindernumber'].apply(lambda x: int(x) if x.isdigit() else 0)
#### data
##### tried to to use but failed

In [228]:
# Define a mapping for the cylinder numbers
cylinder_mapping = {
    'two': 2,
    'three': 3,
    'four': 4,
    'five': 5,
    'six': 6,
    'eight': 8,
    'twelve': 12
}

# Apply the mapping to the 'cylindernumber' column
data['cylindernumber'] = data['cylindernumber'].map(cylinder_mapping)

# Verify the conversion
print(data['cylindernumber'].unique())


[ 4  6  5  3 12  2  8]


In [230]:
data


Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,cylindernumber,enginesize,boreratio,stroke,...,enginetype_ohcf,enginetype_ohcv,enginetype_rotor,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,3,88.6,168.8,64.1,48.8,2548,4,130,3.47,2.68,...,False,False,False,False,False,False,False,True,False,False
1,3,88.6,168.8,64.1,48.8,2548,4,130,3.47,2.68,...,False,False,False,False,False,False,False,True,False,False
2,1,94.5,171.2,65.5,52.4,2823,6,152,2.68,3.47,...,False,True,False,False,False,False,False,True,False,False
3,2,99.8,176.6,66.2,54.3,2337,4,109,3.19,3.40,...,False,False,False,False,False,False,False,True,False,False
4,2,99.4,176.6,66.4,54.3,2824,5,136,3.19,3.40,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,109.1,188.8,68.9,55.5,2952,4,141,3.78,3.15,...,False,False,False,False,False,False,False,True,False,False
201,-1,109.1,188.8,68.8,55.5,3049,4,141,3.78,3.15,...,False,False,False,False,False,False,False,True,False,False
202,-1,109.1,188.8,68.9,55.5,3012,6,173,3.58,2.87,...,False,True,False,False,False,False,False,True,False,False
203,-1,109.1,188.8,68.9,55.5,3217,6,145,3.01,3.40,...,False,False,False,False,False,True,False,False,False,False


In [40]:
data.shape

(205, 191)

#### Split the dataset into features (X) and target (y)

In [233]:
X = data.drop(columns='price') 
y = data['price'] 

In [235]:
# Scale the numerical features
scaler = StandardScaler()
numerical_features = ['wheelbase', 'curbweight', 'enginesize', 'horsepower', 'citympg', 'highwaympg']
X[numerical_features] = scaler.fit_transform(X[numerical_features])

In [237]:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [239]:
X_train

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,cylindernumber,enginesize,boreratio,stroke,...,enginetype_ohcf,enginetype_ohcv,enginetype_rotor,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
66,0,1.022697,175.0,66.1,54.4,0.278074,4,0.170739,3.43,3.64,...,False,False,False,False,False,True,False,False,False,False
111,0,1.522109,186.7,68.4,56.7,1.000049,4,-0.166277,3.46,2.19,...,False,False,False,False,False,False,False,True,False,False
153,0,-0.508831,169.7,63.6,59.1,-0.530538,4,-0.840310,3.05,3.03,...,False,False,False,True,False,False,False,False,False,False
96,1,-0.708596,165.3,63.8,54.5,-1.125445,4,-0.719947,3.15,3.29,...,False,False,False,True,False,False,False,False,False,False
38,0,-0.375655,167.5,65.2,53.3,-0.513210,4,-0.407003,3.15,3.58,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,1,0.073815,178.5,67.9,49.7,1.123266,6,1.302152,3.43,3.27,...,False,True,False,False,False,False,False,True,False,False
14,1,0.789639,189.0,66.9,55.7,0.961544,6,0.892917,3.31,3.19,...,False,False,False,False,False,False,False,True,False,False
92,1,-0.708596,165.3,63.8,54.5,-1.188979,4,-0.719947,3.15,3.29,...,False,False,False,True,False,False,False,False,False,False
179,3,0.689756,183.5,67.7,52.0,0.886458,6,1.061426,3.27,3.35,...,False,False,False,False,False,False,False,True,False,False


In [241]:
X_test

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,cylindernumber,enginesize,boreratio,stroke,...,enginetype_ohcf,enginetype_ohcv,enginetype_rotor,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
15,0,0.789639,189.0,66.9,55.7,1.298465,6,1.976184,3.62,3.39,...,False,False,False,False,False,False,False,True,False,False
9,0,0.123757,178.2,67.9,52.0,0.957693,5,0.098522,3.13,3.4,...,False,False,False,False,False,False,False,True,False,False
100,0,-0.259126,173.4,65.2,54.7,-0.488182,4,-0.166277,3.33,3.47,...,False,False,False,True,False,False,False,False,False,False
132,3,0.057168,186.6,66.5,56.1,0.197213,4,-0.142204,3.54,3.07,...,False,False,False,False,False,False,False,True,False,False
68,-1,1.871697,190.9,70.3,58.7,2.299604,5,1.350297,3.58,3.64,...,False,False,False,False,False,True,False,False,False,False
95,1,-0.708596,165.6,63.8,53.3,-1.015705,4,-0.719947,3.15,3.29,...,False,False,False,True,False,False,False,False,False,False
159,0,-0.508831,166.3,64.4,52.8,-0.540164,4,-0.407003,3.27,3.35,...,False,False,False,False,False,True,False,False,False,False
162,0,-0.508831,166.3,64.4,52.8,-0.800075,4,-0.695874,3.19,3.03,...,False,False,False,True,False,False,False,False,False,False
147,0,-0.29242,173.5,65.4,53.0,-0.193616,4,-0.455148,3.62,2.64,...,True,False,False,False,False,False,False,True,False,False
182,2,-0.242478,171.7,65.5,55.7,-0.567118,4,-0.719947,3.01,3.4,...,False,False,False,False,False,True,False,False,False,False


In [243]:
y_train

66     18344.0
111    15580.0
153     6918.0
96      7499.0
38      9095.0
        ...   
106    18399.0
14     24565.0
92      6849.0
179    15998.0
102    14399.0
Name: price, Length: 164, dtype: float64

In [245]:
y_test

15     30760.000
9      17859.167
100     9549.000
132    11850.000
68     28248.000
95      7799.000
159     7788.000
162     9258.000
147    10198.000
182     7775.000
191    13295.000
164     8238.000
65     18280.000
175     9988.000
73     40960.000
152     6488.000
18      5151.000
82     12629.000
86      8189.000
143     9960.000
60      8495.000
101    13499.000
98      8249.000
30      6479.000
25      6692.000
16     41315.000
168     9639.000
195    13415.000
97      7999.000
194    12940.000
67     25552.000
120     6229.000
154     7898.000
202    21485.000
79      7689.000
69     28176.000
145    11259.000
55     10945.000
45      8916.500
84     14489.000
146     7463.000
Name: price, dtype: float64

In [251]:
print('X_train')
print(X_train.shape)
print('X_test')
print(X_test.shape) 

X_train
(164, 38)
X_test
(41, 38)


### 2. Model Implementation

### 1) Linear Regression

In [254]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [260]:
# Initialize the model
linear_reg = LinearRegression()

In [262]:
# Fit the model
linear_reg.fit(X_train, y_train)

In [264]:
# Predict on the test set
y_pred_linear = linear_reg.predict(X_test)

In [266]:
# Evaluate the model
print("Linear Regression Evaluation:")
print("R-squared:", r2_score(y_test, y_pred_linear))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred_linear))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred_linear))

Linear Regression Evaluation:
R-squared: 0.826753411274379
Mean Squared Error: 13676782.317684893
Mean Absolute Error: 2342.464448559821


### 2) Decision Tree Regressor

In [269]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [271]:
# Initialize the model
dt_regressor = DecisionTreeRegressor(random_state=42)

In [273]:
# Fit the model
dt_regressor.fit(X_train, y_train)

In [275]:
# Predict on the test set
y_pred_dt = dt_regressor.predict(X_test)

In [277]:
# Evaluate the model
print("Decision Tree Regressor Evaluation:")
print("R-squared:", r2_score(y_test, y_pred_dt))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred_dt))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred_dt))

Decision Tree Regressor Evaluation:
R-squared: 0.887947084463948
Mean Squared Error: 8845907.70370461
Mean Absolute Error: 1968.138219512195


### 3)RandomForestRegressor

In [287]:
# Importing the necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [289]:
# Initialize the model
rf_regressor = RandomForestRegressor(random_state=42)

In [291]:
# Fit the model
rf_regressor.fit(X_train, y_train)

In [293]:
# Predict on the test set
y_pred_rf = rf_regressor.predict(X_test)

In [295]:
# Evaluate the model
print("Random Forest Regressor Evaluation:")
print("R-squared:", r2_score(y_test, y_pred_rf))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred_rf))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred_rf))

Random Forest Regressor Evaluation:
R-squared: 0.9581631106690152
Mean Squared Error: 3302772.2648851937
Mean Absolute Error: 1260.8193008130083


### 4) Gradient Boosting Regressor

In [298]:
# Importing the necessary libraries
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [300]:
# Initialize the model
gb_regressor = GradientBoostingRegressor(random_state=42)

In [302]:
# Fit the model
gb_regressor.fit(X_train, y_train)

In [304]:
# Predict on the test set
y_pred_gb = gb_regressor.predict(X_test)

In [306]:
# Evaluate the model
print("Gradient Boosting Regressor Evaluation:")
print("R-squared:", r2_score(y_test, y_pred_gb))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred_gb))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred_gb))

Gradient Boosting Regressor Evaluation:
R-squared: 0.9273815847432478
Mean Squared Error: 5732789.69027669
Mean Absolute Error: 1731.3006558126804


### 5) Support Vector Regressor (SVR)

In [309]:
# Importing the necessary libraries
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [311]:
# Initialize the model
svr = SVR()

In [313]:
# Fit the model
svr.fit(X_train, y_train)

In [315]:
# Predict on the test set
y_pred_svr = svr.predict(X_test)

In [317]:
# Evaluate the model
print("Support Vector Regressor Evaluation:")
print("R-squared:", r2_score(y_test, y_pred_svr))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred_svr))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred_svr))

Support Vector Regressor Evaluation:
R-squared: -0.10242494926033285
Mean Squared Error: 87029858.21266328
Mean Absolute Error: 5708.759653481151


## 3. Model Evaluation.


#### Compare the performance of all the models based on R-squared, Mean Squared Error (MSE), and Mean Absolute Error (MAE).
Identify the best performing model and justify why it is the best.

In [323]:
# Dictionary to store the evaluation results
evaluation_results = {}

# 1) Linear Regression
evaluation_results["Linear Regression"] = {
    "R-squared": r2_score(y_test, y_pred_linear),
    "MSE": mean_squared_error(y_test, y_pred_linear),
    "MAE": mean_absolute_error(y_test, y_pred_linear)
}

# 2) Decision Tree Regressor
evaluation_results["Decision Tree"] = {
    "R-squared": r2_score(y_test, y_pred_dt),
    "MSE": mean_squared_error(y_test, y_pred_dt),
    "MAE": mean_absolute_error(y_test, y_pred_dt)
}

# 3) Random Forest Regressor
evaluation_results["Random Forest"] = {
    "R-squared": r2_score(y_test, y_pred_rf),
    "MSE": mean_squared_error(y_test, y_pred_rf),
    "MAE": mean_absolute_error(y_test, y_pred_rf)
}

# 4) Gradient Boosting Regressor
evaluation_results["Gradient Boosting"] = {
    "R-squared": r2_score(y_test, y_pred_gb),
    "MSE": mean_squared_error(y_test, y_pred_gb),
    "MAE": mean_absolute_error(y_test, y_pred_gb)
}

# 5) Support Vector Regressor
evaluation_results["Support Vector"] = {
    "R-squared": r2_score(y_test, y_pred_svr),
    "MSE": mean_squared_error(y_test, y_pred_svr),
    "MAE": mean_absolute_error(y_test, y_pred_svr)
}

In [325]:
# Displaying the results
evaluation_df = pd.DataFrame(evaluation_results).T
evaluation_df

Unnamed: 0,R-squared,MSE,MAE
Linear Regression,0.826753,13676780.0,2342.464449
Decision Tree,0.887947,8845908.0,1968.13822
Random Forest,0.958163,3302772.0,1260.819301
Gradient Boosting,0.927382,5732790.0,1731.300656
Support Vector,-0.102425,87029860.0,5708.759653


## Model Evaluation:

### 1. R-squared (R²):
#### 1)Comparing R-square values random forest fit the best this explains 95.8 % of the varience in the price.
#### 2)Descion tree is also performing good. all the other except support vector is performing good.Support vector is performing worst.seems tobe overfitting or Poorly tuned.

### 2. Mean Squared Error (MSE):
##### 1)Random Forest has the lowest MSE (3.30e+06), indicating that it is making the smallest squared errors in its predictions.
##### 2)Gradient Boosting and Decision Tree have relatively lower MSEs, 5.73e+06 and 8.85e+06, respectively, which is still quite good, but not as low as Random Forest.
##### 3)Linear Regression has an MSE of 1.37e+07, which is higher than the tree-based models, indicating larger errors.
##### 4)Support Vector has the highest MSE (8.70e+07), which suggests that it is performing poorly and making significant errors in its predictions.


### 3. Mean Absolute Error (MAE):
#### 1)Random Forest again has the lowest MAE (1260.82), meaning it has the smallest average error between predicted and actual prices.
#### 2)Linear Regression has a higher MAE of 2342.46, which means it has a larger average error compared to the tree-based models.
#### 3)Support Vector has the highest MAE of 5708.76, suggesting significant error in its predictions.

## Best Performing Model:
#### Based on the evaluation metrics, Random Forest Regressor clearly outperforms the others across all three metrics:
##### It has the highest R-squared (0.958), which indicates that it explains the most variance in the target variable (price).
##### It has the lowest MSE (3.30e+06), meaning it makes the smallest squared errors in predictions.
##### It has the lowest MAE (1260.82), indicating the smallest average error.

### Conclusion:
##### Best Performing Model: Random Forest Regressor
##### Random Forest performs the best in terms of all the evaluation metrics (R-squared, MSE, and MAE). It explains the most variance in the target variable, makes the smallest squared and absolute errors, and is more robust compared to the other models. Its performance can be attributed to the fact that it is an ensemble method that combines the predictions of many decision trees, reducing overfitting and capturing complex patterns in the data better than a single model (such as Decision Tree or Linear Regression)
##### worst performing model: Support Vector Regressor
##### The Support Vector Regressor has a negative R-squared value and the highest MSE and MAE, indicating poor performance. This could be due to improper parameter tuning or the nature of the dataset not being well-suited to this model. It might require further preprocessing or hyperparameter tuning to improve its performance

## 4)Feature Importance Analysis

In [343]:
from sklearn.ensemble import RandomForestRegressor

In [345]:
# Initialize the model
rf_model = RandomForestRegressor(random_state=42)

In [347]:
# Fit the model to the training data
rf_model.fit(X_train, y_train)

In [349]:
# Get feature importance values
feature_importances = rf_model.feature_importances_

In [351]:
# Creating a DataFrame with feature names and their importance values
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
})

In [353]:
# Sort the features by importance (descending order)
importance_df = importance_df.sort_values(by='Importance', ascending=False)

In [355]:
# Display the top 10 most important features
print(importance_df.head(10))

       Feature  Importance
7   enginesize    0.554492
5   curbweight    0.297062
14  highwaympg    0.046116
11  horsepower    0.027757
3     carwidth    0.013212
2    carlength    0.009072
1    wheelbase    0.007292
12     peakrpm    0.006834
13     citympg    0.006602
8    boreratio    0.005230


#### The above given table is in the descending order of the top 10 important features that is relevent in price predicztion. where enginesize and curbweight are the two most important factors affecting car prices. This confirms that larger, heavier cars with bigger engines generally cost more.eventhogh all the rest are affecting the price value these two are the main variable. we performed Random forest Regressor because it was best performing model.

## 5 Hyperparameter Tuning

In [401]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

In [410]:
# Define the model
rf = RandomForestRegressor(random_state=42)

In [411]:
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

In [412]:
# Perform GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

540 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
255 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\ktomj\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ktomj\anaconda3\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "C:\Users\ktomj\anaconda3\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\ktomj\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParam

In [420]:
# Get the best parameters
best_params = grid_search.best_params_
print("Best parameters found by GridSearchCV:", best_params)

Best parameters found by GridSearchCV: {'max_depth': 30, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}


In [414]:
# Get the best model
best_rf = grid_search.best_estimator_

In [419]:
# Evaluate the best model on the test set
y_pred_best_rf = best_rf.predict(X_test)

In [417]:
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_best_rf)
print("Mean Squared Error (MSE):", mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_best_rf)
print("Mean Absolute Error (MAE):", mae)

# Calculate R-squared (R²)
r2 = r2_score(y_test, y_pred_best_rf)
print("R-squared (R²):", r2)

Mean Squared Error (MSE): 6168288.738448641
Mean Absolute Error (MAE): 1428.4213062330623
R-squared (R²): 0.9218650295523763


##### this has actually reduced performance than it was before