## Project:

Business Objective:

The fundamental goal here is to model the CO2 emissions as a function of several car engine features.

Data Set Details: 

The file contains the data for this example. Here the number of variables (columns) is 12, and the number of instances (rows) is 7385. In that way, this problem has the 12 

following variables:

make, car brand under study.
model, the specific model of the car.
vehicle_class, car body type of the car.
engine_size, size of the car engine, in Liters.
cylinders, number of cylinders.
transmission, "A" for`Automatic', "AM" for ``Automated manual', "AS" for 'Automatic with select shift', "AV" for 'Continuously variable', "M" for 'Manual'.
fuel_type, "X" for 'Regular gasoline', "Z" for 'Premium gasoline', "D" for 'Diesel', "E" for 'Ethanol (E85)', "N" for 'Natural gas'.
fuel_consumption_city, City fuel consumption ratings, in liters per 100 kilometers.
fuel_consumption_hwy, Highway fuel consumption ratings, in liters per 100 kilometers.
fuel_consumption_comb(l/100km), the combined fuel consumption rating (55% city, 45% highway), in L/100 km.
fuel_consumption_comb(mpg), the combined fuel consumption rating (55% city, 45% highway), in miles per gallon (mpg).
co2_emissions, the tailpipe emissions of carbon dioxide for combined city and highway driving, in grams per kilometer.

Acceptance Criterion: Need to deploy the end results using Flask /Streamlit etc

In [4]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import chi2_contingency
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler,MinMaxScaler,LabelEncoder ,PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split,GridSearchCV ,RandomizedSearchCV ,cross_val_score ,KFold
from sklearn.linear_model import LinearRegression ,Lasso ,Ridge , ElasticNet 
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR 
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error,r2_score
import warnings
warnings.filterwarnings("ignore")
import statsmodels.api as sm
import matplotlib.pyplot as plt


ModuleNotFoundError: No module named 'seaborn'

# **EDA (Exploratory Data Analysis)**

**Understand the dataset by summarizing, visualizing, and identifying patterns, correlations, and anomalies.**

## 1. Load Data

In [None]:
#read the dataset
data = pd.read_csv('co2_emissions (1).csv')

In [None]:
## Basic Information & Summary

In [None]:
#Display first 5 records
data.head()

In [None]:
#dispaly random 5 sample
data.sample(5)

##  2. Understand the Data

In [None]:
#size of rows and columns
data.shape

In [None]:
# Get information about columns types and missing values
data.info()

NameError: name 'data' is not defined

In [None]:
# Get basic statistical details of numerical columns
data.describe().T

In [None]:
# Basic statistics summary of Object features
data.describe(include= 'object').T

In [None]:
##Get features names
data.columns

In [None]:
# checking duplicate values
data.duplicated().sum()

In [None]:
#check for missing data
data.isnull().sum()

In [None]:
# percentage of missing values
data.isna().sum()/len(data)*100 

## 3.Univariate, Bivariate & Multivariate Analysis 

**Histogram & Barplot**

In [None]:
#Check Distribution of Numerical Data
# Select numerical columns
num_col = data.select_dtypes(include=['float64', 'int64']).columns

# Plot histograms with KDE for each numerical column
for col in num_col:
    plt.figure(figsize=(6, 4))
    sns.histplot(data[col], bins=30, kde=True, color="skyblue")
    plt.title(f'Histplot of {col}')
    plt.legend()
    plt.show()

**Skewness**

In [None]:
# Check skewness of numerical features

# Select only numerical columns
num_col = data.select_dtypes(include=['float64', 'int64']).columns

# Calculate skewness
skewness_values = data[num_col].skew()

# Display skewness values
print("Skewness of numerical columns:\n", skewness_values)

In [None]:
#Identify Sknewness
# Select numerical columns
num_col = data.select_dtypes(include=['float64', 'int64']).columns

# Plot histograms with KDE for each numerical column
for col in num_col:
    plt.figure(figsize=(6, 4))
    sns.histplot(data[col], bins=30, kde=True, color="skyblue")
    plt.title(f'Distribution & Skewness of {col}')
    plt.axvline(data[col].mean(), color='red', linestyle='dashed', linewidth=2, label="Mean")
    plt.legend()
    plt.show()

In [None]:
# Categorical features

In [None]:
#Brands od cars
print("We have total",len(data['make'].unique()),"Car Companies Data")
df_brand = data['make'].value_counts().reset_index().rename(columns={'count':'Count'})
df_brand.head(20)

In [None]:
plt.figure(figsize=(20,6))
fig1 = sns.barplot(data = df_brand, x = "make",  y= "Count")
plt.xticks(rotation = 75)
plt.title("All Car Companies and their Cars")
plt.xlabel("Companies")
plt.ylabel("Cars")
plt.bar_label(fig1.containers[0])
plt.show()

In [None]:
#Models of cars 
print("We have total",len(data['model'].unique()),"Car Models")
df_model = data['model'].value_counts().reset_index().rename(columns={'count':'Count'})[:25]
df_model.head(20)

In [None]:
plt.figure(figsize=(20,6))
fig2 = sns.barplot(data = df_model, x = "model",  y= "Count")
plt.xticks(rotation = 75)
plt.title("Top 25 Car Models")
plt.xlabel("models")
plt.ylabel("Cars")
plt.bar_label(fig2.containers[0])
plt.show()

In [None]:
#Vehicle Class
print("We have total",len(data['vehicle_class'].unique()),"vehicle_class")
df_vehicle_class = data['vehicle_class'].value_counts().reset_index().rename(columns={'count':'Count'})
df_vehicle_class

In [None]:
plt.figure(figsize=(20,5))
fig3 = sns.barplot(data = df_vehicle_class, x = "vehicle_class",  y= "Count")
plt.xticks(rotation = 75)
plt.title("All Vehicle Class")
plt.xlabel("vehicle_class")
plt.ylabel("Cars")
plt.bar_label(fig3.containers[0])
plt.show()

In [None]:
#Engine Sizes of cars
print("We have total",len(data['engine_size'].unique()),"Types of Engine Size")
df_engine_size = data['engine_size'].value_counts().reset_index().rename(columns={'count':'Count'})
df_engine_size.head(10)

In [None]:
plt.figure(figsize=(20,6))
fig4 = sns.barplot(data = df_engine_size, x = "engine_size",  y= "Count")
plt.xticks(rotation = 90)
plt.title("All Engine Sizes")
plt.xlabel("engine_size")
plt.ylabel("Cars")
plt.bar_label(fig4.containers[0])
plt.show()

In [None]:
#Cylinders
print("We have total",len(data['cylinders'].unique()),"Types of Cylinders")
df_cylinders = data['cylinders'].value_counts().reset_index().rename(columns={'count':'Count'})
df_cylinders.head(10)

In [None]:
plt.figure(figsize=(20,6))
fig5 = sns.barplot(data = df_cylinders, x = "cylinders",  y= "Count")
plt.xticks(rotation = 90)
plt.title("All Cylinders")
plt.xlabel("cylinders")
plt.ylabel("Cars")
plt.bar_label(fig5.containers[0])
plt.show()

In [None]:
## Transmission of Cars 
data['transmission'].unique()

In [None]:
#Here we have to map similar labels into a single label for our Transmission column.
data["transmission"] = np.where(data["transmission"].isin(["A"]), "Automatic", data["transmission"])
data["transmission"] = np.where(data["transmission"].isin(["AM"]), "Automated Manual", data["transmission"])
data["transmission"] = np.where(data["transmission"].isin(["AS"]), "Automatic with Select Shift", data["transmission"])
data["transmission"] = np.where(data["transmission"].isin(["AV"]), "Continuously Variable", data["transmission"])
data["transmission"] = np.where(data["transmission"].isin(["M"]), "Manual", data["transmission"])

In [None]:
print("We have total",len(data['transmission'].unique()),"transmissions")
df_transmission = data['transmission'].value_counts().reset_index().rename(columns={'count':'Count'})
df_transmission

In [None]:
plt.figure(figsize=(20,5))
fig6 = sns.barplot(data = df_transmission, x = "transmission",  y= "Count")
plt.title("All Transmissions")
plt.xlabel("transmissions")
plt.ylabel("Cars")
plt.bar_label(fig6.containers[0])
plt.show()

In [None]:
#Fuel Type of Cars
data['fuel_type'].unique()

In [None]:
#Here we have to map similar labels into a single label for our Fuel Type column
data["fuel_type"] = np.where(data["fuel_type"]=="Z", "Premium Gasoline", data["fuel_type"])
data["fuel_type"] = np.where(data["fuel_type"]=="X", "Regular Gasoline", data["fuel_type"])
data["fuel_type"] = np.where(data["fuel_type"]=="D", "Diesel", data["fuel_type"])
data["fuel_type"] = np.where(data["fuel_type"]=="E", "Ethanol(E85)", data["fuel_type"])
data["fuel_type"] = np.where(data["fuel_type"]=="N", "Natural Gas", data["fuel_type"])

In [None]:
print("We have total",len(data['fuel_type'].unique()),"fuel_type")
df_fuel_type = data['fuel_type'].value_counts().reset_index().rename(columns={'count':'Count'})
df_fuel_type

In [None]:
plt.figure(figsize=(20,5))
fig7 = sns.barplot(data = df_fuel_type, x = "fuel_type",  y= "Count")
plt.title("All Fuel Types")
plt.xlabel("fuel_type")
plt.ylabel("Cars")
plt.bar_label(fig7.containers[0])
plt.show()

### Variation in CO2 emissions with different features

### Co2 Emission with Brand

In [None]:
df_co2_make = data.groupby(['make'])['co2_emissions'].mean().sort_values().reset_index()

In [None]:
plt.figure(figsize=(20,5))
fig8 = sns.barplot(data = df_co2_make, x = "make",  y= "co2_emissions")
plt.xticks(rotation = 90)
plt.title("CO2 Emissions variation with Brand")
plt.xlabel("Brands")
plt.ylabel("co2_emissions")
plt.bar_label(fig8.containers[0], fontsize=8, fmt='%.1f')
plt.show()

In [None]:
plt.figure(figsize=(20,7))
order = data.groupby("make")["co2_emissions"].median().sort_values(ascending=True).index
sns.boxplot(x="make", y="co2_emissions", data=data, order=order, width=0.5)
plt.title("Distribution of CO2 Emissions in relation to Make", fontsize=15)
plt.xticks(rotation=90, horizontalalignment='center')
plt.xlabel("make", fontsize=12)
plt.ylabel("co2_emissions", fontsize=12)
plt.axhline(data["co2_emissions"].median(),color='r',linestyle='dashed',linewidth=1)
plt.tight_layout()
plt.show()

### CO2 Emissions variation with Vehicle Class

In [None]:
 
df_co2_vehicle_class = data.groupby(['vehicle_class'])['co2_emissions'].mean().sort_values().reset_index()

In [None]:
plt.figure(figsize=(23,5))
fig9 = sns.barplot(data = df_co2_vehicle_class, x = "vehicle_class",  y= "co2_emissions")
plt.xticks(rotation = 90)
plt.title("CO2 Emissions variation with Vehicle Class")
plt.xlabel("vehicle_class")
plt.ylabel("co2_emissions)")
plt.bar_label(fig9.containers[0], fontsize=9)
plt.show()

In [None]:
plt.figure(figsize=(20,7))
order = data.groupby("vehicle_class")["co2_emissions"].median().sort_values(ascending=True).index
sns.boxplot(x="vehicle_class", y="co2_emissions", data=data, order=order, width=0.5)
plt.title("Distribution of CO2 Emissions in relation to Make", fontsize=15)
plt.xticks(rotation=90, horizontalalignment='center')
plt.xlabel("vehicle_class", fontsize=12)
plt.ylabel("co2_emissions", fontsize=12)
plt.axhline(data["co2_emissions"].median(),color='r',linestyle='dashed',linewidth=1)
plt.tight_layout()
plt.show()

### CO2 Emissions variation with Transmission

In [None]:

df_co2_transmission = data.groupby(['transmission'])['co2_emissions'].mean().sort_values().reset_index()

In [None]:
fig10 = sns.barplot(data = df_co2_transmission, x = "transmission",  y= "co2_emissions")
plt.xticks(rotation = 90)
plt.title("CO2 Emissions variation with Transmission")
plt.xlabel("transmission")
plt.ylabel("co2_emissions")
plt.bar_label(fig10.containers[0], fontsize=10)
plt.show()    

In [None]:
plt.figure(figsize=(20,7))
order = data.groupby("transmission")["co2_emissions"].median().sort_values(ascending=True).index
sns.boxplot(x="transmission", y="co2_emissions", data=data, order=order, width=0.5)
plt.title("Distribution of CO2 Emissions in relation to Make", fontsize=15)
plt.xticks(rotation=90, horizontalalignment='center')
plt.xlabel("transmission", fontsize=12)
plt.ylabel("co2_emissions", fontsize=12)
plt.axhline(data["co2_emissions"].median(),color='r',linestyle='dashed',linewidth=1)
plt.tight_layout()
plt.show()

### CO2 Emissions variation with Fuel Type

In [None]:
df_co2_fuel_type = data.groupby(['fuel_type'])['co2_emissions'].mean().sort_values().reset_index()

In [None]:
plt.figure(figsize=(23,5))
fig11 = sns.barplot(data = df_co2_fuel_type, x = "fuel_type",  y= "co2_emissions")
plt.xticks(rotation = 90)
plt.title("CO2 Emissions variation with Fuel Type")
plt.xlabel("fuel_type")
plt.ylabel("co2_emissions")
plt.bar_label(fig11.containers[0], fontsize=10)
plt.show()

In [None]:
plt.figure(figsize=(20,7))
order = data.groupby("fuel_type")["co2_emissions"].median().sort_values(ascending=True).index
sns.boxplot(x="fuel_type", y="co2_emissions", data=data, order=order, width=0.5)
plt.title("Distribution of CO2 Emissions in relation to Make", fontsize=15)
plt.xticks(rotation=90, horizontalalignment='center')
plt.xlabel("fuel_type", fontsize=12)
plt.ylabel("co2_emissions", fontsize=12)
plt.axhline(data["co2_emissions"].median(),color='r',linestyle='dashed',linewidth=1)
plt.tight_layout()
plt.show()

### Correlation between numerical features

In [None]:
#Heatmap of Correlations
num_cols = data.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = data[num_cols].corr(numeric_only=True)

plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

### **Scatter Plot (Numerical vs. Numerical)**

In [None]:
#colors = ["blue" if x < 250 else "red" for x in data["co2_emissions"]]

plt.figure(figsize=(8, 5))
sns.scatterplot(x=data["engine_size"], y=data["co2_emissions"],hue=data['fuel_type'], palette='Set1',alpha=0.7,edgecolor='w')

plt.xlabel("Engine size")
plt.ylabel("CO₂ Emissions")
plt.title("Engine Size vs CO₂ Emissions by Fuel type")
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
sns.scatterplot(x=data["fuel_consumption_hwy"], y=data["co2_emissions"], hue=data['fuel_type'], palette='Set1', alpha=0.7)

plt.xlabel("Fuel Consumption (Highway)")
plt.ylabel("CO₂ Emissions")
plt.title("Fuel Consumption (Highway) vs CO₂ Emissions by Fuel type")
plt.show()

In [None]:
plt.figure(figsize=(9, 5))
sns.scatterplot(x=data["fuel_consumption_city"], y=data["co2_emissions"],hue=data['vehicle_class'], palette='Set1', alpha=0.7,edgecolor='w')

plt.xlabel("Fuel Consumption (city)")
plt.ylabel("CO₂ Emissions")
plt.title("Fuel Consumption (city) vs CO₂ Emissions By Vehicle class")
plt.show()

In [None]:
plt.figure(figsize=(10, 5))

sns.scatterplot( x=data["fuel_consumption_comb(l/100km)"],y=data["co2_emissions"],hue=data["vehicle_class"],  palette="Set1",alpha=0.7,edgecolor="w")

plt.xlabel("Fuel Consumption comb (L/100km)")
plt.ylabel("CO₂ Emissions")
plt.title("Fuel Consumption vs CO₂ Emissions by Vehicle Class")

plt.show()


In [None]:
plt.figure(figsize=(8, 5))
sns.scatterplot(x=data["fuel_consumption_comb(mpg)"], y=data["co2_emissions"], hue=data["fuel_type"], palette = 'Set1', alpha=0.7,edgecolor='w')

plt.xlabel("Fuel Consumption comb(mpg)")
plt.ylabel("CO₂ Emissions")
plt.title("Fuel Consumption comb(mpg) vs CO₂ Emissions by fuel_type")
plt.show()

In [None]:
# Create pairplot with hue
sns.pairplot(data, hue="fuel_type", palette="coolwarm")

plt.show()

In [None]:
sns.pairplot(data, vars=["engine_size", "fuel_consumption_hwy", "co2_emissions"], hue="vehicle_class", palette="coolwarm")
plt.show()

In [None]:
sns.pairplot(data, hue="engine_size", palette="coolwarm", diag_kind="kde")

plt.show()


## 4. Identify Outliers

In [None]:
#check for Outliers

# List of numerical columns
num_col = data.select_dtypes(include=['float64', 'int64']).columns

# Plot boxplots for all numerical features
plt.figure(figsize=(12, 6))
for col in num_col:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=data[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

### ANOVA Test for Categorical Features

In [None]:
# Perform ANOVA test for each categorical feature
anova_results = {}
categorical_features = data.select_dtypes(include=['object']).columns

for feature in categorical_features:
    groups = [data["co2_emissions"][data[feature] == category].values for category in data[feature].unique()]
    anova_results[feature] = stats.f_oneway(*groups)

# Display the ANOVA results
for feature, result in anova_results.items():
    print(f"ANOVA result for {feature}:")
    print(f"F-statistic: {result.statistic}, p-value: {result.pvalue}")
    print()

In [None]:
##  Encode Categorical Variables

In [None]:
# List of categorical columns
categorical_columns = ['make','model','vehicle_class', 'transmission', 'fuel_type']

# Apply Label Encoding to each categorical column
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

# Display the first few rows of the labeled dataframe
print(data.head())

In [None]:
df_labeled=data.copy()

In [None]:
correlation_matrix = df_labeled.corr()

plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

# **DATA PREPROCESSING**

### **Clean and transform data for machine learning models**

## 1. Handle Duplicate Data

In [None]:
#remove duplicate rows 
#data.drop_duplicates(inplace=True)

In [None]:
data.shape

In [None]:
data.describe().T

## 3. Handling Outliers

In [None]:
num_col = ['engine_size', 'cylinders', 'fuel_consumption_city', 
                      'fuel_consumption_hwy', 'fuel_consumption_comb(l/100km)', 
                      'fuel_consumption_comb(mpg)', 'co2_emissions']


# Function to detect outliers using IQR
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers

# Apply the function to all numerical columns
for col in num_col:
    outliers = detect_outliers_iqr(data, col)
    print(f"{col} has {len(outliers)} outliers")


In [None]:
def cap_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    data[column] = np.where(data[column] < lower_bound, lower_bound, data[column])
    data[column] = np.where(data[column] > upper_bound, upper_bound, data[column])

# Apply capping to numerical columns
for col in num_col:
    cap_outliers(data, col)
df=data.copy()

In [None]:
df.describe().T

###  Check skewness of numerical features

In [None]:

# Select only numerical columns

print(f'After removing Outliers')
num_col = ['engine_size', 'cylinders', 'fuel_consumption_city', 
                      'fuel_consumption_hwy', 'fuel_consumption_comb(l/100km)', 
                      'fuel_consumption_comb(mpg)', 'co2_emissions']

#num_col = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate skewness
skewness_values = df[num_col].skew()

# Display skewness values
print("Skewness of numerical columns:\n", skewness_values)

In [None]:
print(f'After Detecing Outliers')    
# Select numerical columns
num_col = ['engine_size', 'cylinders', 'fuel_consumption_city', 
                      'fuel_consumption_hwy', 'fuel_consumption_comb(l/100km)', 
                      'fuel_consumption_comb(mpg)', 'co2_emissions']

#num_col = df.select_dtypes(include=['float64', 'int64']).columns

# Plot histograms with KDE for each numerical column
for col in num_col:
    plt.figure(figsize=(6, 4))
    sns.histplot(df[col], bins=30, kde=True, color="skyblue")
    plt.title(f'Distribution & Skewness of {col}')
    plt.axvline(df[col].mean(), color='red', linestyle='dashed', linewidth=2, label="Mean")
    plt.legend()
    plt.show()
    


In [None]:
# Selecting only numerical columns
numerical_cols = ['engine_size', 'cylinders', 'fuel_consumption_city', 
                      'fuel_consumption_hwy', 'fuel_consumption_comb(l/100km)', 
                      'fuel_consumption_comb(mpg)', 'co2_emissions']

#numerical_cols = data.select_dtypes(include=['number']).columns

# Skewness before preprocessing
print("Skewness Before Processing:")
skew_before = data[numerical_cols].skew()
print(skew_before)

# After preprocessing (Example: Removing outliers)
# Assuming you already processed the data and stored it in 'data_cleaned'
print("\nSkewness After Processing:")
skew_after = df[numerical_cols].skew()
print(skew_after)


In [None]:
df.columns


In [None]:
df[numerical_cols].corr()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df[numerical_cols], orient="h")
plt.title("Box Plot of Features After Outlier Removal ")
plt.show()



## 4.Correlation Analysis

In [None]:
df.shape

In [None]:
# Select numerical columns only
num_cols = ['engine_size', 'cylinders', 'fuel_consumption_city', 'fuel_consumption_hwy', 
            'fuel_consumption_comb(l/100km)', 'fuel_consumption_comb(mpg)', 'co2_emissions']

# Compute correlation before removing outliers
corr_before = data[num_cols].corr()

# Compute correlation after removing outliers
corr_after = df[num_cols].corr()

# ------- PLOT HEATMAPS BEFORE & AFTER REMOVING OUTLIERS -------
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Heatmap before outlier removal
sns.heatmap(corr_before, annot=True, cmap="coolwarm", fmt=".2f", ax=axes[0])
axes[0].set_title("Correlation Before Removing Outliers")

# Heatmap after outlier removal
sns.heatmap(corr_after, annot=True, cmap="coolwarm", fmt=".2f", ax=axes[1])
axes[1].set_title("Correlation After Removing Outliers")

plt.tight_layout()
plt.show()


# **Data Cleaning**

**Drop the Natural Gas category from the fuel_type column since it has only one occurrence. Keeping it may not add any significant value to the model and could introduce noise.**

In [None]:
df_n = df[df['fuel_type'] == 'N']
index=df_n.index
df_n

In [None]:
for i in index:
    df_n.drop(i,axis=0,inplace=True)

In [None]:
df_n[df_n['fuel_type']=='N']

In [None]:
dums = pd.get_dummies(df['fuel_type'],prefix='fuel_typeX',drop_first=True)
dums.iloc[0:5]

In [None]:
df['fuel_type']=df['fuel_type'].map({'False':0,'True':1})

In [None]:
frames = [df, dums]
result = pd.concat(frames,axis=1)
result.head()

In [None]:
result.drop(['fuel_type'],inplace=True,axis=1)
result.head()

In [None]:
df_check=df['fuel_type'].value_counts()
df_check

**Impact on Model Performance: If Natural Gas vehicles have distinct CO₂ emission patterns, removing them might slightly affect the model's accuracy**

In [None]:
df_corr=df.select_dtypes(include=['float','int']).columns

In [None]:
df[df_corr].corr().T

In [None]:
df.isnull().sum()

## 5.Variance Inflation Factor (VIF)

In [None]:
X=df.iloc[:,[2,3,5,10]]
y=df.iloc[:,11]
# Fit Linear Model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Residuals Calculation
residuals = y - y_pred

plt.figure(figsize=(8, 5))
sns.residplot(x=y_pred, y=residuals, lowess=True, line_kws={'color': 'red'})
plt.axhline(y=0, linestyle='--', color='gray')
plt.xlabel("Predicted CO₂ Emissions")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

# R² Score (Check if Linear Fit is Good)
r2 = r2_score(y, y_pred)
print(f"\n✅ R² Score for Linear Fit: {r2:.4f} (Close to 1 means linear)")

**Interpretation of VIF Values**

VIF < 5 → Low multicollinearity (Good) 

VIF 5-10 → Moderate multicollinearity (Consider removing/reducing) 

VIF > 10 → High multicollinearity (Serious issue, needs fixing)

## 6.Feature Scaling

### **Normalization (Min-Max Scaling)**

In [None]:
# Select numerical columns
features =  ['engine_size','cylinders','fuel_consumption_city','fuel_consumption_hwy','fuel_consumption_comb(l/100km)','fuel_consumption_comb(mpg)']

#Applying Minmaxscalar
scaler = MinMaxScaler()
df[features] = scaler.fit_transform(df[features])

#check the normalized values
df[features].head()

In [None]:
df.head()


In [None]:
# Fit a simple linear regression model
X = df["engine_size"]
y = df["co2_emissions"]
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(y, X).fit()
residuals = model.resid  # Get residuals

# Plot residuals
plt.figure(figsize=(8, 5))
plt.scatter(df["engine_size"], residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Engine size")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()


In [None]:
# Fit a simple linear regression model
X = df["cylinders"]
y = df["co2_emissions"]
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(y, X).fit()
residuals = model.resid  # Get residuals

# Plot residuals
plt.figure(figsize=(8, 5))
plt.scatter(df["cylinders"], residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Cylinders")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()


In [None]:
# Fit a simple linear regression model
X = df["fuel_consumption_comb(l/100km)"]
y = df["co2_emissions"]
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(y, X).fit()
residuals = model.resid  # Get residuals

# Plot residuals
plt.figure(figsize=(8, 5))
plt.scatter(df["fuel_consumption_comb(l/100km)"], residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Fuel Consumption (L/100km)")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()


In [None]:
# Fit a simple linear regression model
X = df["fuel_consumption_comb(mpg)"]
y = df["co2_emissions"]
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(y, X).fit()
residuals = model.resid  # Get residuals

# Plot residuals
plt.figure(figsize=(8, 5))
plt.scatter(df["fuel_consumption_comb(mpg)"], residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Fuel Consumption (mpg)")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

In [None]:

# Fit a simple linear regression model
X = df["fuel_consumption_hwy"]
y = df["co2_emissions"]
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(y, X).fit()
residuals = model.resid  # Get residuals

# Plot residuals
plt.figure(figsize=(8, 5))
plt.scatter(df["fuel_consumption_hwy"], residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Fuel Consumption (hwy)")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()


In [None]:

# Fit a simple linear regression model
X = df["fuel_consumption_city"]
y = df["co2_emissions"]
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(y, X).fit()
residuals = model.resid  # Get residuals

# Plot residuals
plt.figure(figsize=(8, 5))
plt.scatter(df["fuel_consumption_city"], residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Fuel Consumption City  ")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

In [None]:
# Fit a simple linear regression model
X = df["fuel_consumption_comb(l/100km)"]
y = df["co2_emissions"]
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(y, X).fit()
residuals = model.resid  # Get residuals

# Plot residuals
plt.figure(figsize=(8, 5))
plt.scatter(df["fuel_consumption_comb(l/100km)"], residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Fuel Consumption (L/100km)")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

In [None]:
from sklearn.feature_selection import f_classif, f_regression
import pandas as pd

# Select numerical features
X = df.iloc[:,[0,1,2,3,4,5,7,8,9,10]]  # Features
y = df['co2_emissions']  # Target variable

# Perform ANOVA F-test
f_values, p_values = f_regression(X, y)

# Store results in a DataFrame
anova_results = pd.DataFrame({'Feature': X.columns, 'F-Value': f_values, 'P-Value': p_values})
anova_results = anova_results.sort_values(by="F-Value", ascending=False)

print("ANOVA F-Test Results:")
print(anova_results)


In [None]:
from sklearn.feature_selection import RFE
# Step 1: Use Filter Method (ANOVA)
significant_features = anova_results[anova_results['P-Value'] < 0.05]['Feature']

# Step 2: Use Wrapper Method (RFE)
rfe = RFE(estimator=LinearRegression(), n_features_to_select=5)
X_rfe = rfe.fit_transform(X[significant_features], y)
rfe_selected_features = significant_features[rfe.support_]

# Step 3: Use Embedded Method (LASSO)
lasso = Lasso(alpha=0.1)
lasso.fit(X[rfe_selected_features], y)
lasso_selected_features = rfe_selected_features[lasso.coef_ != 0]

print("Final Selected Features:", list(lasso_selected_features))


In [None]:
# Select relevant features
selected_features1 = ['engine_size','cylinders','fuel_consumption_comb(mpg)',
                     'fuel_consumption_comb(l/100km)']

# Create a new DataFrame with selected features
X_selected = df[selected_features1]

# Compute VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = X_selected.columns
vif_data["VIF"] = [variance_inflation_factor(X_selected.values, i) for i in range(X_selected.shape[1])]

print(vif_data)


## 7. Split Data for Training & Testing 

In [None]:
X=df[['engine_size','cylinders','fuel_consumption_comb(l/100km)','fuel_consumption_comb(mpg)']]
y=df['co2_emissions']
X.shape,y.shape

In [None]:
X.head()

In [None]:
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape,X_test.shape, y_train.shape,y_test.shape

# **MODEL SELECTION**

## **Train the Models & Models Evaluation**

**Fit the selected model to the training data & Assess model performance using metrics**

In [None]:
# Function to compute train and test metrics
def train_val(y_train, y_train_pred, y_test, y_pred, models_name):
    scores = {
        models_name + '_train': { 
            "MAE": mean_absolute_error(y_train, y_train_pred),
            "MSE": mean_squared_error(y_train, y_train_pred),
            "R² Score": r2_score(y_train, y_train_pred),
            "RMSE": np.sqrt(mean_squared_error(y_train, y_train_pred)),
            "Mean Difference": np.mean(abs(y_train - y_train_pred))
        },
        models_name + '_test': {
            "MAE": mean_absolute_error(y_test, y_pred),
            "MSE": mean_squared_error(y_test, y_pred),
            "R² Score": r2_score(y_test, y_pred),
            "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
            "Mean Difference": np.mean(abs(y_test - y_pred))
        }
    }
    return pd.DataFrame.from_dict(scores, orient='index')  

# Dictionary to store model scores
models_scores = {}

# Cross-validation function
def cross_val(models, X, y):
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = cross_val_score(models, X, y, cv=kf, scoring='r2')
    return np.mean(cv_scores), np.std(cv_scores)  # Return mean and std deviation

# Train-test split (Ensure y is 1D)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
y_train = y_train.ravel()
y_test = y_test.ravel()

# ======= Linear Regression =======
lr = LinearRegression()
lr.fit(X_train, y_train)
y_train_pred_lr = lr.predict(X_train)
y_pred_lr = lr.predict(X_test)
models_scores["Linear Regression"] = train_val(y_train, y_train_pred_lr, y_test, y_pred_lr, "Linear")
cv_mean_lr, cv_std_lr = cross_val(lr, X_train, y_train)

# ======= Random Forest Regression =======
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_train_pred_rf = rf.predict(X_train)
y_pred_rf = rf.predict(X_test)
models_scores["Random Forest"] = train_val(y_train, y_train_pred_rf, y_test, y_pred_rf, "RandomForest")
cv_mean_rf, cv_std_rf = cross_val(rf, X_train, y_train)

# ======= Support Vector Regression (SVM) =======
svr = SVR(kernel='rbf')
svr.fit(X_train, y_train)
y_train_pred_svr = svr.predict(X_train)
y_pred_svr = svr.predict(X_test)
models_scores["Support Vector Machine"] = train_val(y_train, y_train_pred_svr, y_test, y_pred_svr, "SVR")
cv_mean_svr, cv_std_svr = cross_val(svr, X_train, y_train)

# ======= K-Nearest Neighbors (KNN) Regression =======
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)
y_train_pred_knn = knn.predict(X_train)
y_pred_knn = knn.predict(X_test)
models_scores["K-Nearest Neighbors"] = train_val(y_train, y_train_pred_knn, y_test, y_pred_knn, "KNN")
cv_mean_knn, cv_std_knn = cross_val(knn, X_train, y_train)

# ======= Print All Scores =======
for models_name, score_df in models_scores.items():
    print(f"\n Model: {models_name}")
    print(score_df)

# ======= Print Cross-Validation Scores =======
cv_results = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest", "SVR", "KNN"],
    "CV Mean R²": [cv_mean_lr, cv_mean_rf, cv_mean_svr, cv_mean_knn],
    "CV Std Dev": [cv_std_lr, cv_std_rf, cv_std_svr, cv_std_knn]
})

print("\n Cross-Validation Results:")
print(cv_results.round(4))  # Round results for readability


# **Visualization of Model Performance**

**We can use various plots to visualize the model performance**

## 1. Residual Plot (Errors for Each Model)

*****Residuals = Actual - Predicted. A good model has randomly scattered residuals around zero*****.

In [None]:
m1 = {
    "Linear Regression": y_pred_lr,
    "Random Forest": y_pred_rf,
    "SVR": y_pred_svr,
    "KNN": y_pred_knn
}

plt.figure(figsize=(12, 8))
for i, (model_name, y_pred) in enumerate(m1.items(), 1):
    plt.subplot(2, 2, i)
    residuals = y_test - y_pred
    sns.histplot(residuals, kde=True, bins=30)
    plt.axvline(x=0, color='red', linestyle='dashed', linewidth=1.5)
    plt.title(f"Residual Distribution - {model_name}")

plt.tight_layout()
plt.show()


In [None]:
m2 = {
    "Linear Regression": y_pred_lr,
    "Random Forest": y_pred_rf,
    "SVR": y_pred_svr,
    "KNN": y_pred_knn
}

plt.figure(figsize=(12, 8))
for i, (model_name, y_pred) in enumerate(m2.items(), 1):
    plt.subplot(2, 2, i)
    residuals = y_test - y_pred  # Calculate residuals (actual - predicted)
    plt.scatter(y_pred, residuals, alpha=0.6, edgecolors='b')
    plt.axhline(y=0, color='red', linestyle='dashed', linewidth=1.5)  # Horizontal line at zero
    plt.xlabel("Predicted CO₂ Emissions")
    plt.ylabel("Residuals (Actual - Predicted)")
    plt.title(f"Residual Scatter Plot - {model_name}")

plt.tight_layout()
plt.show()


 ## 2. Actual vs. Predicted Plot (Scatter Plot for Each Model)

*****This plot helps compare actual and predicted values. Ideally, points should align along the diagonal line (y = x)*****

In [None]:
plt.figure(figsize=(12, 8))
for i, (model_name, y_pred) in enumerate(m1.items(), 1):
    plt.subplot(2, 2, i)
    plt.scatter(y_test, y_pred, alpha=0.6, edgecolors='b')
    plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='dashed')
    plt.xlabel("Actual CO₂ Emissions")
    plt.ylabel("Predicted CO₂ Emissions")
    plt.title(f"Actual vs Predicted - {model_name}")

plt.tight_layout()
plt.show()



## 3. Error Distribution for Each Model

*****Understanding how errors are distributed helps identify bias*****.

In [None]:
plt.figure(figsize=(12, 8))
for i, (model_name, y_pred) in enumerate(m2.items(), 1):
    plt.subplot(2, 2, i)
    errors = abs(y_test - y_pred)
    sns.histplot(errors, kde=True, bins=30, color="blue")
    plt.xlabel("Prediction Error")
    plt.title(f"Error Distribution - {model_name}")

plt.tight_layout()
plt.show()


# **Hyperparameter Tuning (Optimization)**

**To Improve model performance , we can use GridSearchCV or RandomizedSearchCV to find the Besy Hyperparameters.**

 ### **Random Forest Regressor**

In [None]:
rf_params = {
    'n_estimators': [50, 100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_random = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_distributions=rf_params,
    n_iter=20,  # Number of combinations to try
    cv=5,
    scoring='r2',
    verbose=1,
    random_state=42,
    n_jobs=-1
)

rf_random.fit(X_train, y_train)
best_rf = rf_random.best_estimator_
print("Best Random Forest Params:", rf_random.best_params_)


### **Support Vector Regressor (SVR)**

In [None]:
svr_params = {
    'kernel': ['rbf', 'linear', 'poly'],
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1]
}

svr_random = RandomizedSearchCV(
    estimator=SVR(),
    param_distributions=svr_params,
    n_iter=15,
    cv=5,
    scoring='r2',
    verbose=1,
    random_state=42,
    n_jobs=-1
)

svr_random.fit(X_train, y_train)
best_svr = svr_random.best_estimator_
print("Best SVR Params:", svr_random.best_params_)


### **K-Nearest Neighbors (KNN)**

In [None]:
knn_params = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

knn_random = RandomizedSearchCV(
    estimator=KNeighborsRegressor(),
    param_distributions=knn_params,
    n_iter=10,
    cv=5,
    scoring='r2',
    verbose=1,
    random_state=42,
    n_jobs=-1
)

knn_random.fit(X_train, y_train)
best_knn = knn_random.best_estimator_
print("Best KNN Params:", knn_random.best_params_)


# **Implementing the Best Hyperparameters in Models**

**Now that we have tuned our models using RandomizedSearchCV, we can update the models with the best hyperparameters found.**

### **Update Random Forest Regressor**

In [None]:
# Best hyperparameters from RandomizedSearchCV
rf_best_params = rf_random.best_params_

rf = RandomForestRegressor(
    n_estimators=rf_best_params['n_estimators'],
    max_depth=rf_best_params['max_depth'],
    min_samples_split=rf_best_params['min_samples_split'],
    min_samples_leaf=rf_best_params['min_samples_leaf'],
    random_state=42
)

rf.fit(X_train, y_train)
y_train_pred_rf = rf.predict(X_train)
y_pred_rf = rf.predict(X_test)

# Store scores
models_scores["Random Forest (Tuned)"] = train_val(y_train, y_train_pred_rf, y_test, y_pred_rf, "RandomForest_Tuned")
cv_mean_rf, cv_std_rf = cross_val(rf, X_train, y_train)


### **Update Support Vector Regressor (SVR)**

In [None]:
# Best hyperparameters from RandomizedSearchCV
svr_best_params = svr_random.best_params_

svr = SVR(
    kernel=svr_best_params['kernel'],
    C=svr_best_params['C'],
    gamma=svr_best_params['gamma']
)

svr.fit(X_train, y_train)
y_train_pred_svr = svr.predict(X_train)
y_pred_svr = svr.predict(X_test)

# Store scores
models_scores["Support Vector Machine (Tuned)"] = train_val(y_train, y_train_pred_svr, y_test, y_pred_svr, "SVR_Tuned")
cv_mean_svr, cv_std_svr = cross_val(svr, X_train, y_train)


### **Update K-Nearest Neighbors (KNN)**

In [None]:
# Best hyperparameters from RandomizedSearchCV
knn_best_params = knn_random.best_params_

knn = KNeighborsRegressor(
    n_neighbors=knn_best_params['n_neighbors'],
    weights=knn_best_params['weights'],
    metric=knn_best_params['metric']
)

knn.fit(X_train, y_train)
y_train_pred_knn = knn.predict(X_train)
y_pred_knn = knn.predict(X_test)

# Store scores
models_scores["K-Nearest Neighbors (Tuned)"] = train_val(y_train, y_train_pred_knn, y_test, y_pred_knn, "KNN_Tuned")
cv_mean_knn, cv_std_knn = cross_val(knn, X_train, y_train)


### **Print Updated Results**

In [None]:
# Print scores for tuned models
for models_names, score_df in models_scores.items():
    print(f"\n Model: {models_names}")
    print(score_df)

# Print cross-validation scores for tuned models
cv_results = pd.DataFrame({
    "Model": ["Random Forest (Tuned)", "SVR (Tuned)", "KNN (Tuned)"],
    "CV Mean R²": [cv_mean_rf, cv_mean_svr, cv_mean_knn],
    "CV Std Dev": [cv_std_rf, cv_std_svr, cv_std_knn]
})

print("\n Cross-Validation Results (Tuned Models):")
print(cv_results.round(4))


### **Train Polynomial Regression with Different Degrees**

In [None]:
# Function to evaluate polynomial regression for different degrees
def evaluate_polynomial_regression(degree):
    poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    poly_model.fit(X_train, y_train)
    
    y_train_pred = poly_model.predict(X_train)
    y_test_pred = poly_model.predict(X_test)

    # Store scores
    model_name = f"Polynomial (Degree {degree})"
    models_scores[model_name] = train_val(y_train, y_train_pred, y_test, y_test_pred, model_name)

    # Cross-validation
    cv_mean = np.mean(cross_val_score(poly_model, X_train, y_train, cv=5, scoring='r2'))
    cv_std = np.std(cross_val_score(poly_model, X_train, y_train, cv=5, scoring='r2'))
    
    return cv_mean, cv_std, poly_model


### **Test Different Polynomial Degrees**

In [None]:
cv_results_poly = []
best_poly_model = None
best_r2 = float('-inf')

for degree in range(1, 5):  # Test degrees from 1 to 4
    cv_mean, cv_std, poly_model = evaluate_polynomial_regression(degree)
    
    # Store results
    cv_results_poly.append({"Degree": degree, "CV Mean R²": cv_mean, "CV Std Dev": cv_std})

    # Track the best polynomial model
    if cv_mean > best_r2:
        best_r2 = cv_mean
        best_poly_model = poly_model


### **Display Results**

In [None]:
cv_results_poly_df = pd.DataFrame(cv_results_poly)
print("\n Polynomial Regression Cross-Validation Results:")
print(cv_results_poly_df.round(4))

# Best degree based on cross-validation
best_degree = cv_results_poly_df.loc[cv_results_poly_df["CV Mean R²"].idxmax(), "Degree"]
print(f"\n🔹 Best Polynomial Degree: {best_degree}")


## **Implementing the Best Polynomial Regression Model (Degree = 4) in the Tuned Model Workflow**

**Now that we have determined that Polynomial Regression with Degree 4 gives the best performance, we will integrate it into the tuned model pipeline and compare it with Random Forest, KNN, and SVR.**

### **Train the Best Polynomial Regression Model**

In [None]:
# Create and train the best polynomial model
best_poly_model = make_pipeline(PolynomialFeatures(degree=4), LinearRegression())
best_poly_model.fit(X_train, y_train)

# Predict on train and test data
y_train_pred_poly = best_poly_model.predict(X_train)
y_test_pred_poly = best_poly_model.predict(X_test)

# Store model evaluation scores
models_scores["Polynomial Regression (Tuned)"] = train_val(y_train, y_train_pred_poly, y_test, y_test_pred_poly, "Polynomial_Tuned")

# Cross-validation scores for Polynomial Regression
cv_mean_poly = np.mean(cross_val_score(best_poly_model, X_train, y_train, cv=5, scoring='r2'))
cv_std_poly = np.std(cross_val_score(best_poly_model, X_train, y_train, cv=5, scoring='r2'))


# **Final tuned Models**

## **Compare  Polynomial Regression with Tuned Models**

In [None]:
cv_results = pd.DataFrame({
    "Model": ["Linear Regression","Random Forest (Tuned)","SVR (Tuned)","KNN (Tuned)", "Polynomial Regression (Tuned)"],
    "CV Mean R²": [cv_mean_lr, cv_mean_rf,cv_mean_svr, cv_mean_knn, cv_mean_poly],
    "CV Std Dev": [cv_std_lr, cv_std_rf,cv_std_svr, cv_std_knn, cv_std_poly]
})

print("\n Cross-Validation Results for All Tuned Models:")
print(cv_results.round(4))


# **Visualizing Performance**

**Now that we have optimized our models, let's visualize their predictions**

### **Scatter Plot (Actual vs. Predicted)**

In [None]:
def plot_actual_vs_pred(y_test, y_pred, model_name):
    plt.figure(figsize=(6, 6))
    sns.scatterplot(x=y_test, y=y_pred, alpha=0.6, edgecolor='k')
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)  # Perfect fit line
    plt.xlabel("Actual Values")
    plt.ylabel("Predicted Values")
    plt.title(f"{model_name}: Actual vs Predicted")
    plt.grid()
    plt.show()

# Plot for best models
plot_actual_vs_pred(y_test, y_pred_lr, "Linear Regression")
plot_actual_vs_pred(y_test, y_pred_rf, "Random Forest (Tuned)")
plot_actual_vs_pred(y_test, y_pred_svr, "SVR (Tuned)")
plot_actual_vs_pred(y_test, y_pred_knn, "KNN (Tuned)")
plot_actual_vs_pred(y_test, y_test_pred_poly, "Polynomial Regression (Tuned, Degree 4)")

### **Residual Distribution**

In [None]:
def plot_residuals(y_test, y_pred, model_name):
    residuals = y_test - y_pred
    plt.figure(figsize=(6, 4))
    sns.histplot(residuals, kde=True, bins=30, color="blue", alpha=0.6)
    plt.axvline(0, color='red', linestyle='dashed', linewidth=2)  # Zero residual line
    plt.xlabel("Residuals (Actual - Predicted)")
    plt.ylabel("Frequency")
    plt.title(f"{model_name}: Residual Distribution")
    plt.grid()
    plt.show()

# Plot residuals for best models
plot_residuals(y_test,y_pred_lr,"Linear Regression")
plot_residuals(y_test, y_pred_rf, "Random Forest (Tuned)")
plot_residuals(y_test, y_pred_svr, "SVR (Tuned)")
plot_residuals(y_test, y_pred_knn, "KNN (Tuned)")
plot_residuals(y_test, y_test_pred_poly, "Polynomial Regression (Tuned, Degree 4)")

###  **Box Plot Comparison of Errors**

In [None]:
errors_df = pd.DataFrame({
    "Linear Regression": abs(y_test - y_pred_lr),
    "Random Forest (Tuned)": abs(y_test - y_pred_rf),
    "SVR (Tuned)": abs(y_test - y_pred_svr),
    "KNN (Tuned)": abs(y_test - y_pred_knn),
    "Polynomial Regression (Tuned)":abs(y_test - y_test_pred_poly)

})

plt.figure(figsize=(12, 6))
sns.boxplot(data=errors_df)
plt.ylabel("Absolute Error")
plt.title("Comparison of Model Errors")
plt.grid()
plt.show()


# **MODEL DEPLOYMENT**