# 📊 Car Price Prediction — Machine Learning Project

## 📌 Objective
The goal of this project is to build a predictive model for car prices.  
We’ll walk through a complete Machine Learning pipeline:
- Load and clean the dataset
- Perform Exploratory Data Analysis (EDA)
- Handle missing values and outliers
- Encode categorical features
- Train and evaluate regression models
- Compare model performances and feature importance

---


In [None]:
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder



In [None]:
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Section 1: Data Exploration (EDA)

* The dataset was loaded using the Pandas library (`pd.read_csv('your_dataset.csv')`).
* The first five rows were displayed using `.head()`.
* Missing values were checked using `.isnull().sum()`.
* Data types were identified using `.dtypes`, distinguishing numerical and categorical features.

In [None]:
df = pd.read_csv('/content/cars_price.csv')

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
# calculate null percentage of each columns.

null_percentage = (df.isnull().sum()/len(df)) * 100
null_percentage.sort_values(ascending=False)

In [None]:
# Drop columns with greater than 40% null values

remove_cols = null_percentage[null_percentage > 40].keys().tolist()
df.drop(remove_cols, axis=1, inplace=True)

In [None]:
obj_cols = df.select_dtypes(include='object').columns

In [None]:
# Explore objects columns

for col in obj_cols:
  print(df[col].unique())
  print(len(df[col].value_counts()))
  print(df[col].value_counts())
  print("================")

## Handling Garbage Values and Data Type Conversion

This step involves cleaning the dataset by identifying and replacing any "garbage" values with `NaN` (Not a Number), which is the standard way to represent missing data in Pandas. Additionally, the data types of the columns will be reviewed and changed if they are not appropriate for the data they contain or for subsequent analysis and modeling.

In [None]:
df["normalized-losses"] = df["normalized-losses"].replace('?', np.nan)

In [None]:
df["normalized-losses"] = pd.to_numeric(df["normalized-losses"], errors='coerce')

In [None]:
df["num-of-doors"] = df["num-of-doors"].replace('?', np.nan)

In [None]:
df["bore"] = df["bore"].replace('?', np.nan)

In [None]:
df["bore"] = pd.to_numeric(df["bore"], errors='coerce')

In [None]:
df["stroke"] = df["stroke"].replace('?', np.nan)

In [None]:
df["stroke"] = pd.to_numeric(df["stroke"], errors='coerce')

In [None]:
df["horsepower"] = df["horsepower"].replace('?', np.nan)

In [None]:
df["horsepower"] = pd.to_numeric(df["horsepower"], errors='coerce')

In [None]:
df["peak-rpm"] = df["peak-rpm"].replace('?', np.nan)

In [None]:
df["peak-rpm"] = pd.to_numeric(df["peak-rpm"], errors='coerce')

In [None]:
df["price"] = df["price"].replace('?', np.nan)

In [None]:
df["price"] = pd.to_numeric(df["price"], errors='coerce')

In [None]:
# Drop rows where target value "price" value is null.

df.dropna(subset=["price"], inplace=True)

## Exploring and Filling NaN Values in Object Columns

This step focuses on examining the count of missing values (`NaN`) specifically within columns having an `object` data type (which typically represent categorical or string data). If the number of `NaN` values in these columns is below a certain threshold, we will proceed to fill them using an appropriate strategy.

In [None]:
obj_cols = df.select_dtypes(include='O').columns
for col in obj_cols:
  print(col)
  print(df[col].isna().sum())
  print(len(df[col]))
  print(df[col].value_counts())
  print("=====================")

In [None]:
def fill_categorical_missing_values(cols):
  for col in obj_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)
  return None

In [None]:
fill_categorical_missing_values(obj_cols)

In [None]:
num_cols = df.select_dtypes(exclude='O').columns
num_cols

In [None]:
# separate numerical categorical colmuns

def separate_categorical_col_by_value_count(cols, threshold=10):
    rem_cat_cols = []
    for col in cols:
        if df[col].nunique() <= threshold:
            rem_cat_cols.append(col)
    return rem_cat_cols

In [None]:
rem_cat_cols = separate_categorical_col_by_value_count(num_cols)
rem_cat_cols

In [None]:
# explore columns

for col in rem_cat_cols:
  print(col)
  print(df[col].unique())
  print(df[col].value_counts())
  print(df[col].isna().sum())
  print("=====================")

In [None]:
# Convert negative values to positive

df['symboling'] = df['symboling'].abs()

In [None]:
fill_categorical_missing_values(rem_cat_cols)

## Creating and Exploring Visualizations for Categorical Data

This step involves generating and analyzing various visualizations to understand the distribution and patterns within our categorical features and their relationship with the target variable (price). Visualizations help in gaining insights into the different categories, their frequencies, and how they might influence car prices.

In [None]:
all_cat_cols = list(df.select_dtypes(include='O').columns) + rem_cat_cols

In [None]:
# Create a box plot of each categorical colmuns to analyze the impact on target variable 'price'
n_cols = len(all_cat_cols)
cols_per_row = 2
rows_needed = int(np.ceil(n_cols / cols_per_row))

plt.figure(figsize=(15 * cols_per_row, 5 * rows_needed))

for index, col in enumerate(all_cat_cols):
    plt.subplot(rows_needed, cols_per_row, index + 1)
    sns.boxplot(x=col, y='price', data=df)
    plt.title(f'Box Plot of price vs. {col} Column')
    plt.xlabel(f'{col}')
    plt.ylabel('Target Variable Price')
    plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

## Dropping Non-Impacting Categorical Columns

Based on our analysis (likely through visualizations or statistical tests), the following categorical columns were identified as not having a significant impact on the target variable ('price'):

In [None]:
# drop irrelevent categorical columns

rem_col = [col for col in all_cat_cols if col in ["fuel-type","num-of-doors","fuel-system","symboling"]]
rem_col

In [None]:
df_dropped_list = df.drop(rem_col, axis=1, inplace=True)

## Exploring Numerical Columns

This step involves a detailed exploration of the numerical features in the dataset to understand their statistical properties and the extent of missing values. This analysis helps in identifying potential issues, understanding the distribution of the data, and informing subsequent preprocessing steps.

In [None]:
num_cols = df.select_dtypes(exclude='O').columns

In [None]:
def separate_numerial_col_by_value_count(cols, threshold=10):
    num_cols = []
    for col in cols:
        if df[col].nunique() > threshold:
            num_cols.append(col)
    return num_cols

In [None]:
num_cols = separate_numerial_col_by_value_count(num_cols)

In [None]:
for col in num_cols:
  print(col)
  print(df[col].isna().sum())
  print(df[col].describe())
  print("=====================")

## Handling NaN Values in Numerical Columns and Visualizing Numerical Data

This step addresses the missing values (`NaN`) identified in the numerical columns and creates visualizations to further explore their distributions.

In [None]:
# create a histogram to analyze the distribution of each numerical cols.
plt.figure(figsize=(20, 15))
num_rows = (len(num_cols) + 2) // 3
no_cols = 3

for index, col in enumerate(num_cols):
    plt.subplot(num_rows, no_cols, index + 1)
    sns.distplot(df[col])
    plt.title(f'{col} Distribution Plot (figure {index+1})')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

*Based on above analysis, fill the missing values with median or mean. We used median for skewed and mean for unskewed data. As we can seee most of the cols are right skewed*

In [None]:
df["normalized-losses"].fillna(value=df["normalized-losses"].median(), inplace=True)

In [None]:
df["wheel-base"].fillna(value=df["wheel-base"].median(), inplace=True)

In [None]:
df["length"].fillna(value=df["length"].mean(), inplace=True)

In [None]:
df["width"].fillna(value=df["width"].median(), inplace=True)

In [None]:
df["height"].fillna(value=df["height"].median(), inplace=True)

In [None]:
df["curb-weight"].fillna(value=df["curb-weight"].median(), inplace=True)

In [None]:
df["engine-size"].fillna(value=df["engine-size"].median(), inplace=True)

In [None]:
df["bore"].fillna(value=df["bore"].mean(), inplace=True)

In [None]:
df["stroke"].fillna(value=df["stroke"].mean(), inplace=True)

In [None]:
df["compression-ratio"].fillna(value=df["compression-ratio"].median(), inplace=True)

In [None]:
df["horsepower"].fillna(value=df["horsepower"].median(), inplace=True)

In [None]:
df["peak-rpm"].fillna(value=df["peak-rpm"].median(), inplace=True)

In [None]:
df["city-mpg"].fillna(value=df["city-mpg"].median(), inplace=True)

In [None]:
df["highway-mpg"].fillna(value=df["highway-mpg"].median(), inplace=True)

In [None]:
# create histogram again to analyze distribution again after handle missing values.
n_cols = len(num_cols)
cols_per_row = 2
rows_needed = int(np.ceil(n_cols / cols_per_row))

plt.figure(figsize=(15 * cols_per_row, 5 * rows_needed))

for index, col in enumerate(num_cols):
    plt.subplot(rows_needed, cols_per_row, index + 1)
    sns.histplot(df[col], kde=True)
    plt.title(f"Distribution of {col}")
    plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Create box plot to analyze outliers and distribution in each cols.
n_cols = len(num_cols)
cols_per_row = 2
rows_needed = int(np.ceil(n_cols / cols_per_row))

plt.figure(figsize=(15 * cols_per_row, 5 * rows_needed))

for index, col in enumerate(num_cols):
  plt.subplot(rows_needed, cols_per_row, index + 1)
  sns.boxplot(df[col])
  plt.title(f'{col} Box Plot (figure {index+1})')
  plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# create a scatter plots to identify relation between dependent and independent variables.
n_cols = len(num_cols)
cols_per_row = 2
rows_needed = int(np.ceil(n_cols / cols_per_row))

plt.figure(figsize=(15 * cols_per_row, 5 * rows_needed))

for index, col in enumerate(num_cols):
  plt.subplot(rows_needed, cols_per_row, index + 1)
  sns.scatterplot(x=df[col], y=df["price"], color="g")
  plt.title(f'{col} vs Price')
  plt.xlabel(col)
  plt.ylabel('Price')
  plt.xticks(rotation=45, ha='right')
  plt.tight_layout()

plt.show()

In the above charts, we can see that the cols "wheel base", "width", "length", "height", "curve weight", "bore", "engine-size", "hourse power" have a linear relation with price.

In [None]:
# heatmap for alanyze correlation.
plt.figure(figsize=(15, 10))
sns.heatmap(df[num_cols].corr(numeric_only=True),annot=True)
plt.show()

In [None]:
# Drop irrelevent numericals columns with respect to target variable

less_correlation_cols = ['height', 'compression-ratio', 'stroke', 'peak-rpm', 'normalized-losses']
df.drop(columns=less_correlation_cols, axis=1, inplace=True)

In [None]:
# Fetch categorical columns

all_cols = df.columns.tolist()
cols_to_exclude  = num_cols

cat_cols = [col for col in all_cols if col not in cols_to_exclude]

In [None]:
# Plot count chart to analyze counts of each categorical cols and balancing .
n_cols = len(cat_cols)
cols_per_row = 2
rows_needed = int(np.ceil(n_cols / cols_per_row))

plt.figure(figsize=(15 * cols_per_row, 5 * rows_needed))

for index, col in enumerate(cat_cols):
  plt.subplot(rows_needed, cols_per_row, index + 1)
  sns.countplot(x=df[col])
  plt.title(col)
  plt.xticks(rotation=45, ha='right')
  plt.tight_layout()

plt.show()

In [None]:
# Pie chart to analyze categories in a cols.
n_cols = len(cat_cols)
cols_per_row = 1  # You can adjust this
rows_needed = int(np.ceil(n_cols / cols_per_row))

plt.figure(figsize=(15 * cols_per_row, 5 * rows_needed))  # Adjust figure size

for index, col in enumerate(cat_cols):
  plt.subplot(rows_needed, cols_per_row, index + 1)
  value_counts = df[col].value_counts()
  plt.pie(value_counts, labels=value_counts.index, autopct='%1.1f%%', startangle=140)
  plt.title(col)
  plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
  plt.tight_layout()

plt.show()

## Exploring and Handling Outliers in Numerical Columns

This step involves identifying outliers within the numerical features of the dataset and applying appropriate strategies to handle them. Outliers are data points that deviate significantly from other observations and can potentially skew model training and evaluation.

In [None]:
#Check for outliers using the IQR
def get_bounds(df, col):
  Q1 = df[col].quantile(0.25)
  Q3 = df[col].quantile(0.75)

  IQR = Q3 - Q1

  lower_bound = Q1 - 1.5*IQR
  upper_bound = Q3 + 1.5*IQR

  return lower_bound, upper_bound

In [None]:
# print the outliers in cols
num_cols = separate_numerial_col_by_value_count(df.select_dtypes(exclude='O').columns)
for col in num_cols:
  lower_bound, upper_bound = get_bounds(df, col)

  outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
  print(lower_bound)
  print(upper_bound)
  print(f"The {col} data has {len(outliers)} outliers")
  print(df[col].describe())
  print("----------------------------------------")

In [None]:
# def remove_outliers(num_col):
#   df_cleaned = df.copy()

#   for col in num_col:
#     lower_bound, upper_bound = get_bounds(df_cleaned, col)

#     # Keep only the rows within bounds
#     df_cleaned = df_cleaned[(df_cleaned[col] >= lower_bound) & (df_cleaned[col] <= upper_bound)]

#   return df_cleaned

In [None]:
# remove_outliers_cols = ["length", "peak-rpm", ""]

# updated_df = remove_outliers(remove_outliers_cols)

In [None]:
# Appling capping on outliers

def iqr_capping(updated_df, cols):
  for col in cols:
    lower_bound, upper_bound = get_bounds(updated_df, col)

    updated_df[col] = np.where(updated_df[col] < lower_bound, lower_bound, np.where(updated_df[col] > upper_bound, upper_bound, updated_df[col]))
  return None

In [None]:
# capping_cols = [
#     "normalized-losses",
#     "wheel-base",
#     "width",
#     "curb-weight",
#     "engine-size",
#     "bore",
#     "horsepower",
#     "peak-rpm",
#     "city-mpg",
#     "highway-mpg",
#     "city-mpg"]
capping_cols = num_cols
iqr_capping(df, capping_cols)

In [None]:
# Validate no outlier left
sns.histplot(df["length"], kde=True)
plt.title(f"Distribution of {col}")
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

## Encoding Categorical Data into Numerical Representation

This step involves converting the categorical features in our dataset into a numerical format that can be understood and processed by machine learning models. We employed either Label Encoding or One-Hot Encoding based on the nature of each categorical column.

**Actions Performed:**

1.  **Identify Categorical Columns for Encoding:**
    * We selected the remaining categorical columns in the DataFrame (after potentially dropping some non-impacting ones in a previous step). These columns typically have a data type of `object`.


2.  **Apply Encoding Techniques:**
    * **Label Encoding:** This technique was applied to binary categorical features (those with only two unique categories). Label Encoding assigns a numerical label (e.g., 0 and 1) to each category.

    * **One-Hot Encoding:** This technique was applied to multi-category nominal features (those with more than two unique categories and no inherent order). One-Hot Encoding creates new binary columns for each unique category in the original column.

**Verification:**

After applying the encoding techniques, the DataFrame was inspected to:

* Confirm that the original categorical columns have been replaced by numerical representations (either single binary columns from Label Encoding or multiple binary columns from One-Hot Encoding).
* Check the data types of the newly created columns (they should be numerical, typically `int64`).
* Ensure that the number of columns has increased as expected due to One-Hot Encoding.

In [None]:
for col in df.select_dtypes(include='O').columns:
  print(col)
  print(df[col].unique())
  print("=====================")

In [None]:
# encode categorical cols, apply label when col has 2 categories else one hot.
def encode_columns(updated_df):
  label_encoder = LabelEncoder()
  cols_to_one_hot = []
  for col in updated_df.select_dtypes(include='O').columns:
    if len(updated_df[col].unique()) == 2:
      updated_df[col] = label_encoder.fit_transform(updated_df[col])
      updated_df[col] = updated_df[col].astype('int16')

    else:
       cols_to_one_hot.append(col)
  print(cols_to_one_hot)
  updated_df = pd.get_dummies(updated_df, columns=cols_to_one_hot, dtype=int, drop_first=True)
  return updated_df

In [None]:
updated_df = encode_columns(df)

In [None]:
updated_df.columns

In [None]:
updated_df.shape

In [None]:
updated_df.head(5)

In [None]:
updated_df.info()

## Training, Testing, and Validating the Linear Regression Model

This step involves building, training, evaluating, and validating a Linear Regression model to predict car prices using the preprocessed numerical features.

**Actions Performed:**

1.  **Data Splitting:**
    * The dataset was split into training and testing sets using `train_test_split` from `sklearn.model_selection`. A typical split ratio of 80% for training and 20% for testing was used. A `random_state` was set for reproducibility.

2.  **Model Instantiation:**
    * A Linear Regression model was instantiated using `LinearRegression` from `sklearn.linear_model`.


3.  **Model Training:**
    * The instantiated Linear Regression model was trained on the training data (`X_train`, `y_train`) using the `.fit()` method.


4.  **Model Evaluation on Test Set:**
    * The trained model was used to make predictions on the unseen test data (`X_test`) using the `.predict()` method.
    * The performance of the model was evaluated using appropriate regression metrics such as:
        * **Mean Squared Error (MSE):** Measures the average squared difference between the predicted and actual values.
        * **R-squared (R2 Score):** Represents the proportion of the variance in the dependent variable that is predictable from the independent variables.

**Insights:**

The evaluation metrics on the test set provide an indication of how well the Linear Regression model generalizes to unseen data. The cross-validation results offer a more stable estimate of the model's performance on the training data and can help in detecting potential overfitting or underfitting. These results serve as a baseline for comparison with other more complex models.

In [None]:
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
# I have selected relevent features for linear model, they are highly correalted with price
X = updated_df[['aspiration',
               'engine-location',
               'wheel-base',
               'length',
               'width',
               'curb-weight',
               'engine-size',
               'bore',
               'horsepower',
               'city-mpg',
               'highway-mpg'
              ]]
y = updated_df['price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 100)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
# Add a constant to get an intercept
X_train_sm = sm.add_constant(X_train)

# Fit the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()

In [None]:
# print params that define the increase of value and impact on price
lr.params

In [None]:
# Performing a summary operation lists out all the different parameters of the regression line fitted
print(lr.summary())

In [None]:
y_pred_train = lr.predict(X_train_sm)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_train, y=y_pred_train)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual Price vs. Predicted Price (Training Data)')
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--')
plt.show()

In [None]:
y_train_pred = lr.predict(X_train_sm)
res = (y_train - y_train_pred)

In [None]:
fig = plt.figure()
sns.distplot(res, bins = 15)
fig.suptitle('Error Terms', fontsize = 15)
plt.xlabel('y_train - y_train_pred', fontsize = 15)
plt.show()

In [None]:
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)

# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)

In [None]:
res = (y_train - y_train_pred)

In [None]:
fig = plt.figure()
sns.distplot(res, bins = 15)
fig.suptitle('Error Terms', fontsize = 15)
plt.xlabel('y_train - y_train_pred', fontsize = 15)
plt.show()

In [None]:
y_pred.head()

In [None]:
#Returns the mean squared error; we'll take a square root
np.sqrt(mean_squared_error(y_test, y_pred))

In [None]:
r_squared = r2_score(y_test, y_pred)
r_squared

In [None]:
y_pred_test = lr.predict(X_test_sm)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_test, y=y_pred_test)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual Price vs. Predicted Price (Training Data)')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.show()

## Training the Decision Tree Regression Model

This step involves building, training, and evaluating a Decision Tree Regression model for predicting car prices. Decision Trees can capture non-linear relationships in the data and might perform differently compared to Linear Regression.

**Actions Performed:**

1.  **Model Instantiation:**
    * A Decision Tree Regression model was instantiated using `DecisionTreeRegressor` from `sklearn.tree`. We started with default hyperparameters, but these can be tuned later for better performance.

2.  **Model Training:**
    * The instantiated Decision Tree Regression model was trained on the training data (`X_train`, `y_train`) using the `.fit()` method.

3.  **Model Evaluation on Test Set:**
    * The trained Decision Tree model was used to make predictions on the unseen test data (`X_test`) using the `.predict()` method.
    * The performance of the model was evaluated using the same regression metrics as Linear Regression: Mean Squared Error (MSE) and R-squared (R2 Score).

4.  **Visualization of the Decision Tree:**
    * For a better understanding of how the Decision Tree makes predictions, the trained tree can be visualized, especially for trees with a reasonable depth.


**Insights:**

The performance metrics of the Decision Tree model on the test set were compared to those of the Linear Regression model. Decision Trees often have the potential to outperform linear models when the underlying relationships in the data are non-linear. However, they are also prone to overfitting, especially if the tree is allowed to grow very deep. The cross-validation results help in assessing the model's generalization capability. Visualizing the tree provides insights into the decision rules learned by the model. The next step would typically involve tuning the hyperparameters of the Decision Tree (e.g., `max_depth`, `min_samples_split`, `min_samples_leaf`) to optimize its performance and prevent overfitting.

In [None]:
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets

In [None]:
data_train, data_test = train_test_split(updated_df, test_size=0.20, random_state = 42)

In [None]:
# Building a  regression model
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
#Decision Tree Regressors in sklearn require input features in a 2D array.
x_train = data_train[['aspiration',
               'engine-location',
               'wheel-base',
               'length',
               'width',
               'curb-weight',
               'engine-size',
               'bore',
               'horsepower',
               'city-mpg',
               'highway-mpg'
              ]]# if training on 2 or more columns
y_train = np.array(data_train["price"]) #1D
model.fit(x_train,y_train) #learns from training data

#test how well it works ?? test data

In [None]:
r_sq = model.score(x_train, y_train)
print(f"R2: {r_sq}")

In [None]:
mean_squared_error(y_train, model.predict(x_train))

In [None]:
x_test = data_test[['aspiration',
               'engine-location',
               'wheel-base',
               'length',
               'width',
               'curb-weight',
               'engine-size',
               'bore',
               'horsepower',
               'city-mpg',
               'highway-mpg'
              ]]# if training on 2 or more columns
y_test = np.array(data_test["price"]) #1D

mean_squared_error(y_test, model.predict(x_test)) #mean_squared_error(Y_true,Y_pred)

In [None]:
plt.figure(figsize=(50,50))
a = plot_tree(model,
              feature_names=X_train.columns.tolist(), #???
              # class_names=y_train, #??
              filled=True,
              rounded=True,
              fontsize=14)

In [None]:
# user grid search to tunr params

param = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10, 20],
    'splitter': ['best', 'random']
}
# Performing GridSearchCV
grid_search = GridSearchCV(model, cv=5, param_grid=param)
grid_search.fit(x_train, y_train)

# Best parameters and best score
print("Best Parameters:", grid_search.best_estimator_)
print("Best Score (R2):", grid_search.best_score_)

# Evaluating on test data
best_model = grid_search.best_estimator_
test_score = best_model.score(x_test, y_test)
print("Test Score (R2):", test_score)

## Training the Random Forest Regression Model

This step involves building, training, and evaluating a Random Forest Regression model, which is an ensemble learning method that combines multiple decision trees to make more robust and accurate predictions. Random Forests often outperform single decision trees by reducing overfitting and improving generalization.

**Actions Performed:**

1.  **Model Instantiation:**
    * A Random Forest Regression model was instantiated using `RandomForestRegressor` from `sklearn.ensemble`. We started with initial hyperparameters, but these are typically tuned using techniques like GridSearchCV or RandomizedSearchCV for optimal performance.

2.  **Model Training:**
    * The instantiated Random Forest Regression model was trained on the training data (`X_train`, `y_train`) using the `.fit()` method. Training a Random Forest involves building multiple decision trees on different subsets of the data and features.

3.  **Model Evaluation on Test Set:**
    * The trained Random Forest model was used to make predictions on the unseen test data (`X_test`) using the `.predict()` method. The predictions are typically the average of the predictions from all the individual decision trees in the forest.
    * The performance of the model was evaluated using Mean Squared Error (MSE) and R-squared (R2 Score).

4.  **Feature Importance Analysis:**
    * Random Forests provide a measure of feature importance, indicating which features contributed most to the predictions. This can be useful for understanding the underlying relationships in the data.


**Insights:**

The performance metrics of the Random Forest model were compared to those of Linear Regression and the single Decision Tree. Random Forests often achieve better performance due to their ability to reduce variance and handle complex relationships. The feature importance analysis provides insights into which features the model relies on most heavily for prediction. The next crucial step would be to tune the hyperparameters of the Random Forest model (e.g., `n_estimators`, `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`) using cross-validation to further optimize its performance.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Separate features (X) and target (y)
X = updated_df.drop('price', axis=1)
y = updated_df['price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# 1. Instantiate the Random Forest Regressor model
# You can start with default hyperparameters and tune later
rf_model = RandomForestRegressor(random_state=42)

# 2. Train the model on the training data
rf_model.fit(X_train, y_train)

In [None]:
# 3. Make predictions on the test data
y_pred = rf_model.predict(X_test)

# 4. Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

print("Random Forest Regressor Evaluation:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R^2): {r_squared:.2f}")

In [None]:
# Visualize the predictions vs. actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual Price vs. Predicted Price (Random Forest Regressor)")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # Perfect prediction line
plt.grid(True)
plt.show()

In [None]:
# Feature Importance
feature_importances = rf_model.feature_importances_
feature_names = X_train.columns
sorted_indices = np.argsort(feature_importances)[::-1]

In [None]:
plt.figure(figsize=(12, 8))
plt.title("Feature Importance (Random Forest Regressor)")
plt.bar(range(X_train.shape[1]), feature_importances[sorted_indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names[sorted_indices], rotation='vertical')
plt.xlabel("Feature")
plt.ylabel("Importance Score")
plt.tight_layout()
plt.show()

In [None]:
print("=========== END =============")