# Comprehensive Regression Workflow: Predicting Laptop Prices

## **1. Introduction**
This notebook aims to build a **regression model to predict the price of laptops** (**Price_euros**) using their specifications, such as **RAM, CPU, Screen Size, and GPU**.  

Regression is a supervised learning technique that estimates a **continuous target variable** (**price**) based on input features. By analyzing technical specifications, we aim to understand which factors influence laptop pricing the most.  

This workflow will guide you through each step of the **laptop price prediction process**, including:

- **Data Exploration**: Understanding the dataset structure and identifying key trends.  
- **Preprocessing**: Handling missing values, encoding categorical variables, and scaling numerical features.  
- **Feature Engineering**: Creating or modifying features to improve model performance.  
- **Model Selection**: Comparing different regression models (Linear Regression, Ridge, Random Forest, etc.).  
- **Hyperparameter Tuning**: Optimizing the best model for improved predictions.  
- **Evaluation**: Assessing performance using metrics like **RMSE (Root Mean Squared Error)** and **R² (R-squared score)**.  

By the end of this notebook, we will have a well-trained **regression model** capable of estimating laptop prices based on specifications. 

## **2. Data Preprocessing**
This section covers data cleaning, handling missing values, and preparing the dataset for analysis.

#### Import necessary libraries

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import necessary libraries for data manipulation and visualization
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis
import re  # Import the regular expressions module for pattern matching and text processing
import matplotlib.pyplot as plt  # For plotting data
import seaborn as sns  # For enhanced data visualizations

# Import libraries for machine learning models and evaluation
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler, OneHotEncoder  # For scaling numerical data and encoding categorical data
from sklearn.linear_model import LinearRegression, ElasticNet  # For linear Regression
from sklearn.tree import DecisionTreeRegressor  # For Decision Tree Regression
from sklearn.ensemble import RandomForestRegressor  # For Random Forest Regression
from sklearn.svm import SVR  # For Support Vector Regression 
import xgboost as xgb # For XGBoost Regression
from sklearn.model_selection import cross_validate  # To perform cross-validation
from sklearn.metrics import mean_squared_error, r2_score, make_scorer  # For model evaluation metrics
from sklearn.model_selection import GridSearchCV   # For hyperparameter tuning


# Set visual settings for plots (optional)
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
plt.rcParams['figure.figsize'] = [10, 6]

#### Brief overview of the dataset

In [None]:
# Load the dataset
data = pd.read_csv("laptop_price.csv", encoding="latin-1")
'''
"encoding='latin-1'" is used to handle special characters in the dataset  
This ensures that the dataset is read correctly, especially if it contains special characters like é, ñ, ü, etc.
'''
# Display the first few rows of the dataset
data.head()

In [None]:
# Display 10 random rows of the dataset
data.sample(10)

In [None]:
# Display a concise summary of the dataframe, including the number of non-null entries and the data type of each column
data.info()

#### Observations:
- The dataset contains 1303 entries and 13 columns.
- `laptop_ID` is an integer but is just an identifier (not useful for modeling).
- Most columns are object (string) types, including `Ram`, `Memory`, and `Weight` which should be converted to numerical values.
- `ScreenResolution`, `Cpu`, and `Gpu` may require feature extraction since they contain multiple pieces of information.
- **No missing values** are detected, so we don't need to handle NaNs.

In [None]:
# Summary statistics of numerical columns
data.describe()

#### Observations for numerical columns:
- `laptop_ID` is an identifier and does not contribute to the regression model.
- The average screen size (`Inches`) is ~15 inches, with a range from 10.1 to 18.4 inches.
- `Price_euros` varies significantly, with a minimum price of €174 and a maximum of €6099.
- The median price (\~€977) is lower than the mean (~€1123), indicating a possible right-skewed distribution. We may need to transform later

In [None]:
# Summary statistics of categorical columns
data.describe(include = 'object')

#### Observations for categorical columns:
- There are 19 unique laptop brands (`Company`), with Dell being the most common.
- The Product column contains 618 unique values out of 1303 rows, meaning it's highly granular and behaves almost like an identifier rather than a meaningful feature; so we'll drop it.
- `TypeName` shows that most laptops are Notebooks (727 out of 1303).
- `ScreenResolution` has 40 unique values, suggesting feature extraction might be needed.
- `Cpu`, `Gpu`, and `Memory` have high cardinality (many unique values), requiring encoding or feature engineering.
- `OpSys` is dominated by `Windows 10` (1072 occurrences), so we might consider grouping less common OS types.

#### Drop product

In [None]:
df = data.drop("Product", axis=1)

#### Convert data types of `Ram` and `Weight`

In [None]:
df["Ram"].unique()

In [None]:
# Remove the "GB" suffix from the 'Ram' column and convert it to an integer
df["Ram"] = df["Ram"].str.replace("GB", "").astype(int)

In [None]:
df["Weight"].unique()

In [None]:
# Remove the "kg" suffix from the 'Weight' column and convert it to a float
df["Weight"] = df["Weight"].str[:-2].astype(float)
# ".str[:-2]" removes the last two characters ("kg")

## 4. Feature Engineering
Feature Engineering is a technique by which we create new features that could potentially aid in predicting our target variable, which in this case, is laptop price. In this notebook, we will create additional features based on our **Domain Knowledge** of the laptop features

#### Extract screen resolution width & height

In [None]:
# Check for inconsistent formats in ScreenResolution
df["ScreenResolution"].unique()

In [None]:
# The pattern (\d+)x(\d+) captures two groups of digits separated by 'x', e.g., "1920x1080"
df[["ScreenWidth", "ScreenHeight"]] = df["ScreenResolution"].str.extract(r"(\d+)x(\d+)").astype(int)

# Drop the original 'ScreenResolution' column since its information is now split into two separate columns
df.drop(columns=["ScreenResolution"], inplace=True)

#### Extract brand and frequency from `Cpu`

In [None]:
df["Cpu"].unique()

In [None]:
# Extract the brand of the CPU (first word in the 'Cpu' column)
# Example: "Intel Core i5 7200U 2.5GHz" → "Intel"
df["CPU Brand"] = df.Cpu.str.split(" ").apply(lambda x: x[0])


# Extract the CPU frequency (last element in the 'Cpu' column)
# Example: "Intel Core i5 7200U 2.5GHz" → "2.5GHz"
df["CPU Frequency"] = df.Cpu.str.split(" ").apply(lambda x: x[-1])

# Remove 'GHz' and convert CPU Frequency to a numeric format
df["CPU Frequency"] = df["CPU Frequency"].str[:-3].astype("float")

In [None]:
df.drop(columns=["Cpu"], inplace=True)

#### Extract memory amount and type from `Memory`

In [None]:
df["Memory"].unique()

In [None]:
# Function to convert memory size to MB
def convert_memory_to_MB(memory_str):
    """
    Converts memory sizes (GB, TB) into MB.
    Handles cases where storage types are included (e.g., '128GB SSD').
    Handles multiple storage types correctly (e.g., '256GB SSD + 1TB HDD').
    """
    total_memory = 0  # Initialize total storage size
    
    # Split in case there are multiple storage types
    for mem in memory_str.split("+"):
        mem = mem.strip()  # Remove unnecessary spaces
        
        # Extract numeric value using regex
        match = re.findall(r"(\d+\.?\d*)", mem)  # Finds numbers (including decimals)
        if match:
            size = float(match[0])  # Convert extracted number to float
            
            # Convert to MB based on unit (GB or TB) 1GB = 1000MB; 1TB = 1,000,000MB
            if "GB" in mem:
                total_memory += size * 1000
            elif "TB" in mem:
                total_memory += size * 1000000
    
    return total_memory

In [None]:
# Function to extract storage type (SSD, HDD, etc.)
def extract_memory_type(memory_str):
    """
    Extracts the storage type (SSD, HDD, Hybrid, Flash Storage).
    If multiple storage types exist, it returns all types found.
    """
    types = []
    
    for mem in memory_str.split("+"):
        mem = mem.strip()
        if "SSD" in mem:
            types.append("SSD")
        elif "HDD" in mem:
            types.append("HDD")
        elif "Hybrid" in mem:
            types.append("Hybrid")
        elif "Flash Storage" in mem:
            types.append("Flash Storage")
    
    return " + ".join(types)  # Combine types if multiple exist

In [None]:
# Apply functions to transform memory data
df["Total Memory (MB)"] = df["Memory"].apply(convert_memory_to_MB)
df["Memory Type"] = df["Memory"].apply(extract_memory_type)

In [None]:
df["Memory Type"].unique()

In [None]:
# Drop original 'Memory' column
df.drop(columns=["Memory"], inplace=True)

#### Extract the brand name from the GPU column

In [None]:
df["Gpu"].unique()

In [None]:
# Extract the brand name by splitting the string and taking the first word
df["GPU Brand"] = df.Gpu.str.split(" ").apply(lambda x: x[0])

# Drop the original 'Gpu' column
df = df.drop("Gpu", axis=1)

In [None]:
# Display the processed DataFrame
df.head()

## 3. Exploratory Data Analysis (EDA)
This section includes visualizations and insights to understand the dataset.

### Univariate Analysis
Univariate analysis involves analyzing individual features one at a time. This helps to understand the distribution, central tendency, and variability of each feature.

#### Visualizing Price Distribution

In [None]:
# Set the figure size to 8x5 inches for better visibility
plt.figure(figsize=(8, 5))

# Create a histogram to visualize the distribution of laptop prices
# 'bins=30' ensures the data is divided into 30 intervals
# 'kde=True' adds a Kernel Density Estimate (KDE) line to show the smooth probability distribution
sns.histplot(df["Price_euros"], bins=30, kde=True, color="teal")

# Set the title of the plot
plt.title("Distribution of Laptop Prices")

# Display the plot
plt.show()

Since the price distribution is right-skewed with very few items above 4000, we'll normalize later using  Log Transformation

#### Distribution of other numerical features

In [None]:
numerical_features = df.select_dtypes(include='number').columns
len(numerical_features)

In [None]:
numerical_features

In [None]:
plt.figure(figsize=(8, 6))
for i in range(0, len(numerical_features)):
    plt.subplot(3, 3, i+1)
    sns.boxplot(x = df[numerical_features[i]], palette = 'viridis')
    plt.title(numerical_features[i], fontsize = 15)
    plt.xlabel(' ')
    plt.tight_layout()

#### Distribution of categorical features

In [None]:
categorical_features = df.select_dtypes(include='object').columns
len(categorical_features)

In [None]:
plt.figure(figsize=(10, 10))
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.countplot(x = df[categorical_features[i]], palette = 'viridis')
    plt.title(categorical_features[i], fontsize = 10)
    plt.xlabel(' ')
    plt.xticks(rotation=90)
    plt.tight_layout()

In [None]:
# Display the value counts of each categorical feature
for i in range(0, len(categorical_features)):
    # Print the value counts for each categorical feature
    print(f"Value counts for {categorical_features[i]}:")
    print(df[categorical_features[i]].value_counts())

- We will drop `CPU Brand` and `OpSys` as they predominantly contain one variable

### Bi-Variate Analysis
Bi-variate analysis looks at 2 different features to identify any possible relationship or distinctive patterns between the 2 features. We are going to compare all the features with the target variable `Price_euros`

#### Feature Correlation - for numerical columns

- One of the commonly used techniques for Bi-variate analysis between numerical values is the  **Correlation Matrix**. Correlation matrix is an effective tool to uncover linear relationship (Correlation) between any 2 continuous features.

In [None]:
## Correlation matrix of numerical features
plt.figure(figsize=(24, 10))
correlation_matrix = df[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

- We can see the the correlations with the Target variable on the 5th row
- ScreenWidth and ScreenHeight have a very high correlation (0.99) which might negatively affect the performance of models that are sensitive to multicollinearity. So we'll go with just one of them ( `ScreenWidth`)
- We may also want to drop `Inches` and of course `laptop_ID` as they have very weak correlation with the price

#### Categorical Columns vs Laptop prices

In [None]:
# Box Plot: SalePrice distribution across different categories
plt.figure(figsize=(10, 15))  # Reduce figure size for better visibility
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)  # Adjust grid to 2x3 (or whatever fits best)
    sns.boxplot(x=categorical_features[i], y='Price_euros', data=df, palette='viridis')
    plt.title(f'Laptop Price vs. {categorical_features[i]}', fontsize=15)
    plt.xlabel(categorical_features[i], fontsize=12)
    plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
    plt.ylabel('Laptop Price', fontsize=12)  # Add y-axis label for clarity

# Apply tight_layout after all subplots are created
plt.tight_layout()
plt.show()

Here we can check whether the distribution of price between different categories are distinct enough.

## 5. Modelling
Here, we split our data, scale and compare different regression models. We will also try out hyperparameter tuning.

#### Encode categorical variables
Because machine learning only learns from data that is numerical in nature, we will convert the categorical columns into numerical columns (*one-hot features*) using the `get_dummies()` method that are suitable for feeding into our machine learning algorithm.

In [None]:
# List of categorical columns to be encoded
categorical_cols = ["Company", "TypeName", "GPU Brand", "Memory Type"]

# Apply one-hot encoding to the categorical columns
# - pd.get_dummies creates binary columns for each category in the categorical columns
df = pd.get_dummies(df, columns=categorical_cols)

# Set option to display all the columns
pd.set_option("display.max_columns", None)

# Display the first few rows of the updated DataFrame to check the encoding
df.head()

#### Laptop Price Transformation

In [None]:
# Create a histogram to visualize the distribution of laptop prices before transformation
plt.figure(figsize=(10,6))
plt.title("Before transformation of Laptop Price")
sns.histplot(df["Price_euros"], bins=30, kde=True, color="teal")

The distribution is skewed to the right, where the tail on the curve’s right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is also called positive skewness.  
Having a skewed target will affect the overall performance of our machine learning model, thus, one way to alleviate will be to use **log transformation** on the skewed target, in our case, the *Price_euros* to reduce the skewness of the distribution.

In [None]:
# Create a histogram to visualize the distribution of log-transformed laptop prices
plt.figure(figsize=(10,6))
plt.title("After transformation of Laptop Price")
sns.histplot(np.log(df["Price_euros"]), bins=30, kde=True, color="teal") # Apply the natural logarithm to the prices

In [None]:
# Apply the log transformation
df["Price_euros"] = np.log(df["Price_euros"])

#### Feature Selection

In [None]:
# Drop irrelevant columns and separate features (X) and target (y)
X = df.drop(columns=["Price_euros", "laptop_ID", "OpSys", "CPU Brand","Inches","ScreenHeight"])  # Features
y = df["Price_euros"]  # Target variable

In [None]:
y

#### Split the data into training and testing sets

In [None]:
# Split into Train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Scaling the data
- Scaling ensures that each feature contributes equally to the distance calculations or the optimization process. 
- We'll use **Standardization** here.  Standardization transforms the features to have a mean of 0 and a standard deviation of 1. This is useful when the features have different units or scales.

In [None]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_scaled = scaler.transform(X_test)

#### 1. Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize the model
linear_reg = LinearRegression()

# Fit the model to the training data
linear_reg.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred_linear = linear_reg.predict(X_test_scaled)

# Evaluate the model
rmse_linear = np.sqrt(mean_squared_error(y_test, y_pred_linear))
r2_linear = r2_score(y_test, y_pred_linear)

print(f"Linear Regression:\n RMSE: {rmse_linear:.4f}\n R²: {r2_linear:.4f}\n")

#### 2. Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

# Initialize the model
ridge_reg = Ridge(alpha=1.0)

# Fit the model to the training data
ridge_reg.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred_ridge = ridge_reg.predict(X_test_scaled)

# Evaluate the model
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
r2_ridge = r2_score(y_test, y_pred_ridge)

print(f"Ridge Regression:\n RMSE: {rmse_ridge:.4f}\n R²: {r2_ridge:.4f}\n")

#### 3. Elastic Net regression

In [None]:
from sklearn.linear_model import ElasticNet

# Initialize the Elastic Net model
elastic_net = ElasticNet(alpha=0.1)

# Fit the model
elastic_net.fit(X_train_scaled, y_train)

# Predicting on test data
y_pred_en = elastic_net.predict(X_test_scaled)

# Evaluate the model
rmse_en = np.sqrt(mean_squared_error(y_test, y_pred_en))
r2_en = r2_score(y_test, y_pred_en)

print(f"Elastic Net Regression:\n RMSE: {rmse_en:.4f}\n R²: {r2_en:.4f}\n")

#### 4. Support Vector Regression (SVR)

In [None]:
from sklearn.svm import SVR

# Initialize the model
svr_reg = SVR(kernel='linear')

# Fit the model to the training data
svr_reg.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred_svr = svr_reg.predict(X_test_scaled)

# Evaluate the model
rmse_svr = np.sqrt(mean_squared_error(y_test, y_pred_svr))
r2_svr = r2_score(y_test, y_pred_svr)

print(f"Support Vector Regression:\n RMSE: {rmse_svr:.4f}\n R²: {r2_svr:.4f}\n")

#### 5. K-Nearest Neighbors (KNN)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Initialize the model
knn_reg = KNeighborsRegressor(n_neighbors=5)

# Fit the model to the training data
knn_reg.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred_knn = knn_reg.predict(X_test_scaled)

# Evaluate the model
rmse_knn = np.sqrt(mean_squared_error(y_test, y_pred_knn))
r2_knn = r2_score(y_test, y_pred_knn)

print(f"K-Nearest Neighbors:\n RMSE: {rmse_knn:.4f}\n R²: {r2_knn:.4f}\n")

#### 6. Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Initialize the model
tree_reg = DecisionTreeRegressor()

# Fit the model to the training data
tree_reg.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred_tree = tree_reg.predict(X_test_scaled)

# Evaluate the model
rmse_tree = np.sqrt(mean_squared_error(y_test, y_pred_tree))
r2_tree = r2_score(y_test, y_pred_tree)

print(f"Decision Tree Regressor:\n RMSE: {rmse_tree:.4f}\n R²: {r2_tree:.4f}\n")

#### 7. Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model
forest_reg = RandomForestRegressor()

# Fit the model to the training data
forest_reg.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred_forest = forest_reg.predict(X_test_scaled)

# Evaluate the model
rmse_forest = np.sqrt(mean_squared_error(y_test, y_pred_forest))
r2_forest = r2_score(y_test, y_pred_forest)

print(f"Random Forest Regressor:\n RMSE: {rmse_forest:.4f}\n R²: {r2_forest:.4f}\n")

#### 8. Gradient Boosting Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the model
gb_reg = GradientBoostingRegressor()

# Fit the model to the training data
gb_reg.fit(X_train, y_train)

# Predict on the test data
y_pred_gb = gb_reg.predict(X_test)

# Evaluate the model
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
r2_gb = r2_score(y_test, y_pred_gb)

print(f"Gradient Boosting Regressor:\n RMSE: {rmse_gb:.4f}\n R²: {r2_gb:.4f}\n")

#### 9. XGBoost (eXtreme Gradient Boosting) Regression

In [None]:
# Install lightgbm
!pip install xgboost

# Importing LightGBM
import xgboost as xgb

In [None]:
# Initialize the model with some parameters
xgb_reg = xgb.XGBRegressor()

# Fit the model
xgb_reg.fit(X_train, y_train)

# Predicting on test data
y_pred_xgb = xgb_reg.predict(X_test)


# Evaluate the model
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost Regressor:\n RMSE: {rmse_xgb:.4f}\n R²: {r2_xgb:.4f}\n")

#### 10. LightGBM (Light Gradient Boosting Machine)

In [None]:
# Install lightgbm
!pip install lightgbm

# Importing LightGBM
import lightgbm as lgb

In [None]:
# Your turn

#### Select the best performing model with a loop

In [None]:
# Initialize regression models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "ElasticNet Regression": ElasticNet(alpha=0.1),
    "Support Vector Regression": SVR(kernel='linear'),
    "Gradient Boosting Regressor": GradientBoostingRegressor(),
    "K-Nearest Neighbors": KNeighborsRegressor(n_neighbors=5),
    "Random Forest Regressor": RandomForestRegressor(),
    "Decision Tree Regressor": DecisionTreeRegressor(),
    "XGBoost Regressor": xgb.XGBRegressor()
}

# Initializing Best Model Trackers
best_model = None
best_rmse = float("inf") # Set RMSE to a very high value initially (infinity)
best_r2 = float("-inf")  # Set R² to a very low value initially (-infinity)


# Train each model and evaluate its performance
for name, model in models.items():
    # Fit the model to the training data
    model.fit(X_train_scaled, y_train)
    
    # Predict on the test data
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate the model using RMSE and R²
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    print(f"{name}:\n RMSE: {rmse:.4f}\n R²: {r2:.4f}\n")
    
    # Select the best model based on RMSE
    if rmse < best_rmse:
        best_rmse = rmse
        best_r2 = r2
        best_model = model
        best_model_name = name

print(f"\nBest Model: {best_model_name}\n Best RMSE: {best_rmse:.4f}\n Best R²: {best_r2:.4f}")

#### Evaluating With Cross Validation
- Cross-validation is a method to evaluate a model by splitting the data into multiple parts, training and testing the model on different subsets in each round, and then averaging the results. This helps ensure the model's performance is reliable and not just specific to one split of the data.

In [None]:
# Initialize the XGBoost model
model = xgb.XGBRegressor()

# Create custom scorers for RMSE and R²
# `make_scorer` allows using custom metrics in cross-validation
rmse_scorer = make_scorer(mean_squared_error, squared=False)
r2_scorer = make_scorer(r2_score)

# Dictionary of scoring metrics
scoring = {'RMSE': rmse_scorer, 'R2': r2_scorer}

# Perform cross-validation
# `cross_validate` splits the data into multiple folds, trains and tests the model, and calculates the scores.
# Since cross-validation already includes multiple train-test splits, we use the full dataset (X, y).
cv_results = cross_validate(model, X, y, scoring=scoring, cv=5, return_train_score=True)

# Output the results
# `cv_results` contains the scores for each fold
print("RMSE scores:", cv_results['test_RMSE'])  # RMSE scores for each fold
print("R² scores:", cv_results['test_R2'])  # R² scores for each fold
print("Average RMSE:", cv_results['test_RMSE'].mean())  # Average RMSE across all folds
print("Average R²:", cv_results['test_R2'].mean())  # Average R² score across all folds

## 7. Hyperparameter Tuning
- Hyperparameter tuning is the process of finding the best *settings* for a machine learning model to improve its performance.
- Think of it like adjusting the knobs on a machine to make it work better. 
- You test different combinations of settings (hyperparameters) and evaluate how well the model performs with each set. 
- The goal is to find the combination that leads to the best results, such as higher accuracy or lower error rates.

#### Hyperparameter Tuning with Loops

In [None]:
# We'll use XGBoost and tune the n_estimators, learning_rate, and max_depth.

# Initialize the best RMSE and best R² to extreme values to ensure any calculated values will be better
best_rmse = float('inf')
best_r2 = -float('inf')
best_params = {}

# Iterate over different values for n_estimators, learning_rate, and max_depth
for n_estimators in [50, 100, 200]:  # Number of boosting rounds
    for learning_rate in [0.01, 0.1, 0.2]:  # Step size at each iteration
        for max_depth in range(3, 10, 2):  # Maximum depth of each tree

            # Initialize the XGBoost model with the current set of hyperparameters
            xgb_reg = xgb.XGBRegressor(
                n_estimators=n_estimators,       # Number of boosting rounds
                learning_rate=learning_rate,     # Step size at each iteration
                max_depth=max_depth,             # Maximum depth of each tree
                random_state=42                  # Ensures reproducibility
            )
            
            # Train the model using the training data
            xgb_reg.fit(X_train, y_train)
            
            # Predict the target values for the test data
            y_pred_xgb = xgb_reg.predict(X_test)
            
            # Calculate Root Mean Squared Error (RMSE) and R² score for the current model
            rmse = mean_squared_error(y_test, y_pred_xgb, squared=False)
            r2 = r2_score(y_test, y_pred_xgb)
            
            # Check if the current RMSE is better (lower) than the best RMSE so far
            if rmse < best_rmse:
                # Update the best RMSE, R² score, and the best parameters
                best_rmse = rmse
                best_r2 = r2
                best_params = {
                    'n_estimators': n_estimators, 
                    'learning_rate': learning_rate, 
                    'max_depth': max_depth
                }

# Print the best hyperparameters and corresponding RMSE and R² score
print("Best Parameters for XGBoost:", best_params)
print("Best RMSE for XGBoost:", best_rmse)
print("Best R² Score for XGBoost:", best_r2)

#### Hyperparameter Tuning with `GridSearchCV` (GridSearch Cross Validation)
- `GridSearchCV` is a technique used to optimize machine learning models by systematically evaluating all possible combinations of specified hyperparameters.
- `GridSearchCV` combines grid search and cross-validation into a single tool.

#####   This will take some time...

In [None]:
# Initialize the XGBoost model with default settings
xgb_model = xgb.XGBRegressor(random_state=42)

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50, 100, 200],       # Number of boosting rounds
    'learning_rate': [0.01, 0.1, 0.2],    # Step size at each iteration
    'max_depth': [3, 5, 7, 9]             # Maximum depth of each tree
}

# Define custom scorers for RMSE and R²
rmse_scorer = make_scorer(mean_squared_error,  greater_is_better=False, squared=False)
r2_scorer = make_scorer(r2_score)

# Initialize GridSearchCV with multiple scoring metrics
grid_search = GridSearchCV(
    estimator=xgb_model,                 # Model to use
    param_grid=param_grid,               # Parameter grid to search
    scoring={'RMSE': rmse_scorer, 'R2': r2_scorer},  # Metrics to evaluate
    cv=5,                                # Number of cross-validation folds
    refit = 'RMSE',                       # Metric to optimize
    verbose=1,                           # Level of verbosity for output
    n_jobs= -1                            # Use all available CPU cores
)

# Fit GridSearchCV to the training data
grid_search.fit(X, y)

# Retrieve the best hyperparameters
best_params = grid_search.best_params_
best_rmse = -grid_search.best_score_  # GridSearchCV minimizes the score
best_r2 = grid_search.cv_results_['mean_test_R2'][grid_search.best_index_]

# Print the best hyperparameters and corresponding RMSE and R² score
print("Best Parameters for XGBoost:", best_params)
print("Best RMSE for XGBoost:", best_rmse)
print("Best R² Score for XGBoost:", best_r2)

#### Predict with the best parameters

In [None]:
# Initialize the XGBoost model with best parameters
best_xgb_model = xgb.XGBRegressor(**best_params)

# Train the model on the full training dataset
best_xgb_model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = best_xgb_model.predict(X_test)

# Calculate RMSE and R² on test data
test_rmse = mean_squared_error(y_test, y_pred, squared=False)
test_r2 = r2_score(y_test, y_pred)

print(f"Test RMSE: {test_rmse:.4f}")
print(f"Test R² Score: {test_r2:.4f}")

In [None]:
# Plot feature importances
xgb.plot_importance(best_xgb_model, max_num_features=20)
plt.show()

In [None]:
# Save the Model for Future Use
import joblib

joblib.dump(best_xgb_model, "best_xgboost_model.pkl")

---
***Your Dataness***,  
Obinna Oliseneku (_**Hybraid**_)  
**[LinkedIn](https://www.linkedin.com/in/obinnao/)** | **[GitHub](https://github.com/hybraid6)**  