# Machine Learning Models for Text Classification  

In this notebook, we will train and evaluate different **Machine Learning models** for **text classification** using the **TF-IDF features** extracted in the previous steps.  

We will use the **TF-IDF matrix (`Xtrain_matrix.pkl`)** as input features and the **target labels (`y_train_encoded.pkl`)** for supervised learning.  
The **test dataset (`Xtest_matrix.pkl`)** will not be used at this stage, as it is reserved for **final result submission**.  

We will start by experimenting with the following models:  
- **Logistic Regression**  
- **Support Vector Machines (SVM)**  
- **Random Forest Classifier**  
- **K-Neighbors Classifier**  
- **Decision Tree Classifier**
- **XGBoost**  
- **Linear SVC** 
- **Voting Classifier**  

📌 **These models have been selected based on standard text classification approaches, but we may adjust our choices depending on their performance.**  

We will perform a **comprehensive hyperparameter search** for all models, optimizing for the **weighted F1-score**, as this metric is required to evaluate the classification performance in the context of the project and challenge.  

By the end of this notebook, we will:  
✔ Compare the performance of different models  
✔ Optimize hyperparameters if necessary  
✔ Select and save the best-performing model for future use  


## 2. Import Required Libraries 

In [18]:
import sys
import os
from pathlib import Path
import pickle
import pandas as pd

# Machine Learning models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

# Model selection and hyperparameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV
# Evaluation metrics
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Loading Preprocessed Data  

Before training our Machine Learning models, we need to load the **TF-IDF matrix (`Xtrain_matrix.pkl`)** and the **target labels (`ytrain.pkl`)**.  


In [19]:
# Get the current notebook directory
CURRENT_DIR = Path(os.getcwd()).resolve()

# Automatically find the project root (go up 3 levels)
PROJECT_ROOT = CURRENT_DIR.parents[2]

# Add project root to sys.path
sys.path.append(str(PROJECT_ROOT))

# Function to get relative paths from project root
def get_relative_path(absolute_path):
    return str(Path(absolute_path).relative_to(PROJECT_ROOT))

# Print project root directory
print(f"Project Root Directory: {PROJECT_ROOT.name}")  # Display only the root folder name

import config  # Now Python can find config.py

# Paths to load
tfidf_path = Path(config.XTRAIN_MATRIX_PATH)
labels_path = Path(config.YTRAIN_ENCODED_PATH)

# Print paths being used (relative to project root)
print(f"Using Config File from: {get_relative_path(config.__file__)}")
print(f"Loading TF-IDF matrix from: {get_relative_path(tfidf_path)}")
print(f"Loading encoded labels from: {get_relative_path(labels_path)}")

# Check if files exist before loading
if not tfidf_path.exists():
    raise FileNotFoundError(f"Error: TF-IDF matrix file not found at {get_relative_path(tfidf_path)}")

if not labels_path.exists():
    raise FileNotFoundError(f"Error: Encoded labels file not found at {get_relative_path(labels_path)}")

# Load the TF-IDF matrix
X = pickle.load(open(tfidf_path, "rb"))

# Load the classification labels
y = pd.read_pickle(labels_path)

# Print confirmation
print("Data Successfully Loaded!")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")


Project Root Directory: Data_Scientist_Rakuten_Project-main
Using Config File from: config.py
Loading TF-IDF matrix from: data\processed\text\Xtrain_matrix.pkl
Loading encoded labels from: data\processed\text\y_train_encoded.pkl
Data Successfully Loaded!
X shape: (84916, 5000)
y shape: (84916,)


## 2. Splitting Data into Training and Validation Sets

In [20]:
# Split data into training and validation sets
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42, stratify=y
# )
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print(f"Split Completed:")
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


Split Completed:
X_train shape: (67932, 5000), y_train shape: (67932,)
X_test shape: (16984, 5000), y_test shape: (16984,)


## 3. Training Machine Learning Models  

### 3.1 Logistic Regression

To optimize the **Logistic Regression** model, we perform a **GridSearchCV** on key hyperparameters, aiming to maximize the **weighted F1-score** to handle class imbalances effectively:

- `multi_class`: Defines the type of classification (`multinomial` for multi-class problems).
- `class_weight`: Balances classes for imbalanced data (`balanced`, `None`, or custom dictionaries).
- `max_iter`: Controls the number of iterations for convergence.
- `C`: Regularization strength, which can impact model performance.
- `solver`: Specifies the algorithm for optimization (e.g., `lbfgs`, `saga`, `liblinear`).

In [None]:
# # Define parameter grid (reduced for efficiency)
# param_grid = {
#     'multi_class': ['multinomial'],
#     'class_weight': ['balanced', None],
#     'max_iter': [500, 1000, 1500]
# }

# # Initialize Logistic Regression model
# log_reg = LogisticRegression()

# # Perform GridSearchCV based on 'weighted F1-score'
# grid_search = GridSearchCV(log_reg, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
# grid_search.fit(X_train, y_train)

# # Extract best model
# best_log_reg = grid_search.best_estimator_

# # Display best hyperparameters
# print(f"Best Hyperparameters for Logistic Regression: {grid_search.best_params_}")


In [None]:
# clf_lr = LogisticRegression(multi_class='multinomial', class_weight=None, max_iter=500)
# clf_lr.fit(X_train, y_train.values.ravel())

# # Predict and evaluate
# y_pred = clf_lr.predict(X_test)
# print(classification_report(y_test, y_pred))


In [None]:
# clf_lr = LogisticRegression(multi_class='multinomial', class_weight='balanced', max_iter=1000)
# clf_lr.fit(X_train, y_train.values.ravel())

# # Predict and evaluate
# y_pred = clf_lr.predict(X_test)
# print(classification_report(y_test, y_pred))


- **GridSearchCV** for Logistic Regression

In [None]:
# Define a more extensive parameter grid
param_grid = {
    'multi_class': ['multinomial', 'ovr'],  # Essayer aussi 'ovr' pour comparer avec la classification binaire
    'class_weight': ['balanced', None, 'dict'],  # Tester aussi un dictionnaire de poids
    'max_iter': [500, 1000, 1500, 2000],  # Plus de valeurs pour voir l'impact
    'C': [0.01, 0.1, 1, 10, 100],  # Essayer différentes valeurs de régularisation
    'solver': ['lbfgs', 'saga', 'liblinear']  # Tester différents solveurs pour la régression logistique
}

# Initialize Logistic Regression model
log_reg = LogisticRegression()

# Perform GridSearchCV based on 'weighted F1-score'
grid_search = GridSearchCV(log_reg, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Extract best model
best_log_reg = grid_search.best_estimator_

# Display best hyperparameters
print(f"Best Hyperparameters for Logistic Regression: {grid_search.best_params_}")


- **Re-train** Logistic Regression with the best hyperparameters.

In [None]:
# After GridSearchCV has finished
# Retrain Logistic Regression with the best hyperparameters
clf_lr = LogisticRegression(
    multi_class=grid_search.best_params_['multi_class'],
    class_weight=grid_search.best_params_['class_weight'],
    max_iter=grid_search.best_params_['max_iter'],
    C=grid_search.best_params_['C'],
    solver=grid_search.best_params_['solver']
)
clf_lr.fit(X_train, y_train)


- **Evaluate** Logistic Regression with the best hyperparameters.

In [None]:
# Evaluate Logistic Regression model
y_pred = clf_lr.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for Logistic Regression:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = clf_lr.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


- **Save** the best Logistic Regression model.

In [None]:
# Save the trained Logistic Regression model to the appropriate path in the models/text directory
with open(os.path.join(config.TEXT_MODELS_DIR, "best_lr_model.pkl"), "wb") as f:
    pickle.dump(best_log_reg, f)

print("Best Logistic Regression model saved.")

### 3.2 Support Vector Machines (SVM)

To optimize the **Support Vector Machines (SVM)** model, we perform a **GridSearchCV** on key hyperparameters, aiming to maximize the **weighted F1-score** to effectively handle class imbalances:

- `C`: Regularization parameter, controlling the trade-off between achieving a low error on the training set and minimizing the model complexity.
- `kernel`: Specifies the kernel type to use in the algorithm (e.g., `linear`, `rbf`, `poly`, or `sigmoid`).
- `class_weight`: Balances the classes for imbalanced data (`balanced`, `None`).
- `max_iter`: Controls the maximum number of iterations for convergence.
- `gamma`: Defines the kernel coefficient for `rbf`, `poly`, and `sigmoid` kernels. It can impact model flexibility and performance.
- `degree`: Defines the degree of the polynomial kernel function (relevant only if `kernel='poly'`).


- **GridSearchCV** for SVM

In [None]:
# Define a simplified parameter grid for SVM with focus on 'gamma', 'kernel' and 'C'
param_grid = {
    'C': [1, 10],  # Regularization parameter (simplified)
    'kernel': ['linear', 'rbf', 'poly'],  # Focus on three kernel types
    'gamma': [0.01, 0.1, 'scale'],  # Test simple gamma values
}

# Initialize the SVM model
svm = SVC()

# Perform GridSearchCV based on 'weighted F1-score'
grid_search = GridSearchCV(svm, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Extract best model
best_svm = grid_search.best_estimator_

# Display best hyperparametersprint(f"Best Hyperparameters for SVM: {grid_search.best_params_}")


- **Re-train** SVM with the best hyperparameters.

In [None]:
# After GridSearchCV has finished
# Retrain SVM with the best hyperparameters
best_svm = SVC(
    C=grid_search.best_params_['C'],
    kernel=grid_search.best_params_['kernel'],
    gamma=grid_search.best_params_['gamma']
)
best_svm.fit(X_train, y_train)

# If you want to evaluate the model (optional)
# y_pred = best_svm.predict(X_test)
# print(classification_report(y_test, y_pred))

print(f"Re-trained SVM model with the best hyperparameters: {grid_search.best_params_}")


- **Evaluate** SVM with the best hyperparameters.

In [None]:
# Evaluate SVM model
y_pred = best_svm.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for SVM:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = best_svm.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


- **Save** the best SVM model.

In [None]:
# Save the trained SVM model to the appropriate path in the models/text directory
with open(os.path.join(config.TEXT_MODELS_DIR, "best_svm_model.pkl"), "wb") as f:
    pickle.dump(best_svm, f)

print("Best SVM model saved.")

### 3.3 Random Forest Classifier

To optimize the **Random Forest Classifier** model, we perform a **GridSearchCV** on key hyperparameters, aiming to maximize the **weighted F1-score** to effectively handle class imbalances:

- `n_estimators`: The number of trees in the forest. More trees usually improve model performance, but also increase computation time.
- `max_depth`: The maximum depth of the trees. Limiting the depth can prevent overfitting.
- `min_samples_split`: The minimum number of samples required to split an internal node. This can control overfitting by setting higher values.
- `min_samples_leaf`: The minimum number of samples required to be at a leaf node. A higher value can smooth the model.
- `class_weight`: Balances the classes for imbalanced data (`balanced`, `None`).
- `max_features`: The number of features to consider when looking for the best split. This can impact model performance and speed.
- `bootstrap`: Whether bootstrap samples are used when building trees.

- **GridSearchCV** for Random Forest

In [None]:
# Define a simplified parameter grid for Random
param_grid = {
    'n_estimators': [100, 200],  # Number of trees in the forest (simplified range)
    'min_samples_split': [2, 4, 5],  # Minimum samples required to split a node (simplified)
    'max_features': ['auto', 'sqrt'],  # Number of features to consider at each split
    'class_weight': ['balanced', None],  # Handle class imbalance
}

# Initialize the Random Forest model
rf = RandomForestClassifier()

# Perform GridSearchCV based on 'weighted F1-score'
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Extract best model
best_rf = grid_search.best_estimator_

# Display best hyperparameters
print(f"Best Hyperparameters for Random Forest: {grid_search.best_params_}")


- **Re-train** Random Forest with the best hyperparameters.

In [None]:
# After GridSearchCV has finished
# Retrain Random Forest with the best hyperparameters
best_rf = RandomForestClassifier(
    n_estimators=grid_search.best_params_['n_estimators'],
    min_samples_split=grid_search.best_params_['min_samples_split'],
    max_features=grid_search.best_params_['max_features'],
    class_weight=grid_search.best_params_['class_weight']
)

best_rf.fit(X_train, y_train)

# If you want to evaluate the model (optional)
# y_pred = best_rf.predict(X_test)
# print(classification_report(y_test, y_pred))

print(f"Re-trained Random Forest model with the best hyperparameters: {grid_search.best_params_}")


- **Evaluate** Random Forest with the best hyperparameters.

In [None]:
# Evaluate Random Forest model
y_pred = best_rf.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for Random Forest:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = best_rf.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


- **Save** the best Random Forest model.

In [None]:
# Save the trained Random Forest model to the appropriate path in the models/text directory
with open(os.path.join(config.TEXT_MODELS_DIR, "best_rf_model.pkl"), "wb") as f:
    pickle.dump(best_rf, f)

print("Best Random Forest model saved.")


### 3.4 K-Neighbors Classifier

To optimize the **K-Neighbors Classifier** model, we perform a **GridSearchCV** on key hyperparameters, aiming to maximize the **weighted F1-score** to effectively handle class imbalances:

- `n_neighbors`: The number of neighbors to use for knearest neighbors classification. Increasing this value makes the model more general and reduces overfitting.
- `weights`: Function used to weight the points in the neighborhood. Options include `uniform` (all points are weighted equally) or `distance` (closer points have more influence).
- `algorithm`: The algorithm used to compute the nearest neighbors (`auto`, `ball_tree`, `kd_tree`, `brute`).
- `leaf_size`: The leaf size for the `ball_tree` and `kd_tree` algorithms. A smaller leaf size can improve search time.
- `p`: The power parameter for the Minkowski distance. When `p=2`, this corresponds to the Euclidean distance.
- `metric`: The distance metric to use. Can be `minkowski`, `manhattan`, `chebyshev`, or `cosine`.


- **GridSearchCV** for K-Neighbors

In [None]:
# Define a simplified parameter grid for K-Neighbors
param_grid = {
    'n_neighbors': [2, 3, 5, 7, 27],  # Added n_neighbors = 2 and 27
    'weights': ['uniform', 'distance'],  # Weight function for neighbors
    'algorithm': ['auto', 'ball_tree'],  # Simplified algorithms to compute nearest neighbors
}

# Initialize the K-Neighbors Classifier
knn = KNeighborsClassifier()

# Perform GridSearchCV based on 'weighted F1-score'
grid_search = GridSearchCV(knn, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Extract best model
best_knn = grid_search.best_estimator_

# Display best hyperparameters
print(f"Best Hyperparameters for K-Neighbors Classifier: {grid_search.best_params_}")


- **Re-train** K-Neighbors Classifier with the best hyperparameters.

In [None]:
# After GridSearchCV has finished
# Retrain K-Neighbors Classifier with the best hyperparameters
best_knn = KNeighborsClassifier(
    n_neighbors=grid_search.best_params_['n_neighbors'],
    weights=grid_search.best_params_['weights'],
    algorithm=grid_search.best_params_['algorithm']
)

best_knn.fit(X_train, y_train)

# If you want to evaluate the model (optional)
# y_pred = best_knn.predict(X_test)
# print(classification_report(y_test, y_pred))

print(f"Re-trained K-Neighbors Classifier model with the best hyperparameters: {grid_search.best_params_}")


- **Evaluate** K-Neighbors Classifier with the best hyperparameters.

In [None]:
# Evaluate K-Neighbors Classifier model
y_pred = best_knn.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for K-Neighbors Classifier:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = best_knn.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


- **Save** the best K-Neighbors model.

In [None]:
# Save the trained K-Neighbors model to the appropriate path in the models/text directory
with open(os.path.join(config.TEXT_MODELS_DIR, "best_knn_model.pkl"), "wb") as f:
    pickle.dump(best_knn, f)

print("Best K-Neighbors model saved.")


### 3.5 Decision Tree Classifier

To optimize the **Decision Tree Classifier** model, we perform a **GridSearchCV** on key hyperparameters, aiming to maximize the **weighted F1-score** to effectively handle class imbalances:

- `max_depth`: The maximum depth of the tree. Limiting the depth helps to prevent overfitting.
- `min_samples_split`: The minimum number of samples required to split an internal node. This can control overfitting.
- `min_samples_leaf`: The minimum number of samples required to be at a leaf node. A higher value can smooth the model.
- `max_features`: The number of features to consider when looking for the best split. This can impact model performance and speed.
- `criterion`: The function to measure the quality of a split. Common options include `gini` (Gini impurity) and `entropy` (information gain).
- `class_weight`: Balances the classes for imbalanced data (`balanced`, `None`).
- `splitter`: The strategy used to split at each node. Options include `best` (best split) or `random` (random split).

- **GridSearchCV** for Decision Tree

In [None]:
# Define a simplified parameter grid for Decision Tree Classifier with focus on 'max_features' and 'min_samples_split'
param_grid = {
    'max_depth': [None, 10],  # Simplified range for maximum depth
    'min_samples_split': [2, 5],  # Minimum samples required to split a node
    'max_features': ['auto', 'sqrt'],  # Number of features to consider at each split
    'class_weight': ['balanced', None],  # Handle class imbalance
    'criterion': ['gini'],  # We can keep only 'gini' for simplicity
}

# Initialize the Decision Tree model
dt = DecisionTreeClassifier()

# Perform GridSearchCV based on 'weighted F1-score'
grid_search = GridSearchCV(dt, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Extract best model
best_dt = grid_search.best_estimator_

# Display best hyperparameters
print(f"Best Hyperparameters for Decision Tree: {grid_search.best_params_}")


- **Re-train** Decision Tree with the best hyperparameters.

In [None]:
# After GridSearchCV has finished
# Retrain Decision Tree with the best hyperparameters
best_dt = DecisionTreeClassifier(
    max_depth=grid_search.best_params_['max_depth'],
    min_samples_split=grid_search.best_params_['min_samples_split'],
    max_features=grid_search.best_params_['max_features'],
    class_weight=grid_search.best_params_['class_weight'],
    criterion=grid_search.best_params_['criterion']
)

best_dt.fit(X_train, y_train)

# If you want to evaluate the model (optional)
# y_pred = best_dt.predict(X_test)
# print(classification_report(y_test, y_pred))

print(f"Re-trained Decision Tree model with the best hyperparameters: {grid_search.best_params_}")


- **Evaluate** Decision Tree with the best hyperparameters.

In [None]:
# Evaluate Decision Tree model
y_pred = best_dt.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for Decision Tree:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = best_dt.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


- **Save** the best Decision Tree model.

In [None]:
# Save the trained Decision Tree model to the appropriate path in the models/text directory
with open(os.path.join(config.TEXT_MODELS_DIR, "best_dt_model.pkl"), "wb") as f:
    pickle.dump(best_dt, f)

print("Best Decision Tree model saved.")


### 3.6 XGBoost

To optimize the **XGBoost** model, we perform a **GridSearchCV** on key hyperparameters, aiming to maximize the **weighted F1-score** to effectively handle class imbalances:

- `n_estimators`: The number of boosting rounds (trees) to build. More trees usually improve performance but increase computation time.
- `learning_rate`: The rate at which the model learns. Smaller values make the model more robust but require more trees.
- `max_depth`: The maximum depth of the decision trees. Increasing this can make the model more complex, potentially leading to overfitting.
- `min_child_weight`: The minimum sum of instance weight (hessian) needed in a child. It can help control overfitting.
- `subsample`: The fraction of samples used for fitting each tree. A lower value can help reduce overfitting.
- `colsample_bytree`: The fraction of features to use for building each tree. It can help control overfitting.
- `gamma`: The minimum loss reduction required to make a further partition. It helps control the complexity of the model.

- **GridSearchCV** for XGBoost

In [None]:
# Define a simplified parameter grid for XGBoost with focus on the most important hyperparameters
param_grid = {
    'n_estimators': [100, 200],  # Number of boosting rounds (simplified range)
    'learning_rate': [0.05, 0.1],  # Learning rate (step size)
    'max_depth': [3, 5],  # Maximum depth of the trees
    'subsample': [0.8, 1.0],  # Fraction of samples to use for fitting each tree
}

# Initialize the XGBoost model with use_label_encoder=False to suppress warnings
xgb = XGBClassifier(use_label_encoder=False)

# Perform GridSearchCV based on 'weighted F1-score'
grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Extract best model
best_xgb = grid_search.best_estimator_

# Display best hyperparameters
print(f"Best Hyperparameters for XGBoost: {grid_search.best_params_}")


- **Re-train** XGBoost with the best hyperparameters.

In [None]:
# After GridSearchCV has finished
# Retrain XGBoost with the best hyperparameters
best_xgb = XGBClassifier(
    n_estimators=grid_search.best_params_['n_estimators'],
    learning_rate=grid_search.best_params_['learning_rate'],
    max_depth=grid_search.best_params_['max_depth'],
    subsample=grid_search.best_params_['subsample'],
    use_label_encoder=False  # Ensure to suppress the warning related to label encoding
)

best_xgb.fit(X_train, y_train)

# If you want to evaluate the model (optional)
# y_pred = best_xgb.predict(X_test)
# print(classification_report(y_test, y_pred))

print(f"Re-trained XGBoost model with the best hyperparameters: {grid_search.best_params_}")


- **Evaluate** XGBoost with the best hyperparameters.

In [None]:
# Evaluate XGBoost model
y_pred = best_xgb.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for XGBoost:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = best_xgb.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


- **Save** the best XGBoost model.

In [None]:
# Save the trained XGBoost model to the appropriate path in the models/text directory
with open(os.path.join(config.TEXT_MODELS_DIR, "best_xgb_model.pkl"), "wb") as f:
    pickle.dump(best_xgb, f)

print("Best XGBoost model saved.")


### 3.7 Linear SVC

To optimize the **Linear SVC** model, we perform a **GridSearchCV** on key hyperparameters, aiming to maximize the **weighted F1-score** to effectively handle class imbalances:

- `C`: Regularization parameter. It controls the trade-off between achieving a low error on the training data and minimizing the model complexity. A lower `C` encourages a simpler model.
- `max_iter`: The maximum number of iterations for the solver to converge. Higher values allow the model to converge better, especially for complex data.
- `class_weight`: Balances the classes for imbalanced data (`balanced`, `None`).
- `penalty`: Specifies the norm used in the penalization. Can be `'l1'` or `'l2'`. `l2` is the default and commonly used for Linear SVC.
- `dual`: A boolean flag to choose between the primal or dual formulation. Typically, use `dual=True` for sparse data and `dual=False` for dense data.
- `tol`: The tolerance for stopping criteria. Lower values result in higher precision but longer computation time.

- **GridSearchCV** for Linear SVC

In [None]:
# Define a simplified parameter grid for Linear SVC with focus on 'C', 'max_iter', and 'tol'
param_grid = {
    'C': [0.1, 1, 10],  # Regularization parameter (simplified)
    'max_iter': [2000, 3000],  # Maximum number of iterations for solver to converge (simplified range)
    'class_weight': ['balanced', None],  # Handle class imbalance
    'penalty': ['l2'],  # Penalty type (only 'l2' as specified)
    'dual': [False],  # Use dual=False for dense data (default and recommended for LinearSVC)
    'tol': [1e-4, 1e-5]  # Tolerance for stopping criteria (simplified range)
}

# Initialize the Linear SVC model
linear_svc = LinearSVC()

# Perform GridSearchCV based on 'weighted F1-score'
grid_search = GridSearchCV(linear_svc, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Extract best model
best_lsvc = grid_search.best_estimator_

# Display best hyperparameters
print(f"Best Hyperparameters for Linear SVC: {grid_search.best_params_}")


- **Re-train** Linear SVC with the best hyperparameters.

In [None]:
# After GridSearchCV has finished
# Retrain Linear SVC with the best hyperparameters
best_lsvc = LinearSVC(
    C=grid_search.best_params_['C'],
    max_iter=grid_search.best_params_['max_iter'],
    class_weight=grid_search.best_params_['class_weight'],
    penalty=grid_search.best_params_['penalty'],
    dual=grid_search.best_params_['dual'],
    tol=grid_search.best_params_['tol']
)

best_lsvc.fit(X_train, y_train)

# If you want to evaluate the model (optional)
# y_pred = best_lsvc.predict(X_test)
# print(classification_report(y_test, y_pred))

print(f"Re-trained Linear SVC model with the best hyperparameters: {grid_search.best_params_}")


- **Evaluate** Linear SVC with the best hyperparameters.

In [None]:
# Evaluate Linear SVC model
y_pred = best_lsvc.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for Linear SVC:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = best_lsvc.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


- **Save** the best Linear SVC model.

In [None]:
# Save the trained Linear SVC model to the appropriate path in the models/text directory
with open(os.path.join(config.TEXT_MODELS_DIR, "best_lsvc_model.pkl"), "wb") as f:
    pickle.dump(best_lsvc, f)

print("Best Linear SVC model saved.")

## 4. Creating a Voting Classifier
- **Combine** all the best models trained in the previous steps using a **Voting Classifier**.
- Create a **hard voting** or **soft voting** strategy depending on the models' compatibility with `predict_proba()`.

In [None]:
# List of the best models from the previous steps
models = [
    ('lr', best_lr),  # Logistic Regression
    ('svm', best_svm),  # Support Vector Machine
    ('rf', best_rf),  # Random Forest
    ('knn', best_knn),  # K-Neighbors Classifier
    ('dt', best_dt),  # Decision Tree
    ('xgb', best_xgb),  # XGBoost
    ('lsvc', best_lsvc)  # Linear SVC
]

# Check if any model supports predict_proba(), which is required for soft voting
if any([hasattr(model[1], 'predict_proba') for model in models]):
    voting_clf = VotingClassifier(estimators=models, voting='soft')  # Soft voting
else:
    voting_clf = VotingClassifier(estimators=models, voting='hard')  # Hard voting

# Train the Voting Classifier
voting_clf.fit(X_train, y_train)

# Optionally, save the Voting Classifier
with open(os.path.join(config.MODELS_DIR, "voting_clf_model.pkl"), "wb") as f:
    pickle.dump(voting_clf, f)

print(f"Voting Classifier trained with {'soft' if voting_clf.voting == 'soft' else 'hard'} voting and saved.")


- **Evaluate** Voting Classifier with the best models.

In [None]:
# Evaluate Voting Classifier model
y_pred = voting_clf.predict(X_test)  # Make predictions on the test set

# Print the classification report
print("Classification Report for Voting Classifier:")
print(classification_report(y_test, y_pred))

# Optionally, calculate other evaluation metrics like F1-Score, Accuracy
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = voting_clf.score(X_test, y_test)

# Display the evaluation metrics
print(f"Weighted F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")


## 5. Model Comparison and Selection
- **Compare** the performance of each individual model and the **Voting Classifier** using metrics like the **weighted F1-score**.
- Select the best-performing model for further use or future deployment.

In [None]:
from sklearn.metrics import classification_report, f1_score

# Create a dictionary to store the models and their respective predictions
models = {
    'Logistic Regression': best_lr,
    'SVM': best_svm,
    'Random Forest': best_rf,
    'K-Neighbors': best_knn,
    'Decision Tree': best_dt,
    'XGBoost': best_xgb,
    'Linear SVC': best_lsvc,
    'Voting Classifier': voting_clf  # The Voting Classifier
}

# Initialize an empty dictionary to store the weighted F1 scores
f1_scores = {}

# Compare the performance of each model on the test set
for model_name, model in models.items():
    # Make predictions using the model
    y_pred = model.predict(X_test)
    
    # Calculate the weighted F1-score for the model
    f1 = f1_score(y_test, y_pred, average='weighted')
    f1_scores[model_name] = f1

    # Print the classification report for each model
    print(f"Classification Report for {model_name}:\n")
    print(classification_report(y_test, y_pred))

# Select the best-performing model based on the weighted F1-score
best_model_name = max(f1_scores, key=f1_scores.get)
best_model = models[best_model_name]

print(f"\nBest model based on weighted F1-score: {best_model_name} with a score of {f1_scores[best_model_name]:.4f}")


### Display the Mapping Between Encoded Labels and Original Classes
Before saving the best model, let's display the correspondence between the **encoded labels** (0-26) and their **original classes**. This will help us understand the mapping of the product categories in the context of our model's predictions.

In [None]:
print(config.PRDTYPECODE_MAPPING_PATH)

In [None]:
# Load the label mapping from the pickle file (Encoded Label → Original Class)
with open(config.PRDTYPECODE_MAPPING_PATH, 'rb') as f:
    prdtypecode_mapping = pickle.load(f)

# The DataFrame has already the three columns: "Original prdtypecode", "Encoded target", and "Label"
mapping_df = prdtypecode_mapping  # Since it's already a DataFrame

# Display the first 5 rows of the DataFrame after loading
print("Label mapping loaded from pickle file:")
print(mapping_df.head())

# Predict using the best model (selected based on weighted F1-score)
y_pred_best_model = best_model.predict(X_test)  # Make predictions with the best model

# Create a DataFrame to display the classification report and results for each class
classification_results = classification_report(y_test, y_pred_best_model, output_dict=True)
classification_results_df = pd.DataFrame(classification_results).transpose()

# Merge the classification results with the mapping
classification_results_with_mapping = pd.merge(classification_results_df, mapping_df, 
                                               left_index=True, right_on="Encoded target", how="left")

# Display the table showing classification results and the corresponding classes
print("Classification Report with Encoded Labels and Their Original Classes:\n")
print(classification_results_with_mapping)


## 6. Saving the Best Model 
- **Save** the best model (Voting Classifier or other) for future use and potential deployment.

In [None]:
# Save the best model (either Voting Classifier or another model)
best_model_path = os.path.join(config.MODELS_DIR, "best_model.pkl")

with open(best_model_path, "wb") as f:
    pickle.dump(best_model, f)

print(f"The best model has been saved at: {best_model_path}")


## 7. 🔄 Next Steps