# Mirakl Technical Test

## Introduction:

On an e-commerce site, it is crucial that products are categorized correctly. Incorrect categorization can lead to products being invisible to buyers and create a poor user experience, especially when large quantities of products are misplaced in the wrong category.

<u>Guidelines:</u>
- Analyze the data and describe some possible classification algorithms for such a dataset.
- Choose and implement at least one of these algorithms using the training set (data_train.csv) and perform a prediction on the provided test set (data_test.csv). Specify the chosen metrics. Explain your steps.
- Conduct a complete analysis of the obtained results.
- Do not use the category hierarchy as an explanatory variable.

<u>Objectives:</u>
- A structured and concise presentation
- Insightful data analysis and relevant modeling

<u>Summary:</u>
1. Exploratory Data Analysis (EDA)
2. Data Preprocessing and Dimensionality Reduction Techniques
3. Evaluation of Classification Algorithms and Linear Models
4. Final Model Performance Evaluation and Next Steps
5. Analysis of Sibling Misclassifications and Parent Category Performance
6. Conclusion

<u>Instructions:</u>
- Data files must be stored in a "data" directory

## 1. Exploratory Data Analysis (EDA)

### Insights:

<u>Dataset Size:</u> 
- The dataset contains 241,483 entries, which is reasonable for applying simple models with fewer parameters that should suffice for effective modeling.
  
<u>Features:</u>
- All 128 features are numerical and of type float, with no missing values. This reduces the need for complex preprocessing and extensive feature engineering.
- High dimensionality due to the 128 features suggests that dimensionality reduction could be beneficial before applying certain algorithms.
    
<u>Feature Correlation:</u>
- The features are uncorrelated, indicating that simple linear models might perform well. However, potential interactions between features should not be dismissed.
    
<u>Product Categories:</u> 
- There are 101 product categories, which are identical in both the train and test sets.
    
<u>Category Distribution:</u>
- The distribution is reasonably similar between the train and test sets, but the problem is highly imbalanced, with up to a 10x difference in category occurrences.
- In order to avoid bias towards the majority classes, techniques such like class weighting, undersampling or oversampling like SMOTE (Synthetic Minority Over-sampling Technique) can be considered.

<u>Evaluation Metric:</u>
- Given the significant class imbalance, the F1-score is an appropriate evaluation metric. It balances precision and recall, making it effective in situations where the cost of false positives and false negatives are high, as in this case.

<u>Remarks:</u> 
- Automatic EDA tools like "ydata_profiling" demand significant computational time, particularly for large or high-dimensional datasets like this one.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path, PosixPath

data_path : PosixPath = Path.cwd() / "data"

category_parent_df : pd.DataFrame = pd.read_csv(data_path / "category_parent.csv")
test_df : pd.DataFrame = pd.read_csv(data_path / "data_test.csv.gz", compression='gzip')
train_df : pd.DataFrame = pd.read_csv(data_path / "data_train.csv.gz", compression='gzip')

In [2]:
def sanity_check(train_df: pd.DataFrame, test_df: pd.DataFrame):
    def print_df_summary(df, name):
        print(f"Summary for {name}:")
        print(f"Shape: {df.shape}")
        print(f"Data Types Count:\n{df.dtypes.value_counts()}")
        print(f"Max Missing Values (isna): {df.isna().sum(axis=0).max()}")
        print(f"Max Missing Values (isnull): {df.isnull().sum(axis=0).max()}\n")
    
    print_df_summary(train_df, 'Train DataFrame')
    print_df_summary(test_df, 'Test DataFrame')
    
    train_categories = set(train_df.category_id)
    test_categories = set(test_df.category_id)
    
    assert train_categories == test_categories, "Mismatch in categories between train and test DataFrames"
    print("Category consistency check passed: Train and Test categories are the same.")

def plot_correlation_matrix(df: pd.DataFrame, feature_start_col: int = 2) -> None:
    corr_matrix = df.iloc[:, feature_start_col:].corr()
    
    max_corr = corr_matrix.abs().replace(1.0, None).max().max()
    print("Maximum of correlation: ", max_corr)

    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1, annot=False, fmt='.2f', square=True)
    plt.title("Correlation Matrix of Features")
    plt.show()

def plot_class_distribution_comparison(train_df: pd.DataFrame, test_df: pd.DataFrame) -> None:
    # Calculate normalized class distributions
    train_class_counts = train_df['category_id'].value_counts(normalize=True)
    test_class_counts = test_df['category_id'].value_counts(normalize=True)
    
    # Create a DataFrame for plotting
    melted_class_distribution = (
        pd.DataFrame({
            'category': train_class_counts.index,
            'train_percentage': train_class_counts.values * 100,
            'test_percentage': test_class_counts.reindex(train_class_counts.index, fill_value=0).values * 100
        })
        .sort_values(by='train_percentage', ascending=False)
        .melt(id_vars='category', 
              value_vars=['train_percentage', 'test_percentage'],
              var_name='Set', value_name='Percentage')
    )
    
    # Plot the class distribution comparison
    plt.figure(figsize=(14, 7))
    sns.barplot(x='category', y='Percentage', hue='Set', data=melted_class_distribution, palette={'train_percentage': 'blue', 'test_percentage': 'orange'})
    plt.xlabel('Category ID')
    plt.ylabel('Percentage')
    plt.title('Normalized Class Distribution Comparison Between Train and Test Set')
    plt.xticks(rotation=90)
    plt.legend(title='Data Set')
    plt.show()

In [None]:
sanity_check(train_df, test_df)
plot_correlation_matrix(train_df)
plot_class_distribution_comparison(train_df, test_df)

## 2. Data Preprocessing and Dimensionality Reduction Techniques

### Insights:

<u>Outliers:</u>
- Rows with more than 10% outlier features (z-score absolute threshold > 3) were removed, accounting for 2.91% of the training dataset.
    Analysis of row removal per category shows that no more than 10% of the data was removed for any category.

<u>Train, Validation & Test Split:</u>
- The training dataset was split into training and validation sets with a 0.8/0.2 ratio, ensuring stratification to maintain category distribution and random shuffling to prevent order bias.

<u>Scaling:</u>
- Standardization is applied to ensure features are on a similar scale, which helps improve model performance and convergence.

<u>Dimensionality Reduction:</u>
- PCA: Initial PCA suggests features are nearly orthogonal, requiring 120 features to explain 95% of the variance, with similar contributions from each feature.
    LassoCV: Removed 2 features, aiding in reducing overfitting and improving generalization.

<u>Feature Interaction:</u>
- No polynomial features were created due to high dimensionality. Decision trees and ensemble methods like Random Forests and Gradient Boosting capture feature interactions implicitly.

In [4]:
import numpy as np

from scipy.stats import zscore
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, RobustScaler

RANDOM_SEED : int = 42
TRAIN_TEST_SPLIT: int = 0.2

def split_labels_from_features(df: pd.DataFrame):
    X = df.iloc[:, 2:]  # Features
    y = df["category_id"]  # Labels
    return X, y

def remove_outliers(df: pd.DataFrame, feature_columns: list, threshold: float, max_outliers_per_row: int) -> tuple[pd.DataFrame, pd.Series]:
    # Calculate z-scores for the specified feature columns & determine outliers
    z_scores = df[feature_columns].apply(zscore)
    outliers = (z_scores > threshold) | (z_scores < -threshold)
    
    # Count the number of outliers per row & filter rows with excessive outliers
    outliers_per_row = outliers.sum(axis=1) 
    filtered_df = df[outliers_per_row <= max_outliers_per_row]
    return filtered_df, outliers_per_row

def scale_features(X_train: pd.DataFrame, X: pd.DataFrame, scaler) -> tuple:
    X_train_scaled = scaler.fit_transform(X_train)
    X_scaled = scaler.transform(X)

    return X_train_scaled, X_scaled

def select_features_with_lasso(X_train_scaled: np.ndarray, y_train_encoded: np.ndarray, X_val_scaled: np.ndarray, X_test_scaled: np.ndarray) -> tuple:

    lasso_cv = LassoCV(cv=5, random_state=RANDOM_SEED)
    lasso_cv.fit(X_train_scaled, y_train_encoded)
    selected_features_mask = lasso_cv.coef_ != 0
    
    X_train_selected = X_train_scaled[:, selected_features_mask]
    X_val_selected = X_val_scaled[:, selected_features_mask]
    X_test_selected = X_test_scaled[:, selected_features_mask]

    return X_train_selected, X_val_selected, X_test_selected, selected_features_mask

def print_summary(original_df: pd.DataFrame, filtered_df: pd.DataFrame, selected_features_mask: np.ndarray, X_train: pd.DataFrame):
    # Dataset shape information
    original_size = original_df.shape
    filtered_size = filtered_df.shape
    rows_removed = original_size[0] - filtered_size[0]
    
    print(f"Original dataset size: {original_size[0]} rows x {original_size[1]} columns")
    print(f"Filtered dataset size: {filtered_size[0]} rows x {filtered_size[1]} columns")
    print(f"Number of rows removed due to outliers: {rows_removed} ({100 * rows_removed / original_size[0]:.2f}%)")
    
    # Feature selection information
    num_selected_features = selected_features_mask.sum()
    
    print(f'Number of selected features after LassoCV: {num_selected_features}')

def plot_outliers_distribution(outliers_per_row):
    plt.figure(figsize=(10, 6))
    sns.histplot(outliers_per_row, kde=True, bins=30, color='skyblue')
    plt.title('Distribution of Outliers per Row', fontsize=16)
    plt.xlabel('Number of Outliers per Row', fontsize=14)
    plt.ylabel('Frequency', fontsize=14)
    plt.show()

def plot_category_removal_percentage(original_df: pd.DataFrame, filtered_df: pd.DataFrame):
    initial_category_counts = original_df["category_id"].value_counts()
    filtered_category_counts = filtered_df["category_id"].value_counts()
    
    # Align counts
    aligned_counts = pd.concat([initial_category_counts, filtered_category_counts], axis=1, keys=['Before', 'After']).fillna(0)
    aligned_counts['% Removed'] = (aligned_counts['Before'] - aligned_counts['After']) / aligned_counts['Before'] * 100
    aligned_counts_sorted = aligned_counts.sort_values('% Removed', ascending=False)
    
    plt.figure(figsize=(12, 8))
    sns.barplot(x=aligned_counts_sorted.index, y=aligned_counts_sorted['% Removed'], palette='coolwarm')
    plt.xticks(rotation=90)
    
    plt.title('Percentage of Rows Removed per Product Category', fontsize=16)
    plt.xlabel('Product Category', fontsize=14)
    plt.ylabel('% of Rows Removed', fontsize=14)
    
    plt.tight_layout()
    plt.show()

In [None]:
Z_SCORE_OUTLIERS_THRESHOLD: int = 3
MAX_OUTLIERS_PER_ROW: int = 10

# outliers
feature_columns = train_df.columns[2:]
filtered_train_df, outliers_per_row = remove_outliers(train_df, feature_columns, Z_SCORE_OUTLIERS_THRESHOLD, MAX_OUTLIERS_PER_ROW)

# split the data
X_test, y_test = split_labels_from_features(test_df)
X, y = split_labels_from_features(filtered_train_df)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TRAIN_TEST_SPLIT, shuffle=True, stratify=y, random_state=RANDOM_SEED)

# encode labels
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train_encoded = label_encoder.transform(y_train)

# scale data
scaler = StandardScaler()  # or RobustScaler()
X_train_scaled, X_val_scaled  = scale_features(X_train, X_val, scaler)
X_train_scaled, X_test_scaled = scale_features(X_train, X_test, scaler)

# features selection with Lasso
X_train_selected, X_val_selected, X_test_selected, selected_features_mask = select_features_with_lasso(X_train_scaled, y_train_encoded, X_val_scaled, X_test_scaled)

print_summary(train_df, filtered_train_df, selected_features_mask, X_train)
plot_outliers_distribution(outliers_per_row)
plot_category_removal_percentage(train_df, filtered_train_df)

## 3. Evaluation of Classification Algorithms and Linear Models

### Insights

<u>Some relevant classification Algorithms:</u>
- Logistic Regression: A straightforward linear classifier that's efficient for high-dimensional data and handles class imbalance with class weighting.
- Ridge Classifier: An extension of Logistic Regression with L2 regularization, helping to prevent overfitting in high-dimensional spaces.
- RandomForest Classifier: An ensemble method that combines multiple decision trees, reducing overfitting and performing well with high-dimensional data.
- XGBoost/LightGbm: A powerful gradient boosting algorithm that builds trees sequentially and includes parameters for addressing class imbalance.
- Support Vector Machine (SVM): Effective for high-dimensional data, with the ability to capture complex relationships using different kernels, though it requires careful tuning for class imbalance.

<u>Linear models as a baseline:</u>
- High Dimensional Uncorrelated Features: The simplicity and interpretability of linear models are beneficial in this context.
- Computational Efficiency: Linear models are computationally efficient and quick to train.
- Benchmark for More Complex Models: If more complex models significantly outperform linear models, it indicates that there may be non-linear relationships or interactions that the linear model cannot capture

<u>Performance Analysis of linear models:</u>
- RidgeClassifier:
    - Performs fairly consistently across training, validation, and test datasets.
    - Strengths: Balances between training and validation sets, indicating good generalization.
    - Weaknesses: Slightly lower performance on the test set compared to Logistic Regression.

- LogisticRegression:
    - Shows better performance on the training data and maintains relatively good performance on validation and test datasets.
    - Strengths: Higher F1-scores compared to RidgeClassifier, indicating better overall performance.
    - Weaknesses: Performance drop from training to test set, suggesting some level of overfitting.

- F1-score vs. Support Percentage:
    - Helps identify categories that may be misclassified due to low occurrence, and allows for performance comparison across categories (see below).
    - Highlights that some frequently occurring classes are poorly predicted, while others with fewer occurrences are well predicted. This suggests a possible difference in data quality between categories. Consider removing more outliers or applying oversampling to address this issue.
    - Examples:
        - Category 9821 with more than 3% occurence and a f1-score of 0.62
        - Category 2824 with 0.5% occurence and a f1-score of 0.43

In [None]:
from typing import Tuple
from sklearn.metrics import classification_report
from sklearn.linear_model import RidgeClassifier, LogisticRegression

def get_classification_report_df(model, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
    y_pred = model.predict(X)
    return pd.DataFrame(classification_report(y, y_pred, output_dict=True)).transpose()

def score_model(model, X_train: pd.DataFrame, X_val: pd.DataFrame, X_test: pd.DataFrame, 
                y_train: pd.Series, y_val: pd.Series, y_test: pd.Series) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # Predict and evaluate on training data
    train_report_df = get_classification_report_df(model, X_train, y_train)
    print("Training data classification report F1-score:")
    print(train_report_df.loc[:, "f1-score"].mean())

    # Predict and evaluate on validation data
    val_report_df = get_classification_report_df(model, X_val, y_val)
    print("Validation data classification report F1-score:")
    print(val_report_df.loc[:, "f1-score"].mean())

    # Predict and evaluate on test data
    test_report_df = get_classification_report_df(model, X_test, y_test)
    print("Test data classification report F1-score:")
    print(test_report_df.loc[:, "f1-score"].mean())
    
    return train_report_df, val_report_df, test_report_df

def plot_f1_vs_support_percentage(df_report: pd.DataFrame) -> None:
    # Filter out 'accuracy', 'macro avg', 'weighted avg'
    df_report = df_report.iloc[:-3, :]
    
    # Calculate percentages for the support values
    total_instances = df_report['support'].sum()
    df_report['support_percentage'] = (df_report['support'] / total_instances) * 100
    
    # Plotting F1-score vs Support Percentage
    plt.figure(figsize=(10, 6))
    plt.scatter(df_report['support_percentage'], df_report['f1-score'], color='blue', alpha=0.7)
    plt.title("Relationship between Support Percentage and F1-Score for Each Class", fontsize=16)
    plt.xlabel("Support (Percentage of Total Instances)", fontsize=14)
    plt.ylabel("F1-Score", fontsize=14)
    
    # Optionally, annotate each point with the class label
    for i, class_label in enumerate(df_report.index):
        plt.annotate(class_label, (df_report['support_percentage'].iloc[i], df_report['f1-score'].iloc[i]),
                     textcoords="offset points", xytext=(0,5), ha='center', fontsize=9)
    
    plt.grid(True)
    plt.show()

In [None]:
ridge_clf = RidgeClassifier(class_weight="balanced", random_state=RANDOM_SEED)
ridge_clf.fit(X_train_selected, y_train)

ridge_clf_train_report_df, ridge_clf_validation_report_df, ridge_clf_test_report_df = score_model(
    ridge_clf, X_train_selected, X_val_selected, X_test_selected, y_train, y_val, y_test
)
plot_f1_vs_support_percentage(ridge_clf_validation_report_df)

In [None]:
log_reg = LogisticRegression(class_weight="balanced", random_state=RANDOM_SEED)
log_reg.fit(X_train_selected, y_train)

log_reg_train_report_df, log_reg_validation_report_df, log_reg_test_report_df = score_model(
    log_reg, X_train_selected, X_val_selected, X_test_selected, y_train, y_val, y_test
)
plot_f1_vs_support_percentage(log_reg_validation_report_df)

### 4. Final Model Performance Evaluation and Next Steps

### Insights

The following analysis focuses on the performance of the final model, which is the Logistic Regression trained on the entire training set. This model achieved the best results among those evaluated. Here's a summary of its performance and the recommended next steps for further improvement.

<u>DecisionTreeClassifier:</u>
- DecisionTreeClassifier(class_weight="balanced", random_state=RANDOM_SEED, max_depth=15) F1-score: 0.515 train, 0.383 validation & 0.355 test
- Performs poorly on both validation and test datasets.
- Strengths: Can model complex relationships but is heavily dependent on hyperparameters.
- Weaknesses: High variance, significant drop in performance from training to validation and test datasets. Likely overfitting.

<u>SVM & LightGBM:</u>
- Training these models takes too long and requires excessive resources to prevent overfitting. Even though they remain clear challengers, they are set aside to stay pragmatic in this study.

<u>Ensemble Methods</u>:
- Combining models to leverage the strengths of each:
    - Logistic regression and ridge classifier
    - Logistic regression, ridge classifier and Decision Tree Classifier
- However, as the comparison of F1 scores shows, both linear models make the same errors, so their combination does not improve performance. Adding the overfitted Decision Tree does not help either

<u>Final Model: Logistic Regression on Full Training Set</u>
- F1-Scores:
    - Training Data: 0.66
    - Validation Data: 0.64
    - Test Data: 0.62
- The model performs well across different datasets even if it might be slightly overfitting

<u>Next Steps:</u>
- Handling Imbalance: Investigate techniques like SMOTE or class weighting further to address class imbalance.
- Hyperparameter Tuning & Cross-Validation: Fine-tune hyperparameters more extensively to find a balance between bias and variance.

<u>Useful tools:</u>
- Optuna: https://optuna.org/
- MlfLow: https://mlflow.org/

In [9]:
def compare_f1_scores(report_dict1: pd.DataFrame, report_dict2: pd.DataFrame, model1_name: str, model2_name: str) -> None:
    # Convert classification report dictionaries to DataFrames
    df1 = report_dict1.drop(['accuracy', 'macro avg', 'weighted avg'], errors='ignore')
    df2 = report_dict2.drop(['accuracy', 'macro avg', 'weighted avg'], errors='ignore')
    
    # Concatenate F1-scores for all models
    f1_df = pd.concat([df1[['f1-score']].rename(columns={'f1-score': model1_name}),
                       df2[['f1-score']].rename(columns={'f1-score': model2_name})], axis=1)
    
    # Plot the comparison of F1-scores
    f1_df.plot(kind='bar', figsize=(12, 6))
    plt.title('F1-Score Comparison per Class')
    plt.ylabel('F1-Score')
    plt.xlabel('Class')
    plt.xticks(rotation=90)
    plt.legend()
    plt.show()

In [None]:
compare_f1_scores(ridge_clf_validation_report_df, log_reg_validation_report_df, model1_name="Ridge Classifier", model2_name="Logistic Regression")

In [None]:
final_X_test, final_y_test = split_labels_from_features(test_df)
final_X_train, final_y_train = split_labels_from_features(filtered_train_df)

final_X_train, final_X_test = scale_features(final_X_train, final_X_test, scaler)

final_X_test = final_X_test[:, selected_features_mask]
final_X_train = final_X_train[:, selected_features_mask]

final_log_reg = LogisticRegression(class_weight="balanced", random_state=RANDOM_SEED)
final_log_reg.fit(final_X_train, final_y_train)

final_log_reg_train_report_df = get_classification_report_df(final_log_reg, final_X_train, final_y_train)
final_log_reg_test_report_df = get_classification_report_df(final_log_reg, final_X_test, final_y_test)

print(final_log_reg_train_report_df.loc[:, "f1-score"].mean())
print(final_log_reg_test_report_df.loc[:, "f1-score"].mean())

plot_f1_vs_support_percentage(final_log_reg_test_report_df)

### 5. Analysis of Sibling Misclassifications and Parent Category Performance

### Insights

The dataset category_parent.csv contains category-parent pairs, providing an opportunity to study sibling misclassifications.

<u>Definitions:</u>
- Sibling: Categories belonging to the same parent category.
- The Fraction of Total Misclassifications: It is the percentage of mistakes that involve a specific parent category out of all the mistakes made by the model.
- The Misclassification Ratio: It is the percentage of mistakes for a specific parent category out of all the instances that actually belong to that parent category.

<u>Sibling Misclassification Rate:</u>
- Rate: 22.64%
- Implication: About 1 in 5 misclassifications occur between sibling categories, indicating the model's difficulty distinguishing between similar sibling categories.

<u>High Misclassification Counts with Varying F1 Scores:</u>
- Parent Category 9819: Approximately 32% of misclassifications occur within this parent category, but it has a good overall performance with an average F1 score of 0.63. While the category is generally well-predicted, developing distinguishing features for subcategories within Parent Category 9819 could improve classification accuracy further.
- Parent Category 3066: This category has a similar misclassification ratio of 28.84%, but its average F1 score is significantly lower at 0.44. This suggests that not only is there a high misclassification rate, but the quality of predictions within this category is also poor.

<u>Categories with Low Misclassification Ratios and Higher F1 Scores:</u>
- Parent Category 2958 has the lowest misclassification ratio (11.40%) and a relatively high average F1 score (0.62). This suggests that this parent category is well-represented in the model, with fewer misclassifications and reasonably accurate predictions.
- Parent Category 2891 has a low misclassification ratio (16.49%) and a decent average F1 score (0.58), indicating that it is relatively well classified with a lower proportion of errors.

<u>Fractions of Total Misclassifications:</u>
- Categories with higher fractions of total misclassifications (e.g., Parent Category 9819 with 4.26%) point to potentially significant issues in the model’s performance for these categories. This can help prioritize areas for further improvement.

<u>Association Between F1 Score and Misclassification Ratio:</u>
- There seems to be an inverse relationship between the misclassification ratio and the average F1 score. Higher misclassification ratios are often associated with lower average F1 scores, suggesting that improving the F1 score might help reduce the misclassification ratio.

In [1]:
from typing import List, Dict, DefaultDict, Tuple
from collections import defaultdict
from sklearn.metrics import f1_score

CategoryDict = Dict[int, List[int]]

def is_sibling_misclassification(true_label: int, predicted_label: int, category_to_parent: CategoryDict) -> bool:
    return any(true_label in siblings and predicted_label in siblings for siblings in category_to_parent.values())

def analyze_sibling_misclassifications(true_labels: List[int], predicted_labels: List[int], category_to_parent: CategoryDict) -> float:
    sibling_misclassifications = sum(
        1 for true_label, predicted_label in zip(true_labels, predicted_labels)
        if true_label != predicted_label and is_sibling_misclassification(true_label, predicted_label, category_to_parent)
    )
    total_misclassifications = sum(
        1 for true_label, predicted_label in zip(true_labels, predicted_labels)
        if true_label != predicted_label
    )
    return sibling_misclassifications / total_misclassifications if total_misclassifications > 0 else 0

def calculate_f1_scores(true_labels: List[int], predicted_labels: List[int], category_list: List[int]) -> Dict[int, float]:
    f1_scores = {}
    for category in category_list:
        true_binary = [1 if label == category else 0 for label in true_labels]
        predicted_binary = [1 if label == category else 0 for label in predicted_labels]
        f1_scores[category] = f1_score(true_binary, predicted_binary, zero_division=0)
    return f1_scores

def count_sibling_misclassifications(true_labels: List[int], predicted_labels: List[int], categories_per_parent_id_dict: CategoryDict) -> Tuple[DefaultDict[int, int], DefaultDict[int, int]]:
    misclassification_counts = defaultdict(int)
    occurrence_counts = defaultdict(int)
    
    for true_label, predicted_label in zip(true_labels, predicted_labels):
        if true_label != predicted_label and is_sibling_misclassification(true_label, predicted_label, categories_per_parent_id_dict):
            parent = next((parent for parent, siblings in categories_per_parent_id_dict.items() if true_label in siblings or predicted_label in siblings), None)
            if parent is not None:
                misclassification_counts[parent] += 1
        # Increment occurrences for each true label
        parent = next((parent for parent, siblings in categories_per_parent_id_dict.items() if true_label in siblings), None)
        if parent is not None:
            occurrence_counts[parent] += 1
    
    return misclassification_counts, occurrence_counts

def analyze_misclassified_parent_groups(true_labels: List[int], predicted_labels: List[int], categories_per_parent_id_dict: CategoryDict) -> None:
    misclassification_counts, occurrence_counts = count_sibling_misclassifications(true_labels, predicted_labels, categories_per_parent_id_dict)
    
    all_categories = [cat for cats in categories_per_parent_id_dict.values() for cat in cats]
    f1_scores = calculate_f1_scores(true_labels, predicted_labels, all_categories)
    
    print("Parent Category Misclassification Counts, Average F1 Scores, and Ratios:")
    for parent, count in misclassification_counts.items():
        associated_categories = categories_per_parent_id_dict.get(parent, [])
        associated_f1_scores = [f1_scores.get(cat, 0) for cat in associated_categories]
        average_f1_score = sum(associated_f1_scores) / len(associated_f1_scores) if associated_f1_scores else 0
        
        # Calculate fraction of total misclassifications
        total_misclassifications = len([1 for true_label, predicted_label in zip(true_labels, predicted_labels) if true_label != predicted_label])
        fraction_misclassifications = count / total_misclassifications if total_misclassifications > 0 else 0
        
        # Calculate ratio of misclassifications to occurrences
        occurrences = occurrence_counts.get(parent, 0)
        misclassification_ratio = (count / occurrences * 100) if occurrences > 0 else 0
        
        print(f"Parent Category {parent}: {count} misclassifications")
        print(f"  Associated Categories: {associated_categories}")
        print(f"  Average F1 Score: {average_f1_score:.2f}")
        print(f"  Fraction of Total Misclassifications: {fraction_misclassifications:.2%}")
        print(f"  Misclassification Ratio: {misclassification_ratio:.2f}%")


In [None]:
categories_per_parent_id_dict = (
    category_parent_df
    .loc[lambda df: df.category_id.isin(set(train_df.category_id))]
    .assign(parent_id=lambda df: df.parent_id.astype(int),
            category_id=lambda df: df.category_id.astype(int))
    .groupby("parent_id")
    .aggregate({"category_id": [list]})
    .to_dict()
    [('category_id', 'list')]
)

predicted_labels = final_log_reg.predict(final_X_test)

# Calculate sibling error rate
sibling_error_rate = analyze_sibling_misclassifications(final_y_test, predicted_labels, categories_per_parent_id_dict)
print(f"Sibling Misclassification Rate: {sibling_error_rate:.2%}")

# Identify parent groups with high misclassification counts and ratios
analyze_misclassified_parent_groups(final_y_test, predicted_labels, categories_per_parent_id_dict)

### 6. Conclusion

<u>Achievements:</u>
- Model Performance: 
    - Logistic Regression demonstrated the highest performance, with F1-scores of 0.66 on the training set, 0.64 on the validation set, and 0.62 on the test set. This model strikes an excellent balance between performance and complexity due to its simplicity and interpretability.
- Misclassification Insights:
    - Identified parent categories with high internal misclassification ratios (e.g., 9819, 3066) and those with lower ratios and higher F1 scores (e.g., 2958, 2891). Addressing categories with high misclassification ratios can directly enhance customer experience and operational accuracy.

<u>Next Steps:</u>
- Feature Engineering:
    - Create or refine features to better differentiate between similar categories, particularly for those with high misclassification rates.
- Hyperparameter Tuning:
    - Perform extensive hyperparameter tuning and cross-validation to enhance model performance and optimize the balance between bias and variance.
- Explore Complex Models:
    - Consider incorporating more complex models, such as LightGBM, to potentially improve classification performance further.
Strategic Focus:
    - Concentrate on categories with significant internal misclassification to gain more precise and actionable business insights, leading to improved decision-making and operational efficiency.