# 1 Introduction & Objectif

## 📌 Benchmarking CNN Model for Text Classification

### 🎯 Objective:
- Replicate the reference CNN model from the Rakuten challenge.
- Train it on the **designation** field (product titles).
- Evaluate using **Weighted F1-score** to compare with the benchmark (**0.8113**).

### 🔍 Expected Outcome:
- Establish a **baseline** for text classification.
- Compare the F1-score with **0.8113**.
- Prepare for **improvements (Fine-tuning, Hyperparameter Optimization)**.


# 2. Imports & Configuration

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
    Input, Embedding, Reshape, Conv2D, MaxPooling2D, Concatenate, Flatten, Dropout, Dense
)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import f1_score, precision_score, recall_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# 3. Définition du modèle benchmark

## Modèle utilisé dans le benchmark Rakuten
https://challengedata.ens.fr/participants/challenges/35/
*For the text data a simplified CNN classifier used. Only the designation fields (product titles) are used in this benchmark model. The input size is the maximum possible designation length, 34 in this case. Shorter inputs are zero-padded. The architecture consists of an embedding layer and 6 convolutional, max-pooling blocks. The embeddings are trained with the entire architecture. Following is the model architecture:*

| Layer (type)          | Output Shape         | Number of Params | Connected to                   |
|-----------------------|---------------------|------------------|--------------------------------|
| InputLayer           | (None, 34)          | 0                | -                              |
| Embedding Layer      | (None, 34, 300)     | 17,320,500       | InputLayer                     |
| Reshape             | (None, 34, 300, 1)  | 0                | Embedding Layer                |
| Conv2D Block 1      | (None, 34, 1, 512)  | 154,112          | Reshape                        |
| MaxPooling2D Block 1 | (None, 1, 1, 512)   | 0                | Conv2D Block 1                 |
| Conv2D Block 2      | (None, 33, 1, 512)  | 307,712          | Reshape                        |
| MaxPooling2D Block 2 | (None, 1, 1, 512)   | 0                | Conv2D Block 2                 |
| Conv2D Block 3      | (None, 32, 1, 512)  | 461,312          | Reshape                        |
| MaxPooling2D Block 3 | (None, 1, 1, 512)   | 0                | Conv2D Block 3                 |
| Conv2D Block 4      | (None, 31, 1, 512)  | 614,912          | Reshape                        |
| MaxPooling2D Block 4 | (None, 1, 1, 512)   | 0                | Conv2D Block 4                 |
| Conv2D Block 5      | (None, 30, 1, 512)  | 768,512          | Reshape                        |
| MaxPooling2D Block 5 | (None, 1, 1, 512)   | 0                | Conv2D Block 5                 |
| Conv2D Block 6      | (None, 29, 1, 512)  | 922,112          | Reshape                        |
| MaxPooling2D Block 6 | (None, 1, 1, 512)   | 0                | Conv2D Block 6                 |
| Concatenate         | (None, 6, 1, 512)   | 0                | All MaxPooling2D Blocks        |
| Flatten            | (None, 3072)        | 0                | Concatenate                    |
| Dropout Layer       | (None, 3072)        | 0                | Flatten                        |
| Dense Layer         | (None, 27)         | 8,297            | Dropout Layer                  |


- This architecture contains total **20,632,143 trainable** parameters.


In [None]:
# Hyperparameters
MAX_SEQUENCE_LENGTH = 34  # Max length of designation field
EMBEDDING_DIM = 300       # Embedding dimension from benchmark
NUM_CLASSES = 27          # Number of product categories

def create_cnn_text_model():
    # Input layer
    input_text = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype="int32")

    # Embedding Layer
    embedding = Embedding(input_dim=20000, output_dim=EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)(input_text)
    
    # Reshape for Conv2D
    reshape = Reshape((MAX_SEQUENCE_LENGTH, EMBEDDING_DIM, 1))(embedding)

    # 6 Convolution + MaxPooling blocks
    conv_blocks = []
    for filter_size in range(1, 7):
        conv = Conv2D(filters=512, kernel_size=(filter_size, EMBEDDING_DIM), activation="relu")(reshape)
        pool = MaxPooling2D(pool_size=(conv.shape[1], 1))(conv)
        conv_blocks.append(pool)

    # Concatenate all convolution outputs
    concatenated = Concatenate(axis=1)(conv_blocks)

    # Flatten and Dropout
    flatten = Flatten()(concatenated)
    dropout = Dropout(0.5)(flatten)

    # Fully Connected Output Layer
    output = Dense(NUM_CLASSES, activation="softmax")(dropout)

    # Compile Model
    model = Model(inputs=input_text, outputs=output)
    model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
    
    return model

model = create_cnn_text_model()
model.summary()


# 4. Iteration 1: Without text cleaning (to match the benchmark)

## Iteration 1: Without Text Cleaning
*Following the benchmark approach, we apply tokenization directly to the raw `designation` field without any preprocessing.*

In [None]:
# Load dataset
df_train = pd.read_csv("data/raw_csv/X_train_update.csv")
df_test = pd.read_csv("data/raw_csv/X_test_update.csv")
df_labels = pd.read_csv("data/raw_csv/Y_train_CVw08PX.csv")

# Merge train data with labels
df_train = df_train.merge(df_labels, left_index=True, right_index=True)

# Tokenization
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_SEQUENCE_LENGTH = 34
tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(df_train["designation"])

X_train = pad_sequences(tokenizer.texts_to_sequences(df_train["designation"]), maxlen=MAX_SEQUENCE_LENGTH)
X_test = pad_sequences(tokenizer.texts_to_sequences(df_test["designation"]), maxlen=MAX_SEQUENCE_LENGTH)

y_train = df_train["prdtypecode"].values  # Target labels


## Iteration 2: With Text Cleaning
*Now, we apply standard text preprocessing techniques before tokenization:*
   - Lowercasing
    - Removing punctuation & special characters
   - Removing extra spaces

In [None]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Apply cleaning
df_train["designation_cleaned"] = df_train["designation"].astype(str).apply(clean_text)
df_test["designation_cleaned"] = df_test["designation"].astype(str).apply(clean_text)

# Tokenization on cleaned text
tokenizer_cleaned = Tokenizer(num_words=20000)
tokenizer_cleaned.fit_on_texts(df_train["designation_cleaned"])

X_train_cleaned = pad_sequences(tokenizer_cleaned.texts_to_sequences(df_train["designation_cleaned"]), maxlen=MAX_SEQUENCE_LENGTH)
X_test_cleaned = pad_sequences(tokenizer_cleaned.texts_to_sequences(df_test["designation_cleaned"]), maxlen=MAX_SEQUENCE_LENGTH)


# 5. Model Training 

We train our model separately on:

- **Iteration 1**: Raw text (designation)
- **Iteration 2**: Cleaned text (designation_cleaned)
- epochs=10,
- batch_size=32

In [None]:
# Train model on raw text
model_raw = create_text_cnn_model()
history_raw = model_raw.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Train model on cleaned text
model_cleaned = create_text_cnn_model()
history_cleaned = model_cleaned.fit(X_train_cleaned, y_train, epochs=10, batch_size=32, validation_split=0.2)

# 6. Model Evaluation

## 6.1 Learning Curves: Accuracy & Loss

In [None]:
import matplotlib.pyplot as plt

def plot_learning_curves(history, title):
    plt.plot(history.history['accuracy'], label='Train Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.title(title)
    plt.legend()
    plt.show()

plot_learning_curves(history_raw, 'Learning Curve - Raw Text')
plot_learning_curves(history_cleaned, 'Learning Curve - Cleaned Text')


## 6.2  Evaluation Metrics: Weighted F1-Score, Precision & Recall

- **F1-score weighted**
- **Précision**
- **Recal**

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score

def evaluate_model(model, X_test, y_test, label):
    y_pred = np.argmax(model.predict(X_test), axis=1)
    f1 = f1_score(y_test, y_pred, average='weighted')
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    print(f'📌 {label}')
    print(f'🔹 Weighted F1-score: {f1:.4f}')
    print(f'🔹 Precision: {precision:.4f}')
    print(f'🔹 Recall: {recall:.4f}\\n')

evaluate_model(model_raw, X_test, df_test["prdtypecode"].values, 'Raw Text')
evaluate_model(model_cleaned, X_test_cleaned, df_test["prdtypecode"].values, 'Cleaned Text')


## 6.3 Detailed Classification Report


In [None]:
from sklearn.metrics import classification_report

def generate_classification_report(model, X_test, y_test, label):
    """Generates a detailed classification report for the given model."""
    y_pred = np.argmax(model.predict(X_test), axis=1)
    report = classification_report(y_test, y_pred)
    
    print(f'📌 {label} - Classification Report:')
    print(report)
    print("="*80)

# Generate reports for both cases
generate_classification_report(model_raw, X_test, df_test["prdtypecode"].values, "Raw Text")
generate_classification_report(model_cleaned, X_test_cleaned, df_test["prdtypecode"].values, "Cleaned Text")


## 6.4 Confusion Matrix: Error Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(model, X_test, y_test, label, class_labels):
    """Generates and plots a confusion matrix for the given model."""
    y_pred = np.argmax(model.predict(X_test), axis=1)
    cm = confusion_matrix(y_test, y_pred)

    plt.figure(figsize=(16, 8))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=class_labels, yticklabels=class_labels)
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.title(f"Confusion Matrix - {label}")
    plt.show()

# Get class labels
class_labels = df_train["prdtypecode"].unique()

# Generate confusion matrices for both cases
plot_confusion_matrix(model_raw, X_test, df_test["prdtypecode"].values, "Raw Text", class_labels)
plot_confusion_matrix(model_cleaned, X_test_cleaned, df_test["prdtypecode"].values, "Cleaned Text", class_labels)


# 8. Benchmark Results & Next Steps

## 🔹 Results: Raw Text Model
- **Benchmark F1-score (given)**: `0.8113`
- **Our Model F1-score**: `____`  
- **Precision**: `____` 
- **Recall**: `____` 

## 🔹 Results: Cleaned Text Model
- **Benchmark F1-score (given)**: `0.8113`
- **Our Model F1-score**: `____`  
- **Precision**: `____` 
- **Recall**: `{`____` 

## 🔹 Next Steps:
✅ Compare **baseline** with **Hyperparameter Optimization**.  
✅ Perform **Pretrained Embeddings** (`Word2Vec, FastText, GloVe`).  
✅ Train **BiLSTM or Transformer-based models** (e.g., BERT).  
✅ Integrate **Multimodal (Text + Image) classification** (`05_Bimodal_Integration`).  
