# The Food Hazard Detection Challenge

This notebook implements a multi-label classification as discribed in the SemEval 2025 Task 9: The Food Hazard Detection Challenge.  
The goal is to predict hazard and product categories, as well as specific hazards and products, based on food recall titles and descriptions. The model uses the XGBoost classifier combined with TF-IDF vectorization to extract features from the text column and classify them into the appropriate categories.  
The sub-Tasks are:  
ST1: Predict hazard category and product category.  
ST2: Predict hazard and product.   
Specifically, what I do is:  
Load and preprocess the data.  
Check data quality (missing values, duplicates).  
Apply feature extraction using TF-IDF vectorization.   
Train multiple XGBoost classifiers for multi-label classification (The training process took about 5 hours to run in my local server).  
Evaluate model performance using F1 scores.  
Save predictions to submission file.

In [1]:
!pip install scikit-learn==1.3.2 xgboost

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Loading and preparing the data:  
I load the labeled training, testing and validation datasets from "The Food Hazard Detection Challenge". The datasets contain food-related incidents, including a text column, title, description and their corresponding categories.  
I checked for missing and duplicates values to be sure that I am using clean data for the training.

In [2]:
import pandas as pd

# Load train and test(valid) datasets
train_data = pd.read_csv("incidents_train.csv")
test_data = pd.read_csv("incidents_test.csv")
valid_data = pd.read_csv("incidents_valid.csv")

# Checking for missing or duplicate values
def check_data_quality(data, name):
    print(f"--- {name} ---")
    print(f"Shape: {data.shape}")
    print(f"Missing values:\n{data.isnull().sum()}\n")
    print(f"Duplicate rows: {data.duplicated().sum()}\n")
    print("="*50)

check_data_quality(train_data, "Training Data")
check_data_quality(test_data, "Testing Data")
check_data_quality(valid_data, "Validation Data")

--- Training Data ---
Shape: (5082, 11)
Missing values:
Unnamed: 0          0
year                0
month               0
day                 0
country             0
title               0
text                0
hazard-category     0
product-category    0
hazard              0
product             0
dtype: int64

Duplicate rows: 0

--- Testing Data ---
Shape: (997, 11)
Missing values:
Unnamed: 0          0
year                0
month               0
day                 0
country             0
title               0
text                0
hazard-category     0
product-category    0
hazard              0
product             0
dtype: int64

Duplicate rows: 0

--- Validation Data ---
Shape: (565, 11)
Missing values:
Unnamed: 0          0
year                0
month               0
day                 0
country             0
title               0
text                0
hazard-category     0
product-category    0
hazard              0
product             0
dtype: int64

Duplicate rows: 0



As we can see no missing or dublicate values appear so we can continue to the main part of the notebook.



# Classification


Label Encoding:  
Since the model only processes numerical inputs:  
I encode categorical labels using LabelEncoder for hazard categories, product categories, hazards, and products.  
I handle unseen labels in testing and validation data to avoid errors during inference. The transform_with_unknown function makes sure that if some labels not appear in the training set are still represented (using -1 as a placeholder during encoding), so each validation example still gets a prediction from the model. More specifically instead of skipping these examples (which would reduce the number of rows in the results (submission file)), the function assigns a fallback value for unseen labels.  

TF-IDF vectorization:  
After the encoding I used to convert text into numerical features. This method converts text into numerical form using character-level n-grams (bi-grams to 5-grams). This helps us capture important patterns in textual data.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

X_train = train_data['text']
X_test = test_data['text']
X_valid = valid_data['text']

hazard_category_encoder = LabelEncoder()
product_category_encoder = LabelEncoder()
hazard_encoder = LabelEncoder()
product_encoder = LabelEncoder()

# Fitting LabelEncoders on the training data
hazard_category_encoder.fit(train_data['hazard-category'])
product_category_encoder.fit(train_data['product-category'])
hazard_encoder.fit(train_data['hazard'])
product_encoder.fit(train_data['product'])

# Encoding the training data
y_hazard_category = hazard_category_encoder.transform(train_data['hazard-category'])
y_product_category = product_category_encoder.transform(train_data['product-category'])
y_hazard = hazard_encoder.transform(train_data['hazard'])
y_product = product_encoder.transform(train_data['product'])

# Handle the unseen labels in the testing, validation data
def transform_with_unknown(encoder, labels, fallback_value=-1):
    known_labels = encoder.classes_ # checking which labels are present in the encoder
    transformed_labels = []
    for label in labels:
        if label in known_labels:
            transformed_labels.append(encoder.transform([label])[0])
        else:
            transformed_labels.append(fallback_value) 

    return np.array(transformed_labels)
# Encode test labels by using the transform_with_unknown function for unseen labels
y_test_hazard_category = transform_with_unknown(hazard_category_encoder, test_data['hazard-category'])
y_test_product_category = transform_with_unknown(product_category_encoder, test_data['product-category'])
y_test_hazard = transform_with_unknown(hazard_encoder, test_data['hazard'])
y_test_product = transform_with_unknown(product_encoder, test_data['product'])

# Encode validation labels by using the transform_with_unknown function for unseen labels
y_valid_hazard_category = transform_with_unknown(hazard_category_encoder, valid_data['hazard-category'])
y_valid_product_category = transform_with_unknown(product_category_encoder, valid_data['product-category'])
y_valid_hazard = transform_with_unknown(hazard_encoder, valid_data['hazard'])
y_valid_product = transform_with_unknown(product_encoder, valid_data['product'])

# Applying Tf-Idf Vectorization
vectorizer = TfidfVectorizer(max_features=5000, strip_accents='unicode', analyzer='char', ngram_range=(2, 5))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
X_valid_vec = vectorizer.transform(X_valid)

Model Training:  
Each classifier is trained on the TF-IDF vectorized text data. For the training I used XGBoost Classifiers for each task:  
ST1: Hazard Category Prediction  
ST1: Product Category Prediction  
ST2: Hazard Prediction  
ST2: Product Prediction  
XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm that is widely used for structured data and classification tasks. 
It was chosen for this task over many other models that I tried, such as K-Nearest-Neighbor, Random-Forest, MultinomialNB, Hist-Gradient-Boosting, SVM, Log-Regression because of:  

Handling imbalanced data:  
XGBoost includes built-in handling for imbalanced datasets through techniques like scale_pos_weight and boosting iterations, making it suitable for classification tasks where some categories may appear less frequently.

High predictive performance:  
Compared to other classifiers such as Logistic Regression or traditional Decision Trees, XGBoost typically delivers higher accuracy and better generalization due to its ensemble learning approach.

Regularization and pruning:  
Unlike simple decision trees, XGBoost applies L1 and L2 regularization, preventing overfitting. It also uses a pruning strategy to remove splits that do not improve performance.

Handling Sparse Features:  
Since we are using TF-IDF vectorization, which results in a sparse representation of text, XGBoost efficiently handles this type of input, unlike some classifiers that struggle with sparse data.

In comparison with other methods like SVM (Support Vector Machines) and Random Forest, which I mentioned before XGBoost was preferred because:
SVMs can be computationally expensive when working with large datasets.  
Random Forest does not perform as well on high-dimensional text data compared to boosting methods.   

All in all, I choose XGBoost, as it balances between efficiency, interpretability, and predictive performance. It also works well for this classification task.  

In [4]:
# ST1: Training model for hazard categorie
hazard_category_model_2 = XGBClassifier(random_state=42, eval_metric='mlogloss',n_jobs=-1)
hazard_category_model_2.fit(X_train_vec, y_hazard_category)

In [5]:
# ST1: Training model for product categorie
product_category_model_2 = XGBClassifier(random_state=42,eval_metric='mlogloss',n_jobs=-1)
product_category_model_2.fit(X_train_vec, y_product_category)

In [6]:
# ST2: Training model for hazard
hazard_model_2 = XGBClassifier(random_state=42, eval_metric='mlogloss',n_jobs=-1)
hazard_model_2.fit(X_train_vec, y_hazard)

In [7]:
# ST2: Training model for product
product_model_2 = XGBClassifier(random_state=42, eval_metric='mlogloss', n_jobs=-1)
product_model_2.fit(X_train_vec, y_product)

# Predictions on Training set

In [8]:
# Predictions on the training set for ST1
hazard_category_train_preds = hazard_category_model_2.predict(X_train_vec)
product_category_train_preds = product_category_model_2.predict(X_train_vec)

In [9]:
# Predictions on the training set for ST2
hazard_train_preds = hazard_model_2.predict(X_train_vec)
product_train_preds = product_model_2.predict(X_train_vec)

# Predictions on Test set

Once trained, the models make predictions on the test set.

In [10]:
# Predictions on the test set for ST1
hazard_category_test_preds = hazard_category_model_2.predict(X_test_vec)
product_category_test_preds = product_category_model_2.predict(X_test_vec)

In [11]:
# Predictions on the test set for ST2
hazard_test_preds = hazard_model_2.predict(X_test_vec)
product_test_preds = product_model_2.predict(X_test_vec)

# Predictions on Validation set


Once trained, the models make predictions on the validation set.

In [12]:
# Predictions on the validation set for ST1
hazard_category_valid_preds = hazard_category_model_2.predict(X_valid_vec)
product_category_valid_preds = product_category_model_2.predict(X_valid_vec)

In [13]:
# Predictions on the validation set for ST2
hazard_valid_preds = hazard_model_2.predict(X_valid_vec)
product_valid_preds = product_model_2.predict(X_valid_vec)

# Evaluation on Test set


Here I use F1 scores to evaluate the performance of the model. The F1 score is a metric that shows how many predicted positive instances are actually correct and how many actual positive instances were correctly predicted.  

We have two types of F1 scores:  
Macro F1: This calculates the F1 score for each class separately and then averages them, giving equal weight to all classes. This is useful when dealing with imbalanced datasets, as it ensures that small classes are not ignored.  
Micro F1: This aggregates predictions across all classes and computes the F1 score based on the overall precision and recall. This favors more frequent classes, meaning it is higher when the model performs well on the dominant categories.  

In [14]:
# Evaluation function to print macro and micro F1 scores
def print_f1_scores(y_true, y_pred, label):
    macro = f1_score(y_true, y_pred, average='macro')
    micro = f1_score(y_true, y_pred, average='micro')
    print(f"{label} - Macro F1: {macro:.2f}, Micro F1: {micro:.2f}")

print_f1_scores(y_test_hazard_category, hazard_category_test_preds, "Hazard Category")

print_f1_scores(y_test_product_category, product_category_test_preds, "Product Category")

print_f1_scores(y_test_hazard, hazard_test_preds, "Hazard")

print_f1_scores(y_test_product, product_test_preds, "Product")


Hazard Category - Macro F1: 0.71, Micro F1: 0.93
Product Category - Macro F1: 0.50, Micro F1: 0.64
Hazard - Macro F1: 0.41, Micro F1: 0.77
Product - Macro F1: 0.17, Micro F1: 0.33


In [15]:
# Final Scores on Testing data
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
    # Compute F1 for hazards:
    f1_hazards = f1_score(
        hazards_true,
        hazards_pred,
        average='macro'
    )

    # Compute F1 for products:
    f1_products = f1_score(
        products_true[hazards_pred == hazards_true],
        products_pred[hazards_pred == hazards_true],
        average='macro'
    )

    return (f1_hazards + f1_products) / 2.0

# Final Score for ST1
st1_score = (f1_score(y_test_hazard_category, hazard_category_test_preds, average='macro') +
             f1_score(y_test_product_category, product_category_test_preds, average='macro')) / 2.0
print(f"\nScore Sub-Task 1: {st1_score:.3f}")

# Final Score for ST2
st2_score = compute_score(y_test_hazard, y_test_product, hazard_test_preds, product_test_preds)
print(f"Score Sub-Task 2: {st2_score:.3f}")


Score Sub-Task 1: 0.605
Score Sub-Task 2: 0.298


# Evaluation on Validation set

Doing again the evaluation with F1 scores (Macro and Micro) but now on the validation data.

In [16]:
# Evaluation function to print macro and micro F1 scores
def print_f1_scores(y_true, y_pred, label):
    macro = f1_score(y_true, y_pred, average='macro')
    micro = f1_score(y_true, y_pred, average='micro')
    print(f"{label} - Macro F1: {macro:.2f}, Micro F1: {micro:.2f}")

print_f1_scores(y_valid_hazard_category, hazard_category_valid_preds, "Hazard Category")

print_f1_scores(y_valid_product_category, product_category_valid_preds, "Product Category")

print_f1_scores(y_valid_hazard, hazard_valid_preds, "Hazard")

print_f1_scores(y_valid_product, product_valid_preds, "Product")


Hazard Category - Macro F1: 0.71, Micro F1: 0.92
Product Category - Macro F1: 0.49, Micro F1: 0.61
Hazard - Macro F1: 0.44, Micro F1: 0.78
Product - Macro F1: 0.17, Micro F1: 0.33


Explaining the results:  

Hazard Category Classification:  
Macro F1: 0.71 → The model is performing well across all categories on average, meaning it balances precision and recall effectively. Some classes might be performing better than others, but overall, the results are good.  
Micro F1: 0.92 → The model is making very few mistakes in total across all predictions, meaning it predicts the correct hazard category for most cases. 
All in all, the model is strong at recognizing hazard categories, with high accuracy overall. However, individual smaller classes might have slightly lower performance.

Product Category Classification:  
Macro F1: 0.49 → The model struggles more with some product categories, meaning that its performance varies significantly between different classes.  
Micro F1: 0.61 → The model still correctly classifies a majority of cases, but compared to hazard category classification, it makes more mistakes overall.  
All in all, we can see that the performance for product categories is weaker than the one for hazard categories, suggesting that there is more variability in the product labels or that the text data does not contain as clear distinctions between product categories.

Hazard Classification:  
Macro F1: 0.44 → The model has difficulty generalizing across all hazard classes, meaning that while it may classify some hazards well, others are much harder to predict correctly.  
Micro F1: 0.78 → The model is still making mostly correct predictions overall but is struggling with less frequent hazard types.  
Here the classification of hazards is weaker than hazard categories, likely due to a larger number of hazard types with more  difference in the dataset.

Product Classification:  
Macro F1: 0.17 → The model performs poorly across all product classes, meaning that many product labels are misclassified.  
Micro F1: 0.33 → Even for the most common product labels, the model struggles to make correct predictions, leading to a high error rate.  
It is obvious that, product classification is the hardest task for the model, possibly because product names are more ambiguous or there is less relevant information in the text to distinguish between them. Additionally, the missing or unseen product labels in the validation set may be an additional problem. Since some product labels did not appear in the training data, the model could not learn to classify them correctly. As I handled unseen labels by assigning a fallback value with the transform_with_unknown function, we ensured that the model made predictions for all samples. However, since the model never encountered these labels during training, it struggled to predict them correctly, contributing to the low F1 scores.

In [17]:
# Final Scores on Validation data
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
    # Compute F1 for hazards:
    f1_hazards = f1_score(
        hazards_true,
        hazards_pred,
        average='macro'
    )

    # Compute F1 for products:
    f1_products = f1_score(
        products_true[hazards_pred == hazards_true],
        products_pred[hazards_pred == hazards_true],
        average='macro'
    )

    return (f1_hazards + f1_products) / 2.0

# Final Score for ST1
st1_score = (f1_score(y_valid_hazard_category, hazard_category_valid_preds, average='macro') +
             f1_score(y_valid_product_category, product_category_valid_preds, average='macro')) / 2.0
print(f"\nScore Sub-Task 1: {st1_score:.3f}")

# Final Score for ST2
st2_score = compute_score(y_valid_hazard, y_valid_product, hazard_valid_preds, product_valid_preds)
print(f"Score Sub-Task 2: {st2_score:.3f}")


Score Sub-Task 1: 0.598
Score Sub-Task 2: 0.308


Explanation of the final scores:

The final scores are calculated by averaging the macro F1 scores for the respective tasks.  
Sub-Task 1 - Hazard Category, Product Category:  
Final Score: 0.598  
This is the average of the macro F1 scores for hazard category (0.71) and product category (0.49).
The score is moderate, meaning the model performs reasonably well in classifying hazard and product categories.  

Sub-Task 2 - Hazard, Product:  
Final Score: 0.308  
This is the average of the macro F1 scores for hazard (0.44) and product (0.17).  
The low score suggests that predicting specific hazard and product labels is much more difficult than classifying broader categories. The model struggles to differentiate products effectively.  

In conclusion the model performs best in hazard category classification, with a high micro F1 score (0.92), meaning it gets most classifications correct overall.  
Performance drops for product category classification, meaning it is harder for the model to distinguish between different products.  
The worst performance is seen in product classification, meaning the model struggles to classify specific product names.  
The final scores reflect these trends, with hazard categories being easier to classify than individual hazards and products.  

# Save submission file (validation data results)


Saving the final results into a csv file and create a ZIP archive for submission (downloading in our case). The submission file contains the results only for the validation data and not for the testing.

In [19]:
import zipfile
import io
# Decoding the predictions with inverse_transform
hazard_category_valid_preds_decoded = hazard_category_encoder.inverse_transform(hazard_category_valid_preds)
product_category_valid_preds_decoded = product_category_encoder.inverse_transform(product_category_valid_preds)
hazard_valid_preds_decoded = hazard_encoder.inverse_transform(hazard_valid_preds)
product_valid_preds_decoded = product_encoder.inverse_transform(product_valid_preds)

# Combine all decoded predictions to a DataFrame
submission_df = pd.DataFrame({
    "index": valid_data.index,  
    "hazard_category": hazard_category_valid_preds_decoded,  
    "product_category": product_category_valid_preds_decoded, 
    "hazard": hazard_valid_preds_decoded,  
    "product": product_valid_preds_decoded  
})

# Create a zip file containing the 'submission.csv'
csv_buffer = io.StringIO()
submission_df.to_csv(csv_buffer, index=False)
csv_data = csv_buffer.getvalue()

with zipfile.ZipFile("submission.zip", "w") as zipf:
    zipf.writestr("submission.csv", csv_data)  


The submission csv file contains 566 rows, which means that the validation dataset consists of 566 examples for which we need to make predictiocontains More specificall,y it contains 566 food recall texts, and for each of these texts, we generate four predicted labels.
The final competition score (Sub-Task 1: 0.598, Sub-Task 2: 0.308) is computed based on how well our model's predictions match the actual categories in this dataset.  
Each row in the submission file corresponds to a single food recall entry, and for each row, the model predicts four values:  
Hazard Category  
Product Category  
Hazard  
Product  