# Leveraging Sentence Transformers Embeddings for Multilabel Text Classification with LightGBM

In this notebook, the aim is to utilize the embeddings of the best sentence transformers embeddings in the training instead of frequency based vectorization like TF-IDF.
The hope here is to that we give LightGBM a better encoding for the input text to eventually yield a better results.


In [6]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, classification_report

import lightgbm as lgb

Now, let's load and prepare the data and the embeddings:

In [7]:
# Load data
train_df = pd.read_csv('../data/processed/clean_train.csv')
valid_df = pd.read_csv('../data/processed/clean_valid.csv')

# Initialize Sentence Transformer Model
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Transform 'clean_content' using Sentence Transformer
X_train = model.encode(train_df['clean_content'].to_list(), show_progress_bar=True)
X_valid = model.encode(valid_df['clean_content'].to_list(), show_progress_bar=True)

# Prepare labels for multilabel classification
y_train = train_df[['cyber_label', 'environmental_issue']]
y_valid = valid_df[['cyber_label', 'environmental_issue']]

# Classifier for mutlilabels
multioutput_classifier = MultiOutputClassifier(lgb.LGBMClassifier(verbosity=2), n_jobs=-1)
multioutput_classifier.fit(X_train, y_train)


Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:07<00:00,  4.39it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.43it/s]


In [8]:

# Prediction and evaluation
y_pred = multioutput_classifier.predict(X_valid)
for i, label in enumerate(y_train.columns):
    print(f"Accuracy for {label}: {accuracy_score(y_valid.iloc[:, i], y_pred[:, i])}")
    print(f"Classification Report for {label}:\n", classification_report(y_valid.iloc[:, i], y_pred[:, i]))


Accuracy for cyber_label: 0.9484126984126984
Classification Report for cyber_label:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97       235
           1       0.75      0.35      0.48        17

    accuracy                           0.95       252
   macro avg       0.85      0.67      0.73       252
weighted avg       0.94      0.95      0.94       252

Accuracy for environmental_issue: 0.8928571428571429
Classification Report for environmental_issue:
               precision    recall  f1-score   support

           0       0.90      0.97      0.93       200
           1       0.84      0.60      0.70        52

    accuracy                           0.89       252
   macro avg       0.87      0.78      0.82       252
weighted avg       0.89      0.89      0.89       252



Good results, we can see good scores in the macro avg f1 for both classes.

Let's try the upsampled training set:

In [9]:
# Load data
train_df = pd.read_csv('../data/processed/clean_train_upsampled.csv')
valid_df = pd.read_csv('../data/processed/clean_valid.csv')

# Initialize Sentence Transformer Model
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Transform 'clean_content' using Sentence Transformer
X_train = model.encode(train_df['clean_content'].to_list(), show_progress_bar=True)
X_valid = model.encode(valid_df['clean_content'].to_list(), show_progress_bar=True)

# Prepare labels for multilabel classification
y_train = train_df[['cyber_label', 'environmental_issue']]
y_valid = valid_df[['cyber_label', 'environmental_issue']]

# MultiOutput Classifier
multioutput_classifier = MultiOutputClassifier(lgb.LGBMClassifier(verbosity=2), n_jobs=-1)
multioutput_classifier.fit(X_train, y_train)


# Prediction and evaluation
y_pred = multioutput_classifier.predict(X_valid)
for i, label in enumerate(y_train.columns):
    print(f"Accuracy for {label}: {accuracy_score(y_valid.iloc[:, i], y_pred[:, i])}")
    print(f"Classification Report for {label}:\n", classification_report(y_valid.iloc[:, i], y_pred[:, i]))


Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 91/91 [00:20<00:00,  4.53it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.42it/s]


Accuracy for cyber_label: 0.9404761904761905
Classification Report for cyber_label:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97       235
           1       0.62      0.29      0.40        17

    accuracy                           0.94       252
   macro avg       0.79      0.64      0.68       252
weighted avg       0.93      0.94      0.93       252

Accuracy for environmental_issue: 0.9047619047619048
Classification Report for environmental_issue:
               precision    recall  f1-score   support

           0       0.92      0.96      0.94       200
           1       0.83      0.67      0.74        52

    accuracy                           0.90       252
   macro avg       0.88      0.82      0.84       252
weighted avg       0.90      0.90      0.90       252



This one dropped a little bit. Back to the cleaned set, let's play around with the hyperparams:

In [11]:
# Load data
train_df = pd.read_csv('../data/processed/clean_train.csv')
valid_df = pd.read_csv('../data/processed/clean_valid.csv')

# Initialize Sentence Transformer Model
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Transform 'clean_content' using Sentence Transformer
X_train = model.encode(train_df['clean_content'].to_list(), show_progress_bar=True)
X_valid = model.encode(valid_df['clean_content'].to_list(), show_progress_bar=True)

# Prepare labels for multilabel classification
y_train = train_df[['cyber_label', 'environmental_issue']]
y_valid = valid_df[['cyber_label', 'environmental_issue']]

# MultiOutput Classifier
multioutput_classifier = MultiOutputClassifier(
    lgb.LGBMClassifier(
        verbosity=0,
        min_data_in_leaf=30, 
        class_weight='balanced',
        learning_rate=0.15,
        n_estimators=300,
    ),
    n_jobs=-1,
)
multioutput_classifier.fit(X_train, y_train)


# Prediction and evaluation
y_pred = multioutput_classifier.predict(X_valid)
for i, label in enumerate(y_train.columns):
    print(f"Accuracy for {label}: {accuracy_score(y_valid.iloc[:, i], y_pred[:, i])}")
    print(f"Classification Report for {label}:\n", classification_report(y_valid.iloc[:, i], y_pred[:, i]))


Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:07<00:00,  4.39it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.43it/s]




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Accuracy for cyber_label: 0.9484126984126984
Classification Report for cyber_label:
               precision    recall  f1-score   support

           0       0.96      0.99      0.97       235
           1       0.70      0.41      0.52        17

    accuracy                           0.95       252
   macro avg       0.83      0.70      0.75       252
weighted avg       0.94      0.95      0.94       252

Accuracy for environmental_issue: 0.9007936507936508
Classification Report for environmental_issue:
               precision    recall  f1-score   support

           0       0.92      0.95      0.94       200
           1       0.80      0.69      0.74        52

    accuracy                           0.90       252
   macro avg       0.86      0.82      0.84       252
weighted avg       0.90      0.90      0.90       252



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [13]:
# Load data
train_df = pd.read_csv('../data/processed/clean_train.csv')
valid_df = pd.read_csv('../data/processed/clean_valid.csv')

# Initialize Sentence Transformer Model
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Transform 'clean_content' using Sentence Transformer
X_train = model.encode(train_df['clean_content'].to_list(), show_progress_bar=True)
X_valid = model.encode(valid_df['clean_content'].to_list(), show_progress_bar=True)

# Prepare labels for multilabel classification
y_train = train_df[['cyber_label', 'environmental_issue']]
y_valid = valid_df[['cyber_label', 'environmental_issue']]

# MultiOutput Classifier
multioutput_classifier = MultiOutputClassifier(
    lgb.LGBMClassifier(
        verbosity=0,
        min_data_in_leaf=20, 
        class_weight='balanced',
        boosting_type='dart',
        num_leaves=50,
        learning_rate=0.1,
        n_estimators=400,
    ),
    n_jobs=-1,
)
multioutput_classifier.fit(X_train, y_train)


# Prediction and evaluation
y_pred = multioutput_classifier.predict(X_valid)
for i, label in enumerate(y_train.columns):
    print(f"Accuracy for {label}: {accuracy_score(y_valid.iloc[:, i], y_pred[:, i])}")
    print(f"Classification Report for {label}:\n", classification_report(y_valid.iloc[:, i], y_pred[:, i]))


Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:07<00:00,  4.40it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.44it/s]


[LightGBM] [Info] Number of positive: 87, number of negative: 921
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000008 seconds, init for row-wise cost 0.001762 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.009165 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 97919
[LightGBM] [Info] Number of data points in the train set: 1008, number of used features: 384
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.086310 -> initscore=-2.359552
[LightGBM] [Info] Start training from score -2.359552
[LightGBM] [Debug] Trained a tree with leaves = 18 and depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 16
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 11
[Lig

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Accuracy for cyber_label: 0.9523809523809523
Classification Report for cyber_label:
               precision    recall  f1-score   support

           0       0.96      1.00      0.97       235
           1       0.86      0.35      0.50        17

    accuracy                           0.95       252
   macro avg       0.91      0.67      0.74       252
weighted avg       0.95      0.95      0.94       252

Accuracy for environmental_issue: 0.8928571428571429
Classification Report for environmental_issue:
               precision    recall  f1-score   support

           0       0.91      0.95      0.93       200
           1       0.79      0.65      0.72        52

    accuracy                           0.89       252
   macro avg       0.85      0.80      0.82       252
weighted avg       0.89      0.89      0.89       252



In [15]:
# Load data
train_df = pd.read_csv('../data/processed/clean_train.csv')
valid_df = pd.read_csv('../data/processed/clean_valid.csv')

# Initialize Sentence Transformer Model
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Transform 'clean_content' using Sentence Transformer
X_train = model.encode(train_df['clean_content'].to_list(), show_progress_bar=True)
X_valid = model.encode(valid_df['clean_content'].to_list(), show_progress_bar=True)

# Prepare labels for multilabel classification
y_train = train_df[['cyber_label', 'environmental_issue']]
y_valid = valid_df[['cyber_label', 'environmental_issue']]

# MultiOutput Classifier
multioutput_classifier = MultiOutputClassifier(
    lgb.LGBMClassifier(
        verbosity=0,
        min_data_in_leaf=20, 
        class_weight='balanced',
        boosting_type='dart',
        num_leaves=20,
        learning_rate=0.1,
        n_estimators=100,
    ),
    n_jobs=-1,
)
multioutput_classifier.fit(X_train, y_train)


# Prediction and evaluation
y_pred = multioutput_classifier.predict(X_valid)
for i, label in enumerate(y_train.columns):
    print(f"Accuracy for {label}: {accuracy_score(y_valid.iloc[:, i], y_pred[:, i])}")
    print(f"Classification Report for {label}:\n", classification_report(y_valid.iloc[:, i], y_pred[:, i]))


Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:07<00:00,  4.39it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.43it/s]


Accuracy for cyber_label: 0.9404761904761905
Classification Report for cyber_label:
               precision    recall  f1-score   support

           0       0.95      0.98      0.97       235
           1       0.60      0.35      0.44        17

    accuracy                           0.94       252
   macro avg       0.78      0.67      0.71       252
weighted avg       0.93      0.94      0.93       252

Accuracy for environmental_issue: 0.8928571428571429
Classification Report for environmental_issue:
               precision    recall  f1-score   support

           0       0.93      0.94      0.93       200
           1       0.76      0.71      0.73        52

    accuracy                           0.89       252
   macro avg       0.84      0.83      0.83       252
weighted avg       0.89      0.89      0.89       252

[LightGBM] [Info] Number of positive: 1452, number of negative: 1452
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [De