<a href="https://colab.research.google.com/github/kennethmugo/Swahili-SMS-Spam-Detection/blob/main/research/qwen3_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
import kagglehub
from sentence_transformers import SentenceTransformer

In [6]:
path = kagglehub.dataset_download("henrydioniz/swahili-sms-detection-dataset")
full_path = os.path.join(path, "bongo_scam.csv")
df = pd.read_csv(full_path)
df = df.drop_duplicates()
df.head()

Unnamed: 0,Category,Sms
0,trust,"Nipigie baada ya saa moja, tafadhali."
1,scam,Naomba unitumie iyo Hela kwenye namba hii ya A...
2,scam,"666,KARIBU FREEMASON UTIMIZE NDOTO KATIKA BIAS..."
3,trust,Watoto wanapenda sana zawadi ulizowaletea.
4,scam,IYO PESA ITUME KWENYE NAMBA HII 0657538690 JIN...


In [18]:
# Load the model
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

Now that the model is loaded, we need to generate the embeddings. These embeddings will then be fed to another model downstream to perform the actual classification.

In [12]:
def generate_embeddings_batch(texts, batch_size=32) -> np.ndarray:
    """Generate embeddings in batches to manage memory"""
    all_embeddings = []

    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i + batch_size]
        embeddings = model.encode(batch)
        all_embeddings.append(embeddings)

    return np.vstack(all_embeddings)

In [13]:
# Generate embeddings for all SMS
print("Generating embeddings...")
X = generate_embeddings_batch(df['Sms'].tolist())
y = (df['Category'] == 'scam').astype(int)  # Convert labels to binary

Generating embeddings...


100%|██████████| 44/44 [00:31<00:00,  1.38it/s]


In [14]:
## Generate the train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [15]:
## Train a basic logistic regression model. Use a cross-validation strategy to evaluate the model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf = LogisticRegression(max_iter=100)
clf.fit(X_train, y_train)

## Generate a classification report
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       100
           1       1.00      1.00      1.00       176

    accuracy                           1.00       276
   macro avg       1.00      1.00      1.00       276
weighted avg       1.00      1.00      1.00       276



Wow! Qwen embeddings seem to work perfectly! Let me try on some messages that I received on my phone to see how the model performs on that data.

In [16]:
## Test the model on a new spam SMS that I received recently.
new_sms = "HELLO. Ungana na wakenya wengi wanoSHINDA katika PICK A BOX.2024 END YEAR Bonus NI from 50,000. BONYEZA *201# BILA Credo upick BOX YAKO.STOP *456*9*5#"
new_sms_embedding = model.encode([new_sms])
prediction = clf.predict(new_sms_embedding)
probability = clf.predict_proba(new_sms_embedding)
print(prediction)
print(probability)

[1]
[[0.34077544 0.65922456]]


It gets the prediction correct but the confidence isn't as high as I'd like. Let me try it out on a ham message.

In [17]:
## Test the model on a new ham SMS that I received recently.
new_sms = "Leo siko kazi."
new_sms_embedding = model.encode([new_sms])
prediction = clf.predict(new_sms_embedding)
probability = clf.predict_proba(new_sms_embedding)
print(prediction)
print(probability)

[0]
[[0.96023508 0.03976492]]


That's more like it! The confidence here is much better.