# Multi-label Classification

This notebook is used to perform multi-label classification of categories on text scraped from individual website pages. The text was scraped by ShareThisPredactiv - the notebook ingests with_label.csv which was shared with IBM from ShareThisPredactiv - each row in the data has a url, scraped text, and the correct labels according to ShareThis's classification. For each row/url/text, there are 3 levels of category labels and each level can have  0 or multiple category labels. This notebook demonstrates only how to generate categories at the third/lowest level - for example ['/games/computer_&_video_games/adventure_games', '/arts_&_entertainment/fun_&_trivia'] the lowest level labels for this text are adventure_games and fun_&_trivia.

There are 4 steps in the process:
1. Prepare the data for processing
2. Embed the scraped text and the categories
3. Get the 55 most similar categories for each text using cosine similarity with category frequency bucketing(and test the precision and recall)
4. Generate classification categories for each text (and test the precision and recall)

In [None]:
import ast
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='ADD your api key',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/identity/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.us-south.cloud-object-storage.appdomain.cloud')

bucket = 'Add your buket name'
object_key = 'with_label.csv'

body = cos_client.get_object(Bucket=bucket,Key=object_key)['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body)
df['label'] = df['label'].apply(ast.literal_eval)
#use only first 500 words
df['text'] = df['text'].apply(lambda x: ' '.join(str(x).split()[:500]))

In [2]:
df['label'] = df['label'].apply(
    lambda labels: [label.replace("_", " ") for label in labels] if isinstance(labels, list) else labels)

In [3]:
#remove 1st and 2nd-level categories, the rest of the code is only working with third level categories
df['categories'] = df['label'].apply(lambda lst: [
    '/'.join(s.strip('/').split('/')[2:3])
    for s in lst
    if len(s.strip('/').split('/')) >= 3
])

# Get Categories and Embeddings

1. Get all unique categories at the third level - this dataset has at least one example of all available categories.
2. Count categories - this is later used to create frequency buckets that increase likelihood of a given category
being assigned as one of the top k categories if it is more frequently seen in the dataset - This method/bucketing was also tested on a 70k row dataset from ShareThis that did not have labels. ShareThis tested the accuracy with labels on their end and reported that the precision and recall were better than the results from this test set.
Additionally, ShareThis confirmed that the category frequency distribution from this dataset was aligned with what they generally see in their data and therefore can be used for training.
3. Embed scraped text and categories

In [4]:
df['categories'] = df['categories'].apply(lambda lst: list(dict.fromkeys(lst)))
df['categories_count'] = df['categories'].apply(len)
df = df[df['categories_count'] != 0]
unique_categories = sorted(set(cat for row in df['categories'] for cat in row))

In [5]:
#!pip install sentence-transformers

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(unique_categories, normalize_embeddings=True)

#get embeddings for all scraped text and categories
category_df = pd.DataFrame({
    "category": unique_categories,
    "embedding": list(embeddings)
})

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
content = df['text'].astype(str).tolist()
content_embedding = model.encode(content, normalize_embeddings=True)
df['text_embedding'] = list(content_embedding)

In [8]:
df.head()

Unnamed: 0,url,text,label,categories,categories_count,text_embedding
0,https://abcnews.go.com/International/photos/is...,Israel-Gaza-Lebanon-Syria conflict: Slideshow ...,[/arts & entertainment/online media/online ima...,[online image galleries],1,"[0.030414103, 0.04519361, 0.015754465, -0.0839..."
1,https://acapt.org/educationevents/advancing-ac...,Advancing Accessibility & Disability Equity Su...,[/business & industrial/business services/corp...,"[corporate events, colleges & universities, ac...",4,"[-0.06956346, -0.05836557, 0.004113691, 0.0333..."
2,https://adcable.com/products/rf-communication....,RF Wire & Cable Manufacturer | Advanced Digita...,[/computers & electronics/electronics & electr...,[electronic components],1,"[-0.13069408, -0.02237501, -0.044177406, 0.012..."
3,https://adventuregamers.com/companies/view/27602,Adventure games by Terrible Toybox | Adventure...,[/games/computer & video games/adventure games...,[adventure games],1,"[0.03588909, 0.039767638, 0.04656665, -0.08386..."
4,https://aggie-horticulture.tamu.edu/vegetable/...,Ginger - Vegetable Resources Vegetable Resourc...,"[/food & drink/food/fruits & vegetables, /food...","[fruits & vegetables, herbs & spices, crops & ...",3,"[-0.020839928, -0.06396611, -0.08687351, 0.016..."


In [9]:
#create frequency buckets for categories - See Markdown cell labeled 'Get Categories and Embeddings' for more info
def assign_frequency_bucket(freq):
    if freq >= 30: return "very high"
    elif freq >= 20: return "high"
    elif freq >= 9: return "medium"
    elif freq >= 4: return "low"
    else: return "none"

In [10]:
from collections import Counter

all_labels = df['categories'].explode()
label_counts = Counter(all_labels)

In [11]:
bucket_weights = {
    "very high": 0.25,
    "high": 0.2,
    "medium": 0.15,
    "low": 0.05,
    "none": 0.0,}

In [12]:
bucket_map = {cat: assign_frequency_bucket(freq) for cat, freq in label_counts.items()}

# Get top 55 categories using cosine similarity with category frequency bucketing

Perform cosine similarity with frequency bucketing for each text to get top 55 most likely categories - the
purpose of this is to narrow down the list of potential categories before generating the predicted categories

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
def frequency_bucket_top_k(row, category_df, bucket_map, bucket_weights, k):
    query_vec = row['text_embedding']
    sims = cosine_similarity([query_vec], list(category_df['embedding']))[0]

    adjusted_scores = []
    for idx, cat_row in category_df.iterrows():
        cat = cat_row['category']
        sim = sims[idx]

        bucket = bucket_map.get(cat, "low")
        bonus = bucket_weights.get(bucket, 0.0)

        adjusted = sim + bonus
        adjusted_scores.append((cat, adjusted))

    top_k = sorted(adjusted_scores, key=lambda x: x[1], reverse=True)[:k]
    return [cat for cat, _ in top_k]

In [14]:
df['top_k_categories'] = df.apply(lambda row: frequency_bucket_top_k(row, category_df, bucket_map, bucket_weights, k=55), axis=1)

In [15]:
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import (
    accuracy_score,        
    f1_score,
    precision_score,
    recall_score,
    jaccard_score,
    hamming_loss,
    classification_report,
    multilabel_confusion_matrix
)

In [16]:
y_true = df['categories']
y_pred = df['top_k_categories']
mlb = MultiLabelBinarizer()

mlb.fit(list(y_true) + list(y_pred))
Y_true = mlb.transform(y_true)
Y_pred = mlb.transform(y_pred)

subset_accuracy = (Y_true == Y_pred).all(axis=1).mean()
micro_f1 = f1_score(Y_true, Y_pred, average='micro')
macro_f1 = f1_score(Y_true, Y_pred, average='macro')
micro_precision = precision_score(Y_true, Y_pred, average='micro')
micro_recall = recall_score(Y_true, Y_pred, average='micro')
macro_precision = precision_score(Y_true, Y_pred, average='macro')
macro_recall = recall_score(Y_true, Y_pred, average='macro')

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [17]:
#Test the precision and recall of cosine similarity with bucketing - the recall should be high with expected
#low precision because the goal is to get as many of the correct labels into the narrowed down top 55.
print(f"Subset accuracy (exact match): {subset_accuracy:.4f}")
print(f"Micro F1: {micro_f1:.4f}")
print(f"Micro Precision: {micro_precision:.4f}")
print(f"Micro Recall: {micro_recall:.4f}")
print(f"Macro F1: {macro_f1:.4f}")
print(f"Macro Precision: {macro_precision:.4f}")
print(f"Macro Recall: {macro_recall:.4f}")

Subset accuracy (exact match): 0.0000
Micro F1: 0.0668
Micro Precision: 0.0346
Micro Recall: 0.9395
Macro F1: 0.1007
Macro Precision: 0.0556
Macro Recall: 0.9124


# Predict Final Third-Level Categories for each row

In [None]:
import os
from ibm_watsonx_ai import APIClient, Credentials
import getpass

credentials = Credentials(
    url="https://us-south.ml.cloud.ibm.com",
    api_key="Add your api key"
)

model_id = "mistralai/mistral-small-3-1-24b-instruct-2503"

parameters = {
    "decoding_method": "sample",
    "temperature": .1,
    "max_new_tokens": 200,
    "min_new_tokens": 1,
    "repetition_penalty": 1,
    "stop_sequences": ["]"]
}
project_id = "add your project id"
from ibm_watsonx_ai.foundation_models import ModelInference

model = ModelInference(
	model_id = model_id,
	params = parameters,
	credentials = credentials,
	project_id = project_id
	)

In [19]:
#select 10 random samples to use as training examples in the prompt
sample = df.sample(n=10, random_state=42)
sample = sample[['url','text','categories']]
df_remaining = df.drop(sample.index)
test_set = df_remaining

In [20]:
result_string = []
for i, r in sample.iterrows():
    row_string = f"url:{r['url']}, page content:{r['text']}, Categories:{r['categories']}"
    result_string.append(row_string)
final_string = "\n".join(result_string)

In [21]:
#the prompt to generate the categories takes the top 55 categories, the 10 training examples, the url, and the text
#for a given row. The prompt states that the result should generate between 1 and 7 categories for each row.
for i, r in test_set.iterrows():
    prompt_input = f"""You are a researcher tasked with looking at a webpage url and deciding which category or 
    categories the webpage should be assigned based on provided url and text. The webpage can be assigned a minimum
    of 1 category and a maximum of 7 categories, but the average is 2.5 and the mode is 2.
    Make your selections ONLY from the following categories.Â Do not make up other categories, if the webpage url 
    and content do not fit a clear category, just pick the closest option from the given categories: 
        {r['top_k_categories']}
        
        EXAMPLES:
        {final_string}
        
        INPUT TO CATEGORIZE:
        url:
        {r['url']}
        
        content: 
        {r['text']}
        
        categories:"""
    test_set.at[i, 'predicted'] = model.generate_text(prompt=prompt_input, guardrails=False) 

In [22]:
def safe_fix(val):
    if isinstance(val, str):
        val = val.strip()
        if val.startswith('[') and not val.endswith(']'):
            val += ']'
        try:
            return ast.literal_eval(val)
        except (SyntaxError, ValueError):
            return []
    return val if isinstance(val, list) else []
import ast
test_set['predicted'] = test_set['predicted'].apply(safe_fix)

In [23]:
def safe(x):
    if isinstance(x, str):
        try:
            return ast.literal_eval(x)
        except Exception as e:
            print(f"Bad row: {x} -> {e}")
            return []
    return x

test_set['predicted'] = test_set['predicted'].apply(safe)

In [24]:
import ast
test_set['predicted'] = test_set['predicted'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
valid_set = set(s.strip() for s in unique_categories)
test_set['predicted'] = test_set['predicted'].apply(lambda preds: [p.strip() for p in preds if isinstance (p, str) and p.strip() in valid_set])

In [25]:
import ast
def clean_and_parse(x):
    if isinstance(x, list):
        return sorted([s.strip() for s in x])
    if isinstance(x, str):
        try:
            # Sanitize leading/trailing spaces and brackets
            x = x.strip()
            if x.startswith('[') and x.endswith(']'):
                x = x[1:-1]  # remove brackets
                items = [item.strip() for item in x.split(',')]
                return sorted([item for item in items if item])
        except Exception as e:
            print("Parse failed:", x, "Error:", e)
    return []
test_set['predicted'] = test_set['predicted'].apply(clean_and_parse)

In [26]:
y_true = test_set['categories']
y_pred = test_set['predicted']
mlb = MultiLabelBinarizer()

mlb.fit(list(y_true) + list(y_pred))
Y_true = mlb.transform(y_true)
Y_pred = mlb.transform(y_pred)

subset_accuracy = (Y_true == Y_pred).all(axis=1).mean()
micro_f1 = f1_score(Y_true, Y_pred, average='micro')
macro_f1 = f1_score(Y_true, Y_pred, average='macro')
micro_precision = precision_score(Y_true, Y_pred, average='micro')
micro_recall = recall_score(Y_true, Y_pred, average='micro')
macro_precision = precision_score(Y_true, Y_pred, average='macro')
macro_recall = recall_score(Y_true, Y_pred, average='macro')

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [27]:
#k55 & 10 samples
#final accuracy results for the predicted categories
print(f"Subset accuracy (exact match): {subset_accuracy:.4f}")
print(f"Micro F1: {micro_f1:.4f}")
print(f"Micro Precision: {micro_precision:.4f}")
print(f"Micro Recall: {micro_recall:.4f}")
print(f"Macro F1: {macro_f1:.4f}")
print(f"Macro Precision: {macro_precision:.4f}")
print(f"Macro Recall: {macro_recall:.4f}")

Subset accuracy (exact match): 0.2495
Micro F1: 0.5906
Micro Precision: 0.6337
Micro Recall: 0.5530
Macro F1: 0.5737
Macro Precision: 0.6103
Macro Recall: 0.6101


In [28]:
result = test_set[['url','text','label','categories','predicted']]

In [29]:
result.head()

Unnamed: 0,url,text,label,categories,predicted
0,https://abcnews.go.com/International/photos/is...,Israel-Gaza-Lebanon-Syria conflict: Slideshow ...,[/arts & entertainment/online media/online ima...,[online image galleries],[defense industry]
1,https://acapt.org/educationevents/advancing-ac...,Advancing Accessibility & Disability Equity Su...,[/business & industrial/business services/corp...,"[corporate events, colleges & universities, ac...","[academic conferences & publications, physical..."
2,https://adcable.com/products/rf-communication....,RF Wire & Cable Manufacturer | Advanced Digita...,[/computers & electronics/electronics & electr...,[electronic components],[electronic components]
3,https://adventuregamers.com/companies/view/27602,Adventure games by Terrible Toybox | Adventure...,[/games/computer & video games/adventure games...,[adventure games],[adventure games]
4,https://aggie-horticulture.tamu.edu/vegetable/...,Ginger - Vegetable Resources Vegetable Resourc...,"[/food & drink/food/fruits & vegetables, /food...","[fruits & vegetables, herbs & spices, crops & ...","[fruits & vegetables, herbs & spices]"


In [30]:
result['url'][3]

'https://adventuregamers.com/companies/view/27602'