# **Multi-label classification use case**

**Video Tutorial:** https://www.youtube.com/watch?v=CQTCS8SO8bs


### **Author: Partha Seetala**

This example demonstrates the use or Deep Learning for classifying text with multiple labels. We'll generate a sample training dataset which contains news headlines and the corresponding labels assigned to those headlines. We'll build a neural network to learn what types of headlines are being assigned which labels. This neural network will then be used to take a few other headlines and predict the zero or more labels we can assign to them.

In [None]:
!pip install keras_preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences



In [None]:
import random
import numpy as np
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras.utils import to_categorical

**Define some training dataset**

In [None]:
training_samples = [
    ["The gallery hosted an exhibition featuring paintings of basketball players", ['art', 'sports']],
    ["The government is investing heavily in AI and blockchain technology", ['politics', 'technology']],
    ["The government new policy aims to reduce unemployment", ['politics']],
    ["The city will host the next olympics attracting global attention", ['sports']],
    ["New advancements in robotics are transforming the manufacturing industry", ['technology']],
    ["Cutting edge AI was used at the basketball game to track player shots", ['technology', 'sports']],
    ["An artist revealed his latest sculpture at the city gallery", ['art']],
    ["The art exhibition attracted thousands of visitors to see the paintings", ['art']],
    ["A famous artist is known for his unique painting style", ['art']],
    ["The sculpture won the artist an international award attended to by the governor", ['art', 'politics']],
    ["A breakthrough in software development was announced at the tech conference", ['technology']],
    ["The latest hardware release promises faster and more efficient performance", ['technology']],
    ["Blockchain technology is being adopted by artist fraternity", ['art']],
    ["The government announced a new policy to boost the economy", ['politics']],
    ["The senate passed the bill after hours of debate", ['politics']],
    ["The election results will shape the future of the country's politics", ['politics']],
    ["Diplomacy played a key role in the peace negotiations", ['politics']],
    ["The football team won the championship after a thrilling match", ['sports']],
    ["She won the tennis tournament in straight sets", ['sports']],
    ["The cricket match was interrupted by rain, causing a delay", ['sports']],
    ["The basketball team is preparing for the upcoming season", ['sports']],
    ["The artist used AI to create a stunning new piece of digital art", ['art', 'technology']],
    ["The government's new policy on AI ethics is making headlines", ['politics', 'technology']],
    ["A robotics company is sponsoring the local basketball team", ['technology', 'sports']],
    ["The olympics will feature new technology in broadcasting events", ['sports']],
    ["Artists are using blockchain to sell digital art directly to collectors", ['technology']],
    ["The senate discussed the impact of technology on the economy", ['politics']],
    ["A famous artist created a sculpture inspired by the football World Cup", ['art', 'sports']],
    ["The gallery is showcasing an exhibition on sports in contemporary art", ['art']],
    ["The government is funding new research in sports technology", ['politics', 'sports']],
    ["The robotics team demonstrated their latest invention at the tech fair", ['technology']],
    ["The artist painted a mural to celebrate the olympics", ['art', 'sports']],
    ["The policy debate focused on the regulation of AI and blockchain technology", ['politics', 'technology']],
    ["The football team is using advanced software to analyze player performance", ['sports', 'technology']],
    ["The election campaign featured discussions on sports funding and technology", ['politics']]
]

MAX_WORDS_PER_SENTENCE = 100
vocab_size = None  # Will be filled in by our tokenizer below
unique_labels = []

**Utility function(s) to process training dataset and convert them into the input (X) and ground-truth value (Ytrue) that we'll pass to the neural network during training**

In [None]:
tokenizer = Tokenizer()

def labelvec_to_labels(labelvec, all_labels):
    labels = []
    for i, label in enumerate(labelvec):
        if label == 1:
            labels.append(all_labels[i])
    return labels

def labels_to_labelvec(labels, all_labels):
    labelvec = [0] * len(all_labels)
    for label in labels:
        labelvec[all_labels.index(label)] = 1
    return labelvec

def process_training_data(tokenizer, data):
    sentences = []
    sentence_labels  = []
    unique_labels = []

    # Go through entire training data set and extract sentences and their
    # corresponding labels into separate arrays
    for item in data:
        # item[0] -> sentence, item[1] -> array of labels
        slabels = [] # per setence labels
        for label in item[1]:
            unique_labels.append(label) if label not in unique_labels else None
            slabels.append(label) if label not in slabels else None

        sentences.append(item[0])
        sentence_labels.append(slabels)

    tokenizer.fit_on_texts(sentences)
    sequences = tokenizer.texts_to_sequences(sentences)

    # One-hot encode the sequences
    vocab_size = len(tokenizer.word_index) + 1
    max_length = max(len(seq) for seq in sequences)
    assert(max_length < MAX_WORDS_PER_SENTENCE)

    onehotenc = np.zeros((len(sequences), MAX_WORDS_PER_SENTENCE, vocab_size))

    label_matrix = []
    for i, sequence in enumerate(sequences):
        # get the one-hot encoded vector of labels for this setence
        labelvec = labels_to_labelvec(sentence_labels[i], unique_labels)
        label_matrix.append(labelvec)
        for j, word_index in enumerate(sequence):
            onehotenc[i, j, word_index] = 1.0


    return onehotenc, np.array(label_matrix), vocab_size, sentences, unique_labels, sequences

x, ytrue, vocab_size, sentences, unique_labels, sequences = process_training_data(tokenizer, training_samples)

print("LABELS:")
print(unique_labels)
for label in unique_labels:
    print("label={:10s}  vector={}".format(label, labels_to_labelvec([label], unique_labels)))

print("labels={:20s} vector={}".format("[art, sports]", labels_to_labelvec(['art', 'sports'], unique_labels)))
print("labels={:20s} vector={}".format("[art, poltics, technology]", labels_to_labelvec(['art', 'politics', 'technology'], unique_labels)))

print("\nSENTENCES[0]:")
print("Sentence: ", training_samples[0][0])
print("Sequence: ", sequences[0])
print("\nInput X:")
for i in range(x.shape[0]):
    print("tok[", i, "]= ", x[0][i])
print("\nLabels  : ", training_samples[0][1])
print("Ytrue   : ", ytrue[0])

LABELS:
['art', 'sports', 'politics', 'technology']
label=art         vector=[1, 0, 0, 0]
label=sports      vector=[0, 1, 0, 0]
label=politics    vector=[0, 0, 1, 0]
label=technology  vector=[0, 0, 0, 1]
labels=[art, sports]        vector=[1, 1, 0, 0]
labels=[art, poltics, technology] vector=[1, 0, 1, 1]

SENTENCES[0]:
Sentence:  The gallery hosted an exhibition featuring paintings of basketball players
Sequence:  [1, 22, 54, 13, 23, 55, 33, 6, 14, 56]

Input X:
tok[ 0 ]=  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0.]


**Build a Neural Network model for Multi-label classification**

In [None]:

# X shapes
# shape[0] -> number of samples in the training data
# shape[1] -> max number of tokens allowed per sentence
# shape[2] -> number of unique words (vocabulary)

print("x.shape: {}".format(x.shape))
print("x.shapes = {}, {}, {}".format(x.shape[0], x.shape[1], x.shape[2]))
print("ytrue.shape: {}".format(ytrue.shape))
print("Number of samples: ", len(sentences))
print("Number of unique words: ", vocab_size)
print("Number of labels: ", len(unique_labels))

x.shape: (35, 100, 171)
x.shapes = 35, 100, 171
ytrue.shape: (35, 4)
Number of samples:  35
Number of unique words:  171
Number of labels:  4


In [None]:
# Define the neural network model
model = Sequential([
    Flatten(input_shape=(x.shape[1], x.shape[2])),
    Dense(64, activation='relu'),
    Dense(len(unique_labels), activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

  super().__init__(**kwargs)


**Train the model feeding it the training samples**

In [None]:
# Train the model
model.fit(x, ytrue, epochs=1000, batch_size=32, validation_split=0.2)

Epoch 1/1000
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.2143 - loss: 0.6931 - val_accuracy: 0.4286 - val_loss: 0.6944
Epoch 2/1000
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 116ms/step - accuracy: 0.5357 - loss: 0.6833 - val_accuracy: 0.4286 - val_loss: 0.6928
Epoch 3/1000
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 113ms/step - accuracy: 0.7500 - loss: 0.6743 - val_accuracy: 0.4286 - val_loss: 0.6913
Epoch 4/1000
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 117ms/step - accuracy: 0.8929 - loss: 0.6658 - val_accuracy: 0.4286 - val_loss: 0.6897
Epoch 5/1000
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 132ms/step - accuracy: 0.8929 - loss: 0.6578 - val_accuracy: 0.7143 - val_loss: 0.6881
Epoch 6/1000
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 140ms/step - accuracy: 0.9286 - loss: 0.6502 - val_accuracy: 0.7143 - val_loss: 0.6866
Epoch 7/1000
[1m1/1[0m [32m━

<keras.src.callbacks.history.History at 0x7d4134502d50>

**Let's make some predictions**

In [None]:
# Function to predict labels for new data
def predict_labels(tokenizer, text):
    sequence = tokenizer.texts_to_sequences([text])
    one_hot_sequence = np.zeros((1, MAX_WORDS_PER_SENTENCE, vocab_size))

    for i, word_index in enumerate(sequence[0]):
        if i < MAX_WORDS_PER_SENTENCE:
            one_hot_sequence[0, i, word_index] = 1.0

    prediction = model.predict(one_hot_sequence, verbose=0)

    labels = []
    for idx, pred in enumerate(prediction[0]):
        if pred > 0.5:
            labels.append(unique_labels[idx])
    #print(prediction)
    return ", ".join(labels)

In [None]:
example_sentences = [
    "The gallery hosted an exhibition featuring paintings of cricket players",
    "The government is investing heavily in AI and blockchain technology",
    "Michael plays basketball while Bill builds AI robots that would be used by the government",
    "The government new policy aims to reduce unemployment",
]

for i in range(len(example_sentences)):
    example = example_sentences[i]
    labels = predict_labels(tokenizer, example)
    print("\"{}\" -> {}".format(example, labels))


"The gallery hosted an exhibition featuring paintings of cricket players" -> art, sports
"The government is investing heavily in AI and blockchain technology" -> politics, technology
"Michael plays basketball while Bill builds AI robots that would be used by the government" -> sports, technology
"The government new policy aims to reduce unemployment" -> politics
