# **Multi-label classification problem using Word Embeddings**
#### **Author: Partha Seetala**
**Video Tutorial: https://www.youtube.com/watch?v=8jqqE8XG5T0**

This example demonstrates the use or Deep Learning for classifying text with multiple labels. We'll generate a sample training dataset which contains news headlines and the corresponding labels assigned to those headlines. We'll build a neural network to learn what types of headlines are being assigned which labels. This neural network will then be used to take a few other headlines and predict the zero or more labels we can assign to them.

General Code Logic is a follows

1. Download existing GloVe embeddings
2. Process training data
 - Tokenize every sentence in the training data and build a Vocabulary
 - Construct an embedding_matrix that is limited to tokens in our training data
 - for each tokenized sentence generate a list of `tokenid`s corresponding to tokens
 - Pad to make input to Neural Network of same length
3. Pass this as input to Neural Network and make it predict labels

## Load required modules

In [None]:
#!pip install gensim -U
#!pip install numpy==1.25
#!pip install keras_preprocessing


import numpy as np
import pandas as pd
import re
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report, hamming_loss
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten, Dropout
import gensim.downloader as api

glove_embeddings = None
MAX_SENTENCE_LEN = 20


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

# Download pre-built GloVe embeddings

In [None]:
if glove_embeddings is None:
    glove_embeddings = api.load("glove-wiki-gigaword-100")

# Let's see how many words/tokens are in the GloVe Embedding Data
print("Number of tokens in GloVe Data: ", len(glove_embeddings.index_to_key))
print("Dimensions for each token: ", glove_embeddings.vector_size)

Number of tokens in GloVe Data:  400000
Dimensions for each token:  100


## Get some Training Data to work with

In [None]:
training_data = [
    ["The movie was funny", ['art']],
    ["The gallery hosted an exhibition featuring paintings of basketball players", ['art', 'sports']],
    ["The government is investing heavily in AI and robotics technology", ['politics', 'technology']],
    ["The government new policy aims to reduce unemployment", ['politics']],
    ["The city will host the next olympics attracting global attention", ['sports']],
    ["New advancements in robotics are transforming the manufacturing industry", ['technology']],
    ["Cutting edge AI was used at the basketball game to track player shots", ['technology', 'sports']],
    ["An artist revealed his latest sculpture at the city gallery", ['art']],
    ["The art exhibition attracted thousands of visitors to see the paintings", ['art']],
    ["A famous artist is known for his unique painting style", ['art']],
    ["The sculpture won the artist an international award attended to by the governor", ['art', 'politics']],
    ["A breakthrough in software development was announced at the tech conference", ['technology']],
    ["The latest hardware release promises faster and more efficient performance", ['technology']],
    ["robotics technology is being adopted by artist fraternity", ['art']],
    ["The government announced a new policy to boost the economy", ['politics']],
    ["The senate passed the bill after hours of debate", ['politics']],
    ["The election results will shape the future of the country politics", ['politics']],
    ["Diplomacy played a key role in the peace negotiations", ['politics']],
    ["The football team won the championship after a thrilling match", ['sports']],
    ["She won the tennis tournament in straight sets", ['sports']],
    ["The cricket match was interrupted by rain, causing a delay", ['sports']],
    ["The basketball team is preparing for the upcoming season", ['sports']],
    ["The artist used AI to create a stunning new piece of digital art", ['art', 'technology']],
    ["The government new policy on AI ethics is making headlines", ['politics', 'technology']],
    ["A robotics company is sponsoring the local basketball team", ['technology', 'sports']],
    ["The olympics will feature new technology in broadcasting events", ['sports']],
    ["Artists are using robotics to sell digital art directly to collectors", ['technology']],
    ["The senate discussed the impact of technology on the economy", ['politics']],
    ["A famous artist created a sculpture inspired by the football World Cup", ['art', 'sports']],
    ["The gallery is showcasing an exhibition on sports in contemporary art", ['art']],
    ["The government is funding new research in sports technology", ['politics', 'sports']],
    ["The robotics team demonstrated their latest invention at the tech fair", ['technology']],
    ["The artist painted a mural to celebrate the olympics", ['art', 'sports']],
    ["The policy debate focused on the regulation of AI and robotics technology", ['politics', 'technology']],
    ["The football team is using advanced software to analyze player performance", ['sports', 'technology']],
    ["The election campaign featured discussions on sports funding and technology", ['politics']]
]

# Prepare the training data so we can feed to Neural Network

In [None]:
# Define a Tokenizer that we'll use to tokenize sentences into individual words/tokens
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')

def tokenize_sentence(tokenizer, sentence):
    # Tokenize the sentence into an array of TokenIDs
    sentence_seq = tokenizer.texts_to_sequences([sentence])

    # Pad the sequence with 0 (<PAD>) in case it is shorter than MAX_SENTENCE_LEN
    tokenized_and_padded = pad_sequences(sentence_seq, maxlen=MAX_SENTENCE_LEN, padding='post')

    return tokenized_and_padded

def process_training_data(tokenizer, samples):
    sentences = []
    x = []
    ytrue = []
    labels = []

    for sample in samples:
        sentences.append(sample[0])
        labels.append(sample[1])

    # STEP 1: TOKENIZE & BUILD VOCABULARY
    tokenizer.fit_on_texts(sentences)

    # STEP 2: CONVERT FROM TEXT TOKENS TO TOKENID ARRAY
    token_seq = tokenizer.texts_to_sequences(sentences)

    # STEP 3: PAD IF SENTENCE IS LESS THAN MAX_SENTENCE_LEN
    x = pad_sequences(token_seq, maxlen=MAX_SENTENCE_LEN, padding='post')

    # STEP 4: Convert Labels into an array of integers
    mlb = MultiLabelBinarizer()
    ytrue = mlb.fit_transform(labels)

    # unique labels found across all training samples
    unique_labels = mlb.classes_
    return sentences, unique_labels, x, ytrue


sentences, unique_labels, x, ytrue = process_training_data(tokenizer, training_data)
print(sentences[0])
print(unique_labels)
print(x[0])
print(ytrue[0])

The movie was funny
['art' 'politics' 'sports' 'technology']
[ 2 54 16 55  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
[1 0 0 0]


## Build an Embedding Matrix for the subset of tokens in our training data

In [None]:
# Vocabulary
vocab = tokenizer.word_index

VOCAB_SIZE = len(vocab) + 1
EMBEDDING_DIM = glove_embeddings.vector_size

print("Number of tokens in my training data: ", VOCAB_SIZE)
print("Embedding dimension for each token: ", EMBEDDING_DIM)

# Create an embedding matrix that for only the subset of words in our vocabulary
embedding_matrix = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))

for token, i in vocab.items():
    try:
        embedding = glove_embeddings[token]
        embedding_matrix[i] = embedding
    except KeyError:
        # Word is not to the GloVe embedding dtabase
        print("Token '", token , "' not found in downloaded Embeddings")
        pass

#print("Embedding for token 'gallery'=", glove_embeddings['gallery'])

Number of tokens in my training data:  172
Embedding dimension for each token:  100
Token ' <OOV> ' not found in downloaded Embeddings


## Build a Deep Neural Network Model to do multi-label classification

In [None]:
model = Sequential([
    Embedding(
        input_dim=VOCAB_SIZE,
        output_dim=EMBEDDING_DIM,
        weights=[embedding_matrix],
        input_shape=(MAX_SENTENCE_LEN,),
        trainable=False
    ),
    Flatten(),  # Flatten the embeddings
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(len(unique_labels), activation='sigmoid')  # Sigmoid for multi-label
])

# Compile the model
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Model summary
model.summary()


  super().__init__(**kwargs)


## Train the model

In [None]:
model.fit(x, ytrue, epochs=50, batch_size=4)

Epoch 1/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 34ms/step - accuracy: 0.3550 - loss: 0.6326
Epoch 2/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.9693 - loss: 0.3629
Epoch 3/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.8647 - loss: 0.1903
Epoch 4/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.8527 - loss: 0.0970
Epoch 5/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step - accuracy: 0.8964 - loss: 0.0320
Epoch 6/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.7952 - loss: 0.0224
Epoch 7/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9472 - loss: 0.0144  
Epoch 8/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.9791 - loss: 0.0098 
Epoch 9/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

<keras.src.callbacks.history.History at 0x7f1fbb7c0990>

In [None]:


def label_sentence(model, tokenizer, sentence, threshold=0.2):
    tokenized_and_padded = tokenize_sentence(tokenizer, sentence)

    prediction = model.predict(tokenized_and_padded, verbose=0)

    labels = []
    for i, prob in enumerate(prediction[0]):
        if prob > threshold:
            labels.append((unique_labels[i], float(prob) * 100))

    labels.sort(key=lambda x: x[1], reverse=True)
    return labels

def pretty_print_labels(idx, sentence, labels):
    print("{}. {}\n   ".format(i+1, sentence), end="")
    for label in labels:
        print("{}={:3.1f}%".format(label[0], label[1]), end=", ")
    print("\n")


test_sentences = [
    "The gallery hosted an exhibition featuring paintings of kabbadi players",
    "The government is investing heavily in AI and robotics technology",
    "Michael plays basketball while Bill builds AI robots that would be used by the government",
    "The government new policy aims to reduce unemployment",
    "The new museum exhibit features digital art created with AI",
    "Politicians debated the future of sports funding in schools",
    "The new smartphone uses advanced technology for better performance",
]

for i,sentence in enumerate(test_sentences):
    labels = label_sentence(model, tokenizer, sentence)
    pretty_print_labels(i, sentence, labels)

1. The gallery hosted an exhibition featuring paintings of kabbadi players
   art=100.0%, sports=99.8%, 

2. The government is investing heavily in AI and robotics technology
   politics=100.0%, technology=99.9%, 

3. Michael plays basketball while Bill builds AI robots that would be used by the government
   technology=83.5%, politics=44.2%, art=28.7%, 

4. The government new policy aims to reduce unemployment
   politics=100.0%, 

5. The new museum exhibit features digital art created with AI
   art=28.1%, 

6. Politicians debated the future of sports funding in schools
   politics=44.1%, 

7. The new smartphone uses advanced technology for better performance
   sports=76.7%, 

