# Text Embeddings and NLP Tasks for Indian Languages

Text embeddings are vector representations of text that map the original text into a mathematical space where words or sentences with similar meanings are located near each other.

## PHASE 0: SETUP

In [1]:
!pip install indic-nlp-library gensim -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m104.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.1/121.1 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h

#### We have provided 2 files: hindi_cleaned.txt and kannada_cleaned.txt.

These datasets are normalized and cleaned as a part of the NLP preprocessing pipeline in Lab 1.


## Corpus

In [2]:
#make sure these files are in the same directory as this notebook
with open("hindi_cleaned.txt", 'r', encoding='utf-8') as f:
    hindi_corpus = f.read()
with open("kannada_cleaned.txt", 'r', encoding='utf-8') as f:
    kannada_corpus = f.read()

Now, we perform sentence splitting and tokenization done in Lab 1 which will help us to make embeddings for each token.

We also make sure each token is part of the language to remove any punctuation based tokens if any.

These texts are currently strings.

The goal is to make a list of sentences where each sentence is a list of tokens.

In [3]:
from indicnlp.tokenize import sentence_tokenize
from indicnlp.tokenize import indic_tokenize
import re
import numpy as np

def is_hindi_token(token):
    return bool(re.search(r'[\u0900-\u097F]', token))

def is_kannada_token(token):
    return bool(re.search(r'[\u0C80-\u0CFF]', token))

def preprocess(corpus, lang_code):
    sentences = sentence_tokenize.sentence_split(corpus, lang=lang_code)
    tokenized_corpus = []

    for sentence in sentences:
        tokens = indic_tokenize.trivial_tokenize(sentence, lang=lang_code)

        if lang_code == "hi":
            tokens = [t for t in tokens if is_hindi_token(t)]
        elif lang_code == "kn":
            tokens = [t for t in tokens if is_kannada_token(t)]

        tokenized_corpus.append(tokens)

    return tokenized_corpus

hi_tokens = preprocess(hindi_corpus, 'hi')
kn_tokens = preprocess(kannada_corpus, 'kn')

print("Hindi Tokens:", hi_tokens)
print("Kannada Tokens:", kn_tokens)


Hindi Tokens: [['कभी', 'कभी', 'कोई', 'फिल्म', 'आपको', 'बिल्कुल', 'निशब्द', 'कर', 'देती', 'है', '।'], ['कभी', 'उसका', 'असर', 'इतना', 'गहरा', 'होता', 'है', 'कि', 'शब्द', 'ही', 'नहीं', 'मिलते', '।'], ['कभी', 'आप', 'इतने', 'प्रभावित', 'होते', 'हैं', 'कि', 'उसके', 'बारे', 'में', 'बात', 'करना', 'ही', 'बंद', 'नहीं', 'कर', 'पाते', '।'], ['और', 'कभी', 'वह', 'आपके', 'दिल', 'को', 'इतना', 'छू', 'जाती', 'है', 'कि', 'आप', 'बस', 'उसकी', 'जादुई', 'दुनिया', 'में', 'खो', 'जाते', 'हैं', '।'], ['लेकिन', 'जब', 'कोई', 'फिल्म', 'ये', 'सब', 'एक', 'साथ', 'कर', 'दिखाती', 'है', 'तो', 'वह', 'सिर्फ', 'एक', 'मास्टरपीस', 'नहीं', 'रह', 'जाती', 'वह', 'एक', 'सांस्कृतिक', 'घटना', 'बन', 'जाती', 'है', '।'], ['ऋषभ', 'शेट्टी', 'की', '‘कांतारा', 'चैप्टर', 'ठीक', 'ऐसा', 'ही', 'असर', 'छोड़ती', 'है', '।'], ['फिल्म', 'की', 'शुरुआत', 'कदंब', 'वंश', 'और', 'उसके', 'क्रूर', 'शासक', 'से', 'होती', 'है', 'जिसकी', 'लालच', 'हर', 'ज़मीन', 'और', 'पानी', 'को', 'कब्ज़े', 'में', 'लेने', 'की', 'है', '।'], ['चाहे', 'आदमी', 'हो', 'औरत', 'या', 'ब

# PHASE 1 - Pre requisites for Embeddings

## BUILDING VOCABULARY

Word to index and index to word mapping

In [4]:
def build_vocab(corpus):
    """
    TODO: Build vocabulary.

    Steps:
    1. Convert the list of sentences into one single list containing every token.
    2. Filter out duplicates.
    3. Sort the tokens.
    4. Build word2idx (0,1,2..)
    5. Build idx2word
    6. Return all three objects. (vocab, word2idx, idx2word)
    """
    # 1. Convert sentences to single list of tokens
    all_tokens = [token for sentence in corpus for token in sentence]
    # 2 & 3. Filter duplicates and sort
    vocab = sorted(list(set(all_tokens)))
    # 4. Build word2idx
    word2idx = {word: i for i, word in enumerate(vocab)}
    # 5. Build idx2word
    idx2word = {i: word for i, word in enumerate(vocab)}

    return vocab, word2idx, idx2word

In [5]:
hi_vocab, hi_word2idx, hi_idx2word = build_vocab(hi_tokens)
print(f"Hindi Vocabulary: {hi_vocab}")
print(f"Length of Hindi Vocabulary: {len(hi_vocab)}")
print(f"Hindi first sentence: {hi_tokens[0]}")

"""
Expected output:
Hindi Vocabulary: ['\nअब', '\nऐसे', '\nकहानी', '\nक्लाइमेक्स', '\nटेक्निकली' ...]
Length of Hindi Vocabulary: 686
Hindi first sentence: ['कभी', 'कभी', 'कोई'..]
"""

Hindi Vocabulary: ['\nअब', '\nऐसे', '\nकहानी', '\nक्लाइमेक्स', '\nटेक्निकली', '\nधमाकेदार', '\nफर्स्ट', '\nफिल्म', '\nबहुप्रतीक्षित', '\nराजकुमारी', '\n‘कांतारा', 'अंदाज', 'अंधेरी', 'अक्तूबर', 'अगर', 'अच्छाई', 'अछूत', 'अजनिश', 'अटेंशन', 'अदाकारी', 'अद्भुत', 'अधूरा', 'अनाउंसमेंट', 'अनुभव', 'अनुभवी', 'अपडेट', 'अपनी', 'अपने', 'अब', 'अभिनय', 'अभियान', 'अभी', 'अमीर', 'अयोग्य', 'अरविंद', 'अरेस्ट', 'अलग', 'अलावा', 'अवसर', 'असर', 'असली', 'अहम', 'आ', 'आइकॉनिक', 'आई', 'आईडिया', 'आखिर', 'आगे', 'आज', 'आता', 'आती', 'आदमी', 'आदेश', 'आने', 'आप', 'आपकी', 'आपके', 'आपको', 'आवाज़', 'इंटरवल', 'इंडियन', 'इतना', 'इतनी', 'इतने', 'इन', 'इफेक्ट्स', 'इमोशंस', 'इमोशन', 'इस', 'इसकी', 'इसके', 'इससे', 'इसी', 'इसे', 'ईश्वर', 'उठ', 'उतरते', 'उन', 'उनका', 'उनकी', 'उनके', 'उनसे', 'उन्हें', 'उन्होंने', 'उमड़ते', 'उस', 'उसका', 'उसकी', 'उसके', 'उसमें', 'उससे', 'उसे', 'ऊपर', 'ऋषभ', 'ए', 'एंट्री', 'एक', 'एक्ट्रेस', 'एक्शन', 'एक्सपीरिएंस', 'एनर्जी', 'एलान', 'एलान\nदरअसल', 'एस', 'ऐसा', 'ऐसी', 'ऐसे', 'और', 'औरत', 'कई', 'कथा', '

"\nExpected output:\nHindi Vocabulary: ['\nअब', '\nऐसे', '\nकहानी', '\nक्लाइमेक्स', '\nटेक्निकली' ...]\nLength of Hindi Vocabulary: 686\nHindi first sentence: ['कभी', 'कभी', 'कोई'..]\n"

In [6]:
kn_vocab, kn_word2idx, kn_idx2word= build_vocab(kn_tokens)
print(f"Kannada Vocabulary: {kn_vocab}")
print(f"Kannada Vocab Size: {len(kn_vocab)}")
print(f"Kannada first sentence: {kn_tokens[0]}")

"""
Expected output:
Kannada Vocabulary: ['1\nನಿರ್ದೇಶಕ', '1000ಕ್ಕೂ', '10ಕ್ಕೆ', 'ʼಕಾಂತಾರ', 'ʼನಾನು',..]
Kannada Vocab Size: 819
Kannada first sentence: ['ಕಾಂತಾರ', 'ಚಾಪ್ಟರ್\u200c', 'ಬರೀ', 'ಸಿನಿಮಾ' ..]
"""

Kannada Vocabulary: ['1\nನಿರ್ದೇಶಕ', '1000ಕ್ಕೂ', '10ಕ್ಕೆ', 'ʼಕಾಂತಾರ', 'ʼನಾನು', 'ಅಂಕಗಳನ್ನು', 'ಅಂಥದ್ದೇ', 'ಅಂದರೆ', 'ಅಂದುಕೊಳ್ಳುವಂತಾಗುತ್ತದೆ', 'ಅಕ್ಟೋಬರ್', 'ಅಕ್ಷರಶಃ', 'ಅಗ್ನಿ', 'ಅಜನೀಶ್', 'ಅಜನೀಶ್\u200c', 'ಅಜ್ಜನ', 'ಅತ್ಯಂತ', 'ಅದನ್ನು', 'ಅದರ', 'ಅದರಲ್ಲೂ', 'ಅದರಿಂದ', 'ಅದು', 'ಅದೊಂದು', 'ಅದ್ದೂರಿ', 'ಅದ್ದೂರಿತನಕ್ಕೆ', 'ಅದ್ಭುತ', 'ಅದ್ಭುತವಾದ', 'ಅಧಿಕ', 'ಅಧ್ಯಾಯ', 'ಅನಿರೀಕ್ಷಿತ', 'ಅನುಭವ', 'ಅನುಭವವಾಗಿದ್ದು', 'ಅನುಭೂತಿ', 'ಅನುಭೂತಿಗಾಗಿ', 'ಅನುಮಾನವಿಲ್ಲ', 'ಅನ್ನು', 'ಅನ್ನುವುದರಲ್ಲಿ', 'ಅನ್ನೋದೇ', 'ಅಪ್\u200cಡೇಟ್ಸ್\u200c', 'ಅಬ್ಬರ', 'ಅಬ್ಬರಿಸಿದಂತೆ', 'ಅಬ್ಬರಿಸಿದೆ', 'ಅಬ್ಬರಿಸಿದ್ದಾರೆ', 'ಅಭಿನಯ', 'ಅಭಿನಯದ', 'ಅರಮನೆ', 'ಅರವಿಂದ್\u200c', 'ಅರಸ', 'ಅರಸರ', 'ಅರಸರಾಗಿ', 'ಅರಸರು', 'ಅರ್ಧದಷ್ಟು', 'ಅಲ್ಲ', 'ಅವಕಾಶ', 'ಅವತರಿಸುವ', 'ಅವನು', 'ಅವರ', 'ಅವರಿಂದ', 'ಅವರು', 'ಅಷ್ಟನ್ನು', 'ಆ', 'ಆಕರ್ಷಕವಾಗಿದೆ', 'ಆಕ್ಷನ್', 'ಆಕ್ಷನ್\u200cಗೆ', 'ಆಗದಂತೆ', 'ಆಗಲಿದ್ದು', 'ಆಗಿದೆ', 'ಆಗಿದ್ದವು', 'ಆಟ', 'ಆತ', 'ಆತನ', 'ಆತನಿಗೆ', 'ಆದ', 'ಆದರೆ', 'ಆಧುನಿಕವೆನಿಸುವ', 'ಆನುವಂಶಿಕವಾಗಿ', 'ಆಯೋಜನೆ', 'ಆರಂಭದಲ್ಲಿಯೇ', 'ಆರಂಭದಿಂದ', 'ಆರಾಧಿಸುತ್ತಾರೆ', 'ಆಳಕ್ಕೆ', 'ಆಳುತ್ತಿದ್ದಾರೆ', 'ಆಳ್ವಿಕೆಯ', 'ಆವರಣದೊಳಗೆ', 'ಆವರಣವನ್ನು', 'ಆ್ಯಕ್ಷನ್

"\nExpected output:\nKannada Vocabulary: ['1\nನಿರ್ದೇಶಕ', '1000ಕ್ಕೂ', '10ಕ್ಕೆ', 'ʼಕಾಂತಾರ', 'ʼನಾನು',..]\nKannada Vocab Size: 819\nKannada first sentence: ['ಕಾಂತಾರ', 'ಚಾಪ್ಟರ್\u200c', 'ಬರೀ', 'ಸಿನಿಮಾ' ..]\n"

## One hot encoding

In one-hot encoding each unique word is mapped to a binary vector where only one position has the value 1 and all others are 0. The length of the vector equals the size of the vocabulary making each word uniquely identifiable.

- Transforms categorical text data into numeric vectors
- Each word is represented by a vector with a single 1 and remaining 0s
- Easy to understand and implement
- Works well for small vocabularies but becomes inefficient for large ones

In [7]:
def one_hot_vector(word, word_to_idx):
    vocab_size = len(word_to_idx)
    # Initialize a vector of zeros
    vector = np.zeros(vocab_size)
    # Set the index of the word to 1
    if word in word_to_idx:
        vector[word_to_idx[word]] = 1
    return vector

In [8]:
print(f"One-hot vector for first Hindi word: {one_hot_vector(hi_tokens[0][0], hi_word2idx)}")

One-hot vector for first Hindi word: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0

In [9]:
print(f"One-hot vector for first Kannada word: {one_hot_vector(kn_tokens[0][0], kn_word2idx)}")

One-hot vector for first Kannada word: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.

## BoW

BOW (Bag of Words) turns text like sentence, paragraph or document into a collection of words and counts how often each word appears but ignoring the order of the words.

It does not consider the order of the words or their grammar but focuses on counting how often each word appears in the text.

The length of every single BoW vector is equal to the length of your entire vocabulary.

In [10]:
import numpy as np
def bag_of_words(sentence, word2idx):
    """
    TODO: Implement Bag of Words encoding for a sentence.
    """

    # 1. Create a vector of zeros with length equal to the vocabulary size
    vocab_size = len(word2idx)
    vector = np.zeros(vocab_size)

    # 2. Iterate through each word in the sentence
    for word in sentence:
        # 3. If the word is in our vocabulary, increment the count at its index
        if word in word2idx:
            index = word2idx[word]
            vector[index] += 1

    # 4. Return the resulting frequency vector
    return vector

print(f"BoW for first Hindi sentence: {bag_of_words(hi_tokens[0], hi_word2idx)}")

BoW for first Hindi sentence: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.

In [11]:
print(f"BoW for first Kannada sentence: {bag_of_words(kn_tokens[0], kn_word2idx)}")

BoW for first Kannada sentence: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 

# PHASE 2: TEXT EMBEDDINGS

## Word2Vec

### 1. Using Skip-gram with Negative Sampling

Word2Vec is a set of neural network models that learn word embeddings— i.e ontinuous vector representations of words, based on their context within a corpus. The two main architectures of Word2Vec are:

- Skip-Gram: Predicts the context words given a target word.
- Continuous Bag of Words (CBOW): Predicts the target word from its context.


You already know how Skip-gram works theoretically: given a center word (e.g., "drinking"), we try to predict the context words (e.g., "juice", "water").

In a standard implementation, to update the weights for "drinking", the model has to calculate the probability of "juice" against every other word in the dictionary (the denominator in the Softmax function).

- If your Hindi vocabulary is 50,000 words, that is 50,000 calculations per training step.

That is too slow.

Here comes Negative Sampling which is a technique that modifies the training objective from predicting the entire probability distribution of the vocabulary (as in softmax) to focusing on distinguishing the target word from a few noise (negative) words. Instead of updating the weights for all words in the vocabulary, negative sampling updates the weights for only a small number of words, significantly reducing computation.

In negative sampling, for each word-context pair, the model not only processes the actual context words (positive samples) but also a few randomly chosen words from the vocabulary that do not appear in the context (negative samples). The modified objective function aims to:

- Maximize the probability that a word-context pair (target word and its context word) is observed in the corpus.
- Minimize the probability that randomly sampled word-context pairs are observed.

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Hyperparameters

# Fixed:
embedding_dim_skipg = 100          # Size of the vector for each word
context_size_skipg = 2             # "Window size": look 2 words behind and 2 words ahead

# Tunable: (TUNE THESE PARAMS TO GET A DECREASING LOSS WITH LAST EPOCH LOSS IN THE RANGE OF 0-5)
# TRY TO ACHIEVE THIS WITH THE SAME hyperparameters FOR BOTH LANGUAGES AND IN THESE RANGES ITSELF
num_negative_samples_skipg = 5     # RANGE = 5-10
learning_rate_skipg = 0.04         # RANGE = 0.01 - 0.05
num_epochs_skipg = 15              # RANGE = 5 - 20
batch_size_skipg = 256             # RANGE = 128 - 512

# 2. DATA PREPARATION

# 3. DATASET CLASS
class SkipGramDataset(Dataset):
    def __init__(self, data):
        self.data = data # list of (target, context) tuples

    def __len__(self):
        # Returns the total number of training pairs
        return len(self.data)

    def __getitem__(self, idx):
        # Returns a single pair at index `idx`
        return self.data[idx]

def generate_training_data_skipgram(words, word2idx, context_size):
    """
    TODO: Implement the Skip-gram training data generator.

    Steps:
    1. data (the output list) -> (target, context) pairs.
    2. Loop through the words considering the context_size.
    3. Identify the Target using the word2idx mapping.
    4. Identify the left and right context words and convert these words into their respective indices from word2idx.
    5. For EVERY context word in the window, create a tuple of (target_index, context_index) and add it to output list.
    6. Return the output list.
    """
    data = []

    for i, word in enumerate(words):
        # 3. Identify Target index
        target_index = word2idx[word]

        # 4. Identify context window boundaries
        start = max(0, i - context_size)
        end = min(len(words), i + context_size + 1)

        for j in range(start, end):
            # Skip the target word itself
            if i != j:
                # 4 cont. Convert context word to index
                context_index = word2idx[words[j]]
                # 5. Add tuple to list
                data.append((target_index, context_index))

    # 6. Return the output list
    return data

In [14]:
def get_negative_samples(target, num_negative_samples, vocab_size):
    """
    TODO: Implement negative sampling.
    """
    neg_samples = []
    # 1. Take a random integer from the range of the vocabulary size
    # and make sure it is not the same as the target word index.
    while len(neg_samples) < num_negative_samples:
        sample = np.random.randint(0, vocab_size)
        if sample != target:
            # 2. Repeat this process until you have the required number of negative samples.
            neg_samples.append(sample)
    return neg_samples

In [15]:

# 4. MODEL ARCHITECTURE

class SkipGramNegSampling(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramNegSampling, self).__init__()
        self.vocab_size = vocab_size

        # Input Embeddings (Target words)
        # This is what we usually extract as the final embeddings
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Output Embeddings (Context/Negative words)
        # These are internal weights used for calculation
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # log(sigmoid(x)) as the loss function involves logs of probabilities
        self.log_sigmoid = nn.LogSigmoid()

    def forward(self, target, context, negative_samples):
            """
            TODO: Implement the forward pass for Skip-gram with Negative Sampling.
            """
            # 1. Embeddings: Retrieve vectors
            emb_target = self.embeddings(target)               # [batch, embed_dim]
            emb_context = self.context_embeddings(context)     # [batch, embed_dim]
            emb_neg = self.context_embeddings(negative_samples) # [batch, k, embed_dim]

            # 2. Positive Score: Dot product and LogSigmoid
            # torch.sum(a*b, dim=1) computes the dot product for each pair in the batch
            pos_score = self.log_sigmoid(torch.sum(emb_target * emb_context, dim=1))

            # 3 & 4. Negative Score: Use batch matrix multiplication 'bmm'
            # emb_target.unsqueeze(2) makes it [batch, embed_dim, 1]
            neg_score_raw = torch.bmm(emb_neg, emb_target.unsqueeze(2)).squeeze(2) # [batch, k]
            neg_score = torch.sum(self.log_sigmoid(-neg_score_raw), dim=1)

            # 5 & 6. Loss: Combine scores and return negative mean (to minimize)
            loss = -torch.mean(pos_score + neg_score)

            return loss

In [16]:
# 5. TRAINING

# Flatten the sentences into one long list of words
# (Because the sliding window function expects a single sequence)
hi_flat_words = [word for sentence in hi_tokens for word in sentence]
kn_flat_words = [word for sentence in kn_tokens for word in sentence]

word2vec_skipgram_hi = generate_training_data_skipgram(hi_flat_words, hi_word2idx, context_size_skipg)
word2vec_skipgram_kn = generate_training_data_skipgram(kn_flat_words, kn_word2idx, context_size_skipg)

print(f"Training pairs generated for Hindi: {len(word2vec_skipgram_hi)}")
print(f"Training pairs generated for Kannada: {len(word2vec_skipgram_kn)}")

# 1. Create Dataset and DataLoader
dataset_hi = SkipGramDataset(word2vec_skipgram_hi)
loader_hi = DataLoader(dataset_hi, batch_size=batch_size_skipg, shuffle=True)

# 2. Initialize Model and Optimizer
model_hi = SkipGramNegSampling(vocab_size=len(hi_vocab), embedding_dim=embedding_dim_skipg)
optimizer_hi = optim.Adam(model_hi.parameters(), lr=learning_rate_skipg)

print("Starting Hindi Training:")

# 3. Training Loop
model_hi.to("cuda") # Move model to GPU
for epoch in range(num_epochs_skipg):
    total_loss = 0

    # 1. Loop through each batch from the DataLoader
    for target, context in loader_hi:

        # 2. Generate negative samples for the batch
        neg_samples = torch.stack([torch.LongTensor(get_negative_samples(t.item(), num_negative_samples_skipg, len(hi_vocab))) for t in target])

        # Move tensors to GPU
        target, context, neg_samples = target.to("cuda"), context.to("cuda"), neg_samples.to("cuda")

        # 3. Zero the gradients
        optimizer_hi.zero_grad()

        # 4. Forward Pass -> Backward Pass -> Optimizer Step
        loss = model_hi(target, context, neg_samples)
        loss.backward()
        optimizer_hi.step()

        # 5. Accumulate the loss
        total_loss += loss.item()

    print(f"Epoch {epoch + 1}, Hindi Loss: {total_loss / len(loader_hi):.4f}")

Training pairs generated for Hindi: 7890
Training pairs generated for Kannada: 4518
Starting Hindi Training:
Epoch 1, Hindi Loss: 20.2983
Epoch 2, Hindi Loss: 11.9907
Epoch 3, Hindi Loss: 6.7172
Epoch 4, Hindi Loss: 3.6679
Epoch 5, Hindi Loss: 2.2912
Epoch 6, Hindi Loss: 1.6977
Epoch 7, Hindi Loss: 1.4224
Epoch 8, Hindi Loss: 1.2992
Epoch 9, Hindi Loss: 1.2496
Epoch 10, Hindi Loss: 1.1917
Epoch 11, Hindi Loss: 1.1155
Epoch 12, Hindi Loss: 1.1027
Epoch 13, Hindi Loss: 1.1007
Epoch 14, Hindi Loss: 1.1130
Epoch 15, Hindi Loss: 1.0852


In [17]:
# 4. Extract Embeddings
embeddings_hi = model_hi.embeddings.weight.detach().cpu().numpy()

# 5. Test
test_word_hi = "उसके"
if test_word_hi in hi_word2idx:
    print(f"Vector for '{test_word_hi}':\n", embeddings_hi[hi_word2idx[test_word_hi]][:10], "... (truncated)")
else:
    print(f"'{test_word_hi}' not found in Hindi vocabulary.")

Vector for 'उसके':
 [ 0.7763243  -0.5938838   1.182088    0.5238703   1.2975618   0.19986677
  1.0705981   2.2289062   0.66192013  1.4417475 ] ... (truncated)


In [18]:
# 1. Create Dataset and DataLoader
dataset_kn = SkipGramDataset(word2vec_skipgram_kn)
loader_kn = DataLoader(dataset_kn, batch_size=batch_size_skipg, shuffle=True)

# 2. Initialize Model and Optimizer
model_kn = SkipGramNegSampling(vocab_size=len(kn_vocab), embedding_dim=embedding_dim_skipg)
optimizer_kn = optim.Adam(model_kn.parameters(), lr=learning_rate_skipg)

print("\nStarting Kannada Training:")

# 3. Training Loop
model_kn.to("cuda") # Move model to GPU
for epoch in range(num_epochs_skipg):
    total_loss = 0

    # Implement the training loop for Kannada model
    for target, context in loader_kn:
        neg_samples = torch.stack([torch.LongTensor(get_negative_samples(t.item(), num_negative_samples_skipg, len(kn_vocab))) for t in target])

        target, context, neg_samples = target.to("cuda"), context.to("cuda"), neg_samples.to("cuda")

        optimizer_kn.zero_grad()
        loss = model_kn(target, context, neg_samples)
        loss.backward()
        optimizer_kn.step()

        total_loss += loss.item()

    print(f"Epoch {epoch + 1}, Kannada Loss: {total_loss / len(loader_kn):.4f}")


Starting Kannada Training:
Epoch 1, Kannada Loss: 23.5287
Epoch 2, Kannada Loss: 18.3893
Epoch 3, Kannada Loss: 15.2789
Epoch 4, Kannada Loss: 12.3318
Epoch 5, Kannada Loss: 8.6617
Epoch 6, Kannada Loss: 5.4503
Epoch 7, Kannada Loss: 3.1851
Epoch 8, Kannada Loss: 2.0150
Epoch 9, Kannada Loss: 1.3092
Epoch 10, Kannada Loss: 1.0065
Epoch 11, Kannada Loss: 0.8715
Epoch 12, Kannada Loss: 0.7101
Epoch 13, Kannada Loss: 0.6873
Epoch 14, Kannada Loss: 0.6598
Epoch 15, Kannada Loss: 0.5825


In [19]:
# 4. Extract Embeddings
embeddings_kn = model_kn.embeddings.weight.detach().cpu().numpy()

# 5. Test a word
test_word_kn = "ಅದು"
if test_word_kn in kn_word2idx:
    print(f"Vector for '{test_word_kn}':\n", embeddings_kn[kn_word2idx[test_word_kn]][:10], "... (truncated)")
else:
    print(f"'{test_word_kn}' not found in Kannada vocabulary.")

Vector for 'ಅದು':
 [-0.07927098 -1.0062981   0.13388993  0.5689989   0.7463421  -1.3144639
 -0.9616554   0.59798324  1.7237269  -0.7442783 ] ... (truncated)


In [20]:
# INFERENCE / CHECKING RESULTS
def get_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
    """
    TODO: Finds the closest words in the vector space using Dot Product.

    Steps:
    1. Check if the word is in the vocabulary.
    2. Get the embedding vector for the input word and calculate the similarity between this vector and all other word vectors.
    3. Sort the indices of the other words based on similarity scores in descending order.
    4. Return the top N closest words (excluding the input word itself).
    """
    target_idx = word2idx[word]
    dot_products = np.dot(embeddings, embeddings[target_idx])
    sorted_indices = np.argsort(-dot_products)
    closest_idxs = [idx for idx in sorted_indices if idx != target_idx][:top_n]

    return [idx2word[idx] for idx in closest_idxs]

# Usage for Hindi
print(f"Hindi Similar Words: {get_similar_words('उसके', embeddings_hi, hi_word2idx, hi_idx2word)}")

# Usage for Kannada (Example)
print(f"Kannada Similar Words: {get_similar_words('ಅದು', embeddings_kn, kn_word2idx, kn_idx2word)}")

Hindi Similar Words: ['दिखता', 'था―', 'गढ़ा', 'गायब', 'बैकग्राउंड']
Kannada Similar Words: ['ನೀಡುವ', 'ಅಕ್ಟೋಬರ್', 'ಚಾಕಚಕ್ಯತೆ', 'ಶಕ್ತಿಗಳ', 'ಸಾವಿರ']


Some of the words might not be too similar due to a smaller corpus size.

## Using CBoW with Negative Sampling

CBOW is a neural network-based algorithm that predicts a target word given its surrounding context words. It is a type of "unsupervised" learning, meaning that it can learn from unlabeled data, and it is often used to pre-train word embeddings that can be used for various NLP tasks such as sentiment analysis, text classification, and machine translation.

In [21]:
# 1. hyperparameters

# FIXED
embedding_dim_cbow = 100
context_size_cbow = 2

# Tunable: (TUNE THESE PARAMS TO GET A DECREASING LOSS WITH LAST EPOCH LOSS IN THE RANGE OF 0-5)
# TRY TO ACHIEVE THIS WITH THE SAME hyperparameters FOR BOTH LANGUAGES AND IN THESE RANGES ITSELF
num_negative_samples_cbow = 5 # RANGE = 5-10
learning_rate_cbow = 0.03       # RANGE = 0.001 - 0.05
num_epochs_cbow = 15            # RANGE = 5 - 20
batch_size_cbow = 128           # RANGE = 32 - 512

# 2. DATA PREPARATION

class CBOWDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        context_list, target = self.data[idx]
        # Return context
        return torch.LongTensor(context_list), torch.tensor(target, dtype=torch.long)

def generate_cbow_data(words, word2idx, context_size):
    """
    Generates training data for CBOW.
    Input: Window of surrounding words
    Output: The center word
    Example: "The cat sat on mat" (window=1) -> Input: [The, sat], Output: cat

    TODO:
Steps:
    1. data (the output list) -> (target, context) pairs.
    2. Loop through the words considering the context_size.
    3. Identify the Target using the word2idx mapping.
    4. Identify the left and right context words and convert these words into their respective indices from word2idx.
    5. output list -> (context_words, target_word) tuples.
    6. Return the output list.
    """
    data = []
    # 2. Loop through the words considering the context_size
    for i in range(context_size, len(words) - context_size):
        # 3. Identify the Target
        target_word = words[i]
        target_idx = word2idx[target_word]

        # 4. Identify context words indices
        context_indices = []
        for j in range(i - context_size, i + context_size + 1):
            if i != j:
                context_indices.append(word2idx[words[j]])

        # 5. output list -> (context_words, target_word) tuples
        data.append((context_indices, target_idx))

    # 6. Return the output list.
    return data

In [22]:
# 3. MODEL ARCHITECTURE

class CBOWNegSampling(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOWNegSampling, self).__init__()

        # Input Embeddings (For Context Words)
        # In CBOW, we average these to get the hidden layer
        self.in_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Output Embeddings (For Target/Center Word)
        self.out_embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.log_sigmoid = nn.LogSigmoid()

    def forward(self, context_indices, target_index, negative_indices):
            # 1. Lookup embeddings for context words and take mean
            context_embeds = self.in_embeddings(context_indices) # [batch, 2*window, dim]
            context_avg = torch.mean(context_embeds, dim=1)      # [batch, dim]

            # 2. Process Target and Negatives
            target_embed = self.out_embeddings(target_index)     # [batch, dim]
            neg_embeds = self.out_embeddings(negative_indices)   # [batch, k, dim]

            # 3. Positive Similarity (Dot product)
            pos_dot = torch.sum(context_avg * target_embed, dim=1)
            pos_score = self.log_sigmoid(pos_dot)

            # 4. Negative Similarity (BMM for batch dot product)
            neg_dot = torch.bmm(neg_embeds, context_avg.unsqueeze(2)).squeeze(2)
            neg_score = torch.sum(self.log_sigmoid(-neg_dot), dim=1)

            # 5. Total Loss
            loss = -torch.mean(pos_score + neg_score)

            return loss

In [23]:
# 4. TRAINING LOOP

cbow_data_hi = generate_cbow_data(hi_flat_words, hi_word2idx, context_size_cbow)
dataset_hi = CBOWDataset(cbow_data_hi)
loader_hi = DataLoader(dataset_hi, batch_size=batch_size_cbow, shuffle=True)

vocab_size = len(hi_word2idx)
model_cbow = CBOWNegSampling(vocab_size, embedding_dim_cbow)
optimizer = optim.Adam(model_cbow.parameters(), lr=learning_rate_cbow)

print(f"Starting CBOW Training:")


model_cbow.to("cuda")
for epoch in range(num_epochs_cbow):
    total_loss = 0
    for context, target in loader_hi:
        # Generate negative samples for batch
        neg_indices = torch.stack([torch.LongTensor(get_negative_samples(t.item(), num_negative_samples_cbow, len(hi_vocab))) for t in target])

        # Move to GPU
        context, target, neg_indices = context.to("cuda"), target.to("cuda"), neg_indices.to("cuda")

        # Training steps
        optimizer.zero_grad()
        loss = model_cbow(context, target, neg_indices)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(loader_hi):.4f}")

Starting CBOW Training:
Epoch 1, Loss: 11.5630
Epoch 2, Loss: 8.6191
Epoch 3, Loss: 5.6226
Epoch 4, Loss: 3.5168
Epoch 5, Loss: 2.1664
Epoch 6, Loss: 1.4733
Epoch 7, Loss: 1.0991
Epoch 8, Loss: 0.7824
Epoch 9, Loss: 0.6947
Epoch 10, Loss: 0.5269
Epoch 11, Loss: 0.4255
Epoch 12, Loss: 0.4164
Epoch 13, Loss: 0.3155
Epoch 14, Loss: 0.2812
Epoch 15, Loss: 0.2643


In [24]:
embeddings_cbow_hi = model_cbow.in_embeddings.weight.detach().cpu().numpy()

# 5. Test
test_word_hi = "उसके"
if test_word_hi in hi_word2idx:
    print(f"Vector for '{test_word_hi}':\n", embeddings_cbow_hi[hi_word2idx[test_word_hi]][:10], "... (truncated)")
else:
    print(f"'{test_word_hi}' not found in Hindi vocabulary.")

Vector for 'उसके':
 [ 1.0845292  -0.6521129  -0.81119025 -1.3476454  -0.8005418  -0.5911268
 -1.7747544   0.76654977 -0.5583952  -0.8646731 ] ... (truncated)


In [25]:
cbow_data_kn = generate_cbow_data(kn_flat_words, kn_word2idx, context_size_cbow)
dataset_kn_cbow = CBOWDataset(cbow_data_kn)
loader_kn_cbow = DataLoader(dataset_kn_cbow, batch_size=batch_size_cbow, shuffle=True)

model_cbow_kn = CBOWNegSampling(vocab_size=len(kn_vocab), embedding_dim=embedding_dim_cbow)
optimizer_kn = optim.Adam(model_cbow_kn.parameters(), lr=learning_rate_cbow)

print(f"Starting Kannada CBOW Training:")


model_cbow_kn.to("cuda")
for epoch in range(num_epochs_cbow):
    total_loss = 0
    for context, target in loader_kn_cbow:
        # Generate negative samples
        neg_indices = torch.stack([torch.LongTensor(get_negative_samples(t.item(), num_negative_samples_cbow, len(kn_vocab))) for t in target])

        # Move to GPU
        context, target, neg_indices = context.to("cuda"), target.to("cuda"), neg_indices.to("cuda")

        optimizer_kn.zero_grad()
        loss = model_cbow_kn(context, target, neg_indices)
        loss.backward()
        optimizer_kn.step()

        total_loss += loss.item()

    print(f"Epoch {epoch + 1}, Kannada Loss: {total_loss / len(loader_kn_cbow):.4f}")

Starting Kannada CBOW Training:
Epoch 1, Kannada Loss: 12.3141
Epoch 2, Kannada Loss: 10.7360
Epoch 3, Kannada Loss: 9.7622
Epoch 4, Kannada Loss: 9.1101
Epoch 5, Kannada Loss: 8.3204
Epoch 6, Kannada Loss: 7.0932
Epoch 7, Kannada Loss: 6.2195
Epoch 8, Kannada Loss: 4.6496
Epoch 9, Kannada Loss: 3.2657
Epoch 10, Kannada Loss: 2.2862
Epoch 11, Kannada Loss: 1.7460
Epoch 12, Kannada Loss: 1.1979
Epoch 13, Kannada Loss: 0.9294
Epoch 14, Kannada Loss: 0.6758
Epoch 15, Kannada Loss: 0.5488


In [26]:
embeddings_kn_cbow = model_cbow_kn.in_embeddings.weight.detach().cpu().numpy()

test_word_kn = "ಅದು"
if test_word_kn in kn_word2idx:
    print(f"Vector for '{test_word_kn}':\n", embeddings_kn_cbow[kn_word2idx[test_word_kn]][:10], "... (truncated)")
else:
    print(f"'{test_word_kn}' not found in Kannada vocabulary.")

Vector for 'ಅದು':
 [-0.04312239 -0.37834284 -0.94859004  1.4072022  -1.0910386  -1.0139536
 -0.8097962  -1.565552   -0.93464065  0.02042177] ... (truncated)


In [27]:
# Usage for Hindi
print(f"Hindi Similar Words: {get_similar_words('उसके', embeddings_cbow_hi, hi_word2idx, hi_idx2word)}")

# Usage for Kannada (Example)
print(f"Kannada Similar Words: {get_similar_words('ಅದು', embeddings_kn_cbow, kn_word2idx, kn_idx2word)}")

Hindi Similar Words: ['हटाएं\nप्रीक्वल', 'जंगल', 'वाले', 'संवेदनशीलता', 'एंट्री']
Kannada Similar Words: ['ತೊಂದರೆಗೆ', 'ಕರ್ನಾಟಕಕ್ಕೆ', 'ಆತನಿಗೆ', 'ಕಾಡು', 'ದೈವ']


## GloVe

GloVe is an unsupervised learning algorithm designed to generate dense vector representations. Its primary objective is to capture semantic relationships between words by analyzing their co-occurrence patterns in a large text corpus.

It constructs a word co-occurrence matrix where each element reflects how often a pair of words appears together within a given context window. It then optimizes the word vectors such that the dot product between any two word vectors approximates the pointwise mutual information (PMI) of the corresponding word pair.

Key Weighting Parameters: To ensure stability, the model uses a weighted loss function controlled by two specific hyperparameters:

- X_MAX (Weighting Cutoff): This caps the influence of extremely frequent word pairs (like stop words "the-is"). Any pair appearing more than X_MAX times is treated as having equal weight, preventing them from dominating the training.

- ALPHA (Scaling Factor): Usually set to 0.75, this non-linear scaling dampens the impact of very rare co-occurrences, ensuring the model learns robust patterns rather than noise.

This optimization allows GloVe to produce embeddings that effectively encode both syntactic and semantic relationships across the vocabulary

In [28]:
from collections import defaultdict
from torch.utils.data import DataLoader, TensorDataset

# 1. hyperparameters

# FIXED
EMBEDDING_DIM_GLOVE = 100
WINDOW_SIZE_GLOVE = 2
X_MAX = 100                 # GloVe weighting cutoff (words with count > x_max get weight 1.0)
ALPHA = 0.75                # GloVe scaling factor (usually 0.75)

# Tunable: (TUNE THESE PARAMS TO GET A DECREASING LOSS WITH LAST EPOCH LOSS IN THE RANGE OF 0-5)
# TRY TO ACHIEVE THIS WITH THE SAME hyperparameters FOR BOTH LANGUAGES AND IN THESE RANGES ITSELF
BATCH_SIZE_GLOVE = 512 # RANGE = 256 - 1024
LEARNING_RATE_GLOVE = 0.05 # RANGE = 0.001 - 0.05
NUM_EPOCHS_GLOVE = 15 # RANGE = 5 - 20

# 2. MATRIX
def build_cooccurrence_matrix(tokenized_corpus, word2idx, window_size):
    """
    Builds a dictionary of {(word_i_id, word_j_id): weighted_count}

    TODO: Build a weighted co-occurrence matrix for GloVe training.

    Steps:
    1. coocs -> (center_idx, context_idx) as key and weight as value.
    2. For each sentence, convert words to their corresponding indices, ensuring only words in 'word2idx' are kept.
    3. For each word (center) in the sentence:
       - Determine the 'start' and 'end' boundaries based on 'window_size'.
       - Iterate through all words within that window.
    4. Calculate the distance between the two word positions. weight = 1.0 / distance
    5. Return the dictionary of co-occurrences.
    """
    coocs = defaultdict(float)
    print(f"Building Co-occurrence Matrix (Window: {window_size}):")

    # 2. For each sentence, convert words to indices
    for sentence in tokenized_corpus:
        indices = [word2idx[w] for w in sentence if w in word2idx]

        # 3. For each word (center) in the sentence
        for i, center_idx in enumerate(indices):
            # Determine boundaries
            start = max(0, i - window_size)
            end = min(len(indices), i + window_size + 1)

            for j in range(start, end):
                if i == j: continue
                context_idx = indices[j]

                # 4. Calculate distance and weight
                distance = abs(i - j)
                weight = 1.0 / distance
                coocs[(center_idx, context_idx)] += weight

    return coocs

def prepare_tensors(cooc_dict):
    """Converts the dictionary to PyTorch tensors."""
    i_idxs, j_idxs, counts = [], [], []
    for (i, j), count in cooc_dict.items():
        i_idxs.append(i)
        j_idxs.append(j)
        counts.append(count)

    return (torch.LongTensor(i_idxs),
            torch.LongTensor(j_idxs),
            torch.FloatTensor(counts))

In [29]:
# 3. GLOVE MODEL
class GloVe(nn.Module):
    def __init__(self, vocab_size, embed_dim, x_max, alpha):
        super(GloVe, self).__init__()

        # Center word embeddings (W) and biases (b)
        self.w_i = nn.Embedding(vocab_size, embed_dim)
        self.b_i = nn.Embedding(vocab_size, 1)

        # Context word embeddings (W_tilde) and biases (b_tilde)
        self.w_j = nn.Embedding(vocab_size, embed_dim)
        self.b_j = nn.Embedding(vocab_size, 1)

        self.x_max = x_max
        self.alpha = alpha

    def forward(self, i_indices, j_indices, counts):
        """
        TODO:

Steps:
1. Prediction: w_i * w_j + b_i + b_j
2. Target: log(counts)
3. Weights: (count / x_max)^alpha
4. Compute the weighted mean squared error between the predictions and targets and return it.

        """
        # 1. Prediction: w_i * w_j + b_i + b_j
        w_i = self.w_i(i_indices)
        w_j = self.w_j(j_indices)
        b_i = self.b_i(i_indices).squeeze()
        b_j = self.b_j(j_indices).squeeze()

        # Dot product across the embedding dimension
        prediction = torch.sum(w_i * w_j, dim=1) + b_i + b_j

        # 2. Target: log(counts)
        target = torch.log(counts)

        # 3. Weights: (count / x_max)^alpha, capped at 1.0
        weight = torch.pow(counts / self.x_max, self.alpha)
        weight = torch.where(weight > 1.0, torch.ones_like(weight), weight)

        # 4. Compute weighted MSE
        loss = torch.mean(weight * torch.pow(prediction - target, 2))
        return loss

    def get_final_embeddings(self):
        # Summing the two matrices as per GloVe paper
        return self.w_i.weight.detach().cpu().numpy() + \
               self.w_j.weight.detach().cpu().numpy()

In [34]:

# 4. TRAINING FUNCTION
def run_glove(corpus, word2idx, name):
    # Build Data
    cooc_dict = build_cooccurrence_matrix(corpus, word2idx, WINDOW_SIZE_GLOVE)
    i_vec, j_vec, count_vec = prepare_tensors(cooc_dict)

    dataset = TensorDataset(i_vec, j_vec, count_vec)
    loader = DataLoader(dataset, batch_size=BATCH_SIZE_GLOVE, shuffle=True)

    # Setup Model
    model = GloVe(len(word2idx), EMBEDDING_DIM_GLOVE, X_MAX, ALPHA)
    optimizer = optim.Adagrad(model.parameters(), lr=LEARNING_RATE_GLOVE)

    print(f"Starting {name} Training ({len(cooc_dict)} pairs):")

# Train
    for epoch in range(NUM_EPOCHS_GLOVE):
        """
        TODO:
        Loop through each batch(i, j, count) from the DataLoader and perform the exact steps as in the Skip-gram and CBOW training loops (forward pass, backward pass, optimizer step).
        """
        total_loss = 0

        # 1. Loop through each batch
        for i_batch, j_batch, count_batch in loader:

            # 2. Zero the gradients
            optimizer.zero_grad()

            # 3. Forward Pass (Calculates loss)
            loss = model(i_batch, j_batch, count_batch)

            # 4. Backward Pass (Calculates gradients)
            loss.backward()

            # 5. Optimizer Step (Updates weights)
            optimizer.step()

            # 6. Accumulate loss
            total_loss += loss.item()

        print(f"Epoch {epoch+1}, Loss: {total_loss/len(loader):.4f}")

    return model.get_final_embeddings()

In [36]:
# 5. TESTING
def get_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
    """
    TODO:
    Find the closest words in the vector space using Cosine Similarity.

    Steps:
    1. Handle Out-of-Vocabulary.
    2. Get the specific embedding for the input word.
    3. To calculate cosine similarity efficiently, divide the entire embeddings matrix by its row-wise L2 norm.
    4. Divide the target word's vector by its L2 norm. (Normalization)
    5. Perform a dot product between the normalized matrix and the normalized target vector.
    6. Sort the scores in descending order.
    7. Exclude self and take the next 'top_n' indices.
    """
    # 1. Handle OOV
    if word not in word2idx: return []

    # 2. Get specific embedding
    target_idx = word2idx[word]
    target_vec = embeddings[target_idx]

    # 3. Normalize matrix (row-wise L2 norm)
    # We add 1e-9 to avoid division by zero
    norm_matrix = embeddings / (np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-9)

    # 4. Normalize target vector
    norm_target = target_vec / (np.linalg.norm(target_vec) + 1e-9)

    # 5. Dot product (Cosine Similarity)
    similarities = np.dot(norm_matrix, norm_target)

    # 6. Sort scores in descending order
    all_sorted = np.argsort(-similarities)

    # 7. Exclude self and take top_n
    # This sets the variable required by the boilerplate return line
    sorted_idxs = [i for i in all_sorted if i != target_idx][:top_n]

    return [idx2word[i] for i in sorted_idxs]

print("\nHindi (GloVe):")
glove_embeddings_hi = run_glove(hi_tokens, hi_word2idx, "Hindi")
test_word_hi_glove = "उसके"
print(f"Similar to '{test_word_hi_glove}':", get_similar_words(test_word_hi_glove, glove_embeddings_hi, hi_word2idx, hi_idx2word))


Hindi (GloVe):
Building Co-occurrence Matrix (Window: 2):
Starting Hindi Training (6099 pairs):
Epoch 1, Loss: 2.8444
Epoch 2, Loss: 1.3077
Epoch 3, Loss: 0.8534
Epoch 4, Loss: 0.6127
Epoch 5, Loss: 0.4610
Epoch 6, Loss: 0.3590
Epoch 7, Loss: 0.2865
Epoch 8, Loss: 0.2326
Epoch 9, Loss: 0.1919
Epoch 10, Loss: 0.1603
Epoch 11, Loss: 0.1350
Epoch 12, Loss: 0.1152
Epoch 13, Loss: 0.0988
Epoch 14, Loss: 0.0854
Epoch 15, Loss: 0.0743
Similar to 'उसके': ['असर', 'ज़मीन', 'उसे', 'स्टोरीटेलिंग', '‘कांतारा’']


In [37]:
print("\nKannada (GloVe):")
glove_embeddings_kn = run_glove(kn_tokens, kn_word2idx, "Kannada")
test_word_kn_glove = "ಅದು"
print(f"Similar to '{test_word_kn_glove}':", get_similar_words(test_word_kn_glove, glove_embeddings_kn, kn_word2idx, kn_idx2word))


Kannada (GloVe):
Building Co-occurrence Matrix (Window: 2):
Starting Kannada Training (3621 pairs):
Epoch 1, Loss: 2.7406
Epoch 2, Loss: 1.0132
Epoch 3, Loss: 0.5501
Epoch 4, Loss: 0.3557
Epoch 5, Loss: 0.2213
Epoch 6, Loss: 0.1584
Epoch 7, Loss: 0.1163
Epoch 8, Loss: 0.1153
Epoch 9, Loss: 0.0903
Epoch 10, Loss: 0.0522
Epoch 11, Loss: 0.0389
Epoch 12, Loss: 0.0304
Epoch 13, Loss: 0.0238
Epoch 14, Loss: 0.0194
Epoch 15, Loss: 0.0166
Similar to 'ಅದು': ['ದೃಶ್ಯ', 'ಹೆಗಲ', 'ಸಾಕ್ಷಾತ್ಕರಿಸುತ್ತವೆ', 'ಕಾಯುತ್ತಿದ್ದ', 'ಪ್ರಿಕ್ವೆಲ್']


## FastText

FastText extends the Skip-gram and CBOW models by representing words as bags of character n-grams rather than atomic units. This fundamental shift allows the model to generate embeddings for previously unseen words (Out-Of-Vocabulary or OOV) and capture morphological relationships between related terms.

Traditional word embedding models treat each word as an indivisible token. FastText breaks words into character n-grams, enabling it to understand word structure and meaning at a granular level.

Consider the word "running":

- 3-grams: <ru, run, unn, nni, nin, ing, ng>

- 4-grams: <run, runn, unni, nnin, ning, ing>

- 5-grams: <runn, runni, unnin, nning, ning>

The angle brackets indicate word boundaries, helping the model distinguish between subwords that appear at different positions.

#### Understanding min_n and max_n:

These two parameters control the granularity of the subword generation:

- min_n (Minimum n-gram length): The smallest subword size to capture. For example, min_n=2 captures short syllables like ru or ng, which is crucial for morphologically rich Indic languages.

- max_n (Maximum n-gram length): The largest subword size. max_n=5 would capture larger roots like runni.

The model learns vectors for all n-grams between these lengths and sums them to create the final word vector.

#### Efficiency of Hierarchical Softmax:

FastText employs hierarchical softmax instead of standard softmax for computational efficiency. Rather than computing probabilities across all vocabulary words, it constructs a binary tree where each leaf represents a word and internal nodes represent probability distributions.

Key advantages of hierarchical softmax:

- Reduces time complexity from O(V) to O(log V) where V is vocabulary size
- Uses Huffman coding to optimize frequent word access
- Maintains prediction accuracy while significantly improving training speed

Note: In this exercise, we will not be implementing the mathematical FastText training algorithm from scratch. Instead, we will utilize the gensim module to visualize how FastText aggregates these subword embeddings to generate meaningful vectors for words it has never seen before.


In [38]:
from gensim.models import FastText

def run_fasttext(language_name, sentences_tokens, target_oov_word):
    """
    TODO:
    Train a FastText model to generate embeddings for an Out-Of-Vocabulary (OOV) word.

    Steps:
    1. Define hyperparameters (Vector size, Window, Epochs, n-grams).
    2. Initialize the FastText model using gensim.
    3. Build the vocabulary from the training corpus.
    4. Train the model on the corpus.
    5. Verify that the target word generates a vector despite not being in the vocab.
    """

    print(f"Target OOV Word: '{target_oov_word}'")

    #create training data EXCLUDING the target word - remove the word to prove FastText can generate vectors for unseen words
    training_corpus = []
    removed_count = 0

    for sentence in sentences_tokens:
        clean_sent = [word for word in sentence if word != target_oov_word]
        if len(clean_sent) < len(sentence):
            removed_count += 1
        if clean_sent:
            training_corpus.append(clean_sent)

    print(f"Original Sentence Count: {len(sentences_tokens)}")
    print(f"Sentences processed: {len(training_corpus)}")
    print(f"Occurrences of '{target_oov_word}' removed: {removed_count}")

    if len(training_corpus) < 1:
        print("Error: Not enough data to train.")
        return

    # Hyperparameter tuning
    # TODO: Tune these values to observe the effect on vector quality

# Hyperparameter tuning
    vector_size = 100
    window = 5
    min_count = 1
    epochs = 50
    min_n = 2
    max_n = 6

    # Implementation using gensim's FastText
    # TODO: Initialize, Build Vocab, and Train the FastText model

    model = None

    # --- YOUR CODE STARTS HERE ---

    model = FastText(vector_size=vector_size, window=window, min_count=min_count, min_n=min_n, max_n=max_n)
    model.build_vocab(corpus_iterable=training_corpus)
    model.train(corpus_iterable=training_corpus, total_examples=len(training_corpus), epochs=epochs)

    # --- YOUR CODE ENDS HERE ---

    if model is None:
        print("Model not initialized.")
        return

    is_in_vocab = target_oov_word in model.wv.key_to_index
    print(f"Is '{target_oov_word}' in vocabulary? -> {is_in_vocab}")

    if is_in_vocab:
        print("FAILURE: The word leaked into the training data!")
    else:
        print("SUCCESS: The word is truly unknown to the model.")

    # FastText constructs this vector from the n-grams (subwords) it learned.
    vector = model.wv[target_oov_word]
    print(f"Generated Vector (First 10 dims): {vector[:10]}...")

    # Semantic check - for most similar words
    try:
        similar = model.wv.most_similar(target_oov_word, topn=3)
        print(f"Most similar words to '{target_oov_word}':")
        for word, score in similar:
            print(f"  - {word}: {score:.4f}")
    except:
        print("  (Not enough data to find neighbors)")
    print("\n")

# Run for Hindi by testing OOV capability on the word 'भार'
run_fasttext("Hindi", hi_tokens, "भार")

Target OOV Word: 'भार'
Original Sentence Count: 73
Sentences processed: 73
Occurrences of 'भार' removed: 0
Is 'भार' in vocabulary? -> False
SUCCESS: The word is truly unknown to the model.
Generated Vector (First 10 dims): [ 0.04468874  0.315929   -0.10713156  0.709043    0.5696158  -0.63706076
 -0.430622    1.2000194  -0.15808907  0.2307816 ]...
Most similar words to 'भार':
  - परतदार: 1.0000
  - पार: 1.0000
  - बार: 1.0000




In [39]:
# Run for Kannada by testing OOV capability on the word 'ಬೆಂಗಳೂರ'
run_fasttext("Kannada", kn_tokens, "ಬೆಂಗಳೂರ")

Target OOV Word: 'ಬೆಂಗಳೂರ'
Original Sentence Count: 122
Sentences processed: 122
Occurrences of 'ಬೆಂಗಳೂರ' removed: 0
Is 'ಬೆಂಗಳೂರ' in vocabulary? -> False
SUCCESS: The word is truly unknown to the model.
Generated Vector (First 10 dims): [ 1.5165162e-02 -1.3181415e-01 -1.3346118e-02  1.1572949e-01
  1.4321609e-04  1.7812507e-02  2.4328345e-01 -7.1489297e-02
 -8.2327448e-02 -1.2490988e-01]...
Most similar words to 'ಬೆಂಗಳೂರ':
  - ಬೆಂಗಳೂರಿನ: 1.0000
  - ಬೆಂಗಳೂರು: 1.0000
  - ವಿಲನ್‌ಗಳಾಗಿ: 1.0000




## BERT Based - Contextualized Embeddings

**Note:** In this section, we will utilize the IndicBERT model specifically to generate and visualize contextualized embeddings. We are not performing any downstream tasks like classification or fine-tuning here; our goal is simply to inspect the high-dimensional vector representations the model produces for Hindi and Kannada text.

BERT (Bidirectional Encoder Representations from Transformers) differs from static models like GloVe or FastText by producing dynamic embeddings. The vector for a word changes based on its context (the words around it).

If you would be asked to grant access to HF_TOKEN while running this subtask, click on "Grant Access"


In [53]:
from huggingface_hub import login
from google.colab import userdata

token = userdata.get('HF_TOKEN')
login(token=token)

In [54]:
from huggingface_hub import HfApi
api = HfApi()
try:
    user_info = api.whoami()
    print(f"✅ Success! Logged in as: {user_info['name']}")
except Exception as e:
    print("❌ Login failed. Check your token and Secret toggle.")

✅ Success! Logged in as: preranamp2005


In [55]:
from transformers import AutoModel, AutoTokenizer
import torch

# 1. Load IndicBERT
model_name = "ai4bharat/IndicBERTv2-MLM-only"
print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_sentence_embedding(text):
    """
    TODO:
    Generate a single dense vector for an input sentence using BERT.

    Steps:
    1. Tokenize the input text. Ensure padding and truncation are enabled, and return PyTorch tensors ('pt').
    2. Pass the inputs through the model to get the outputs.
       HINT: Use 'torch.no_grad()' context manager to ensure no gradients are calculated (saves memory).
    3. Extract the 'last_hidden_state' from the model outputs. This represents the raw token-level embeddings.
       (Shape: Batch_Size, Sequence_Length, Hidden_Dim)
    4. Perform Mean Pooling: Calculate the average of the token vectors across the sequence dimension (dim=1) to get a single sentence vector.
    """

    # 1. Tokenize input text
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

    # 2. Pass inputs through model (no_grad to save memory)
    with torch.no_grad():
        outputs = model(**inputs)

    # 3. Extract last_hidden_state
    raw_embeddings = outputs.last_hidden_state # Shape: (1, seq_len, 768)

    # 4. Perform Mean Pooling
    sentence_vector = torch.mean(raw_embeddings, dim=1) # Shape: (1, 768)

    return sentence_vector, raw_embeddings

Loading ai4bharat/IndicBERTv2-MLM-only...


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: ai4bharat/IndicBERTv2-MLM-only
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
bert.embeddings.position_ids               | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.decoder.weight             | UNEXPECTED |  | 
cls.predictions.decoder.bias               | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [56]:
text_hi = "बैंक नदी के किनारे है"
sent_vec_hi, raw_hi = get_sentence_embedding(text_hi)

print(f"Input Hindi: '{text_hi}'")
print(f"Token-level Embeddings Shape: {raw_hi.shape} (Batch, Tokens, Dim)")
print(f"Final Sentence Vector Shape:  {sent_vec_hi.shape} (Batch, Dim)")
print(f"Hindi Vector: {sent_vec_hi[0]}")

Input Hindi: 'बैंक नदी के किनारे है'
Token-level Embeddings Shape: torch.Size([1, 7, 768]) (Batch, Tokens, Dim)
Final Sentence Vector Shape:  torch.Size([1, 768]) (Batch, Dim)
Hindi Vector: tensor([-3.6153e-04, -9.2446e-02,  1.1137e-02,  6.0685e-02,  3.6829e-02,
        -1.6569e-01, -1.0330e-01, -3.2641e-02,  2.3512e-02,  3.2311e-02,
        -4.1213e-02, -2.2319e-02,  3.7742e-02,  7.3300e-02, -1.5746e-02,
        -3.1520e-02, -9.5638e-02, -3.8164e-02, -4.6424e-03,  2.4899e-01,
         1.6728e-01, -8.5957e-02,  2.7910e-02,  7.0017e-04,  1.6346e-01,
        -1.0854e-01, -4.9784e-02, -7.1044e-02,  1.4839e-01,  2.6561e-02,
         6.5278e-03, -2.1694e-02,  2.1538e-01, -4.7706e-02, -4.7297e-02,
        -2.3620e-01, -9.9296e-02,  2.1077e-01,  2.8460e-02, -7.5039e-02,
         1.1843e-01, -3.0535e-02, -1.3885e-02, -2.9177e-01, -6.5315e-02,
        -1.6293e-01, -2.7968e-02,  5.3277e-02,  1.5835e-01, -1.5898e-01,
         5.4970e-02, -1.9136e-02, -2.5600e-02,  3.2297e-01, -9.8974e-02,
       

In [57]:
text_kn = "ಬೆಂಗಳೂರು ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ"
sent_vec_kn, raw_kn = get_sentence_embedding(text_kn)

print(f"\nInput Kannada: '{text_kn}'")
print(f"Token-level Embeddings Shape: {raw_kn.shape} (Batch, Tokens, Dim)")
print(f"Final Sentence Vector Shape:  {sent_vec_kn.shape}")
print(f"Kannada Vector: {sent_vec_kn[0]}")


Input Kannada: 'ಬೆಂಗಳೂರು ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ'
Token-level Embeddings Shape: torch.Size([1, 5, 768]) (Batch, Tokens, Dim)
Final Sentence Vector Shape:  torch.Size([1, 768])
Kannada Vector: tensor([ 1.6938e-02, -7.0767e-02, -2.1789e-02, -1.3382e-01, -4.8083e-02,
        -1.0159e-01,  1.1649e-01,  1.8540e-01,  1.1519e-01,  8.8466e-02,
         3.6696e-02, -7.4599e-02, -8.6353e-02,  5.3005e-02,  4.1776e-02,
        -1.0148e-01,  6.7454e-03,  1.2830e-01, -9.4740e-02,  6.3213e-02,
         7.8564e-02, -1.4309e-01,  2.1462e-02,  3.8186e-02,  1.1200e-01,
         4.4798e-02,  4.6691e-02, -1.7362e-01,  6.2766e-01, -2.1079e-02,
        -1.8509e-01, -5.6519e-02,  1.7626e-01,  1.0224e-01, -3.9882e-02,
        -1.8685e-01,  2.1846e-02,  7.7390e-02, -6.7398e-02,  1.4881e-01,
         6.0286e-02,  8.5481e-02, -4.0085e-02, -2.4488e-01, -8.8397e-02,
         1.1578e-01, -1.5053e-02, -5.8249e-02,  1.3288e-01, -1.0979e-01,
         4.1129e-02, -1.4868e-01, -1.1567e-01, -1.2480e-02, -2.4694e-01,
        -5.

# NLP Tasks on Indian Languages

In this module, we will move beyond simple text processing and implement four high-level NLP tasks that power modern AI applications.

### 1. Machine Translation (MT)
**Goal:** Automatically convert text from a source language (e.g., Kannada) to a target language (e.g., English).

**Real-World Application:** Breaking language barriers in global business, social media content moderation (e.g., translating regional posts for policy review), and cross-border e-commerce.

**Our Lab:** We compare Google Translate (Statistical) vs. NLLB-200 (Neural) to translate cultural reviews of the movie *Kantara Chapter 1*.


### 2. Text Summarization
**Goal:** Compress a long document into a short, fluent summary while retaining key information.

**Real-World Application:** Generating news snippets for apps (like InShorts), summarizing legal contracts, or creating "TL;DR" versions of research papers.

**Our Lab:** We test Abstractive models (which write new sentences like a human) against Extractive models (which function like a highlighter).


### 3. Zero-Shot Text Classification
**Goal:** Classify text into categories the model has never seen before during training.

**Real-World Application:** Organizing customer support tickets on the fly (e.g., tagging a complaint as "Billing" or "Technical" without training a specific model for it) or content tagging for dynamic recommendation systems.

**Our Lab:** We use a model trained on logical inference to categorize movie reviews into genres like Action, Folklore, or Romance without any prior training on those specific labels.


### 4. Extractive Question Answering (QA)
**Goal:** Retrieve precise answers to user questions from a given context or document.

**Real-World Application:** Search engine "featured snippets" (e.g., Google highlighting the answer at the top of search results), automated chatbots that scan manuals to answer user queries, and digital assistants (Siri/Alexa).

**Our Lab:** We build a cross-lingual system that can answer English questions (e.g., "Who is the director?") based on a Hindi news report.


## PHASE 0: SETUP

In [58]:
!pip install transformers sentencepiece protobuf accelerate huggingface_hub deep-translator sumy nltk -q

In [59]:
import warnings
warnings.filterwarnings('ignore')

import os
import gc
import torch
import textwrap
import nltk
from transformers import pipeline
from huggingface_hub import notebook_login
from deep_translator import GoogleTranslator

# Sumy Imports (For Extractive Summarization)
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

# **Authentication Required for HuggingFace**

When prompted by the Hugging Face login popup, paste your read-only access token and click "Login". You can disable the checkbox for "Add token as git credential?".

This step authenticates your session with the Hugging Face Hub and is required to download and use the models for this lab.


In [60]:
notebook_login()

# Utility Functions for the Lab (DO NOT MODIFY)

In [61]:
def get_device():
    return 0 if torch.cuda.is_available() else -1

# utility to clear model cache in GPU before loading another model
def flush_memory():
    gc.collect()
    torch.cuda.empty_cache()
    print("GPU Memory Flushed.")

def load_corpus(filename):
    if not os.path.exists(filename):
        print(f"Error: {filename} not found.")
        return ""
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

def pretty_print(title, text):
    print(f"\n--- {title} ---")
    print(textwrap.fill(text, width=100))

# Load Data
hindi_text = load_corpus('hindi_cleaned.txt')
kannada_text = load_corpus('kannada_cleaned.txt')
device_id = get_device()

print(f"Setup Complete. Using {'GPU' if device_id==0 else 'CPU'}.")

Setup Complete. Using GPU.


# Task 1: Machine Translation

### The Task
Our goal is to translate a culturally dense review of the movie *Kantara* from **Kannada** to **English**. This is a stress test for translation systems because the text contains:
* **Complex Morphology:** Kannada is an agglutinative language (words are formed by stringing together morphemes).
* **Cultural Nuance:** Terms like *"Daiva"* and *"Guliga"* require context to translate correctly, not just dictionary lookups.

###

#### 1. Google Translate as a utility
* **Type:** Production-Grade Neural Machine Translation (NMT).
* **Mechanism:** Accessed via the `deep-translator` library. It uses a massive, closed-source model optimized for factual accuracy and speed.
* **Strength:** Excellent **Named Entity Recognition (NER)**. It usually recognizes proper nouns (like character names) better than raw open-source models.

#### 2. LLM Used: NLLB-200 (No Language Left Behind)
* **Model ID:** `facebook/nllb-200-1.3B`
* **Developer:** Meta AI (FAIR)
* **Architecture:** Sequence-to-Sequence (Seq2Seq) Transformer
    * **Type:** Full Encoder-Decoder Transformer (based on the original "Attention Is All You Need" architecture).
    * **Components:** This 1.3B parameter variant typically consists of **24 Encoder Layers** and **24 Decoder Layers**.
        * **Encoder:** Reads the Kannada text and compresses it into a high-dimensional "context vector" (numerical understanding of the meaning).
        * **Decoder:** Takes that context vector and generates English text token-by-token.
    * **Dense Model:** Unlike its larger 54B sibling (which uses Mixture-of-Experts), this 1.3B version is **dense**, meaning every input token activates all parameters in the layer, ensuring robust "understanding" of every word.
    * **Tokenization:** Uses a shared **SentencePiece** model that can handle 200+ languages, allowing it to translate between rare language pairs directly.

### Implementation Strategy: Chunking
Even with a billion-parameter model, feeding 2000 characters at once is risky. We implement a **Chunking Wrapper**:
1. **Split:** Divide the paragraph into individual sentences (using `split('. ')`).
2. **Translate:** Pass each sentence through the Encoder-Decoder separately.
3. **Merge:** Recombine the English outputs.

This mimics real-world production pipelines, ensuring high stability and preventing the model's context window from overflowing.

In [62]:
flush_memory()

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Translation utility
def standard_tool_translate(text, target='en'):
    return GoogleTranslator(source='auto', target=target).translate(text[:4000])

model_name = "facebook/nllb-200-1.3B"

print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

if device_id == 0:
    model = model.to("cuda")

def nllb_translate_chunked(text, src_lang, tgt_lang, chunk_size=512):
    """
    Professional Translation Wrapper:
    Splits text into chunks to prevent the 'Repetition Loop' bug seen in smaller models.

    TODO:
    Steps:
    1. Split the input text into manageable sentences or chunks.
    2. For each chunk:
        - Tokenize and prepare inputs for the model.
        - Generate translation using the model.
    3. Collect and concatenate the translated chunks to form the final output.

    """
    sentences = text.split('. ')
    translated_chunks = []

    tokenizer.src_lang = src_lang
    forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_lang)

    # YOUR CODE HERE
    for sentence in sentences:
        if not sentence.strip():
            continue

        # Tokenize and prepare inputs for the model
        inputs = tokenizer(sentence, return_tensors="pt").to(model.device)

        # Generate translation using the model
        with torch.no_grad():
            translated_tokens = model.generate(
                **inputs,
                forced_bos_token_id=forced_bos_token_id,
                max_length=chunk_size
            )

        # Collect and concatenate the translated chunks
        translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
        translated_chunks.append(translation)

    return ". ".join(translated_chunks)

kannada_snippet = kannada_text[:2000]

print("Running Tool (Google Translate)...")
tool_output = standard_tool_translate(kannada_snippet)

print("Running Model (NLLB 1.3B)...")
# Codes: kan_Knda (Kannada), eng_Latn (English)
model_output = nllb_translate_chunked(kannada_snippet, src_lang="kan_Knda", tgt_lang="eng_Latn")

GPU Memory Flushed.
Loading facebook/nllb-200-1.3B...


config.json:   0%|          | 0.00/808 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.48G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.48G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/1016 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Running Tool (Google Translate)...
Running Model (NLLB 1.3B)...


In [63]:
pretty_print("ORIGINAL KANNADA", kannada_snippet)
pretty_print("A. Using Google Translate", tool_output)
pretty_print("B. Using NLLB 1.3B", model_output)

flush_memory()


--- ORIGINAL KANNADA ---
ಕಾಂತಾರ ಚಾಪ್ಟರ್‌ 1 ಬರೀ ಸಿನಿಮಾ ( 1) ಅಲ್ಲ, ಅದೊಂದು ಅನುಭೂತಿ. ಅದನ್ನು ಕಥೆಗಾಗಿ ನೋಡಬಾರದು, ಅದು ನೀಡುವ
ರೋಮಾಂಚನಕ್ಕಾಗಿ ನೋಡಬೇಕು. ಅದು ನೀಡುವ ದೃಶ್ಯವೈಭವ, ಸೀಟಿನ ತುದಿಗೆ ತಂದು ಕೂರಿಸುವ ಥ್ರಿಲ್ಲಿಂಗ್‌ ಹೊಡೆದಾಟದ
ದೃಶ್ಯಗಳು, ಕಾಡಿನ ಮರೆಯಲ್ಲಿ ಹುದುಗಿಕೊಂಡ ರಹಸ್ಯಗಳು, ತುಳುನಾಡಿನ ದೈವಗಳು ಮತ್ತು ಅದರಿಂದ ಕಾಯಲ್ಪಡುವ ಮನುಷ್ಯಲೋಕದ ಆಟ-
ಹೋರಾಟಗಳ ಭಾವುಕ- ರಮ್ಯ ಲೋಕದ ಚಿತ್ರಣಕ್ಕಾಗಿ ಇದನ್ನು ನೋಡಬೇಕು. ದೊಡ್ಡ ತೆರೆಯಲ್ಲಿ ನೋಡಿದರೆ ಮಾತ್ರವೇ ಈ ದೃಶ್ಯ ವೈಭವದ
ನೈಜ ಸಾಕ್ಷಾತ್ಕಾರ ಸಾಧ್ಯ. ಕಥೆಯಲ್ಲಿ ಹೊಸತೇನಿಲ್ಲ. ಅದು ಒಳಿತು ಕೆಡುಕುಗಳ ಸಮರ. ಒಳಿತಿನ ವಿಜಯ. ಕೇಡಿನ ಶಕ್ತಿಗಳ
ವಿರುದ್ಧ ಸಜ್ಜನರ, ಮನುಷ್ಯರು ನಂಬಿದ ದೈವಗಳ ವಿಜಯ. ಆರಂಭದಲ್ಲಿಯೇ ತುಳುನಾಡಿಗೆ ಕೈಲಾಸದಿಂದ ಅವತರಿಸುವ ಶಿವಗಣಗಳು
ದೈವವಾಗಿ ನಾಡನ್ನು ಕಾಯುತ್ತವೆ. ಇದರ ನಡುವೆಯೂ ದೈವವನ್ನು ಬಂಧಿಸಲು ಯತ್ನಿಸುವ ದುರ್ಜನರು ಇದ್ದಾರೆ. ಈ ದುರ್ಜನರನ್ನು
ಮಟ್ಟಹಾಕಲು ಮನುಷ್ಯಶಕ್ತಿಯೂ ದೈವಶಕ್ತಿಯೂ ಕೈ ಜೋಡಿಸಬೇಕಾಗುತ್ತದೆ. ರಿಷಬ್‌ ಶೆಟ್ಟಿಯ ಬೆರ್ಮೆ ಪಾತ್ರದಲ್ಲಿ ಇವೆರಡೂ
ಜೋಡಿಯಾಗಿ ನಮ್ಮನ್ನು ರೋಮಾಂಚಿತಗೊಳಿಸುತ್ತವೆ. ತುಳುನಾಡನ್ನು ಬಂಗ್ರ ಅರಸರು ಆಳುತ್ತಿದ್ದಾರೆ. ಬಂಗ್ರದ ಅರಸರ ಹೊಸ
ರಾಜಕುಮಾರನಿಗೂ ಕಾಂತಾರದ ಕಾನನ ನಿವಾಸಿಗಳಿಗೂ ಇಕ್ಕಟ್ಟು ಬಿಕ್ಕಟ್ಟುಗಳು ತಲೆದೋರುತ್ತವೆ. ಇದನ್ನು ಪರಿಹರಿಸಲು ಕಾಂತಾರ
ನಿವಾಸಿಗಳು ತಮಗೆ ಪ

# Task 2: Text Summarization

### The Task
We are condensing the detailed English translation of the *Kantara Chapter 1* review into short, digestible insights. In NLP, summarization is categorized into two distinct paradigms, both of which we implement here:

1.  **Extractive Summarization:** The model selects the most important *existing* sentences from the text and stitches them together. It never generates new words.

2.  **Abstractive Summarization:** The model understands the core meaning and generates completely new sentences to represent it.

#### 1. Sumy (LexRank) as a summarization utility
* **Algorithm:** **LexRank** (Unsupervised Graph-Based Algorithm).
* **LexRank Mechanism:** LexRank operates by constructing a graph where each sentence serves as a node. It calculates similarity edges between these nodes based on shared vocabulary (using cosine similarity of TF-IDF vectors). By applying an eigenvector centrality algorithm akin to Google's PageRank, it identifies the most "connected" or central sentences. The system then selects the sentences with the highest centrality scores to form the extractive summary.

#### 2. LLM A: BART as an Extreme summarizer
* **Model ID:** `facebook/bart-large-xsum`
* **Architecture:** **BART (Bidirectional and Auto-Regressive Transformer)**.
    * **Type:** Full **Encoder-Decoder** Transformer.
    * **Encoders/Decoders:** The "Large" version typically has **12 Encoder layers** and **12 Decoder layers**.
    * **Mechanism:**
        * **Encoder (Bidirectional):** Reads the entire input text at once (like BERT) to understand deep context.
        * **Decoder (Auto-Regressive):** Generates the summary word-by-word (like GPT), attending to the Encoder's output.
* **Training Objective:** BART is trained as a "Denoising Autoencoder." It is fed corrupted text (words masked or shuffled) and forced to reconstruct the original clean text. This makes it incredibly good at understanding sentence structure and rewriting.
* **Finetuning:** This specific variant is fine-tuned on the **XSum dataset** (BBC News Articles $\rightarrow$ One Sentence Headlines). It is optimized for **Extreme Summarization** i.e. condensing a paragraph into a single, punchy sentence.

#### 3. LLM B: DistilBART as a Descriptive summarizer
* **Model ID:** `sshleifer/distilbart-cnn-12-6`
* **Technique:** **Knowledge Distillation**.
* **Why Distil?** The original BART model is massive (400MB+). This model is a "Student" that was trained to mimic the "Teacher" (BART-Large) but is significantly smaller and faster.
* **Architecture:**
    * **12 Encoder Layers:** (Same depth as teacher to capture full context).
    * **6 Decoder Layers:** (Half the depth of teacher to speed up generation).
* **Finetuning:** This variant is fine-tuned on the **CNN/DailyMail dataset**. unlike XSum, this dataset favors **Descriptive Summaries** (multi-sentence bullet points that cover key plot details).

### Implementation Note: Memory Management
Because we are running two heavy Transformers (BART and DistilBART) in a single session, we implement a strict **Load-Generate-Delete** cycle. We load one model into the GPU, generate the summary, and immediately delete it from VRAM (`del model`, `gc.collect()`) before loading the next. This prevents **Out-Of-Memory (OOM)** crashes on standard Colab instances.

In [64]:
flush_memory()

# Abstractive Summarization Wrapper
def generate_abstractive_summary(model_name, text, min_length, max_length):
    """
    Loads a model, generates summary, and immediately clears memory.
    This prevents OOM errors on Colab free tier.

    TODO:

    Steps:
    1. Load the specified model and tokenizer.
    2. Tokenize the input text with appropriate truncation and padding.
    3. Generate the summary using the model's generate function with specified parameters.
    4. Decode the generated summary back to text.
    5. Clear the model and tokenizer from memory and flush GPU cache.
    6. Return the generated summary.
    """
    print(f"Loading {model_name}...")

    # 1. Load the specified model and tokenizer.
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")

    # 2. Tokenize the input text with appropriate truncation and padding.
    inputs = tokenizer(text, max_length=1024, truncation=True, return_tensors="pt").to(model.device)

    # 3. Generate the summary using the model's generate function.
    summary_ids = model.generate(
        inputs["input_ids"],
        num_beams=4,
        min_length=min_length,
        max_length=max_length,
        early_stopping=True
    )

    # 4. Decode the generated summary back to text.
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # CLEANUP
    del model
    del tokenizer
    del inputs
    flush_memory()

    return summary

# Extractive Summarization Wrapper
def sumy_extractive_summarizer(text, sentences_count=1):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LexRankSummarizer()
    summary = summarizer(parser.document, sentences_count)
    return " ".join([str(sentence) for sentence in summary])


if model_output and len(model_output) > 100:
      english_input = model_output


print(f"\n[Input Text({len(english_input)} chars)]: {english_input}...")

GPU Memory Flushed.

[Input Text(2283 chars)]: Kantaara Chapter 1 is not just a cinema, it is an experience. You should not look at it for the story, you should look at the thrill it offers.. It must be seen for the spectacular glory it offers, the thrilling scenes of the thrilling shootout that brings you to the edge of the seat, the secrets embedded in the forest, the gods of Tulu and the human world guarded by it, the playful, emotional, romantic world of the game-of-fights.. Only if seen on the big screen is this scene of glory real realisation possible. There is nothing new in the story. It is the battle of good and evil.. The victory of goodness.. The victory of the good against the forces of evil, the victory of the gods in whom men believe.. Shivgans, who in the beginning incarnate Tulu Nadu from Kailasa, as deities, guard the land.. Despite this, there are still evil people who try to bind the god. It will take human power and divine power to join hands to crush these evil one

In [65]:
print("\n--- 1. Abstractive Extreme Summarization using BART-XSum ---")
summ_abs_extreme = generate_abstractive_summary(
    "facebook/bart-large-xsum",
    english_input,
    min_length=10,
    max_length=40
)
pretty_print("BART-XSum Result (Extreme)", summ_abs_extreme)


--- 1. Abstractive Extreme Summarization using BART-XSum ---
Loading facebook/bart-large-xsum...


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/513 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/309 [00:00<?, ?B/s]

GPU Memory Flushed.

--- BART-XSum Result (Extreme) ---
Kantaara Chapter 1 is a thousand-year-old story of a battle between good and evil, set against the
backdrop of the colonial era in the southern Indian state of Karnataka.


In [66]:
print("\n--- 2. Abstractive Descriptive Summarization using (DistilBART) ---")
# DistilBART is designed for descriptive summaries (like news bullets)
summ_abs_desc = generate_abstractive_summary(
    "sshleifer/distilbart-cnn-12-6",
    english_input,
    min_length=50,
    max_length=120
)
pretty_print("DistilBART Result (Descriptive)", summ_abs_desc)


--- 2. Abstractive Descriptive Summarization using (DistilBART) ---
Loading sshleifer/distilbart-cnn-12-6...


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Please make sure the generation config includes `forced_bos_token_id=0`. 


Loading weights:   0%|          | 0/358 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]



GPU Memory Flushed.

--- DistilBART Result (Descriptive) ---
 Kantaara Chapter 1 is not just a cinema, it is an experience . It must be seen for the spectacular
glory it offers, the thrilling scenes of the thrilling shootout that brings you to the edge of the
seat . Only if seen on the big screen is this scene of glory real realisation possible .


In [67]:
print("\n--- 3. Extractive Extreme Summarization using Sumy LexRank ---")
summ_ext_extreme = sumy_extractive_summarizer(english_input, sentences_count=1)
pretty_print("Sumy Result (Extreme)", summ_ext_extreme)


--- 3. Extractive Extreme Summarization using Sumy LexRank ---

--- Sumy Result (Extreme) ---
The new prince of the Kings of Bengra and the inhabitants of Kanana of Kantara face a series of
crises.. To solve this, the inhabitants of Kantara will have to face a new world that is unfamiliar
to them.


In [68]:
print("\n--- 4. Extractive Descriptive Summarization using Sumy LexRank ---")
summ_ext_desc = sumy_extractive_summarizer(english_input, sentences_count=3)
pretty_print("Sumy Result (Descriptive)", summ_ext_desc)

flush_memory()


--- 4. Extractive Descriptive Summarization using Sumy LexRank ---

--- Sumy Result (Descriptive) ---
It must be seen for the spectacular glory it offers, the thrilling scenes of the thrilling shootout
that brings you to the edge of the seat, the secrets embedded in the forest, the gods of Tulu and
the human world guarded by it, the playful, emotional, romantic world of the game-of-fights.. Only
if seen on the big screen is this scene of glory real realisation possible. The victory of the good
against the forces of evil, the victory of the gods in whom men believe.. Shivgans, who in the
beginning incarnate Tulu Nadu from Kailasa, as deities, guard the land.. The new prince of the Kings
of Bengra and the inhabitants of Kanana of Kantara face a series of crises.. To solve this, the
inhabitants of Kantara will have to face a new world that is unfamiliar to them.
GPU Memory Flushed.


# Task 3: Zero-Shot Text Classification

### The Task
In traditional Machine Learning, if you wanted to classify text as "Action" or "Romance," you would need to train a model on thousands of labeled examples for each category.
**Zero-Shot Text Classification** flips this paradigm. We use a pre-trained Large Language Model (LLM) to classify text into **arbitrary labels** that we define on the fly without any specific training on those labels.

In this task, we apply two different labels to our movie data:
1.  **Creative Labels:** Analyzing the review to detect themes like *Mythology*, *Visual Effects*, *Action*, or *Romance*.
2.  **Business Labels:** Analyzing a production update to detect *Budget*, *Marketing*, or *Legal* issues.

### The Labels
We define these labels dynamically in the code. The model has no specific "Marketing" neuron; instead, it uses logic to determine if the text aligns with these concepts.
* **Creative Labels:** `["Mythology & Folklore", "Visual Effects", "Action", "Romance"]`
* **Business Labels:** `["Budget & Finance", "Marketing", "Legal", "Casting"]`

### The Model: BART-Large-MNLI
* **Model ID:** `facebook/bart-large-mnli`
* **Developer:** Meta AI (FAIR)
* **Architecture:** **BART (Bidirectional and Auto-Regressive Transformer)**.
    * **Type:** Full **Encoder-Decoder** Transformer.
    * **Components:** This "Large" variant consists of **12 Encoder Layers** and **12 Decoder Layers**.
    * **Pre-training:** Trained as a Denoising Autoencoder (reconstructing corrupted text), giving it robust language understanding.

### How does "Zero-Shot Text Classification" work?
This model wasn't trained to classify movie reviews. It was trained on **MNLI (Multi-Genre Natural Language Inference)**.
* **The NLI Task:** In NLI, a model is given two sentences (a **Premise** and a **Hypothesis**) and asked: Does the Premise imply that the Hypothesis is true? (Entailment, Contradiction, or Neutral).
* **The Zero-Shot Trick:** We convert our classification task into an NLI problem:
    * **Premise:** Your input text (e.g., *"The poster was launched in Mumbai"*).
    * **Hypothesis:** We construct a template sentence for each label: *"This text is about **Marketing**."*
    * **Prediction:** The model calculates the probability that the Premise **entails** (implies) the Hypothesis.
    * **Result:** If the Entailment score is high (e.g., 99%), we say the text belongs to the "Marketing" category.

This "NLI Trick" allows us to use any label we want, as long as it can be inserted into a sentence!

Since `bart-large-mnli` was trained primarily on English data, it performs poorly on raw Kannada or Hindi text. We use **Google Translate** (via `deep_translator`) to convert the raw Indian language text into English before feeding it to the classification engine.

In [69]:
"""
TODO:
Implement Zero-Shot Text Classification using a pre-trained model.active_adapters]
- Load "facebook/bart-large-mnli" MODEL.
- Use translator from NLP task 1 to translate inputs before passing it to the model.
"""

# load the model for zero-shot text classification
zsc = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=device_id)
labels_creative = ["Mythology & Folklore", "Visual Effects", "Action", "Romance"]
res_creative = zsc(model_output, labels_creative)

# Hindi content snippet for business labelling
hindi_biz = "फिल्म का बजट 100 करोड़ है और पोस्टर मुंबई में लॉन्च हुआ।"

hindi_biz_eng = standard_tool_translate(hindi_biz)

labels_business = ["Budget & Finance", "Marketing", "Legal", "Casting"]
res_biz = zsc(hindi_biz_eng, labels_business)

print(f"\n[Creative Analysis (Content translated from Kannada)]: {model_output}...")
for l, s in zip(res_creative['labels'], res_creative['scores']):
    print(f"- {l}: {s:.4f}")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/515 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]


[Creative Analysis (Content translated from Kannada)]: Kantaara Chapter 1 is not just a cinema, it is an experience. You should not look at it for the story, you should look at the thrill it offers.. It must be seen for the spectacular glory it offers, the thrilling scenes of the thrilling shootout that brings you to the edge of the seat, the secrets embedded in the forest, the gods of Tulu and the human world guarded by it, the playful, emotional, romantic world of the game-of-fights.. Only if seen on the big screen is this scene of glory real realisation possible. There is nothing new in the story. It is the battle of good and evil.. The victory of goodness.. The victory of the good against the forces of evil, the victory of the gods in whom men believe.. Shivgans, who in the beginning incarnate Tulu Nadu from Kailasa, as deities, guard the land.. Despite this, there are still evil people who try to bind the god. It will take human power and divine power to join hands to crush these

In [70]:
print(f"\n[Business Analysis (Content translated from Hindi)]: {hindi_biz_eng}")
for l, s in zip(res_biz['labels'], res_biz['scores']):
    print(f"- {l}: {s:.4f}")

flush_memory()


[Business Analysis (Content translated from Hindi)]: The budget of the film is Rs 100 crore and the poster was launched in Mumbai.
- Budget & Finance: 0.4027
- Marketing: 0.3331
- Legal: 0.2124
- Casting: 0.0519
GPU Memory Flushed.


# Task 4: Cross-Lingual Extractive QA

### The Task
Question Answering (QA) comes in two flavors: *Generative* (like ChatGPT) and *Extractive* (like a Search Engine snippet).
In this task, we perform **Extractive QA**. We provide the model with a "Context" (a news report about *Kantara*) and ask it specific questions. The model must locate and highlight the exact span of text that contains the answer.

The cross-lingual challenge comes from the fact that our source text is in **Hindi**, but we want to ask questions and get answers in **English**.
* **Workflow:** Hindi Text $\rightarrow$ [Translation Bridge] $\rightarrow$ English Context $\rightarrow$ [QA Model] $\rightarrow$ Answer.


#### 1. Google Translate as a bridge between vernacular data and model
* Since our QA model is monolingual (English), we first translate the Hindi context using `deep_translator`. This allows us to use high-precision English models on Indian language data without needing a massive multilingual model.

#### 2. LLM Used: RoBERTa (Robustly Optimized BERT Approach)
* **Model ID:** `deepset/roberta-base-squad2`
* **Developer:** Deepset (Finetuned from Facebook/Meta AI's base model)
* **Architecture: Encoder-Only Transformer**
    * **Type:** Unlike BART (Encoder-Decoder) used in previous tasks, RoBERTa is **Encoder-Only**. It is designed to *understand* text, not generate it.
    * **Components:** This "Base" variant consists of **12 Encoder Layers**, **12 Attention Heads**, and a hidden dimension size of **768**.
    * **Total Parameters:** ~125 Million.

### How Extractive QA Works
This is **not** text generation. The model does not write the answer; it mathematically finds coordinates.
1.  **Input:** The model receives `[CLS] Question [SEP] Context`.
2.  **Processing:** The 12 layers of Encoders analyze the deep semantic relationship between the question tokens and the context tokens.
3.  **Output Heads:** The model has two specific output vectors:
    * **Start Logits:** The probability of each word being the *start* of the answer.
    * **End Logits:** The probability of each word being the *end* of the answer.
4.  **Prediction:** The span of text with the highest combined Start/End probability is returned as the answer.



In [71]:
flush_memory()
"""
TODO:

Implement Question Answering using a pre-trained model.
- Load "deepset/roberta-base-squad2" MODEL.
- Use translator from NLP task 1 to translate inputs before passing it to the model.
"""
qa_model = pipeline("question-answering", model="deepset/roberta-base-squad2", device=device_id) # the pipeline for question answering

hindi_snippet = hindi_text[:2000]

print("Translating Hindi Context to English...")
context_eng = standard_tool_translate(hindi_snippet) # translator

print(f"\n[CONTEXT]: {context_eng}...\n")

GPU Memory Flushed.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

RobertaForQuestionAnswering LOAD REPORT from: deepset/roberta-base-squad2
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Translating Hindi Context to English...

[CONTEXT]: 2 Sometimes a film leaves you completely speechless. Sometimes its impact is so deep that words cannot be found. Sometimes you are so impressed that you can't stop talking about it. And sometimes she touches your heart so much that you just get lost in her magical world.
But when a film pulls it all together, it's not just a masterpiece, it becomes a cultural phenomenon. Rishabh Shetty's 'Kantara Chapter 1' leaves exactly the same impact.
The film begins with the Kadamba dynasty and its cruel ruler, whose greed is to take over every land and water. Whether man, woman or child does not matter to him. He spreads his rule by killing everyone.
Once, during one such expedition, he sees a mysterious old man fishing on the beach. Orders his soldiers to capture him. As they drag him away, valuables fall out of his bag.
The ruler sees those things and sets out to find their source. The journey takes him to Kantara, where tribes live in harmony

In [72]:
questions = [
    "Who is the director of the film?",
    "Which dynasty is mentioned in the story?",
    "Who is the King of Bhangra?",
    "Who is the daughter of the King?"
]

for q in questions:
    res = qa_model(question=q, context=context_eng)
    print(f"Q: {q}")
    print(f"A: {res['answer']}")
    print(f"   (Confidence: {res['score']:.4f})\n")

flush_memory()

Q: Who is the director of the film?
A: Rishabh Shetty
   (Confidence: 0.8783)

Q: Which dynasty is mentioned in the story?
A: Kadamba
   (Confidence: 0.3159)

Q: Who is the King of Bhangra?
A: Vijayendra
   (Confidence: 1.9010)

Q: Who is the daughter of the King?
A: Kanakavati
   (Confidence: 1.4085)

GPU Memory Flushed.


# Resources:

1. https://www.geeksforgeeks.org/nlp/what-is-text-embedding/
2. https://www.geeksforgeeks.org/nlp/continuous-bag-of-words-cbow-in-nlp/
3. https://www.geeksforgeeks.org/python/python-word-embedding-using-word2vec/
4. https://www.geeksforgeeks.org/nlp/glove-word-embedding-in-nlp/
5. https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only
6. https://pypi.org/project/deep-translator/
7. https://huggingface.co/facebook/nllb-200-1.3B
8. https://huggingface.co/facebook/bart-large-xsum
9. https://huggingface.co/sshleifer/distilbart-cnn-12-6
10. https://www.geeksforgeeks.org/nlp/mastering-text-summarization-with-sumy-a-python-library-overview/
11. https://huggingface.co/facebook/bart-large-mnli
12. https://huggingface.co/deepset/roberta-base-squad2