# Exploring Applications of NLP Basics for Indian Languages

# The Scenario
You are the **Lead Data Scientist at Hombale Films**, the production house behind India's new wave of global hits. The marketing team is strategizing for the massive release of the prequel, ***Kantara: Chapter 1***.

The film represents a unique cinematic experiment: it is deeply rooted in the folklore of the Karavali region of coastal Karnataka, yet it is being consumed by a massive national audience. The studio faces a unique cultural interpretation challenge: reviews on local forums in Kannada often discuss the film using deep spiritual concepts and specific cultural terminology (e.g., *Daiva Nartana*, *Bhoota Kola*), whereas reviews on pan-India portals often discuss the film using cinematic metrics such as VFX, action, and screenplay. Standard "off-the-shelf" NLP tools fail to process this data because they cannot handle the complex morphology of Dravidian languages or the heavy code-mixing used by fans on social media.

---

# The Task
You have been tasked with engineering a **custom NLP pipeline** to process and compare vernacular text streams in **Hindi** and **Kannada**. Your objective is to address the disparities in linguistic tool availability for these languages by developing custom tokenizers to handle complex agglutination. You will then deploy and evaluate standard models to determine how effectively they can identify domain-specific entities across disparate scripts and morphological structures.

Instead of using pre-built tools for your implementation, you will engineer the core components from scratch:

> ### 1. Data Ingestion & Normalization Module
You will implement a web scraping module to extract raw text from disparate sources (news articles vs. online forums) and apply Unicode normalization to handle script-specific inconsistencies, ensuring a clean data foundation.

> ### 2. Custom Tokenization Layer
Since standard tokenizers struggle with Dravidian agglutination, you will engineer **BPE** and **WordPiece** algorithms from scratch. You will analyze N-Grams and vocabulary structures to determine which algorithm best preserves the semantic integrity of complex Kannada words compared to Hindi.

> ### 3. Morphological & Semantic Analytics Layer
You will implement a comparative study of **stemming vs. lemmatization** to demonstrate why dictionary-based approaches are critical for Indian languages. Finally, you will deploy and "stress-test" standard models (**POS Taggers**, **XLM-RoBERTa** for NER) to identify specific linguistic entities.

---

# Final Objective
Through this implementation, you will explicitly map the resource and research gaps, identifying exactly where standard NLP pipelines fail for low-resource languages (Kannada) and where custom engineering is the only viable solution.

### IMPORTANT NOTE

Indian languages like Kannada are low-resource languages which have limited linguistic data and resources available for natural language processing.

Challenges with such languages:
- Data Scarcity
- Limited Linguistic Resources
- Lack of Dedicated Solutions



## LIBRARY INSTALLATIONS

In [1]:
!pip install requests beautifulsoup4 indic-nlp-library stanza transformers stopwordsiso snowballstemmer



## PHASE 0 : LAB CONFIGURATION [DO NOT EDIT!]
These constants ensure your results are comparable for grading.

In [2]:
# 1. Target URLs (Use these specific articles to build your corpus)
HINDI_URLS = [
    "https://www.livemint.com/hindi/trends/kantara-2-movie-review-in-hindi-rishab-shetty-film-is-spellbinding-spectacle-241759366677483.html",
    "https://www.aajtak.in/entertainment/film-review/story/kantara-chapter-1-review-rishab-shetty-magical-mesmerising-visuals-story-ntcpsm-dskc-2346497-2025-10-02",
    "https://www.amarujala.com/entertainment/kantara-3-confirm-rishab-shetty-kantara-chapter-1-sequel-announced-title-will-be-kantara-a-legend-chapter-2-2025-10-02?pageId=1"

]

KANNADA_URLS = [
    "https://vishwavani.news/cinema/kantara-chapter-1-review-thrilling-experience-of-rishab-shetty-cinema-56780.html",
    "https://www.kannadaprabha.com/cinema/review/2025/Oct/04/kantara-chapter-1-movie-review-visually-stunning-and-compelling-film-traces-the-bloodlines-of-myth-and-power",
    "https://kannada.asianetnews.com/sandalwood/kantara-chapter-1-review-blockbuster-pre-sequel-climax-stun-rishab-shetty-sat/articleshow-fvuu0nm"

]

# 2. Parameters For BPE/WordPiece comparison
VOCAB_SIZE = 500

# 3. Pretrained models for tokenization
MODELS_TOKENIZERS = {
    "BERT-BASE-MULTILINGUAL-CASED MODEL": "bert-base-multilingual-cased",
    "XLM-ROBERTA-BASE MODEL": "xlm-roberta-base",
    "INDICBERT MODEL" : "ai4bharat/IndicBERTv2-SS"
}

# 4. Stopwords URL for Kannada
KANNADA_STOPWORDS_URL = "https://raw.githubusercontent.com/crvineeth97/kannada-stop-words/master/stop-words.txt"

# 5. Model Language Codes for POS tagging & Model Names for NER
STANZA_LANG = 'hi'
SNOWBALL_LANG = 'hindi'
NER_MODEL_NAME = "Davlan/xlm-roberta-base-wikiann-ner"

'''
# 6. Test Corpus
You must use these exact sentences as given in your boilerplate code for :
a) POS Tagging on Hindi test corpus only
b) NER Tagging on both Hindi and Kannada test corpus
'''
# KANNADA
kannada_reports = [
    "ನಟ ರಿಷಬ್ ಶೆಟ್ಟಿ ಅವರು ಕುಂದಾಪುರ ನಗರದಲ್ಲಿ ಚಿತ್ರೀಕರಣ ಆರಂಭಿಸಿದ್ದಾರೆ.",
    "ಹೊಂಬಾಳೆ ಫಿಲ್ಮ್ಸ್ ಬೆಂಗಳೂರು ನಗರದಲ್ಲಿ ಹೊಸ ಕಚೇರಿಯನ್ನು ತೆರೆದಿದೆ.",
    "ನಿರ್ಮಾಪಕ ವಿಜಯ್ ಕಿರಗಂದೂರು ಮಂಗಳೂರು ನಗರಕ್ಕೆ ಭೇಟಿ ನೀಡಿದರು.",
    "ಉಡುಪಿ ಹಾಗೂ ಭಾರತದಲ್ಲಿ ಈ ಕಥೆ ಪ್ರಸಿದ್ಧವಾಗಿದೆ.",
    "ಪಿವಿಆರ್ ಸಿನಿಮಾಸ್ ಮುಂದೆ ಅಭಿಮಾನಿಗಳು ಸಂಭ್ರಮಿಸುತ್ತಿದ್ದಾರೆ."
]

# HINDI
hindi_reports = [
    "ऋषभ शेट्टी ने मुंबई में कांतारा चैप्टर 1 का पोस्टर लॉन्च किया।",
    "प्रगति शेट्टी जी ने पुष्टि की है कि शूटिंग कर्नाटक में होगी।",
    "विजय किरागंदूर ने बताया कि फिल्म का बजट सौ करोड़ रुपये है।",
    "नई दिल्ली में गूगल पर सबसे ज्यादा सर्च की जाने वाली फिल्म कांतारा है।",
    "अजनीश लोकनाथ ने केराडी में संगीत रिकॉर्ड किया।"
]

print("Lab Constants Loaded!")

Lab Constants Loaded!


# PHASE 1: DATA ACQUISITION AND SANITATION

#### 1.1 Web Scraping

There are different ways to collect data:
- Public Dataset: We can search for publicly available data as per our problem statement.

- Web Scraping: Web Scraping is a technique to scrap data from a website. For this, we can use Beautiful Soup to scrape the text data from the web page.

We are going to web scrape data from different URLs.
This approach allows us to work with truly unstructured data.

In [3]:
import requests
from bs4 import BeautifulSoup
import re

def get_text_from_url(url):
    """
    Fetches URL and extracts paragraph text with a browser-like header.
    """
    try:
        r = requests.get(url,timeout=10)
        soup = BeautifulSoup(r.text, "html.parser")
        # scrape only paragraph contents as our actual content lies in these HTML elements
        paragraphs = soup.find_all('p')
        text_content = "\n".join([p.get_text() for p in paragraphs])

        return text_content

    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return ""

In [4]:
def extract_full_corpus(urls, lang_code, filename):
    full_corpus = ""
    for url in urls:
        print(f"Fetching: {url[:50]}...")
        raw_text = get_text_from_url(url)
        full_corpus += raw_text + "\n"

    with open(filename, "w", encoding="utf-8") as f:
        f.write(full_corpus)

    print(f"Saved data to {filename} (Total chars: {len(full_corpus)})")
    return full_corpus

# Run for Hindi scraped content
unstructured_hindi_content = extract_full_corpus(HINDI_URLS, "hi", "hindi_kantara.txt")
print("SAMPLE HINDI TEXT: \n", unstructured_hindi_content)


Fetching: https://www.livemint.com/hindi/trends/kantara-2-mo...
Fetching: https://www.aajtak.in/entertainment/film-review/st...
Fetching: https://www.amarujala.com/entertainment/kantara-3-...
Saved data to hindi_kantara.txt (Total chars: 12176)
SAMPLE HINDI TEXT: 
 Kantara 2 Movie Review: कभी-कभी कोई फिल्म आपको बिल्कुल निशब्द कर देती है। कभी उसका असर इतना गहरा होता है कि शब्द ही नहीं मिलते। कभी आप इतने प्रभावित होते हैं कि उसके बारे में बात करना ही बंद नहीं कर पाते। और कभी वह आपके दिल को इतना छू जाती है कि आप बस उसकी जादुई दुनिया में खो जाते हैं।
लेकिन जब कोई फिल्म ये सब एक साथ कर दिखाती है, तो वह सिर्फ एक मास्टरपीस नहीं रह जाती वह एक सांस्कृतिक घटना बन जाती है। ऋषभ शेट्टी की ‘कांतारा: चैप्टर 1’ ठीक ऐसा ही असर छोड़ती है।
फिल्म की शुरुआत कदंब वंश और उसके क्रूर शासक से होती है, जिसकी लालच हर ज़मीन और पानी को कब्ज़े में लेने की है। चाहे आदमी हो, औरत या बच्चा उसके लिए कोई मायने नहीं। वह सबको मारकर अपनी हुकूमत फैलाता है।
एक बार, ऐसे ही अभियान के दौरान वह समुद्र किनारे मछली पकड़ते एक रहस्यमय

In [5]:
# Run for Kannada scraped content
unstructured_kannada_content = extract_full_corpus(KANNADA_URLS, "kn", "kannada_kantara.txt")
print("SAMPLE KANNADA TEXT: \n", unstructured_kannada_content)

Fetching: https://vishwavani.news/cinema/kantara-chapter-1-r...
Fetching: https://www.kannadaprabha.com/cinema/review/2025/O...
Fetching: https://kannada.asianetnews.com/sandalwood/kantara...
Saved data to kannada_kantara.txt (Total chars: 9648)
SAMPLE KANNADA TEXT: 
  -
ಕಾಂತಾರ ಚಾಪ್ಟರ್‌ 1 ಬರೀ ಸಿನಿಮಾ (Kantara Chapter 1) ಅಲ್ಲ, ಅದೊಂದು ಅನುಭೂತಿ. ಅದನ್ನು ಕಥೆಗಾಗಿ ನೋಡಬಾರದು, ಅದು ನೀಡುವ ರೋಮಾಂಚನಕ್ಕಾಗಿ ನೋಡಬೇಕು. ಅದು ನೀಡುವ ದೃಶ್ಯವೈಭವ, ಸೀಟಿನ ತುದಿಗೆ ತಂದು ಕೂರಿಸುವ ಥ್ರಿಲ್ಲಿಂಗ್‌ ಹೊಡೆದಾಟದ ದೃಶ್ಯಗಳು, ಕಾಡಿನ ಮರೆಯಲ್ಲಿ ಹುದುಗಿಕೊಂಡ ರಹಸ್ಯಗಳು, ತುಳುನಾಡಿನ ದೈವಗಳು ಮತ್ತು ಅದರಿಂದ ಕಾಯಲ್ಪಡುವ ಮನುಷ್ಯಲೋಕದ ಆಟ- ಹೋರಾಟಗಳ ಭಾವುಕ- ರಮ್ಯ ಲೋಕದ ಚಿತ್ರಣಕ್ಕಾಗಿ ಇದನ್ನು ನೋಡಬೇಕು. ದೊಡ್ಡ ತೆರೆಯಲ್ಲಿ ನೋಡಿದರೆ ಮಾತ್ರವೇ ಈ ದೃಶ್ಯ ವೈಭವದ ನೈಜ ಸಾಕ್ಷಾತ್ಕಾರ ಸಾಧ್ಯ.

ಕಥೆಯಲ್ಲಿ ಹೊಸತೇನಿಲ್ಲ. ಅದು ಒಳಿತು ಕೆಡುಕುಗಳ ಸಮರ. ಒಳಿತಿನ ವಿಜಯ. ಕೇಡಿನ ಶಕ್ತಿಗಳ ವಿರುದ್ಧ ಸಜ್ಜನರ, ಮನುಷ್ಯರು ನಂಬಿದ ದೈವಗಳ ವಿಜಯ. ಆರಂಭದಲ್ಲಿಯೇ ತುಳುನಾಡಿಗೆ ಕೈಲಾಸದಿಂದ ಅವತರಿಸುವ ಶಿವಗಣಗಳು ದೈವವಾಗಿ ನಾಡನ್ನು ಕಾಯುತ್ತವೆ. ಇದರ ನಡುವೆಯೂ ದೈವವನ್ನು ಬಂಧಿಸಲು ಯತ್ನಿಸುವ ದುರ್ಜನರು ಇದ್ದಾರೆ. ಈ ದುರ್ಜನರನ್ನು ಮಟ್ಟಹಾಕಲು ಮನುಷ್ಯಶಕ್ತಿಯೂ ದೈವಶಕ್ತಿಯೂ ಕೈ ಜೋಡಿ

### 1.2 Text Cleaning

The data which we acquire is usually not very clean. It may contain HTML tags, spelling mistakes, or special characters. This is why we need to clean the text.

Here, we remove:

- URLs, Handles, and Emails
- Punctuation
- Remove English characters ([a-zA-Z]) ensuring only the target vernacular script remains.
- Normalize whitespaces.

In [6]:
import re
def clean_indic_text(data, lang):
    """
    Cleans raw web-scraped text to return only valid Hindi and Kannada sentences.
    """
    lines = data.split('\n')
    cleaned_sentences = []

    """
    TODO: Regex to remove English letters and noisy punctuation
    Syntax : english_garbage_pattern = re.compile('<regex>')
    """

    english_garbage_pattern = re.compile(
      r"""
      http\S+|www\S+|          # URLs
      \S+@\S+|                 # Emails
      @[A-Za-z0-9_]+|          # Handles
      [a-zA-Z]|                # English letters
      [0-9]|                   # Numbers
      [^\u0900-\u097F\u0C80-\u0CFF\s]  # Everything except Hindi, Kannada and spaces
      """,
      re.VERBOSE
  )

    # Regex to ensure the line actually contains Hindi/Kannada characters
    if lang == "hi":
        pattern = re.compile(r'[\u0900-\u097F]')
    else:
        pattern = re.compile(r'[\u0C80-\u0CFF]')

    for line in lines:
        line = line.strip()

        # Remove English words which are unnecessary
        clean_line = english_garbage_pattern.sub('', line)

        # Remove extra spaces
        clean_line = ' '.join(clean_line.split())

        '''
        1. Length of the line > 10 characters
        2. Must contain at least one Hindi/Kannada character
        '''
        if len(clean_line) > 10 and pattern.search(clean_line):
            cleaned_sentences.append(clean_line)
    cleaned_sentences = "\n".join(cleaned_sentences)
    return cleaned_sentences

In [7]:
cleaned_hindi_content = clean_indic_text(unstructured_hindi_content, "hi")
print("HINDI TEXT AFTER CLEANING: \n",cleaned_hindi_content)

HINDI TEXT AFTER CLEANING: 
 कभीकभी कोई फिल्म आपको बिल्कुल निशब्द कर देती है। कभी उसका असर इतना गहरा होता है कि शब्द ही नहीं मिलते। कभी आप इतने प्रभावित होते हैं कि उसके बारे में बात करना ही बंद नहीं कर पाते। और कभी वह आपके दिल को इतना छू जाती है कि आप बस उसकी जादुई दुनिया में खो जाते हैं।
लेकिन जब कोई फिल्म ये सब एक साथ कर दिखाती है तो वह सिर्फ एक मास्टरपीस नहीं रह जाती वह एक सांस्कृतिक घटना बन जाती है। ऋषभ शेट्टी की कांतारा चैप्टर ठीक ऐसा ही असर छोड़ती है।
फिल्म की शुरुआत कदंब वंश और उसके क्रूर शासक से होती है जिसकी लालच हर ज़मीन और पानी को कब्ज़े में लेने की है। चाहे आदमी हो औरत या बच्चा उसके लिए कोई मायने नहीं। वह सबको मारकर अपनी हुकूमत फैलाता है।
एक बार ऐसे ही अभियान के दौरान वह समुद्र किनारे मछली पकड़ते एक रहस्यमयी बूढ़े आदमी को देखता है। अपने सैनिकों को उसे पकड़ने का आदेश देता है। जैसे ही वे उसे खींचकर ले जाते हैं उसके थैले से कीमती सामान गिरते हैं।
शासक उन चीज़ों को देखता है और उनके स्रोत की खोज में निकल पड़ता है। यह सफ़र उसे कांतारा तक ले जाता है जहां जनजातियां प्रकृति के साथ 

In [8]:
cleaned_kannada_content = clean_indic_text(unstructured_kannada_content, "kn")
print("KANNADA TEXT AFTER CLEANING: \n",cleaned_kannada_content)

KANNADA TEXT AFTER CLEANING: 
 ಕಾಂತಾರ ಚಾಪ್ಟರ್ ಬರೀ ಸಿನಿಮಾ ಅಲ್ಲ ಅದೊಂದು ಅನುಭೂತಿ ಅದನ್ನು ಕಥೆಗಾಗಿ ನೋಡಬಾರದು ಅದು ನೀಡುವ ರೋಮಾಂಚನಕ್ಕಾಗಿ ನೋಡಬೇಕು ಅದು ನೀಡುವ ದೃಶ್ಯವೈಭವ ಸೀಟಿನ ತುದಿಗೆ ತಂದು ಕೂರಿಸುವ ಥ್ರಿಲ್ಲಿಂಗ್ ಹೊಡೆದಾಟದ ದೃಶ್ಯಗಳು ಕಾಡಿನ ಮರೆಯಲ್ಲಿ ಹುದುಗಿಕೊಂಡ ರಹಸ್ಯಗಳು ತುಳುನಾಡಿನ ದೈವಗಳು ಮತ್ತು ಅದರಿಂದ ಕಾಯಲ್ಪಡುವ ಮನುಷ್ಯಲೋಕದ ಆಟ ಹೋರಾಟಗಳ ಭಾವುಕ ರಮ್ಯ ಲೋಕದ ಚಿತ್ರಣಕ್ಕಾಗಿ ಇದನ್ನು ನೋಡಬೇಕು ದೊಡ್ಡ ತೆರೆಯಲ್ಲಿ ನೋಡಿದರೆ ಮಾತ್ರವೇ ಈ ದೃಶ್ಯ ವೈಭವದ ನೈಜ ಸಾಕ್ಷಾತ್ಕಾರ ಸಾಧ್ಯ
ಕಥೆಯಲ್ಲಿ ಹೊಸತೇನಿಲ್ಲ ಅದು ಒಳಿತು ಕೆಡುಕುಗಳ ಸಮರ ಒಳಿತಿನ ವಿಜಯ ಕೇಡಿನ ಶಕ್ತಿಗಳ ವಿರುದ್ಧ ಸಜ್ಜನರ ಮನುಷ್ಯರು ನಂಬಿದ ದೈವಗಳ ವಿಜಯ ಆರಂಭದಲ್ಲಿಯೇ ತುಳುನಾಡಿಗೆ ಕೈಲಾಸದಿಂದ ಅವತರಿಸುವ ಶಿವಗಣಗಳು ದೈವವಾಗಿ ನಾಡನ್ನು ಕಾಯುತ್ತವೆ ಇದರ ನಡುವೆಯೂ ದೈವವನ್ನು ಬಂಧಿಸಲು ಯತ್ನಿಸುವ ದುರ್ಜನರು ಇದ್ದಾರೆ ಈ ದುರ್ಜನರನ್ನು ಮಟ್ಟಹಾಕಲು ಮನುಷ್ಯಶಕ್ತಿಯೂ ದೈವಶಕ್ತಿಯೂ ಕೈ ಜೋಡಿಸಬೇಕಾಗುತ್ತದೆ ರಿಷಬ್ ಶೆಟ್ಟಿಯ ಬೆರ್ಮೆ ಪಾತ್ರದಲ್ಲಿ ಇವೆರಡೂ ಜೋಡಿಯಾಗಿ ನಮ್ಮನ್ನು ರೋಮಾಂಚಿತಗೊಳಿಸುತ್ತವೆ
ತುಳುನಾಡನ್ನು ಬಂಗ್ರ ಅರಸರು ಆಳುತ್ತಿದ್ದಾರೆ ಬಂಗ್ರದ ಅರಸರ ಹೊಸ ರಾಜಕುಮಾರನಿಗೂ ಕಾಂತಾರದ ಕಾನನ ನಿವಾಸಿಗಳಿಗೂ ಇಕ್ಕಟ್ಟು ಬಿಕ್ಕಟ್ಟುಗಳು ತಲೆದೋರುತ್ತವೆ ಇದನ್ನು ಪರಿಹರಿಸಲು ಕಾಂತಾರ ನಿವಾಸಿಗಳು ತಮಗೆ ಪರಿಚಿತವಲ್ಲದ ಹೊಸ ಲೋಕವನ್ನು ಪರಿ

### 1.3 Normalization
We must normalize the text to ensure that identical words are represented by the same byte sequence, regardless of how they were typed.

#### A. Nukta Normalization
This is crucial for languages in Devanagari script like Hindi, Urdu, and Sindhi. Many Indic characters can be represented in Unicode in two equivalent ways:

•⁠  ⁠*Precomposed Character:* A single Unicode character that represents the consonant + nukta (dot) combined.

    * Example: Za (ज़) = ⁠ U+095B ⁠

•⁠  ⁠*Decomposed Sequence:* Two distinct Unicode characters: the base consonant followed by the Nukta combining mark.

    * Example: Ja (ज) ⁠ U+091C ⁠ + Nukta (़) ⁠ U+093C ⁠

*The Problem:* To a computer, ⁠ U+095B ⁠ and ⁠ U+091C + U+093C ⁠ are two completely different strings, even though they look identical (⁠ ज़ ⁠ vs ⁠ ज़ ⁠) and mean the same thing.

*The Fix:* The normalizer converts these mixed representations into a single standard form (usually the NFC form).

#### B. Unicode Normalization (NFC)
The library applies standard *Unicode Normalization Form C (NFC)*.

•⁠  ⁠*Canonical Composition:* It combines base characters and combining marks into their single-character equivalents whenever possible.

•⁠  ⁠*Result:* It ensures that even if a user types "K-a-n-n-a-d-a" using different keystrokes, the machine sees one consistent binary representation.

We are using the IndicNLP library:
https://github.com/anoopkunchukuttan/indic_nlp_library

This library has been made to perform NLP on Indian Languages specifically.

It fills a crucial gap in computational linguistics by providing a wide range of NLP functionalities specifically designed for languages spoken in India.

Language codes for this library : Hindi for "hi" and Kannada for "kn".

In [11]:
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

# Global factory instance
factory = IndicNormalizerFactory()

## Use this factory to normalize text
def normalize_indic_text(text, lang):
    """
    Normalizes Indic text (handles Nuktas, canonical Unicode forms, etc.).

    Args:
        text (str): Input text.
        lang (str): Language code ('hi', 'kn', etc.)
    """

    """
    TODO: Initialize the normalizer -> Get the normalized text and return it.
    Syntax:
     - factory.get_normalizer(<language)
     - normalizer.normalize(<text>)
    """
    normalizer = factory.get_normalizer(lang)
    normalized_text = normalizer.normalize(text)
    return normalized_text
    # return the normalized text
kannada_text = normalize_indic_text(cleaned_kannada_content, "kn")
hindi_text = normalize_indic_text(cleaned_hindi_content, 'hi')

In [12]:
print("HINDI TEXT AFTER CLEANING: \n", hindi_text)

HINDI TEXT AFTER CLEANING: 
 कभीकभी कोई फिल्म आपको बिल्कुल निशब्द कर देती है। कभी उसका असर इतना गहरा होता है कि शब्द ही नहीं मिलते। कभी आप इतने प्रभावित होते हैं कि उसके बारे में बात करना ही बंद नहीं कर पाते। और कभी वह आपके दिल को इतना छू जाती है कि आप बस उसकी जादुई दुनिया में खो जाते हैं।
लेकिन जब कोई फिल्म ये सब एक साथ कर दिखाती है तो वह सिर्फ एक मास्टरपीस नहीं रह जाती वह एक सांस्कृतिक घटना बन जाती है। ऋषभ शेट्टी की कांतारा चैप्टर ठीक ऐसा ही असर छोड़ती है।
फिल्म की शुरुआत कदंब वंश और उसके क्रूर शासक से होती है जिसकी लालच हर ज़मीन और पानी को कब्ज़े में लेने की है। चाहे आदमी हो औरत या बच्चा उसके लिए कोई मायने नहीं। वह सबको मारकर अपनी हुकूमत फैलाता है।
एक बार ऐसे ही अभियान के दौरान वह समुद्र किनारे मछली पकड़ते एक रहस्यमयी बूढ़े आदमी को देखता है। अपने सैनिकों को उसे पकड़ने का आदेश देता है। जैसे ही वे उसे खींचकर ले जाते हैं उसके थैले से कीमती सामान गिरते हैं।
शासक उन चीज़ों को देखता है और उनके स्रोत की खोज में निकल पड़ता है। यह सफ़र उसे कांतारा तक ले जाता है जहां जनजातियां प्रकृति के साथ 

In [13]:
print("KANNADA TEXT AFTER CLEANING: \n", kannada_text)

KANNADA TEXT AFTER CLEANING: 
 ಕಾಂತಾರ ಚಾಪ್ಟರ್ ಬರೀ ಸಿನಿಮಾ ಅಲ್ಲ ಅದೊಂದು ಅನುಭೂತಿ ಅದನ್ನು ಕಥೆಗಾಗಿ ನೋಡಬಾರದು ಅದು ನೀಡುವ ರೋಮಾಂಚನಕ್ಕಾಗಿ ನೋಡಬೇಕು ಅದು ನೀಡುವ ದೃಶ್ಯವೈಭವ ಸೀಟಿನ ತುದಿಗೆ ತಂದು ಕೂರಿಸುವ ಥ್ರಿಲ್ಲಿಂಗ್ ಹೊಡೆದಾಟದ ದೃಶ್ಯಗಳು ಕಾಡಿನ ಮರೆಯಲ್ಲಿ ಹುದುಗಿಕೊಂಡ ರಹಸ್ಯಗಳು ತುಳುನಾಡಿನ ದೈವಗಳು ಮತ್ತು ಅದರಿಂದ ಕಾಯಲ್ಪಡುವ ಮನುಷ್ಯಲೋಕದ ಆಟ ಹೋರಾಟಗಳ ಭಾವುಕ ರಮ್ಯ ಲೋಕದ ಚಿತ್ರಣಕ್ಕಾಗಿ ಇದನ್ನು ನೋಡಬೇಕು ದೊಡ್ಡ ತೆರೆಯಲ್ಲಿ ನೋಡಿದರೆ ಮಾತ್ರವೇ ಈ ದೃಶ್ಯ ವೈಭವದ ನೈಜ ಸಾಕ್ಷಾತ್ಕಾರ ಸಾಧ್ಯ
ಕಥೆಯಲ್ಲಿ ಹೊಸತೇನಿಲ್ಲ ಅದು ಒಳಿತು ಕೆಡುಕುಗಳ ಸಮರ ಒಳಿತಿನ ವಿಜಯ ಕೇಡಿನ ಶಕ್ತಿಗಳ ವಿರುದ್ಧ ಸಜ್ಜನರ ಮನುಷ್ಯರು ನಂಬಿದ ದೈವಗಳ ವಿಜಯ ಆರಂಭದಲ್ಲಿಯೇ ತುಳುನಾಡಿಗೆ ಕೈಲಾಸದಿಂದ ಅವತರಿಸುವ ಶಿವಗಣಗಳು ದೈವವಾಗಿ ನಾಡನ್ನು ಕಾಯುತ್ತವೆ ಇದರ ನಡುವೆಯೂ ದೈವವನ್ನು ಬಂಧಿಸಲು ಯತ್ನಿಸುವ ದುರ್ಜನರು ಇದ್ದಾರೆ ಈ ದುರ್ಜನರನ್ನು ಮಟ್ಟಹಾಕಲು ಮನುಷ್ಯಶಕ್ತಿಯೂ ದೈವಶಕ್ತಿಯೂ ಕೈ ಜೋಡಿಸಬೇಕಾಗುತ್ತದೆ ರಿಷಬ್ ಶೆಟ್ಟಿಯ ಬೆರ್ಮೆ ಪಾತ್ರದಲ್ಲಿ ಇವೆರಡೂ ಜೋಡಿಯಾಗಿ ನಮ್ಮನ್ನು ರೋಮಾಂಚಿತಗೊಳಿಸುತ್ತವೆ
ತುಳುನಾಡನ್ನು ಬಂಗ್ರ ಅರಸರು ಆಳುತ್ತಿದ್ದಾರೆ ಬಂಗ್ರದ ಅರಸರ ಹೊಸ ರಾಜಕುಮಾರನಿಗೂ ಕಾಂತಾರದ ಕಾನನ ನಿವಾಸಿಗಳಿಗೂ ಇಕ್ಕಟ್ಟು ಬಿಕ್ಕಟ್ಟುಗಳು ತಲೆದೋರುತ್ತವೆ ಇದನ್ನು ಪರಿಹರಿಸಲು ಕಾಂತಾರ ನಿವಾಸಿಗಳು ತಮಗೆ ಪರಿಚಿತವಲ್ಲದ ಹೊಸ ಲೋಕವನ್ನು ಪರಿ

# PHASE 2: COMPARATIVE ANALYSIS OF TOKENIZATION STRATEGIES

Tokenization is the foundational step in the NLP pipeline, involving the segmentation of raw text into discrete units called **tokens**.

To preserve the semantic structure of the document, we follow a hierarchical approach:
1.  **Sentence Tokenization:** The corpus is first segmented into distinct grammatical sentences.
2.  **Word Tokenization:** Each sentence is subsequently fractured into individual lexical units (words).

**Implementation Note:** It is standard practice to perform sentence tokenization *before* word tokenization. Consequently, your output will be structured as a **List of Lists** (e.g., `[[Word1, Word2], [Word3, Word4]]`), where the outer list represents the document and the inner lists represent sentences.

In [14]:
import unicodedata
from collections import Counter
import re

# Maps token_id (int) -> token_str (str)
vocab = {}
# Maps token_str (str) -> token_id (int)
inverse_vocab = {}
# Stores the merges: {(id1, id2): new_merged_id}
bpe_merges = {}

def initialize_vocab(text):
    """
    Initializes vocab with all unique characters in the text.
    Uses 'Ġ' to represent spaces for reversible tokenization.
    """
    # global, so it can be used later easily
    global vocab, inverse_vocab

    # 1. Preprocess text: Replace spaces with Ġ to preserve them
    processed_chars = []
    for i, char in enumerate(text):
        if char == " ":
            processed_chars.append("Ġ")
        else:
            processed_chars.append(char)

    # 2. Extract unique characters for initial vocab
    unique_chars = sorted(list(set(processed_chars)))

    # 3. Build mappings
    vocab = {i: char for i, char in enumerate(unique_chars)} # mapping : index:char
    inverse_vocab = {char: i for i, char in vocab.items()} # mapping : char:index

    # 4. Return the text as a list of IDs using inverse_vocab
    return [inverse_vocab[char] for char in processed_chars]


In [15]:
def get_stats(ids):
    """
    Step 1: Count frequency of adjacent pairs.
    Args:
        ids (list of int): The current list of token IDs (e.g., [1, 2, 1, 2, 3])

    Returns:
        dict: A dictionary mapping a tuple (id1, id2) to its frequency count.
              Example: {(1, 2): 2, (2, 3): 1}
    """
    counts = Counter()

    ''''
    TODO: Implement this method.
    Iterate through the list of `ids` and count every adjacent pair.
    use zip(ids, ids[1:]) to traverse the pairs.
    '''
    for pair in zip(ids, ids[1:]):
        counts[pair] += 1

    return counts


In [16]:
def merge_ids(ids, pair, idx):
    """
    Step 2: Replace all occurrences of a specific pair with a new token ID.

    Args:
        ids (list of int): The current list of token IDs.
        pair (tuple): The pair of IDs to replace (e.g., (1, 2)).
        idx (int): The new ID to assign to this pair.

    Returns:
        list of int: The new list of IDs with the pair replaced.
    """

    new_ids = []

    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:
            new_ids.append(idx)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1

    return new_ids

In [17]:
def train(text, vocab_size):
    """
    The Main Training Loop.

    Args:
        text (str): The text to train on.
        vocab_size (int): How many tokens we want in our final vocabulary.
    """
    # to store the merges
    global vocab, inverse_vocab, bpe_merges

    # Reset globals for a fresh train
    vocab = {}
    inverse_vocab = {}
    bpe_merges = {}

    print(f"Training BPE (Target Vocab: {vocab_size})")

    # 1. Initialize vocab with characters (ids are now a list of integers)
    ids = initialize_vocab(text)

    # 2. Loop until we reach vocab size -> We keep merging until we reach the desired vocab_size
    num_merges = vocab_size - len(vocab)

    for i in range(num_merges):
        # Count all adjacent pairs
        stats = get_stats(ids)
        if not stats:
            break

        # 1. Find the most frequent pair
        best_pair = max(stats, key=stats.get)

        # 2. Create new token id
        idx = len(vocab)

        # 3. Merge the ids using this best pair
        ids = merge_ids(ids, best_pair, idx)

        # 4. Store merge rule
        bpe_merges[best_pair] = idx

        # 5. Update vocab and inverse_vocab
        new_token = vocab[best_pair[0]] + vocab[best_pair[1]]
        vocab[idx] = new_token
        inverse_vocab[new_token] = idx

    print(f"Training complete. Final vocab size: {len(vocab)}")
    return vocab

In [18]:

def encode(text):
    """
    Encodes text into token IDs using the trained BPE merges.
    """
    if not bpe_merges:
        return []

    # 1. Start by converting the text to character IDs (pre-processing)
    processed_chars = []
    for char in text:
        if char == " ":
            processed_chars.append("Ġ")
        else:
            processed_chars.append(char)

    # Initial IDs (character level)
    ids = [inverse_vocab[c] for c in processed_chars if c in inverse_vocab]

    while len(ids) >= 2:
        '''
        TODO: Implement the encoding logic.
        Repeatedly compress the sequence by applying the learned BPE merge rules.
        Unlike training (which merges the most frequent pair), encoding must merge the pair that was learned earliest during training.

        1. Identify which pairs in the current sequence are valid - have a merge rule (check in bpe_merges).
        2. Select the valid pair with the lowest merge ID (highest priority) and merge it.
        3. Terminate the loop if no known pairs remain in the sequence.
        '''
        # Step 1: Collect all adjacent pairs in the current sequence
        pairs = [(ids[i], ids[i+1]) for i in range(len(ids) - 1)]

        # Step 2: Keep only those pairs which were learned during training
        valid_pairs = [p for p in pairs if p in bpe_merges]

        # Step 3: If no valid pairs exist, no more merges are possible → stop encoding
        if not valid_pairs:
            break

        # Step 4: Choose the pair that was learned earliest during training
        # (smallest merge ID = highest priority)
        best_pair = min(valid_pairs, key=lambda p: bpe_merges[p])
        best_idx = bpe_merges[best_pair]

        # Step 5: Merge this pair in the current sequence
        ids = merge_ids(ids, best_pair, best_idx)

    return ids

### **Concept: What is Detokenization?**

If **Tokenization** is the process of breaking raw text into numerical IDs for the machine, **Detokenization** is the "reverse engineering" phase that reconstructs human-readable text from those numbers.

It isn't as simple as just "gluing words back together" because we need to respect the original spacing and formatting.

#### **Understanding the Logic in `decode()`**
Your function performs three critical steps to restore the text:

1.  **ID-to-Token Mapping:**
    * The machine only knows integers (e.g., `[452, 1109, 23]`).
    * We use the **Vocabulary** (dictionary) to map these back to string fragments (e.g., `['Hel', 'lo', 'ĠWorld']`).

2.  **The "Ġ" (Space Marker) and it's importance:**
    * *Why do we see this character?* In algorithms like **BPE (Byte Pair Encoding)**, the tokenizer replaces standard spaces with a special character (often `Ġ`) to preserve whitespace information during the split.
    * *The Fix:* The `Ġ` character replacement is the sanitization step. It converts these internal markers back into actual spaces so the text is readable.

3.  **Reconstruction:**
    * Finally, the fragments are joined to form the complete sentence.

**Visual Summary:**
`[Ids]` $\xrightarrow{\text{Vocab}}$ `['Tok', 'ens']` $\xrightarrow{\text{Join}}$ `"TokĠs"` $\xrightarrow{\text{Sanitize}}$ `"Tokens"`

In [19]:
def decode(ids):
    """
    Decodes IDs back to string.
    """
    '''
    TODO: Implement text decoding.
    Reconstruct the original human-readable text from the sequence of token IDs.

    1. Convert each integer ID in the list back to its corresponding string token
       using the vocabulary.
    2. Concatenate the tokens into a single string and restore original formatting by replacing the tokenizer's special space marker ('Ġ')
       with an actual space.
    '''

    # 1. Convert IDs to tokens
    tokens = [vocab[i] for i in ids]

    # 2. Join tokens and restore spaces
    text = "".join(tokens).replace("Ġ", " ")

    return text

In [20]:
def bpe_output(text, lang):
    print("RAW TEXT: \n", text[:200])
    print("\nTokenization using BPE: (Subword tokenization)")
    text = unicodedata.normalize('NFKC', text)
    mid = len(text) // 2
    train_bpe = text[:mid]
    test_bpe = text[mid:]
    vocab = train(train_bpe, VOCAB_SIZE)
    print("Vocabulary:", vocab)
    ids = encode(test_bpe)

    print(f"Encoded: {ids}")
    print(f"Decoded: {decode(ids)}")

bpe_output(hindi_text, "hi")

RAW TEXT: 
 कभीकभी कोई फिल्म आपको बिल्कुल निशब्द कर देती है। कभी उसका असर इतना गहरा होता है कि शब्द ही नहीं मिलते। कभी आप इतने प्रभावित होते हैं कि उसके बारे में बात करना ही बंद नहीं कर पाते। और कभी वह आपके दिल क

Tokenization using BPE: (Subword tokenization)
Training BPE (Target Vocab: 500)
Training complete. Final vocab size: 500
Vocabulary: {0: '\n', 1: 'Ġ', 2: 'ँ', 3: 'ं', 4: 'अ', 5: 'आ', 6: 'इ', 7: 'ई', 8: 'उ', 9: 'ऋ', 10: 'ए', 11: 'ऐ', 12: 'औ', 13: 'क', 14: 'ख', 15: 'ग', 16: 'घ', 17: 'च', 18: 'छ', 19: 'ज', 20: 'झ', 21: 'ट', 22: 'ठ', 23: 'ड', 24: 'ढ', 25: 'ण', 26: 'त', 27: 'थ', 28: 'द', 29: 'ध', 30: 'न', 31: 'प', 32: 'फ', 33: 'ब', 34: 'भ', 35: 'म', 36: 'य', 37: 'र', 38: 'ल', 39: 'व', 40: 'श', 41: 'ष', 42: 'स', 43: 'ह', 44: '़', 45: 'ा', 46: 'ि', 47: 'ी', 48: 'ु', 49: 'ू', 50: 'ृ', 51: 'े', 52: 'ै', 53: 'ॉ', 54: 'ो', 55: 'ौ', 56: '्', 57: '।', 58: 'ीĠ', 59: 'ाĠ', 60: 'ेĠ', 61: 'रĠ', 62: 'है', 63: 'Ġक', 64: 'ंĠ', 65: 'औरĠ', 66: 'ोĠ', 67: 'तीĠ', 68: 'है।', 69: 'सेĠ', 70: 'ार', 71: '

In [21]:
bpe_output(kannada_text, "kn")

RAW TEXT: 
 ಕಾಂತಾರ ಚಾಪ್ಟರ್ ಬರೀ ಸಿನಿಮಾ ಅಲ್ಲ ಅದೊಂದು ಅನುಭೂತಿ ಅದನ್ನು ಕಥೆಗಾಗಿ ನೋಡಬಾರದು ಅದು ನೀಡುವ ರೋಮಾಂಚನಕ್ಕಾಗಿ ನೋಡಬೇಕು ಅದು ನೀಡುವ ದೃಶ್ಯವೈಭವ ಸೀಟಿನ ತುದಿಗೆ ತಂದು ಕೂರಿಸುವ ಥ್ರಿಲ್ಲಿಂಗ್ ಹೊಡೆದಾಟದ ದೃಶ್ಯಗಳು ಕಾಡಿನ ಮರೆಯಲ್ಲಿ ಹುದುಗಿಕ

Tokenization using BPE: (Subword tokenization)
Training BPE (Target Vocab: 500)
Training complete. Final vocab size: 500
Vocabulary: {0: '\n', 1: 'Ġ', 2: 'ಂ', 3: 'ಃ', 4: 'ಅ', 5: 'ಆ', 6: 'ಇ', 7: 'ಈ', 8: 'ಉ', 9: 'ಊ', 10: 'ಎ', 11: 'ಏ', 12: 'ಐ', 13: 'ಒ', 14: 'ಓ', 15: 'ಕ', 16: 'ಖ', 17: 'ಗ', 18: 'ಚ', 19: 'ಜ', 20: 'ಟ', 21: 'ಡ', 22: 'ಣ', 23: 'ತ', 24: 'ಥ', 25: 'ದ', 26: 'ಧ', 27: 'ನ', 28: 'ಪ', 29: 'ಫ', 30: 'ಬ', 31: 'ಭ', 32: 'ಮ', 33: 'ಯ', 34: 'ರ', 35: 'ಲ', 36: 'ಳ', 37: 'ವ', 38: 'ಶ', 39: 'ಷ', 40: 'ಸ', 41: 'ಹ', 42: 'ಾ', 43: 'ಿ', 44: 'ೀ', 45: 'ು', 46: 'ೂ', 47: 'ೃ', 48: 'ೆ', 49: 'ೇ', 50: 'ೈ', 51: 'ೊ', 52: 'ೋ', 53: 'ೌ', 54: '್', 55: 'ುĠ', 56: 'ೆĠ', 57: 'ಿĠ', 58: 'ತ್', 59: 'ನ್', 60: 'ಕಾ', 61: 'ದĠ', 62: 'ನ್ನ', 63: 'ಲ್', 64: 'ಾಗ', 65: 'ಕ್', 66: 'ತ್ತ', 67: 'ಲ್ಲ', 68: 'ಿಸ', 69: 'ಾರ', 70: 'ಗಳ', 71:

## Implementing WordPiece from scratch

Key challenges in traditional tokenization:

- Vocabulary grows exponentially with text corpus size.

- Technical terms and proper nouns create endless edge cases.

WordPiece solves this by breaking words into meaningful subunits.

Instead of treating "unbreakable" as a single unknown token, it breaks it down into recognizable pieces: ["un", "##break", "##able"].

The "##" prefix indicates that a token continues from the previous piece, preserving word boundaries and also enabling flexible decomposition.

This approach ensures that even completely new words can be understood through their constituent parts which results in improving model robustness and generalization.

The algorithm follows a data-driven approach to build its vocabulary.

- Initialize vocabulary with all individual characters

- Count frequency of all adjacent symbol pairs in the corpus

- Merge the most frequent pair into a single token

- Update the corpus with the new merged token
- Repeat until reaching desired vocabulary size

In [22]:
from collections import Counter

# 1. Initialization
def initialize_vocab_wordpiece(text):
    vocab = {}
    words = text.split()
    unique_chars = set()
    for word in words:
        if not word: continue
        unique_chars.add(word[0])
        for char in word[1:]:
            unique_chars.add("##" + char)

    vocab = {i: token for i, token in enumerate(sorted(list(unique_chars)))}
    return vocab

In [23]:
# 2. Stats Calculation
def get_stats(ids):
    counts = Counter()
    for i in range(len(ids) - 1):
        # Skip word boundary markers
        if ids[i] == -1 or ids[i+1] == -1:
            continue
        counts[(ids[i], ids[i+1])] += 1
    return counts

In [24]:
def encode_wordpiece(text, vocab, inverse_vocab):
    text = unicodedata.normalize('NFKC', text)
    tokens = []
    words = text.split()
    for word in words:
        i = 0
        while i < len(word):
            matched = False
            # Try longest subword first (greedy matching)
            for j in range(len(word), i, -1):
                sub = word[i:j]
                if i > 0:
                    sub = "##" + sub
                if sub in inverse_vocab:
                    tokens.append(inverse_vocab[sub])
                    i = j
                    matched = True
                    break

            # If no subword matches, skip one character
            if not matched:
                i += 1

        # Add word boundary marker
        tokens.append(-1)
    return tokens

In [25]:
def train_wordpiece(text, vocab_size=VOCAB_SIZE):
    print(f"\nTraining WordPiece (Target Vocab: {vocab_size})")
    vocab = initialize_vocab_wordpiece(text)
    inverse_vocab = {v: k for k, v in vocab.items()}

    # Calculate initial IDs once
    ids = encode_wordpiece(text, vocab, inverse_vocab)
    """
    TODO: Implement the main WordPiece training loop:
    1.  In each iteration, calculate pair frequencies using `get_stats(ids)`.
    2.  Identify the most frequent pair. If no pairs exist, stop training.
    3.  Construct the new token string by combining the pair's tokens. (Note: If the second token starts with '##', remove the '##' prefix before concatenating).
    4.  If the new token is not already in the vocabulary, add it and assign a new ID.
    5.  Update the `ids` list by replacing all occurrences of the best pair with the new token ID using `merge_ids`.
    """
    while len(vocab) < vocab_size:
        stats = get_stats(ids)
        if not stats:
            break

        # 1. Find most frequent pair
        best_pair = max(stats, key=stats.get)

        # 2. Build new token
        tok1 = vocab[best_pair[0]]
        tok2 = vocab[best_pair[1]]

        # Remove ## from second token if present
        if tok2.startswith("##"):
            new_token = tok1 + tok2[2:]
        else:
            new_token = tok1 + tok2

        # 3. Add to vocabulary if new
        if new_token not in inverse_vocab:
            new_id = len(vocab)
            vocab[new_id] = new_token
            inverse_vocab[new_token] = new_id
        else:
            new_id = inverse_vocab[new_token]

        # 4. Merge ids using BPE-style merging
        ids = merge_ids(ids, best_pair, new_id)


    print(f"Training complete. Final vocab size: {len(vocab)}")
    return vocab, inverse_vocab

In [29]:
# 4. Decoding Function
def decode_wordpiece(ids, vocab):
    # Filter out boundary markers
    tokens = [vocab[i] for i in ids if i != -1]
    """
    TODO: Implement the decoding logic to convert token IDs back to text:
    1.  Convert the list of IDs into token strings using the vocabulary, ignoring any `-1` boundary markers.
    2.  Iterate through the token strings to reconstruct the sentence:
        - If a token starts with "##", append it to the text excluding the "##" prefix (merge with previous).
        - Otherwise, treat it as a new word: add a space before the token and append it.
    """
    text = ""
    for token in tokens:
        if token.startswith("##"):
            # Continuation of previous word
            text += token[2:]
        else:
            # New word, add space before
            text += " " + token

    return text.strip()

In [30]:
def wordpiece_output(text, lang):
    print("RAW TEXT: \n", text[:200])
    print("\nTokenization using WordPiece: (Subword tokenization)")

    text = unicodedata.normalize('NFKC', text)
    mid = len(text) // 2
    train_wp = text[:mid]
    test_wp = text[mid:]

    # Train
    vocab, inverse_vocab = train_wordpiece(train_wp, vocab_size=VOCAB_SIZE)
    print("Vocabulary:", vocab)
    # Encode using WordPiece encoder
    encoded = encode_wordpiece(test_wp, vocab, inverse_vocab)

    # Decode using the same vocabulary
    decoded = decode_wordpiece(encoded, vocab)

    print(f"Encoded: {encoded}")
    print(f"Decoded: {decoded}")

In [31]:
wordpiece_output(hindi_text, "Hindi")

RAW TEXT: 
 कभीकभी कोई फिल्म आपको बिल्कुल निशब्द कर देती है। कभी उसका असर इतना गहरा होता है कि शब्द ही नहीं मिलते। कभी आप इतने प्रभावित होते हैं कि उसके बारे में बात करना ही बंद नहीं कर पाते। और कभी वह आपके दिल क

Tokenization using WordPiece: (Subword tokenization)

Training WordPiece (Target Vocab: 500)
Training complete. Final vocab size: 500
Vocabulary: {0: '##ँ', 1: '##ं', 2: '##अ', 3: '##आ', 4: '##इ', 5: '##ई', 6: '##ए', 7: '##क', 8: '##ख', 9: '##ग', 10: '##घ', 11: '##च', 12: '##छ', 13: '##ज', 14: '##ट', 15: '##ठ', 16: '##ड', 17: '##ढ', 18: '##ण', 19: '##त', 20: '##थ', 21: '##द', 22: '##ध', 23: '##न', 24: '##प', 25: '##फ', 26: '##ब', 27: '##भ', 28: '##म', 29: '##य', 30: '##र', 31: '##ल', 32: '##व', 33: '##श', 34: '##ष', 35: '##स', 36: '##ह', 37: '##़', 38: '##ा', 39: '##ि', 40: '##ी', 41: '##ु', 42: '##ू', 43: '##ृ', 44: '##े', 45: '##ै', 46: '##ॉ', 47: '##ो', 48: '##ौ', 49: '##्', 50: '##।', 51: 'अ', 52: 'आ', 53: 'इ', 54: 'ई', 55: 'उ', 56: 'ऋ', 57: 'ए', 58: 'ऐ', 59: 'औ', 60: 'क'

In [32]:
wordpiece_output(kannada_text, "Kannada")

RAW TEXT: 
 ಕಾಂತಾರ ಚಾಪ್ಟರ್ ಬರೀ ಸಿನಿಮಾ ಅಲ್ಲ ಅದೊಂದು ಅನುಭೂತಿ ಅದನ್ನು ಕಥೆಗಾಗಿ ನೋಡಬಾರದು ಅದು ನೀಡುವ ರೋಮಾಂಚನಕ್ಕಾಗಿ ನೋಡಬೇಕು ಅದು ನೀಡುವ ದೃಶ್ಯವೈಭವ ಸೀಟಿನ ತುದಿಗೆ ತಂದು ಕೂರಿಸುವ ಥ್ರಿಲ್ಲಿಂಗ್ ಹೊಡೆದಾಟದ ದೃಶ್ಯಗಳು ಕಾಡಿನ ಮರೆಯಲ್ಲಿ ಹುದುಗಿಕ

Tokenization using WordPiece: (Subword tokenization)

Training WordPiece (Target Vocab: 500)
Training complete. Final vocab size: 500
Vocabulary: {0: '##ಂ', 1: '##ಃ', 2: '##ಕ', 3: '##ಖ', 4: '##ಗ', 5: '##ಚ', 6: '##ಜ', 7: '##ಟ', 8: '##ಡ', 9: '##ಣ', 10: '##ತ', 11: '##ಥ', 12: '##ದ', 13: '##ಧ', 14: '##ನ', 15: '##ಪ', 16: '##ಫ', 17: '##ಬ', 18: '##ಭ', 19: '##ಮ', 20: '##ಯ', 21: '##ರ', 22: '##ಲ', 23: '##ಳ', 24: '##ವ', 25: '##ಶ', 26: '##ಷ', 27: '##ಸ', 28: '##ಹ', 29: '##ಾ', 30: '##ಿ', 31: '##ೀ', 32: '##ು', 33: '##ೂ', 34: '##ೃ', 35: '##ೆ', 36: '##ೇ', 37: '##ೈ', 38: '##ೊ', 39: '##ೋ', 40: '##ೌ', 41: '##್', 42: 'ಅ', 43: 'ಆ', 44: 'ಇ', 45: 'ಈ', 46: 'ಉ', 47: 'ಊ', 48: 'ಎ', 49: 'ಏ', 50: 'ಐ', 51: 'ಒ', 52: 'ಓ', 53: 'ಕ', 54: 'ಗ', 55: 'ಚ', 56: 'ಜ', 57: 'ಟ', 58: 'ಡ', 59: 'ತ', 60: 'ಥ', 61: 'ದ', 62: 'ನ'

## Tokenization using pre-trained models

We will use "transformers" library:

The transformers library (by Hugging Face) provides a unified API to download, configure, and use thousands of pre-trained models like BERT, GPT, RoBERTa, and Llama.

Instead of writing complex neural network architecture code from scratch (in PyTorch or TensorFlow), this library allows you to:

- Download weights.

- Tokenize: Automatically process text into numbers the way the model expects.

- Infer/Train: Run the model on new data.

In [34]:
from transformers import AutoTokenizer

def tokenize(text, lang_name):
    """
      For tokenizing indic sentences.
      Args:
        text (str): The input text string.
        lang_name (str): The name of the language.
      Prints:
        tokens (list): A list of tokens.
        ids (list): A list of token IDs.

    """


    print(f"\n Tokenizing for {lang_name}: ")
    print(f"Original Text: {text}")

    for name, model_path in MODELS_TOKENIZERS.items():
        try:
            """
            TODO:
            Syntax:
              - tokenizer = AutoTokenizer.from_pretrained(<model_name>)
              - tokens = tokenizer.tokenize(<text>)
              - ids = tokenizer.convert_tokens_to_ids(tokens)
            """
            tokenizer = AutoTokenizer.from_pretrained(model_path)
            tokens = tokenizer.tokenize(text)
            ids = tokenizer.convert_tokens_to_ids(tokens)

            print(f"\nModel: {name}")
            print(f"Tokens: {tokens}")
            print(f"IDs:    {ids}")

        except Exception as e:
            print(f"Error loading {name}: {e}")


In [35]:
#TOKENIZATION TESTS FOR A SAMPLE INPUT SENTENCE FOR HINDI AND KANNADA

# "ऋषभ शेट्टी ने मुंबई में कांतारा चैप्टर 1 का पोस्टर लॉन्च किया।"
sample_hindi_text = hindi_reports[0]
# "ನಟ ರಿಷಬ್ ಶೆಟ್ಟಿ ಅವರು ಕುಂದಾಪುರ ನಗರದಲ್ಲಿ ಚಿತ್ರೀಕರಣ ಆರಂಭಿಸಿದ್ದಾರೆ."
sample_kannada_text = kannada_reports[0]

# 1. HINDI
tokenize(sample_hindi_text, "Hindi")


 Tokenizing for Hindi: 
Original Text: ऋषभ शेट्टी ने मुंबई में कांतारा चैप्टर 1 का पोस्टर लॉन्च किया।


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]


Model: BERT-BASE-MULTILINGUAL-CASED MODEL
Tokens: ['ऋ', '##ष', '##भ', 'श', '##ेट', '##्टी', 'ने', 'मुंबई', 'में', 'का', '##ंत', '##ारा', 'च', '##ै', '##प', '##्टर', '1', 'का', 'प', '##ो', '##स्ट', '##र', 'ल', '##ॉन', '##्', '##च', 'किया', '।']
IDs:    [857, 39765, 60270, 896, 39680, 84218, 13088, 65800, 10532, 11081, 24786, 54350, 870, 18438, 18187, 54071, 122, 11081, 885, 13718, 44611, 11549, 893, 69016, 20429, 16940, 13016, 920]


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]


Model: XLM-ROBERTA-BASE MODEL
Tokens: ['▁ऋष', 'भ', '▁शे', 'ट्टी', '▁ने', '▁मुंबई', '▁में', '▁कां', 'ता', 'रा', '▁चै', 'प्ट', 'र', '▁1', '▁का', '▁पोस्ट', 'र', '▁लॉन्च', '▁किया', '।']
IDs:    [241861, 6576, 35993, 86644, 1142, 17360, 421, 115710, 1480, 2815, 115739, 104332, 1393, 106, 641, 15484, 1393, 126736, 4029, 125]


tokenizer_config.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/6.28M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]


Model: INDICBERT MODEL
Tokens: ['ऋषभ', 'शेट्टी', 'ने', 'मुंबई', 'में', 'कांता', '##रा', 'चै', '##प्टर', '1', 'का', 'पोस्टर', 'लॉन्च', 'किया', '।']
IDs:    [69362, 53285, 15621, 18568, 15708, 197387, 15736, 20253, 45107, 62, 15711, 55007, 52883, 16837, 1408]


In [36]:
# 2. KANNADA
tokenize(sample_kannada_text, "Kannada")


 Tokenizing for Kannada: 
Original Text: ನಟ ರಿಷಬ್ ಶೆಟ್ಟಿ ಅವರು ಕುಂದಾಪುರ ನಗರದಲ್ಲಿ ಚಿತ್ರೀಕರಣ ಆರಂಭಿಸಿದ್ದಾರೆ.

Model: BERT-BASE-MULTILINGUAL-CASED MODEL
Tokens: ['ನ', '##ಟ', 'ರ', '##ಿ', '##ಷ', '##ಬ್', 'ಶ', '##ೆ', '##ಟ್ಟಿ', 'ಅವರು', 'ಕ', '##ು', '##ಂದ', '##ಾ', '##ಪು', '##ರ', 'ನ', '##ಗರ', '##ದಲ್ಲಿ', 'ಚ', '##ಿತ್ರ', '##ೀ', '##ಕರ', '##ಣ', 'ಆ', '##ರ', '##ಂ', '##ಭ', '##ಿಸಿದ', '##್ದಾರೆ', '.']
IDs:    [1281, 36312, 1288, 13232, 107724, 73513, 1292, 13833, 83301, 22397, 1263, 14284, 38114, 14921, 89856, 14060, 1281, 94352, 17886, 1267, 64491, 35461, 75074, 18409, 1251, 14060, 26521, 111364, 61364, 41661, 119]

Model: XLM-ROBERTA-BASE MODEL
Tokens: ['▁ನಟ', '▁ರಿ', 'ಷ', 'ಬ್', '▁ಶೆಟ್ಟಿ', '▁ಅವರು', '▁ಕು', 'ಂದ', 'ಾಪುರ', '▁ನಗರ', 'ದಲ್ಲಿ', '▁ಚಿತ್ರ', 'ೀಕರಣ', '▁ಆರಂಭಿಸ', 'ಿದ್ದಾರೆ', '.']
IDs:    [54669, 90802, 20776, 23599, 147032, 11759, 28504, 23454, 210783, 49490, 2908, 16738, 103359, 190433, 11367, 5]

Model: INDICBERT MODEL
Tokens: ['ನ', '##ಟ', 'ರ', '##ಿ', '##ಷ', '##ಬ', '##್', 'ಶ', '##ೆ', '##ಟ', '##್', '##ಟ', 

In [37]:
#TOKENIZATION TESTS FOR THE WHOLE HINDI AND KANNADA CORPUS

def pre_trained_models_output(text, lang):
    print("Tokenization using pre-trained models:")
    pre_train_tokens = tokenize(text, lang)

pre_trained_models_output(hindi_text, "Hindi")

Tokenization using pre-trained models:

 Tokenizing for Hindi: 
Original Text: कभीकभी कोई फिल्म आपको बिल्कुल निशब्द कर देती है। कभी उसका असर इतना गहरा होता है कि शब्द ही नहीं मिलते। कभी आप इतने प्रभावित होते हैं कि उसके बारे में बात करना ही बंद नहीं कर पाते। और कभी वह आपके दिल को इतना छू जाती है कि आप बस उसकी जादुई दुनिया में खो जाते हैं।
लेकिन जब कोई फिल्म ये सब एक साथ कर दिखाती है तो वह सिर्फ एक मास्टरपीस नहीं रह जाती वह एक सांस्कृतिक घटना बन जाती है। ऋषभ शेट्टी की कांतारा चैप्टर ठीक ऐसा ही असर छोड़ती है।
फिल्म की शुरुआत कदंब वंश और उसके क्रूर शासक से होती है जिसकी लालच हर ज़मीन और पानी को कब्ज़े में लेने की है। चाहे आदमी हो औरत या बच्चा उसके लिए कोई मायने नहीं। वह सबको मारकर अपनी हुकूमत फैलाता है।
एक बार ऐसे ही अभियान के दौरान वह समुद्र किनारे मछली पकड़ते एक रहस्यमयी बूढ़े आदमी को देखता है। अपने सैनिकों को उसे पकड़ने का आदेश देता है। जैसे ही वे उसे खींचकर ले जाते हैं उसके थैले से कीमती सामान गिरते हैं।
शासक उन चीज़ों को देखता है और उनके स्रोत की खोज में निकल पड़ता है। यह सफ़र उसे का

Token indices sequence length is longer than the specified maximum sequence length for this model (3786 > 512). Running this sequence through the model will result in indexing errors



Model: BERT-BASE-MULTILINGUAL-CASED MODEL
Tokens: ['कभी', '##क', '##भी', 'कोई', 'फिल्म', 'आ', '##प', '##को', 'ब', '##िल', '##्क', '##ुल', 'न', '##िश', '##ब', '##्द', 'कर', 'दे', '##ती', 'है', '।', 'कभी', 'उसका', 'अ', '##सर', 'इ', '##तन', '##ा', 'ग', '##हर', '##ा', 'होता', 'है', 'कि', 'शब्द', 'ही', 'नहीं', 'म', '##िल', '##ते', '।', 'कभी', 'आ', '##प', 'इ', '##तन', '##े', 'प्रभावित', 'होते', 'हैं', 'कि', 'उसके', 'बारे', 'में', 'बात', 'करना', 'ही', 'ब', '##ंद', 'नहीं', 'कर', 'प', '##ात', '##े', '।', 'और', 'कभी', 'वह', 'आ', '##प', '##के', 'द', '##िल', 'को', 'इ', '##तन', '##ा', 'छ', '##ू', 'जाती', 'है', 'कि', 'आ', '##प', 'ब', '##स', 'उसकी', 'जा', '##द', '##ु', '##ई', 'दुनिया', 'में', 'ख', '##ो', 'जाते', 'हैं', '।', 'लेकिन', 'जब', 'कोई', 'फिल्म', 'ये', 'सब', 'एक', 'साथ', 'कर', 'द', '##िखा', '##ती', 'है', 'तो', 'वह', 'स', '##िर', '##्फ', 'एक', 'मा', '##स्ट', '##र', '##पी', '##स', 'नहीं', 'र', '##ह', 'जाती', 'वह', 'एक', 'सांस्कृतिक', 'घ', '##टना', 'बन', 'जाती', 'है', '।', 'ऋ', '##ष', '##भ', 'श

Token indices sequence length is longer than the specified maximum sequence length for this model (2717 > 512). Running this sequence through the model will result in indexing errors



Model: XLM-ROBERTA-BASE MODEL
Tokens: ['▁कभी', 'क', 'भी', '▁कोई', '▁फिल्म', '▁आपको', '▁बिल्कुल', '▁नि', 'शब्द', '▁कर', '▁देती', '▁है', '।', '▁कभी', '▁उसका', '▁असर', '▁इतना', '▁ग', 'हरा', '▁होता', '▁है', '▁कि', '▁शब्द', '▁ही', '▁नहीं', '▁मिल', 'ते', '।', '▁कभी', '▁आप', '▁इतने', '▁प्रभावित', '▁होते', '▁हैं', '▁कि', '▁उसके', '▁बारे', '▁में', '▁बात', '▁करना', '▁ही', '▁बंद', '▁नहीं', '▁कर', '▁पा', 'ते', '।', '▁और', '▁कभी', '▁वह', '▁आपके', '▁दिल', '▁को', '▁इतना', '▁छ', 'ू', '▁जाती', '▁है', '▁कि', '▁आप', '▁बस', '▁उसकी', '▁जा', 'दु', 'ई', '▁दुनिया', '▁में', '▁खो', '▁जाते', '▁हैं', '।', '▁लेकिन', '▁जब', '▁कोई', '▁फिल्म', '▁ये', '▁सब', '▁एक', '▁साथ', '▁कर', '▁दिखा', 'ती', '▁है', '▁तो', '▁वह', '▁सिर्फ', '▁एक', '▁मास्टर', 'पी', 'स', '▁नहीं', '▁रह', '▁जाती', '▁वह', '▁एक', '▁सांस्कृतिक', '▁घटना', '▁बन', '▁जाती', '▁है', '।', '▁ऋष', 'भ', '▁शे', 'ट्टी', '▁की', '▁कां', 'ता', 'रा', '▁चै', 'प्ट', 'र', '▁ठीक', '▁ऐसा', '▁ही', '▁असर', '▁छोड़', 'ती', '▁है', '।', '▁फिल्म', '▁की', '▁शुरुआत', '▁क', 'द', 'ंब', '

In [38]:
pre_trained_models_output(kannada_text, "Kannada")

Tokenization using pre-trained models:

 Tokenizing for Kannada: 
Original Text: ಕಾಂತಾರ ಚಾಪ್ಟರ್ ಬರೀ ಸಿನಿಮಾ ಅಲ್ಲ ಅದೊಂದು ಅನುಭೂತಿ ಅದನ್ನು ಕಥೆಗಾಗಿ ನೋಡಬಾರದು ಅದು ನೀಡುವ ರೋಮಾಂಚನಕ್ಕಾಗಿ ನೋಡಬೇಕು ಅದು ನೀಡುವ ದೃಶ್ಯವೈಭವ ಸೀಟಿನ ತುದಿಗೆ ತಂದು ಕೂರಿಸುವ ಥ್ರಿಲ್ಲಿಂಗ್ ಹೊಡೆದಾಟದ ದೃಶ್ಯಗಳು ಕಾಡಿನ ಮರೆಯಲ್ಲಿ ಹುದುಗಿಕೊಂಡ ರಹಸ್ಯಗಳು ತುಳುನಾಡಿನ ದೈವಗಳು ಮತ್ತು ಅದರಿಂದ ಕಾಯಲ್ಪಡುವ ಮನುಷ್ಯಲೋಕದ ಆಟ ಹೋರಾಟಗಳ ಭಾವುಕ ರಮ್ಯ ಲೋಕದ ಚಿತ್ರಣಕ್ಕಾಗಿ ಇದನ್ನು ನೋಡಬೇಕು ದೊಡ್ಡ ತೆರೆಯಲ್ಲಿ ನೋಡಿದರೆ ಮಾತ್ರವೇ ಈ ದೃಶ್ಯ ವೈಭವದ ನೈಜ ಸಾಕ್ಷಾತ್ಕಾರ ಸಾಧ್ಯ
ಕಥೆಯಲ್ಲಿ ಹೊಸತೇನಿಲ್ಲ ಅದು ಒಳಿತು ಕೆಡುಕುಗಳ ಸಮರ ಒಳಿತಿನ ವಿಜಯ ಕೇಡಿನ ಶಕ್ತಿಗಳ ವಿರುದ್ಧ ಸಜ್ಜನರ ಮನುಷ್ಯರು ನಂಬಿದ ದೈವಗಳ ವಿಜಯ ಆರಂಭದಲ್ಲಿಯೇ ತುಳುನಾಡಿಗೆ ಕೈಲಾಸದಿಂದ ಅವತರಿಸುವ ಶಿವಗಣಗಳು ದೈವವಾಗಿ ನಾಡನ್ನು ಕಾಯುತ್ತವೆ ಇದರ ನಡುವೆಯೂ ದೈವವನ್ನು ಬಂಧಿಸಲು ಯತ್ನಿಸುವ ದುರ್ಜನರು ಇದ್ದಾರೆ ಈ ದುರ್ಜನರನ್ನು ಮಟ್ಟಹಾಕಲು ಮನುಷ್ಯಶಕ್ತಿಯೂ ದೈವಶಕ್ತಿಯೂ ಕೈ ಜೋಡಿಸಬೇಕಾಗುತ್ತದೆ ರಿಷಬ್ ಶೆಟ್ಟಿಯ ಬೆರ್ಮೆ ಪಾತ್ರದಲ್ಲಿ ಇವೆರಡೂ ಜೋಡಿಯಾಗಿ ನಮ್ಮನ್ನು ರೋಮಾಂಚಿತಗೊಳಿಸುತ್ತವೆ
ತುಳುನಾಡನ್ನು ಬಂಗ್ರ ಅರಸರು ಆಳುತ್ತಿದ್ದಾರೆ ಬಂಗ್ರದ ಅರಸರ ಹೊಸ ರಾಜಕುಮಾರನಿಗೂ ಕಾಂತಾರದ ಕಾನನ ನಿವಾಸಿಗಳಿಗೂ ಇಕ್ಕಟ್ಟು ಬಿಕ್ಕಟ್ಟುಗಳು ತಲೆದೋರುತ್ತವೆ ಇದನ್ನು ಪರಿಹರಿಸಲು 

Token indices sequence length is longer than the specified maximum sequence length for this model (4407 > 512). Running this sequence through the model will result in indexing errors



Model: BERT-BASE-MULTILINGUAL-CASED MODEL
Tokens: ['ಕ', '##ಾ', '##ಂತ', '##ಾರ', 'ಚ', '##ಾ', '##ಪ್', '##ಟರ್', 'ಬ', '##ರ', '##ೀ', 'ಸಿ', '##ನಿ', '##ಮ', '##ಾ', 'ಅ', '##ಲ್ಲ', 'ಅ', '##ದ', '##ೊಂದು', 'ಅ', '##ನು', '##ಭ', '##ೂ', '##ತಿ', 'ಅದನ್ನು', 'ಕ', '##ಥ', '##ೆ', '##ಗಾಗಿ', 'ನ', '##ೋ', '##ಡ', '##ಬ', '##ಾರದ', '##ು', 'ಅದು', 'ನ', '##ೀ', '##ಡುವ', 'ರ', '##ೋ', '##ಮ', '##ಾ', '##ಂ', '##ಚ', '##ನ', '##ಕ್ಕಾಗಿ', 'ನ', '##ೋ', '##ಡ', '##ಬೇಕು', 'ಅದು', 'ನ', '##ೀ', '##ಡುವ', 'ದ', '##ೃ', '##ಶ್', '##ಯ', '##ವ', '##ೈ', '##ಭ', '##ವ', 'ಸ', '##ೀ', '##ಟಿ', '##ನ', 'ತ', '##ು', '##ದಿ', '##ಗೆ', 'ತ', '##ಂದು', 'ಕ', '##ೂ', '##ರಿ', '##ಸುವ', 'ಥ', '##್ರಿ', '##ಲ್ಲಿ', '##ಂಗ್', 'ಹ', '##ೊ', '##ಡೆದ', '##ಾ', '##ಟದ', 'ದ', '##ೃ', '##ಶ್', '##ಯ', '##ಗಳು', 'ಕ', '##ಾ', '##ಡಿ', '##ನ', 'ಮ', '##ರೆ', '##ಯಲ್ಲಿ', 'ಹ', '##ುದು', '##ಗಿ', '##ಕ', '##ೊಂಡ', 'ರ', '##ಹ', '##ಸ್', '##ಯ', '##ಗಳು', 'ತ', '##ು', '##ಳು', '##ನಾ', '##ಡಿ', '##ನ', 'ದ', '##ೈ', '##ವ', '##ಗಳು', 'ಮತ್ತು', 'ಅದರ', '##ಿಂದ', 'ಕ', '##ಾಯ', '##ಲ್ಪ', '##ಡುವ', 'ಮ', '##ನು', '##ಷ್', '##ಯ', '##ಲ', '##

Token indices sequence length is longer than the specified maximum sequence length for this model (2705 > 512). Running this sequence through the model will result in indexing errors



Model: XLM-ROBERTA-BASE MODEL
Tokens: ['▁', 'ಕಾಂ', 'ತಾರ', '▁ಚಾ', 'ಪ್', 'ಟರ್', '▁ಬರ', 'ೀ', '▁ಸಿನಿಮಾ', '▁ಅಲ್ಲ', '▁ಅ', 'ದೊಂದು', '▁ಅನು', 'ಭೂತ', 'ಿ', '▁ಅದನ್ನು', '▁ಕಥೆ', 'ಗಾಗಿ', '▁ನೋಡ', 'ಬಾರದು', '▁ಅದು', '▁ನೀಡುವ', '▁', 'ರೋ', 'ಮಾ', 'ಂಚ', 'ನ', 'ಕ್ಕಾಗಿ', '▁ನೋಡ', 'ಬೇಕು', '▁ಅದು', '▁ನೀಡುವ', '▁ದೃಶ್ಯ', 'ವೈ', 'ಭವ', '▁ಸೀ', 'ಟಿ', 'ನ', '▁ತು', 'ದ', 'ಿಗೆ', '▁ತಂದು', '▁ಕ', 'ೂರ', 'ಿಸುವ', '▁', 'ಥ್', 'ರಿ', 'ಲ್ಲಿ', 'ಂಗ್', '▁ಹೊಡೆದ', 'ಾಟ', 'ದ', '▁ದೃಶ್ಯ', 'ಗಳು', '▁ಕಾಡ', 'ಿನ', '▁ಮರೆಯ', 'ಲ್ಲಿ', '▁ಹು', 'ದು', 'ಗಿ', 'ಕೊಂಡ', '▁ರ', 'ಹ', 'ಸ್', 'ಯ', 'ಗಳು', '▁ತು', 'ಳು', 'ನಾಡ', 'ಿನ', '▁ದ', 'ೈ', 'ವ', 'ಗಳು', '▁ಮತ್ತು', '▁ಅದರ', 'ಿಂದ', '▁ಕಾಯ', 'ಲ್', 'ಪ', 'ಡುವ', '▁ಮನುಷ್ಯ', 'ಲೋಕ', 'ದ', '▁ಆಟ', '▁ಹೋರಾಟ', 'ಗಳ', '▁ಭಾವ', 'ು', 'ಕ', '▁ರ', 'ಮ್', 'ಯ', '▁ಲೋಕ', 'ದ', '▁ಚಿತ್ರ', 'ಣ', 'ಕ್ಕಾಗಿ', '▁ಇದನ್ನು', '▁ನೋಡ', 'ಬೇಕು', '▁ದೊಡ್ಡ', '▁ತೆರೆ', 'ಯಲ್ಲಿ', '▁ನೋಡಿದ', 'ರೆ', '▁ಮಾತ್ರ', 'ವೇ', '▁ಈ', '▁ದೃಶ್ಯ', '▁ವೈ', 'ಭವ', 'ದ', '▁ನೈ', 'ಜ', '▁ಸಾ', 'ಕ್ಷ', 'ಾ', 'ತ್', 'ಕಾರ', '▁ಸಾಧ್ಯ', '▁ಕಥೆ', 'ಯಲ್ಲಿ', '▁ಹೊಸ', 'ತೇ', 'ನ', 'ಿಲ್ಲ', '▁ಅದು', '▁ಒಳ', 'ಿತು', '▁ಕೆ', 'ಡು', 'ಕ

## Tokenization using IndicNLP library
- Word Tokenization and Detokenization
- Sentence Splitting

In [41]:
from indicnlp.tokenize import sentence_tokenize
from indicnlp.tokenize import indic_tokenize
from indicnlp.tokenize import indic_detokenize

def sentence_tokenization(text, lang_code):
    """
    Splits a paragraph into individual sentences.
    Args:
        text (str): The input text.
        lang_code (str): 'hi', 'ta', 'te', 'kn'. (Depending on the language)
    Return:
        list: A list of sentence strings.
    TODO:
    Syntax:
    sentence_tokenize.sentence_split(text, lang=lang_code)
    """

    sentences = sentence_tokenize.sentence_split(text, lang=lang_code)
    return sentences

def word_tokenization(sentence, lang_code):
    """
    Splits a sentence into tokens.
    Args:
        sentence (str): A single sentence string.
        lang_code (str): Language code.
    Return:
        list: A list of individual tokens.

    TODO:
    Syntax:
    indic_tokenize.trivial_tokenize(sentence, lang=lang_code)
    """
    tokens = indic_tokenize.trivial_tokenize(sentence, lang=lang_code)
    return tokens

def detokenization(tokens, lang_code):
    """
    Reconstructs a sentence from tokens, fixing punctuation spacing.
    Args:
        tokens (list): A list of token strings.
        lang_code (str): Language code.
    Return:
        str: The first 1000 characters reconstructed sentence string.

    TODO:
    Syntax:
    indic_detokenize.trivial_detokenize(tokens, lang=lang_code)
    """
    tokenized_text = ' '.join(tokens)
    text = indic_detokenize.trivial_detokenize(tokenized_text, lang=lang_code)
    return text[:1000]

In [42]:
def indic_nlp_output(text, lang):
    print("\nTokenization using IndicNLP's word tokenization: ")
    word_tokens = word_tokenization(text, lang)
    print(word_tokens)


    print("\nTokenization using IndicNLP's sentence splitting: ")
    sentences = sentence_tokenization(text, lang)
    print(sentences)


    print("\nTokenization using IndicNLP's detokenization: ")
    detokenized_text = detokenization(word_tokens, lang)
    print(detokenized_text)

print("HINDI TEXT INDIC NLP OUTPUTS: ")
indic_nlp_output(hindi_text, "Hindi")

HINDI TEXT INDIC NLP OUTPUTS: 

Tokenization using IndicNLP's word tokenization: 
['कभीकभी', 'कोई', 'फिल्म', 'आपको', 'बिल्कुल', 'निशब्द', 'कर', 'देती', 'है', '।', 'कभी', 'उसका', 'असर', 'इतना', 'गहरा', 'होता', 'है', 'कि', 'शब्द', 'ही', 'नहीं', 'मिलते', '।', 'कभी', 'आप', 'इतने', 'प्रभावित', 'होते', 'हैं', 'कि', 'उसके', 'बारे', 'में', 'बात', 'करना', 'ही', 'बंद', 'नहीं', 'कर', 'पाते', '।', 'और', 'कभी', 'वह', 'आपके', 'दिल', 'को', 'इतना', 'छू', 'जाती', 'है', 'कि', 'आप', 'बस', 'उसकी', 'जादुई', 'दुनिया', 'में', 'खो', 'जाते', 'हैं', '।', '\nलेकिन', 'जब', 'कोई', 'फिल्म', 'ये', 'सब', 'एक', 'साथ', 'कर', 'दिखाती', 'है', 'तो', 'वह', 'सिर्फ', 'एक', 'मास्टरपीस', 'नहीं', 'रह', 'जाती', 'वह', 'एक', 'सांस्कृतिक', 'घटना', 'बन', 'जाती', 'है', '।', 'ऋषभ', 'शेट्टी', 'की', 'कांतारा', 'चैप्टर', 'ठीक', 'ऐसा', 'ही', 'असर', 'छोड़ती', 'है', '।', '\nफिल्म', 'की', 'शुरुआत', 'कदंब', 'वंश', 'और', 'उसके', 'क्रूर', 'शासक', 'से', 'होती', 'है', 'जिसकी', 'लालच', 'हर', 'ज़मीन', 'और', 'पानी', 'को', 'कब्ज़े', 'में', 'लेने', 'क

In [43]:
print("KANNADA TEXT INDIC NLP OUTPUTS: ")
indic_nlp_output(kannada_text, "Kannada")

KANNADA TEXT INDIC NLP OUTPUTS: 

Tokenization using IndicNLP's word tokenization: 
['ಕಾಂತಾರ', 'ಚಾಪ್ಟರ್', 'ಬರೀ', 'ಸಿನಿಮಾ', 'ಅಲ್ಲ', 'ಅದೊಂದು', 'ಅನುಭೂತಿ', 'ಅದನ್ನು', 'ಕಥೆಗಾಗಿ', 'ನೋಡಬಾರದು', 'ಅದು', 'ನೀಡುವ', 'ರೋಮಾಂಚನಕ್ಕಾಗಿ', 'ನೋಡಬೇಕು', 'ಅದು', 'ನೀಡುವ', 'ದೃಶ್ಯವೈಭವ', 'ಸೀಟಿನ', 'ತುದಿಗೆ', 'ತಂದು', 'ಕೂರಿಸುವ', 'ಥ್ರಿಲ್ಲಿಂಗ್', 'ಹೊಡೆದಾಟದ', 'ದೃಶ್ಯಗಳು', 'ಕಾಡಿನ', 'ಮರೆಯಲ್ಲಿ', 'ಹುದುಗಿಕೊಂಡ', 'ರಹಸ್ಯಗಳು', 'ತುಳುನಾಡಿನ', 'ದೈವಗಳು', 'ಮತ್ತು', 'ಅದರಿಂದ', 'ಕಾಯಲ್ಪಡುವ', 'ಮನುಷ್ಯಲೋಕದ', 'ಆಟ', 'ಹೋರಾಟಗಳ', 'ಭಾವುಕ', 'ರಮ್ಯ', 'ಲೋಕದ', 'ಚಿತ್ರಣಕ್ಕಾಗಿ', 'ಇದನ್ನು', 'ನೋಡಬೇಕು', 'ದೊಡ್ಡ', 'ತೆರೆಯಲ್ಲಿ', 'ನೋಡಿದರೆ', 'ಮಾತ್ರವೇ', 'ಈ', 'ದೃಶ್ಯ', 'ವೈಭವದ', 'ನೈಜ', 'ಸಾಕ್ಷಾತ್ಕಾರ', 'ಸಾಧ್ಯ\nಕಥೆಯಲ್ಲಿ', 'ಹೊಸತೇನಿಲ್ಲ', 'ಅದು', 'ಒಳಿತು', 'ಕೆಡುಕುಗಳ', 'ಸಮರ', 'ಒಳಿತಿನ', 'ವಿಜಯ', 'ಕೇಡಿನ', 'ಶಕ್ತಿಗಳ', 'ವಿರುದ್ಧ', 'ಸಜ್ಜನರ', 'ಮನುಷ್ಯರು', 'ನಂಬಿದ', 'ದೈವಗಳ', 'ವಿಜಯ', 'ಆರಂಭದಲ್ಲಿಯೇ', 'ತುಳುನಾಡಿಗೆ', 'ಕೈಲಾಸದಿಂದ', 'ಅವತರಿಸುವ', 'ಶಿವಗಣಗಳು', 'ದೈವವಾಗಿ', 'ನಾಡನ್ನು', 'ಕಾಯುತ್ತವೆ', 'ಇದರ', 'ನಡುವೆಯೂ', 'ದೈವವನ್ನು', 'ಬಂಧಿಸಲು', 'ಯತ್ನಿಸುವ', 'ದುರ್ಜನರು', 'ಇದ್ದಾರೆ', 'ಈ', 'ದುರ್ಜನರನ್ನು', 'ಮಟ್ಟಹಾಕಲು', 'ಮನುಷ್

## N-grams

N-gram is a contiguous sequence of 'N' items like words or characters from text or speech.

The items can be letters, words or base pairs according to the application.

The value of ’N’ determines the order of the N-gram.

N-grams can be of various types based on the value of 'n':

- Unigrams (1-grams) are single words
- Bigrams (2-grams) are pairs of consecutive words
- Trigrams (3-grams) are triplets of consecutive words

In [44]:
import string

def generate_ngrams(text, lang, n):
    """
    Generates n-grams from text.
    n = 1 -> Unigram
    n = 2 -> Bigram
    n = 3 -> Trigram
    """

    # Step 1: Tokenize the text into words using IndicNLP word tokenizer
    # This gives us a list like: ['कभीकभी', 'कोई', 'फिल्म', ...]
    words = word_tokenization(text, lang)

    # Here we filter out tokens that are only punctuation marks
    words = [w for w in words if w not in string.punctuation and w.strip() != ""]

    """
    Step 2: Generate N-grams using a sliding window approach

    For example, if words = [w1, w2, w3, w4] and n = 2 (bigram):
    Windows will be:
        [w1, w2]
        [w2, w3]
        [w3, w4]

    If n = 3 (trigram):
        [w1, w2, w3]
        [w2, w3, w4]
    """

    ngrams = []

    # Sliding window over the word list
    for i in range(len(words) - n + 1):
        # Take n consecutive words starting at position i
        ngram = words[i:i + n]
        ngrams.append(ngram)

    # Step 3: Join tokens inside each n-gram with space
    # Example: ['कभीकभी', 'कोई'] -> "कभीकभी कोई"
    return [" ".join(ngram) for ngram in ngrams]


In [45]:
def n_grams_output(text, lang):
    print("N-Grams (Unigram, Bigram and Trigram): \n")
    unigrams = generate_ngrams(text, lang, 1)
    bigrams = generate_ngrams(text,lang,  2)
    trigrams = generate_ngrams(text,lang, 3)

    print(f"Unigrams ({len(unigrams)}): {unigrams}")
    print(f"Bigrams  ({len(bigrams)}):  {bigrams}")
    print(f"Trigrams ({len(trigrams)}): {trigrams}")

print("HINDI TEXT N-GRAMS OUTPUTS: \n")
n_grams_output(hindi_text, "hi")


HINDI TEXT N-GRAMS OUTPUTS: 

N-Grams (Unigram, Bigram and Trigram): 

Unigrams (1948): ['कभीकभी', 'कोई', 'फिल्म', 'आपको', 'बिल्कुल', 'निशब्द', 'कर', 'देती', 'है', '।', 'कभी', 'उसका', 'असर', 'इतना', 'गहरा', 'होता', 'है', 'कि', 'शब्द', 'ही', 'नहीं', 'मिलते', '।', 'कभी', 'आप', 'इतने', 'प्रभावित', 'होते', 'हैं', 'कि', 'उसके', 'बारे', 'में', 'बात', 'करना', 'ही', 'बंद', 'नहीं', 'कर', 'पाते', '।', 'और', 'कभी', 'वह', 'आपके', 'दिल', 'को', 'इतना', 'छू', 'जाती', 'है', 'कि', 'आप', 'बस', 'उसकी', 'जादुई', 'दुनिया', 'में', 'खो', 'जाते', 'हैं', '।', '\nलेकिन', 'जब', 'कोई', 'फिल्म', 'ये', 'सब', 'एक', 'साथ', 'कर', 'दिखाती', 'है', 'तो', 'वह', 'सिर्फ', 'एक', 'मास्टरपीस', 'नहीं', 'रह', 'जाती', 'वह', 'एक', 'सांस्कृतिक', 'घटना', 'बन', 'जाती', 'है', '।', 'ऋषभ', 'शेट्टी', 'की', 'कांतारा', 'चैप्टर', 'ठीक', 'ऐसा', 'ही', 'असर', 'छोड़ती', 'है', '।', '\nफिल्म', 'की', 'शुरुआत', 'कदंब', 'वंश', 'और', 'उसके', 'क्रूर', 'शासक', 'से', 'होती', 'है', 'जिसकी', 'लालच', 'हर', 'ज़मीन', 'और', 'पानी', 'को', 'कब्ज़े', 'में', 'लेन

In [46]:
print("KANNADA TEXT N-GRAMS OUTPUTS: \n")
n_grams_output(kannada_text, "kn")

KANNADA TEXT N-GRAMS OUTPUTS: 

N-Grams (Unigram, Bigram and Trigram): 

Unigrams (1078): ['ಕಾಂತಾರ', 'ಚಾಪ್ಟರ್', 'ಬರೀ', 'ಸಿನಿಮಾ', 'ಅಲ್ಲ', 'ಅದೊಂದು', 'ಅನುಭೂತಿ', 'ಅದನ್ನು', 'ಕಥೆಗಾಗಿ', 'ನೋಡಬಾರದು', 'ಅದು', 'ನೀಡುವ', 'ರೋಮಾಂಚನಕ್ಕಾಗಿ', 'ನೋಡಬೇಕು', 'ಅದು', 'ನೀಡುವ', 'ದೃಶ್ಯವೈಭವ', 'ಸೀಟಿನ', 'ತುದಿಗೆ', 'ತಂದು', 'ಕೂರಿಸುವ', 'ಥ್ರಿಲ್ಲಿಂಗ್', 'ಹೊಡೆದಾಟದ', 'ದೃಶ್ಯಗಳು', 'ಕಾಡಿನ', 'ಮರೆಯಲ್ಲಿ', 'ಹುದುಗಿಕೊಂಡ', 'ರಹಸ್ಯಗಳು', 'ತುಳುನಾಡಿನ', 'ದೈವಗಳು', 'ಮತ್ತು', 'ಅದರಿಂದ', 'ಕಾಯಲ್ಪಡುವ', 'ಮನುಷ್ಯಲೋಕದ', 'ಆಟ', 'ಹೋರಾಟಗಳ', 'ಭಾವುಕ', 'ರಮ್ಯ', 'ಲೋಕದ', 'ಚಿತ್ರಣಕ್ಕಾಗಿ', 'ಇದನ್ನು', 'ನೋಡಬೇಕು', 'ದೊಡ್ಡ', 'ತೆರೆಯಲ್ಲಿ', 'ನೋಡಿದರೆ', 'ಮಾತ್ರವೇ', 'ಈ', 'ದೃಶ್ಯ', 'ವೈಭವದ', 'ನೈಜ', 'ಸಾಕ್ಷಾತ್ಕಾರ', 'ಸಾಧ್ಯ\nಕಥೆಯಲ್ಲಿ', 'ಹೊಸತೇನಿಲ್ಲ', 'ಅದು', 'ಒಳಿತು', 'ಕೆಡುಕುಗಳ', 'ಸಮರ', 'ಒಳಿತಿನ', 'ವಿಜಯ', 'ಕೇಡಿನ', 'ಶಕ್ತಿಗಳ', 'ವಿರುದ್ಧ', 'ಸಜ್ಜನರ', 'ಮನುಷ್ಯರು', 'ನಂಬಿದ', 'ದೈವಗಳ', 'ವಿಜಯ', 'ಆರಂಭದಲ್ಲಿಯೇ', 'ತುಳುನಾಡಿಗೆ', 'ಕೈಲಾಸದಿಂದ', 'ಅವತರಿಸುವ', 'ಶಿವಗಣಗಳು', 'ದೈವವಾಗಿ', 'ನಾಡನ್ನು', 'ಕಾಯುತ್ತವೆ', 'ಇದರ', 'ನಡುವೆಯೂ', 'ದೈವವನ್ನು', 'ಬಂಧಿಸಲು', 'ಯತ್ನಿಸುವ', 'ದುರ್ಜನರು', 'ಇದ್ದಾರೆ', 'ಈ', 'ದುರ್ಜನರನ್ನು', 'ಮಟ್ಟಹಾಕಲು', 

# PHASE 3: MORPHOLOGICAL ANALYSIS

## Stopword Removal

Stop words are commonly occurring words in a language such as "the", "and", "a", etc.

Eg: Hindi - और (aur) and हैं (hain)

Kannada: ಈ (ee) - This and ಬಗ್ಗೆ (bagge) - About

They are usually removed from the text during preprocessing because they do not carry much meaning and can cause noise in the data.

In [47]:
import json
import stopwordsiso

# Stopwords for Hindi are available officially from ISO because it is a well established language for NLP.
stop_set_hindi = list(stopwordsiso.stopwords("hi"))
print("Number of Hindi Stopwords: ", len(stop_set_hindi))
print("Hindi Stopwords: ", stop_set_hindi)


# Kannada is a low resource language, and therefore is not well established for NLP. So, we need to rely on online sources for stopwords.
import requests

url = KANNADA_STOPWORDS_URL
response = requests.get(url)
if response.status_code == 200:
    stop_set_kannada = [line.strip() for line in response.text.splitlines() if line.strip()]
    print(f"Number of Kannada Stopwords: ", len(stop_set_kannada))
    print(f"Kannada Stopwords: ",stop_set_kannada)
else:
    print("Failed to retrieve the file.")

Number of Hindi Stopwords:  225
Hindi Stopwords:  ['सो', 'उसी', 'इन्हों', 'इसकी', 'में', 'दबारा', 'निचे', 'मे', 'कइ', 'यह', 'तब', 'की', 'किन्हों', 'अंदर', 'भि', 'कौन', 'जहां', 'दो', 'पूरा', 'हुइ', 'द्वारा', 'लिए', 'कहा', 'इतयादि', 'उन्हें', 'सकता', 'होना', 'सभि', 'एसे', 'तिस', 'इसके', 'इसे', 'ऱ्वासा', 'इस', 'जहाँ', 'बनी', 'वुह', 'बहुत', 'आप', 'पर', 'होने', 'ओर', 'यही', 'कितना', 'अपने', 'उन्हों', 'थे', 'जिन', 'वग़ैरह', 'जिधर', 'किसी', 'जेसे', 'जिस', 'दूसरे', 'करना', 'करता', 'सबसे', 'वहीं', 'जा', 'किन्हें', 'वहिं', 'काफि', 'कि', 'नीचे', 'पहले', 'हे', 'उनका', 'बही', 'किर', 'किंहों', 'था', 'मानो', 'अप', 'होता', 'अपना', 'जीधर', 'करने', 'वगेरह', 'अपनि', 'पे', 'होति', 'तक', 'न', 'इंहों', 'यिह', 'वाले', 'उंहिं', 'सकते', 'हुई', 'है', 'भी', 'उनकि', 'इन', 'होती', 'होते', 'इंहें', 'कई', 'पुरा', 'अत', 'हि', 'यदि', 'को', 'या', 'वहां', 'काफ़ी', 'कुल', 'घर', 'साबुत', 'इसका', 'तिसे', 'कर', 'किया', 'फिर', 'हैं', 'ऐसे', 'अदि', 'अभी', 'उन', 'रहा', 'हुआ', 'ना', 'यहां', 'इन्हें', 'कोई', 'थी', 'का', 'इन्हीं', 

In [48]:
import json
import re

def stopword_removal(text, lang):
    """
    Removes stopwords from Indic text robustly.
    """

    # 1. Select the correct stopword set based on language
    if lang == "hi":
        stopword_set = set(stop_set_hindi)
    elif lang == "kn":
        stopword_set = set(stop_set_kannada)
    else:
        stopword_set = set()   # fallback for unsupported languages

    # 2. Tokenize using IndicNLP word tokenizer
    tokens = word_tokenization(text, lang)

    filtered_tokens = []

    for token in tokens:
        # 3. Remove punctuation and special characters using regex
        # Keeps only letters and numbers from all scripts
        clean_token = re.sub(r"[^\w\u0900-\u097F\u0C80-\u0CFF]+", "", token)

        # 4. Apply filtering conditions:
        #    - token must not be empty
        #    - token must not be a stopword
        if clean_token and clean_token not in stopword_set:
            filtered_tokens.append(clean_token)

    return filtered_tokens


In [49]:
def stopword_output(text, lang):
    filtered_tokens = stopword_removal(text, lang)
    return filtered_tokens

print("Hindi Vocabulary after stopword removal:", stopword_output(hindi_text, "hi"))

Hindi Vocabulary after stopword removal: ['कभीकभी', 'फिल्म', 'आपको', 'बिल्कुल', 'निशब्द', 'देती', '।', 'कभी', 'उसका', 'असर', 'इतना', 'गहरा', 'शब्द', 'मिलते', '।', 'कभी', 'इतने', 'प्रभावित', 'बारे', 'बात', 'बंद', 'पाते', '।', 'कभी', 'आपके', 'दिल', 'इतना', 'छू', 'जाती', 'बस', 'उसकी', 'जादुई', 'दुनिया', 'खो', 'जाते', '।', 'फिल्म', 'सब', 'दिखाती', 'सिर्फ', 'मास्टरपीस', 'रह', 'जाती', 'सांस्कृतिक', 'घटना', 'बन', 'जाती', '।', 'ऋषभ', 'शेट्टी', 'कांतारा', 'चैप्टर', 'ठीक', 'ऐसा', 'असर', 'छोड़ती', '।', 'फिल्म', 'शुरुआत', 'कदंब', 'वंश', 'क्रूर', 'शासक', 'जिसकी', 'लालच', 'हर', 'ज़मीन', 'पानी', 'कब्ज़े', 'लेने', '।', 'चाहे', 'आदमी', 'औरत', 'बच्चा', 'मायने', '।', 'सबको', 'मारकर', 'हुकूमत', 'फैलाता', '।', 'बार', 'अभियान', 'दौरान', 'समुद्र', 'किनारे', 'मछली', 'पकड़ते', 'रहस्यमयी', 'बूढ़े', 'आदमी', 'देखता', '।', 'सैनिकों', 'पकड़ने', 'आदेश', 'देता', '।', 'खींचकर', 'ले', 'जाते', 'थैले', 'कीमती', 'सामान', 'गिरते', '।', 'शासक', 'चीज़ों', 'देखता', 'स्रोत', 'खोज', 'निकल', 'पड़ता', '।', 'सफ़र', 'कांतारा', 'ले'

In [50]:
print("Kannada Vocabulary after stopword removal:", stopword_output(kannada_text, "kn"))

Kannada Vocabulary after stopword removal: ['ಕಾಂತಾರ', 'ಚಾಪ್ಟರ್', 'ಬರೀ', 'ಅದೊಂದು', 'ಅನುಭೂತಿ', 'ಕಥೆಗಾಗಿ', 'ನೋಡಬಾರದು', 'ನೀಡುವ', 'ರೋಮಾಂಚನಕ್ಕಾಗಿ', 'ನೋಡಬೇಕು', 'ನೀಡುವ', 'ದೃಶ್ಯವೈಭವ', 'ಸೀಟಿನ', 'ತುದಿಗೆ', 'ತಂದು', 'ಕೂರಿಸುವ', 'ಥ್ರಿಲ್ಲಿಂಗ್', 'ಹೊಡೆದಾಟದ', 'ದೃಶ್ಯಗಳು', 'ಕಾಡಿನ', 'ಮರೆಯಲ್ಲಿ', 'ಹುದುಗಿಕೊಂಡ', 'ರಹಸ್ಯಗಳು', 'ತುಳುನಾಡಿನ', 'ದೈವಗಳು', 'ಅದರಿಂದ', 'ಕಾಯಲ್ಪಡುವ', 'ಮನುಷ್ಯಲೋಕದ', 'ಆಟ', 'ಹೋರಾಟಗಳ', 'ಭಾವುಕ', 'ರಮ್ಯ', 'ಲೋಕದ', 'ಚಿತ್ರಣಕ್ಕಾಗಿ', 'ಇದನ್ನು', 'ನೋಡಬೇಕು', 'ದೊಡ್ಡ', 'ತೆರೆಯಲ್ಲಿ', 'ನೋಡಿದರೆ', 'ಮಾತ್ರವೇ', 'ದೃಶ್ಯ', 'ವೈಭವದ', 'ನೈಜ', 'ಸಾಕ್ಷಾತ್ಕಾರ', 'ಸಾಧ್ಯಕಥೆಯಲ್ಲಿ', 'ಹೊಸತೇನಿಲ್ಲ', 'ಒಳಿತು', 'ಕೆಡುಕುಗಳ', 'ಸಮರ', 'ಒಳಿತಿನ', 'ವಿಜಯ', 'ಕೇಡಿನ', 'ಶಕ್ತಿಗಳ', 'ವಿರುದ್ಧ', 'ಸಜ್ಜನರ', 'ಮನುಷ್ಯರು', 'ನಂಬಿದ', 'ದೈವಗಳ', 'ವಿಜಯ', 'ಆರಂಭದಲ್ಲಿಯೇ', 'ತುಳುನಾಡಿಗೆ', 'ಕೈಲಾಸದಿಂದ', 'ಅವತರಿಸುವ', 'ಶಿವಗಣಗಳು', 'ದೈವವಾಗಿ', 'ನಾಡನ್ನು', 'ಕಾಯುತ್ತವೆ', 'ಇದರ', 'ನಡುವೆಯೂ', 'ದೈವವನ್ನು', 'ಬಂಧಿಸಲು', 'ಯತ್ನಿಸುವ', 'ದುರ್ಜನರು', 'ಇದ್ದಾರೆ', 'ದುರ್ಜನರನ್ನು', 'ಮಟ್ಟಹಾಕಲು', 'ಮನುಷ್ಯಶಕ್ತಿಯೂ', 'ದೈವಶಕ್ತಿಯೂ', 'ಕೈ', 'ಜೋಡಿಸಬೇಕಾಗುತ್ತದೆ', 'ರಿಷಬ್', 'ಶೆಟ್ಟಿಯ', 'ಬೆರ್ಮೆ', 'ಪಾತ್ರದಲ್ಲಿ', 'ಇವೆರಡೂ', 'ಜೋಡಿಯ

## Stemming and Lemmatization

Stemming is a rule-based text normalisation technique that reduces words to their root form by removing prefixes or suffixes. The resulting form called a stem, may not be a valid or meaningful word in the language.

Lemmatization is a linguistically driven text normalization technique that converts words into their base dictionary form, known as a lemma, by considering grammar, vocabulary and context.

#### Stanza uses a Neural Lemmatizer which provides higher accuracy than Rule-Based Stemming. Therefore, we use the lemma attribute to normalize words to their canonical forms.

#### We are using the Stanza library for POS Tagging here:

Stanza is an easy to use Python library developed by the Stanford NLP Group for performing Natural Language Processing (NLP) tasks like tokenization, part of speech tagging, named entity recognition and dependency parsing. It is built on deep learning and supporting over 70 languages.

Stanza makes it simple to analyze and understand text in multiple languages with high accuracy.

Limitation: (Works for Hindi, Tamil and Telugu but not Kannada)

### Note: Reliable open-source Lemmatizers are **currently unavailable for Kannada**. You will perform this analysis on the **Hindi corpus only** to demonstrate the concept.

In [51]:
import stanza
import pandas as pd
from snowballstemmer import stemmer as snowball_stemmer

# 1. Models setup
print("Loading Models...")
stanza.download(STANZA_LANG, verbose=False)

# Loading pipeline with tokenization, POS tagging, and lemmatization
nlp_stanza = stanza.Pipeline(STANZA_LANG, processors='tokenize,pos,lemma', verbose=False)
sb_stemmer = snowball_stemmer(SNOWBALL_LANG)
print("Models Loaded Successfully!\n")

for i, report in enumerate(hindi_reports, 1):
    doc = nlp_stanza(report)

    data = []
    for sentence in doc.sentences:
        for word in sentence.words:
            original = word.text

            # ---------------------------------------------------
            # Morphological Extraction
            # ---------------------------------------------------

            # 1. Lemmatization using Stanza (Neural Lemmatizer)
            # The lemma is the dictionary form of the word.
            # Example: "जा रहा" → "जा", "लड़कों" → "लड़का"
            lemma = word.lemma

            # 2. Stemming using Snowball Stemmer (Rule-based)
            # This chops off suffixes to produce a stem.
            # The stem may not always be a valid dictionary word.
            # Snowball expects a list and returns a list.
            sb_stem = sb_stemmer.stemWord(original)

            # ---------------------------------------------------
            # Collect results
            # ---------------------------------------------------
            data.append({
                "Original": original,
                "Stanza (Lemma)": lemma if lemma else "Pending...",
                "Snowball (Stem)": sb_stem if sb_stem else "Pending..."
            })

    df = pd.DataFrame(data)

    print(f"REPORT {i} ANALYSIS")
    print(f"Sentence: \"{report}\"")
    print("=" * 80)
    print(f"{'ORIGINAL':<20} | {'LEMMA (Root) - STANZA':<25} | {'STEM (Suffix Chopped) - SNOWBALL':<35} ")
    print("-" * 80)

    for _, row in df.iterrows():
        print(f"{row['Original']:<20} | {row['Stanza (Lemma)']:<25} | {row['Snowball (Stem)']:<35}")
    print("\n")


Loading Models...
Models Loaded Successfully!

REPORT 1 ANALYSIS
Sentence: "ऋषभ शेट्टी ने मुंबई में कांतारा चैप्टर 1 का पोस्टर लॉन्च किया।"
ORIGINAL             | LEMMA (Root) - STANZA     | STEM (Suffix Chopped) - SNOWBALL    
--------------------------------------------------------------------------------
ऋषभ                  | ऋषभ                       | ऋषभ                                
शेट्टी               | शेट्टी                    | शेट्ट                              
ने                   | ने                        | न                                  
मुंबई                | मुंबई                     | मुंब                               
में                  | में                       | म                                  
कांतारा              | कांतारा                   | कांतार                             
चैप्टर               | चैप्टर                    | चैप्टर                             
1                    | 1                         | 1                              

# PHASE 4: SYNTACTIC & SEMANTIC EXTRACTION (POS & NER)

## POS Tagging

POS tagging involves assigning a part of speech tag to each word in a text.

This step is commonly used in various NLP tasks such as named entity recognition, sentiment analysis, and machine translation.

### Note: Due to the low resource nature of Dravidian languages in standard libraries, **we cannot perform reliable POS tagging for the Kannada** stream in this lab.

In [52]:
import stanza

# Global model cache to store loaded models (prevents reloading model for every sentence)
_model_cache = {}

def get_pos_tags_hindi(text):
    """
    Performs Part-of-Speech tagging for Hindi using Stanza.
    Args:
        text (str): The raw Hindi text.
    Returns:
        list: A list of tuples [(word, upos_tag), ...]
    """

    # ----------------------------------------------------
    # 1. Tokenize the text
    # We reuse the Indic NLP word tokenizer implemented earlier
    # This gives us a list of tokens (already split words)
    # ----------------------------------------------------
    tokens = word_tokenization(text, STANZA_LANG)

    # ----------------------------------------------------
    # 2. Load the Stanza Pipeline with caching
    # We only download and initialize the model ONCE
    # ----------------------------------------------------
    if STANZA_LANG not in _model_cache:
        print(f"Downloading and loading Stanza model for {STANZA_LANG}...")

        # Download model for Hindi with tokenize + POS processors
        stanza.download(STANZA_LANG, processors="tokenize,pos", verbose=False)

        # Initialize pipeline
        # tokenize_pretokenized=True is CRITICAL because
        # we are passing tokens ourselves instead of raw text
        _model_cache[STANZA_LANG] = stanza.Pipeline(
            lang=STANZA_LANG,
            processors="tokenize,pos",
            tokenize_pretokenized=True,
            verbose=False
        )

    # Retrieve model from cache
    nlp = _model_cache.get(STANZA_LANG)

    if not nlp or not tokens:
        return []

    # ----------------------------------------------------
    # 3. Process the text
    # Since tokenize_pretokenized=True, input must be:
    #   List[List[str]]
    # One list = one sentence
    # ----------------------------------------------------
    doc = nlp([tokens])

    # ----------------------------------------------------
    # 4. Extract POS Tags
    # word.upos gives the Universal POS tag
    # ----------------------------------------------------
    pos_tags = []
    for sentence in doc.sentences:
        for word in sentence.words:
            pos_tags.append((word.text, word.upos))

    return pos_tags

In [53]:
print(f"{'='*70}")
print(f"ANALYZING HINDI MOVIE REPORTS AFTER POS TAGGING")
print(f"{'='*70}")

for i, report in enumerate(hindi_reports, 1):
    print(f"\nReview {i}: {report}")

    tags = get_pos_tags_hindi(report)

    if tags:
        print(f"   >>> POS TAGS:")
        for word, tag in tags:
            print(f"       - {word:15} : {tag}")
    else:
        print("   >>> [NO TAGS FOUND]")

ANALYZING HINDI MOVIE REPORTS AFTER POS TAGGING

Review 1: ऋषभ शेट्टी ने मुंबई में कांतारा चैप्टर 1 का पोस्टर लॉन्च किया।
Downloading and loading Stanza model for hi...
   >>> POS TAGS:
       - ऋषभ             : PROPN
       - शेट्टी          : PROPN
       - ने              : ADP
       - मुंबई           : PROPN
       - में             : ADP
       - कांतारा         : PROPN
       - चैप्टर          : PROPN
       - 1               : PROPN
       - का              : ADP
       - पोस्टर          : NOUN
       - लॉन्च           : X
       - किया            : VERB
       - ।               : PUNCT

Review 2: प्रगति शेट्टी जी ने पुष्टि की है कि शूटिंग कर्नाटक में होगी।
   >>> POS TAGS:
       - प्रगति          : PROPN
       - शेट्टी          : PROPN
       - जी              : PART
       - ने              : ADP
       - पुष्टि          : NOUN
       - की              : VERB
       - है              : AUX
       - कि              : SCONJ
       - शूटिंग          : NOUN
       - कर्नाटक   

## Named Entity Recognition (NER)

NER involves identifying and classifying named entities in text, such as people, organizations, and locations.

This step is commonly used in various NLP tasks such as information extraction, machine translation, and question-answering.

### Note: Unlike the previous steps, this **model is multilingual** so run it on **BOTH the Hindi and Kannada test corpora** provided to you.

In [54]:
from transformers import pipeline

# Global variable to store the model (prevents model reloading on every function call)
_NER_PIPELINE = None

def load_ner_model():
    """
    Loads the NER model only once and caches it.
    HuggingFace pipeline is expensive to load, so we avoid reloading it.
    """
    global _NER_PIPELINE

    if _NER_PIPELINE is None:
        print(f"Downloading/Loading NER Model: {NER_MODEL_NAME}...")

        # ---------------------------------------------------
        # Initialize HuggingFace NER pipeline
        # task="ner" → Named Entity Recognition
        # model=NER_MODEL_NAME → multilingual NER model
        # aggregation_strategy="simple" →
        #   merges subword outputs into full words
        #   e.g. ["Kan", "##ta", "##ra"] → "Kantara"
        # ---------------------------------------------------
        _NER_PIPELINE = pipeline(
            task="ner",
            model=NER_MODEL_NAME,
            aggregation_strategy="simple"
        )

    return _NER_PIPELINE


def get_ner_tags(text):
    """
    Performs NER on the input text using the cached model.
    Args:
        text (str): Input text.
    Returns:
        list: A list of tuples [(entity_text, entity_group), ...]
    """
    if not text or not text.strip():
        return []

    # Get cached pipeline
    ner_pipe = load_ner_model()

    if ner_pipe is None:
        return []

    # ---------------------------------------------------
    # 1. Run inference
    # We pass the raw text directly to the pipeline
    # ---------------------------------------------------
    results = ner_pipe(text)

    # ---------------------------------------------------
    # 2. Format Output
    # Each result item looks like:
    # {
    #   'entity_group': 'PER',
    #   'score': 0.99,
    #   'word': 'ऋषभ शेट्टी',
    #   'start': 0,
    #   'end': 10
    # }
    # ---------------------------------------------------
    entities = []
    for item in results:
        # Extract clean word and its entity category
        entities.append((item["word"], item["entity_group"]))

    return entities



In [55]:
def run_ner_test(lang_name, corpus):
    print(f"\n{'='*70}")
    print(f"ANALYZING {lang_name.upper()} MOVIE REPORTS FOR NER TAGGING")
    print(f"{'='*70}")

    for i, report in enumerate(corpus, 1):
        print(f"\nReview {i}: {report}")
        entities = get_ner_tags(report)

        if entities:
            print(f"   >>> DETECTED ENTITIES:")
            for word, tag in entities:
                # Making output cleaner
                tag_name = "LOCATION" if tag == 'LOC' else ("PERSON" if tag == 'PER' else ("ORG" if tag == 'ORG' else tag))
                print(f"       - {word:15} : {tag_name}")
        else:
            print("   >>> [NO ENTITIES FOUND]")

'''
Run the tests
Note: The model loads ONLY ONCE at the start of the first function call.
'''
run_ner_test("Kannada", kannada_reports)
run_ner_test("Hindi", hindi_reports)


ANALYZING KANNADA MOVIE REPORTS FOR NER TAGGING

Review 1: ನಟ ರಿಷಬ್ ಶೆಟ್ಟಿ ಅವರು ಕುಂದಾಪುರ ನಗರದಲ್ಲಿ ಚಿತ್ರೀಕರಣ ಆರಂಭಿಸಿದ್ದಾರೆ.
Downloading/Loading NER Model: Davlan/xlm-roberta-base-wikiann-ner...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/398 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


   >>> DETECTED ENTITIES:
       - ರಿಷಬ್ ಶೆಟ್ಟಿ    : PERSON
       - ಕುಂದಾಪುರ ನಗರ    : LOCATION

Review 2: ಹೊಂಬಾಳೆ ಫಿಲ್ಮ್ಸ್ ಬೆಂಗಳೂರು ನಗರದಲ್ಲಿ ಹೊಸ ಕಚೇರಿಯನ್ನು ತೆರೆದಿದೆ.
   >>> DETECTED ENTITIES:
       - ಹೊಂಬಾಳೆ ಫಿಲ್ಮ್ಸ್ : ORG
       - ಬೆಂಗಳೂರು        : LOCATION

Review 3: ನಿರ್ಮಾಪಕ ವಿಜಯ್ ಕಿರಗಂದೂರು ಮಂಗಳೂರು ನಗರಕ್ಕೆ ಭೇಟಿ ನೀಡಿದರು.
   >>> DETECTED ENTITIES:
       - ವಿಜಯ್ ಕಿರಗಂದೂರು : PERSON
       - ಮಂಗಳೂರು ನಗರ     : LOCATION

Review 4: ಉಡುಪಿ ಹಾಗೂ ಭಾರತದಲ್ಲಿ ಈ ಕಥೆ ಪ್ರಸಿದ್ಧವಾಗಿದೆ.
   >>> DETECTED ENTITIES:
       - ಉಡುಪಿ           : LOCATION
       - ಭಾರತದಲ್ಲಿ       : LOCATION

Review 5: ಪಿವಿಆರ್ ಸಿನಿಮಾಸ್ ಮುಂದೆ ಅಭಿಮಾನಿಗಳು ಸಂಭ್ರಮಿಸುತ್ತಿದ್ದಾರೆ.
   >>> DETECTED ENTITIES:
       - ಪಿವಿಆರ್         : ORG

ANALYZING HINDI MOVIE REPORTS FOR NER TAGGING

Review 1: ऋषभ शेट्टी ने मुंबई में कांतारा चैप्टर 1 का पोस्टर लॉन्च किया।
   >>> DETECTED ENTITIES:
       - ऋषभ शेट्टी      : PERSON
       - मुंबई           : LOCATION
       - कांतारा चैप्टर 1 : ORG

Review 2: प्रगति शेट्टी जी ने पुष्टि की है कि शूटिंग

# Resources:

1. https://www.geeksforgeeks.org/nlp/natural-language-processing-nlp-pipeline/
2. https://www.geeksforgeeks.org/nlp/byte-pair-encoding-bpe-in-nlp/
3. https://nbviewer.org/url/anoopkunchukuttan.github.io/indic_nlp_library/doc/indic_nlp_examples.ipynb
4. https://www.geeksforgeeks.org/nlp/introduction-to-the-indic-nlp-library-and-its-core-functionalities/
5. https://www.geeksforgeeks.org/nlp/how-wordpiece-tokenization-addresses-the-rare-words-problem-in-nlp/