# Week 1 Hands on Execises


#1.Word Tokenization

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

# Example text in Hindi
hindi_text = "मैं कल बाजार गया था। वहाँ मैंने किताबें खरीदीं।"

# Word Tokenization using NLTK
tokens = word_tokenize(hindi_text)
print("Tokenized words (NLTK):", tokens)

# Using a Regular Expression Tokenizer for better control
regexp_tokenizer = RegexpTokenizer(r'\w+')
tokens_regexp = regexp_tokenizer.tokenize(hindi_text)
print("Tokenized words (Regex):", tokens_regexp)


Tokenized words (NLTK): ['मैं', 'कल', 'बाजार', 'गया', 'था।', 'वहाँ', 'मैंने', 'किताबें', 'खरीदीं।']
Tokenized words (Regex): ['म', 'कल', 'ब', 'ज', 'र', 'गय', 'थ', 'वह', 'म', 'न', 'क', 'त', 'ब', 'खर', 'द']


\d

    Matches any decimal digit; this is equivalent to the class [0-9].

\D
    
    Matches any non-digit character; this is equivalent to the class [^0-9].

\s

    Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S

    Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w

    Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W

    Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'.

In [None]:
import re

pattern=re.compile(r'\w+')
matches=pattern.findall(hindi_text)
print(matches)

['म', 'कल', 'ब', 'ज', 'र', 'गय', 'थ', 'वह', 'म', 'न', 'क', 'त', 'ब', 'खर', 'द']


#2. Challenges of Tokenization

In [None]:
from nltk.tokenize import sent_tokenize

# Hindi text with Devanagari full stops
hindi_text = "यह एक वाक्य है। यह दूसरा वाक्य है।"

# Sentence Segmentation
sentences = sent_tokenize(hindi_text)
print("Segmented sentences (NLTK):", sentences)

# Custom segmentation for Devanagari full stop
import re
custom_sentences = re.split(r'।\s*', hindi_text)
print("Segmented sentences (Custom):", [s for s in custom_sentences if s])


Segmented sentences (NLTK): ['यह एक वाक्य है। यह दूसरा वाक्य है।']
Segmented sentences (Custom): ['यह एक वाक्य है', 'यह दूसरा वाक्य है']


# Tokenization for Indian Languages Using indic-nlp-library

In [None]:
pip install indic-nlp-library

The **UnicodeIndicTransliterator** is a component of the Indic NLP Library used for transliteration between Indian scripts and Roman script or between Indian scripts themselves. It supports transliteration across various Indian languages, allowing text to be represented in a standardized format or converted between scripts.

**Key Features:**

**Script-to-Script Transliteration:** Convert text from one Indic script to another (e.g., Devanagari to Tamil).

**Roman Transliteration:** Convert Indic script text to Roman script (and vice versa).

**Supports Multiple Languages:** Handles most Indian languages, such as Hindi, Tamil, Bengali, Telugu, etc.


**Supported Script Codes:**

**Indic Scripts:** hi (Hindi), ta (Tamil), te (Telugu), kn (Kannada), etc.

**Roman Script:** iast (International Alphabet of Sanskrit Transliteration).

In [None]:
from indicnlp.tokenize import sentence_tokenize, indic_tokenize
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator

# Load the library
# Install: pip install indic-nlp-library

# Example text
hindi_text = "यह एक वाक्य है। यह दूसरा वाक्य है।"

# Sentence Tokenization
sentences = sentence_tokenize.sentence_split(hindi_text, lang='hi')
print("Sentence Segmentation (IndicNLP):", sentences)

# Word Tokenization
tokens = indic_tokenize.trivial_tokenize(hindi_text)
print("Word Tokenization (IndicNLP):", tokens)


# Transliterate from Devanagari to Tamil script
tamil_text = UnicodeIndicTransliterator.transliterate(hindi_text, "hi", "ta")
print("Hindi to Tamil:", tamil_text)



Sentence Segmentation (IndicNLP): ['यह एक वाक्य है।', 'यह दूसरा वाक्य है।']
Word Tokenization (IndicNLP): ['यह', 'एक', 'वाक्य', 'है', '।', 'यह', 'दूसरा', 'वाक्य', 'है', '।']
Hindi to Tamil: யஹ ஏக வாக்ய ஹை। யஹ தூஸரா வாக்ய ஹை।


#5. Advanced Approaches for Tokenization and Sentence Segmentation

---
**Byte Pair Encoding (BPE)**

Byte Pair Encoding (BPE) is a data compression technique that has been adapted for subword tokenization in NLP. It helps in breaking down words into smaller units called subword tokens.

**Why BPE is Useful in NLP**

 *  **Handling Out-of-Vocabulary Words:** By breaking words into subwords, models can understand and generate words not seen during training.
 *  **Morphological Representation:** Captures prefixes, suffixes, and root words, helping in understanding word structures.
 *  **Efficiency:** Reduces the vocabulary size, making the model more efficient without losing significant information.


 These tokens are commonly used in transformer-based models like BERT. Here's a breakdown of their purposes:

**[PAD]:** Used for padding sequences to the same length in a batch. It ensures that all input sequences have the same length without altering the model's predictions.

**[UNK]:**Represents an unknown token, used for words or symbols not in the model's vocabulary.

**[CLS]:** A special token added at the beginning of every input sequence. In BERT, the output corresponding to this token is often used as a summary representation for tasks like classification.

**[SEP]:** Separates segments in tasks that involve multiple sequences, like sentence-pair classification or question-answering.

**[MASK]:** Used in masked language modeling, where tokens are randomly replaced with this mask token to train the model to predict them.


In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Prepare a small dataset with Indian language text
texts = [
    "भारत विविधता में एकता का देश है।",
    "வாழைப்பழத்தை வாங்கினேன்.",
    "रामगया बाज़ार।",
]

# Initialize the tokenizer
tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()

# Train the tokenizer on the text data
tokenizer.train_from_iterator(texts, trainer)

# Tokenize a new text
output = tokenizer.encode("भारत महान देश है।")
print("Tokenized Output:", output.tokens)


Tokenized Output: ['भारत', 'म', 'ह', 'ा', 'देश', 'है', '।']


***Deep Learning-Based Sentence Segmentation***

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the IndicBERT tokenizer and model
model_name = "ai4bharat/indic-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load text in Hindi
text = "भारत एक महान देश है। यहाँ विभिन्न भाषाएँ बोली जाती हैं।"

# Tokenize using IndicBERT
tokens = tokenizer.tokenize(text)
print("IndicBERT Tokenized Output:", tokens)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/5.65M [00:00<?, ?B/s]

IndicBERT Tokenized Output: ['▁भरत', '▁एक', '▁मह', 'न', '▁दश', '▁ह', '।', '▁यह', '▁व', 'भ', 'नन', '▁भ', 'ष', 'ए', '▁बल', '▁जत', '▁ह', '।']


# Advanced Task: Building a Pipeline for Indian Languages
Here's an outline for building a tokenization and segmentation pipeline:


1.   **Input**: Accept Indian language text.
2.   **Preprocessing**:

       *   Normalize the text (e.g., handle diacritics in Hindi or Tamil).
       *   Remove noise like special characters.
3.   **Tokenization**:
       *   Use Indic NLP Library or a pretrained tokenizer.

4.   **Sentence Segmentation**:
        * Implement segmentation logic for punctuation like "।".

5.  **Output:**

       *    Provide tokenized words and segmented sentences.








In [None]:
from indicnlp.tokenize import sentence_tokenize, indic_tokenize

def process_text(text, lang='hi'):
    # Sentence segmentation
    sentences = sentence_tokenize.sentence_split(text, lang=lang)
    print("Segmented Sentences:", sentences)

    # Word tokenization
    for sentence in sentences:
        tokens = indic_tokenize.trivial_tokenize(sentence)
        print(f"Tokens for '{sentence}':", tokens)

# Example Input
hindi_text = "भारत एक महान देश है। यहाँ अनेक भाषाएँ बोली जाती हैं।"
process_text(hindi_text, lang='hi')


Segmented Sentences: ['भारत एक महान देश है।', 'यहाँ अनेक भाषाएँ बोली जाती हैं।']
Tokens for 'भारत एक महान देश है।': ['भारत', 'एक', 'महान', 'देश', 'है', '।']
Tokens for 'यहाँ अनेक भाषाएँ बोली जाती हैं।': ['यहाँ', 'अनेक', 'भाषाएँ', 'बोली', 'जाती', 'हैं', '।']


# 9. Building a Tokenization and Sentence Segmentation Pipeline

Pipeline Components:
Here, we outline a complete pipeline for tokenizing and segmenting Indian language text. This can handle multiple languages, integrate pretrained models, and address the challenges discussed earlier.



---

**Pipeline Outline:**
1.    **Preprocessing:**

     *   Normalize text (e.g., handle diacritics or unwanted special characters).
     *   Detect language/script using libraries like langdetect.
2.    **Sentence Segmentation:**

      *   Use language-specific punctuation rules.
      *    Apply Indic NLP Library or a custom regular expression.
3.    **Word Tokenization:**

       *   Leverage Indic NLP, Hugging Face tokenizers, or subword tokenization methods.
4.    **Error Handling:**

      *   Handle unknown or mixed-language tokens robustly.
5.   **Output:**

      *    Provide tokenized words and segmented sentences with metadata.

**Step 1: Installing Dependencies**

Install necessary libraries:




In [None]:
pip install nltk indic-nlp-library langdetect transformers

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=6d094b6abd8efe81cb37de40abbf727bf78111ce1addf3826b9f5124c41c2a89
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


**Step 2: Implementation**

In [None]:
from langdetect import detect
from indicnlp.tokenize import sentence_tokenize, indic_tokenize
from transformers import AutoTokenizer

# Load Pretrained Tokenizer (IndicBERT)
model_name = "ai4bharat/indic-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Detect Language
def detect_language(text):
    try:
        lang = detect(text)
        print(f"Detected language: {lang}")
        return lang
    except Exception as e:
        print(f"Error detecting language: {e}")
        return None

# Sentence Segmentation
def segment_sentences(text, lang='hi'):
    if lang in ['hi', 'ta', 'te', 'kn', 'ml']:  # Supported Indian languages
        sentences = sentence_tokenize.sentence_split(text, lang=lang)
    else:
        # Default to splitting on ".", "!", or "?"
        sentences = re.split(r'[.!?]', text)
    return [s.strip() for s in sentences if s.strip()]

# Word Tokenization
def tokenize_words(text, lang='hi'):
    if lang in ['hi', 'ta', 'te', 'kn', 'ml']:
        tokens = indic_tokenize.trivial_tokenize(text)
    else:
        tokens = tokenizer.tokenize(text)
    return tokens

# Complete Pipeline
def process_text(text):
    lang = detect_language(text) or 'unknown'
    print("\nOriginal Text:", text)
    print("Language Detected:", lang)

    # Segment sentences
    sentences = segment_sentences(text, lang=lang)
    print("\nSegmented Sentences:")
    for i, sent in enumerate(sentences):
        print(f"{i+1}: {sent}")

    # Tokenize each sentence
    print("\nWord Tokens for Each Sentence:")
    for sentence in sentences:
        tokens = tokenize_words(sentence, lang=lang)
        print(f"Sentence: {sentence}")
        print(f"Tokens: {tokens}\n")

# Example Texts
hindi_text = "भारत एक महान देश है। यहाँ अनेक भाषाएँ बोली जाती हैं।"
tamil_text = "இந்தியாவில் பல மொழிகள் பேசப்படுகின்றன. இது ஒரு சிறப்பான நாடு."
mixed_text = "India is a multilingual country. यहाँ अनेक भाषाएँ बोली जाती हैं।"

# Process Texts
process_text(hindi_text)
process_text(tamil_text)
process_text(mixed_text)


Detected language: hi

Original Text: भारत एक महान देश है। यहाँ अनेक भाषाएँ बोली जाती हैं।
Language Detected: hi

Segmented Sentences:
1: भारत एक महान देश है।
2: यहाँ अनेक भाषाएँ बोली जाती हैं।

Word Tokens for Each Sentence:
Sentence: भारत एक महान देश है।
Tokens: ['भारत', 'एक', 'महान', 'देश', 'है', '।']

Sentence: यहाँ अनेक भाषाएँ बोली जाती हैं।
Tokens: ['यहाँ', 'अनेक', 'भाषाएँ', 'बोली', 'जाती', 'हैं', '।']

Detected language: ta

Original Text: இந்தியாவில் பல மொழிகள் பேசப்படுகின்றன. இது ஒரு சிறப்பான நாடு.
Language Detected: ta

Segmented Sentences:
1: இந்தியாவில் பல மொழிகள் பேசப்படுகின்றன.
2: இது ஒரு சிறப்பான நாடு.

Word Tokens for Each Sentence:
Sentence: இந்தியாவில் பல மொழிகள் பேசப்படுகின்றன.
Tokens: ['இந்தியாவில்', 'பல', 'மொழிகள்', 'பேசப்படுகின்றன', '.']

Sentence: இது ஒரு சிறப்பான நாடு.
Tokens: ['இது', 'ஒரு', 'சிறப்பான', 'நாடு', '.']

Detected language: hi

Original Text: India is a multilingual country. यहाँ अनेक भाषाएँ बोली जाती हैं।
Language Detected: hi

Segmented Sentences:


# Advanced Features for the Pipeline

**Handling Mixed-Language Texts:**

*   Detect multiple languages in a single text.
*   Apply appropriate tokenization rules based on detected segments.

In [None]:
def process_mixed_language_text(text):
    from langdetect import DetectorFactory
    DetectorFactory.seed = 0  # Ensure reproducibility
    detected_segments = []

    for sentence in re.split(r'[।.!?]', text):
        sentence = sentence.strip()
        if sentence:
            lang = detect(sentence)
            detected_segments.append((sentence, lang))

    print("\nDetected Language Segments:")
    for segment, lang in detected_segments:
        print(f"[{lang}] {segment}")

    for segment, lang in detected_segments:
        process_text(segment)


# Challenges and How the Pipeline Addresses Them
**Challenges:**

1.   Agglutination:

     *   Use subword tokenization (e.g., Byte Pair Encoding).
     *   Indic NLP's morphological analysis could help for highly agglutinated words.
2.   Ambiguity in Segmentation:

     *   Example: Quoted text or nested punctuation.
     *   Use regex and context-aware sentence splitting.
3.   Mixed-Language Contexts:

     *   Detect and process each language segment independently.

# Exercises for Practice
**1.  Tokenization and Segmentation Practice:**

     Input: "भारत महान देश है। India is diverse."
     Output: Provide tokens and sentences for both Hindi and English parts.
**2. Error Identification and Handling:**

      Input: Tamil text with typos: "இந்தியாவின் பல மொழிகள் பேசப் பட்டுகின்றன."
      Task: Identify and correct tokenization errors.
**3. Experiment with Pretrained Models:**

      Load IndicBERT and test on multilingual Indian texts. Compare its performance against Indic NLP Library.


# solution for Excercise 2



In [None]:
pip install indic-nlp-library

In [None]:
from indicnlp.tokenize import indic_tokenize

# Input text with possible tokenization errors
text = "இந்தியாவின் பல மொழிகள் பேசப் பட்டுகின்றன."

# Tokenize the text using indic_nlp_library
tokens = list(indic_tokenize.trivial_tokenize(text, lang='ta'))

# Correct tokenization errors (if required)
# For demonstration, let's manually fix an identified issue, e.g., splitting incorrect compound tokens.
# "பேசப் பட்டுகின்றன" -> ["பேசப்", "பட்டுகின்றன"]

corrected_tokens = []
for token in tokens:
    if "பட்டுகின்றன" in token:
        corrected_tokens.extend(["பேசப்", "பட்டுகின்றன"])
    else:
        corrected_tokens.append(token)

# Display original and corrected tokens
print("Original Tokens:", tokens)
print("Corrected Tokens:", corrected_tokens)


Original Tokens: ['இந்தியாவின்', 'பல', 'மொழிகள்', 'பேசப்', 'பட்டுகின்றன', '.']
Corrected Tokens: ['இந்தியாவின்', 'பல', 'மொழிகள்', 'பேசப்', 'பேசப்', 'பட்டுகின்றன', '.']


# Solution for Question 1

In [None]:
!pip install indic-nlp-library spacy
!python -m spacy download en_core_web_sm


In [None]:
from indicnlp.tokenize import indic_tokenize
import spacy

# Load spaCy model for English
nlp = spacy.load("en_core_web_sm")

# Input text
text = "भारत महान देश है। India is diverse."

# Split text into sentences manually
sentences = text.split("।")

# Process Hindi sentence using Indic NLP Library
hindi_sentence = sentences[0].strip() + "।"  # Add the period back
hindi_tokens = list(indic_tokenize.trivial_tokenize(hindi_sentence, lang='hi'))

# Process English sentence using spaCy
english_sentence = sentences[1].strip() if len(sentences) > 1 else ""
english_doc = nlp(english_sentence)
english_tokens = [token.text for token in english_doc]

# Output results
print("Hindi Sentence:", hindi_sentence)
print("Hindi Tokens:", hindi_tokens)
print("\nEnglish Sentence:", english_sentence)
print("English Tokens:", english_tokens)


Hindi Sentence: भारत महान देश है।
Hindi Tokens: ['भारत', 'महान', 'देश', 'है', '।']

English Sentence: India is diverse.
English Tokens: ['India', 'is', 'diverse', '.']


# Solution for Question 3

In [None]:
pip install transformers indic-nlp-library


In [None]:
from transformers import AutoTokenizer, AutoModel
from indicnlp.tokenize import indic_tokenize

# Load IndicBERT tokenizer and model
model_name = "ai4bharat/indic-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example multilingual Indian text
texts = {
    "tamil": "இந்தியாவின் பல மொழிகள் பேசப்படுகின்றன.",
    "hindi": "भारत में कई भाषाएँ बोली जाती हैं।",
    "bengali": "ভারতে অনেক ভাষা কথা বলা হয়।"
}

# Tokenization using Indic NLP Library
def tokenize_indic_nlp(text, lang):
    return list(indic_tokenize.trivial_tokenize(text, lang=lang))

# Tokenization using IndicBERT
def tokenize_indic_bert(text):
    return tokenizer.tokenize(text)

# Compare results
for lang, text in texts.items():
    print(f"\nLanguage: {lang.capitalize()}")
    print(f"Original Text: {text}")

    # Indic NLP Library Tokenization
    indic_nlp_tokens = tokenize_indic_nlp(text, lang=lang[:2])
    print(f"Indic NLP Library Tokens: {indic_nlp_tokens}")

    # IndicBERT Tokenization
    indicbert_tokens = tokenize_indic_bert(text)
    print(f"IndicBERT Tokens: {indicbert_tokens}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/5.65M [00:00<?, ?B/s]


Language: Tamil
Original Text: இந்தியாவின் பல மொழிகள் பேசப்படுகின்றன.
Indic NLP Library Tokens: ['இந்தியாவின்', 'பல', 'மொழிகள்', 'பேசப்படுகின்றன', '.']
IndicBERT Tokens: ['▁இ', 'ந', 'தய', 'வன', '▁பல', '▁மழ', 'கள', '▁பச', 'ப', 'பட', 'கன', 'றன', '.']

Language: Hindi
Original Text: भारत में कई भाषाएँ बोली जाती हैं।
Indic NLP Library Tokens: ['भारत', 'में', 'कई', 'भाषाएँ', 'बोली', 'जाती', 'हैं', '।']
IndicBERT Tokens: ['▁भरत', '▁म', '▁कई', '▁भ', 'ष', 'ए', '▁बल', '▁जत', '▁ह', '।']

Language: Bengali
Original Text: ভারতে অনেক ভাষা কথা বলা হয়।
Indic NLP Library Tokens: ['ভারতে', 'অনেক', 'ভাষা', 'কথা', 'বলা', 'হয়', '।']
IndicBERT Tokens: ['▁ভরত', '▁অন', 'ক', '▁ভ', 'ষ', '▁কথ', '▁বল', '▁হ', 'য', '।']


In [None]:
pip install colab_pdf

[31mERROR: Could not find a version that satisfies the requirement colab_pdf (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for colab_pdf[0m[31m
[0m

In [None]:
! wget -nc https://raw.githubusercontent.com/brpy/colab_pdf/master/colab_pdf.py

--2025-01-01 17:12:36--  https://raw.githubusercontent.com/brpy/colab_pdf/master/colab_pdf.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-01-01 17:12:36 ERROR 404: Not Found.

