<font>
<div dir=ltr align=center>
<img src='https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png' width=150 height=150> <br>
<font color=0F5298 size=6>
Natural Language Processing<br>
<font color=2565AE size=4>
Computer Engineering Department<br>
Spring 2025<br>
<font color=3C99D size=4>
Workshop 1 - NLP Frameworks - Hazm<br>
<font color=696880 size=3>
<a href='https://language.ml'>https://language.ml</a><br>
info [AT] language [dot] ml

# 📖 Part 1: Introduction

## ❓ What is Hazm? ([Hazm Official Website](https://www.roshan-ai.ir/hazm/), [GitHub](https://github.com/sobhe/hazm))

**Hazm** is a Python library for **Persian (Farsi) natural language processing**, offering tools tailored to Persian script and grammar. Key features include:

- **Text normalization** (handling different Unicode variants, diacritics, etc.)
- **Sentence & word tokenization** optimized for Persian
- **Stemming** and **lemmatization** for Persian morphology
- **POS tagging** and **dependency parsing** trained on Persian corpora

---

## ✅ When to Use Hazm

- You need **accurate Persian preprocessing** (normalization, tokenization).
- You’re building **classical NLP pipelines** (stemming, POS tagging) for Persian text.
- You require **dependency parses** of Persian sentences.
- You want a **lightweight** library without heavy ML frameworks.

---

## 🚫 When Not to Use Hazm

- You need **deep learning** or **transformer-based models** for Persian (use 🤗 Hugging Face’s Persian models instead).
- You require **multilingual pipelines**—Hazm is Persian-only.
- You need **real-time, high-throughput** production systems at scale (consider spaCy + custom Persian models).
- You want **entity linking** or **semantic role labeling**—Hazm focuses on core NLP tasks only.

## ⚙️ Installation & Setup

In [22]:
# Needs Python 3.11 or older
!python --version

Python 3.10.16


In [23]:
!pip install hazm



# 💡 Part 2: Preprocessing

In this section we’ll normalize raw Persian text and then split it into sentences and words using Hazm’s built-in tools.

## 📦 Imports

In [24]:
from hazm import Normalizer, sent_tokenize, word_tokenize

## 🔹 2.1 Text Normalization

Hazm’s `Normalizer` handles common Persian‐specific cleanup (Unicode variants, diacritics, punctuation unification, etc.).

In [25]:
normalizer = Normalizer()

raw_text = 'سلام!  چطورید؟   این متن، شامل  چند  فاصلهٔ اضافه  و نیم‌فاصله‌است.'

norm_text = normalizer.normalize(raw_text)
print(norm_text)

سلام! چطورید؟ این متن، شامل چند فاصله اضافه و نیم‌فاصله‌است.


## 🔹 2.2 Sentence Tokenization

Use `sent_tokenize` to split the normalized text into sentences.

In [26]:
sentences = sent_tokenize(norm_text)

for sent in sentences:
    print(sent)

سلام!
چطورید؟
این متن، شامل چند فاصله اضافه و نیم‌فاصله‌است.


## 🔹 2.3 Word Tokenization
Use `word_tokenize` to split each sentence into word tokens.

In [27]:
for sent in sentences:
    tokens = word_tokenize(sent)
    print(tokens)

['سلام', '!']
['چطورید', '؟']
['این', 'متن', '،', 'شامل', 'چند', 'فاصله', 'اضافه', 'و', 'نیم\u200cفاصله\u200cاست', '.']


## ⭐ Tip — Customize the Normalizer

Suppose you want to **keep Western digits** and **diacritics**. You can customize the `Normalizer` like this:

In [28]:
from hazm import Normalizer

raw_text = 'می‌روم؛ 123؟ این یِک مَتنِ آزمایشی‌ست.'

default_norm = Normalizer()
print(default_norm.normalize(raw_text))

custom_norm = Normalizer(
    remove_diacritics=False,
    persian_numbers=False,
)
print(custom_norm.normalize(raw_text))

می‌روم؛ ۱۲۳؟ این یک متن آزمایشی‌ست.
می‌روم؛ 123؟ این یِک مَتنِ آزمایشی‌ست.


# 🧠 Part 3: Stemming & Lemmatization

In this part, we’ll reduce words to their base forms using Hazm’s **Stemmer** and **Lemmatizer**. Stemmer is rule-based (heuristic), while Lemmatizer uses a dictionary for more accurate results.

## 📦 Imports

In [29]:
from hazm import Stemmer, Lemmatizer, word_tokenize

## 🔹 3.1 Stemming

In [30]:
stemmer = Stemmer()

words = ['می‌روم', 'رفتن', 'رفتند', 'دوستان', 'کتاب‌ها']

stems = [stemmer.stem(w) for w in words]
print(words)  # Original
print(stems)  # Stemmed

['می\u200cروم', 'رفتن', 'رفتند', 'دوستان', 'کتاب\u200cها']
['می\u200cرو', 'رفتن', 'رفتند', 'دوس', 'کتاب']


## 🔹 3.2 Lemmatization

In [31]:
lemmatizer = Lemmatizer()

words = ['می‌روم', 'رفتن', 'رفتند', 'دوستان', 'کتاب‌ها']

lemmas = [lemmatizer.lemmatize(w) for w in words]
print(words)  # Original
print(lemmas)  # Lemmatized

['می\u200cروم', 'رفتن', 'رفتند', 'دوستان', 'کتاب\u200cها']
['رفت#رو', 'رفتن', 'رفت#رو', 'دوستان', 'کتاب']


### 🔍 When to Use Which?

- **Stemmer** is fast and language-agnostic but may over- or under-stem.
- **Lemmatizer** is more accurate for Persian morphology but requires its dictionary.

# ✍️ Part 4: POS Tagging & Chunking

## 📦 Imports & Model Download

Download the pretrained "POSTagger" model from [Hazm’s GitHub Pretrained-Models section](https://github.com/roshan-research/hazm?tab=readme-ov-file#pretrained-models) and save it in:<br>
`resources/pos_tagger.model`

Download the pretrained "Chunker" model from [Hazm’s GitHub Pretrained-Models section](https://github.com/roshan-research/hazm?tab=readme-ov-file#pretrained-models) and save it in:<br>
`resources/chunker.model`

In [32]:
from hazm import POSTagger, Chunker, word_tokenize, tree2brackets

## 🔹 4.1 POS Tagging

In [33]:
tagger = POSTagger(model='resources/pos_tagger.model')

sentence = 'من دوست دارم زبان فارسی را بهتر یاد بگیرم.'
tokens = word_tokenize(sentence)

tags = tagger.tag(tokens)
print(tags)

[('من', 'PRON'), ('دوست', 'NOUN'), ('دارم', 'VERB'), ('زبان', 'NOUN,EZ'), ('فارسی', 'NOUN'), ('را', 'ADP'), ('بهتر', 'ADJ'), ('یاد', 'NOUN'), ('بگیرم', 'VERB'), ('.', 'PUNCT')]


## 🔹 4.2 Chunking

In [34]:
chunker = Chunker(model='resources/chunker.model')
chunk_tree = chunker.parse(tags)

In [35]:
print(chunk_tree)

(S
  (NP من/PRON)
  (VP دوست/NOUN دارم/VERB)
  (NP زبان/NOUN,EZ فارسی/NOUN)
  (POSTP را/ADP)
  (ADJP بهتر/ADJ)
  (VP یاد/NOUN بگیرم/VERB)
  ./PUNCT)


In [36]:
tree2brackets(chunk_tree)

'[من NP] [دوست دارم VP] [زبان فارسی NP] [را POSTP] [بهتر ADJP] [یاد بگیرم VP] .'

> `NP` denotes noun phrases, `VP` verb phrases, giving you easy access to phrase-level units for further analysis.

# 🌳 Part 5: Dependency Parsing

## 📦 Imports & Model Download

Download the pretrained "DependencyParser" model from [Hazm’s GitHub Pretrained-Models section](https://github.com/roshan-research/hazm?tab=readme-ov-file#pretrained-models) and save it as:

`resources/universal_dependency_parser`

In [37]:
from hazm import POSTagger, Lemmatizer, DependencyParser, word_tokenize

## 🔹 5.1 Initialization

In [38]:
tagger = POSTagger(model='resources/pos_tagger.model')
lemmatizer = Lemmatizer()
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer, working_dir='resources/universal_dependency_parser')

## 🔹 5.2 Parse a Sentence

In [39]:
sentence = 'من دوست دارم زبان فارسی را بهتر یاد بگیرم.'
tokens = word_tokenize(sentence)

graph = parser.parse(tokens)
print(graph)

defaultdict(<function DependencyGraph.__init__.<locals>.<lambda> at 0x364cb3be0>,
            {0: {'address': 0,
                 'ctag': 'TOP',
                 'deps': defaultdict(<class 'list'>, {'root': [3], 'ROOT': []}),
                 'feats': None,
                 'head': None,
                 'lemma': None,
                 'rel': None,
                 'tag': 'TOP',
                 'word': None},
             1: {'address': 1,
                 'ctag': 'PRON',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 3,
                 'lemma': 'من',
                 'rel': 'nsubj',
                 'tag': 'PRON',
                 'word': 'من'},
             2: {'address': 2,
                 'ctag': 'NOUN',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 3,
                 'lemma': 'دوست',
                 'rel': 'compound',
                 'tag': '



### 📖 Quick Guide: What’s in `graph.nodes[idx]`?

Each node is a Python dict with these keys:

| Key       | Meaning                                                                                           |
|-----------|---------------------------------------------------------------------------------------------------|
| `address` | 1-based position of the token in the sentence (0 = artificial ROOT)                               |
| `word`    | Surface form of the token                                                                         |
| `lemma`   | Lemmatized form (from Hazm’s lemmatizer)                                                          |
| `tag`     | Fine-grained POS tag (Bijankhan tagset) e.g. `N`, `V`, `ADJ`, `PREP`                              |
| `ctag`    | Coarse POS tag (Universal POS) e.g. `NOUN`, `VERB`, `ADJ`, `ADP`, `PRON`                          |
| `feats`   | Morphological features string (often `_` if unavailable) e.g. `Number=Sing\|Person=1\|Tense=Past` |
| `head`    | Address of the syntactic head (0 if this token is the root)                                       |
| `rel`     | Dependency relation to the head (e.g., `nsubj`, `obj`, `advmod`)                                  |
| `deps`    | Dict mapping each child-relation label → list of child addresses                                  |

In [40]:
node = graph.nodes[1]
node

{'address': 1,
 'word': 'من',
 'lemma': 'من',
 'ctag': 'PRON',
 'tag': 'PRON',
 'feats': '_',
 'head': 3,
 'deps': defaultdict(list, {}),
 'rel': 'nsubj'}

# 🔚 Part 6: Summary

## ✅ What We’ve Covered in Hazm

| Component                 | Hazm API                                                         | Notes                                                |
|---------------------------|------------------------------------------------------------------|------------------------------------------------------|
| **Normalization**         | `Normalizer().normalize(text)`                                   | Unicode variants, diacritics, spacing, numbers       |
| **Sentence Tokenization** | `sent_tokenize(text)`                                            | Persian‐specific rules                               |
| **Word Tokenization**     | `word_tokenize(sentence)`                                        | Handles Persian clitics and punctuation              |
| **Stemming**              | `Stemmer().stem(word)`                                           | Fast, rule‐based                                     |
| **Lemmatization**         | `Lemmatizer().lemmatize(word)`                                   | Dictionary‐based, more accurate                      |
| **POS Tagging**           | `POSTagger(model='resources/pos_tagger.model')` → `.tag(tokens)` | Bijankhan tagset                                     |
| **Chunking**              | `Chunker(model='resources/chunker.model').parse(tags)`           | Parses POS tags into phrases                         |
| **Dependency Parsing**    | `DependencyParser(...).parse(tokens)`                            | MaltParser under the hood, returns `DependencyGraph` |

# 🧪 Part 7: Integrated Pipeline Exercise

Instead of running each NLP step separately, let’s build one **`preprocess_hazm`** function that:

1. **Normalizes** raw text
2. **Sentence-tokenizes** & **word-tokenizes**
3. **Removes punctuation**
4. **POS-filters** (keep only nouns, verbs, adjectives, adverbs)
5. **Stems** or **lemmatizes** (configurable)
6. **Lowercases** the final tokens

Then we’ll wrap it in a simple class and compare performance using stemming vs. lemmatization.


### 🔹 7.1 Write `preprocess_hazm`

In [41]:
from hazm import Normalizer, sent_tokenize, word_tokenize, POSTagger, Stemmer, Lemmatizer
import string

# 1. Instantiate components
normalizer = Normalizer()
postagger = POSTagger(model='resources/pos_tagger.model')
stemmer = Stemmer()
lemmatizer = Lemmatizer()

# 2. Define punctuation set (including Persian marks)
PUNCT = set(string.punctuation + '؟،؛…«»')


def preprocess_hazm(text: str, use_lemmatizer: bool = True, allowed_pos: list[str] = None) -> list[str]:
    """
    - Normalize text
    - Tokenize into sentences and words
    - Remove punctuation tokens
    - If allowed_pos is provided, filter tokens by those POS tags
    - Otherwise keep all tokens
    - Lemmatize or stem
    - Lowercase
    """
    # Your code here
    pass

In [42]:
sample = 'سلام! من عاشق پردازش زبان طبیعی با هضم هستم.'
print(preprocess_hazm(sample, use_lemmatizer=True))
print(preprocess_hazm(sample, use_lemmatizer=False))
# Expected:
    # ['سلام', 'من', 'عاشق', 'پردازش', 'زبان', 'طبیعی', 'با', 'هضم', '#هست']
    # ['سلا', 'من', 'عاشق', 'پرداز', 'زب', 'طبیع', 'با', 'هض', 'هس']

None
None


### ✅ Answer

In [43]:
from hazm import Normalizer, sent_tokenize, word_tokenize, POSTagger, Stemmer, Lemmatizer
import string

# 1. Instantiate components
normalizer = Normalizer()
postagger = POSTagger(model='resources/pos_tagger.model')
stemmer = Stemmer()
lemmatizer = Lemmatizer()

# 2. Define punctuation set (including Persian marks)
PUNCT = set(string.punctuation + '؟،؛…«»')


def preprocess_hazm(text: str, use_lemmatizer: bool = True, allowed_pos: list[str] = None) -> list[str]:
    """
    - Normalize text
    - Tokenize into sentences and words
    - Remove punctuation tokens
    - If allowed_pos is provided, filter tokens by those POS tags
    - Otherwise keep all tokens
    - Lemmatize or stem
    - Lowercase
    """
    # Normalize
    normed = normalizer.normalize(text)

    # Tokenize
    tokens = []
    for sent in sent_tokenize(normed):
        for tok in word_tokenize(sent):
            if tok in PUNCT:
                continue
            tokens.append(tok)

    # POS tagging
    tagged = postagger.tag(tokens)

    # Filter by POS if requested
    if allowed_pos is not None:
        tagged = [(w, tag) for w, tag in tagged if tag in allowed_pos]

    # Extract words
    words = [w for w, _ in tagged]

    # Normalize words
    if use_lemmatizer:
        processed = [lemmatizer.lemmatize(w) for w in words]
    else:
        processed = [stemmer.stem(w) for w in words]

    # Lowercase
    return [w.lower() for w in processed]

In [44]:
sample = 'سلام! من عاشق پردازش زبان طبیعی با هضم هستم.'
print(preprocess_hazm(sample, use_lemmatizer=True))
print(preprocess_hazm(sample, use_lemmatizer=False))

['سلام', 'من', 'عاشق', 'پردازش', 'زبان', 'طبیعی', 'با', 'هضم', '#هست']
['سلا', 'من', 'عاشق', 'پرداز', 'زب', 'طبیع', 'با', 'هض', 'هس']


## 🔹 7.2 Wrap as a “Pipeline” Class

In [45]:
class HazmPreprocessor:
    def __init__(self, use_lemmatizer: bool = True, allowed_pos: list[str] = None):
        self.use_lem = use_lemmatizer
        self.allowed_pos = allowed_pos

    def __call__(self, text: str) -> list[str]:
        return preprocess_hazm(text, use_lemmatizer=self.use_lem, allowed_pos=self.allowed_pos)

In [46]:
pp = HazmPreprocessor(use_lemmatizer=True)
print(pp(sample))

['سلام', 'من', 'عاشق', 'پردازش', 'زبان', 'طبیعی', 'با', 'هضم', '#هست']


## 🔹 7.3 Performance Comparison
Measure throughput for a batch of sentences with vs. without lemmatization:

In [47]:
import time

docs = [sample] * 500  # repeat sample 500×


def time_it(preprocessor, docs):
    start = time.perf_counter()

    for d in docs:
        preprocessor(d)

    return time.perf_counter() - start


pp_lem = HazmPreprocessor(use_lemmatizer=True)
pp_stem = HazmPreprocessor(use_lemmatizer=False)

t_lem = time_it(pp_lem, docs)
t_stem = time_it(pp_stem, docs)

print(f'Lemmatizer time: {t_lem:.3f}s')
print(f'Stemmer time:    {t_stem:.3f}s')

Lemmatizer time: 0.058s
Stemmer time:    0.058s
