<font>
<div dir=ltr align=center>
<img src='https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png' width=150 height=150> <br>
<font color=0F5298 size=6>
Natural Language Processing<br>
<font color=2565AE size=4>
Computer Engineering Department<br>
Spring 2025<br>
<font color=3C99D size=4>
Workshop 1 - NLP Frameworks - Parsivar<br>
<font color=696880 size=3>
<a href='https://language.ml'>https://language.ml</a><br>
info [AT] language [dot] ml

# 📖 Part 1: Introduction

## ❓ What is Parsivar? ([GitHub](https://github.com/ICTRC/Parsivar))

**Parsivar** is an open-source Python library built specifically for **Persian (Farsi) NLP**.
It offers ready-made components for:

- **Normalization** and orthographic fixes
- **Sentence / word tokenization**
- **Stemming** (rule-based)
- **Part-of-speech tagging** (HMM/Wapiti)
- **Dependency parsing** (UD-style)
- **Spell checking / autocorrection**

*(Parsivar does **not** ship a lemmatizer or NER in its current release.)*

---

## ✅ When to Use Parsivar

- You need a **quick, self-contained Persian NLP** pipeline in pure Python.
- You want features **not bundled in Hazm** (e.g. built-in NER, sentiment).
- You prefer **rule + ML hybrids** instead of heavier deep-learning models.
- You’re preprocessing Persian text for classical ML or linguistic analysis.

---

## 🚫 When Not to Use Parsivar

- You need **transformer-based** embeddings or state-of-the-art accuracy (use 🤗 Hugging Face models).
- You require **dependency parsing** (Parsivar doesn’t provide one).
- You’re building **multilingual** pipelines (Parsivar is Persian-only).
- You need **industrial throughput**—for very large-scale, spaCy with custom Persian models is faster.

## ⚙️ Installation & Setup

In [1]:
!pip install parsivar

Collecting parsivar
  Downloading parsivar-0.2.3.1-py3-none-any.whl.metadata (242 bytes)
Downloading parsivar-0.2.3.1-py3-none-any.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: parsivar
Successfully installed parsivar-0.2.3.1


## 💡 Part 2: Preprocessing

We’ll start by **normalizing** raw Persian text and then splitting it into **sentences** and **words** with Parsivar.

## 📦 Imports

In [17]:
from parsivar import Normalizer, Tokenizer

## 🔹 2.1 Normalization

In [56]:
text = "به گزارش ایسنا سمینارها ی شیمی آلی از امروز ۱۱ شهریور ۱۳۹۶ در دانشگاه علم و صنعت ایران آغاز به کار کردند. این سمینار تا ۱۳ شهریور ادامه می یابد."
normalizer = Normalizer()
print(normalizer.normalize(text))

به گزارش ایسنا سمینارها ی شیمی آلی از امروز 11 شهریور 1396 در دانشگاه علم و صنعت ایران آغاز به کار کردند . این سمینار تا 13 شهریور ادامه می‌یابد .


In [57]:
normalizer = Normalizer(statistical_space_correction=True)
print(normalizer.normalize(text))

به گزارش ایسنا سمینارها‌ی شیمی آلی از امروز 11 شهریور 1396 در دانشگاه علم و صنعت ایران آغاز به کار کردند . این سمینار تا 13 شهریور ادامه می‌یابد . 


In [58]:
normalizer = Normalizer(date_normalizing_needed=True)
print(normalizer.normalize(text))

به گزارش ایسنا سمینارها ی شیمی آلی از امروز y1396m6d11 در دانشگاه علم و صنعت ایران آغاز به کار کردند . این سمینار تا y0m6d13 ادامه می‌یابد .


In [59]:
normalizer = Normalizer(pinglish_conversion_needed=True)
print(normalizer.normalize("farda asman abri ast."))

فردا اسمان ابری است .


## 🔹 2.2 Sentence Tokenization

In [60]:
tokenizer = Tokenizer()
normalizer = Normalizer()
sentences = tokenizer.tokenize_sentences(normalizer.normalize(text))

for sent in sentences:
    print(sent)

به گزارش ایسنا سمینارها ی شیمی آلی از امروز 11 شهریور 1396 در دانشگاه علم و صنعت ایران آغاز به کار کردند  .
 این سمینار تا 13 شهریور ادامه می‌یابد  .


## 🔹 2.3 Word Tokenization

In [61]:
for sent in sentences:
    words = tokenizer.tokenize_words(normalizer.normalize(sent))
    print(words)

['به', 'گزارش', 'ایسنا', 'سمینارها', 'ی', 'شیمی', 'آلی', 'از', 'امروز', '11', 'شهریور', '1396', 'در', 'دانشگاه', 'علم', 'و', 'صنعت', 'ایران', 'آغاز', 'به', 'کار', 'کردند', '.']
['این', 'سمینار', 'تا', '13', 'شهریور', 'ادامه', 'می\u200cیابد', '.']


# 🧠 Part 3: Morphology — Stemming

Parsivar provides a rule-based **stemmer** but **does not** include a lemmatizer.

If you need full lemmatization you’ll have to integrate another tool (e.g. Hazm’s `Lemmatizer`).

## 📦 Imports

In [None]:
from parsivar import FindStems, Tokenizer

## 🔹 3.1 Stemming

In [62]:
stemmer = FindStems()
tokenizer = Tokenizer()

sentence = 'کتاب‌های جدیدم را به دوستان خوبم نشان دادم.'
tokens = tokenizer.tokenize_words(sentence)

stems = [stemmer.convert_to_stem(t) for t in tokens]
print(tokens)
print(stems)

['کتاب\u200cهای', 'جدیدم', 'را', 'به', 'دوستان', 'خوبم', 'نشان', 'دادم.']
['کتاب', 'جدید', 'را', 'به', 'دوست', 'خوب', 'نشان', 'دادم.']


# ✍️ Part 4: POS Tagging & Chunking

## 📦 Imports

Parsivar’s HMM‐based POS tagger uses the **Wapiti** library under the hood. Before you run the tagger, install both Parsivar and its Wapiti binding.

In [25]:
!pip install libwapiti

Collecting libwapiti
  Downloading libwapiti-0.2.1.tar.gz (233 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: libwapiti
  Building wheel for libwapiti (setup.py) ... [?25ldone
[?25h  Created wheel for libwapiti: filename=libwapiti-0.2.1-cp310-cp310-macosx_11_0_arm64.whl size=49039 sha256=6d4d50bf5d5f716bde41e58399bb2506bc54152c3a911c822257ba8ecd035392
  Stored in directory: /Users/AmirMohammad/Library/Caches/pip/wheels/9f/cb/30/fef48ecac051e433987eccdb5682900b4c00d44a4bcd4d4ec8
Successfully built libwapiti
Installing collected packages: libwapiti
Successfully installed libwapiti-0.2.1


In [35]:
from parsivar import Tokenizer, POSTagger, FindChunks

## 🔹 4.1 Tokenization & POS Tagging

In [34]:
# Instantiate tokenizer and tagger (use "wapiti" or "stanford")
tokenizer = Tokenizer()
tagger = POSTagger(tagging_model="wapiti")

sentence = 'من دوست دارم زبان فارسی را بهتر یاد بگیرم.'

tokens = tokenizer.tokenize_words(sentence)
tags = tagger.parse(tokens)

tags

[('من', 'PRO'),
 ('دوست', 'N_SING'),
 ('دارم', 'V_PRS'),
 ('زبان', 'N_SING'),
 ('فارسی', 'ADJ'),
 ('را', 'CLITIC'),
 ('بهتر', 'ADJ_CMPR'),
 ('یاد', 'N_SING'),
 ('بگیرم.', 'V_SUB')]

### 📋 Parsivar POS Tagset Reference

| Tag      | Category     | Meaning                |
|----------|--------------|------------------------|
| N_SING   | Noun         | Singular noun          |
| N_PLUR   | Noun         | Plural noun            |
| V_PRS    | Verb         | Present-tense verb     |
| V_PST    | Verb         | Past-tense verb        |
| V_IMP    | Verb         | Imperative verb        |
| ADJ      | Adjective    | Adjective              |
| ADV      | Adverb       | Adverb                 |
| PRON     | Pronoun      | Pronoun                |
| DET      | Determiner   | Determiner             |
| NUM      | Numeral      | Numeral                |
| PREP     | Preposition  | Preposition            |
| CONJ     | Conjunction  | Conjunction            |
| PUNC     | Punctuation  | Punctuation mark       |

## 🔹 4.2 Phrase Chunking

In [73]:
chunker = FindChunks()
chunk_tree = chunker.chunk_sentence(tags)

In [74]:
print(chunk_tree)

(S
  من/PRO
  (VP (VP دوست/N_SING دارم/V_PRS))
  (NP زبان/N_SING فارسی/ADJ)
  را/CLITIC
  بهتر/ADJ_CMPR
  (VP (VP یاد/N_SING بگیرم./V_SUB)))


In [75]:
print(chunker.convert_nestedtree2rawstring(chunk_tree))

من [دوست دارم VP] [زبان فارسی NP] را بهتر [یاد بگیرم. VP]


> `NP` denotes noun phrases, `VP` verb phrases, giving you easy access to phrase-level units for further analysis.

# 🌳 Part 5: Dependency Parsing

## 📦 Imports

In [63]:
from parsivar import Tokenizer, DependencyParser

## 🔹 5.1 Parse Sentences

In [67]:
tokenizer = Tokenizer()
parser = DependencyParser()

text = "به گزارش ایسنا سمینارها ی شیمی آلی از امروز ۱۱ شهریور ۱۳۹۶ در دانشگاه علم و صنعت ایران آغاز به کار کردند. این سمینار تا ۱۳ شهریور ادامه می یابد."

sentences = tokenizer.tokenize_sentences(text)
parsed_graphs = parser.parse_sents(sentences)

for g in parsed_graphs:
    print(g.tree())

(کردند
  (به (گزارش (ایسنا (سمینارها (ی (شیمی آلی))))))
  (از امروز)
  (شهریور ۱۱)
  (دانشگاه (در ۱۳۹۶) (علم (و (صنعت ایران))))
  آغاز
  (به کار)
  .)
(یابد (سمینار این) (تا (شهریور ۱۳)) ادامه می .)


# 🛠️ Part 6: Spell Correction

## 📦 Imports

In [68]:
from parsivar import SpellCheck

## 🔹 6.1 Correct Misspellings and Spacing

To use the spell checker module download it's resources from [here](https://www.dropbox.com/s/tlyvnzv1ha9y1kl/spell.zip?dl=0) and after extraction copy the `spell/` directory to `parsivar/resource` -><br>
`FileNotFoundError: [Errno 2] No such file or directory: '/opt/homebrew/anaconda3/envs/Jupyter310/lib/python3.10/site-packages/parsivar/resource/spell/mybigram_lm.pckl'`

In [71]:
spell = SpellCheck()

bad_sentence = "نمازگذاران وارد مسلی شدند."
fixed_sentence = spell.spell_corrector(bad_sentence)

print(bad_sentence)
print(fixed_sentence)

نمازگذاران وارد مسلی شدند.
نمازگزاران وارد مصلی شدند .


# 🔚 Part 7: Summary & Comparison (Parsivar vs. Hazm)

| Feature / Task               | **Hazm**                                          | **Parsivar**                                        |
|------------------------------|---------------------------------------------------|-----------------------------------------------------|
| **Normalization**            | ✔ Extensive, configurable                         | ✔ Good, statistical space correction                |
| **Spell Correction**         | ✖                                                 | ✔ `SpellCheck` (needs LM files)                     |
| **Sentence Tokenizer**       | ✔ `sent_tokenize`                                 | ✔ `Tokenizer.tokenize_sentences`                    |
| **Word Tokenizer**           | ✔ `word_tokenize`                                 | ✔ `Tokenizer.tokenize_words`                        |
| **Stemming**                 | ✔ `Stemmer`                                       | ✔ `FindStems`                                       |
| **Lemmatization**            | ✔ `Lemmatizer`                                    | ✖ (not included)                                    |
| **POS Tagging**              | ✔ Bijankhan tagger                                | ✔ HMM/Wapiti tagger                                 |
| **Chunker**                  | ✔ `Chunker(model='chunker.model')`                | ✔ `FindChunks`                                      |
| **Dependency Parsing**       | ✔ MaltParser wrapper (needs Java)                 | ✔ Pure-Python UD parser                             |
| **Word/Sent Embeddings**     | ✔ `WordEmbedding` (fastText / word2vec, etc.)     | ✖                                                   |
| **Named-Entity Recognition** | Optional (works if you supply a trained model)    | ✖ (no built-in support)                             |
| **Installation weight**      | Light (Python + optional Java)                    | Light (Python + optional Wapiti)                    |
| **Best for …**               | Rich morphology, chunking, embeddings, custom NER | Quick normalization + spell-fix + Java-free parsing |

## Key Takeaways

- **Hazm** now covers **chunking** and **word/sentence embeddings** (fastText/word2vec).
  It can also run NER **if you point it to a trained model**, but ships none by default.
- **Parsivar** offers built-in **spell correction**, a pure-Python dependency parser, and chunking, but **no lemmatizer, embeddings, or NER**.
- Choose **Hazm** for deeper linguistic pipelines (lemmas, embeddings, chunker, optional NER).
- Choose **Parsivar** when you need fast spell-correction and a Java-free dependency parser, accepting the absence of lemmatizer/embeddings/NER.

## 🧪 Part 8: Integrated Pre-processing Pipeline (Parsivar Style)

You’ll create a **`preprocess_parsivar`** function, wrap it in a class, and compare speed with spell-correction ON vs. OFF.

## 📦 Imports (local to this part)

In [76]:
from parsivar import SpellCheck, Normalizer, Tokenizer, FindStems
import string, time

## 🔹 8.1 preprocess_parsivar Function

In [77]:
spell = SpellCheck()
normalizer = Normalizer(statistical_space_correction=True)
tokenizer = Tokenizer()
stemmer = FindStems()
PUNCT = set(string.punctuation + '؟،؛«»…')


def preprocess_parsivar(
        text: str,
        fix_spelling: bool = True,
        use_stemmer: bool = False
) -> list[str]:
    """Return a list of clean (optionally stemmed) tokens."""
    # your code here
    pass

In [78]:
sample = "نمازگذاران وارد مسلی شدند."
print(preprocess_parsivar(sample))
# Expected:
# ['نمازگزاران', 'وارد', 'مصلی', 'شدند']

None


### ✅ Answer

In [80]:
spell = SpellCheck()
normalizer = Normalizer(statistical_space_correction=True)
tokenizer = Tokenizer()
stemmer = FindStems()
PUNCT = set(string.punctuation + '؟،؛«»…')


def preprocess_parsivar(
        text: str,
        fix_spelling: bool = True,
        use_stemmer: bool = False
) -> list[str]:
    """Return a list of clean (optionally stemmed) tokens."""
    if fix_spelling:
        text = spell.spell_corrector(text)
    text = normalizer.normalize(text)

    tokens = []
    for sent in tokenizer.tokenize_sentences(text):
        for tok in tokenizer.tokenize_words(sent):
            if tok in PUNCT:
                continue
            tokens.append(tok)

    if use_stemmer:
        tokens = [stemmer.convert_to_stem(t) for t in tokens]

    return [t.lower() for t in tokens]

In [81]:
sample = "نمازگذاران وارد مسلی شدند."
print(preprocess_parsivar(sample))

['نمازگزاران', 'وارد', 'مصلی', 'شدند']


## 🔹 8.2 ParsivarPreprocessor Class

In [82]:
class ParsivarPreprocessor:
    def __init__(self, fix_spelling=True, use_stemmer=False):
        self.fix_spelling = fix_spelling
        self.use_stemmer = use_stemmer

    def __call__(self, text: str) -> list[str]:
        return preprocess_parsivar(
            text,
            fix_spelling=self.fix_spelling,
            use_stemmer=self.use_stemmer
        )

    def batch(self, docs):
        return [self(d) for d in docs]

In [83]:
pp = ParsivarPreprocessor(fix_spelling=True, use_stemmer=True)
print(pp("کتاب های جدیدم را خوانده ام ."))

['کتاب', 'جدید', 'را', 'خواند&خوان']


## 🔹 8.3 Speed Comparison

In [84]:
docs = [sample] * 500  # duplicate sample 500×


def time_it(prep, docs):
    start = time.perf_counter()
    _ = prep.batch(docs)
    return time.perf_counter() - start


pp_spell = ParsivarPreprocessor(fix_spelling=True, use_stemmer=False)
pp_raw = ParsivarPreprocessor(fix_spelling=False, use_stemmer=False)

t_spell = time_it(pp_spell, docs)
t_raw = time_it(pp_raw, docs)

print(f"With spell-correction : {t_spell:.3f}s")
print(f"Without correction    : {t_raw:.3f}s")

With spell-correction : 0.624s
Without correction    : 0.034s
