<font>
<div dir=ltr align=center>
<img src='https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png' width=150 height=150> <br>
<font color=0F5298 size=6>
Natural Language Processing<br>
<font color=2565AE size=4>
Computer Engineering Department<br>
Spring 2025<br>
<font color=3C99D size=4>
Workshop 1 - NLP Frameworks - Parsi.io<br>
<font color=696880 size=3>
<a href='https://language.ml'>https://language.ml</a><br>
info [AT] language [dot] ml

# 📖 Part 1: Introduction

## ❓ What is Parsi.io? ([GitHub](https://github.com/language-ml/parsi.io))

`parsi.io` is an **industrial-grade Persian NLP toolkit** maintained by the language-ml lab.
It goes beyond basic preprocessing by offering ready-made extractors for prices, quantities, verb morphology, and even Old-Persian text normalization. The project is open-source and receives active updates (≈ 180 commits, last push June 2024).

---

## ✅ When to Use Parsi.io

* **E-commerce or classifieds analytics** – extract prices, units, and product names from unstructured ads.
* **Temporal or event extraction** – leverage the verb-information module for tense, person, and root detection.
* **Digital-humanities on classical corpora** – Old-Persian normalizer and lemmatizer fill gaps left by Hazm/Parsivar.
* **Quick prototyping** – one-line `.run(text)` returns structured JSON, avoiding custom regex pipelines.

---

## 🚫 When Not to Use Parsi.io

* **Neural downstream tasks (e.g., text classification, embeddings)** – opt for Hugging Face transformers instead.
* **Token-level customization** – if you need low-level control over tokenization or custom linguistic attributes, spaCy with a Persian model is more flexible.
* **Ultra-large corpora requiring distributed processing** – parsi.io is a pure-Python library aimed at single-machine workloads; Spark-based pipelines or Elastic Search ingest plugins may scale better.

## ⚙️ Installation & Setup

In [1]:
# On Apple Silicon, you may need to install the following:
!brew install cmake boost
!CMAKE_OSX_ARCHITECTURES=arm64 pip install camel-tools

[34m==>[0m [1mDownloading https://formulae.brew.sh/api/formula.jws.json[0m
[34m==>[0m [1mDownloading https://formulae.brew.sh/api/cask.jws.json[0m
To reinstall 4.0.1, run:
  brew reinstall cmake
To reinstall 1.88.0, run:
  brew reinstall boost


In [2]:
!pip install camel-kenlm



In [7]:
!pip install git+https://github.com/language-ml/parsi.io.git

# or, for development:
# !pip install -e git+https://github.com/language-ml/parsi.io.git

Collecting git+https://github.com/language-ml/parsi.io.git
  Cloning https://github.com/language-ml/parsi.io.git to /private/var/folders/tk/kx7c6x1j16gdg23gt5g8mr440000gn/T/pip-req-build-tufrljtk
  Running command git clone --filter=blob:none --quiet https://github.com/language-ml/parsi.io.git /private/var/folders/tk/kx7c6x1j16gdg23gt5g8mr440000gn/T/pip-req-build-tufrljtk
  Resolved https://github.com/language-ml/parsi.io.git to commit 22427882850fecbcbcc91eabbafebc3a30dbdf66
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: parsi_io
  Building wheel for parsi_io (setup.py) ... [?25ldone
[?25h  Created wheel for parsi_io: filename=parsi_io-0.1.0-py3-none-any.whl size=19538839 sha256=c86eb51554f6c174b11242aa595082af4b93f1c528216f4de057fcff44f66956
  Stored in directory: /private/var/folders/tk/kx7c6x1j16gdg23gt5g8mr440000gn/T/pip-ephem-wheel-cache-ega_3lq9/wheels/be/de/2b/bfc92c814d7a9cd7a2d7034872de2cdb317fec746ec2835208
Successfully buil

In [6]:
# !pip uninstall parsi_io parsi-io -y

Found existing installation: parsi_io 0.1.0
Uninstalling parsi_io-0.1.0:
  Successfully uninstalled parsi_io-0.1.0


# 🧠 Part 2 – Verb Information Extraction

This part shows how to identify each verb in Persian text and return its **tense, root, person, and type**.

## 🔹 2.1 What the extractor returns

| Key      | Description                                                |
|----------|------------------------------------------------------------|
| `زمان`   | Tense: `گذشته`, `حال`, or `آینده`                          |
| `بن فعل` | Verb root (past or present stem)                           |
| `شخص`    | Grammatical person (1st, 2nd, 3rd; singular/plural)        |
| `نوع`    | Sub-class such as `گذشته ساده`, `حال اخباری`, `آینده ساده` |

## 🚀 2.2 Quick demo

In [20]:
from parsi_io.modules.verb_info_extractions import VerbInfoExtraction

extractor = VerbInfoExtraction()

text = "من فردا به دانشگاه خواهم رفت اما امروز در خانه می‌نویسم."
output = extractor.run(text)
output

100%|██████████| 11/11 [00:03<00:00,  2.89it/s]


[{'زمان': 'آینده',
  'بن فعل': 'رفت',
  'شخص': 'اول شخص مفرد',
  'نوع': 'آینده ساده'},
 {'زمان': 'حال', 'بن فعل': 'نویس', 'شخص': 'اول شخص مفرد', 'نوع': 'حال اخباری'}]

# 🔢 Part 3 – Number Extraction

This part shows how to extract numeric expressions (numerals, words, or mixed forms) from Persian text and convert them into int/float values.

## 🔹 3.1 What the extractor returns

| Field    | Description                                    |
|----------|------------------------------------------------|
| `span`   | `[start, end]` character offsets of the phrase |
| `phrase` | The matched numeric phrase                     |
| `value`  | Float value corresponding to the phrase        |

## 🚀 3.2 Quick demo

In [31]:
from parsi_io.modules.number_extractor import NumberExtractor

extractor = NumberExtractor()

text = "من در بیست و پنجمین روز فروردین سوار اتوبوس ۱۲ شدم."
output = extractor.run(text)
output

[{'span': [6, 16], 'phrase': 'بیست و پنج', 'value': 25},
 {'span': [44, 46], 'phrase': '۱۲', 'value': 12.0}]