# Persian-English Translator Project (Classical Seq2Seq)

This notebook implements a classical Persian↔English translator using Seq2Seq (LSTM). It includes preprocessing, POS tagging, NER, tokenization, embedding, model design, training, and evaluation.

---


## Table of contents
1. Project goals
2. Environment setup
3. Data collection
4. Preprocessing
5. POS tagging
6. NER
7. Tokenization & Embedding
8. Seq2Seq model
9. Evaluation
10. Analysis and report



## 2. Environment setup

In [1]:
!pip install --upgrade pip
!pip install numpy pandas matplotlib scikit-learn tensorflow keras hazm parsivar sacrebleu nltk sentencepiece
# Optional: for advanced Persian NER datasets and models
!pip install datasets transformers seqeval



In [2]:
'''
!pip install hazm
!pip install parsivar
!pip install sacrebleu
!pip install nltk
!pip install sentencepiece
!pip install datasets
!pip install transformers
!pip install seqeval
'''

'\n!pip install hazm\n!pip install parsivar\n!pip install sacrebleu\n!pip install nltk\n!pip install sentencepiece\n!pip install datasets\n!pip install transformers\n!pip install seqeval\n'

In [3]:
%unload_ext cudf.pandas

The cudf.pandas extension is not loaded.


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import tensorflow as tf
import keras

import hazm
import parsivar
import sacrebleu
import nltk
import sentencepiece
import datasets
import transformers
import seqeval

2025-09-24 18:28:40.412122: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-09-24 18:28:40.419399: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-24 18:28:40.789652: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-24 18:28:42.579659: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To tur

## 3. Data collection

### 3.1: Imports

In [5]:
from datasets import load_dataset, DatasetDict
import re, os
from pprint import pprint

### 3.2: Load dataset and inspect (normal load)

In [6]:
# Data collection: load the Hugging Face dataset (full, non-streaming)

DATASET_ID = "shenasa/English-Persian-Parallel-Dataset"

print("Loading dataset (this may take a minute)...")
ds = load_dataset(DATASET_ID)   # loads all splits (here likely only 'train')
print(ds)                       # quick summary (splits, size)

# Print column names and example
split_name = list(ds.keys())[0]      # should be 'train'
print("Split:", split_name)
print("Columns:", ds[split_name].column_names)
print("\nOne sample (0):")
pprint(ds[split_name][0])


Loading dataset (this may take a minute)...
DatasetDict({
    train: Dataset({
        features: ['flash fire .', 'فلاش آتش .'],
        num_rows: 3960172
    })
})
Split: train
Columns: ['flash fire .', 'فلاش آتش .']

One sample (0):
{'flash fire .': 'superheats the air . burns the lungs like rice paper .',
 'فلاش آتش .': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می '
               'سوزاند .'}


## 4. Preprocessing (Normalization & cleaning)

### 4.1: Imports

In [7]:
from datasets import Dataset
import re
from hazm import Normalizer
from datasets import DatasetDict

### 4.2: Auto-detect which column is Persian / English & rename

In [8]:
# Identify which column contains Persian text and which column contains English.
def is_persian(s):
    if s is None: return False
    return bool(re.search(r'[\u0600-\u06FF]', s))

cols = ds[split_name].column_names
sample = ds[split_name][0]

persian_col = None
english_col = None
for c in cols:
    try:
        if is_persian(sample[c]):
            persian_col = c
        else:
            english_col = c
    except Exception:
        pass

print("Detected Persian column:", persian_col)
print("Detected English column:", english_col)



Detected Persian column: فلاش آتش .
Detected English column: flash fire .


### 4.3: Rename columns to standard names and run light cleaning

In [9]:
# If the dataset columns are e.g. 'translation' you may need to adapt the mapping below.

def rename_and_select(example):
    return {"persian": example[persian_col], "english": example[english_col]}

print("Mapping and renaming columns... (this may take a while for large datasets)")
ds_simple = ds[split_name].map(rename_and_select, remove_columns=ds[split_name].column_names)

print("New columns:", ds_simple.column_names)
print("Sample:", ds_simple[0])


Mapping and renaming columns... (this may take a while for large datasets)
New columns: ['persian', 'english']
Sample: {'persian': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می سوزاند .', 'english': 'superheats the air . burns the lungs like rice paper .'}


### 4.4: Cleaning helpers (Persian normalizer + English lite cleaning)

In [10]:
# This is a robust, minimal cleaning pipeline for Persian + basic English normalization.

normalizer = Normalizer()

AR_TO_FA = {'\u064A': '\u06CC', '\u0643': '\u06A9'}
ZERO_WIDTH = ['\u200c', '\u200f', '\u202a', '\u202b']
PERSIAN_DIGITS = '۰۱۲۳۴۵۶۷۸۹'
ASCII_DIGITS = '0123456789'

def replace_arabic_chars(text):
    for a,f in AR_TO_FA.items(): text = text.replace(a,f)
    return text

def remove_zero_width(text):
    for ch in ZERO_WIDTH: text = text.replace(ch,'')
    return text

def persian_to_ascii_digits(text):
    for p,a in zip(PERSIAN_DIGITS, ASCII_DIGITS): text = text.replace(p,a)
    return text

def clean_persian(text):
    if text is None: return ""
    text = str(text)
    text = replace_arabic_chars(text)
    text = remove_zero_width(text)
    try:
        text = normalizer.normalize(text)
    except Exception:
        pass
    text = persian_to_ascii_digits(text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def clean_english(text):
    if text is None: return ""
    text = str(text).strip()
    # optional: lowercasing (depends if you want to preserve casing)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    return text

# Quick test
print(clean_persian("این ۱۲۳ کتاب‌ هاست‎"))
print(clean_english(" This IS  a Test! "))


این 123 کتاب هاست‎
this is a test!


### 4.5: Apply cleaning (batched) and filter empty/very-long pairs

In [11]:
# Apply cleaning to dataset (use num_proc if Colab CPU allows parallelism)
def clean_example(ex):
    return {
        "persian": clean_persian(ex["persian"]),
        "english": clean_english(ex["english"])
    }

print("Cleaning dataset (batched)...")
ds_clean = ds_simple.map(clean_example, batched=False)  # batched=True can be faster but needs function change
print("After cleaning sample:", ds_clean[0])

# Filter out empty or too-long sentences (token-length heuristics)
MAX_CHARS = 512
def filter_empty_or_long(ex):
    if ex["persian"].strip()=="" or ex["english"].strip()=="":
        return False
    if len(ex["persian"]) > MAX_CHARS or len(ex["english"]) > MAX_CHARS:
        return False
    return True

ds_clean = ds_clean.filter(filter_empty_or_long)
print("Size after filter:", len(ds_clean))


Cleaning dataset (batched)...
After cleaning sample: {'persian': 'هوا را فوق\u200cالعاده گرم می\u200cکند. ریه\u200cها را مثل کاغذ برنج می\u200cسوزاند.', 'english': 'superheats the air . burns the lungs like rice paper .'}
Size after filter: 3948961


### 4.6: Create train/validation/test splits

In [12]:
# If the original dataset already had splits, you can directly use them.
# Here we assume ds_clean is a single large 'train' and we want to create val/test.
total = len(ds_clean)
print("Total pairs:", total)

if total > 1000:
    # create ~3% holdout; then split that half for val/test → ~1.5% each
    split1 = ds_clean.train_test_split(test_size=0.03, seed=42, shuffle=True)
    hold = split1['test'].train_test_split(test_size=0.5, seed=42)
    dataset_splits = DatasetDict({
        'train': split1['train'],
        'validation': hold['train'],
        'test': hold['test']
    })
else:
    # small dataset: create 80/10/10
    split1 = ds_clean.train_test_split(test_size=0.2, seed=42)
    hold = split1['test'].train_test_split(test_size=0.5, seed=42)
    dataset_splits = DatasetDict({
        'train': split1['train'],
        'validation': hold['train'],
        'test': hold['test']
    })

for k in dataset_splits:
    print(k, len(dataset_splits[k]))

# Peek at a few examples
print("Example pair (train[0]):")
print(dataset_splits['train'][0])


Total pairs: 3948961
train 3830492
validation 59234
test 59235
Example pair (train[0]):
{'persian': 'Thakhek 5121 **** تلفن', 'english': 'thakhek 5121 **** phone'}


### 4.7: Save to disk / optional Google Drive mount

In [13]:
# Option A: save to Colab local storage
os.makedirs("data", exist_ok=True)
dataset_splits['train'].to_csv("data/train.csv")
dataset_splits['validation'].to_csv("data/validation.csv")
dataset_splits['test'].to_csv("data/test.csv")
print("Saved CSVs to ./data/")

# Option B: save to Google Drive (uncomment to use)
# from google.colab import drive
# drive.mount('/content/drive')
# out_dir = "/content/drive/MyDrive/persian_translation_dataset"
# os.makedirs(out_dir, exist_ok=True)
# dataset_splits['train'].to_csv(os.path.join(out_dir,"train.csv"))
# dataset_splits['validation'].to_csv(os.path.join(out_dir,"validation.csv"))
# dataset_splits['test'].to_csv(os.path.join(out_dir,"test.csv"))
# print("Saved CSVs to Google Drive:", out_dir)


Creating CSV from Arrow format:   0%|          | 0/3831 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/60 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/60 [00:00<?, ?ba/s]

Saved CSVs to ./data/


### 4.8: Create a small subset for fast experiments

In [14]:
# Create a small subset (e.g., 10k pairs) for fast development / debugging.
small_size = 10000
if len(dataset_splits['train']) > small_size:
    small_train = dataset_splits['train'].select(range(small_size))
else:
    small_train = dataset_splits['train']

small_train.to_csv("data/train_small.csv")
print("Saved small train:", len(small_train))


Creating CSV from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Saved small train: 10000


### 4.9: Streaming mode — inspect a few examples without full download

In [15]:
# If dataset is extremely large and you prefer to stream:
print("Streaming a few examples (no full download):")
stream_ds = load_dataset(DATASET_ID, split="train", streaming=True)
for i, ex in enumerate(stream_ds):
    print(i, ex)
    if i >= 10:
        break

Streaming a few examples (no full download):
0 {'flash fire .': 'superheats the air . burns the lungs like rice paper .', 'فلاش آتش .': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می سوزاند .'}
1 {'flash fire .': 'hey , guys . down here . down here .', 'فلاش آتش .': 'سلام بچه ها . این پایین . این پایین .'}
2 {'flash fire .': 'what do you got down this corridor is the bow , right .', 'فلاش آتش .': 'چه چیزی در این راهرو پایین آمده است ، درست است .'}
3 {'flash fire .': 'theres an access hatch right there that puts us into the bowthruster room .', 'فلاش آتش .': 'یک دریچه دسترسی درست در آنجا وجود دارد که ما را وارد اتاق کمان می کند .'}
4 {'flash fire .': 'we get into the propeller tubes and the only thing between us and the outside .', 'فلاش آتش .': 'وارد لوله های پروانه می شویم و تنها چیزی که بین ما و بیرون است .'}
5 {'flash fire .': 'is nothing . all right . lets go .', 'فلاش آتش .': 'هیچی نیست . خیلی خوب . بیا بریم .'}
6 {'flash fire .': 'lets go . thats our way out .', 'فلاش

## 5. POS tagging (Hazm)

In [17]:
from typing import List, Tuple

def add_pos_columns(ds_split):
    # Persian (Parsivar → Hazm fallback)
    fa_backend = None
    fa_tokenizer = None
    fa_tagger = None

    try:
        from parsivar import Normalizer as PVNormalizer, Tokenizer as PVTokenizer, POSTagger as PVPOSTagger
        fa_norm_pv = PVNormalizer()
        fa_tokenizer = PVTokenizer()
        fa_tagger = PVPOSTagger(tagging_model="wapiti")
        fa_backend = "parsivar"
    except Exception:
        try:
            from hazm import POSTagger as HZPOSTagger, word_tokenize as hz_word_tokenize
            fa_tagger = HZPOSTagger()
            fa_backend = "hazm"
        except Exception:
            pass

    import nltk
    from nltk import pos_tag as nltk_pos_tag, word_tokenize as nltk_word_tokenize
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        try: nltk.download('punkt', quiet=True)
        except: pass
    try:
        nltk.data.find('taggers/averaged_perceptron_tagger')
    except LookupError:
        try: nltk.download('averaged_perceptron_tagger', quiet=True)
        except: pass

    def pos_str(pairs: List[Tuple[str,str]]) -> str:
        return " ".join(f"{w}/{t}" for w,t in pairs)

    def tag_fa(text: str) -> str:
        if not text or fa_tagger is None: return ""
        try:
            if fa_backend == "parsivar":
                t = fa_norm_pv.normalize(text)
                toks = fa_tokenizer.tokenize_words(t)
                tags = fa_tagger.tag(toks)
                return pos_str(list(zip(toks, tags)))
            else:
                toks = hz_word_tokenize(text)
                return pos_str(fa_tagger.tag(toks))
        except Exception:
            return ""

    def tag_en(text: str) -> str:
        if not text: return ""
        try:
            toks = nltk_word_tokenize(text)
            return pos_str(nltk_pos_tag(toks))
        except Exception:
            return ""

    def add_pos(batch):
        return {
            "fa_pos": [tag_fa(x) for x in batch["persian"]],
            "en_pos": [tag_en(x) for x in batch["english"]],
        }

    ds_with_pos = ds_split.map(add_pos, batched=True, batch_size=64)
    return ds_with_pos

# Example:
pos_valid = add_pos_columns(dataset_splits["validation"])
pos_valid.to_csv("data/validation_with_pos.csv")
print("Saved:", "data/validation_with_pos.csv", "rows:", len(pos_valid))
print(pos_valid[0]["fa_pos"][:120], "...")
print(pos_valid[0]["en_pos"][:120], "...")


Creating CSV from Arrow format:   0%|          | 0/60 [00:00<?, ?ba/s]

Saved: data/validation_with_pos.csv rows: 59234
 ...
 ...


## 6. Named Entity Recognition

In [18]:
# فقط اگر قبلاً نصب نیست
!pip install stanza spacy  parsivar
!python -m spacy download en_core_web_sm


Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting spacy
  Downloading spacy-3.8.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Collecting networkx (from stanza)
  Using cached networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting torch>=1.3.0 (from stanza)
  Using cached torch-2.8.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Do

In [19]:
TARGET_SPLIT = "validation"   
OUT_NER_CSV  = f"data/{TARGET_SPLIT}_with_ner.csv"

import os
os.makedirs("data", exist_ok=True)

split = dataset_splits[TARGET_SPLIT]
len(split)


59234

In [20]:
print("Initializing Persian NER...")
fa_backend = None
fa_stanza = None
fa_pv_tok = None
fa_pv_ner = None

# ترجیح: stanza
try:
    import stanza
    try:
        # اگر مدل‌ها نیستند سعی می‌کند دانلود کند
        stanza.Pipeline(lang='fa', processors='tokenize,ner', tokenize_no_ssplit=True, verbose=False)
    except Exception:
        try:
            stanza.download('fa', processors='tokenize,ner', quiet=True)
        except Exception:
            pass
    fa_stanza = stanza.Pipeline(lang='fa', processors='tokenize,ner', tokenize_no_ssplit=True, verbose=False)
    fa_backend = "stanza"
    print("FA NER → stanza")
except Exception as e:
    print("stanza not available:", e)

#Fallback: Parsivar
if fa_stanza is None:
    try:
        from parsivar import Tokenizer as PVTok, NERTagger as PVNERTagger
        fa_pv_tok = PVTok()
        fa_pv_ner = PVNERTagger()
        fa_backend = "parsivar"
        print("FA NER → Parsivar")
    except Exception as e:
        print("Parsivar NER not available:", e)

def ner_fa(text: str) -> str:
    """خروجی:
       - stanza: 'متن|TYPE; ...'
       - parsivar: 'tok/TAG tok/TAG ...'
    """
    if not text: return ""
    try:
        if fa_backend == "stanza" and fa_stanza is not None:
            doc = fa_stanza(text)
            return "; ".join(f"{ent.text}|{ent.type}" for ent in doc.entities)
        elif fa_backend == "parsivar" and fa_pv_tok is not None and fa_pv_ner is not None:
            toks = fa_pv_tok.tokenize_words(text)
            tags = fa_pv_ner.tag(toks)
            return " ".join(f"{w}/{t}" for w,t in zip(toks, tags))
    except Exception:
        return ""
    return ""


Initializing Persian NER...


  return torch._C._cuda_getDeviceCount() > 0


FA NER → stanza


In [21]:
print("Initializing English NER...")
en_backend = None
en_spacy = None

# ترجیح: spaCy
try:
    import spacy
    try:
        en_spacy = spacy.load("en_core_web_sm")
    except Exception:
        try:
            from spacy.cli import download as spacy_download
            spacy_download("en_core_web_sm")
            en_spacy = spacy.load("en_core_web_sm")
        except Exception:
            en_spacy = None
    if en_spacy is not None:
        en_backend = "spacy"
        print("EN NER → spaCy")
except Exception as e:
    print("spaCy not available:", e)

#Fallback: NLTK
if en_spacy is None:
    try:
        import nltk
        from nltk import word_tokenize, pos_tag, ne_chunk
        for pkg in ['punkt','averaged_perceptron_tagger','maxent_ne_chunker','words']:
            try:
                nltk.data.find(nltk.downloader.Downloader()._packages[pkg].subdir + "/" + pkg)
            except LookupError:
                try: nltk.download(pkg, quiet=True)
                except Exception: pass
        en_backend = "nltk"
        print("EN NER → NLTK")
    except Exception as e:
        print("NLTK ne_chunk not available:", e)

def ner_en(text: str) -> str:
    """خروجی: 'TEXT|LABEL; ...'"""
    if not text: return ""
    try:
        if en_backend == "spacy" and en_spacy is not None:
            doc = en_spacy(text)
            return "; ".join(f"{ent.text}|{ent.label_}" for ent in doc.ents)
        else:
            import nltk
            toks = nltk.word_tokenize(text)
            pos  = nltk.pos_tag(toks)
            tree = nltk.ne_chunk(pos, binary=False)
            spans = []
            for node in tree:
                if hasattr(node, 'label'):
                    words = " ".join(leaf[0] for leaf in node.leaves())
                    spans.append(f"{words}|{node.label()}")
            return "; ".join(spans)
    except Exception:
        return ""
    return ""


Initializing English NER...
EN NER → spaCy


In [22]:
from tqdm.auto import tqdm

print(f"Tagging NER on split: {TARGET_SPLIT} (size={len(split)})")

def add_ner(batch):
    fa_list, en_list = [], []
    for fa, en in zip(batch["persian"], batch["english"]):
        fa_list.append(ner_fa(fa))
        en_list.append(ner_en(en))
    return {"fa_ner": fa_list, "en_ner": en_list}

# map به‌صورت بچ‌—تعادلی بین سرعت/حافظه
ner_split = split.map(add_ner, batched=True, batch_size=32)

ner_split.to_csv(OUT_NER_CSV)
print(f"Saved: {OUT_NER_CSV} rows: {len(ner_split)}")

# نمونهٔ سریع
sample = ner_split[0]
print({k: sample.get(k,"")[:180] for k in ["persian","english","fa_ner","en_ner"]})


Tagging NER on split: validation (size=59234)


Map:   0%|          | 0/59234 [00:00<?, ? examples/s]

Creating CSV from Arrow format:   0%|          | 0/60 [00:00<?, ?ba/s]

Saved: data/validation_with_ner.csv rows: 59234
{'persian': 'فقط با مشکلی دست و پنجه نرم می\u200cکنم که به نظر می\u200cرسد نمی\u200cتوانم GU - 50 s را به سوکت فشار دهم.', 'english': 'just battling the problem that i seem not to be able to push the gu-50s into the sockets.', 'fa_ner': '', 'en_ner': ''}


## 7. Tokenization & embedding

## 8. Seq2Seq model

## 9. Evaluation

## End of Notebook
This Colab notebook covers preprocessing, POS tagging, NER, tokenization, Seq2Seq model, and evaluation.