# Persian-English Translator Project (Classical Seq2Seq)

This notebook implements a classical Persian↔English translator using Seq2Seq (LSTM). It includes preprocessing, POS tagging, NER, tokenization, embedding, model design, training, and evaluation.

---


## Table of contents
1. Project goals
2. Environment setup
3. Data collection
4. Preprocessing
5. POS tagging
6. NER
7. Tokenization & Embedding
8. Seq2Seq model
9. Evaluation
10. Analysis and report



## 2. Environment setup

In [1]:
!pip install --upgrade pip
!pip install numpy pandas matplotlib scikit-learn tensorflow keras hazm parsivar sacrebleu nltk sentencepiece
# Optional: for advanced Persian NER datasets and models
!pip install datasets transformers seqeval



In [2]:
#%load_ext cudf.pandas

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import tensorflow as tf
import keras

import hazm
import parsivar
import sacrebleu
import nltk
import sentencepiece
import datasets
import transformers
import seqeval

## 3. Data collection

### 3.1: Imports

In [4]:
from datasets import load_dataset, DatasetDict
import re, os
from pprint import pprint

### 3.2: Load dataset and inspect (normal load)

In [5]:
# Data collection: load the Hugging Face dataset (full, non-streaming)

DATASET_ID = "shenasa/English-Persian-Parallel-Dataset"

print("Loading dataset (this may take a minute)...")
data_set = load_dataset(DATASET_ID)   # loads all splits (here only 'train')
print("Dataset is loaded successfully")

Loading dataset (this may take a minute)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset is loaded successfully


In [6]:
# quick summary (splits, size)
print(data_set)

# Print column names and example
split_name = list(data_set.keys())[0]      # should be 'train'
print("Split:", split_name)
print("Columns:", data_set[split_name].column_names)
print("\nOne sample (0):")
pprint(data_set[split_name][0])

DatasetDict({
    train: Dataset({
        features: ['flash fire .', 'فلاش آتش .'],
        num_rows: 3960172
    })
})
Split: train
Columns: ['flash fire .', 'فلاش آتش .']

One sample (0):
{'flash fire .': 'superheats the air . burns the lungs like rice paper .',
 'فلاش آتش .': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می '
               'سوزاند .'}


## 4. Preprocessing (Normalization & cleaning)

### 4.1: Imports

In [7]:
from datasets import Dataset
import re
from hazm import Normalizer , word_tokenize , stopwords_list
from datasets import DatasetDict
from nltk.corpus import stopwords
#import spacy

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### 4.2: Select a part of dataset

In [9]:
# take a random sample of 500k
ds = DatasetDict({
    "train": data_set["train"].shuffle(seed=42).select(range(500_000))
})
print(ds)

DatasetDict({
    train: Dataset({
        features: ['flash fire .', 'فلاش آتش .'],
        num_rows: 500000
    })
})


### 4.3: Auto-detect which column is Persian / English & rename

In [10]:
# Detect Persian / English columns
def is_persian(s):
    if s is None:
        return False
    return bool(re.search(r'[\u0600-\u06FF]', s))

cols = ds[split_name].column_names
sample = ds[split_name][0]

persian_col = None
english_col = None
for c in cols:
    try:
        if is_persian(sample[c]):
            persian_col = c
        else:
            english_col = c
    except Exception:
        pass

print("Detected Persian column:", persian_col)
print("Detected English column:", english_col)


# Helper functions to detect unwanted text
def contains_english(text):
    if text is None:
        return False
    return bool(re.search(r"[A-Za-z]", text))

def contains_persian(text):
    if text is None:
        return False
    return bool(re.search(r"[\u0600-\u06FF]", text))


# Filter dataset
filtered_dataset = ds[split_name].filter(
    lambda x: not contains_english(x[persian_col]) and not contains_persian(x[english_col])
)

# Wrap back into DatasetDict to keep "train" key
ds = {"train": filtered_dataset}
print(ds)


Detected Persian column: فلاش آتش .
Detected English column: flash fire .
{'train': Dataset({
    features: ['flash fire .', 'فلاش آتش .'],
    num_rows: 400320
})}


- There is a problem here. if we delete this way, all persian sentences with a single charachter of english will be deleted.

In [11]:
ds

{'train': Dataset({
     features: ['flash fire .', 'فلاش آتش .'],
     num_rows: 400320
 })}

### 4.4: Rename columns to standard names and run light cleaning

In [12]:
def rename_and_select(example):
    return {"persian": example[persian_col], "english": example[english_col]}

print("Mapping and renaming columns...")
ds_simple = ds[split_name].map(rename_and_select, remove_columns=ds[split_name].column_names) # split_name="train"

print("Old columns:", ds[split_name].column_names)
print("New columns:", ds_simple.column_names)

Mapping and renaming columns...
Old columns: ['flash fire .', 'فلاش آتش .']
New columns: ['persian', 'english']


In [13]:
print(ds)
print(ds_simple)

{'train': Dataset({
    features: ['flash fire .', 'فلاش آتش .'],
    num_rows: 400320
})}
Dataset({
    features: ['persian', 'english'],
    num_rows: 400320
})


### 4.5: Cleaning helpers (Persian normalizer + English lite cleaning)

In [14]:
normalizer = Normalizer()

# Maps Arabic to Persian chars
AR_TO_FA = {'\u064A': '\u06CC', '\u0643': '\u06A9'}
# Zero-width and invisible characters
ZERO_WIDTH = ['\u200c', '\u200f', '\u202a', '\u202b']

PERSIAN_DIGITS = '۰۱۲۳۴۵۶۷۸۹'
ASCII_DIGITS = '0123456789'

# Persian + English stopwords
persian_stopwords = set(stopwords_list())
english_stopwords = set(stopwords.words("english"))

# --- Cleaning functions ---
def replace_arabic_chars(text):
    for a, f in AR_TO_FA.items():
        text = text.replace(a, f)
    return text

def remove_zero_width(text):
    for zw in ZERO_WIDTH:
        text = text.replace(zw, " ")
    return text

def normalize_digits(text):
    for fa, en in zip(PERSIAN_DIGITS, ASCII_DIGITS):
        text = text.replace(fa, en)
    return text

def clean_html(text):
    return re.sub(r"<.*?>", " ", text)

def clean_urls(text):
    return re.sub(r"http\S+|www\.\S+", " ", text)

def remove_unwanted_chars(text):
    # Keep Persian, English letters, and basic spaces
    return re.sub(r"[^آ-یA-Za-z0-9\s]", " ", text)

def remove_extra_spaces(text):
    return re.sub(r"\s+", " ", text).strip()


def remove_stopwords(text):
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in persian_stopwords and t.lower() not in english_stopwords]
    return " ".join(tokens)



In [15]:
def clean_persian(text):
    if text is None: return ""
    text = str(text)
    text = text.replace("٫", " ")
    text = text.replace(",", " ")
    text = clean_html(text)
    text = clean_urls(text)
    text = replace_arabic_chars(text)
    text = remove_zero_width(text)
    text = normalize_digits(text)
    text = normalizer.normalize(text)
    text = normalize_digits(text)
    text = remove_unwanted_chars(text)
    text = remove_extra_spaces(text)
    #text = remove_stopwords(text)
    return text

def clean_english(text):
    if text is None: return ""
    text = str(text).strip()
    text = text.replace("٫", " ")
    text = text.replace(",", " ")
    # optional: lowercasing (depends if you want to preserve casing)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    text = clean_html(text)
    text = clean_urls(text)
    text = remove_zero_width(text)
    text = normalizer.normalize(text)
    text = normalize_digits(text)
    text = remove_unwanted_chars(text)
    text = remove_extra_spaces(text)
    #text = remove_stopwords(text)
    return text

# Quick test
print(clean_persian("این٫  ۱۲۳ کتاب‌ هاست‎"))
print(clean_english(" This IS  ,a Test! "))
print(clean_persian(" کاتالوگ 4 5۵, ٫ مگابایت"))
print(clean_english(" کاتالوگ ۵۴ 7, ٫ مگابایت"))

این 123 کتاب هاست
this is a test
کاتالوگ 4 55 مگابایت
کاتالوگ 54 7 مگابایت


### 4.6: Apply cleaning

In [16]:
# Apply cleaning to dataset (use num_proc if Colab CPU allows parallelism)
def clean_example(ex):
    return {
        "persian": clean_persian(ex["persian"]),
        "english": clean_english(ex["english"])
    }

print("Cleaning dataset (batched)...")
ds_clean = ds_simple.map(clean_example, batched=False)  # batched=True can be faster but needs function change
print("After cleaning sample:", ds_clean[0])

Cleaning dataset (batched)...
After cleaning sample: {'persian': 'بابابالو در 30 ژوئن 2011 ساعت 22 40 دقیقه گفت', 'english': 'bababaloo on 30 june 2011 at 22 h 40 minutes said'}


In [17]:
# Filter out empty or too-long sentences (token-length heuristics)
MAX_CHARS = 512
def filter_empty_or_long(ex):
    if ex["persian"].strip()=="" or ex["english"].strip()=="":
        return False
    if len(ex["persian"]) > MAX_CHARS or len(ex["english"]) > MAX_CHARS:
        return False
    return True

ds_clean = ds_clean.filter(filter_empty_or_long)
print("Size after filter:", len(ds_clean))

Size after filter: 399355


In [18]:
print("some examples:")

for i in range(5):
  print(f"\n {i}-raw:" ,(ds[split_name][i]))
  print(f"\n {i}-Pe_En:" ,(ds_simple[i]))
  print(f"\n {i}-clean:" ,(ds_clean[i]))
  print(f"\n")

some examples:

 0-raw: {'flash fire .': 'Bababaloo on 30 June 2011 at 22 h 40 minutes said:', 'فلاش آتش .': 'بابابالو در 30 ژوئن 2011 ساعت 22 : 40 دقیقه گفت :'}

 0-Pe_En: {'persian': 'بابابالو در 30 ژوئن 2011 ساعت 22 : 40 دقیقه گفت :', 'english': 'Bababaloo on 30 June 2011 at 22 h 40 minutes said:'}

 0-clean: {'persian': 'بابابالو در 30 ژوئن 2011 ساعت 22 40 دقیقه گفت', 'english': 'bababaloo on 30 june 2011 at 22 h 40 minutes said'}



 1-raw: {'flash fire .': 'After the chicken was put on paper towels, she went on out to the back porch with her guitar, sat down, and began to play.', 'فلاش آتش .': 'بعد از اینکه مرغ را روی دستمال کاغذی گذاشتند ، با گیتار به ایوان پشتی رفت و نشست و شروع به نواختن کرد .'}

 1-Pe_En: {'persian': 'بعد از اینکه مرغ را روی دستمال کاغذی گذاشتند ، با گیتار به ایوان پشتی رفت و نشست و شروع به نواختن کرد .', 'english': 'After the chicken was put on paper towels, she went on out to the back porch with her guitar, sat down, and began to play.'}

 1-clean: {'persia

### 4.6: Apply cleaning with removing stop-words

In [19]:
def clean_stopwords_persian(text):
    if text is None: return ""
    text = remove_stopwords(text)
    return text

def clean_stopwords_english(text):
    if text is None: return ""
    text = remove_stopwords(text)
    return text

# Quick test
print(clean_stopwords_persian(clean_persian("این٫  ۱۲۳ کتاب‌ هاست‎")))
print(clean_stopwords_english(clean_english(" This IS  ,a Test! ")))
print(clean_stopwords_persian(clean_persian(" کاتالوگ, ٫ مگابایت")))
print(clean_stopwords_english(clean_english(" کاتالوگ, ٫ مگابایت")))

123 کتاب هاست
test
کاتالوگ مگابایت
کاتالوگ مگابایت


In [20]:
# Apply stop-words cleaning to dataset (use num_proc if Colab CPU allows parallelism)
def clean_example(ex):
    return {
        "persian": clean_stopwords_persian(ex["persian"]),
        "english": clean_stopwords_english(ex["english"])
    }

print("Cleaning dataset ...")
ds_stopwords_clean = ds_clean.map(clean_example, batched=False)  # batched=True can be faster but needs function change
print("After stop-words cleaning sample:", ds_stopwords_clean[0])

Cleaning dataset ...
After stop-words cleaning sample: {'persian': 'بابابالو 30 ژوئن 2011 ساعت 22 40 دقیقه', 'english': 'bababaloo 30 june 2011 22 h 40 minutes said'}


In [21]:
print("some examples:")

for i in range(5):
  print(f"\n {i}-raw:" ,(ds[split_name][i]))
  #print(f"\n {i}-Pe_En:" ,(ds_simple[i]))
  print(f"\n {i}-clean:" ,(ds_clean[i]))
  print(f"\n {i}-clean_stopwords:" ,(ds_stopwords_clean[i]))
  print(f"\n")

some examples:

 0-raw: {'flash fire .': 'Bababaloo on 30 June 2011 at 22 h 40 minutes said:', 'فلاش آتش .': 'بابابالو در 30 ژوئن 2011 ساعت 22 : 40 دقیقه گفت :'}

 0-clean: {'persian': 'بابابالو در 30 ژوئن 2011 ساعت 22 40 دقیقه گفت', 'english': 'bababaloo on 30 june 2011 at 22 h 40 minutes said'}

 0-clean_stopwords: {'persian': 'بابابالو 30 ژوئن 2011 ساعت 22 40 دقیقه', 'english': 'bababaloo 30 june 2011 22 h 40 minutes said'}



 1-raw: {'flash fire .': 'After the chicken was put on paper towels, she went on out to the back porch with her guitar, sat down, and began to play.', 'فلاش آتش .': 'بعد از اینکه مرغ را روی دستمال کاغذی گذاشتند ، با گیتار به ایوان پشتی رفت و نشست و شروع به نواختن کرد .'}

 1-clean: {'persian': 'بعد از اینکه مرغ را روی دستمال کاغذی گذاشتند با گیتار به ایوان پشتی رفت و نشست و شروع به نواختن کرد', 'english': 'after the chicken was put on paper towels she went on out to the back porch with her guitar sat down and began to play'}

 1-clean_stopwords: {'persian': 'م

### 4.8: Create train/validation/test splits

**4.8.1: ds_clean**

In [22]:
def split_dataset(ds_clean, seed: int = 42):

    total = len(ds_clean)
    print("Total pairs:", total)

    if total > 1000:
        # Large dataset: ~3% holdout, split into val/test (1.5% each)
        split1 = ds_clean.train_test_split(test_size=0.03, seed=seed, shuffle=True)
        hold = split1['test'].train_test_split(test_size=0.5, seed=seed)
        dataset_splits = DatasetDict({
            'train': split1['train'],
            'validation': hold['train'],
            'test': hold['test']
        })
    else:
        # Small dataset: 80/10/10
        split1 = ds_clean.train_test_split(test_size=0.2, seed=seed, shuffle=True)
        hold = split1['test'].train_test_split(test_size=0.5, seed=seed)
        dataset_splits = DatasetDict({
            'train': split1['train'],
            'validation': hold['train'],
            'test': hold['test']
        })

    # Print sizes
    for k in dataset_splits:
        print(f"{k}: {len(dataset_splits[k])}")

    # Peek at one example
    print("\nExample pair (train[10]):")
    print(dataset_splits['train'][10])

    return dataset_splits


In [23]:
dataset_splits = split_dataset(ds_clean)

Total pairs: 399355
train: 387374
validation: 5990
test: 5991

Example pair (train[10]):
{'persian': 'به همین ترتیب مانند گناه یک انسان محکومیت بر همه انسان ها وارد شد و به همین ترتیب با عدالت یک نفر پیروزی زندگی برای همه مردم خواهد بود', 'english': 'in like manner as by one man s offence condemnation came upon all men even so by the righteousness of one man will the victory to life be to all men'}


In [24]:
# If you want to implement "Stop-words Removing"
#dataset_splits = split_dataset(ds_stopwords_clean)

### 4.8: Save to disk / optional Google Drive mount

In [25]:
# Option A: save to Colab local storage
os.makedirs("data", exist_ok=True)
dataset_splits['train'].to_csv("data/train.csv")
dataset_splits['validation'].to_csv("data/validation.csv")
dataset_splits['test'].to_csv("data/test.csv")
print("Saved CSVs to ./data/")

Creating CSV from Arrow format:   0%|          | 0/388 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Saved CSVs to ./data/


In [26]:
# Option B: save to Google Drive (uncomment to use)
'''
from google.colab import drive
drive.mount('/content/drive')
out_dir = "/content/drive/MyDrive/persian_translation_dataset"
os.makedirs(out_dir, exist_ok=True)
dataset_splits['train'].to_csv(os.path.join(out_dir,"train.csv"))
dataset_splits['validation'].to_csv(os.path.join(out_dir,"validation.csv"))
dataset_splits['test'].to_csv(os.path.join(out_dir,"test.csv"))
print("Saved CSVs to Google Drive:", out_dir)
'''

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\')\nout_dir = "/content/drive/MyDrive/persian_translation_dataset"\nos.makedirs(out_dir, exist_ok=True)\ndataset_splits[\'train\'].to_csv(os.path.join(out_dir,"train.csv"))\ndataset_splits[\'validation\'].to_csv(os.path.join(out_dir,"validation.csv"))\ndataset_splits[\'test\'].to_csv(os.path.join(out_dir,"test.csv"))\nprint("Saved CSVs to Google Drive:", out_dir)\n'

## 5. POS tagging (Hazm)

## 6. Named Entity Recognition

## 7. Tokenization & embedding

## 8. Seq2Seq model

## 9. Evaluation

## End of Notebook
This Colab notebook covers preprocessing, POS tagging, NER, tokenization, Seq2Seq model, and evaluation.