# Persian-English Translator Project (Classical Seq2Seq)

This notebook implements a classical Persian↔English translator using Seq2Seq (LSTM). It includes preprocessing, POS tagging, NER, tokenization, embedding, model design, training, and evaluation.

---


## Table of contents
1. Project goals
2. Environment setup
3. Data collection
4. Preprocessing
5. POS tagging
6. NER
7. Tokenization & Embedding
8. Seq2Seq model
9. Evaluation
10. Analysis and report



## 2. Environment setup

In [1]:
!pip install --upgrade pip
!pip install numpy pandas matplotlib scikit-learn tensorflow keras hazm parsivar sacrebleu nltk sentencepiece
# Optional: for advanced Persian NER datasets and models
!pip install datasets transformers seqeval



In [None]:
'''
!pip install hazm
!pip install parsivar
!pip install sacrebleu
!pip install nltk
!pip install sentencepiece
!pip install datasets
!pip install transformers
!pip install seqeval
'''

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import tensorflow as tf
import keras

import hazm
import parsivar
import sacrebleu
import nltk
import sentencepiece
import datasets
import transformers
import seqeval

## 3. Data collection

### 3.1: Imports

In [3]:
from datasets import load_dataset, DatasetDict
import re, os
from pprint import pprint

### 3.2: Load dataset and inspect (normal load)

In [4]:
# Data collection: load the Hugging Face dataset (full, non-streaming)

DATASET_ID = "shenasa/English-Persian-Parallel-Dataset"

print("Loading dataset (this may take a minute)...")
ds = load_dataset(DATASET_ID)   # loads all splits (here likely only 'train')
print(ds)                       # quick summary (splits, size)

# Print column names and example
split_name = list(ds.keys())[0]      # should be 'train'
print("Split:", split_name)
print("Columns:", ds[split_name].column_names)
print("\nOne sample (0):")
pprint(ds[split_name][0])


Loading dataset (this may take a minute)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['flash fire .', 'فلاش آتش .'],
        num_rows: 3960172
    })
})
Split: train
Columns: ['flash fire .', 'فلاش آتش .']

One sample (0):
{'flash fire .': 'superheats the air . burns the lungs like rice paper .',
 'فلاش آتش .': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می '
               'سوزاند .'}


## 4. Preprocessing (Normalization & cleaning)

### 4.1: Imports

In [5]:
from datasets import Dataset
import re
from hazm import Normalizer
from datasets import DatasetDict

### 4.2: Auto-detect which column is Persian / English & rename

In [6]:
# Identify which column contains Persian text and which column contains English.
def is_persian(s):
    if s is None: return False
    return bool(re.search(r'[\u0600-\u06FF]', s))

cols = ds[split_name].column_names
sample = ds[split_name][0]

persian_col = None
english_col = None
for c in cols:
    try:
        if is_persian(sample[c]):
            persian_col = c
        else:
            english_col = c
    except Exception:
        pass

print("Detected Persian column:", persian_col)
print("Detected English column:", english_col)



Detected Persian column: فلاش آتش .
Detected English column: flash fire .


### 4.3: Rename columns to standard names and run light cleaning

In [7]:
# If the dataset columns are e.g. 'translation' you may need to adapt the mapping below.

def rename_and_select(example):
    return {"persian": example[persian_col], "english": example[english_col]}

print("Mapping and renaming columns... (this may take a while for large datasets)")
ds_simple = ds[split_name].map(rename_and_select, remove_columns=ds[split_name].column_names)

print("New columns:", ds_simple.column_names)
print("Sample:", ds_simple[0])


Mapping and renaming columns... (this may take a while for large datasets)


Map:   0%|          | 0/3960172 [00:00<?, ? examples/s]

New columns: ['persian', 'english']
Sample: {'persian': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می سوزاند .', 'english': 'superheats the air . burns the lungs like rice paper .'}


### 4.4: Cleaning helpers (Persian normalizer + English lite cleaning)

In [8]:
# This is a robust, minimal cleaning pipeline for Persian + basic English normalization.

normalizer = Normalizer()

AR_TO_FA = {'\u064A': '\u06CC', '\u0643': '\u06A9'}
ZERO_WIDTH = ['\u200c', '\u200f', '\u202a', '\u202b']
PERSIAN_DIGITS = '۰۱۲۳۴۵۶۷۸۹'
ASCII_DIGITS = '0123456789'

def replace_arabic_chars(text):
    for a,f in AR_TO_FA.items(): text = text.replace(a,f)
    return text

def remove_zero_width(text):
    for ch in ZERO_WIDTH: text = text.replace(ch,'')
    return text

def persian_to_ascii_digits(text):
    for p,a in zip(PERSIAN_DIGITS, ASCII_DIGITS): text = text.replace(p,a)
    return text

def clean_persian(text):
    if text is None: return ""
    text = str(text)
    text = replace_arabic_chars(text)
    text = remove_zero_width(text)
    try:
        text = normalizer.normalize(text)
    except Exception:
        pass
    text = persian_to_ascii_digits(text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def clean_english(text):
    if text is None: return ""
    text = str(text).strip()
    # optional: lowercasing (depends if you want to preserve casing)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    return text

# Quick test
print(clean_persian("این ۱۲۳ کتاب‌ هاست‎"))
print(clean_english(" This IS  a Test! "))


این 123 کتاب هاست‎
this is a test!


### 4.5: Apply cleaning (batched) and filter empty/very-long pairs

In [9]:
# Apply cleaning to dataset (use num_proc if Colab CPU allows parallelism)
def clean_example(ex):
    return {
        "persian": clean_persian(ex["persian"]),
        "english": clean_english(ex["english"])
    }

print("Cleaning dataset (batched)...")
ds_clean = ds_simple.map(clean_example, batched=False)  # batched=True can be faster but needs function change
print("After cleaning sample:", ds_clean[0])

# Filter out empty or too-long sentences (token-length heuristics)
MAX_CHARS = 512
def filter_empty_or_long(ex):
    if ex["persian"].strip()=="" or ex["english"].strip()=="":
        return False
    if len(ex["persian"]) > MAX_CHARS or len(ex["english"]) > MAX_CHARS:
        return False
    return True

ds_clean = ds_clean.filter(filter_empty_or_long)
print("Size after filter:", len(ds_clean))


Cleaning dataset (batched)...


Map:   0%|          | 0/3960172 [00:00<?, ? examples/s]

After cleaning sample: {'persian': 'هوا را فوق\u200cالعاده گرم می\u200cکند. ریه\u200cها را مثل کاغذ برنج می\u200cسوزاند.', 'english': 'superheats the air . burns the lungs like rice paper .'}


Filter:   0%|          | 0/3960172 [00:00<?, ? examples/s]

Size after filter: 3948961


### 4.6: Create train/validation/test splits

In [10]:
# If the original dataset already had splits, you can directly use them.
# Here we assume ds_clean is a single large 'train' and we want to create val/test.
total = len(ds_clean)
print("Total pairs:", total)

if total > 1000:
    # create ~3% holdout; then split that half for val/test → ~1.5% each
    split1 = ds_clean.train_test_split(test_size=0.03, seed=42, shuffle=True)
    hold = split1['test'].train_test_split(test_size=0.5, seed=42)
    dataset_splits = DatasetDict({
        'train': split1['train'],
        'validation': hold['train'],
        'test': hold['test']
    })
else:
    # small dataset: create 80/10/10
    split1 = ds_clean.train_test_split(test_size=0.2, seed=42)
    hold = split1['test'].train_test_split(test_size=0.5, seed=42)
    dataset_splits = DatasetDict({
        'train': split1['train'],
        'validation': hold['train'],
        'test': hold['test']
    })

for k in dataset_splits:
    print(k, len(dataset_splits[k]))

# Peek at a few examples
print("Example pair (train[0]):")
print(dataset_splits['train'][0])


Total pairs: 3948961
train 3830492
validation 59234
test 59235
Example pair (train[0]):
{'persian': 'Thakhek 5121 **** تلفن', 'english': 'thakhek 5121 **** phone'}


### 4.7: Save to disk / optional Google Drive mount

In [11]:
# Option A: save to Colab local storage
os.makedirs("data", exist_ok=True)
dataset_splits['train'].to_csv("data/train.csv")
dataset_splits['validation'].to_csv("data/validation.csv")
dataset_splits['test'].to_csv("data/test.csv")
print("Saved CSVs to ./data/")

# Option B: save to Google Drive (uncomment to use)
# from google.colab import drive
# drive.mount('/content/drive')
# out_dir = "/content/drive/MyDrive/persian_translation_dataset"
# os.makedirs(out_dir, exist_ok=True)
# dataset_splits['train'].to_csv(os.path.join(out_dir,"train.csv"))
# dataset_splits['validation'].to_csv(os.path.join(out_dir,"validation.csv"))
# dataset_splits['test'].to_csv(os.path.join(out_dir,"test.csv"))
# print("Saved CSVs to Google Drive:", out_dir)


Creating CSV from Arrow format:   0%|          | 0/3831 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/60 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/60 [00:00<?, ?ba/s]

Saved CSVs to ./data/


### 4.8: Create a small subset for fast experiments

In [12]:
# Create a small subset (e.g., 10k pairs) for fast development / debugging.
small_size = 10000
if len(dataset_splits['train']) > small_size:
    small_train = dataset_splits['train'].select(range(small_size))
else:
    small_train = dataset_splits['train']

small_train.to_csv("data/train_small.csv")
print("Saved small train:", len(small_train))


Creating CSV from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Saved small train: 10000


### 4.9: Streaming mode — inspect a few examples without full download

In [13]:
# If dataset is extremely large and you prefer to stream:
print("Streaming a few examples (no full download):")
stream_ds = load_dataset(DATASET_ID, split="train", streaming=True)
for i, ex in enumerate(stream_ds):
    print(i, ex)
    if i >= 10:
        break

Streaming a few examples (no full download):
0 {'flash fire .': 'superheats the air . burns the lungs like rice paper .', 'فلاش آتش .': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می سوزاند .'}
1 {'flash fire .': 'hey , guys . down here . down here .', 'فلاش آتش .': 'سلام بچه ها . این پایین . این پایین .'}
2 {'flash fire .': 'what do you got down this corridor is the bow , right .', 'فلاش آتش .': 'چه چیزی در این راهرو پایین آمده است ، درست است .'}
3 {'flash fire .': 'theres an access hatch right there that puts us into the bowthruster room .', 'فلاش آتش .': 'یک دریچه دسترسی درست در آنجا وجود دارد که ما را وارد اتاق کمان می کند .'}
4 {'flash fire .': 'we get into the propeller tubes and the only thing between us and the outside .', 'فلاش آتش .': 'وارد لوله های پروانه می شویم و تنها چیزی که بین ما و بیرون است .'}
5 {'flash fire .': 'is nothing . all right . lets go .', 'فلاش آتش .': 'هیچی نیست . خیلی خوب . بیا بریم .'}
6 {'flash fire .': 'lets go . thats our way out .', 'فلاش

## 5. POS tagging (Hazm)

In [None]:
print('test')

## 6. Named Entity Recognition

## 7. Tokenization & embedding

## 8. Seq2Seq model

## 9. Evaluation

## End of Notebook
This Colab notebook covers preprocessing, POS tagging, NER, tokenization, Seq2Seq model, and evaluation.