# 🌍 English-to-Sindhi Neural Translator 🇬🇧➡️🇸🇳

**A Natural Language Processing project that bridges English and Sindhi using Deep Learning and Rule-Based NLP techniques.**

This project enables seamless translation between English and Sindhi — focusing on accuracy, accessibility, and preserving regional languages through modern AI.


## 📌 Features

- ✅ Rule-based Sindhi ➡️ English translation (dictionary lookup)
- 🤖 English ➡️ Sindhi neural machine translation using HuggingFace’s MarianMT
- 📚 Custom parallel dataset (English ↔️ Sindhi)
- 🧠 Fine-tuned Transformer model for low-resource translation
- 💬 Potential for speech and TTS integration


## 🧠 Model Architecture

- **Model:** MarianMT (`Helsinki-NLP/opus-mt-en-ROMANCE`)
- **Frameworks:** Hugging Face Transformers, PyTorch
- **Dataset:** 10,000+ paired English↔Sindhi sentences from `s1.csv`
- **Approach:**
  - Preprocessing with tokenization
  - Fine-tuning with `Seq2SeqTrainer`
  - Evaluated on accuracy of translation and sentence structure


## 📁 Dataset

- Built from local resources and parallel sentence collections
- Columns: `English`, `Sindhi`
- Preprocessed using Hugging Face’s `Dataset` module


In [1]:
# Step 0: Sindhi to English Dictionary
sindhi_to_english = {
    'مان': 'I',
    'توھان': 'you',
    'هو': 'he',
    'آهي': 'is',
    'هئا': 'was',
    'هجي': 'should be',
    'آهن': 'are',
    'ويندس': 'will go',
    'وڃان': 'go',
    'اسڪول': 'school',
    'ڪتاب': 'book',
    'سٺو': 'good',
    'ڪتو': 'dog',
    'ڏسيو': 'saw',
    'ڳالهايو': 'spoke',
    'سان': 'with',
    'ٿي': 'became',
    'آيو': 'came',
    'نه': 'not',
    'سائين': 'sir'
}

# Step 1: Rule-Based Translator Function
def rule_based_translate(sindhi_sentence):
    words = sindhi_sentence.strip().split()
    translated_words = []

    for word in words:
        translated = sindhi_to_english.get(word, f"[{word}]")  # keep unknowns in []
        translated_words.append(translated)

    return ' '.join(translated_words)

# Step 2: Example Usage
sindhi_input = "سائين مان اسڪول وڃان"
english_output = rule_based_translate(sindhi_input)
print("Sindhi:  ", sindhi_input)
print("English: ", english_output)


Sindhi:   سائين مان اسڪول وڃان
English:  sir I school go


### Pre Processing

In [20]:
import pandas as pd
from datasets import Dataset
from transformers import (
    MarianMTModel, MarianTokenizer, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
)
import os

In [19]:
import pandas as pd
from datasets import Dataset

# Load and sample 10k rows
df = pd.read_csv('s1.csv')
df = df[['English', 'Sindhi']].dropna().sample(10000, random_state=42)

# HuggingFace dataset format
dataset = Dataset.from_pandas(df)


In [22]:
dataset = Dataset.from_pandas(df)

In [7]:
df.head()

Unnamed: 0,English,Sindhi
49351,were going to work tonight,اسان اڄ رات ڪم ڪرڻ وارا آهيون
24066,i knew everyone there,مان اتي سڀني کي سڃاڻان
69801,she visited her husband in prison,هوء جيل پنهنجي مڙس سان ملاقات ڪئي
23540,he went into teaching,هو درس هليو ويو
19161,why are you cursing,ڇو ٿا لعنتون


## **Tokenize and Preprocess**

## **Model + Tokenizer**

In [24]:
# Load tokenizer and model (or initialize your own small one)
model_name = "Helsinki-NLP/opus-mt-en-sd"  # start from small checkpoint
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

ImportError: 
MarianTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


In [8]:
def preprocess(example):
    inputs = tokenizer(example['English'], padding="max_length", truncation=True, max_length=128)
    targets = tokenizer(example['Sindhi'], padding="max_length", truncation=True, max_length=128)
    inputs["labels"] = targets["input_ids"]
    return inputs

tokenized_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)

NameError: name 'dataset' is not defined

In [8]:
import transformers
print(transformers.__version__)

4.51.3


In [10]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import os
os.environ["WANDB_DISABLED"] = "true"

training_args = Seq2SeqTrainingArguments(
    output_dir="./opus-mt-en-sd",
    save_steps=500,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    save_total_limit=2,
    predict_with_generate=True,
    logging_dir='./logs',
    logging_steps=100,
    learning_rate=2e-5,
    warmup_steps=500,
    weight_decay=0.01,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset.select(range(500)),  # small dev set
)

trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
100,5.5179
200,0.8283
300,0.551
400,0.466
500,0.408
600,0.3653
700,0.3264
800,0.3043
900,0.2766
1000,0.273


  Arguments:


TrainOutput(global_step=11250, training_loss=0.20444406127929687, metrics={'train_runtime': 1554.3361, 'train_samples_per_second': 28.951, 'train_steps_per_second': 7.238, 'total_flos': 1525426421760000.0, 'train_loss': 0.20444406127929687, 'epoch': 3.0})

In [16]:
import torch
def translate(text):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move model to the right device
    model.to(device)

    # Tokenize and move inputs to the same device
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)

    # Generate translation
    translated_tokens = model.generate(**inputs)

    # Decode and return
    output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return output

In [18]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# move model to device
model.to(device)

def translate(text):
    # Move input to same device as model
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {key: val.to(device) for key, val in inputs.items()}  # 💥 this is the key

    translated_tokens = model.generate(**inputs)
    output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return output


توهان ڪيئن آهيو آهيو


In [21]:
# Example
print(translate("tell me who is your fathers?"))

مون کي ٻڌايو ته توهان جا پيء ڪير آهن


In [29]:
print(translate("I love learning new things."))
print(translate("Translate this to French."))

مون کي نئين شيون سکڻ پسند آهي
هن کي فرانسيس ترجم ڪريو


In [30]:

pip install transformers huggingface_hub



In [36]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [37]:
model.save_pretrained("my-translator")
tokenizer.save_pretrained("my-translator")

('my-translator/tokenizer_config.json',
 'my-translator/special_tokens_map.json',
 'my-translator/vocab.json',
 'my-translator/source.spm',
 'my-translator/target.spm',
 'my-translator/added_tokens.json')

In [39]:
from huggingface_hub import upload_folder
upload_folder(
    folder_path="my-translator",
    repo_id="Jawadah1/english-sindhi-translator",  # Replace with your real Hugging Face username + repo
    repo_type="model"
)

source.spm:   0%|          | 0.00/790k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

target.spm:   0%|          | 0.00/707k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Jawadah1/english-sindhi-translator/commit/b97142e4af75f27d44e394d3f7193d863ec29be5', commit_message='Upload folder using huggingface_hub', commit_description='', oid='b97142e4af75f27d44e394d3f7193d863ec29be5', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Jawadah1/english-sindhi-translator', endpoint='https://huggingface.co', repo_type='model', repo_id='Jawadah1/english-sindhi-translator'), pr_revision=None, pr_num=None)

# **UI**

In [40]:
!pip install -q fastapi uvicorn transformers huggingface_hub

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/95.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/72.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [41]:
from fastapi import FastAPI, Request
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

app = FastAPI()

# Load model from Hugging Face
model_name = "Jawadah1/english-sindhi-translator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

class TranslateRequest(BaseModel):
    text: str

@app.post("/translate")
def translate(req: TranslateRequest):
    inputs = tokenizer(req.text, return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    output = tokenizer.decode(translated[0], skip_special_tokens=True)
    return {"translation": output}


tokenizer_config.json:   0%|          | 0.00/849 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/790k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/707k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.55M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/288 [00:00<?, ?B/s]

In [45]:
def predict(input_data):
    # Your model inference code here
    return "model prediction for " + input_data


In [50]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.31.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.1 (from gradio)
  Downloading gradio_client-1.10.1-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.11-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 (from gradio)
  Downloading safehttpx-0.1.6-py3-none-any.whl.metadata (4.2 kB)
Collecting semantic-version~=2.

In [51]:
import gradio as gr

def predict(input_text):
    # Dummy example, replace with your actual model code
    return "Prediction: " + input_text.upper()

iface = gr.Interface(fn=predict, inputs="text", outputs="text")
iface.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://722f76e98b27dc9dac.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [44]:

!pip install pyngrok
from pyngrok import ngrok
ngrok_tunnel = ngrok.connect(8000)
print("Public URL:", ngrok_tunnel.public_url)




ERROR:pyngrok.process.ngrok:t=2025-05-24T06:52:40+0000 lvl=eror msg="failed to reconnect session" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-05-24T06:52:40+0000 lvl=eror msg="session closing" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-05-24T06:52:40+0000 lvl=eror msg="terminating with error" obj=app err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your aut

PyngrokNgrokError: The ngrok process errored on start: authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n.