This project implements a Seq2Seq language translation model using an LSTM-based encoder-decoder in PyTorch.
Supported examples:
- English -> French
- English -> Tamil
train_seq2seq.py- training, evaluation, and interactive translationapp.py- Flask frontend for browser-based translationtemplates/index.html- main frontend pagestatic/styles.css- frontend stylingrequirements.txt- Python dependenciesdata/sample_en_fr.tsv- tiny English-French sample datasetdata/sample_en_ta.tsv- tiny English-Tamil sample dataset
Use a tab-separated file with one sentence pair per line:
english sentence<TAB>target sentence
pip install -r requirements.txtEnglish -> French:
python train_seq2seq.py --data_path data/sample_en_fr.tsv --source_lang en --target_lang fr --epochs 300Train on a larger French dataset file such as C:/Users/admin/OneDrive/Desktop/fra.txt:
python train_seq2seq.py --data_path "C:/Users/admin/OneDrive/Desktop/fra.txt" --source_lang en --target_lang fr --max_samples 50000 --epochs 15 --batch_size 64Force GPU training when CUDA is available:
python train_seq2seq.py --data_path "C:/Users/admin/OneDrive/Desktop/fra.txt" --source_lang en --target_lang fr --max_samples 20000 --epochs 8 --batch_size 64 --device cudaEnglish -> Tamil:
python train_seq2seq.py --data_path data/sample_en_ta.tsv --source_lang en --target_lang ta --epochs 300After training, the script saves:
artifacts/config.jsonartifacts/model.ptartifacts/vocab.json
Run interactive inference:
python train_seq2seq.py --mode translateStart the web app:
python app.pyThen open:
http://127.0.0.1:5000
The frontend uses the latest files in artifacts/. If those files are missing, train the model first.
fra.txtcontains extra metadata columns, and the trainer automatically uses only the first two columns.- Start with
--max_samples 50000for a manageable first run. - After training on
fra.txt, restart the Flask app so it picks up the new artifacts. - Use
--device cudato force GPU training,--device cputo force CPU, or leave the defaultauto.
- The included datasets are intentionally tiny, so translations are only for demonstration.
- For real performance, replace the sample dataset with a larger parallel corpus.
- The model uses greedy decoding for simplicity.