<a href="https://colab.research.google.com/github/priyanshu2103/Sanskrit-Hindi-Machine-Translation/blob/main/Supervised_Statistical_MT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **```Supervised Statistical Machine Translation```**

This notebook runs the Moses SMT system, following the official documentation. Moses is used to translate Sanskrit to English, which is then translated to Hindi using Google Translate API.

NOTE: Moses training and tuning takes around 2-2.5 hrs to run 

In [3]:
# Downloading Moses Binaries directly, as it is complicated to install moses by source 
!wget http://www.statmt.org/moses/RELEASE-4.0/binaries/ubuntu-17.04.tgz
!tar -xvzf ubuntu-17.04.tgz
!rm -rf ubuntu-17.04.tgz

--2020-11-22 12:47:51--  http://www.statmt.org/moses/RELEASE-4.0/binaries/ubuntu-17.04.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 120382471 (115M) [application/x-gzip]
Saving to: ‘ubuntu-17.04.tgz’


2020-11-22 12:49:34 (1.19 MB/s) - ‘ubuntu-17.04.tgz’ saved [120382471/120382471]

ubuntu-17.04/
ubuntu-17.04/training-tools/
ubuntu-17.04/training-tools/mkcls
ubuntu-17.04/training-tools/snt2cooc
ubuntu-17.04/training-tools/merge_alignment.py
ubuntu-17.04/training-tools/mgiza
ubuntu-17.04/moses/
ubuntu-17.04/moses/scripts/
ubuntu-17.04/moses/scripts/analysis/
ubuntu-17.04/moses/scripts/analysis/sg2dot.perl
ubuntu-17.04/moses/scripts/analysis/smtgui/
ubuntu-17.04/moses/scripts/analysis/smtgui/newsmtgui.cgi
ubuntu-17.04/moses/scripts/analysis/smtgui/file-factors
ubuntu-17.04/moses/scripts/analysis/smtgui/Corpus.pm
ubuntu-17.04/moses/scri

This is the repo which contains the parallel data we are going to use.

In [4]:
# Cloning our repo
%cd /content
!git clone https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation.git

/content
Cloning into 'Sanskrit-Hindi-Machine-Translation'...
remote: Enumerating objects: 112, done.[K
remote: Counting objects: 100% (112/112), done.[K
remote: Compressing objects: 100% (97/97), done.[K
remote: Total 112 (delta 38), reused 58 (delta 12), pack-reused 0[K
Receiving objects: 100% (112/112), 9.25 MiB | 17.31 MiB/s, done.
Resolving deltas: 100% (38/38), done.


We are using indic-nlp library for sanskrit text tokenization.

In [5]:
# Install indic-nlp for tokenization
!pip install indic-nlp-library
!cp /content/Sanskrit-Hindi-Machine-Translation/indic_tokenize.py /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py

Collecting indic-nlp-library
  Downloading https://files.pythonhosted.org/packages/2f/51/f4e4542a226055b73a621ad442c16ae2c913d6b497283c99cae7a9661e6c/indic_nlp_library-0.71-py3-none-any.whl
Collecting morfessor
  Downloading https://files.pythonhosted.org/packages/39/e6/7afea30be2ee4d29ce9de0fa53acbb033163615f849515c0b1956ad074ee/Morfessor-2.0.6-py3-none-any.whl
Installing collected packages: morfessor, indic-nlp-library
Successfully installed indic-nlp-library-0.71 morfessor-2.0.6


In [6]:
# Extracting the parallel data into lists
%cd /content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-english/

eng_lines = []
sanskrit_lines = []

with open('bhagvadgita_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('bible_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('manu_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('ramayan_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('rigveda_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])



with open('bhagvadgita_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('bible_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('manu_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('ramayan_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('rigveda_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

# print(eng_lines[:100])
# print(sanskrit_lines[:100])

print(len(eng_lines))
print(len(sanskrit_lines))

/content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-english
34374
34374


In [7]:
# Randomly shuffling the data into training and dev set, the test data is already provided on github
import random
c = list(zip(sanskrit_lines, eng_lines))
random.shuffle(c)

sanskrit_lines, eng_lines = zip(*c)

train_text_sa = sanskrit_lines[:-1374]
train_text_en = eng_lines[:-1374]

dev_text_sa = sanskrit_lines[-1374:]
dev_text_en = eng_lines[-1374:]

In [8]:
# Setting up the required files
%cd /content
!mkdir -p corpus
%cd corpus
!mkdir -p training
%cd training

with open('parallel.sa-en.sa', 'w') as f:
  for line in train_text_sa:
    f.write(line + '\n')

with open('parallel.sa-en.en', 'w') as f:
  for line in train_text_en:
    f.write(line + '\n')

!mkdir -p /content/corpus/dev
%cd /content/corpus/dev
with open('dev.sa-en.sa', 'w') as f:
  for line in dev_text_sa:
    f.write(line + '\n')

with open('dev.sa-en.en', 'w') as f:
  for line in dev_text_en:
    f.write(line + '\n')

/content
/content/corpus
/content/corpus/training
/content/corpus/dev


In [9]:
%cd /content

/content


Tokenize

In [10]:
!ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
    < /content/corpus/training/parallel.sa-en.en    \
    > /content/corpus/parallel.sa-en.tok.en

!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/corpus/training/parallel.sa-en.sa /content/corpus/parallel.sa-en.tok.sa sa

Tokenizer Version 1.1
Language: en
Number of threads: 1


Truecase

We first need to learn the truecaser, then apply it

In [11]:
!ubuntu-17.04/moses/scripts/recaser/train-truecaser.perl \
     --model /content/corpus/truecase-model.en --corpus     \
     /content/corpus/parallel.sa-en.tok.en
!ubuntu-17.04/moses/scripts/recaser/train-truecaser.perl \
     --model /content/corpus/truecase-model.sa --corpus     \
     /content/corpus/parallel.sa-en.tok.sa

In [12]:
!ubuntu-17.04/moses/scripts/recaser/truecase.perl \
   --model /content/corpus/truecase-model.en         \
   < /content/corpus/parallel.sa-en.tok.en \
   > /content/corpus/parallel.sa-en.true.en
!ubuntu-17.04/moses/scripts/recaser/truecase.perl \
   --model /content/corpus/truecase-model.sa  \
   < /content/corpus/parallel.sa-en.tok.sa \
   > /content/corpus/parallel.sa-en.true.sa

Clean the data by filtering out sentences longer than 80 words and remove any blank lines

In [13]:
!/content/ubuntu-17.04/moses/scripts/training/clean-corpus-n.perl \
    /content/corpus/parallel.sa-en.true sa en \
    /content/corpus/parallel.sa-en.clean 1 80

clean-corpus.perl: processing /content/corpus/parallel.sa-en.true.sa & .en to /content/corpus/parallel.sa-en.clean, cutoff 1-80, ratio 9
...
Input sentences: 33000  Output sentences:  32621


Train the Language Model

In [14]:
!mkdir -p /content/lm
%cd /content/lm
!/content/ubuntu-17.04/moses/bin/lmplz -o 3 < /content/corpus/parallel.sa-en.true.en > parallel.sa-en.arpa.en

/content/lm
=== 1/5 Counting and sorting n-grams ===
Reading /content/corpus/parallel.sa-en.true.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 2514862080 bytes == 0x55adabb84000 @  0x7fa5756211e7 0x55adaade33f4 0x55adaae5f900 0x55adaae4e800 0x55adaada1110 0x7fa5744b8bf7 0x55adaada2aca
tcmalloc: large alloc 8382865408 bytes == 0x55ae419e0000 @  0x7fa5756211e7 0x55adaade33f4 0x55adaae3aa3e 0x55adaae3b3fe 0x55adaae4e817 0x55adaada1110 0x7fa5744b8bf7 0x55adaada2aca
****************************************************************************************************
Unigram tokens 817144 types 24559
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:294708 2:3799160832 3:7123426816
tcmalloc: large alloc 7123427328 bytes == 0x55adabb84000 @  0x7fa5756211e7 0x55adaade33f4 0x55adaae3aa3e 0x55adaae3b3fe 0x55adaae4ed5d 0x55adaada1110 0x7fa5744b8bf7 0x55adaada2aca
tcmalloc: large alloc 3799162880 bytes 

In [15]:
!/content/ubuntu-17.04/moses/bin/build_binary parallel.sa-en.arpa.en parallel.sa-en.blm.en

Reading parallel.sa-en.arpa.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [16]:
# Simple sanity check to see if correct process uptil now
!echo "is this an English sentence ?"                       \
   | /content/ubuntu-17.04/moses/bin/query parallel.sa-en.blm.en

is=16 2 -2.9549935	this=6 2 -2.1058443	an=110 2 -3.5131161	English=0 1 -5.8624034	sentence=10004 1 -4.6240244	?=56 1 -2.667434	</s>=2 2 -0.37507707	Total: -22.102894 OOV: 1
Perplexity including OOVs:	1437.329242285302
Perplexity excluding OOVs:	509.03589279121127
OOVs:	1
Tokens:	7
Name:query	VmPeak:53840 kB	VmRSS:6228 kB	RSSMax:18236 kB	user:0.000998	sys:0.002995	CPU:0.003993	real:0.00166383


## Actual training using mgiza alignment tool.
### It will show that the cell has finished running instantly, however it is actually running in the background. Status can be seen at /working/training.out. Process is finished when the end of line in file says "(9) create moses.ini". Proceed further then

In [17]:
!mkdir -p /content/working
%cd /content/working
!nohup nice /content/ubuntu-17.04/moses/scripts/training/train-model.perl -root-dir train \
 -mgiza \
 -corpus /content/corpus/parallel.sa-en.clean \
 -f sa -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:3:/content/lm/parallel.sa-en.blm.en:8                          \
 -cores 4 \
 -external-bin-dir /content/ubuntu-17.04/training-tools >& training.out &

/content/working


In [18]:
%cd /content/corpus
!/content/ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
   < dev/dev.sa-en.en > dev.sa-en.tok.en
!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/corpus/dev/dev.sa-en.sa /content/corpus/dev.sa-en.tok.sa sa

!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.en \
   < dev.sa-en.tok.en > dev.sa-en.true.en
!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.sa \
   < dev.sa-en.tok.sa > dev.sa-en.true.sa

/content/corpus
Tokenizer Version 1.1
Language: en
Number of threads: 1


## Tuning
### It will show that the cell has finished running instantly, however it is actually running in the background. Status can be seen at /working/mert.out. Process is finished moses.ini file is formed inside /working/mert-work

In [19]:
%cd /content/working
!nohup nice /content/ubuntu-17.04/moses/scripts/training/mert-moses.pl \
  /content/corpus/dev.sa-en.true.sa /content/corpus/dev.sa-en.true.en \
  /content/ubuntu-17.04/moses/bin/moses train/model/moses.ini --mertdir /content/ubuntu-17.04/moses/bin/ \
  --decoder-flags="-threads 4" \
  &> mert.out &

/content/working


In [21]:
%cd /content/corpus
# !/content/ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
#    < test/dev.sa-en.en > dev.sa-en.tok.en
!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-hindi/test_sanskrit.sa /content/corpus/test.tok.sa sa

# !/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.en \
#    < dev.sa-en.tok.en > dev.sa-en.true.en
!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.sa \
   < test.tok.sa > test.true.sa

/content/corpus


In [22]:
!nohup nice /content/ubuntu-17.04/moses/bin/moses \
   -f /content/working/mert-work/moses.ini   \
   < /content/corpus/test.true.sa  \
   > /content/working/test.translated.en  \
   2> /content/working/test_result.out


# !/content/ubuntu-17.04/moses/scripts/generic/multi-bleu.perl \
#    -lc /content/corpus/dev.sa-en.true.en    \
#    < /content/working/test.translated.en

Google translate API to translate English to Hindi

In [23]:
!pip install googletrans
from googletrans import Translator
translator = Translator()

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/71/3a/3b19effdd4c03958b90f40fe01c93de6d5280e03843cc5adf6956bfc9512/googletrans-3.0.0.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 5.0MB/s 
[?25hCollecting rfc3986<2,>=1.3
  Downloading https://files.pythonhosted.org/packages/78/be/7b8b99fd74ff5684225f50dd0e865393d2265656ef3b4ba9eaaaffe622b8/rfc3986-1.4.0-py2.py3-none-any.whl
Collecting hstspreload
[?25l  Downloading https://files.pythonhosted.org/packages/d3/3c/cdeaf9ab0404853e77c45d9e8021d0d2c01f70a1bb26e460090926fe2a5e/hstspreload-2020.11.21-py3-none-any.whl (981kB)
[K     |████████████████████████████████| 983kB 8.4MB/s 
[?25hCollecting httpcore==0.9.*
[?25l  Downloading https://files.pythonhosted.org/packages/dd/d5/e4ff9318693ac6101a2095e580908b591838c6f33

For translating multiple lines from file

In [24]:
with open('/content/working/test.translated.en', 'r') as f:
  test = [x.replace('\n', '') for x in f.readlines()]

translated = []
print(len(test))

for line in test:
  ans = translator.translate(line, dest='hi')
  translated.append(ans.text)

with open('translated_hindi.hi', 'w') as f:
  for line in translated:
    f.write(line + '\n')

1374


In [28]:
# calculate bleu score
import nltk
with open('/content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-hindi/test_hindi.hi', 'r') as f: # ground truth file
  reference = [x.replace('\n', '') for x in f.readlines()]

with open('translated_hindi.hi', 'r') as f: # translated hindi file (our output)
  candidate = [x.replace('\n', '') for x in f.readlines()]

from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4

print("bleu", nltk.translate.bleu_score.corpus_bleu(reference, candidate, smoothing_function=smoothie))

bleu 0.2830206410946437


For translating single line present in /content/input.txt

In [None]:
!/content/ubuntu-17.04/moses/bin/moses -f /content/working/mert-work/moses.ini < /content/input.txt > /content/temp.txt
with open('/content/temp.txt', 'r') as f:
  line = f.readline
ans = translator.translate(line, dest='hi')
print(ans.text)