<a href="https://colab.research.google.com/github/priyanshu2103/Sanskrit-Hindi-Machine-Translation/blob/main/Supervised_Statistical_MT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **```Supervised Statistical Machine Translation```**

This notebook runs the Moses SMT system, following the official documentation. Moses is used to translate Sanskrit to English, which is then translated to Hindi using Google Translate API.

In [1]:
# Downloading Moses Binaries directly, as it is complicated to install moses by source 
!wget http://www.statmt.org/moses/RELEASE-4.0/binaries/ubuntu-17.04.tgz
!tar -xvzf ubuntu-17.04.tgz
!rm -rf ubuntu-17.04.tgz

--2020-11-20 10:04:15--  http://www.statmt.org/moses/RELEASE-4.0/binaries/ubuntu-17.04.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 120382471 (115M) [application/x-gzip]
Saving to: ‘ubuntu-17.04.tgz’


2020-11-20 10:05:53 (1.18 MB/s) - ‘ubuntu-17.04.tgz’ saved [120382471/120382471]

ubuntu-17.04/
ubuntu-17.04/training-tools/
ubuntu-17.04/training-tools/mkcls
ubuntu-17.04/training-tools/snt2cooc
ubuntu-17.04/training-tools/merge_alignment.py
ubuntu-17.04/training-tools/mgiza
ubuntu-17.04/moses/
ubuntu-17.04/moses/scripts/
ubuntu-17.04/moses/scripts/analysis/
ubuntu-17.04/moses/scripts/analysis/sg2dot.perl
ubuntu-17.04/moses/scripts/analysis/smtgui/
ubuntu-17.04/moses/scripts/analysis/smtgui/newsmtgui.cgi
ubuntu-17.04/moses/scripts/analysis/smtgui/file-factors
ubuntu-17.04/moses/scripts/analysis/smtgui/Corpus.pm
ubuntu-17.04/moses/scri

This is the repo which contains the parallel data we are going to use.

In [20]:
# Cloning our repo
%cd /content
!git clone https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation.git

/content
Cloning into 'Sanskrit-Hindi-Machine-Translation'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 61 (delta 13), reused 55 (delta 10), pack-reused 0[K
Unpacking objects: 100% (61/61), done.


We are using indic-nlp library for sanskrit text tokenization.

In [21]:
# Install indic-nlp for tokenization
!pip install indic-nlp-library
!cp /content/Sanskrit-Hindi-Machine-Translation/indic_tokenize.py /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py



In [6]:
# Extracting the parallel data into lists
%cd /content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-english/

eng_lines = []
sanskrit_lines = []

with open('bhagvadgita_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('bible_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('manu_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('ramayan_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('rigveda_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])



with open('bhagvadgita_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('bible_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('manu_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('ramayan_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('rigveda_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

# print(eng_lines[:100])
# print(sanskrit_lines[:100])

print(len(eng_lines))
print(len(sanskrit_lines))

/content
/content/corpus
/content/corpus/training


In [None]:
# Randomly shuffling the data into training and dev set, the test data is already provided on github
import random
c = list(zip(sanskrit_lines, eng_lines))
random.shuffle(c)

sanskrit_lines, eng_lines = zip(*c)

train_text_sa = sanskrit_lines[:-1374]
train_text_en = eng_lines[:-1374]

dev_text_sa = sanskrit_lines[-1374:]
dev_text_en = eng_lines[-1374:]

In [None]:
# Setting up the required files
%cd /content
!mkdir -p corpus
%cd corpus
!mkdir -p training
%cd training

with open('parallel.sa-en.sa', 'w') as f:
  for line in train_text_sa:
    f.write(line + '\n')

with open('parallel.sa-en.en', 'w') as f:
  for line in train_text_en:
    f.write(line + '\n')

!mkdir -p /content/corpus/dev
%cd /content/corpus/dev
with open('dev.sa-en.sa', 'w') as f:
  for line in dev_text_sa:
    f.write(line + '\n')

with open('dev.sa-en.en', 'w') as f:
  for line in dev_text_en:
    f.write(line + '\n')

In [16]:
%cd /content

/content


In [22]:
!ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
    < /content/corpus/training/parallel.sa-en.en    \
    > /content/corpus/parallel.sa-en.tok.en

!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/corpus/training/parallel.sa-en.sa /content/corpus/parallel.sa-en.tok.sa sa

Tokenizer Version 1.1
Language: en
Number of threads: 1


In [23]:
!ubuntu-17.04/moses/scripts/recaser/train-truecaser.perl \
     --model /content/corpus/truecase-model.en --corpus     \
     /content/corpus/parallel.sa-en.tok.en
!ubuntu-17.04/moses/scripts/recaser/train-truecaser.perl \
     --model /content/corpus/truecase-model.sa --corpus     \
     /content/corpus/parallel.sa-en.tok.sa

In [24]:
!ubuntu-17.04/moses/scripts/recaser/truecase.perl \
   --model /content/corpus/truecase-model.en         \
   < /content/corpus/parallel.sa-en.tok.en \
   > /content/corpus/parallel.sa-en.true.en
!ubuntu-17.04/moses/scripts/recaser/truecase.perl \
   --model /content/corpus/truecase-model.sa  \
   < /content/corpus/parallel.sa-en.tok.sa \
   > /content/corpus/parallel.sa-en.true.sa

In [25]:
!/content/ubuntu-17.04/moses/scripts/training/clean-corpus-n.perl \
    /content/corpus/parallel.sa-en.true sa en \
    /content/corpus/parallel.sa-en.clean 1 300

clean-corpus.perl: processing /content/corpus/parallel.sa-en.true.sa & .en to /content/corpus/parallel.sa-en.clean, cutoff 1-300, ratio 9
..
Input sentences: 26887  Output sentences:  26723


In [26]:
!mkdir -p /content/lm
%cd /content/lm
!/content/ubuntu-17.04/moses/bin/lmplz -o 3 < /content/corpus/parallel.sa-en.true.en > parallel.sa-en.arpa.en

/content/lm
=== 1/5 Counting and sorting n-grams ===
Reading /content/corpus/parallel.sa-en.true.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 2514862080 bytes == 0x559b41154000 @  0x7f470dade1e7 0x559b3fba73f4 0x559b3fc23900 0x559b3fc12800 0x559b3fb65110 0x7f470c975bf7 0x559b3fb66aca
tcmalloc: large alloc 8382865408 bytes == 0x559bd6fb0000 @  0x7f470dade1e7 0x559b3fba73f4 0x559b3fbfea3e 0x559b3fbff3fe 0x559b3fc12817 0x559b3fb65110 0x7f470c975bf7 0x559b3fb66aca
****************************************************************************************************
Unigram tokens 665185 types 20605
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:247260 2:3799177216 3:7123457536
tcmalloc: large alloc 7123460096 bytes == 0x559b41154000 @  0x7f470dade1e7 0x559b3fba73f4 0x559b3fbfea3e 0x559b3fbff3fe 0x559b3fc12d5d 0x559b3fb65110 0x7f470c975bf7 0x559b3fb66aca
tcmalloc: large alloc 3799179264 bytes 

In [27]:
!/content/ubuntu-17.04/moses/bin/build_binary parallel.sa-en.arpa.en parallel.sa-en.blm.en

Reading parallel.sa-en.arpa.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [28]:
!echo "is this an English sentence ?"                       \
   | /content/ubuntu-17.04/moses/bin/query parallel.sa-en.blm.en

is=37 1 -6.564294	this=69 2 -1.904331	an=621 2 -3.3780825	English=0 1 -5.777322	sentence=481 1 -4.7360168	?=0 1 -5.5146914	</s>=2 1 -5.204671	Total: -33.079407 OOV: 2
Perplexity including OOVs:	53165.45488211808
Perplexity excluding OOVs:	22776.066071282476
OOVs:	2
Tokens:	7
Name:query	VmPeak:50728 kB	VmRSS:6156 kB	RSSMax:18236 kB	user:0.002063	sys:0.002063	CPU:0.004126	real:0.00145599


In [35]:
!mkdir -p /content/working
%cd /content/working
!nohup nice /content/ubuntu-17.04/moses/scripts/training/train-model.perl -root-dir train \
 -mgiza \
 -corpus /content/corpus/parallel.sa-en.clean \
 -f sa -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:3:/content/lm/parallel.sa-en.blm.en:8                          \
 -cores 4 \
 -external-bin-dir /content/ubuntu-17.04/training-tools >& training.out &

/content/working


In [37]:
!mkdir -p /content/corpus/dev
!cp /content/working/drive/MyDrive/sans-eng\ parallel\ corpus/bible_english.txt /content/corpus/dev/dev_text.en
!cp /content/working/drive/MyDrive/sans-eng\ parallel\ corpus/bible_sanskrit.txt /content/corpus/dev/dev_text.sa

In [41]:
%cd /content/corpus
!/content/ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
   < dev/dev_text.en > dev_text.tok.en
!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/corpus/dev/dev_text.sa /content/corpus/dev_text.tok.sa sa

!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.en \
   < dev_text.tok.en > dev_text.true.en
!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.sa \
   < dev_text.tok.sa > dev_text.true.sa

/content/corpus
Tokenizer Version 1.1
Language: en
Number of threads: 1


In [42]:
%cd /content/working
!nohup nice /content/ubuntu-17.04/moses/scripts/training/mert-moses.pl \
  /content/corpus/dev_text.true.sa /content/corpus/dev_text.true.en \
  /content/ubuntu-17.04/moses/bin/moses train/model/moses.ini --mertdir /content/ubuntu-17.04/moses/bin/ \
  --decoder-flags="-threads 4" \
  &> mert.out &

/content/working


In [43]:
!pip install googletrans

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/71/3a/3b19effdd4c03958b90f40fe01c93de6d5280e03843cc5adf6956bfc9512/googletrans-3.0.0.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 3.0MB/s 
[?25hCollecting hstspreload
[?25l  Downloading https://files.pythonhosted.org/packages/fe/f4/a290dfbc9cfebf7957c63e8caca5b80caddfd1d7721821a3e8b40a399923/hstspreload-2020.10.20-py3-none-any.whl (972kB)
[K     |████████████████████████████████| 972kB 6.3MB/s 
[?25hCollecting rfc3986<2,>=1.3
  Downloading https://files.pythonhosted.org/packages/78/be/7b8b99fd74ff5684225f50dd0e865393d2265656ef3b4ba9eaaaffe622b8/rfc3986-1.4.0-py2.py3-none-any.whl
Collecting sniffio
  Downloading https://files.pythonhosted.org/packages/52/b0/7b2e028b63d092804b6794595871f936aafa5e9322dcaaad50ebf67

In [48]:
!/content/ubuntu-17.04/moses/bin/moses -f /content/working/mert-work/moses.ini

Defined parameters (per moses.ini or switch):
	config: /content/working/mert-work/moses.ini 
	distortion-limit: 6 
	feature: UnknownWordPenalty WordPenalty PhrasePenalty PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/content/working/train/model/phrase-table.gz input-factor=0 output-factor=0 LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/content/working/train/model/reordering-table.wbe-msd-bidirectional-fe.gz Distortion KENLM name=LM0 factor=0 path=/content/lm/parallel.sa-en.blm.en order=3 
	input-factors: 0 
	mapping: 0 T 0 
	threads: 48 
	weight: LexicalReordering0= 0.0864738 0.00939731 0.0217044 0.0602748 0.0552573 0.00817419 Distortion0= 0.00273384 LM0= 0.0214872 WordPenalty0= -0.188945 PhrasePenalty0= -0.470726 TranslationModel0= 0.00428964 0.00768241 -0.0339488 0.0289053 UnknownWordPenalty0= 1 
line=UnknownWordPenalty
FeatureFunction: UnknownWordPenalty0 start: 0 end: 0
line=Word

In [50]:
%cd /content/corpus
!/content/ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
   < test/dev.sa-en.en > dev.sa-en.tok.en
!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/corpus/test/dev.sa-en.sa /content/corpus/dev.sa-en.tok.sa sa

!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.en \
   < dev.sa-en.tok.en > dev.sa-en.true.en
!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.sa \
   < dev.sa-en.tok.sa > dev.sa-en.true.sa

/content/corpus
Tokenizer Version 1.1
Language: en
Number of threads: 1


In [51]:
!nohup nice /content/ubuntu-17.04/moses/bin/moses \
   -f /content/working/mert-work/moses.ini   \
   < /content/corpus/dev.sa-en.true.sa  \
   > /content/working/test.translated.en  \
   2> /content/working/test_result.out

!/content/ubuntu-17.04/moses/scripts/generic/multi-bleu.perl \
   -lc /content/corpus/dev.sa-en.true.en    \
   < /content/working/test.translated.en

BLEU = 3.15, 14.2/3.7/2.0/1.4 (BP=0.908, ratio=0.912, hyp_len=19465, ref_len=21335)


In [66]:
from googletrans import Translator
translator = Translator()

In [61]:
with open('/content/working/test.translated.en', 'r') as f:
  test = [x.replace('\n', '') for x in f.readlines()]

translated = []
print(len(test))
i=0
for line in test:
  try:
    ans = translator.translate(line, dest='hi')
    i+=1
    translated.append(ans.text)
  except:
    print("done", i)
    translated.append('NULL')

870
done 152


In [59]:
print(ans)

Translated(src=en, dest=hi, text=start _ then the पर्व्वतमारुह्येश्वरमुद्दिश्य प्रार्थयमानः the night at the words of the entire यापितवान् _ the son of the lord of the earth _ END , pronunciation=start _ then the पर्व्वतमारुह्येश्वरमुद्दिश्य प्रार्थयमानः the night at the words of the entire यापितवान् _ the son of the lord of the earth _ END , extra_data="{'translat...")


In [60]:
print(ans.origin)
print(ans.text)

start _ then the पर्व्वतमारुह्येश्वरमुद्दिश्य प्रार्थयमानः the night at the words of the entire यापितवान् _ the son of the lord of the earth _ END 
start _ then the पर्व्वतमारुह्येश्वरमुद्दिश्य प्रार्थयमानः the night at the words of the entire यापितवान् _ the son of the lord of the earth _ END 


In [None]:
import nltk
def bleu():
  
  candidate = []
  reference = []
  for i in range(n):
      pair = temp[i]
      # print('>', pair[0])
      # print('=', pair[1])
      output_words, attentions = evaluate(encoder, decoder, pair[0])
      output_sentence = ' '.join(output_words)
      reference.append(pair[1].split())
      candidate.append(output_sentence.split())
      # print('<', output_sentence)
      # print('')
  print("bleu", nltk.translate.bleu_score.corpus_bleu(reference, candidate))

In [71]:
with open('/content/corpus/dev.sa-en.true.en', 'r') as f:
  test = [x.replace('\n', '') for x in f.readlines()]

hindi_gt = []
print(len(test))
i=0
for line in test:
  try:
    ans = translator.translate(line, dest='hi')
    i+=1
    hindi_gt.append(ans.text)
    print(ans.text)
    if i==10:
      break
  except:
    print("done", i)
    hindi_gt.append('NULL')

1374
he sendeth out his voice and many loving friends of him the highly lauded hasten with their songs .
&apos;O victorious lord he who does not repay the help got through your grace and your brother &apos;s is despicable among men .
who sit as deities in heaven above the skyvaults luminous sphere .
&quot; just as a cow fond of its calf begins to distil milk from its teats at the sight of the calf , so does my heart melt at the sight of this excellent jewel .
quickly he becomes righteous-souled ( minded ) and attains peace permanently . o son of kunti ! i swear that my devotee gets never lost .
so will she shine on days to come immortal she moves on in her own strength undecaying .
&apos;O rama vali will experience sorrow and become pale bereft of me even in heaven just as you are filled with sorrow even on a delightful mountain slope bereft of princess of videha .
and from thence to Philippi , which is the chief city of that part of Macedonia , and a colony : and we were in that city 

In [63]:
print(test[:100])

['he sendeth out his voice and many loving friends of him the highly lauded hasten with their songs .', '&apos;O victorious lord he who does not repay the help got through your grace and your brother &apos;s is despicable among men .', 'who sit as deities in heaven above the skyvaults luminous sphere .', '&quot; just as a cow fond of its calf begins to distil milk from its teats at the sight of the calf , so does my heart melt at the sight of this excellent jewel .', 'quickly he becomes righteous-souled ( minded ) and attains peace permanently . o son of kunti ! i swear that my devotee gets never lost .', 'so will she shine on days to come immortal she moves on in her own strength undecaying .', '&apos;O rama vali will experience sorrow and become pale bereft of me even in heaven just as you are filled with sorrow even on a delightful mountain slope bereft of princess of videha .', 'and from thence to Philippi , which is the chief city of that part of Macedonia , and a colony : and we 

In [67]:
ans = translator.translate(test[0], dest='hi')

In [70]:
print(ans.text)

he sendeth out his voice and many loving friends of him the highly lauded hasten with their songs .


In [65]:
!pip uninstall googletrans
!git clone https://github.com/BoseCorp/py-googletrans.git
%cd ./py-googletrans
!python setup.py install

Uninstalling googletrans-3.0.0:
  Would remove:
    /usr/local/bin/translate
    /usr/local/lib/python3.6/dist-packages/googletrans-3.0.0.dist-info/*
    /usr/local/lib/python3.6/dist-packages/googletrans/*
Proceed (y/n)? y
  Successfully uninstalled googletrans-3.0.0
Cloning into 'py-googletrans'...
remote: Enumerating objects: 431, done.[K
remote: Total 431 (delta 0), reused 0 (delta 0), pack-reused 431[K
Receiving objects: 100% (431/431), 104.76 KiB | 4.19 MiB/s, done.
Resolving deltas: 100% (254/254), done.
/content/corpus/py-googletrans
running install
running bdist_egg
running egg_info
creating googletrans.egg-info
writing googletrans.egg-info/PKG-INFO
writing dependency_links to googletrans.egg-info/dependency_links.txt
writing requirements to googletrans.egg-info/requires.txt
writing top-level names to googletrans.egg-info/top_level.txt
writing manifest file 'googletrans.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'googletrans.egg-info/