<a href="https://colab.research.google.com/github/priyanshu2103/Sanskrit-Hindi-Machine-Translation/blob/main/Supervised_Statistical_MT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **```Supervised Statistical Machine Translation```**

This notebook runs the Moses SMT system, following the official documentation. Moses is used to translate Sanskrit to English, which is then translated to Hindi using Google Translate API.

NOTE: Moses training and tuning takes around 2-2.5 hrs to run 

In [1]:
# Downloading Moses Binaries directly, as it is complicated to install moses by source 
!wget http://www.statmt.org/moses/RELEASE-4.0/binaries/ubuntu-17.04.tgz
!tar -xvzf ubuntu-17.04.tgz
!rm -rf ubuntu-17.04.tgz

--2020-11-20 10:04:15--  http://www.statmt.org/moses/RELEASE-4.0/binaries/ubuntu-17.04.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 120382471 (115M) [application/x-gzip]
Saving to: ‘ubuntu-17.04.tgz’


2020-11-20 10:05:53 (1.18 MB/s) - ‘ubuntu-17.04.tgz’ saved [120382471/120382471]

ubuntu-17.04/
ubuntu-17.04/training-tools/
ubuntu-17.04/training-tools/mkcls
ubuntu-17.04/training-tools/snt2cooc
ubuntu-17.04/training-tools/merge_alignment.py
ubuntu-17.04/training-tools/mgiza
ubuntu-17.04/moses/
ubuntu-17.04/moses/scripts/
ubuntu-17.04/moses/scripts/analysis/
ubuntu-17.04/moses/scripts/analysis/sg2dot.perl
ubuntu-17.04/moses/scripts/analysis/smtgui/
ubuntu-17.04/moses/scripts/analysis/smtgui/newsmtgui.cgi
ubuntu-17.04/moses/scripts/analysis/smtgui/file-factors
ubuntu-17.04/moses/scripts/analysis/smtgui/Corpus.pm
ubuntu-17.04/moses/scri

This is the repo which contains the parallel data we are going to use.

In [20]:
# Cloning our repo
%cd /content
!git clone https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation.git

/content
Cloning into 'Sanskrit-Hindi-Machine-Translation'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 61 (delta 13), reused 55 (delta 10), pack-reused 0[K
Unpacking objects: 100% (61/61), done.


We are using indic-nlp library for sanskrit text tokenization.

In [21]:
# Install indic-nlp for tokenization
!pip install indic-nlp-library
!cp /content/Sanskrit-Hindi-Machine-Translation/indic_tokenize.py /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py



In [6]:
# Extracting the parallel data into lists
%cd /content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-english/

eng_lines = []
sanskrit_lines = []

with open('bhagvadgita_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('bible_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('manu_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('ramayan_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('rigveda_english.txt', 'r') as f:
  eng_lines.extend([x.replace('\n', '') for x in f.readlines()])



with open('bhagvadgita_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('bible_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('manu_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('ramayan_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

with open('rigveda_sanskrit.txt', 'r') as f:
  sanskrit_lines.extend([x.replace('\n', '') for x in f.readlines()])

# print(eng_lines[:100])
# print(sanskrit_lines[:100])

print(len(eng_lines))
print(len(sanskrit_lines))

/content
/content/corpus
/content/corpus/training


In [None]:
# Randomly shuffling the data into training and dev set, the test data is already provided on github
import random
c = list(zip(sanskrit_lines, eng_lines))
random.shuffle(c)

sanskrit_lines, eng_lines = zip(*c)

train_text_sa = sanskrit_lines[:-1374]
train_text_en = eng_lines[:-1374]

dev_text_sa = sanskrit_lines[-1374:]
dev_text_en = eng_lines[-1374:]

In [None]:
# Setting up the required files
%cd /content
!mkdir -p corpus
%cd corpus
!mkdir -p training
%cd training

with open('parallel.sa-en.sa', 'w') as f:
  for line in train_text_sa:
    f.write(line + '\n')

with open('parallel.sa-en.en', 'w') as f:
  for line in train_text_en:
    f.write(line + '\n')

!mkdir -p /content/corpus/dev
%cd /content/corpus/dev
with open('dev.sa-en.sa', 'w') as f:
  for line in dev_text_sa:
    f.write(line + '\n')

with open('dev.sa-en.en', 'w') as f:
  for line in dev_text_en:
    f.write(line + '\n')

In [16]:
%cd /content

/content


Tokenize

In [22]:
!ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
    < /content/corpus/training/parallel.sa-en.en    \
    > /content/corpus/parallel.sa-en.tok.en

!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/corpus/training/parallel.sa-en.sa /content/corpus/parallel.sa-en.tok.sa sa

Tokenizer Version 1.1
Language: en
Number of threads: 1


Truecase

We first need to learn the truecaser, then apply it

In [23]:
!ubuntu-17.04/moses/scripts/recaser/train-truecaser.perl \
     --model /content/corpus/truecase-model.en --corpus     \
     /content/corpus/parallel.sa-en.tok.en
!ubuntu-17.04/moses/scripts/recaser/train-truecaser.perl \
     --model /content/corpus/truecase-model.sa --corpus     \
     /content/corpus/parallel.sa-en.tok.sa

In [24]:
!ubuntu-17.04/moses/scripts/recaser/truecase.perl \
   --model /content/corpus/truecase-model.en         \
   < /content/corpus/parallel.sa-en.tok.en \
   > /content/corpus/parallel.sa-en.true.en
!ubuntu-17.04/moses/scripts/recaser/truecase.perl \
   --model /content/corpus/truecase-model.sa  \
   < /content/corpus/parallel.sa-en.tok.sa \
   > /content/corpus/parallel.sa-en.true.sa

Clean the data by filtering out sentences longer than 80 words and remove any blank lines

In [None]:
!/content/ubuntu-17.04/moses/scripts/training/clean-corpus-n.perl \
    /content/corpus/parallel.sa-en.true sa en \
    /content/corpus/parallel.sa-en.clean 1 80

Train the Language Model

In [26]:
!mkdir -p /content/lm
%cd /content/lm
!/content/ubuntu-17.04/moses/bin/lmplz -o 3 < /content/corpus/parallel.sa-en.true.en > parallel.sa-en.arpa.en

/content/lm
=== 1/5 Counting and sorting n-grams ===
Reading /content/corpus/parallel.sa-en.true.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 2514862080 bytes == 0x559b41154000 @  0x7f470dade1e7 0x559b3fba73f4 0x559b3fc23900 0x559b3fc12800 0x559b3fb65110 0x7f470c975bf7 0x559b3fb66aca
tcmalloc: large alloc 8382865408 bytes == 0x559bd6fb0000 @  0x7f470dade1e7 0x559b3fba73f4 0x559b3fbfea3e 0x559b3fbff3fe 0x559b3fc12817 0x559b3fb65110 0x7f470c975bf7 0x559b3fb66aca
****************************************************************************************************
Unigram tokens 665185 types 20605
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:247260 2:3799177216 3:7123457536
tcmalloc: large alloc 7123460096 bytes == 0x559b41154000 @  0x7f470dade1e7 0x559b3fba73f4 0x559b3fbfea3e 0x559b3fbff3fe 0x559b3fc12d5d 0x559b3fb65110 0x7f470c975bf7 0x559b3fb66aca
tcmalloc: large alloc 3799179264 bytes 

In [27]:
!/content/ubuntu-17.04/moses/bin/build_binary parallel.sa-en.arpa.en parallel.sa-en.blm.en

Reading parallel.sa-en.arpa.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [28]:
# Simple sanity check to see if correct process uptil now
!echo "is this an English sentence ?"                       \
   | /content/ubuntu-17.04/moses/bin/query parallel.sa-en.blm.en

is=37 1 -6.564294	this=69 2 -1.904331	an=621 2 -3.3780825	English=0 1 -5.777322	sentence=481 1 -4.7360168	?=0 1 -5.5146914	</s>=2 1 -5.204671	Total: -33.079407 OOV: 2
Perplexity including OOVs:	53165.45488211808
Perplexity excluding OOVs:	22776.066071282476
OOVs:	2
Tokens:	7
Name:query	VmPeak:50728 kB	VmRSS:6156 kB	RSSMax:18236 kB	user:0.002063	sys:0.002063	CPU:0.004126	real:0.00145599


Actual training using mgiza alignment tool

In [35]:
!mkdir -p /content/working
%cd /content/working
!nohup nice /content/ubuntu-17.04/moses/scripts/training/train-model.perl -root-dir train \
 -mgiza \
 -corpus /content/corpus/parallel.sa-en.clean \
 -f sa -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:3:/content/lm/parallel.sa-en.blm.en:8                          \
 -cores 4 \
 -external-bin-dir /content/ubuntu-17.04/training-tools >& training.out &

/content/working


In [41]:
%cd /content/corpus
!/content/ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
   < dev/dev.sa-en.en > dev.sa-en.tok.en
!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/corpus/dev/dev.sa-en.sa /content/corpus/dev.sa-en.tok.sa sa

!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.en \
   < dev.sa-en.tok.en > dev.sa-en.true.en
!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.sa \
   < dev.sa-en.tok.sa > dev.sa-en.true.sa

/content/corpus
Tokenizer Version 1.1
Language: en
Number of threads: 1


In [42]:
%cd /content/working
!nohup nice /content/ubuntu-17.04/moses/scripts/training/mert-moses.pl \
  /content/corpus/dev.sa-en.true.sa /content/corpus/dev.sa-en.true.en \
  /content/ubuntu-17.04/moses/bin/moses train/model/moses.ini --mertdir /content/ubuntu-17.04/moses/bin/ \
  --decoder-flags="-threads 4" \
  &> mert.out &

/content/working


In [None]:
%cd /content/corpus
# !/content/ubuntu-17.04/moses/scripts/tokenizer/tokenizer.perl -l en \
#    < test/dev.sa-en.en > dev.sa-en.tok.en
!python /usr/local/lib/python3.6/dist-packages/indicnlp/tokenize/indic_tokenize.py /content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-hindi/test_sanskrit.sa /content/corpus/test.tok.sa sa

# !/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.en \
#    < dev.sa-en.tok.en > dev.sa-en.true.en
!/content/ubuntu-17.04/moses/scripts/recaser/truecase.perl --model truecase-model.sa \
   < test.tok.sa > test.true.sa

In [None]:
!nohup nice /content/ubuntu-17.04/moses/bin/moses \
   -f /content/working/mert-work/moses.ini   \
   < /content/corpus/test.true.sa  \
   > /content/working/test.translated.en  \
   2> /content/working/test_result.out


# !/content/ubuntu-17.04/moses/scripts/generic/multi-bleu.perl \
#    -lc /content/corpus/dev.sa-en.true.en    \
#    < /content/working/test.translated.en

Google translate API to translate English to Hindi

In [66]:
!pip install googletrans
from googletrans import Translator
translator = Translator()

For translating multiple lines from file

In [None]:
with open('/content/working/test.translated.en', 'r') as f:
  test = [x.replace('\n', '') for x in f.readlines()]

translated = []
print(len(test))

for line in test:
  ans = translator.translate(line, dest='hi')
  translated.append(ans.text)

with open('translated_hindi.hi', 'w') as f:
  for line in translated:
    f.write(line + '\n')

In [2]:
# calculate bleu score
import nltk
with open('/content/Sanskrit-Hindi-Machine-Translation/parallel-corpus/sanskrit-hindi/test_hindi.hi', 'r') as f: # ground truth file
  reference = [x.replace('\n', '') for x in f.readlines()]

with open('translated_hindi.hi', 'r') as f: # translated hindi file (our output)
  candidate = [x.replace('\n', '') for x in f.readlines()]

from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4

print("bleu", nltk.translate.bleu_score.corpus_bleu(reference, candidate, smoothing_function=smoothie))

bleu 0.28475005173770523


For translating single line present in /content/input.txt

In [None]:
!/content/ubuntu-17.04/moses/bin/moses -f /content/working/mert-work/moses.ini < /content/input.txt > /content/temp.txt
with open('/content/temp.txt', 'r') as f:
  line = f.readline
ans = translator.translate(line, dest='hi')
print(ans.text)