### Setting Up the data

Downloading the global test set is simple,
we need to set english and your target as source and target language, then we find the intersection of the english test set with the target corpus after that we get the corresponding target sentences from the target corpus.

In [1]:
%%capture
!pip install opustools-pkg

# SET THE LANGUAGE CODE and other variables.

You need to change the value below for your language!

The language codes from the [JW300 corpus website](https://opus.nlpl.eu/JW300.php) are: 
```
["ab", "ach", "ada", "ady", "aed", "af", "aha", "ajg", "alt", "alz", "am", "ami", "amu", "aoc", "ar", "arh", "arn", "as", "ase", "asf", "ati", "ay", "az", "az_Cyrl", "ba", "bas", "bbc", "bbj", "bci", "bcl", "bem", "bfi", "bg", "bhw", "bi", "bin", "bn", "btg", "bts", "btx", "bum", "bvl", "byv", "bzj", "bzs", "cab", "cac", "cak", "cat", "cce", "ceb", "chf", "chj", "chk", "chw", "cjk", "cmn_Hans", "cmn_Hant", "cnh", "crs", "cs", "cse", "csf", "csg", "csl", "csn", "csr", "cto", "ctu", "cuk", "cv", "cy", "da", "daf", "de", "dga", "dhv", "djk", "dtp", "dua", "dyu", "ecs", "ee", "efi", "el", "en", "es", "esn", "et", "eu", "ewo", "fa", "fan", "fat", "fcs", "fi", "fj", "fo", "fon", "fr", "fse", "fsl", "ga", "gaa", "gcf", "gcr", "gil", "gl", "gom", "gsg", "gsm", "gss", "gu", "guc", "gug", "gum", "gur", "guw", "gxx", "gym", "ha", "hch", "hds", "he", "hi", "hil", "hmn", "ho", "hr", "hrx", "hsh", "ht", "hu", "hus", "hy", "hy_arevmda", "hye_x_hma", "hye_x_hms", "hz", "iba", "ibg", "id", "ig", "ilo", "inl", "ins", "is", "ise", "ish", "iso", "it", "ja", "jiv", "jmx", "jsl", "jv", "jw_dgr", "jw_dmr", "jw_ibi", "jw_paa", "jw_qcs", "jw_rmg", "jw_rmv", "jw_spl", "jw_ssa", "jw_tpo", "jw_vlc", "jw_vz", "ka", "kab", "kac", "kam", "kbp", "kea", "kek", "kg", "ki", "kj", "kjh", "kk_Arab", "kk_Cyrl", "kl", "km", "kmb", "kmr", "kmr_Cyrl", "kmr_latn", "kn", "ko", "koo", "kqn", "krc", "kri", "kss", "ksw", "kvk", "kwn", "kwy", "ky", "lam", "lg", "ln", "lo", "loz", "lsp", "lt", "lu", "lua", "lue", "lun", "luo", "lus", "lv", "mam", "mau", "maz", "mco", "mcp", "men", "mfe", "mfs", "mg", "mgr", "mh", "mhr", "miq", "mk", "ml", "mn", "mos", "mr", "mrq", "mt", "mxv", "my", "mzy", "nba", "nch", "ncj", "ncs", "ncx", "nd", "ndc", "ne", "ng", "ngl", "ngu", "nhk", "nia", "nij", "niu", "nl", "nnh", "no", "nr", "nso", "nv", "ny", "nya", "nyk", "nyn", "nyu", "nzi", "oke", "om", "or", "os", "ote", "pa", "pag", "pap", "pbb", "pcm", "pdc", "pdt", "pid", "pis", "pl", "pnb", "pon", "prl", "pso", "psp", "psr", "pt", "pys", "qu", "quc", "que", "qug", "qus", "quw", "quy", "quz", "qvi", "qvz", "rar", "rcf", "rmc_sk", "rmn", "rmn_Cyrl", "rms", "rmy", "rmy_AR", "rnd", "ro", "rsl", "ru", "run", "rw", "sah", "sbs", "seh", "sfs", "sfw", "sg", "sgn_AO", "si", "sid", "sk", "sl", "sm", "sn", "sop", "sq", "sqk", "sr_Cyrl", "sr_Latn", "srm", "srn", "ss", "ssp", "st", "su", "sv", "svk", "sw", "swc", "sxn", "ta", "tcf", "tdt", "tdx", "te", "tg", "th", "ti", "tiv", "tk", "tk_Cyrl", "tl", "tll", "tn", "to", "tob", "tog", "toh", "toi", "toi_zw", "toj", "top", "tpi", "tr", "ts", "tsc", "tso_MZ", "tss", "tsz", "tt", "ttj", "tum", "tvl", "tw", "ty", "tyv", "tzh", "tzo", "udm", "ug_Cyrl", "uk", "umb", "ur", "urh", "uz_Cyrl", "uz_Latn", "ve", "vec", "vi", "vmw", "vsl", "wal", "war", "wba", "wes", "wes_ng", "wls", "wlv", "xh", "xmf", "xpe", "yao", "yap", "ybb", "yo", "yua", "yue_Hans", "yue_Hant", "zab", "zai", "zdj", "zib", "zlm", "zne", "zpa", "zpg", "zsl", "zu","]
```
Already-created test sets: https://github.com/juliakreutzer/masakhane/tree/master/jw300_utils/test

For this example, we pick `ha`, which is Hausa.

In [2]:
import os
source_language = "en"
target_language = "ab" # TODO: CHANGE THIS TO YOUR LANGUAGE! "ha" is hausa. See the language codes at https://opus.nlpl.eu/JW300.php
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# No need to use gdrive since we are training on gcp
!mkdir -p "$src-$tgt-$tag"
os.environ["gdrive_path"] = "%s-%s-%s" % (source_language, target_language, tag) # saving directly on the vm

In [3]:
!echo $gdrive_path

en-ab-baseline


#### Downloading the corpus data

for precaution , am removing the old data

In [4]:
!rm -f w300.$src jw300.$tgt JW300_latest_xml_$src-$tgt.xml.gz JW300_latest_xml_$src-$tgt.xml JW300_latest_xml_$src.zip  JW300_latest_xml_$tgt.zip test.en-any.en

In [5]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/ab-en.xml.gz not found. The following files are available for downloading:

 312 KB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/ab-en.xml.gz
   3 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/ab.zip
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/en.zip

 266 MB Total size
./JW300_latest_xml_ab-en.xml.gz ... 100% of 312 KB
./JW300_latest_xml_ab.zip ... 100% of 3 MB
./JW300_latest_xml_en.zip ... 100% of 263 MB
gzip: JW300_latest_xml_en-ab.xml.gz: No such file or directory


In [6]:
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
  
# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

--2021-06-24 19:43:27--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277791 (271K) [text/plain]
Saving to: ‘test.en-any.en’


2021-06-24 19:43:28 (9.09 MB/s) - ‘test.en-any.en’ saved [277791/277791]



In [7]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = "test.en-any.en"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3571 global test sentences to filter from the training/dev data.


In [8]:
!ls

en-ab-baseline	JW300_latest_xml_ab-en.xml.gz  sample_data
jw300.ab	JW300_latest_xml_ab.zip        test.en-any.en
jw300.en	JW300_latest_xml_en.zip


#### Building the corpus

In the 2 cells below you can check if the 2 datasets are aligned. Even if you don't speak the language you can get a sense, especially with similar words, punctuation, and so forth.

In [9]:
! head -5 jw300.en

Table of Contents
© 2016 Watch Tower Bible and Tract Society of Pennsylvania
Would the world be a better place if everyone lived by this Bible principle ?
“ We wish to conduct ourselves honestly in all things . ” ​ — Hebrews 13 : 18 .
This issue of The Watchtower discusses how honesty touches every aspect of our life .


In [10]:
! head -5 jw300.$tgt

Аҵакы
© 2016 Watch Tower Bible and Tract Society of Pennsylvania
Еиӷьхозма адунеи , ауаа зегь ари абиблиатә принцип иқәныҟәозҭгьы ?
« Ҳарҭ . . . иҟаҳҵо зегьы гәык - ԥсык ала , иаша - ҵабыргла иҟаҳҵоит » ( Ауриацәа рахь 13 : 18 ) .
Ари аброшиураҿы иануп аиашара ҳаԥсҭазаара ишаныруа . Аиашара .


In [11]:
import pandas as pd

# TMX file to dataframe
source_file = 'jw300.' + source_language  ## source language is english
target_file = 'jw300.' + target_language ## Target is whatever you set. For our example it was ha, so jw300.ha
target_test = {}
source = []
target = []
english_sentences_in_global_test_set = {}  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as src_f:
    for i, line in enumerate(src_f):
        # Skip sentences that are contained in the test set and add them into the new frencg test
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            # Here is the intersection with the global test set
            english_sentences_in_global_test_set[i] = line.strip()           
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in english_sentences_in_global_test_set.keys():
            target.append(line.strip())
        else:
            #Collecting the aligned test sentences
            target_test[j] = line.strip()
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(english_sentences_in_global_test_set.keys()), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
# if you get TypeError: data argument can't be an iterator is because of your zip version run this below
#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])
df.head(10)

Loaded data and skipped 2984/28885 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,Why It Pays to Be Honest 6,Иашарыла анхара заԥсоу 6
1,The Bible Changes Lives,Абиблиа аԥсҭазаара аԥсахуеит
2,Did You Know ?,Ижәдыруама шәара ?
3,10,10
4,Ancient Wisdom for Modern Living,Ажәытәтәи аҟәыӷара иахьатәи аԥсҭазааразы
5,Do Not Be Anxious 15,Шәазымхьаалан 15
6,What Does the Bible Say ?,Иаҳәозеи Абиблиа ?
7,16,16
8,Can the Bible Help Me if I’m Depressed ?,
9,While reviewing a financial account with his s...,"Афинанстә ҳасабырба ангәарҭоз , аусура аиҳабы ..."


## Check a random item!
Let's pick one of the keys in the dictionary at random and check it. 

In [12]:
import random
keys_in_target_test = list(target_test.keys())
print(type(keys_in_target_test))
random_key = random.choice(keys_in_target_test)
print(f"The random key we picked was {random_key}")

<class 'list'>
The random key we picked was 8583


In [13]:
target_test[random_key]

'б ) Ишԥарылшо аизара аиҳабацәа аишьцәа доуҳатә хьчаҩцәаны иҟаларц азы қәҿиарала разыҟаҵара ?'

In [14]:
english_sentences_in_global_test_set[random_key]

'( b ) How can the elders effectively train future shepherds of the congregation ?'

Do the two look like they line up? 

## Check several rows at the tail end

Let's get a sample from the end of the dataset

In [15]:
target_test_set = pd.DataFrame(zip(target_test.values(), english_sentences_in_global_test_set.values()), columns=['target_equivalent', 'english_equivalent'])

In [16]:
target_test_set = target_test_set.reset_index()

In [17]:
target_test_set = target_test_set.set_index("index")

In [18]:
target_test_set.tail()

Unnamed: 0_level_0,target_equivalent,english_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
2979,( Шәахәаԥш астатиа алагамҭаҿы иаагоу асахьа . ),( See opening picture . )
2980,Ажәамаанақәа 14 : 15 аҿы иануп : « Аԥышәа змам...,Proverbs 14 : 15 says : “ The naive person bel...
2981,б ) Еилҳаргозеи анаҩстәи астатиаҿы ?,( b ) What will we discuss in the next article ?
2982,( Шәаԥхьа 1 Тимофеи иахь 6 : 17 — 19 . ),( Read 1 Timothy 6 : 17 - 19 . )
2983,Мап .,No .


Removing duplicates from english and target set

In [19]:
target_test_set = target_test_set.drop_duplicates(subset='target_equivalent')

In [20]:
target_test_set = target_test_set.drop_duplicates(subset='english_equivalent')

In [21]:
target_test_set.head()

Unnamed: 0_level_0,target_equivalent,english_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Аҵакы,Table of Contents
1,© 2016 Watch Tower Bible and Tract Society of ...,© 2016 Watch Tower Bible and Tract Society of ...
2,"Еиӷьхозма адунеи , ауаа зегь ари абиблиатә при...",Would the world be a better place if everyone ...
3,"« Ҳарҭ . . . иҟаҳҵо зегьы гәык - ԥсык ала , иа...",“ We wish to conduct ourselves honestly in all...
4,Ари аброшиураҿы иануп аиашара ҳаԥсҭазаара ишан...,This issue of The Watchtower discusses how hon...


In [22]:
target_test_set.shape

(2516, 2)

In [23]:
target_test_set.loc[~target_test_set.english_equivalent.isin(en_test_sents)]

Unnamed: 0_level_0,target_equivalent,english_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1


## Write out target-language test set file
In our example, we should have `test.ha-any.ha`, but it will be different for you if you picked a different code.

In [25]:
target_test_filename = f"test.{target_language}-any.{target_language}"
print(target_test_filename)

test.ab-any.ab


## Write out English-language test set file
In our example, we should have `test.en-ha.en`, but it will be different for you if you picked a different code.

**Make sure the data lines up in the two files!**
The first line of each file should be translations of each other.


In [26]:

with open(target_test_filename, "w") as test_tgt_any_tgt:
    test_tgt_any_tgt.write("\n".join(target_test_set.target_equivalent))

In [27]:
!head -5 test.$tgt-any.$tgt

Аҵакы
© 2016 Watch Tower Bible and Tract Society of Pennsylvania
Еиӷьхозма адунеи , ауаа зегь ари абиблиатә принцип иқәныҟәозҭгьы ?
« Ҳарҭ . . . иҟаҳҵо зегьы гәык - ԥсык ала , иаша - ҵабыргла иҟаҳҵоит » ( Ауриацәа рахь 13 : 18 ) .
Ари аброшиураҿы иануп аиашара ҳаԥсҭазаара ишаныруа . Аиашара .


In [28]:
source_test_filename = f"test.en-{target_language}.en"
print(f"saving english aligned sentences to {source_test_filename}")
with open(source_test_filename, "w") as test_en_tgt_en:
    test_en_tgt_en.write("\n".join(target_test_set.english_equivalent))
!ls -al

saving english aligned sentences to test.en-ab.en
total 272920
drwxr-xr-x 1 root root      4096 Jun 24 19:46 .
drwxr-xr-x 1 root root      4096 Jun 24 19:36 ..
drwxr-xr-x 4 root root      4096 Jun 15 13:37 .config
drwxr-xr-x 2 root root      4096 Jun 24 19:42 en-ab-baseline
-rw-r--r-- 1 root root   4044520 Jun 24 19:43 jw300.ab
-rw-r--r-- 1 root root   2329929 Jun 24 19:43 jw300.en
-rw-r--r-- 1 root root    318785 Jun 24 19:42 JW300_latest_xml_ab-en.xml.gz
-rw-r--r-- 1 root root   2596928 Jun 24 19:43 JW300_latest_xml_ab.zip
-rw-r--r-- 1 root root 269378154 Jun 24 19:43 JW300_latest_xml_en.zip
drwxr-xr-x 1 root root      4096 Jun 15 13:37 sample_data
-rw-r--r-- 1 root root    300741 Jun 24 19:46 test.ab-any.ab
-rw-r--r-- 1 root root    185786 Jun 24 19:46 test.en-ab.en
-rw-r--r-- 1 root root    277791 Jun 24 19:43 test.en-any.en


In [29]:
!head -5 test.en-$tgt.en

Table of Contents
© 2016 Watch Tower Bible and Tract Society of Pennsylvania
Would the world be a better place if everyone lived by this Bible principle ?
“ We wish to conduct ourselves honestly in all things . ” ​ — Hebrews 13 : 18 .
This issue of The Watchtower discusses how honesty touches every aspect of our life .


## One last check to see if the two files are aligned

Let's just get one more sample! Let's take from the end this time

In [30]:
!echo "test.en-$tgt.en"
!tail -5 test.en-$tgt.en
!echo
!echo "**********************"
!echo "test.$tgt-any.$tgt"
!echo "**********************"
!tail -5 test.$tgt-any.$tgt

test.en-ab.en
The sword of the spirit ( See paragraphs 19 - 20 )
With Jehovah’s help , we can stand firm against him !
( Read Hebrews 11 : 24 - 27 . )
( Read Luke 10 : 29 - 37 . )
Proverbs 14 : 15 says : “ The naive person believes every word , but the shrewd one ponders each step . ”
**********************
test.ab-any.ab
**********************
Адоуҳатә бџьар , аҳәа ( Шәрыхәаԥш абзацқәа 19 — 20 . )
Иегова ицхыраарала ҳара иҳалшоит уи иҿагылара !
( Шәаԥхьа Ауриацәа рахь 11 : 24 — 27 . )
( Шәаԥхьа Лука 10 : 29 — 37 . )
Ажәамаанақәа 14 : 15 аҿы иануп : « Аԥышәа змам иарбан ажәазаалакгьы агәра игоит , аилкаара змоу ауаҩы ишьаҿақәа заа дрызхәыцуеит » .