### Setting Up the data

Downloading the french global test set is simple,
we need to set english  and french as source and target language, then we find the intersection of the english test set with the english corpus after that we get the corresponding french sentencs from the french corpus

In [1]:
import os
source_language = "en"
target_language = "fr" # ln is the language code of lingala 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# No need to use gdrive since we are training on gcp
!mkdir -p "$src-$tgt-$tag"
os.environ["gdrive_path"] = "%s-%s-%s" % (source_language, target_language, tag) # saving directly on the vm

In [None]:
!echo $gdrive_path

#### Downloading the corpus data

for precaution , am removing the old data

In [4]:
!rm -f w300.$src jw300.$tgt JW300_latest_xml_$src-$tgt.xml.gz JW300_latest_xml_$src-$tgt.xml JW300_latest_xml_$src.zip  JW300_latest_xml_$tgt.zip

In [5]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-fr.xml.gz not found. The following files are available for downloading:

  21 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-fr.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip
 278 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/fr.zip

 563 MB Total size
./JW300_latest_xml_en-fr.xml.gz ... 100% of 21 MB
./JW300_latest_xml_en.zip ... 100% of 263 MB
./JW300_latest_xml_fr.zip ... 100% of 278 MBip ... 8% of 278 MBfr.zip ... 39% of 278 MB


In [6]:
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
  
# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

--2020-02-09 18:47:58--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.192.133, 151.101.128.133, 151.101.64.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277791 (271K) [text/plain]
Saving to: ‘test.en-any.en.2’


2020-02-09 18:47:58 (6.40 MB/s) - ‘test.en-any.en.2’ saved [277791/277791]



In [7]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = "test.en-any.en"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3571 global test sentences to filter from the training/dev data.


In [8]:
!ls

JW300_latest_xml_en-fr.xml			 jw300.fr
JW300_latest_xml_en.zip				 jw300.swc
JW300_latest_xml_fr-swc.xml			 test.en
JW300_latest_xml_fr.zip				 test.en-any.en
JW300_latest_xml_swc.zip			 test.en-any.en.1
baseline.ipynb					 test.en-any.en.2
buiding_french_english_test_set.ipynb		 test.en-swc.en
buiding_french_english_test_set_approach1.ipynb  test.en-swc.swc
en-fr-baseline					 test.fr-swc.fr
fr-swc-baseline					 test.fr-swc.swc
jw300.en					 test.swc


#### Building the corpus

For those who knows french , in the 2 cells bellows am checking if the 2 dataset are aligned

In [9]:
! head -5 jw300.en

“ A Good Word for the Witnesses ”
THE preaching activity of Jehovah’s witnesses is growing very rapidly .
This has required a large expansion of facilities at their international headquarters in Brooklyn , New York .
The expansion is arousing much comment in the community , even prompting a sermon at the Plymouth Church ( Congregational ) , located just two blocks away .
More than a century ago , the church’s first minister , Henry Ward Beecher , lived on property that is now part of the Watchtower Society’s headquarters complex .


In [10]:
! head -5 jw300.fr

“ Éloge des Témoins ”
L’ŒUVRE de prédication des témoins de Jéhovah s’étend rapidement .
Cette extension a exigé l’agrandissement de leur siège principal situé à Brooklyn , New York .
Pareille expansion suscite de nombreux commentaires dans la localité et a même été l’objet d’un prêche prononcé dans le temple Plymouth ( de l’Église congrégationaliste ) , situé à deux pâtés de maisons du siège des témoins de Jéhovah .
Il y a plus d’un siècle , le premier pasteur de ce temple , Henry Ward Beecher , habitait une maison qui fait partie aujourd’hui de l’ensemble des bâtiments appartenant à la Société Watchtower .


In [11]:
import pandas as pd

# TMX file to dataframe
source_file = 'jw300.' + source_language  ## source language is english
target_file = 'jw300.' + target_language ## Target is french
french_test = {}
source = []
target = []
english_sentences_in_global_test_set = {}  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as src_f:
    for i, line in enumerate(src_f):
        # Skip sentences that are contained in the test set and add them into the new frencg test
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            # TODOS : Here is the intersection with the global test set
            english_sentences_in_global_test_set[i] = line.strip()           
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in english_sentences_in_global_test_set.keys():
            target.append(line.strip())
        else:
            #TODOS : Collecting the aligned test sentences
            french_test[j] = line.strip()
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(english_sentences_in_global_test_set.keys()), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
# if you get TypeError: data argument can't be an iterator is because of your zip version run this below
#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])
df.head(10)

Loaded data and skipped 10612/2304442 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,“ A Good Word for the Witnesses ”,“ Éloge des Témoins ”
1,THE preaching activity of Jehovah’s witnesses ...,L’ŒUVRE de prédication des témoins de Jéhovah ...
2,This has required a large expansion of facilit...,Cette extension a exigé l’agrandissement de le...
3,The expansion is arousing much comment in the ...,Pareille expansion suscite de nombreux comment...
4,"More than a century ago , the church’s first m...","Il y a plus d’un siècle , le premier pasteur d..."
5,The sermon was delivered by Dr . Harry H . Kru...,"Le prêche en question , prononcé par le pasteu..."
6,At the outset he declared : “ I have to say th...,L’orateur commença par déclarer : “ Je dois di...
7,"However , he then commented : “ But I can stil...",Il poursuivit toutefois en disant : “ Mais je ...
8,He said :,Il déclara alors :
9,I admire the Witnesses for talking about their...,“ J’admire les témoins parce qu’ils parlent de...


In [12]:
french_test[6794]

'Et pourquoi pas ?'

In [13]:
english_sentences_in_global_test_set[6794]

'Why not ?'

In [15]:
french_test_set = pd.DataFrame(zip(french_test.values(), english_sentences_in_global_test_set.values()), columns=['french_equivalent', 'english_equivalent'])

In [16]:
french_test_set = french_test_set.reset_index()

In [21]:
french_test_set = french_test_set.set_index("index")

In [22]:
french_test_set.tail()

Unnamed: 0_level_0,french_equivalent,english_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
10607,"» Mais un mois plus tard , ils ont reçu une no...","But then , a month later , they received thril..."
10608,Míriam explique : « On nous a proposé d’être p...,Miriam says : “ We were invited to serve as sp...
10609,Quelle joie de pouvoir rester dans notre terri...,What a joy to be able to stay in our assignmen...
10610,Ils ont fait confiance à la promesse de Psaume...,They trusted in the promise found at Psalm 37 ...
10611,"Aujourd’hui , on voit bien que oui , et on ne ...","Today we do , and we lack nothing of real impo..."


Removing duplicates from english and french set

In [24]:
french_test_set = french_test_set.drop_duplicates(subset='french_equivalent')

In [26]:
french_test_set = french_test_set.drop_duplicates(subset='english_equivalent')

In [27]:
french_test_set.head()

Unnamed: 0_level_0,french_equivalent,english_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Et pourquoi pas ?,Why not ?
2,Non .,No .
5,Je vais lui faire une aide qui soit son complé...,"I am going to make a helper for him , as a com..."
6,Mais un autre rouleau fut ouvert ; c’est le ro...,But another scroll was opened ; it is the scro...
20,Comment le savons - ​ nous ?,How do we know ?


In [28]:
french_test_set.shape

(3332, 2)

In [31]:
french_test_set.loc[~french_test_set.english_equivalent.isin(en_test_sents)]

Unnamed: 0_level_0,french_equivalent,english_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1


In [32]:
with open("test.fr-any.fr", "w") as test_fr_any_fr:
    test_fr_any_fr.write("\n".join(french_test_set.french_equivalent))

In [34]:
!head -5 test.fr-any.fr

Et pourquoi pas ?
Non .
Je vais lui faire une aide qui soit son complément . ”
Mais un autre rouleau fut ouvert ; c’est le rouleau de vie .
Comment le savons - ​ nous ?
