In [1]:
import os
import pandas as pd

# Loading

## Load EDP

In [2]:
fr_files = sorted(os.listdir("EDP/WMT/all/FR"))
en_files = sorted(os.listdir("EDP/WMT/all/EN"))
assert fr_files == en_files

lines = []
en_path = "EDP/WMT/all/EN"
fr_path = "EDP/WMT/all/FR"
fnames = en_files
for fname in fnames:
    with open(os.path.join(en_path, fname), "r") as f_en, open(os.path.join(fr_path, fname), "r") as f_fr:
        for en_line, fr_line in zip(f_en, f_fr):
            lines.append((en_line[:-1], fr_line[:-1]))

sentences_EDP = pd.DataFrame(lines, columns=("english", "french"))
sample = sentences_EDP.sample(n=5)
for _, row in sample.iterrows():
    print(row.english)
    print(row.french)
    print("\n")

Genetic selection, which provides substantial improvements in both oil content and fatty acid composition, should therefore be expected to enhance the quality of sunflower meal.
Selon nos résultats, la teneur en protéines calculée dans les tourteaux décortiqués, à partir des caractéristiques mesurées sur les différentes variétés étudiées, varie dans une gamme de 34 à 44% sur matière brute.


There was a significant association between multiple facial fractures and road traffic accidents.
Le pic de fréquence (58,2 %) était observé entre 20 et 39 ans et le rapport homme/femme était de 7.1:1.


There are two clinical aspects, the minor and the major forms.
l’une mineure, dont les symptômes sont peu invalidants, l’autre majeure, comporte une atteinte cutanée et muqueuse buccale et/ou génitale, et peut entraîner une altération de l’état général.


The addition of the acyl moiety is required for the protein function and usually mediates protein subcellular localization, protein-protein inter

## Load pubmed

In [3]:
with open("PUBMED/pubmed_en_fr.txt", "r") as f_in:
    raw_lines = f_in.readlines()  

clean_lines = []

for index, line in enumerate(raw_lines):
    _, en, fr = line.split("|")
    en = en[1:-2]
    fr = fr[:-2]
    clean_lines.append((en, fr))
    
    # The domain specific corpus should be smaller
    if index > 50000:
        break

sentences_pubmed = pd.DataFrame(clean_lines, columns=("english", "french"))

# Remove two typical issues from the dataset
m_1 = sentences_pubmed.english != "Not Available"
m_2 = sentences_pubmed.english != "In Process Citation"
sentences_pubmed = sentences_pubmed[m_1 & m_2]

sample = sentences_pubmed.sample(n=5)
for _, row in sample.iterrows():
    print(row.english)
    print(row.french)
    print("\n")
    
print(len(sentences_pubmed))

Cardiovascular disease in the elderly. A new public health challenge and new medical training needs
Maladies cardiovasculaires du sujet âgé. Un nouvel enjeu de santé publique et la nécessité d'une nouvelle formation médicale


From dysphagia of the patient to medical gastroparesis
De la dysphagie du patient a la gastroparésie médicale


The specificities of orthopedic trauma in the geriatric patient
Les particularités de la traumatologie de l'appareil locomoteur chez le sujet gériatrique


The cricopharyngeal muscle and the laryngeal nerves: contribution to the functional anatomy of swallowing
Le muscle cricopharyngien et les nerfs laryngés supérieur et inférieur : contribution à l'anatomie fonctionnelle de la déglutition


Liver fibrosis: from pathophysiology to therapeutic openings
Fibrose hépatique: de la physiopathologie aux implications thérapeutiques


48607


## Load gigafren

In [4]:
sentences_gigafren = []
gigafren_path = "gigafren/en-fr"

with open(os.path.join(
    gigafren_path, "giga-fren.en-fr.en"), "r"
) as f_en, open(os.path.join(
    gigafren_path, "giga-fren.en-fr.fr"), "r"
) as f_fr:
    for index, pair in enumerate(zip(f_en, f_fr)):
        
        # Remove \n characters
        pair = (s[:-1] for s in pair)
        
        sentences_gigafren.append(pair)
        
        # Take only the first 250000
        if index > 250000:
            break
            
sentences_gigafren = pd.DataFrame(
    sentences_gigafren, columns=("english", "french")
)
sample = sentences_gigafren.sample(n=5)
for _, row in sample.iterrows():
    print(row.english)
    print(row.french)
    print("\n")

print(len(sentences_gigafren))

Private Label Products Private label brands (also called store brands) are a critical component in the evolution of the retail grocery market, and currently account for 20% of all items sold in American supermarkets.
Produits vendus sous la marque du distributeur Les produits vendus sous la marque du distributeur (aussi appelée marque du magasin) sont une composante essentielle de l'évolution du marché de l'épicerie de détail et représentent actuellement 20 % de tous les articles vendus dans les supermarchés américains.


March Canada Child Tax Benefit Payments? [ 2007-03-22 ] Canada Revenue Agency:
La prestation fiscale canadienne pour enfants de mars? [ 2007-03-22 ] Agence du revenu du Canada :


Key sectors of the food and drink industry The "various food products" category is the largest sector of the food and drink industry, accounting for 26% of total turnover and 42% of the workforce.
Principales branches du secteur des aliments et boissons Les « produits alimentaires divers » s

In [5]:
sentences = pd.concat((sentences_EDP, sentences_gigafren, sentences_pubmed))
sentences = sentences.sample(frac=1).reset_index(drop=True)
sentences

Unnamed: 0,english,french
0,"Cependant, les réservations pour le T4 2002 so...",Recent retail and wholesale trade data also su...
1,o Click on the Institutions tab at the top of ...,o Cliquez sur l'onglet Institutions au haut de...
2,o Canadian Environmental Solutions (CES) - Ind...,o Solutions environnementales canadiennes (SEC...
3,Since the 1965 introduction of the Canadian fl...,"Depuis l'adoption du drapeau canadien en 1965,..."
4,"It could be that a perceived, whether real or ...","Il se peut que la perception, réelle ou non, d..."
...,...,...
301519,o TRAINING,o FORMATION
301520,Beef and Poultry Glut Appears to be Easing CAN...,La surabondance de boeuf et de volaille semble...
301521,Register a Canadian Company To register your C...,Inscription de votre entreprise canadienne Pou...
301522,Copepods distribution in relation to a Cape Gh...,Distribution des copépodes en relation avec la...


# Cleaning

In [6]:
sentences["source_len"] = sentences.iloc[:, 0].map(len)
sentences["target_len"] = sentences.iloc[:, 1].map(len)

In [7]:
L_MAX = 300
L_MIN = 10
len_mask = (sentences.source_len < L_MAX) & (sentences.source_len > L_MIN) & (sentences.target_len > L_MIN) & (sentences.target_len > L_MIN)
bad_sample = sentences[~len_mask].sample(n=1)
print(bad_sample.iloc[0, 0])
print(bad_sample.iloc[0, 1])

8.4 Within a ten year time-frame, what should have changed in relation to the vitality of official language minority communities and linguistic duality in the field of justice, following a thorough implementation of Section 41 of the Official Languages Act by Justice Canada? 8.5 How can progress be measured objectively?
8.4 En se donnant un horizon de planification d'une dizaine d'années, qu'est-ce qui devra avoir changé concernant l'épanouissement des communautés et la dualité linguistique dans le domaine de la justice suite à une mise en oeuvre diligente de l'article 41 de la Loi sur les langues officielles par Justice Canada? 8.5 Comment pourrons-nous mesurer objectivement les progrès accomplis ?


In [8]:
n_sentences_left = len(sentences[len_mask])
fraction_left = n_sentences_left / len(sentences)
print("Percentage of sentences left if we removed the sentences longer than {} and shorter than {} characters : {:.2f}".format(L_MAX, L_MIN, fraction_left))
print("{} sentences left".format(n_sentences_left))

Percentage of sentences left if we removed the sentences longer than 300 and shorter than 10 characters : 0.95
286188 sentences left


In [9]:
# Update sentences content
sentences = sentences[len_mask]

## Find bad alignements

Differences in target and source sentence lengths may indicate bad alignements. 
We remove from the dataframe the pairs for which the character count difference is superior to a certain threshold.

In [10]:
sentences.loc[:, "len_difference"] = abs(sentences.loc[:, "target_len"].copy() - sentences.loc[:, "source_len"].copy()).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sentences.loc[:, "len_difference"] = abs(sentences.loc[:, "target_len"].copy() - sentences.loc[:, "source_len"].copy()).astype(int)


In [11]:
greater_differences = sentences.sort_values(
    "len_difference", ascending=False
)
for _, row in greater_differences[:4].iterrows():
    print(row.english)
    print(row.french)
    print("\n")

Their diagnosis is clinical and requires no further examination.
Dans les autres cas, l’orientation diagnostique se fait à partir de l’analyse des lésions élémentaires, empruntées à la dermatologie : enduits (langue saburrale, langue noire villeuse, muguet), érythèmes (érythroplasie, anémies, mycoses, maladie de Kawasaki, scarlatine), vésicules essentiellement virales, bulles (érythème polymorphe, Stevens-Johnson et Lyell, pemphigoïde cicatricielle, pemphigus, bulles hémorragiques), ulcérations (traumatiques, aphtes, carcinome, médicamenteuses…), plaques blanches (leucoplasies, kératoses frictionnelles, lichen plan, leucoplasie orale chevelue, white sponge naevus, stomatite urémique), atrophie (anémies chroniques, lichen plan), pigmentations (tatouage, lentigo, médicamenteuses, mélanome…), infiltrations (amylose, maladie de Kaposi, myxoedème).


Manitoba First Nations Justice Strategy Multidisciplinary too early too early N.B. Domestic Legal Aid Multidisciplinary too early too early N.

In [12]:
greater_differences = sentences.sort_values(
    "len_difference", ascending=True
)
temp_mask = greater_differences["len_difference"] > 40

In [13]:
for _, row in greater_differences[temp_mask][:4].iterrows():
    print(row.english)
    print(row.french)
    print("\n")

o On the other hand, travellers from overseas countries spent $7.9 billion in Canada in 2006, a 2.8% gain over 2005.
o D'autre part, les voyageurs en provenance des pays d'outre-mer ont dépensé 7,9 milliards de dollars au Canada en 2006, un gain de 2,8 % par rapport à 2005.


o Combining the DKS U.S. Travel Study research with the Explorer Segmentation Research study, the CTC has been able to identify three key customer segments that can deliver the best potential ROI in the US market.
o En combinant les résultats de deux études, l'une de la firme DKS sur le marché américain du voyage et l'autre autour du quotient explorateur, la CCT est parvenue à dégager trois grands profils de consommateurs à haut rendement pour le marché américain.


The MAB conducted reviews of and offered advice on a number of issues of strategic importance, including comments on the specific action plans developed by the OSB to achieve its strategic objectives and on the results obtained.
Ainsi, le CCG s'est pen

In [14]:
# We choose 40 characters as the difference threshold
alignement_mask = sentences.len_difference < 40
sentences = sentences[alignement_mask]

In [15]:
sample = sentences.sample(n=5)
for _, row in sample.iterrows():
    print(row.english)
    print(row.french)
    print("\n")
print(len(sentences))

There are a number of distinct categories that protect intellectual property.
La propriété intellectuelle est protégée de plusieurs façons.


Pseudotumoral soft tissue masses in children and adolescents
Masses des tissus mous d'allure tumorale de l'enfant et de l'adolescent


o South Carolina was ranked 43rd in the country for per capita income at US$26,138 in 2003.
o La Caroline du Sud se classait 43e au pays en 2003 pour son revenu par habitant, qui se chiffrait à 26 138 $ US.


Research, Technology, Innovation Assistance Programs Federal Government Assistance Programs Assistance Programs
Recherche technologie et l'innovation Programmes d'aide Programmes d'aide financière du gouvernement fédéral Programmes d'aide


Monitoring Canada's Accommodation Industry During A Challenging Year - 2003 The Impacts Of The Iraq War And SARS Quarter 3 Results
Revenus provenant de la location de chambres, AACJ 2003 par rapport à 2002 - Est de l'Ontario -5,7 % 140 Revenus (millions $) 120 100 80 60




In [17]:
with open("merged_shuffled_data_raw.en", "w") as f_en, open("merged_shuffled_data_raw.fr", "w") as f_fr:
    for _, row in sentences.iterrows():
        f_en.write(row.english + "\n")
        f_fr.write(row.french + "\n")

In [18]:
sentences.head()

Unnamed: 0,english,french,source_len,target_len,len_difference
0,"Cependant, les réservations pour le T4 2002 so...",Recent retail and wholesale trade data also su...,119,140,21
1,o Click on the Institutions tab at the top of ...,o Cliquez sur l'onglet Institutions au haut de...,140,157,17
2,o Canadian Environmental Solutions (CES) - Ind...,o Solutions environnementales canadiennes (SEC...,59,67,8
3,Since the 1965 introduction of the Canadian fl...,"Depuis l'adoption du drapeau canadien en 1965,...",121,109,12
4,"It could be that a perceived, whether real or ...","Il se peut que la perception, réelle ou non, d...",135,136,1
