# Ideas

- Generate a translation and compare it to the translation to give a score
- Number of words
- Embedded space distance between vectors
- Use punctuation to delimiter some subsample of the phrase and try to evaluate the proximity between these
- Fraction of simple words
- Evaluer la complexité synthaxique de la phrase en anglais -> phrase simple, traduction devrait être de bonne qualité
- Mots rares -> chercher si le mot a été traduit ou non

# Code

In [267]:
from collections import Counter, defaultdict
import math
import copy
import random
import operator
import pandas as pd

flatten = lambda l: [item for sublist in l for item in sublist]

# some helper functions
def prepare_data(filename):
    data = [l.strip().split() + ['</s>'] for l in open(filename) if l.strip()]
    corpus = flatten(data)
    vocab = set(corpus)
    return vocab, data

In [271]:
def extract_sentences(filename,lower=False):
    if lower:
        data = [l.lower().strip() for l in open(filename) if l.strip()]
    else:
        data = [l.strip() for l in open(filename) if l.strip()]
    return data

In [272]:
sentences_en = pd.DataFrame(extract_sentences('../data/en_de/train.ende.src'),columns = ['sentences_en'])
sentences_ge = pd.DataFrame(extract_sentences('../data/en_de/train.ende.mt'),columns = ['sentences_ge'])
scores = pd.read_csv('../data/en_de/train.ende.scores',header=None)
scores = scores.rename(columns={0:"scores"})

In [273]:
dataset = pd.merge(sentences_en,sentences_ge,left_index=True,right_index=True)
dataset = pd.merge(dataset,scores,left_index=True,right_index=True)

In [274]:
bottom_10 = dataset.sort_values('scores').reset_index(drop=True).iloc[0:10]
top_10 = dataset.sort_values('scores').reset_index(drop=True).iloc[-10:]
middle = dataset.sort_values('scores').reset_index(drop=True).iloc[4000:4010]

In [280]:
sample = dataset.sample(30)

In [281]:
sample.style.set_properties(subset=['sentences_en'], **{'width': '300px'})

Unnamed: 0,sentences_en,sentences_ge,scores
1479,Numerous NP cabooses remain in private collections.,Zahlreiche NP-Kabinen bleiben in Privatsammlungen.,-0.116238
6540,"The reproductive system of females consist of paired ovaries, lateral oviducts, spermatheca, and a genital chamber.","Das Fortpflanzungssystem der Weibchen besteht aus paarigen Eierstöcken, lateralen Eizellen, Spermatheca und einer Genitalkammer.",-0.257776
3416,"Ben Schwartz talks about Season 3 of House of Lies with Red Crab, Haardvark, and Paul rating Ben's impersonations.","Ben Schwartz spricht über Staffel 3 von House of Lies mit Red Crab, Haardvark und Paul, die Ben 's Imitationen bewerten.",0.299273
4001,"Her funeral drew 60,000 mourners, many of whom traveled from outside the United States.","Ihr Begräbnis zog 60.000 Trauernde an, von denen viele außerhalb der Vereinigten Staaten reisten.",0.415565
5686,"During the 2002 Gujarat riots, several such ghettos became easy targets for the rioting mobs, as they enabled the profiling of residential colonies.","Während der Ausschreitungen in Gujarat 2002 wurden mehrere solcher Ghettos leichte Ziele für die Aufruhr von Mobs, da sie die Profilerstellung von Wohnkolonien ermöglichten.",0.36035
5549,The yolk provides the liquid nourishment for the embryo caterpillar until it escapes from the shell.,"Das Eigelb versorgt die Embryonenkatze mit flüssiger Nahrung, bis es aus der Schale entweicht.",0.0520693
6683,The hemolymph of the mealworm beetle Tenebrio molitor contains several antifreeze proteins.,Die Hämolymphe des Mealworm Beetle Tenebrio molitor enthält mehrere Frostschutzproteine.,-0.577143
5946,Croc went on a rampage after a confrontation with a vagrant and Croc wound up in a shopping mall.,Croc ging auf eine Rampage nach einer Konfrontation mit einem Vagranten und Croc aufgewickelt in einem Einkaufszentrum.,0.307129
4112,"Jourdain, Herta, Nakano, Fittipaldi, and Gugelmin rounded out the top ten.","Jourdain, Herta, Nakano, Fittipaldi und Gugelmin runden die Top Ten ab.",0.137732
2589,Anglican devotions Anglican Service Book Prayer Book Rebellion Prayer Book Society of Canada The Books of Homilies,Anglikanische Andacht Anglikanische Gottesdienst Buch Gebetsaufstand Gebetsgesellschaft von Kanada Bücher der Homilien,-1.34875


In [275]:
top_10.style.set_properties(subset=['sentences_en'], **{'width': '300px'})

Unnamed: 0,sentences_en,sentences_ge,scores
6990,"Confirmation of President Obama's first nominee, Andrew Traver, stalled in 2011 after the NRA expressed strong opposition.","Die Bestätigung des ersten Kandidaten von Präsident Obama, Andrew Traver, kam 2011 ins Stocken, nachdem die NRA starke Opposition zum Ausdruck gebracht hatte.",1.85087
6991,"In the aftermath of the assassination, the Metropolitan Police and the Home Office rapidly suppressed India House.",Nach der Ermordung unterdrückten die Metropolitan Police und das Home Office Indien House rasch.,1.85087
6992,Legislators who codified these laws sought to modernize them without abandoning their foundations in traditional jurisprudence.,"Gesetzgeber, die diese Gesetze kodifizierten, versuchten, sie zu modernisieren, ohne ihre Grundlagen in der traditionellen Rechtsprechung aufzugeben.",1.85087
6993,The eastern bobwhite attacks these ants by digging out the mounds looking for young queens.,"Der Ostbobwhite greift diese Ameisen an, indem er die Hügel ausgräbt, die nach jungen Königinnen suchen.",2.09382
6994,The visibility of prostitution in Laos belies the practice's illegality.,Die Sichtbarkeit der Prostitution in Laos widerlegt die Illegalität der Praxis.,2.09382
6995,"Soon afterwards, the JMA too downgraded Muifa to a Severe Tropical storm.",Bald darauf hat die JMA Muifa zu einem schweren Tropensturm herabgestuft.,2.22772
6996,"On the 5th, a northward drift ensued while well southeast of Okinawa.","Am 5. folgte eine Norddrift, während weit südöstlich von Okinawa.",2.22772
6997,"On 12 June 1626, the Commons launched a direct protestation attacking Buckingham, stating,","Am 12. Juni 1626 startete die Commons einen direkten Protest gegen Buckingham, in dem sie erklärte:",2.22772
6998,"The sulfate minerals all contain the sulfate anion, SO42−.","Die Sulfatminerale enthalten alle Sulfatanion, SO42 −.",2.22772
6999,132 Slim Williams and his Tennessee Avenue Boys Throwing Kisses / Reckon As How I Love You 1949,132 Slim Williams und seine Tennessee Avenue Jungen werfen Küsse / Reckon wie ich dich liebe 1949,2.76458


In [219]:
bottom_10.style.set_properties(subset=['sentences_en'], **{'width': '300px'})

Unnamed: 0,sentences_en,sentences_ge,scores
0,"the mummy, a handbook of egyptian funerary archaeology.","the mummy, a handbook of egyptian funerary archaeology.",-8.14071
1,he vacated the wwa cruiserweight title in april 2002 after returning to wwf.,nach seiner rückkehr zum wwf verließ er im april 2002 den titel wwa cruiserweight.,-7.48152
2,"haroun and s.a. mourad, proceedings of international colloquium on structural engineering, cairo, egypt, april 1992, pp.","haroun and s.a. mourad, proceedings of international colloquium on structure engineering, cairo, egypt, april 1992, pp.",-7.14853
3,boone trails grand towers gravois trail new horizons north star osage ozark trailblazers pathfinder river trails thunderbird,boone trails grand towers gravois trail new horizons north star osage ozark trailblazers pathfinder river trails thunderbird,-7.02872
4,"illustrated encyclopedia of woodworking handtools, instruments & devices.","illustrated encyclopedia of woodworking handtools, instruments & devices.",-6.77649
5,"faerie apocalypse, jason franks, ifwg publishing australia.","faerie apocalypse, jason franks, ifwg publishing australia.",-6.77649
6,"""the great reform act of 1832 and british democratization.""","""the great reform act of 1832 and british democracy"".",-6.77649
7,"pollock, griselda, generations and geographies in the visual arts, routledge, london, 1996.","pollock, griselda, generations and geographies in the visual arts, routledge, london, 1996.",-6.77649
8,"""radio birdman"", in the trouser press record guide, 4th ed., ed.","""radio birdman"", in the hose press record guide, 4th ed., ed.",-6.62988
9,"u.s. army national infantry museum, ""indian wars"", u.s. army infantry home page viele, john.","u.s. army national infantry museum, ""indian wars"", u.s. army infantry home page viele, john.",-6.62429


In [220]:
middle.style.set_properties(subset=['sentences_en'], **{'width': '300px'})

Unnamed: 0,sentences_en,sentences_ge,scores
4000,"on 22 june 1911, a contingent of 61 camerons, parading with their allied regiment, participated in the coronation of king george v.","am 22. juni 1911 nahm ein kontingent von 61 kameronen, die mit ihrem alliierten regiment paradierten, an der krönung von könig georg v. teil.",0.235618
4001,the creek heads approximately south with a few bends through sugarloaf township for about 2 miles.,der bach geht etwa nach süden mit ein paar kurven durch sugarloaf township für etwa 2 meilen.,0.235866
4002,"in october 2010, lindelof submitted his rewritten screenplay to fox.",im oktober 2010 reichte lindelof sein neu geschriebenes drehbuch bei fox ein.,0.236079
4003,"the tudor rose represents england, a thistle scotland, the flax flower and shamrock northern ireland, and the leek and daffodil wales.","die tudor rose repräsentiert england, eine distel schottland, die flachsblume und den shamrock nordirland und den lauch und narzissen wales.",0.236079
4004,"the tibia is characteristically a slender segment in adult insects, only a little shorter than the femur or the combined femur and trochanter.","die tibia ist charakteristisch für ein schlankes segment bei erwachsenen insekten, nur wenig kürzer als der oberschenkel oder der kombinierte oberschenkel und trochanter.",0.236079
4005,this inexperience contributed to tactical mistakes and avoidable casualties during the gallipoli campaign.,diese unerfahrenheit trug zu taktischen fehlern und vermeidbaren opfern während der gallipoli kampagne bei.,0.236079
4006,"murat then succeeded in bluffing hohenlohe into capitulating, even though the prussian was neither surrounded nor outnumbered.","murat gelang es dann hohenlohe zu kapitulieren, obwohl der preußische weder umzingelt noch überzählig war.",0.236113
4007,"kopje outcrops erupt from the landscape, and a weep trickles down the face of the rocks.","kopje ausbrüche brechen aus der landschaft, und ein weinen knirscht das gesicht der felsen hinunter.",0.236147
4008,"ireland depended on, mainly, british tankers for petroleum.",irland war hauptsächlich von britischen tankern für erdöl abhängig.,0.236181
4009,"it sailed in a convoy from hoboken, new jersey, for france on 18 october 1917.","sie segelte in einem konvoi von hoboken, new jersey, nach frankreich am 18. oktober 1917.",0.236314


### Exploration 

## Test spacy

In [18]:
import spacy
import numpy as np

In [11]:
nlp = spacy.load("en_core_web_md")
nlp_german = spacy.load("de_core_news_md")

confirmation confirmation NOUN NN nsubj xxxx True False
of of ADP IN prep xx True True
president president PROPN NNP compound xxxx True False
obama obama PROPN NNP poss xxxx True False
's 's PART POS case 'x False True
first first ADJ JJ amod xxxx True True
nominee nominee NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
andrew andrew PROPN NNP compound xxxx True False
traver traver PROPN NNP appos xxxx True False
, , PUNCT , punct , False False
stalled stall VERB VBD ROOT xxxx True False
in in ADP IN prep xx True True
2011 2011 NUM CD pobj dddd False False
after after ADP IN mark xxxx True True
the the DET DT det xxx True True
nra nra PROPN NNP nsubj xxx True False
expressed express VERB VBD advcl xxxx True False
strong strong ADJ JJ amod xxxx True False
opposition opposition NOUN NN dobj xxxx True False
. . PUNCT . punct . False False


In [195]:
phrase_en = "confirmation of president obama's first nominee, andrew traver, stalled in 2011 after the nra expressed strong opposition."
phrase_ge = "die bestätigung des ersten kandidaten von präsident obama, andrew traver, kam 2011 ins stocken, nachdem die nra starke opposition zum ausdruck gebracht hatte."

In [32]:
doc = nlp("confirmation of president obama's first nominee, andrew traver, stalled in 2011 after the nra expressed strong opposition.")
decomposition = []
for token in doc:
    decomposition += [[token.text,token.pos_]]
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)
decomposition = np.array(decomposition)

confirmation confirmation NOUN NN nsubj xxxx True False
of of ADP IN prep xx True True
president president PROPN NNP compound xxxx True False
obama obama PROPN NNP poss xxxx True False
's 's PART POS case 'x False True
first first ADJ JJ amod xxxx True True
nominee nominee NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
andrew andrew PROPN NNP compound xxxx True False
traver traver PROPN NNP appos xxxx True False
, , PUNCT , punct , False False
stalled stall VERB VBD ROOT xxxx True False
in in ADP IN prep xx True True
2011 2011 NUM CD pobj dddd False False
after after ADP IN mark xxxx True True
the the DET DT det xxx True True
nra nra PROPN NNP nsubj xxx True False
expressed express VERB VBD advcl xxxx True False
strong strong ADJ JJ amod xxxx True False
opposition opposition NOUN NN dobj xxxx True False
. . PUNCT . punct . False False


In [193]:
"die bestätigung des ersten kandidaten von präsident obama, andrew traver, kam 2011 ins stocken, nachdem die nra starke opposition zum ausdruck gebracht hatte.".split()

['die',
 'bestätigung',
 'des',
 'ersten',
 'kandidaten',
 'von',
 'präsident',
 'obama,',
 'andrew',
 'traver,',
 'kam',
 '2011',
 'ins',
 'stocken,',
 'nachdem',
 'die',
 'nra',
 'starke',
 'opposition',
 'zum',
 'ausdruck',
 'gebracht',
 'hatte.']

In [266]:
doc_ge.vector.shape

(300,)

In [265]:
doc_ge = nlp_german("die bestätigung des ersten kandidaten von präsident obama, andrew traver, kam 2011 ins stocken, nachdem die nra starke opposition zum ausdruck gebracht hatte.")
decomposition_ge = []
for token in doc_ge:
    decomposition_ge += [[token.text,token.pos_]]
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop,token.vector.shape)
decomposition_ge = np.array(decomposition_ge)

die der DET ART nk xxx True True (300,)
bestätigung bestätigung NOUN NN sb xxxx True False (300,)
des der DET ART nk xxx True True (300,)
ersten erst ADJ ADJA nk xxxx True True (300,)
kandidaten kandidaten NOUN NN ag xxxx True False (300,)
von von ADP APPR pg xxx True True (300,)
präsident präsident NOUN NN nk xxxx True False (300,)
obama obama PROPN NE nk xxxx True False (300,)
, , PUNCT $, punct , False False (300,)
andrew andrew PROPN NE pnc xxxx True False (300,)
traver traver PROPN NE app xxxx True False (300,)
, , PUNCT $, punct , False False (300,)
kam kommen VERB VVFIN ROOT xxx True True (300,)
2011 2011 NUM CARD da dddd False False (300,)
ins ins ADP APPRART mo xxx True True (300,)
stocken stocken NOUN NN nk xxxx True False (300,)
, , PUNCT $, punct , False False (300,)
nachdem nachdem SCONJ KOUS cp xxxx True True (300,)
die der DET ART nk xxx True True (300,)
nra nra PROPN NE sb xxx True False (300,)
starke starke ADJ ADJA nk xxxx True False (300,)
opposition opposition NOUN 

In [7]:
stopwords_en = set(stopwords.words('english'))
stopwords_ge = set(stopwords.words('german'))

In [46]:
decomposition = [elt for elt in decomposition if elt[0] not in stopwords_en]
decomposition_ge = [elt for elt in decomposition_ge if elt[0] not in stopwords_ge]

In [47]:
decomposition_ge

[array(['bestätigung', 'NOUN'], dtype='<U11'),
 array(['ersten', 'ADJ'], dtype='<U11'),
 array(['kandidaten', 'NOUN'], dtype='<U11'),
 array(['präsident', 'NOUN'], dtype='<U11'),
 array(['obama', 'PROPN'], dtype='<U11'),
 array([',', 'PUNCT'], dtype='<U11'),
 array(['andrew', 'PROPN'], dtype='<U11'),
 array(['traver', 'PROPN'], dtype='<U11'),
 array([',', 'PUNCT'], dtype='<U11'),
 array(['kam', 'VERB'], dtype='<U11'),
 array(['2011', 'NUM'], dtype='<U11'),
 array(['stocken', 'NOUN'], dtype='<U11'),
 array([',', 'PUNCT'], dtype='<U11'),
 array(['nachdem', 'SCONJ'], dtype='<U11'),
 array(['nra', 'PROPN'], dtype='<U11'),
 array(['starke', 'ADJ'], dtype='<U11'),
 array(['opposition', 'NOUN'], dtype='<U11'),
 array(['ausdruck', 'NOUN'], dtype='<U11'),
 array(['gebracht', 'VERB'], dtype='<U11'),
 array(['.', 'PUNCT'], dtype='<U11')]

In [48]:
decomposition

[array(['confirmation', 'NOUN'], dtype='<U12'),
 array(['president', 'PROPN'], dtype='<U12'),
 array(['obama', 'PROPN'], dtype='<U12'),
 array(["'s", 'PART'], dtype='<U12'),
 array(['first', 'ADJ'], dtype='<U12'),
 array(['nominee', 'NOUN'], dtype='<U12'),
 array([',', 'PUNCT'], dtype='<U12'),
 array(['andrew', 'PROPN'], dtype='<U12'),
 array(['traver', 'PROPN'], dtype='<U12'),
 array([',', 'PUNCT'], dtype='<U12'),
 array(['stalled', 'VERB'], dtype='<U12'),
 array(['2011', 'NUM'], dtype='<U12'),
 array(['nra', 'PROPN'], dtype='<U12'),
 array(['expressed', 'VERB'], dtype='<U12'),
 array(['strong', 'ADJ'], dtype='<U12'),
 array(['opposition', 'NOUN'], dtype='<U12'),
 array(['.', 'PUNCT'], dtype='<U12')]

### English-Deutsch dictionnary

In [82]:
import pandas as pd

In [172]:
dic = pd.read_csv('../data/de-en.txt',sep=' ',header=None)
dic = dic.rename(columns={0:'ge',1:'en'})

In [173]:
def remove_first_elt(array):
    return [x[1:] if (isinstance(x,str) and x[0]=="#") else x for x in array ]

In [176]:
dic['ge'] = dic['ge'].apply(lambda x:x[1:] if (isinstance(x,str) and x[0]=="#") else x)
dic['en'] = dic['en'].apply(lambda x:x[1:] if (isinstance(x,str) and x[0]=="#") else x)

In [178]:
def add(x):
    return list(x)

In [185]:
res = dic.groupby('en').agg({'ge':add}).reset_index(drop=False)

In [191]:
correspondance_dict = res.set_index("en").to_dict()["ge"]

In [258]:
def count_translated_words(phrase_en,phrase_ge):
    count=0
    sep_german_words = phrase_ge.split(" ")
    sep_english_words = phrase_en.split(" ")
    n_german_words = len(sep_english_words)
    for word in sep_english_words:
        for word_ge in sep_german_words:
            if word in correspondance_dict.keys() and word_ge in correspondance_dict[word]:
                count+=1
                break
    
    return (count,n_german_words)

In [259]:
count_translated_words(phrase_en,phrase_ge)

(8, 17)

In [260]:
def tr(x):
    return count_translated_words(x["sentences_en"],x["sentences_ge"])

In [261]:
dataset["corresp"] = dataset.apply(tr,axis=1)

In [262]:
dataset

Unnamed: 0,sentences_en,sentences_ge,scores,corresp
0,josé ortega y gasset visited husserl at freibu...,1934 besuchte josé ortega y gasset husserl in ...,1.101697,"(6, 10)"
1,"however, a disappointing ninth in china meant ...",eine enttäuschende neunte in china bedeutete j...,-0.516656,"(10, 16)"
2,"in his diary, chase wrote that the release of ...","in seinem tagebuch, chase schrieb, dass die ve...",-2.226388,"(10, 19)"
3,heavy arquebuses mounted on wagons were called...,schwere arquebuses auf waggons montiert wurden...,-0.827379,"(5, 10)"
4,once north pacific salmon die off after spawni...,sobald der nordpazifische lachs nach dem laich...,0.364695,"(7, 17)"
5,they engaged in crossfire at guamaní river bri...,sie verübten kreuzfeuer an der guamaní river b...,-0.688501,"(10, 19)"
6,"several wha teams courted gretzky, notably the...","mehrere wha-teams umworben gretzky, insbesonde...",0.311271,"(7, 13)"
7,16 penn state in which redshirt freshman jim k...,"16 penn state, in dem rothemd-neuling jim kell...",-0.493057,"(14, 24)"
8,"on 12 april 1983, the landmarks preservation c...",am 12. april 1983 ernannte die landmarks prese...,0.018337,"(8, 19)"
9,"after that, chinese junks came to northern bor...","danach kamen chinesische junks mit gewürzen, v...",-0.680303,"(8, 20)"


In [251]:
dataset["scores"].corr(dataset["corresp"])

0.01210202332874808