/
report.txt
84 lines (57 loc) · 8.52 KB
/
report.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
####1. THE F-LANGUAGE AND GENERAL OBSERVATIONS####
We have chosen the canonical F-language in machine translation, French. We chose French mainly because the both of us know the its grammatical structure reasonably well, and therefore it is easier for us to implement rules after we do word-for-word translation.
French is a Romance language. Some of its grammatical features are shared with other Romance languages:
a) Two grammatical genders (masculin, féminin)
b) Tenses formed from auxiliaries
c) The use of pronominal verbs. In general, pronominal verbs in French are either reflexive, reciprocal or idiomatic. As the term suggests, idiomatic pronominal verbs are quite idiosyncratic and the pronoun may modify the sense of the verb in unexpected ways.
More quirky grammatical features include:
a) Declarative word order is SVO, except that pronominal objects precede verbs ('Tu parles français'; 'tu le parles')
b) Verb-subject inversion to denote question parking ('parles-tu français?'), although this is often not required ('tu parles français')
c) Existence of simple past, principally used in poetic diction
d) Split negation, typically of the form 'ne (VERB) pas'. Other forms of negation include: 'ne (VERB) point', 'ne (VERB) plus', 'ne (VERB)
Some vocabulary features:
a) The majority of French words come from Vulgar Latin. The Roman conquest in England, as well as the French influence over the English court, explain the large overlap of its vocabulary with that of English.
Some phonological features:
a) The key feature to note is elision; this is otherwise unproblematic except that certain forms of elision are carried over into the orthography, for example, 'j'ai' <- 'je ai', 'l'homme qu'il a vu' <- 'le homme que il a vu'.
We proceed to discuss the impact that these features have on the translation:
PREPROCESSING:
Elision meant that we needed to be more sensitive to tokenization, so we couldn't simply remove every single punctuation mark. However, splitting on punctuation marks could lead to confusion too. Take, for example, 'je l'ai vu'. In our word-by-word translation model, there is no straightforward way of distinguishing between 'l'' as an article ('the'), and 'l'' as a pronoun ('he' or 'she' or 'it). Thankfully our training set did not have very many of these ambiguities; it would be less easy in this deterministic setup to deal with such problems.
TRANSLATING:
We were thankful that most of these French words had a pretty straightforward counterpart in English, due to the cultural (and geographic!) proximity of English to French. It would have been far harder translating a different language, say, Indonesian, in which many words would escape straightforward translation because many words would be dependent on some basis of cultural understanding. This meant that we didn't have to worry too much about fidelity.
We were also glad that French and English are hot languages, so we do not have to deal with inferring agents, as they are explicitly marked. However, a problem that comes up as a result is that we often have to deal with pronouns, and there is no straightforward way of dealing with this, without further context: more specifically, in English, one refers to any entity (physical or not) as 'it', as long as this entity is not a person. In French, however, any entity is gendered, and this affects the pronoun that is used. For example, in English one says 'This is my passport. It is blue.' In French, one says 'C'est mon passeport. Il est bleu.'
Another thing that proves problematic is the existence of pronominal verbs. It is not immediately obvious how to deal with them, since the pronoun 'se' is typically translated as 'himself', and in some cases it makes sense (particularly for reflexive verbs). However, for other cases it is far less obvious what we should do with the pronoun. For example, it would take a more sophisticated language model to tell that 'il se peut' ought to be translated as 'it is possible' or 'it appears that' or 'it could be', rather than 'he itself can'.
####2. THE TEST DATA ####
L'ambassade de Belgique à Washington a l'honneur de vous informer de la mise en place d'une nouvelle mesure européenne en matière de délivrance de passeports.
Depuis le 15 février 2012, les passeports délivrés par cette Ambassade contiennent les empreintes digitales de leur demandeur et ce dès l'âge de 12 ans.
Depuis cette date, la comparution personnelle de tout demandeur de passeports inscrits ou non dans les registres consulaires de cette Ambassade est obligatoire. Cette Ambassade n'est plus en mesure d'accepter les demandes introduites par courrier postal.
Veuillez également noter qu'à partir de l'âge de 6 ans, les enfants doivent se présenter à l'Ambassade pour la prise de photo et la signature du passeport.
La même procédure est d'application aurpsè des Consulats Généraux d'Atlanta, Los Angeles et New York depuis le 1er octobre 2012.
Pour de plus amples informations, veuillez consulter les liens suivants
La section passeport précise que ce procédé d'identification n'a aucune autre incidence sur l'introduction, le traitement, et la délivrance des passeports.
Ces procédures s'effectuent comme par le passé, et sur base des mêmes documents.
Cette nouvelle mesure est imposée par un règlement Européen du 13.12.2004 et s'inscrit dans la lutte contre la fraude d'identité.
Dans le futur, elle permettra aux détenteurs de passeports d'Etats membres de l'Union Européenne de bonne foi, de bénéficier de contrôles de frontière plus aisés.
####3. THE TRANSLATIONS ####
######## NOTE THIS PART MUST BE COMPLETED##############
####4. THE RULES WE USED ####
First, as instructed, we ran the output through a POS tagger. Then we had 10 rules, described below:
I: NN JJ -> JJ NN
Whenever we encountered a noun followed by an adjective, we switched it such that the adjective comes before the noun. This is perhaps the most obvious difference between English and French (in that most new students of French will learn it within the first two weeks of class) and that is why we had this rule first.
II: NN VBG -> VBG NN
Implemented to invert gerund constructions: in French probably more grammatical to say something along the lines of 'l'homme gagnant', in English 'the winning man'. Rarer than the previous case, but some cases persisted in our training set, so we inverted gerund constructions.
III: NN1 "OF" NN2 -> NN2 NN1
Implemented to catch compound nouns in French. For example, the phrase "fraud d'identité" should translate as "identity fraud" instead of "fraud of identity". This is simply the way French compounds nouns, whereas in English we can simply concatenate them.
IV: "HOPE FOR" -> "PLEASE"
Implemented to dramatically improve fluency. The French "veuillez" is literally translated as "hope for", but this translation is inadequate, because it is used as a marker of politeness, usually closer to 'please' instead.
V: DELETING ARTICLES FROM DATES
French represents dates as "le 12ème janvier", i.e. they append a definite article at the front of dates. English never does this, so we remove the article.
VI: DELETING REPEATED WORDS
Sometimes languages will translate word-for-word extremely badly, because noun phrases in one language can be translated into a single noun in another language. For example, "courrier" translates as "mail" and "postal" translates as "mail", but "courrier postal" should translate as "mail", not "mail mail". If we see two same words in a row we should be able to delete one of them without much loss of fidelity, and probably gain some fluency.
VII: "NOT" VB "NO" -> VB "NO" [DELETE THE "NOT"]
This has to do with French having a split negation, so verbs are negated with as "ne" + VB + PARTICLE, with the particle denoting the richness of negation. So whenever we see a verb surrounded by negation, we need to make a choice. In this case, it comes from French "ne...aucune" which means "no".
VIII: NOT" VB "MORE" -> VB "NO LONGER" [DELETE THE "NOT", "MORE", INSERT "NO LONGER"]
Refer to rule 7 for split negation. In this case, the split negation arose from French "ne... plus" which means "no longer".
IX: "OF TO" --> "TO" [DELETE "OF"]
French often requires the preposition "DE" in front of an infinitive in order to be grammatical, e.g. 'permettre ... de bénéficier'. This does not hold in English, one would never say "allow ... of to benefit", one would simply say "allow to benefit".
X: "AT TO LEAVE OF" -> "FROM"
In our translation model, "à partir de" is translated as "AT" + "TO LEAVE" + "OF", this is idiomatic in French but hardly comprehensible in English. One is better off translating this as "FROM".