Word alignment symmetrization is broken #1829

gtoffoli · 2017-09-14T16:33:31Z

The module nltk.translate.gdfa, which includes the grow_diag_final_and function and a couple of nested functions, is broken.
It has a few serious bugs; in particular, the main loop of the grow_diag internal function runs forever except in some trivial cases. Possibly this symmetrization algorithm has been ported to NLTK from another language and hasn't been tested.

Although I didn't analyze the algorithm implementation in depth, I think that I have fixed the bugs preventing it from working. I've put the patches in the attached diff file.

gdfa.py.txt

alvations · 2017-09-15T01:17:14Z

Thanks @gtoffoli for reporting this!

Do you have example sentences with word alignments that result in the infinite loop? I'll create a unit test with theses sentences to test the gdfa function.

gtoffoli · 2017-09-15T08:41:26Z

I'm attaching the test_gdfa.py module and 4 data files:

source.txt
target.txt
link_fwd.txt
link_rev.txt

They all must be put in the same directory.
The test execution should create a file link_sym.txt, containing symmetrized alignments, in the same directory.

I used as a small bi-lingual corpus some segments extracted from an Italian web site (source.txt) and the corresponding translations in English (target.txt).
The corpus was tokenized and the tokens replaced in the source and the target texts with numerical codes referring mono-lingual vocabularies (Italian and English respectively); the vocabularies themselves are not needed for this test.
With the eflomal package (https://github.com/robertostling/eflomal) and some more code I produced the italian-to-english alignments (link_fwd.txt) and the english-to-italian alignments (link_rev.txt).

All data files include the same number of lines, except that:

the first line of source.txt contains 2 numbers: number of the lines following and number of word types (unique tokens) in the Italian vocabulary
the first line of target.txt contains 2 numbers: number of the lines following and number of word types (unique tokens) in the English vocabulary.

I must correct my previous statement: without my patches, grow_diag_final_and starts looping indefinitely from the start (first couple of asymmetric alignments); only after a partial fix, I was able to symmetrize the first, very simple, alignments,

test_gdfa.zip

alvations · 2017-09-15T08:48:01Z

Thanks @gtoffoli for the details! I'll look into it over the weekend.

alvations · 2017-12-13T04:57:26Z

@gtoffoli Sorry for getting back this late. Had been busy these months.

It's because the infinite while loop didn't have a breaking condition grow_diag(). This seems to fix the issue:

    def grow_diag():
        """
        Search for the neighbor points and them to the intersected alignment
        points if criteria are met.
        """
        prev_len = len(alignment) - 1
        # iterate until no new points added
        while prev_len < len(alignment):
            no_new_points = True
            # for english word e = 0 ... en
            for e in range(srclen):
                # for foreign word f = 0 ... fn
                for f in range(trglen): 
                    # if ( e aligned with f)
                    if (e,f) in alignment:
                        # for each neighboring point (e-new, f-new)
                        for neighbor in neighbors:
                            neighbor = tuple(i+j for i,j in zip((e,f),neighbor))
                            e_new, f_new = neighbor
                            # if ( ( e-new not aligned and f-new not aligned) 
                            # and (e-new, f-new in union(e2f, f2e) )
                            if (e_new not in aligned and f_new not in aligned)\
                            and neighbor in union:
                                alignment.add(neighbor)
                                aligned['e'].add(e_new); aligned['f'].add(f_new)
                                prev_len+=1
                                no_new_points = False
            # iterate until no new points added
            if no_new_points:
                break

Also, I agree that the spaghetti function is weird but it stays close to the pseudo-code (perl-like syntax)

- Fixing #1829 - Also removed unused import.

alvations added bug SMT tests labels Sep 15, 2017

alvations self-assigned this Sep 15, 2017

alvations added this to the 3.2.6 milestone Oct 14, 2017

alvations added a commit that referenced this issue Dec 13, 2017

Added mechanism to check that no new points are added

c24d851

- Fixing #1829 - Also removed unused import.

alvations mentioned this issue Dec 13, 2017

Fixing GDFA: Check that no new points are added #1914

Merged

alvations modified the milestone: 3.2.6 Dec 13, 2017

stevenbird closed this as completed Dec 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word alignment symmetrization is broken #1829

Word alignment symmetrization is broken #1829

gtoffoli commented Sep 14, 2017

alvations commented Sep 15, 2017

gtoffoli commented Sep 15, 2017

alvations commented Sep 15, 2017

alvations commented Dec 13, 2017 •

edited

Loading

Word alignment symmetrization is broken #1829

Word alignment symmetrization is broken #1829

Comments

gtoffoli commented Sep 14, 2017

alvations commented Sep 15, 2017

gtoffoli commented Sep 15, 2017

alvations commented Sep 15, 2017

alvations commented Dec 13, 2017 • edited Loading

alvations commented Dec 13, 2017 •

edited

Loading