Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word alignment symmetrization is broken #1829

Closed
gtoffoli opened this issue Sep 14, 2017 · 4 comments
Closed

Word alignment symmetrization is broken #1829

gtoffoli opened this issue Sep 14, 2017 · 4 comments
Assignees
Milestone

Comments

@gtoffoli
Copy link

The module nltk.translate.gdfa, which includes the grow_diag_final_and function and a couple of nested functions, is broken.
It has a few serious bugs; in particular, the main loop of the grow_diag internal function runs forever except in some trivial cases. Possibly this symmetrization algorithm has been ported to NLTK from another language and hasn't been tested.

Although I didn't analyze the algorithm implementation in depth, I think that I have fixed the bugs preventing it from working. I've put the patches in the attached diff file.

gdfa.py.txt

@alvations
Copy link
Contributor

Thanks @gtoffoli for reporting this!

Do you have example sentences with word alignments that result in the infinite loop? I'll create a unit test with theses sentences to test the gdfa function.

@gtoffoli
Copy link
Author

I'm attaching the test_gdfa.py module and 4 data files:

  • source.txt
  • target.txt
  • link_fwd.txt
  • link_rev.txt

They all must be put in the same directory.
The test execution should create a file link_sym.txt, containing symmetrized alignments, in the same directory.

I used as a small bi-lingual corpus some segments extracted from an Italian web site (source.txt) and the corresponding translations in English (target.txt).
The corpus was tokenized and the tokens replaced in the source and the target texts with numerical codes referring mono-lingual vocabularies (Italian and English respectively); the vocabularies themselves are not needed for this test.
With the eflomal package (https://github.com/robertostling/eflomal) and some more code I produced the italian-to-english alignments (link_fwd.txt) and the english-to-italian alignments (link_rev.txt).

All data files include the same number of lines, except that:

  • the first line of source.txt contains 2 numbers: number of the lines following and number of word types (unique tokens) in the Italian vocabulary
  • the first line of target.txt contains 2 numbers: number of the lines following and number of word types (unique tokens) in the English vocabulary.

I must correct my previous statement: without my patches, grow_diag_final_and starts looping indefinitely from the start (first couple of asymmetric alignments); only after a partial fix, I was able to symmetrize the first, very simple, alignments,

test_gdfa.zip

@alvations
Copy link
Contributor

Thanks @gtoffoli for the details! I'll look into it over the weekend.

@alvations alvations added this to the 3.2.6 milestone Oct 14, 2017
@alvations
Copy link
Contributor

alvations commented Dec 13, 2017

@gtoffoli Sorry for getting back this late. Had been busy these months.

It's because the infinite while loop didn't have a breaking condition grow_diag(). This seems to fix the issue:

    def grow_diag():
        """
        Search for the neighbor points and them to the intersected alignment
        points if criteria are met.
        """
        prev_len = len(alignment) - 1
        # iterate until no new points added
        while prev_len < len(alignment):
            no_new_points = True
            # for english word e = 0 ... en
            for e in range(srclen):
                # for foreign word f = 0 ... fn
                for f in range(trglen): 
                    # if ( e aligned with f)
                    if (e,f) in alignment:
                        # for each neighboring point (e-new, f-new)
                        for neighbor in neighbors:
                            neighbor = tuple(i+j for i,j in zip((e,f),neighbor))
                            e_new, f_new = neighbor
                            # if ( ( e-new not aligned and f-new not aligned) 
                            # and (e-new, f-new in union(e2f, f2e) )
                            if (e_new not in aligned and f_new not in aligned)\
                            and neighbor in union:
                                alignment.add(neighbor)
                                aligned['e'].add(e_new); aligned['f'].add(f_new)
                                prev_len+=1
                                no_new_points = False
            # iterate until no new points added
            if no_new_points:
                break

Also, I agree that the spaghetti function is weird but it stays close to the pseudo-code (perl-like syntax)

alvations added a commit that referenced this issue Dec 13, 2017
- Fixing #1829
- Also removed unused import.
@alvations alvations modified the milestone: 3.2.6 Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants