-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word alignment symmetrization is broken #1829
Comments
Thanks @gtoffoli for reporting this! Do you have example sentences with word alignments that result in the infinite loop? I'll create a unit test with theses sentences to test the gdfa function. |
I'm attaching the test_gdfa.py module and 4 data files:
They all must be put in the same directory. I used as a small bi-lingual corpus some segments extracted from an Italian web site (source.txt) and the corresponding translations in English (target.txt). All data files include the same number of lines, except that:
I must correct my previous statement: without my patches, grow_diag_final_and starts looping indefinitely from the start (first couple of asymmetric alignments); only after a partial fix, I was able to symmetrize the first, very simple, alignments, |
Thanks @gtoffoli for the details! I'll look into it over the weekend. |
@gtoffoli Sorry for getting back this late. Had been busy these months. It's because the infinite while loop didn't have a breaking condition def grow_diag():
"""
Search for the neighbor points and them to the intersected alignment
points if criteria are met.
"""
prev_len = len(alignment) - 1
# iterate until no new points added
while prev_len < len(alignment):
no_new_points = True
# for english word e = 0 ... en
for e in range(srclen):
# for foreign word f = 0 ... fn
for f in range(trglen):
# if ( e aligned with f)
if (e,f) in alignment:
# for each neighboring point (e-new, f-new)
for neighbor in neighbors:
neighbor = tuple(i+j for i,j in zip((e,f),neighbor))
e_new, f_new = neighbor
# if ( ( e-new not aligned and f-new not aligned)
# and (e-new, f-new in union(e2f, f2e) )
if (e_new not in aligned and f_new not in aligned)\
and neighbor in union:
alignment.add(neighbor)
aligned['e'].add(e_new); aligned['f'].add(f_new)
prev_len+=1
no_new_points = False
# iterate until no new points added
if no_new_points:
break Also, I agree that the spaghetti function is weird but it stays close to the pseudo-code (perl-like syntax) |
- Fixing #1829 - Also removed unused import.
The module nltk.translate.gdfa, which includes the grow_diag_final_and function and a couple of nested functions, is broken.
It has a few serious bugs; in particular, the main loop of the grow_diag internal function runs forever except in some trivial cases. Possibly this symmetrization algorithm has been ported to NLTK from another language and hasn't been tested.
Although I didn't analyze the algorithm implementation in depth, I think that I have fixed the bugs preventing it from working. I've put the patches in the attached diff file.
gdfa.py.txt
The text was updated successfully, but these errors were encountered: