-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending data model for parallel corpus alignments #234
Comments
The Django join field options are |
Got it. Then |
Some notes on how to implement this: In the table view:
Displaying alignments:
Data model:
|
I'm working on this now, and my thinking is to leave the token annotation view alone (used in too many places), and modify the sentence view and the example view to show the alignments. Maybe, a new grid in the sentence view, which replicates the PTokenAnnotation view with extra columns for aligned language, aligned sentence id, and aligned example ID. And a new grid in the example view (or maybe use the existing grid), which has extra records for the aligned example ID, across languages. Proposing a new model class (data table), with the following columns:
And the joins to the other tables, would be on the example ID which seems more granular than the sentence ID. The table could also be queried on the sentence ID (for the sentence view). Need to check the following:
|
An example is a particular adposition token. In general we don't have alignments at the adposition token level (we have them for a few chapters of Chinese-English). But I think we need to first implement sentence-level alignments. And note that these are not always 1-to-1. In terms of the display, I would start with the simplest possible addition—maybe on the sentence page, a list of links to translation sentences. We can improve the UI later. (Also see #242 about making the current views more legible) |
OK, understood. I was thinking of combining the adposition and sentence into a single alignment view, as adposition is the most granular, but it looks like the sentence alignments have value by themselves. In this case, I could try with the single datamodel mentioned above for plain sentence alignment, which could be adapted for adposition alignment when they are available. |
OK so: ParallelSentAlignment model
The source and target languages can be inferred from the sentence entries right? Is there an advantage to having a copy of this info with the alignment? |
Looks like a join with the CorpusSentence and the Language models, will get the language text. So there's no need to have it in the model. I would propose leaving the Source Example ID and Target Example ID columns in the ParallelSentAlignment model, and populate them with nulls to start with until they are available. Maybe there is a way in the Django Layer to refine the query to get just what is needed for the sentence alignment view, i could take a look at this. This could save some space in the database, since an aligned adposition would probably need to be uniquely identified by the sentence ID. It could also avoid joins with the PTokenAnnotation table which is a pretty large table. |
But a sentence will typically contain multiple adposition tokens, so wouldn't a separate model be needed for those alignments? |
OK, yeah i thought the sentence and adposition alignments could be combined into one Model, then I tried out some scenarios on an excel sheet (1:1, 1:Many, no alignments, etc) and things quickly became unwieldy. Maybe a separate ParallelAdpositionAlignment model (AdpositionAlignmentID, SourceExampleID, TargetExampleID) works best for the adposition alignments. |
Agreed (though I would call it ParallelPTokenAlignment). |
for actually creating the alignments in the database, can we assume that the alignments can always be derived from an external conllulex file set for the language pairs? Then i can write a set of scripts to a) read the alignments off the conllulex file (using the existing zh-en template) and b) populate the alignment Models in the database using these alignments (This is what I did for the prototype, I guess i just need to make the scripts abstract-able for any language pair) |
Yes I think it would be a good idea to assume that alignments are integrated into the .conllulex files so they can be uniformly imported. For the sentence level, there is the |
The screenshots look good! But I'm not sure it's necessary to repeat info about the source language if it's displayed at the top of the page. |
OK, I've removed the source columns from the sentence alignment view.
There is a format present in the Chinese conllulex which stores the AlignedTokId (id of the English token) and AlignedAdposition (english adposition token) in a column. I'll work with this to start and create some scripts for production deployment (assume that alignments to English are present in the other language's conllulex file) , but i guess it can get complicated with language pairs not including English. |
Thanks @nitinvwaran. The code update/db migration went smoothly. Next step is to import the alignments. I will wait for the overnight db backup before doing that. |
running scripts/generate_alignments_from_conllulex.py as described in the release instructions mentioned above— As expected, warnings about "No CorpusSentence object for sentence pairs" (missing English ch. 1, 4, 5)
"No Token object" warning for 10 tokens:
|
Looks good! E.g. at http://www.xposition.org/en/the%20little%20prince%20(english)0.9/lpp_1943.1153/ |
In the future we might want to have alignments for e.g. translations of The Little Prince. Some ideas for how to approach this:
CorpusSentence
a many-to-many join on corpus sentencesCorpusSentence
a join on corpus sentences and an additional field indicating how many sentences "after" that aligned sentence should be included for the given sentenceThe text was updated successfully, but these errors were encountered: