Extending data model for parallel corpus alignments #234

lgessler · 2021-10-13T19:31:30Z

In the future we might want to have alignments for e.g. translations of The Little Prince. Some ideas for how to approach this:

Give CorpusSentence a many-to-many join on corpus sentences
Give CorpusSentence a join on corpus sentences and an additional field indicating how many sentences "after" that aligned sentence should be included for the given sentence
Do we need a one-to-many join? Look at Django ORM docs

The text was updated successfully, but these errors were encountered:

lgessler · 2021-10-13T19:41:54Z

The Django join field options are ManyToMany, OneToOne and ForeignKey. OneToOne is obviously not what we need here and ForeignKey would get us a one-to-many, but that would not be sufficient either because a SQL column cannot have multiple values. So ManyToMany is the only remaining option among the choices the Django ORM gives us. reference: https://docs.djangoproject.com/en/3.2/ref/models/fields/#module-django.db.models.fields.related

nschneid · 2021-10-13T22:24:28Z

Got it. Then ManyToMany is the most natural, and we could probably get it to work. Or we could hack around it with ForeignKey + number of subsequent aligned sentences.

lgessler · 2021-10-20T19:11:27Z

Some notes on how to implement this:

In the table view:

Add a column for language
Indicate somehow (prob. another column) whether it has sentence alignments in another language
But probably not whether a token has an alignment (an adposition in one language might not have an adpositional equivalent in a translation)

Displaying alignments:

Under the "pretty" view of a sentence in the sentence view (example) or token annotation view (example), show aligned sentence(s) in the same style
Indicate somehow the alignment of annotated tokens (maybe highlight on hover in both sentences) on both pages

Data model:

For a given source sentence, we need to know all its aligned sentences
For a given token, we need to know its aligned token (if it exists) and the sentence that aligned sentence occurs in
We need to have fields for both because a translation might not have an adposition
Also: should the "is parallel" flag get replaced by a computed property? (check the join, and if it exists, it's parallel)

nitinvwaran · 2022-08-09T16:27:10Z

I'm working on this now, and my thinking is to leave the token annotation view alone (used in too many places), and modify the sentence view and the example view to show the alignments.

Maybe, a new grid in the sentence view, which replicates the PTokenAnnotation view with extra columns for aligned language, aligned sentence id, and aligned example ID. And a new grid in the example view (or maybe use the existing grid), which has extra records for the aligned example ID, across languages.

Proposing a new model class (data table), with the following columns:

Alignment ID (just an Identity column)
Source Sentence ID
Source example ID
Source language
Target Sentence ID
Target example ID
Target language

And the joins to the other tables, would be on the example ID which seems more granular than the sentence ID. The table could also be queried on the sentence ID (for the sentence view).

Need to check the following:

How to align the example IDs
Check whether MWE adpositions that map to a single adposition or vice versa, still fall under a single example ID each.

nschneid · 2022-08-10T03:28:14Z

An example is a particular adposition token. In general we don't have alignments at the adposition token level (we have them for a few chapters of Chinese-English). But I think we need to first implement sentence-level alignments. And note that these are not always 1-to-1.

In terms of the display, I would start with the simplest possible addition—maybe on the sentence page, a list of links to translation sentences. We can improve the UI later. (Also see #242 about making the current views more legible)

nitinvwaran · 2022-08-10T14:25:22Z

OK, understood. I was thinking of combining the adposition and sentence into a single alignment view, as adposition is the most granular, but it looks like the sentence alignments have value by themselves.

In this case, I could try with the single datamodel mentioned above for plain sentence alignment, which could be adapted for adposition alignment when they are available.

nschneid · 2022-08-10T14:37:38Z

OK so:

ParallelSentAlignment model

Alignment ID (just an Identity column)
Source Sentence ID
Source language
Target Sentence ID
Target language

The source and target languages can be inferred from the sentence entries right? Is there an advantage to having a copy of this info with the alignment?

nitinvwaran · 2022-08-10T15:07:05Z

Looks like a join with the CorpusSentence and the Language models, will get the language text. So there's no need to have it in the model.

I would propose leaving the Source Example ID and Target Example ID columns in the ParallelSentAlignment model, and populate them with nulls to start with until they are available. Maybe there is a way in the Django Layer to refine the query to get just what is needed for the sentence alignment view, i could take a look at this. This could save some space in the database, since an aligned adposition would probably need to be uniquely identified by the sentence ID. It could also avoid joins with the PTokenAnnotation table which is a pretty large table.

nschneid · 2022-08-10T15:32:14Z

But a sentence will typically contain multiple adposition tokens, so wouldn't a separate model be needed for those alignments?

nitinvwaran · 2022-08-10T16:17:21Z

OK, yeah i thought the sentence and adposition alignments could be combined into one Model, then I tried out some scenarios on an excel sheet (1:1, 1:Many, no alignments, etc) and things quickly became unwieldy. Maybe a separate ParallelAdpositionAlignment model (AdpositionAlignmentID, SourceExampleID, TargetExampleID) works best for the adposition alignments.

nschneid · 2022-08-10T16:39:37Z

Agreed (though I would call it ParallelPTokenAlignment).

nitinvwaran · 2022-08-22T18:07:27Z

I created a prototype of the Sentence and PToken alignment grids, in the Corpus Sentence View and PToken View, using the existing Zh-En alignments.

The sentence alignment page looks like this; i added a grid at the bottom to show the aligned sentences across languages. For 1-many alignments, all the aligned sentences for the record will appear.

The PToken alignments look like this. The new grid is a replica of the PTokenAnnotation table with the info for all target tokens. Tokens across multiple languages can be added to the DataModel:

Let me know, for any feedback. I still have to make a new script to generate and load these alignments for languages into the database via the admin console.

nitinvwaran · 2022-08-22T19:34:29Z

for actually creating the alignments in the database, can we assume that the alignments can always be derived from an external conllulex file set for the language pairs? Then i can write a set of scripts to a) read the alignments off the conllulex file (using the existing zh-en template) and b) populate the alignment Models in the database using these alignments

(This is what I did for the prototype, I guess i just need to make the scripts abstract-able for any language pair)

nschneid · 2022-08-22T20:58:01Z

Yes I think it would be a good idea to assume that alignments are integrated into the .conllulex files so they can be uniformly imported. For the sentence level, there is the en_sent_id metadata field. For token-level alignments we will need to figure out a format but let's not worry about that now.

nschneid · 2022-08-22T21:00:49Z

The screenshots look good! But I'm not sure it's necessary to repeat info about the source language if it's displayed at the top of the page.

nitinvwaran · 2022-08-23T01:47:24Z

OK, I've removed the source columns from the sentence alignment view.

For token-level alignments we will need to figure out a format but let's not worry about that now.

There is a format present in the Chinese conllulex which stores the AlignedTokId (id of the English token) and AlignedAdposition (english adposition token) in a column.

I'll work with this to start and create some scripts for production deployment (assume that alignments to English are present in the other language's conllulex file) , but i guess it can get complicated with language pairs not including English.

nitinvwaran · 2022-08-30T17:23:19Z

Created a PR #244. Release instructions for server deployment can be found here.

I tested the release instructions on my local copy with a production database backup.

nschneid · 2022-11-29T04:31:26Z

Thanks @nitinvwaran. The code update/db migration went smoothly. Next step is to import the alignments. I will wait for the overnight db backup before doing that.

nschneid · 2022-12-04T01:41:52Z

running scripts/generate_alignments_from_conllulex.py as described in the release instructions mentioned above—

As expected, warnings about "No CorpusSentence object for sentence pairs" (missing English ch. 1, 4, 5)

lpp_1943_zh-1 through lpp_1943_zh-35
lpp_1943_zh-146 through lpp_1943_zh-262

"No Token object" warning for 10 tokens:

No Token object ID: 1 	 16 for sentence pairs lpp_1943_zh-280 	 lpp_1943.280
No Token object ID: 17 	 29 for sentence pairs lpp_1943_zh-325 	 lpp_1943.325
No Token object ID: 19 	 29 for sentence pairs lpp_1943_zh-325 	 lpp_1943.325
No Token object ID: 3 	 6 for sentence pairs lpp_1943_zh-426 	 lpp_1943.426
No Token object ID: 11 	 10 for sentence pairs lpp_1943_zh-426 	 lpp_1943.426
No Token object ID: 2 	 3 4 for sentence pairs lpp_1943_zh-550 	 lpp_1943.550
No Token object ID: 1 	 10 for sentence pairs lpp_1943_zh-590 	 lpp_1943.590
No Token object ID: 3 	 8 for sentence pairs lpp_1943_zh-1000 	 lpp_1943.1000
No Token object ID: 9 	 7 for sentence pairs lpp_1943_zh-1186 	 lpp_1943.1186
No Token object ID: 9 	 10 11 for sentence pairs lpp_1943_zh-1192 	 lpp_1943.1192

nschneid · 2022-12-04T01:45:28Z

Looks good! E.g. at http://www.xposition.org/en/the%20little%20prince%20(english)0.9/lpp_1943.1153/

…nts (#234)

This was referenced Aug 24, 2022

Cross-Linguistic alignments for multiple language pairs #243

Open

Parallel corpora #202

Closed

Adding Alignments to Xposition #244

Merged

nschneid added a commit that referenced this issue Dec 4, 2022

requirements.txt: conllu for script that imports crosslingual alignme…

8b880d1

…nts (#234)

nschneid closed this as completed Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending data model for parallel corpus alignments #234

Extending data model for parallel corpus alignments #234

lgessler commented Oct 13, 2021

lgessler commented Oct 13, 2021 •

edited

Loading

nschneid commented Oct 13, 2021

lgessler commented Oct 20, 2021

nitinvwaran commented Aug 9, 2022 •

edited

Loading

nschneid commented Aug 10, 2022 •

edited

Loading

nitinvwaran commented Aug 10, 2022

nschneid commented Aug 10, 2022

nitinvwaran commented Aug 10, 2022 •

edited

Loading

nschneid commented Aug 10, 2022

nitinvwaran commented Aug 10, 2022

nschneid commented Aug 10, 2022

nitinvwaran commented Aug 22, 2022 •

edited

Loading

nitinvwaran commented Aug 22, 2022

nschneid commented Aug 22, 2022

nschneid commented Aug 22, 2022

nitinvwaran commented Aug 23, 2022

nitinvwaran commented Aug 30, 2022

nschneid commented Nov 29, 2022 •

edited

Loading

nschneid commented Dec 4, 2022 •

edited

Loading

nschneid commented Dec 4, 2022

Extending data model for parallel corpus alignments #234

Extending data model for parallel corpus alignments #234

Comments

lgessler commented Oct 13, 2021

lgessler commented Oct 13, 2021 • edited Loading

nschneid commented Oct 13, 2021

lgessler commented Oct 20, 2021

nitinvwaran commented Aug 9, 2022 • edited Loading

nschneid commented Aug 10, 2022 • edited Loading

nitinvwaran commented Aug 10, 2022

nschneid commented Aug 10, 2022

nitinvwaran commented Aug 10, 2022 • edited Loading

nschneid commented Aug 10, 2022

nitinvwaran commented Aug 10, 2022

nschneid commented Aug 10, 2022

nitinvwaran commented Aug 22, 2022 • edited Loading

nitinvwaran commented Aug 22, 2022

nschneid commented Aug 22, 2022

nschneid commented Aug 22, 2022

nitinvwaran commented Aug 23, 2022

nitinvwaran commented Aug 30, 2022

nschneid commented Nov 29, 2022 • edited Loading

nschneid commented Dec 4, 2022 • edited Loading

nschneid commented Dec 4, 2022

lgessler commented Oct 13, 2021 •

edited

Loading

nitinvwaran commented Aug 9, 2022 •

edited

Loading

nschneid commented Aug 10, 2022 •

edited

Loading

nitinvwaran commented Aug 10, 2022 •

edited

Loading

nitinvwaran commented Aug 22, 2022 •

edited

Loading

nschneid commented Nov 29, 2022 •

edited

Loading

nschneid commented Dec 4, 2022 •

edited

Loading