Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending data model for parallel corpus alignments #234

Closed
lgessler opened this issue Oct 13, 2021 · 20 comments
Closed

Extending data model for parallel corpus alignments #234

lgessler opened this issue Oct 13, 2021 · 20 comments

Comments

@lgessler
Copy link
Contributor

In the future we might want to have alignments for e.g. translations of The Little Prince. Some ideas for how to approach this:

  • Give CorpusSentence a many-to-many join on corpus sentences
  • Give CorpusSentence a join on corpus sentences and an additional field indicating how many sentences "after" that aligned sentence should be included for the given sentence
  • Do we need a one-to-many join? Look at Django ORM docs
@lgessler
Copy link
Contributor Author

lgessler commented Oct 13, 2021

The Django join field options are ManyToMany, OneToOne and ForeignKey. OneToOne is obviously not what we need here and ForeignKey would get us a one-to-many, but that would not be sufficient either because a SQL column cannot have multiple values. So ManyToMany is the only remaining option among the choices the Django ORM gives us. reference: https://docs.djangoproject.com/en/3.2/ref/models/fields/#module-django.db.models.fields.related

@nschneid
Copy link
Contributor

Got it. Then ManyToMany is the most natural, and we could probably get it to work. Or we could hack around it with ForeignKey + number of subsequent aligned sentences.

@lgessler
Copy link
Contributor Author

Some notes on how to implement this:

In the table view:

  • Add a column for language
  • Indicate somehow (prob. another column) whether it has sentence alignments in another language
  • But probably not whether a token has an alignment (an adposition in one language might not have an adpositional equivalent in a translation)

Displaying alignments:

  • Under the "pretty" view of a sentence in the sentence view (example) or token annotation view (example), show aligned sentence(s) in the same style
  • Indicate somehow the alignment of annotated tokens (maybe highlight on hover in both sentences) on both pages

Data model:

  • For a given source sentence, we need to know all its aligned sentences
  • For a given token, we need to know its aligned token (if it exists) and the sentence that aligned sentence occurs in
  • We need to have fields for both because a translation might not have an adposition
  • Also: should the "is parallel" flag get replaced by a computed property? (check the join, and if it exists, it's parallel)

@nitinvwaran
Copy link
Contributor

nitinvwaran commented Aug 9, 2022

I'm working on this now, and my thinking is to leave the token annotation view alone (used in too many places), and modify the sentence view and the example view to show the alignments.

Maybe, a new grid in the sentence view, which replicates the PTokenAnnotation view with extra columns for aligned language, aligned sentence id, and aligned example ID. And a new grid in the example view (or maybe use the existing grid), which has extra records for the aligned example ID, across languages.

Proposing a new model class (data table), with the following columns:

  1. Alignment ID (just an Identity column)
  2. Source Sentence ID
  3. Source example ID
  4. Source language
  5. Target Sentence ID
  6. Target example ID
  7. Target language

And the joins to the other tables, would be on the example ID which seems more granular than the sentence ID. The table could also be queried on the sentence ID (for the sentence view).

Need to check the following:

  1. How to align the example IDs
  2. Check whether MWE adpositions that map to a single adposition or vice versa, still fall under a single example ID each.

@nschneid
Copy link
Contributor

nschneid commented Aug 10, 2022

An example is a particular adposition token. In general we don't have alignments at the adposition token level (we have them for a few chapters of Chinese-English). But I think we need to first implement sentence-level alignments. And note that these are not always 1-to-1.

In terms of the display, I would start with the simplest possible addition—maybe on the sentence page, a list of links to translation sentences. We can improve the UI later. (Also see #242 about making the current views more legible)

@nitinvwaran
Copy link
Contributor

OK, understood. I was thinking of combining the adposition and sentence into a single alignment view, as adposition is the most granular, but it looks like the sentence alignments have value by themselves.

In this case, I could try with the single datamodel mentioned above for plain sentence alignment, which could be adapted for adposition alignment when they are available.

@nschneid
Copy link
Contributor

OK so:

ParallelSentAlignment model

  • Alignment ID (just an Identity column)
  • Source Sentence ID
  • Source language
  • Target Sentence ID
  • Target language

The source and target languages can be inferred from the sentence entries right? Is there an advantage to having a copy of this info with the alignment?

@nitinvwaran
Copy link
Contributor

nitinvwaran commented Aug 10, 2022

Looks like a join with the CorpusSentence and the Language models, will get the language text. So there's no need to have it in the model.

I would propose leaving the Source Example ID and Target Example ID columns in the ParallelSentAlignment model, and populate them with nulls to start with until they are available. Maybe there is a way in the Django Layer to refine the query to get just what is needed for the sentence alignment view, i could take a look at this. This could save some space in the database, since an aligned adposition would probably need to be uniquely identified by the sentence ID. It could also avoid joins with the PTokenAnnotation table which is a pretty large table.

@nschneid
Copy link
Contributor

But a sentence will typically contain multiple adposition tokens, so wouldn't a separate model be needed for those alignments?

@nitinvwaran
Copy link
Contributor

OK, yeah i thought the sentence and adposition alignments could be combined into one Model, then I tried out some scenarios on an excel sheet (1:1, 1:Many, no alignments, etc) and things quickly became unwieldy. Maybe a separate ParallelAdpositionAlignment model (AdpositionAlignmentID, SourceExampleID, TargetExampleID) works best for the adposition alignments.

@nschneid
Copy link
Contributor

Agreed (though I would call it ParallelPTokenAlignment).

@nitinvwaran
Copy link
Contributor

nitinvwaran commented Aug 22, 2022

I created a prototype of the Sentence and PToken alignment grids, in the Corpus Sentence View and PToken View, using the existing Zh-En alignments.

The sentence alignment page looks like this; i added a grid at the bottom to show the aligned sentences across languages. For 1-many alignments, all the aligned sentences for the record will appear.
sentenceview_2

The PToken alignments look like this. The new grid is a replica of the PTokenAnnotation table with the info for all target tokens. Tokens across multiple languages can be added to the DataModel:

ptokenview_3

Let me know, for any feedback. I still have to make a new script to generate and load these alignments for languages into the database via the admin console.

@nitinvwaran
Copy link
Contributor

for actually creating the alignments in the database, can we assume that the alignments can always be derived from an external conllulex file set for the language pairs? Then i can write a set of scripts to a) read the alignments off the conllulex file (using the existing zh-en template) and b) populate the alignment Models in the database using these alignments

(This is what I did for the prototype, I guess i just need to make the scripts abstract-able for any language pair)

@nschneid
Copy link
Contributor

Yes I think it would be a good idea to assume that alignments are integrated into the .conllulex files so they can be uniformly imported. For the sentence level, there is the en_sent_id metadata field. For token-level alignments we will need to figure out a format but let's not worry about that now.

@nschneid
Copy link
Contributor

The screenshots look good! But I'm not sure it's necessary to repeat info about the source language if it's displayed at the top of the page.

@nitinvwaran
Copy link
Contributor

OK, I've removed the source columns from the sentence alignment view.

For token-level alignments we will need to figure out a format but let's not worry about that now.

There is a format present in the Chinese conllulex which stores the AlignedTokId (id of the English token) and AlignedAdposition (english adposition token) in a column.

I'll work with this to start and create some scripts for production deployment (assume that alignments to English are present in the other language's conllulex file) , but i guess it can get complicated with language pairs not including English.

@nitinvwaran
Copy link
Contributor

Created a PR #244. Release instructions for server deployment can be found here.

I tested the release instructions on my local copy with a production database backup.

@nschneid
Copy link
Contributor

nschneid commented Nov 29, 2022

Thanks @nitinvwaran. The code update/db migration went smoothly. Next step is to import the alignments. I will wait for the overnight db backup before doing that.

@nschneid
Copy link
Contributor

nschneid commented Dec 4, 2022

running scripts/generate_alignments_from_conllulex.py as described in the release instructions mentioned above—

As expected, warnings about "No CorpusSentence object for sentence pairs" (missing English ch. 1, 4, 5)

  • lpp_1943_zh-1 through lpp_1943_zh-35
  • lpp_1943_zh-146 through lpp_1943_zh-262

"No Token object" warning for 10 tokens:

No Token object ID: 1 	 16 for sentence pairs lpp_1943_zh-280 	 lpp_1943.280
No Token object ID: 17 	 29 for sentence pairs lpp_1943_zh-325 	 lpp_1943.325
No Token object ID: 19 	 29 for sentence pairs lpp_1943_zh-325 	 lpp_1943.325
No Token object ID: 3 	 6 for sentence pairs lpp_1943_zh-426 	 lpp_1943.426
No Token object ID: 11 	 10 for sentence pairs lpp_1943_zh-426 	 lpp_1943.426
No Token object ID: 2 	 3 4 for sentence pairs lpp_1943_zh-550 	 lpp_1943.550
No Token object ID: 1 	 10 for sentence pairs lpp_1943_zh-590 	 lpp_1943.590
No Token object ID: 3 	 8 for sentence pairs lpp_1943_zh-1000 	 lpp_1943.1000
No Token object ID: 9 	 7 for sentence pairs lpp_1943_zh-1186 	 lpp_1943.1186
No Token object ID: 9 	 10 11 for sentence pairs lpp_1943_zh-1192 	 lpp_1943.1192

@nschneid
Copy link
Contributor

nschneid commented Dec 4, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants