New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate VitalSource annotations to be associated with book rather than chapter #7709
Comments
The format of the new selectors is described at https://github.com/hypothesis/client/blob/7267c198adbf31bcd0bf0065aa376b3a4bf2702e/src/types/api.ts#L75. Only the "url" field is currently marked as required. For PDF/fixed-layout books, there is also a |
Some notes on how many VitalSource annotations will need to be migrated and the books and Hypothesis groups they are associated with: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1664901048607019. |
Are new VitalSource annotations with the old format still being created? If not then I wonder about doing a one-off DB migration to migrate all annotations to the new format in the DB. That's what we'd normally do if we wanted to migrate a bunch of data in the DB. Uou'd then use the admin pages to reindex those annotations. There are already various admin pages in h to reindex all annotations of a user/group/etc. You may be able to use one of those, or you may need to add a new one. I think it should also be possible to write a "migrate to the new VitalSource format" admin page if you want to do it that way. But this will be the first time we've written a Celery task to do a bulk migration on the DB, those've always been done using DB migrations in the past. The task would also schedule each annotation for reindexing after the annotation has been changed in the DB. And I suppose once you're finished, you'll delete the admin page? Do we know what volume of annotations we're talking about here? |
Number of annotations: 12,656. (2023-01-24 update: 15,360) select count(*) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%'; Number of groups: 75 (2023-01-24 update: 101) select count(distinct(annotation.groupid)) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%'; Number of users: 432 (2023-01-24 update: 726) select count(distinct(annotation.userid)) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%'; Number of URLs (== number of distinct chapters/pages): (2023-01-24 update: 655) select count(distinct(uri_normalized)) from document_uri where uri_normalized like 'httpx://jigsaw.vitalsource.com/%'; |
Given that there are only a small number of groups, we could use the existing "Reindex all annotations in a group" facility in the search index management page at http://localhost:5000/admin/search to handle reindexing. It would be more convenient if we modified that form to support supplying a list of groups (eg. as a comma-separated list) to reindex. |
As noted in the issue description, the migrated annotations should include some data which is not present in the original annotation:
The We could omit these fields and make the client dynamically look up the CFI and title that correspond to the To add this data during the migration, we have a couple of options:
A total of 74 different books have been annotated so far. Query: select distinct(substring(uri_normalized, '/books/[0-9A-Z-]+')) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%'; |
In preparation for enabling the `book_as_single_document` feature for everyone, enable capturing the EPUBContentSelector selector whether the feature flag is enabled or not. Once this is released, all new VS annotations will have all the data they will need after they are migrated to the new format [1] and only the annotation URL will need to be changed. This will leave us with only a fixed set of older annotations for which we will need to obtain the missing CFI and chapter title data. [1] See hypothesis/h#7709
I think it might be helpful to do this annotation in several stages:
|
In preparation for enabling the `book_as_single_document` feature for everyone, enable capturing the EPUBContentSelector selector whether the feature flag is enabled or not. Once this is released, all new VS annotations will have all the data they will need after they are migrated to the new format [1] and only the annotation URL will need to be changed. This will leave us with only a fixed set of older annotations for which we will need to obtain the missing CFI and chapter title data. [1] See hypothesis/h#7709
Some notes on step (2) of the migration: For each existing annotated VitalSource URL (example: "https://jigsaw.vitalsource.com/books/L-999-70049/epub/OPS/loc_002.xhtml") we need to:
|
I'm currently working on a script to gather the data needed for the backfilled EPUBContentSelector selectors. I encountered an issue with PDF-based books, as not all pages have an entry in the table of contents. See https://vitalsource.slack.com/archives/C01208U1A2F/p1671548778110049. |
In preparation for enabling the `book_as_single_document` feature for everyone, enable capturing the EPUBContentSelector selector whether the feature flag is enabled or not. Once this is released, all new VS annotations will have all the data they will need after they are migrated to the new format [1] and only the annotation URL will need to be changed. This will leave us with only a fixed set of older annotations for which we will need to obtain the missing CFI and chapter title data. [1] See hypothesis/h#7709
In preparation for enabling the `book_as_single_document` feature for everyone, enable capturing the EPUBContentSelector selector whether the feature flag is enabled or not. Once this is released, all new VS annotations will have all the data they will need after they are migrated to the new format [1] and only the annotation URL will need to be changed. This will leave us with only a fixed set of older annotations for which we will need to obtain the missing CFI and chapter title data. [1] See hypothesis/h#7709
Using the above APIs I got a dump of the TOC and pages data for all the VS books annotated so far. See https://drive.google.com/file/d/16FMKv2VmKDnpZEzdA-3MTc4W22c1pPHB/view?usp=share_link (H internal only). This covers steps 1-3. |
I have a first pass of a JSON file containing the data for the updates we'll need to apply: https://gist.github.com/robertknight/96a438e4869930d3e4fc285ca711d989 contains a mapping from the current URL of an annotation, to an object with The JSON output here was generated from an input list of current annotation URLs using this script. This data is not final because there were some URLs in the input list for which I could not find the necessary entries in the VitalSource data, and I need to check some issues relating to the "title" field for some entries. These issues won't affect the structure of the data though. |
I have updated the data at https://gist.github.com/robertknight/96a438e4869930d3e4fc285ca711d989 with document titles. When we migrate annotation URLs, we'll need to make sure document entries get created for the new URLs and have at least the titles set. The data now looks like: |
There were a small number of annotated PDF page URLs which no longer appear in the page index for the book. I suspect what has happened is that the book has been updated or re-processed since it was originally annotated. We didn't record page numbers or CFIs at the time when these annotations were created, so we can't easily locate the correct page in the book. Fortunately for all new annotations that are created, we are capturing the CFI and page number. Log output from https://github.com/hypothesis/vitalsource-url-migration/blob/main/gen_epub_selectors.py:
|
The latest version of the data that we'll need for the migration is now at https://github.com/hypothesis/vitalsource-url-migration/blob/main/vs-selectors.json. It has updated URLs and document (book) titles for all books. A small number of chapter/page URLs, mentioned in the previous comment, still had to be skipped. |
Looking through a list of all the document titles that were fetched, I see there are some HTML entities and character references (
|
Add a service that can perform batch migration of annotations from one set of URLs / documents to another. The initial use case is for migrating annotations on VitalSource ebook annotations from individual chapter URLs to whole-book URLs as part of #7709. The migration reuses the `h.storage.update_annotation` function that was originally used for handling annotation updates via the API, to ensure that annotations and documents are updated in a way that is consistent how they would be updated if users "moved" the annotations via API calls for each annotation.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
Add a service that can perform batch migration of annotations from one set of URLs / documents to another. The initial use case is for migrating annotations on VitalSource ebook annotations from individual chapter URLs to whole-book URLs as part of #7709. The migration reuses the `h.storage.update_annotation` function that was originally used for handling annotation updates via the API, to ensure that annotations and documents are updated in a way that is consistent how they would be updated if users "moved" the annotations via API calls for each annotation.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
Add a service that can perform batch migration of annotations from one set of URLs / documents to another. The initial use case is for migrating annotations on VitalSource ebook annotations from individual chapter URLs to whole-book URLs as part of #7709. The migration reuses the `h.storage.update_annotation` function that was originally used for handling annotation updates via the API, to ensure that annotations and documents are updated in a way that is consistent how they would be updated if users "moved" the annotations via API calls for each annotation.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
Add a service that can perform batch migration of annotations from one set of URLs / documents to another. The initial use case is for migrating annotations on VitalSource ebook annotations from individual chapter URLs to whole-book URLs as part of #7709. The migration reuses the `h.storage.update_annotation` function that was originally used for handling annotation updates via the API, to ensure that annotations and documents are updated in a way that is consistent how they would be updated if users "moved" the annotations via API calls for each annotation.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
Add a service that can perform batch migration of annotations from one set of URLs / documents to another. The initial use case is for migrating annotations on VitalSource ebook annotations from individual chapter URLs to whole-book URLs as part of #7709. The migration reuses the `h.storage.update_annotation` function that was originally used for handling annotation updates via the API, to ensure that annotations and documents are updated in a way that is consistent how they would be updated if users "moved" the annotations via API calls for each annotation.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
Add a service that can perform batch migration of annotations from one set of URLs / documents to another. The initial use case is for migrating annotations on VitalSource ebook annotations from individual chapter URLs to whole-book URLs as part of #7709. The migration reuses the `h.storage.update_annotation` function that was originally used for handling annotation updates via the API, to ensure that annotations and documents are updated in a way that is consistent how they would be updated if users "moved" the annotations via API calls for each annotation.
Add route at `/admin/documents` for moving annotations from one URL to another. The initial use case is for #7709.
The migration has been initiated and is expected to complete in the next 20 minutes or so. Slack thread with operations analysis here: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1674638705104229. |
The bulk of the migration is complete. There were a total of 24 out of ~15,400 annotations that could not be migrated. See notes at https://hypothes-is.slack.com/archives/C4K6M7P5E/p1674643433767469?thread_ts=1674638705.104229&cid=C4K6M7P5E. |
As part of the "treat VitalSource books as one document" project, we are changing the URL and selectors that the Hypothesis client captures. These changes are currently behind a feature flag. In order to roll this change out to all users, we will need to migrate the existing annotations to use the same URL and selector format.
The current thinking is that this will be done via a task in the h admin panel that can be run multiple times during the transition, with optional filters to control which users or groups are processed on each run.
The existing annotations have data that looks like this:
EPUB ("reflowable") book example:
PDF ("fixed layout") book example:
The migrated annotations will look like this:
Note that some of the information that is needed in the new format is not available in the existing data. We will either need to make everything work without it, or look the information up via requests to the VitalSource metadata API.
The text was updated successfully, but these errors were encountered: