Update topics code for PWT topic mappings #1275

jkachel · 2024-07-16T17:00:28Z

What are the relevant tickets?

Closes mitodl/hq#4553

Description (What does it do?)

Makes some changes to the topics code:

Adds topics and mappings to offeror data feeds according to the list provided in the issue referenced above
Adds support for persisting the mapping data in the database
Adds support for specifying the topic icon in the database
Updates ETL pipelines to have a standardized topic mapping function, and to use the database topics stuff for resolving topics

How can this be tested?

Automated tests should pass.

Topic data

You will need to run the migrations in the data_fixtures app to upsert the topic data into the system. Be advised that the migration will truncate the topics table with cascade so this will also wipe out mappings and associations with learning resources. Topic channels will be invalid after this as well. You will want to backpopulate anything that's got topics in it (ocw, mitx, mitxonline, xpro, prolearn, etc) after running the migration. (However, for just testing the topic data upsert stuff, you don't need to do it right away.)

The topic data is sourced from these places (from the original issue):

We have interpreted the vocabularies and cross-walks into 3 mappings (awaiting a final approval):

OCW->PMT Topic Map ("draft3" tab)

note that this presumes that we add a topic "archaeology" under "humanities"

ProLearn->PMT Topic Map ("draft3" tab)

note that this presumes that we add "manufacturing" and "technology innovation" topics

edX->PMT Topic Map ("draft3" tab)

OpenLearningLibrary (OLL) doesn't have any topic metadata at the moment

Then, check to make sure the data is in there properly and matches the data in the YAML file.

You can test topic upsert using a new topic or updating an existing topic in these ways:

Via the data_fixtures app - create a migration that either pulls a file with the data in it, or embeds the update YAML into a string within the migration.
Manually - Call learning_resources.utils.upsert_topic_data_string or _file.

If you're testing modifying an existing topic, you should specify the database ID for the topic. (It will try to match on name but that won't work if that's what you're updating.)

ETL Pipelines

Clear the learning_resources_learningresource_topics table (which contains the maps between learning resources and topics) and run the backpopulate_ commands. This should fill the table and you should be able to single out individual resources and confirm that their topics are mapped to Open ones properly. Not all topics will map successfully - only ones we have a mapping for will come over. We no longer create new topics if the topic didn't exist in the system.

In particular, when syncing OCW data, you should not see a bunch of "Skipped adding topic" messages in the logs. (There may be a few but they should be few and far between.) OCW has a lot of topic mappings so this is a good test to make sure the mapping system is working.

Additional Context

Since the topic maps will be different than they were, items may not be in the places you expect them - check against the topics.yaml file to see where things should be.

The topics.yaml has IDs in it that were created from a truncated and reset table (so they start at 1 and go up). For consistency, the migration also truncates the table. This requires a re-populate of a bunch of stuff (notably topic channels, and any data source that includes topic data) so some care will need to be taken to perform those tasks after this merges for both local deployments and Heroku ones.

This is the base file and represents the topcis that were listed in the A tab in the Google doc. (See issue mitodl/hq#4553 for details.)

…e and added database changes to support persisting the new topic fields and mappings

- Upserting from string is intended to be a helper for migrations, so we can put the updates directly in the migration - Rewrote test so that it works with the new file format - Added test for string upserter and topic updating - Updated LearningResourceOfferorFactory to allow for "see"

For xpro, prolearn, and mitxonline, this reconfigures the ETL pipelines to pull topic data from the mappings in the database rather than a CSV file. This requires some test changes that haven't happened yet. This also adds a "full name" that generates a topic name with the parent topic in the manner that the code expects. This really should only be used here as it's recursive and can potentially be expensive to trigger.

…or mitxonline (still not done yet)

…for initial topic data load Tests will still fail - still need to fix up prolearn ETL, probably some other things too.

Fixed some issues with prolearn ETL but now have issues elsewhere, re-thinking how the topic transformation works This is no-verify since it's not at all done

- Use the transform_topics stuff rather than reimplementing it in the tests that use it - Fix some formatting issues with resolved topics - Update learning resource offeror factory to include a MITx Online option

for more information, see https://pre-commit.ci

Now using transform_topics and doing some manipulation of the topics that get returned

… some parse_topics stuff in prolearn

This gets lumped into mitx for this processing - maybe we revisit this in future?

… map to have IDs

(saving is important when merging)

…scade and restarts the identity Not sure this will stick around but it does simplify the data somewhat (since there's definitely going to be data on prod and RC)

mbertrand

Overall looks great!

mbertrand · 2024-07-24T13:40:27Z

learning_resources/data/README-topics.md

+		yaml.dump(topic_data, yaml_output_file)
+```
+
+You can open a Django shell or a notebook and paste that in, then run `process_topics()` to process your datafile. The result of this will be written to the output location specified.


Worth making a management command out of this script?

On a related note, the update_topics management command should probably be deleted, along with some old mapping files under learning_resources/data (edx-topic-mappings.csv, ucc-topic-mappings.csv).

I culled the old data files.

I also updated the update_topics to be a front-end to those functions. I didn't do this initially because I didn't want to encourage people to use the management command to update the data (this should be done with a data migration) but after some thought it would be handy to have for generating the files for migrations.

mbertrand · 2024-07-24T13:41:31Z

learning_resources/data/README-topics.md

+Use the `data_fixtures` app. Create a migration there that calls `RunPython` and calls one of the upsert functions (both in `learning_resources/utils.py`):
+
+- `upsert_topic_data_file` - if you've got an external file of changes
+- `upsert_topic_data_string` - if you've got a string instead


Will have to be careful that no breaking changes are made to these functions in the future (or that older migrations using them are updated) to avoid errors when running all data fixture migrations on a fresh database.

mbertrand · 2024-07-24T13:43:45Z

learning_resources/data/topics.yaml

+        icon: RiPaletteLine
+        id: 3
+        mappings: {}
+        name: Environmental Deisgn


Typo: Environmental Design

Originating from the csv file?

mbertrand · 2024-07-24T13:56:49Z

learning_resources/data/topics.yaml

+            - Drama
+            - Fiction
+            - International Literature
+            - Nonfiction Prose *Periodic Literature


Is asterisk supposed to be there?

Yes. (@pdpinch this is in the spreadsheet on row 196 - I don't know who to ask about this but it seems a bit out of place.)

mbertrand · 2024-07-24T13:59:22Z

learning_resources/migrations/0058_merge_icon_mappings_and_prices.py

+    dependencies = [
+        ("learning_resources", "0057_add_icon_and_mappings_to_topics"),
+        ("learning_resources", "0057_learningresource_next_prices"),
+    ]


Since this hasn't been merged yet, might be better to rename 0057_add_icon_and_mappings_to_topics to 0058_add_icon_and_mappings_to_topics and delete this one.

mbertrand · 2024-07-24T14:00:58Z

learning_resources/models.py

+    offeror = models.ForeignKey(
+        "LearningResourceOfferor",
+        on_delete=models.CASCADE,
+        related_name="+",


Is there a reason for using "+" as the related name for these?

I wanted to avoid having these show up elsewhere - the only part of the app that should care about the mappings is the ETL pipelines so I didn't want mappings to show up in, say, APIs because someone used fields = "__all__".

mbertrand · 2024-07-24T15:27:03Z

learning_resources/models.py

        return topic_detail.channel.channel_url if topic_detail is not None else None

+    @cached_property
+    def full_name(self):


This doesn't seem to be used anywhere, maybe it can be removed?

mbertrand · 2024-07-24T15:51:05Z

data_fixtures/migrations/0004_upsert_initial_topic_data.py

+        migrations.RunSQL(
+            sql=[
+                (
+                    "TRUNCATE learning_resources_learningresourcetopic "


This doesn't seem to quite get rid of everything it should. I have lots of old channels with channel_type="topic" that have a null ChannelTopicDetail. These should probably be deleted as well.

Channel.objects.filter(channel_type="topic").filter(topic_detail__isnull=True).count() Out[20]: 279

Wondering if another python function could replace this raw sql?

LearningResourceTopic.objects.all().delete() Channel.objects.filter(channel_type=ChannelType.topic.name).delete()

Maybe topic channels should be automatically deleted when the topic is deleted, but that can be an issue for another PR.

I reworked the migration to cull topic channels and to delete topics using model objects.

mbertrand · 2024-07-24T17:23:02Z

learning_resources/utils.py

+        lr_topic, created = LearningResourceTopic.objects.filter(
+            Q(name=topic["name"]) | Q(id=topic["id"])
+        ).update_or_create(
            defaults={


What do you think of changing this to:

).update_or_create( id=topic["id"], defaults={

Assuming the file will always have an id supplied, anyway. This way the topics will be assigned the ids in the yaml file, and if running upsert_topic_data_file on a file that specifies an existing id and a changed name, the name will be updated on the existing topic instead of a new topic being created and the id value being ignored.

If id isn't always present, then maybe that arg could be used conditionally.

The ID isn't always there: you may be adding a new one. However, this lead me down a bit of a rabbit hole wrt the IDs - this was using database (int) IDs and that ended up being a problem, so I refactored to add a UUID field. (Using database IDs ended up causing weirdness because inserting rows with IDs doesn't update the identity sequence, which generates database errors when you're trying to add new ones.)

mbertrand · 2024-07-24T17:32:16Z

learning_resources/etl/prolearn_test.py


 def test_prolearn_transform_courses(mock_mitpe_courses_data):
    """Test that prolearn courses data is correctly transformed into our normalized structure"""
+    upsert_topic_data_file()


Could be done via a pytest fixture, but this works too.

Rolled this into the existing offeror fixture (since it depends on it).

This won't be signed because that seems to have broken for some reason?!? - Update the data_fixture migration to delete more data and do it without using raw SQL - Remove stale topic mapping files - Update learning_resource migrations so that there's not a merge one - Update LearningResourceTopic to include a UUID field that we can use in the yaml files - with the database ID, the sequence gets screwed up and you then can't add new ones later - Update the util functions for uuids - Update the update_topics command to use the new topics code, and to provide a dump of the topics into a yaml file for migrations - Update the topics file so it also has UUIDs

This was pulling and parsing the OCW topics but these are hard-coded anyway.. this component needs to be updated to use the API-provided icon and not tackling that in this PR.

jkachel · 2024-07-25T13:59:26Z

Last few commits should address everything here.

mbertrand

Looking good, just another couple comments

mbertrand · 2024-07-25T17:09:42Z

data_fixtures/migrations/0004_upsert_initial_topic_data.py

+    ChannelTopicDetail = apps.get_model("channels", "ChannelTopicDetail")
+
+    LearningResourceTopic.objects.all().delete()
+    ChannelTopicDetail.objects.all().delete()


This still leaves a bunch of orphaned Channel objects with channel_type="topic" and a null ChannelTopicDetail. This should handle it, and also delete the related ChannelTopicDetail objects via cascade:

Channel.objects.filter(channel_type="topic").delete()

Another consequence of deleting all existing topics is that any learning paths (like featured lists for OCW, xPro, etc) will have their manually assigned topics field set to an empty list. Not sure how big a deal this is, as there aren't that many of them, but might be worth checking with @pdpinch.

We should run the pipelines for sources that include topic data once this gets merged anyway - would these lists get updated as part of that? (I know there's a populate_featured_lists for featured courses, for instance.)

mbertrand · 2024-07-25T17:20:46Z

learning_resources/management/commands/update_topics.py

+    help = (
+        "Update LearningResourceTopic data from a yaml file. Optionally "
+        "dump the updated topics to a file."
+    )


I think this might be better left to the data_fixtures app, @shanbady can you chime in?

Also, in terms of the option for generating the yaml file, seems like there might be an issue if the current data in the db is incomplete? Should the yaml file only be generated from a presumably 100% complete CSV file ?

Yea thats probably the way to go. I feel like anything static fixture related should be consolidated under the data fixtures app

I removed it. The upsert APIs can be run from a shell if someone needs to process a file or string manually for whatever reason.

The upsert functions work fine with deltas, so the incoming data doesn't need to be complete.

…er than just the detail

mbertrand

Sorry for noticing this late, I often overlook podcasts & videos, but noticed there are lots more podcast episodes without topics using the new system. Saw lots of these messages in the logs when running ./manage.py backpopulate_podcast_data:

Skipped adding topic Physical Education and Recreation to resource LearningResource object (5257)
Skipped adding topic Physical Education and Recreation to resource LearningResource object (5258)
Skipped adding topic Business to resource LearningResource object (5258)

Still running the video pipeline but that looks better in terms of assigning topics.

EDIT: video topics are fine because they're assigned later as topics of similar courses based on an opensearch query.

The lower number of podcast episodes with matching topics may just be a result of the lower number of new topics (~108 vs ~335 on prod), since they were not being mapped to anything. So might be okay.

mbertrand

Getting the same # podcast episodes without topics on the main branch, I'm guessing the discrepancy on prod must be due to those topics being added before we stopped adding new topics as they came in.

mbertrand · 2024-07-26T11:18:25Z

Sorry, haven't had enough coffee, forgot to populate old topics after switching back to the main branch. Once I did that, most podcast episodes were assigned topics. Probably best to check with Peter/Ferdi about how important it is for most podcasts/episodes to have topics before merging.

for more information, see https://pre-commit.ci

jkachel · 2024-07-26T14:02:00Z

Discussed briefly this morning and it doesn't seem like topics on podcasts were too big of a deal.

I'm holding off on merging this until Monday - since this will require some manual guidance to deploy, it'd be best to do this in a controlled manner and also not block anyone else doing stuff in Open. (Plus, it's Friday so no releases anyway.) We've decided on getting this out to RC today.

jkachel added product:mit-open Issues related to the MIT Open product Work in Progress labels Jul 16, 2024

jkachel self-assigned this Jul 16, 2024

jkachel force-pushed the jkachel/4553-update-topics-workflow branch 2 times, most recently from b7adae8 to 194e95f Compare July 23, 2024 13:18

jkachel marked this pull request as ready for review July 23, 2024 13:31

jkachel marked this pull request as draft July 23, 2024 13:42

jkachel and others added 18 commits July 23, 2024 11:43

Adding new topics file

c07ebda

This is the base file and represents the topcis that were listed in the A tab in the Google doc. (See issue mitodl/hq#4553 for details.)

Updates the topic data upsert to work with the new format fixture fil…

14a0df2

…e and added database changes to support persisting the new topic fields and mappings

Adding icon to the topic serializer

c6fce4b

Got the remaining offeror mappings in, wrote some docs, update test f…

d44c2fd

…or mitxonline (still not done yet)

Fixes tests for mitxonline and xpro, adds more docs, added migration …

1c54547

…for initial topic data load Tests will still fail - still need to fix up prolearn ETL, probably some other things too.

no-verify commit: working through the mappings some more

e4dfad3

Fixed some issues with prolearn ETL but now have issues elsewhere, re-thinking how the topic transformation works This is no-verify since it's not at all done

Fixing tests, some minor tweaks to topic resolution

90011e1

- Use the transform_topics stuff rather than reimplementing it in the tests that use it - Fix some formatting issues with resolved topics - Update learning resource offeror factory to include a MITx Online option

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff8f06e

for more information, see https://pre-commit.ci

Update OCW ETL to do topic mappings

cf57cb9

Now using transform_topics and doing some manipulation of the topics that get returned

Updating edx pipelines to use the topic mappings

286596b

Updating openapi spec

7add282

Fixing some tests that had collateral changes from other changes; fix…

2b83dcd

… some parse_topics stuff in prolearn

imported cached_property twice

993a813

Fixing erroneous mitxonline offeror code

4dd7d02

This gets lumped into mitx for this processing - maybe we revisit this in future?

Added a helper function to dump the topic map to yaml, updated inital…

33dd09d

… map to have IDs

fixing openapi spec after rebase from main

2a11310

jkachel force-pushed the jkachel/4553-update-topics-workflow branch from 1429e62 to 2a11310 Compare July 23, 2024 16:45

jkachel added 2 commits July 23, 2024 11:52

Adding a merge migration

f7cab4c

fixing borked test from rebase

cd3bcda

(saving is important when merging)

jkachel marked this pull request as ready for review July 23, 2024 17:18

jkachel removed the Work in Progress label Jul 23, 2024

Added a step to the migration that truncates the topics table with ca…

b788253

…scade and restarts the identity Not sure this will stick around but it does simplify the data somewhat (since there's definitely going to be data on prod and RC)

mbertrand self-assigned this Jul 24, 2024

mbertrand requested changes Jul 24, 2024

View reviewed changes

mbertrand added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Jul 24, 2024

jkachel force-pushed the jkachel/4553-update-topics-workflow branch from 7fbf26b to 1287c70 Compare July 24, 2024 21:27

jkachel and others added 3 commits July 25, 2024 08:32

Move topic upsert into the offeror setup for ProLearn test

233a7e8

Merge branch 'main' into jkachel/4553-update-topics-workflow

abd1d31

Updating the RootTopicIcon test to use its own hard-coded topic list

56d4d11

This was pulling and parsing the OCW topics but these are hard-coded anyway.. this component needs to be updated to use the API-provided icon and not tackling that in this PR.

jkachel requested a review from mbertrand July 25, 2024 13:58

mbertrand reviewed Jul 25, 2024

View reviewed changes

jkachel added 2 commits July 25, 2024 14:11

Remove update_topics, update data migration to kill the channels rath…

09f979f

…er than just the detail

fix test - should be inserting the UUID, not the database ID

677db62

jkachel requested a review from mbertrand July 25, 2024 19:35

mbertrand reviewed Jul 25, 2024

View reviewed changes

mbertrand approved these changes Jul 26, 2024

View reviewed changes

jkachel and others added 2 commits July 26, 2024 08:58

Merge branch 'main' into jkachel/4553-update-topics-workflow

f2e8dc7

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab91ead

for more information, see https://pre-commit.ci

jkachel merged commit e8cbea2 into main Jul 26, 2024

jkachel deleted the jkachel/4553-update-topics-workflow branch July 26, 2024 14:42

odlbot mentioned this pull request Jul 26, 2024

Release 0.14.3 #1320

Closed

3 tasks

jkachel mentioned this pull request Jul 26, 2024

Remap topic icons according to what's in the topics listing #1322

Merged

odlbot mentioned this pull request Jul 26, 2024

Release 0.14.3 #1323

Closed

5 tasks

ChristopherChudzicki mentioned this pull request Jul 26, 2024

urlencode search_filter #1326

Merged

This was referenced Jul 26, 2024

Release 0.14.3 #1327

Closed

Release 0.14.3 #1329

Merged

Update topics code for PWT topic mappings #1275

Update topics code for PWT topic mappings #1275

Uh oh!

Conversation

jkachel commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Topic data

ETL Pipelines

Additional Context

Uh oh!

mbertrand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkachel commented Jul 25, 2024

Uh oh!

mbertrand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbertrand left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbertrand left a comment

Choose a reason for hiding this comment

Uh oh!

mbertrand commented Jul 26, 2024

Uh oh!

jkachel commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

jkachel commented Jul 16, 2024 •

edited

Loading

mbertrand left a comment •

edited

Loading

jkachel commented Jul 26, 2024 •

edited

Loading