Skip to content

Conversation

@jkachel
Copy link
Contributor

@jkachel jkachel commented Jul 16, 2024

What are the relevant tickets?

Closes mitodl/hq#4553

Description (What does it do?)

Makes some changes to the topics code:

  • Adds topics and mappings to offeror data feeds according to the list provided in the issue referenced above
  • Adds support for persisting the mapping data in the database
  • Adds support for specifying the topic icon in the database
  • Updates ETL pipelines to have a standardized topic mapping function, and to use the database topics stuff for resolving topics

How can this be tested?

Automated tests should pass.

Topic data

You will need to run the migrations in the data_fixtures app to upsert the topic data into the system. Be advised that the migration will truncate the topics table with cascade so this will also wipe out mappings and associations with learning resources. Topic channels will be invalid after this as well. You will want to backpopulate anything that's got topics in it (ocw, mitx, mitxonline, xpro, prolearn, etc) after running the migration. (However, for just testing the topic data upsert stuff, you don't need to do it right away.)

The topic data is sourced from these places (from the original issue):

We have interpreted the vocabularies and cross-walks into 3 mappings (awaiting a final approval):

  • OCW->PMT Topic Map ("draft3" tab)
    • note that this presumes that we add a topic "archaeology" under "humanities"
  • ProLearn->PMT Topic Map ("draft3" tab)
    • note that this presumes that we add "manufacturing" and "technology innovation" topics
  • edX->PMT Topic Map ("draft3" tab)
  • OpenLearningLibrary (OLL) doesn't have any topic metadata at the moment

Then, check to make sure the data is in there properly and matches the data in the YAML file.

You can test topic upsert using a new topic or updating an existing topic in these ways:

  • Via the data_fixtures app - create a migration that either pulls a file with the data in it, or embeds the update YAML into a string within the migration.
  • Manually - Call learning_resources.utils.upsert_topic_data_string or _file.

If you're testing modifying an existing topic, you should specify the database ID for the topic. (It will try to match on name but that won't work if that's what you're updating.)

ETL Pipelines

Clear the learning_resources_learningresource_topics table (which contains the maps between learning resources and topics) and run the backpopulate_ commands. This should fill the table and you should be able to single out individual resources and confirm that their topics are mapped to Open ones properly. Not all topics will map successfully - only ones we have a mapping for will come over. We no longer create new topics if the topic didn't exist in the system.

In particular, when syncing OCW data, you should not see a bunch of "Skipped adding topic" messages in the logs. (There may be a few but they should be few and far between.) OCW has a lot of topic mappings so this is a good test to make sure the mapping system is working.

Additional Context

Since the topic maps will be different than they were, items may not be in the places you expect them - check against the topics.yaml file to see where things should be.

The topics.yaml has IDs in it that were created from a truncated and reset table (so they start at 1 and go up). For consistency, the migration also truncates the table. This requires a re-populate of a bunch of stuff (notably topic channels, and any data source that includes topic data) so some care will need to be taken to perform those tasks after this merges for both local deployments and Heroku ones.

@jkachel jkachel added product:mit-open Issues related to the MIT Open product Work in Progress labels Jul 16, 2024
@jkachel jkachel self-assigned this Jul 16, 2024
@jkachel jkachel force-pushed the jkachel/4553-update-topics-workflow branch 2 times, most recently from b7adae8 to 194e95f Compare July 23, 2024 13:18
@jkachel jkachel marked this pull request as ready for review July 23, 2024 13:31
@jkachel jkachel marked this pull request as draft July 23, 2024 13:42
jkachel and others added 18 commits July 23, 2024 11:43
This is the base file and represents the topcis that were listed in the A tab in the Google doc. (See issue mitodl/hq#4553 for details.)
…e and added database changes to support persisting the new topic fields and mappings
- Upserting from string is intended to be a helper for migrations, so we can put the updates directly in the migration
- Rewrote test so that it works with the new file format
- Added test for string upserter and topic updating
- Updated LearningResourceOfferorFactory to allow for "see"
For xpro, prolearn, and mitxonline, this reconfigures the ETL pipelines to pull topic data from the mappings in the database rather than a CSV file. This requires some test changes that haven't happened yet.

This also adds a "full name" that generates a topic name with the parent topic in the manner that the code expects. This really should only be used here as it's recursive and can potentially be expensive to trigger.
…for initial topic data load

Tests will still fail - still need to fix up prolearn ETL, probably some other things too.
Fixed some issues with prolearn ETL but now have issues elsewhere, re-thinking how the topic transformation works
This is no-verify since it's not at all done
- Use the transform_topics stuff rather than reimplementing it in the tests that use it
- Fix some formatting issues with resolved topics
- Update learning resource offeror factory to include a MITx Online option
Now using transform_topics and doing some manipulation of the topics that get returned
This gets lumped into mitx for this processing - maybe we revisit this in future?
@jkachel jkachel force-pushed the jkachel/4553-update-topics-workflow branch from 1429e62 to 2a11310 Compare July 23, 2024 16:45
@jkachel jkachel marked this pull request as ready for review July 23, 2024 17:18
…scade and restarts the identity

Not sure this will stick around but it does simplify the data somewhat (since there's definitely going to be data on prod and RC)
@mbertrand mbertrand self-assigned this Jul 24, 2024
Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great!

yaml.dump(topic_data, yaml_output_file)
```

You can open a Django shell or a notebook and paste that in, then run `process_topics()` to process your datafile. The result of this will be written to the output location specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth making a management command out of this script?

On a related note, the update_topics management command should probably be deleted, along with some old mapping files under learning_resources/data (edx-topic-mappings.csv, ucc-topic-mappings.csv).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I culled the old data files.

I also updated the update_topics to be a front-end to those functions. I didn't do this initially because I didn't want to encourage people to use the management command to update the data (this should be done with a data migration) but after some thought it would be handy to have for generating the files for migrations.

Use the `data_fixtures` app. Create a migration there that calls `RunPython` and calls one of the upsert functions (both in `learning_resources/utils.py`):

- `upsert_topic_data_file` - if you've got an external file of changes
- `upsert_topic_data_string` - if you've got a string instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to be careful that no breaking changes are made to these functions in the future (or that older migrations using them are updated) to avoid errors when running all data fixture migrations on a fresh database.

icon: RiPaletteLine
id: 3
mappings: {}
name: Environmental Deisgn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Environmental Design

Originating from the csv file?

- Drama
- Fiction
- International Literature
- Nonfiction Prose *Periodic Literature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is asterisk supposed to be there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. (@pdpinch this is in the spreadsheet on row 196 - I don't know who to ask about this but it seems a bit out of place.)

dependencies = [
("learning_resources", "0057_add_icon_and_mappings_to_topics"),
("learning_resources", "0057_learningresource_next_prices"),
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this hasn't been merged yet, might be better to rename 0057_add_icon_and_mappings_to_topics to 0058_add_icon_and_mappings_to_topics and delete this one.

offeror = models.ForeignKey(
"LearningResourceOfferor",
on_delete=models.CASCADE,
related_name="+",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using "+" as the related name for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid having these show up elsewhere - the only part of the app that should care about the mappings is the ETL pipelines so I didn't want mappings to show up in, say, APIs because someone used fields = "__all__".

return topic_detail.channel.channel_url if topic_detail is not None else None

@cached_property
def full_name(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used anywhere, maybe it can be removed?

migrations.RunSQL(
sql=[
(
"TRUNCATE learning_resources_learningresourcetopic "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to quite get rid of everything it should. I have lots of old channels with channel_type="topic" that have a null ChannelTopicDetail. These should probably be deleted as well.

Channel.objects.filter(channel_type="topic").filter(topic_detail__isnull=True).count()
Out[20]: 279

Wondering if another python function could replace this raw sql?

LearningResourceTopic.objects.all().delete()
Channel.objects.filter(channel_type=ChannelType.topic.name).delete()

Maybe topic channels should be automatically deleted when the topic is deleted, but that can be an issue for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the migration to cull topic channels and to delete topics using model objects.

lr_topic, created = LearningResourceTopic.objects.filter(
Q(name=topic["name"]) | Q(id=topic["id"])
).update_or_create(
defaults={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of changing this to:

        ).update_or_create(
            id=topic["id"],
            defaults={

Assuming the file will always have an id supplied, anyway. This way the topics will be assigned the ids in the yaml file, and if running upsert_topic_data_file on a file that specifies an existing id and a changed name, the name will be updated on the existing topic instead of a new topic being created and the id value being ignored.

If id isn't always present, then maybe that arg could be used conditionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID isn't always there: you may be adding a new one. However, this lead me down a bit of a rabbit hole wrt the IDs - this was using database (int) IDs and that ended up being a problem, so I refactored to add a UUID field. (Using database IDs ended up causing weirdness because inserting rows with IDs doesn't update the identity sequence, which generates database errors when you're trying to add new ones.)


def test_prolearn_transform_courses(mock_mitpe_courses_data):
"""Test that prolearn courses data is correctly transformed into our normalized structure"""
upsert_topic_data_file()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done via a pytest fixture, but this works too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rolled this into the existing offeror fixture (since it depends on it).

@mbertrand mbertrand added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Jul 24, 2024
This won't be signed because that seems to have broken for some reason?!?

- Update the data_fixture migration to delete more data and do it without using raw SQL
- Remove stale topic mapping files
- Update learning_resource migrations so that there's not a merge one
- Update LearningResourceTopic to include a UUID field that we can use in the yaml files - with the database ID, the sequence gets screwed up and you then can't add new ones later
- Update the util functions for uuids
- Update the update_topics command to use the new topics code, and to provide a dump of the topics into a yaml file for migrations
- Update the topics file so it also has UUIDs
@jkachel jkachel force-pushed the jkachel/4553-update-topics-workflow branch from 7fbf26b to 1287c70 Compare July 24, 2024 21:27
jkachel and others added 3 commits July 25, 2024 08:32
This was pulling and parsing the OCW topics but these are hard-coded anyway.. this component needs to be updated to use the API-provided icon and not tackling that in this PR.
@jkachel jkachel requested a review from mbertrand July 25, 2024 13:58
@jkachel
Copy link
Contributor Author

jkachel commented Jul 25, 2024

Last few commits should address everything here.

Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just another couple comments

ChannelTopicDetail = apps.get_model("channels", "ChannelTopicDetail")

LearningResourceTopic.objects.all().delete()
ChannelTopicDetail.objects.all().delete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still leaves a bunch of orphaned Channel objects with channel_type="topic" and a null ChannelTopicDetail. This should handle it, and also delete the related ChannelTopicDetail objects via cascade:

Channel.objects.filter(channel_type="topic").delete()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consequence of deleting all existing topics is that any learning paths (like featured lists for OCW, xPro, etc) will have their manually assigned topics field set to an empty list. Not sure how big a deal this is, as there aren't that many of them, but might be worth checking with @pdpinch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run the pipelines for sources that include topic data once this gets merged anyway - would these lists get updated as part of that? (I know there's a populate_featured_lists for featured courses, for instance.)

help = (
"Update LearningResourceTopic data from a yaml file. Optionally "
"dump the updated topics to a file."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be better left to the data_fixtures app, @shanbady can you chime in?

Also, in terms of the option for generating the yaml file, seems like there might be an issue if the current data in the db is incomplete? Should the yaml file only be generated from a presumably 100% complete CSV file ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea thats probably the way to go. I feel like anything static fixture related should be consolidated under the data fixtures app

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. The upsert APIs can be run from a shell if someone needs to process a file or string manually for whatever reason.

The upsert functions work fine with deltas, so the incoming data doesn't need to be complete.

@jkachel jkachel requested a review from mbertrand July 25, 2024 19:35
Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for noticing this late, I often overlook podcasts & videos, but noticed there are lots more podcast episodes without topics using the new system. Saw lots of these messages in the logs when running ./manage.py backpopulate_podcast_data:

Skipped adding topic Physical Education and Recreation to resource LearningResource object (5257)
Skipped adding topic Physical Education and Recreation to resource LearningResource object (5258)
Skipped adding topic Business to resource LearningResource object (5258)

Still running the video pipeline but that looks better in terms of assigning topics.

EDIT: video topics are fine because they're assigned later as topics of similar courses based on an opensearch query.

The lower number of podcast episodes with matching topics may just be a result of the lower number of new topics (~108 vs ~335 on prod), since they were not being mapped to anything. So might be okay.

Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the same # podcast episodes without topics on the main branch, I'm guessing the discrepancy on prod must be due to those topics being added before we stopped adding new topics as they came in.

@mbertrand
Copy link
Member

Sorry, haven't had enough coffee, forgot to populate old topics after switching back to the main branch. Once I did that, most podcast episodes were assigned topics. Probably best to check with Peter/Ferdi about how important it is for most podcasts/episodes to have topics before merging.

@jkachel
Copy link
Contributor Author

jkachel commented Jul 26, 2024

Discussed briefly this morning and it doesn't seem like topics on podcasts were too big of a deal.

I'm holding off on merging this until Monday - since this will require some manual guidance to deploy, it'd be best to do this in a controlled manner and also not block anyone else doing stuff in Open. (Plus, it's Friday so no releases anyway.) We've decided on getting this out to RC today.

@jkachel jkachel merged commit e8cbea2 into main Jul 26, 2024
@jkachel jkachel deleted the jkachel/4553-update-topics-workflow branch July 26, 2024 14:42
@odlbot odlbot mentioned this pull request Jul 26, 2024
3 tasks
@odlbot odlbot mentioned this pull request Jul 26, 2024
5 tasks
This was referenced Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

product:mit-open Issues related to the MIT Open product Waiting on author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants