-
Notifications
You must be signed in to change notification settings - Fork 3
Update topics code for PWT topic mappings #1275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b7adae8 to
194e95f
Compare
This is the base file and represents the topcis that were listed in the A tab in the Google doc. (See issue mitodl/hq#4553 for details.)
…e and added database changes to support persisting the new topic fields and mappings
- Upserting from string is intended to be a helper for migrations, so we can put the updates directly in the migration - Rewrote test so that it works with the new file format - Added test for string upserter and topic updating - Updated LearningResourceOfferorFactory to allow for "see"
For xpro, prolearn, and mitxonline, this reconfigures the ETL pipelines to pull topic data from the mappings in the database rather than a CSV file. This requires some test changes that haven't happened yet. This also adds a "full name" that generates a topic name with the parent topic in the manner that the code expects. This really should only be used here as it's recursive and can potentially be expensive to trigger.
…or mitxonline (still not done yet)
…for initial topic data load Tests will still fail - still need to fix up prolearn ETL, probably some other things too.
Fixed some issues with prolearn ETL but now have issues elsewhere, re-thinking how the topic transformation works This is no-verify since it's not at all done
- Use the transform_topics stuff rather than reimplementing it in the tests that use it - Fix some formatting issues with resolved topics - Update learning resource offeror factory to include a MITx Online option
for more information, see https://pre-commit.ci
Now using transform_topics and doing some manipulation of the topics that get returned
… some parse_topics stuff in prolearn
This gets lumped into mitx for this processing - maybe we revisit this in future?
1429e62 to
2a11310
Compare
(saving is important when merging)
…scade and restarts the identity Not sure this will stick around but it does simplify the data somewhat (since there's definitely going to be data on prod and RC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great!
| yaml.dump(topic_data, yaml_output_file) | ||
| ``` | ||
|
|
||
| You can open a Django shell or a notebook and paste that in, then run `process_topics()` to process your datafile. The result of this will be written to the output location specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth making a management command out of this script?
On a related note, the update_topics management command should probably be deleted, along with some old mapping files under learning_resources/data (edx-topic-mappings.csv, ucc-topic-mappings.csv).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I culled the old data files.
I also updated the update_topics to be a front-end to those functions. I didn't do this initially because I didn't want to encourage people to use the management command to update the data (this should be done with a data migration) but after some thought it would be handy to have for generating the files for migrations.
| Use the `data_fixtures` app. Create a migration there that calls `RunPython` and calls one of the upsert functions (both in `learning_resources/utils.py`): | ||
|
|
||
| - `upsert_topic_data_file` - if you've got an external file of changes | ||
| - `upsert_topic_data_string` - if you've got a string instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have to be careful that no breaking changes are made to these functions in the future (or that older migrations using them are updated) to avoid errors when running all data fixture migrations on a fresh database.
learning_resources/data/topics.yaml
Outdated
| icon: RiPaletteLine | ||
| id: 3 | ||
| mappings: {} | ||
| name: Environmental Deisgn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Environmental Design
Originating from the csv file?
| - Drama | ||
| - Fiction | ||
| - International Literature | ||
| - Nonfiction Prose *Periodic Literature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is asterisk supposed to be there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. (@pdpinch this is in the spreadsheet on row 196 - I don't know who to ask about this but it seems a bit out of place.)
| dependencies = [ | ||
| ("learning_resources", "0057_add_icon_and_mappings_to_topics"), | ||
| ("learning_resources", "0057_learningresource_next_prices"), | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this hasn't been merged yet, might be better to rename 0057_add_icon_and_mappings_to_topics to 0058_add_icon_and_mappings_to_topics and delete this one.
| offeror = models.ForeignKey( | ||
| "LearningResourceOfferor", | ||
| on_delete=models.CASCADE, | ||
| related_name="+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for using "+" as the related name for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid having these show up elsewhere - the only part of the app that should care about the mappings is the ETL pipelines so I didn't want mappings to show up in, say, APIs because someone used fields = "__all__".
learning_resources/models.py
Outdated
| return topic_detail.channel.channel_url if topic_detail is not None else None | ||
|
|
||
| @cached_property | ||
| def full_name(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be used anywhere, maybe it can be removed?
| migrations.RunSQL( | ||
| sql=[ | ||
| ( | ||
| "TRUNCATE learning_resources_learningresourcetopic " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to quite get rid of everything it should. I have lots of old channels with channel_type="topic" that have a null ChannelTopicDetail. These should probably be deleted as well.
Channel.objects.filter(channel_type="topic").filter(topic_detail__isnull=True).count()
Out[20]: 279Wondering if another python function could replace this raw sql?
LearningResourceTopic.objects.all().delete()
Channel.objects.filter(channel_type=ChannelType.topic.name).delete()Maybe topic channels should be automatically deleted when the topic is deleted, but that can be an issue for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the migration to cull topic channels and to delete topics using model objects.
learning_resources/utils.py
Outdated
| lr_topic, created = LearningResourceTopic.objects.filter( | ||
| Q(name=topic["name"]) | Q(id=topic["id"]) | ||
| ).update_or_create( | ||
| defaults={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of changing this to:
).update_or_create(
id=topic["id"],
defaults={Assuming the file will always have an id supplied, anyway. This way the topics will be assigned the ids in the yaml file, and if running upsert_topic_data_file on a file that specifies an existing id and a changed name, the name will be updated on the existing topic instead of a new topic being created and the id value being ignored.
If id isn't always present, then maybe that arg could be used conditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ID isn't always there: you may be adding a new one. However, this lead me down a bit of a rabbit hole wrt the IDs - this was using database (int) IDs and that ended up being a problem, so I refactored to add a UUID field. (Using database IDs ended up causing weirdness because inserting rows with IDs doesn't update the identity sequence, which generates database errors when you're trying to add new ones.)
|
|
||
| def test_prolearn_transform_courses(mock_mitpe_courses_data): | ||
| """Test that prolearn courses data is correctly transformed into our normalized structure""" | ||
| upsert_topic_data_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be done via a pytest fixture, but this works too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rolled this into the existing offeror fixture (since it depends on it).
This won't be signed because that seems to have broken for some reason?!? - Update the data_fixture migration to delete more data and do it without using raw SQL - Remove stale topic mapping files - Update learning_resource migrations so that there's not a merge one - Update LearningResourceTopic to include a UUID field that we can use in the yaml files - with the database ID, the sequence gets screwed up and you then can't add new ones later - Update the util functions for uuids - Update the update_topics command to use the new topics code, and to provide a dump of the topics into a yaml file for migrations - Update the topics file so it also has UUIDs
7fbf26b to
1287c70
Compare
This was pulling and parsing the OCW topics but these are hard-coded anyway.. this component needs to be updated to use the API-provided icon and not tackling that in this PR.
|
Last few commits should address everything here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just another couple comments
| ChannelTopicDetail = apps.get_model("channels", "ChannelTopicDetail") | ||
|
|
||
| LearningResourceTopic.objects.all().delete() | ||
| ChannelTopicDetail.objects.all().delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still leaves a bunch of orphaned Channel objects with channel_type="topic" and a null ChannelTopicDetail. This should handle it, and also delete the related ChannelTopicDetail objects via cascade:
Channel.objects.filter(channel_type="topic").delete()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another consequence of deleting all existing topics is that any learning paths (like featured lists for OCW, xPro, etc) will have their manually assigned topics field set to an empty list. Not sure how big a deal this is, as there aren't that many of them, but might be worth checking with @pdpinch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should run the pipelines for sources that include topic data once this gets merged anyway - would these lists get updated as part of that? (I know there's a populate_featured_lists for featured courses, for instance.)
| help = ( | ||
| "Update LearningResourceTopic data from a yaml file. Optionally " | ||
| "dump the updated topics to a file." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be better left to the data_fixtures app, @shanbady can you chime in?
Also, in terms of the option for generating the yaml file, seems like there might be an issue if the current data in the db is incomplete? Should the yaml file only be generated from a presumably 100% complete CSV file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea thats probably the way to go. I feel like anything static fixture related should be consolidated under the data fixtures app
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it. The upsert APIs can be run from a shell if someone needs to process a file or string manually for whatever reason.
The upsert functions work fine with deltas, so the incoming data doesn't need to be complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for noticing this late, I often overlook podcasts & videos, but noticed there are lots more podcast episodes without topics using the new system. Saw lots of these messages in the logs when running ./manage.py backpopulate_podcast_data:
Skipped adding topic Physical Education and Recreation to resource LearningResource object (5257)
Skipped adding topic Physical Education and Recreation to resource LearningResource object (5258)
Skipped adding topic Business to resource LearningResource object (5258)
Still running the video pipeline but that looks better in terms of assigning topics.
EDIT: video topics are fine because they're assigned later as topics of similar courses based on an opensearch query.
The lower number of podcast episodes with matching topics may just be a result of the lower number of new topics (~108 vs ~335 on prod), since they were not being mapped to anything. So might be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the same # podcast episodes without topics on the main branch, I'm guessing the discrepancy on prod must be due to those topics being added before we stopped adding new topics as they came in.
|
Sorry, haven't had enough coffee, forgot to populate old topics after switching back to the main branch. Once I did that, most podcast episodes were assigned topics. Probably best to check with Peter/Ferdi about how important it is for most podcasts/episodes to have topics before merging. |
|
Discussed briefly this morning and it doesn't seem like topics on podcasts were too big of a deal.
|
What are the relevant tickets?
Closes mitodl/hq#4553
Description (What does it do?)
Makes some changes to the topics code:
How can this be tested?
Automated tests should pass.
Topic data
You will need to run the migrations in the
data_fixturesapp to upsert the topic data into the system. Be advised that the migration will truncate the topics table with cascade so this will also wipe out mappings and associations with learning resources. Topic channels will be invalid after this as well. You will want to backpopulate anything that's got topics in it (ocw, mitx, mitxonline, xpro, prolearn, etc) after running the migration. (However, for just testing the topic data upsert stuff, you don't need to do it right away.)The topic data is sourced from these places (from the original issue):
Then, check to make sure the data is in there properly and matches the data in the YAML file.
You can test topic upsert using a new topic or updating an existing topic in these ways:
data_fixturesapp - create a migration that either pulls a file with the data in it, or embeds the update YAML into a string within the migration.learning_resources.utils.upsert_topic_data_stringor_file.If you're testing modifying an existing topic, you should specify the database ID for the topic. (It will try to match on name but that won't work if that's what you're updating.)
ETL Pipelines
Clear the
learning_resources_learningresource_topicstable (which contains the maps between learning resources and topics) and run thebackpopulate_commands. This should fill the table and you should be able to single out individual resources and confirm that their topics are mapped to Open ones properly. Not all topics will map successfully - only ones we have a mapping for will come over. We no longer create new topics if the topic didn't exist in the system.In particular, when syncing OCW data, you should not see a bunch of "Skipped adding topic" messages in the logs. (There may be a few but they should be few and far between.) OCW has a lot of topic mappings so this is a good test to make sure the mapping system is working.
Additional Context
Since the topic maps will be different than they were, items may not be in the places you expect them - check against the
topics.yamlfile to see where things should be.The
topics.yamlhas IDs in it that were created from a truncated and reset table (so they start at 1 and go up). For consistency, the migration also truncates the table. This requires a re-populate of a bunch of stuff (notably topic channels, and any data source that includes topic data) so some care will need to be taken to perform those tasks after this merges for both local deployments and Heroku ones.