-
Notifications
You must be signed in to change notification settings - Fork 3
ETL pipeline for MIT edX programs #1536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rhysyngsun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review
| Helper function to determine if a course is an MIT course | ||
| Args: | ||
| course (dict): The JSON object representing the course with all its course runs | ||
| Returns: | ||
| bool: indicates whether the course is owned by MIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring needs to be updated from copy/paste.
| # use the OpenEdx factory to create our extract and transform funcs | ||
| extract, _transform = openedx_extract_transform_factory(get_open_edx_config) | ||
|
|
||
| # modified transform function that filters the course list to ones that pass the _is_mit_course() predicate # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment referencing wrong function
learning_resources/etl/openedx.py
Outdated
| Transform a course run into the normalized data structure | ||
| Args: | ||
| config (OpenEdxConfiguration): configuration for the openedx backend | ||
| Returns: | ||
| dict: the tranformed course run data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy/paste typos
learning_resources/etl/openedx.py
Outdated
| return dates | ||
|
|
||
|
|
||
| def _add_course_prices(program): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest changing the name of this function since _add_course_prices sounds like it'd add the course prices to the program object being passed in. Replacing "add" with "sum" is probably good enough.
learning_resources/tasks.py
Outdated
| programs = pipelines.mit_edx_programs_etl(api_datafile) | ||
| clear_search_cache() | ||
| return len(courses) | ||
| return len(courses + programs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's probably a bit better to take len(courses) + len(programs) so that you're not generating an entirely new list in memory just to take the len() of it.
6e09db4 to
262ea10
Compare
|
@rhysyngsun ready for another look |
rhysyngsun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
262ea10 to
1e4f1bb
Compare
1e4f1bb to
e504a81
Compare
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/5389
Description (What does it do?)
Adds an ETL pipeline for ingesting MIT edX programs from an API endpoint
Screenshots (if appropriate):
How can this be tested?
./manage.py backpopulate_edx_dataAdditional Context
The programs API endpoint also returns micromasters programs, which are currently ignored in favor of the data from the micromasters API endpoint.