Skip to content

Conversation

@mbertrand
Copy link
Member

@mbertrand mbertrand commented Jul 1, 2024

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/4731

Related ol-infrastructure PR: mitodl/ol-infrastructure#2536

Description (What does it do?)

  • Adds a new ETL pipeline to ingest courses and programs directly from Professional Education's own API feed instead of Prolearn's.
  • Excludes PE courses from being ingested by the Prolearn ETL pipeline.

Screenshots (if appropriate):

Screenshot 2024-07-08 at 8 59 20 AM

How can this be tested?

  • Set the following values in your backend .env file:

     PROLEARN_CATALOG_API_URL=https://prolearn.mit.edu/graphql
     MITPE_BASE_API_URL=https://professional.mit.edu/
    
  • Run ./manage.py backpopulate_prolearn_data

  • When it is done, there should be roughly >130 learning resources with etl_source=prolearn and offered_by=mitpe. They will all be courses and not programs. http://api.open.odl.local:8063/admin/learning_resources/learningresource/?o=2&offered_by__code__exact=mitpe

  • If you don't already have a populated featured learning path for Professional Education, run ./manage.py populate_featured_lists to create one

  • Switch to this branch.

  • Set MITPE_API_ENABLED=True in your backend .env file, restart containers.

  • Run ./manage.py backpopulate_mitpe_data followed by ./manage.py backpopulate_prolearn_data. When it is done, most/all the same courses returned above should now have etl_source=mitpe and have new readable_id values. Some that were courses in prolearn should be unpublished, and created as new programs instead of courses. There should not be any published dupes from different sources (prolearn and mitpe). Some will be unpublished now due to changed criteria (past enrollment end date or past end date).

  • If you go to http://open.odl.local:8062/learningpaths and open the Professional Education featured learning path, it will be empty. Run the following to reassign the old featured courses from prolearn to their new equivalents

./manage.py transfer_list_resources course course title prolearn mitpe
./manage.py transfer_list_resources course program title prolearn mitpe
  • Run ./manage.py update_index --all
  • Once complete, you should have courses/programs in the MITPE featured learning path again. These commands will need to be run on RC/production as well to retain the current featured lists.

@mbertrand mbertrand added Needs Review An open Pull Request that is ready for review Work in Progress and removed Work in Progress Needs Review An open Pull Request that is ready for review labels Jul 2, 2024
@mbertrand mbertrand added Needs Review An open Pull Request that is ready for review and removed Work in Progress labels Jul 3, 2024
@abeglova abeglova self-assigned this Jul 8, 2024
Copy link
Contributor

@abeglova abeglova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great! Back to you with a request for some variable name changes and a question

**{unique_field: unique_value},
platform=platform,
resource_type=resource_type,
published=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused by this logic. Why is the delete necessary? And does that mean that courses that are imported as unpublished, such as mitpe courses with no future runs in the logic below are deleted and recreated every run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be run only when the readable_id is not the unique_field value, and in that case if there may be multiple unpublished resources with the same value for unique_field that need to be removed to avoid database duplicate errors.

Good catch on the mitpe courses with no future runs though, those probably should just be skipped altogether instead of (re)created as unpublished. Not much point in importing unpublished resources.

img_metadata = get_related_object(img_url)
if img_metadata:
field_image = img_metadata["relationships"]["field_media_image"]
img_url_2 = field_image["links"]["related"]["href"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called image_data_url or something like that?

Returns:
list of dict: list containing topic dicts with a name attribute
"""
topic_data = resource_data["relationships"]["field_course_topics"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be

topics_exist = resource_data["relationships"]["field_course_topics"]["data"] 

instead of reusing the variable name topic_data to mean different things

@mbertrand
Copy link
Member Author

@abeglova Peter and I are going to have a meeting at some point with the Professional Education API developer to discuss possible enhancements like clearly delineating programs vs courses. So I'm going to label this as "Blocked" until after that meeting.

@gitguardian
Copy link

gitguardian bot commented Jul 15, 2024

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@mbertrand mbertrand removed the Needs Review An open Pull Request that is ready for review label Jul 17, 2024
@mbertrand mbertrand force-pushed the mb/profed_etl branch 3 times, most recently from d48bc1a to 9a5299a Compare August 7, 2024 14:51
@mbertrand mbertrand force-pushed the mb/profed_etl branch 2 times, most recently from 580fcc4 to f3495f4 Compare August 9, 2024 20:07
@mbertrand mbertrand changed the title Professional Education ETL PIpeline Professional Education ETL Pipeline Aug 9, 2024
@mbertrand mbertrand force-pushed the mb/profed_etl branch 3 times, most recently from 45537cf to ddcb055 Compare August 28, 2024 14:16
@mbertrand mbertrand force-pushed the mb/profed_etl branch 4 times, most recently from 023533a to 41770bb Compare September 24, 2024 15:08
@mbertrand mbertrand force-pushed the mb/profed_etl branch 4 times, most recently from 2fff71a to ab55004 Compare October 28, 2024 19:43
@mbertrand mbertrand added Needs Review An open Pull Request that is ready for review and removed Blocked labels Oct 29, 2024
Copy link
Contributor

@abeglova abeglova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work great one minor fix required


log = logging.getLogger(__name__)

BASE_URL = "https://professional.mit.edu/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates settings.MITPE_BASE_API_URL. You should get rid of either the constant or the setting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, at first the value was different while the API was in development (professional-dev.mit.edu) but now they're the same and that's unlikely to ever change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a settings.MITPE_BASE_URL, going to use that here instead and remove both the above constant BASE_URL as well as settings.MITPE_BASE_API_URL

@mbertrand mbertrand merged commit c1066fd into main Oct 29, 2024
11 checks passed
@odlbot odlbot mentioned this pull request Oct 30, 2024
5 tasks
@rhysyngsun rhysyngsun deleted the mb/profed_etl branch February 7, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs Review An open Pull Request that is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants