Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Taxonomy, Tag, ObjectTag models and APIs #57

Merged
merged 4 commits into from
Jun 27, 2023

Conversation

pomegranited
Copy link
Contributor

@pomegranited pomegranited commented Jun 13, 2023

Description

Implements the Taxonomy, Tag, and ObjectTag models, APIs, and permissions described in the discovery document for the tagging app.

Supporting information

Relates to openedx/modular-learning#63

Testing instructions

Manual testing is possible but not very useful, since we don't yet have views for these models and APIs.

  • Ensure that all models and methods from the discovery doc are implemented.
  • Ensure that the tests cover the expected behavior of the models, APIs and permissions as described in the discovery doc.

Other information

The import_tags, export_tags and autocomplete_tags will be added by a subsequents PRs

SystemTaxonomy will be added by a subsequent PR as a subclass of Taxonomy.

ContentTaxonomy and ContentTag will subclass these models, and will be implemented in edx-platform openedx.features.content_tagging.

To do before merge

@openedx-webhooks
Copy link

openedx-webhooks commented Jun 13, 2023

Thanks for the pull request, @pomegranited! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Jun 13, 2023
@pomegranited pomegranited force-pushed the jill/taxonomy-api branch 7 times, most recently from a4910b3 to 12ce0d2 Compare June 15, 2023 02:25
@pomegranited pomegranited marked this pull request as ready for review June 15, 2023 02:26
Copy link
Contributor

@bradenmacdonald bradenmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool :) I am liking this approach.

openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved
openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved
openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved
openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved
@ChrisChV
Copy link
Contributor

Other information

SystemTaxonomy will be added by a subsquent PR as a subclass of Taxonomy.

ContentTaxonomy and ContentTag will subclass these models, and will be implemented in edx-platform openedx.features.tagging.

Also the import_tags, export_tags and autocomplete_tags will be added by a subsequents PRs

@ChrisChV
Copy link
Contributor

@pomegranited Great work! Looks good 👍

@@ -26,6 +26,8 @@ But because permissions #2 + #3 require access to the edx-platform CMS model `Co

Per `OEP-9`_, ``openedx_tagging`` will allow applications to use the standard Django API to query permissions, for example: ``user.has_perm('openedx_tagging.edit_taxonomy', taxonomy)``, and the appropriate permissions will be applied in that application's context.

These rules will be enforced in the tagging `views`_, not the API or models, so that external code using this library need not have a logged-in user in order to call the API. So please use with care.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisChV CC @ormsbee @bradenmacdonald

Does this seem reasonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pomegranited It's reasonable to me. Maybe it is good to add this also as a comment in the API so that it is taken into account when using it

openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved

operations = [
migrations.CreateModel(
name='TagContent',
name="TagContent",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JV TODO: Since we don't need this table at all, omit this and merge all the migrations down to this 0001_initial.py file.

openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved
openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved
openedx_tagging/core/tagging/models.py Outdated Show resolved Hide resolved
Reconciles all stored ObjectTag entries with any changes made to their associated taxonomies and tags.
"""
num_changed = 0
for object_tag in ObjectTag.objects.all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are a lot of ObjecTag rows, this has the downside of loading them all from the database and then converting them all into python objects before iterating over them. It seems like using for object_tag in ObjectTag.objects.all().iterator() is better:

For a QuerySet which returns a large number of objects that you only need to access once, this can result in better performance and a significant reduction in memory.

However, it's still not as good as I hoped:

MySQL doesn’t support streaming results, hence the Python database driver loads the entire result set into memory.

What?! Yet another point for postgres...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradenmacdonald Cool, I didn't know about that! Will fix this.

MySQL doesn’t support streaming results, hence the Python database driver loads the entire result set into memory.

What?! Yet another point for postgres...

My favourite related comment so far came from this issue: https://code.djangoproject.com/ticket/2495#comment:14

After eight years, and considering that the fix is far from easy, could we file this under "use a better database" and close the ticket?

@pomegranited pomegranited force-pushed the jill/taxonomy-api branch 2 times, most recently from c37f9f4 to 25d4a62 Compare June 21, 2023 02:32
@pomegranited
Copy link
Contributor Author

@ChrisChV My apologies for my rebase on main.. I've rolled it back and did a merge with main instead, so you should still be able to change the base branch of your open PRs to this one to ease reviews there.

But I'll need to do a rebase before I merge here so I can squash those migrations. Awaiting word on whether we need a uuid field before I do that.

Reconciles all stored ObjectTag entries with any changes made to their associated taxonomies and tags.
"""
num_changed = 0
for object_tag in ObjectTag.objects.all().iterator():
Copy link
Contributor

@ormsbee ormsbee Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I've been a bit in-and-out of the taxonomy/tagging model discussions, so please pardon me if you folks have already gone over this:

I wonder if this could be less expensive if we make one more layer of model indirection between the Object and the Taxonomy's Tag. Right now, it seems like we have the global representation of a Tag, and a per-tagged-thing copy in case the global thing goes away. We need that because the Taxonomies+Tags ownership and lifecycle is different from ObjectTags. The process of reconciliation is between per-object copies and the taxonomy's canonical Tag.

But do we really need that many copies? Can't we keep a more normalized representation as long as we create a copy that follows the same ownership as the ObjectTags in question?

What if we kept a copy of the Taxonomy at the layer of the importable/exportable thing (e.g. library, course, learning package)? Take a course for example. Let's say:

  • ObjectTags don't have local copies of the name.
  • All the ObjectTags for course content is against course-level Taxonomies.
  • Some of those are CourseTaxonomies that are entirely owned and internal to the course.
  • Some of those are CourseTaxonomies that serve as links-with-copies to other Taxonomies (CourseTaxonomies controlled by other courses, OrgTaxonomies, SystemTaxonomies).
  • Inconsistencies within the Course on import are errors (they can be ignored and the values dropped, or they can stop the process, but there's no reconciliation).
  • Reconciliation for updated values happens between Taxonomies.

That would significantly cut down on the number of rows that would have to be iterated over during any sort of reconciliation process. Would that be helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ormsbee , we haven't gone over an option like this, so thank you for raising it! And you're right, it is storage-heavy to store copies of the name:value on each ObjectTag, which will store the most rows of any model in this package.

But, this does seem more complicate, so do we care that much about storage space?

Right now, it seems like we have the global representation of a Tag, and a per-tagged-thing copy in case the global thing goes away.

We have three uses for the ObjectTag._value field:

  1. What you said ^
  2. We can import tagged content without having to have a Taxonomy with these Tags in place before importing.
  3. Free-text tags can be anything the content author wants, and are not hierarchical.
    So we don't store them as Tags, we just create an ObjectTag with _value="free text value" and tag=None. This also avoids race conditions; we don't have to worry about creating both a Tag and an ObjectTag when free-text tagging an object.

(ObjectTag._name is also in the model for reasons 1 & 2.)

But there's a lot of dancing going on around this _value field, which likely means it's trying to do too much.

If I understand you correctly, you're suggesting something like:

  • Taxonomy: canonical source of a taxonomy's tags and metadata. Taxonomy Admins create these, import/export tags.
  • ContentTaxonomy: references a source Taxonomy`, and keeps a copy of its Tag tree, e.g. as ContentTaxonomyTag
  • ContentTag: references a ContentTaxonomy and a ContentTaxonomyTag.
  • ContentTaxonomy objects know how to reconcile changes made to its source Taxonomy or the source Taxonomy's Tag without losing ContentTag data.

So this deals with point #1 above, but how should we handle points #2 and #3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was that reconciliation would be quite a rare process. If that's the case, I would strongly lean toward the simpler version, even if it uses a bit more space for storage. (Tags are usually quite small so I don't think the space is an issue.)

Even if it's not that rare, I think there's a simpler approach: If it's ObjectTag.objects.all() that's the problem, I suggest changing the reconciliation API to work on a single taxonomy or a single course at a time, rather than all tags of all objects. So for example it could take a course ID as its argument, and then do the reconciliation based on ObjectTag.objects.filter(object_type=..., object_id__startswith="block-v1:the+course+ID@"). Then you get much more efficient (and targeted) reconciliation at appropriate times (e.g. after a course import), and the data model is simple and easy to reason about.

What about taxonomy import? It seems like you might need to iterate over all the ObjectTags in that case, to see if any of them should be "upgraded" to new values in the taxonomy based on their name and value. But I think you can actually do a much more targeted iteration by only iterating the ObjectTag that have either (taxonomy_id is None) or (taxonomy_id==imported_taxonomy_id and tag_id is None). Which is again much much faster than iterating all the ObjectTags, because I expect that the vast majority of ObjectTags will be filtered out by that criteria (because they're linked to a taxonomy and/or a tag).

Copy link
Contributor Author

@pomegranited pomegranited Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradenmacdonald

If that's the case, I would strongly lean toward the simpler version, even if it uses a bit more space for storage.

That's my thinking too, and hence this PR.. but I am open to other options @ormsbee .

Even if it's not that rare, I think there's a simpler approach: If it's ObjectTag.objects.all() that's the problem, I suggest changing the reconciliation API to work on a single taxonomy or a single course at a time, rather than all tags of all objects.

👍 to adding filter options to the resync_tags method.

What about taxonomy import?

Perfect use of this, yes! We can use that as part of the import taxonomy API.

Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies: I thought I replied to this yesterday but I never submitted my comment.

@bradenmacdonald: Yes, my concern was much more about the number of records that need to be updated (and any potential locking), and not the space usage.

@pomegranited:

  1. We can import tagged content without having to have a Taxonomy with these Tags in place before importing.

Yeah, what I was proposing assumes that we'd have local Taxonomies/Tags in place for import, because those taxonomies can be locally controlled at the scope the import is happening.

  1. Free-text tags can be anything the content author wants, and are not hierarchical.
    So we don't store them as Tags, we just create an ObjectTag with _value="free text value" and tag=None. This also avoids race conditions; we don't have to worry about creating both a Tag and an ObjectTag when free-text tagging an object.

I'm fuzzy on this part. It doesn't have to block this PR, but don't we have to worry about race conditions where we're making the same free text value ObjectTag for the same object multiple times? It seems like it would benefit from being a separate model altogether, so we could put a uniqueness constraints around the freeform tag and the thing being tagged, independent of a taxonomy?

Copy link
Contributor

@ormsbee ormsbee Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: I'm fine with the proposal to keep the general shape of the data structures now and do filtering during reconciliation. We can always add in the local-taxonomy-copy thing later if it turns out we want it. The questions I'm asking now are just to improve my understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would benefit from being a separate model altogether, so we could put a uniqueness constraints around the freeform tag and the thing being tagged, independent of a taxonomy?

I'll give it some thought.. originally when I was doing the discovery, I thought we'd end up having to subclass all of these base classes, which would lead to lots and lots of classes, e.g. SystemTaxonomy, SystemTaxonomyTag, SystemTaxonomyObjectTag, ... And so I was trying to reduce the number of classes used at the base here. But now that I've gotten into the ContentTaxonomy implementation, I didn't even need to create a ContentTag class, because all of the ObjectTag validation is handled by the taxonomy. We'll still need a FreeFormTaxonomy of some sort, even if it has no tags, because that model is where the namespace for the object tags are stored.

Unfortunately we're running short on time for this phase 1, and so I'd rather not tweak it too much more until we actually have a frontend that's using it. But if I find that the frontend is having to do a lot of `if freeform then do this else do something else" before calling the backend, then separating them out will make much more sense.

@pomegranited
Copy link
Contributor Author

Thank you for all your reviews and advice, @ChrisChV @ormsbee and @bradenmacdonald !

I think this is ready to merge now -- if you agree, could someone please squash+merge this? I haven't squashed and rebased the commits here to keep the conversations working, but I can do that if it's preferred.

@ormsbee
Copy link
Contributor

ormsbee commented Jun 22, 2023

@pomegranited: Could you please resolve merge conflicts? (I think it's fine for you to squash.)

Also:
* Adds compile-requirements make target and updates (without upgrading) the requirements,
  which pulled in some other requirements, since it's been a while since
  they were compiled.
* Adds ddt as a test dependency.
* Replaces the placeholder TagContent class, and squashes migrations down to a single file
* use the openedx_learning.lib multi-collation char and text fields
Also:
* Adds rules requirement and app settings to enable it
* Adds mock to test requirements, so we can test system taxonomy rules
* ADR: Clarifies that rules will be enforced in the views, not the model or APIs
Tests are failing in openedx on sqlite.
@ormsbee
Copy link
Contributor

ormsbee commented Jun 27, 2023

@pomegranited: is this okay for me to rebase+merge now?

@pomegranited
Copy link
Contributor Author

@ormsbee Yes please! Sorry for the delay, I didn't think you were even up :)

@ormsbee ormsbee merged commit 821a53c into openedx:main Jun 27, 2023
5 checks passed
@openedx-webhooks
Copy link

@pomegranited 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

@ormsbee
Copy link
Contributor

ormsbee commented Jun 27, 2023

I'm a parent. I keep weird hours sometimes. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants