-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Taxonomy, Tag, ObjectTag models and APIs #57
Conversation
Thanks for the pull request, @pomegranited! Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket as you can:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. |
a4910b3
to
12ce0d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool :) I am liking this approach.
Also the |
@pomegranited Great work! Looks good 👍 |
d499752
to
eaee7e4
Compare
@@ -26,6 +26,8 @@ But because permissions #2 + #3 require access to the edx-platform CMS model `Co | |||
|
|||
Per `OEP-9`_, ``openedx_tagging`` will allow applications to use the standard Django API to query permissions, for example: ``user.has_perm('openedx_tagging.edit_taxonomy', taxonomy)``, and the appropriate permissions will be applied in that application's context. | |||
|
|||
These rules will be enforced in the tagging `views`_, not the API or models, so that external code using this library need not have a logged-in user in order to call the API. So please use with care. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisChV CC @ormsbee @bradenmacdonald
Does this seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pomegranited It's reasonable to me. Maybe it is good to add this also as a comment in the API so that it is taken into account when using it
|
||
operations = [ | ||
migrations.CreateModel( | ||
name='TagContent', | ||
name="TagContent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JV TODO: Since we don't need this table at all, omit this and merge all the migrations down to this 0001_initial.py
file.
openedx_tagging/core/tagging/api.py
Outdated
Reconciles all stored ObjectTag entries with any changes made to their associated taxonomies and tags. | ||
""" | ||
num_changed = 0 | ||
for object_tag in ObjectTag.objects.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are a lot of ObjecTag
rows, this has the downside of loading them all from the database and then converting them all into python objects before iterating over them. It seems like using for object_tag in ObjectTag.objects.all().iterator()
is better:
For a
QuerySet
which returns a large number of objects that you only need to access once, this can result in better performance and a significant reduction in memory.
However, it's still not as good as I hoped:
MySQL doesn’t support streaming results, hence the Python database driver loads the entire result set into memory.
What?! Yet another point for postgres...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bradenmacdonald Cool, I didn't know about that! Will fix this.
MySQL doesn’t support streaming results, hence the Python database driver loads the entire result set into memory.
What?! Yet another point for postgres...
My favourite related comment so far came from this issue: https://code.djangoproject.com/ticket/2495#comment:14
After eight years, and considering that the fix is far from easy, could we file this under "use a better database" and close the ticket?
c37f9f4
to
25d4a62
Compare
@ChrisChV My apologies for my rebase on But I'll need to do a rebase before I merge here so I can squash those migrations. Awaiting word on whether we need a uuid field before I do that. |
openedx_tagging/core/tagging/api.py
Outdated
Reconciles all stored ObjectTag entries with any changes made to their associated taxonomies and tags. | ||
""" | ||
num_changed = 0 | ||
for object_tag in ObjectTag.objects.all().iterator(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize I've been a bit in-and-out of the taxonomy/tagging model discussions, so please pardon me if you folks have already gone over this:
I wonder if this could be less expensive if we make one more layer of model indirection between the Object and the Taxonomy's Tag. Right now, it seems like we have the global representation of a Tag, and a per-tagged-thing copy in case the global thing goes away. We need that because the Taxonomies+Tags ownership and lifecycle is different from ObjectTags. The process of reconciliation is between per-object copies and the taxonomy's canonical Tag.
But do we really need that many copies? Can't we keep a more normalized representation as long as we create a copy that follows the same ownership as the ObjectTags in question?
What if we kept a copy of the Taxonomy at the layer of the importable/exportable thing (e.g. library, course, learning package)? Take a course for example. Let's say:
- ObjectTags don't have local copies of the name.
- All the ObjectTags for course content is against course-level Taxonomies.
- Some of those are CourseTaxonomies that are entirely owned and internal to the course.
- Some of those are CourseTaxonomies that serve as links-with-copies to other Taxonomies (CourseTaxonomies controlled by other courses, OrgTaxonomies, SystemTaxonomies).
- Inconsistencies within the Course on import are errors (they can be ignored and the values dropped, or they can stop the process, but there's no reconciliation).
- Reconciliation for updated values happens between Taxonomies.
That would significantly cut down on the number of rows that would have to be iterated over during any sort of reconciliation process. Would that be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ormsbee , we haven't gone over an option like this, so thank you for raising it! And you're right, it is storage-heavy to store copies of the name:value
on each ObjectTag, which will store the most rows of any model in this package.
But, this does seem more complicate, so do we care that much about storage space?
Right now, it seems like we have the global representation of a Tag, and a per-tagged-thing copy in case the global thing goes away.
We have three uses for the ObjectTag._value
field:
- What you said ^
- We can import tagged content without having to have a Taxonomy with these Tags in place before importing.
- Free-text tags can be anything the content author wants, and are not hierarchical.
So we don't store them asTag
s, we just create anObjectTag
with_value="free text value"
andtag=None
. This also avoids race conditions; we don't have to worry about creating both aTag
and anObjectTag
when free-text tagging an object.
(ObjectTag._name
is also in the model for reasons 1 & 2.)
But there's a lot of dancing going on around this _value
field, which likely means it's trying to do too much.
If I understand you correctly, you're suggesting something like:
- Taxonomy: canonical source of a taxonomy's tags and metadata. Taxonomy Admins create these, import/export tags.
- ContentTaxonomy: references a
source
Taxonomy`, and keeps a copy of its Tag tree, e.g. as ContentTaxonomyTag - ContentTag: references a ContentTaxonomy and a ContentTaxonomyTag.
- ContentTaxonomy objects know how to reconcile changes made to its
source
Taxonomy or the source Taxonomy's Tag without losingContentTag
data.
So this deals with point #1 above, but how should we handle points #2 and #3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was that reconciliation would be quite a rare process. If that's the case, I would strongly lean toward the simpler version, even if it uses a bit more space for storage. (Tags are usually quite small so I don't think the space is an issue.)
Even if it's not that rare, I think there's a simpler approach: If it's ObjectTag.objects.all()
that's the problem, I suggest changing the reconciliation API to work on a single taxonomy or a single course at a time, rather than all tags of all objects. So for example it could take a course ID as its argument, and then do the reconciliation based on ObjectTag.objects.filter(object_type=..., object_id__startswith="block-v1:the+course+ID@")
. Then you get much more efficient (and targeted) reconciliation at appropriate times (e.g. after a course import), and the data model is simple and easy to reason about.
What about taxonomy import? It seems like you might need to iterate over all the ObjectTags in that case, to see if any of them should be "upgraded" to new values in the taxonomy based on their name
and value
. But I think you can actually do a much more targeted iteration by only iterating the ObjectTag
that have either (taxonomy_id is None) or (taxonomy_id==imported_taxonomy_id and tag_id is None). Which is again much much faster than iterating all the ObjectTags, because I expect that the vast majority of ObjectTags will be filtered out by that criteria (because they're linked to a taxonomy and/or a tag).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, I would strongly lean toward the simpler version, even if it uses a bit more space for storage.
That's my thinking too, and hence this PR.. but I am open to other options @ormsbee .
Even if it's not that rare, I think there's a simpler approach: If it's ObjectTag.objects.all() that's the problem, I suggest changing the reconciliation API to work on a single taxonomy or a single course at a time, rather than all tags of all objects.
👍 to adding filter options to the resync_tags method.
What about taxonomy import?
Perfect use of this, yes! We can use that as part of the import taxonomy API.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies: I thought I replied to this yesterday but I never submitted my comment.
@bradenmacdonald: Yes, my concern was much more about the number of records that need to be updated (and any potential locking), and not the space usage.
- We can import tagged content without having to have a Taxonomy with these Tags in place before importing.
Yeah, what I was proposing assumes that we'd have local Taxonomies/Tags in place for import, because those taxonomies can be locally controlled at the scope the import is happening.
- Free-text tags can be anything the content author wants, and are not hierarchical.
So we don't store them asTag
s, we just create anObjectTag
with_value="free text value"
andtag=None
. This also avoids race conditions; we don't have to worry about creating both aTag
and anObjectTag
when free-text tagging an object.
I'm fuzzy on this part. It doesn't have to block this PR, but don't we have to worry about race conditions where we're making the same free text value ObjectTag for the same object multiple times? It seems like it would benefit from being a separate model altogether, so we could put a uniqueness constraints around the freeform tag and the thing being tagged, independent of a taxonomy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: I'm fine with the proposal to keep the general shape of the data structures now and do filtering during reconciliation. We can always add in the local-taxonomy-copy thing later if it turns out we want it. The questions I'm asking now are just to improve my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would benefit from being a separate model altogether, so we could put a uniqueness constraints around the freeform tag and the thing being tagged, independent of a taxonomy?
I'll give it some thought.. originally when I was doing the discovery, I thought we'd end up having to subclass all of these base classes, which would lead to lots and lots of classes, e.g. SystemTaxonomy, SystemTaxonomyTag, SystemTaxonomyObjectTag, ... And so I was trying to reduce the number of classes used at the base here. But now that I've gotten into the ContentTaxonomy implementation, I didn't even need to create a ContentTag class, because all of the ObjectTag validation is handled by the taxonomy. We'll still need a FreeFormTaxonomy of some sort, even if it has no tags, because that model is where the namespace for the object tags are stored.
Unfortunately we're running short on time for this phase 1, and so I'd rather not tweak it too much more until we actually have a frontend that's using it. But if I find that the frontend is having to do a lot of `if freeform then do this else do something else" before calling the backend, then separating them out will make much more sense.
Thank you for all your reviews and advice, @ChrisChV @ormsbee and @bradenmacdonald ! I think this is ready to merge now -- if you agree, could someone please squash+merge this? I haven't squashed and rebased the commits here to keep the conversations working, but I can do that if it's preferred. |
1d01dfc
to
80060df
Compare
@pomegranited: Could you please resolve merge conflicts? (I think it's fine for you to squash.) |
198bea7
to
1551cea
Compare
Also: * Adds compile-requirements make target and updates (without upgrading) the requirements, which pulled in some other requirements, since it's been a while since they were compiled. * Adds ddt as a test dependency. * Replaces the placeholder TagContent class, and squashes migrations down to a single file * use the openedx_learning.lib multi-collation char and text fields
Also: * Adds rules requirement and app settings to enable it * Adds mock to test requirements, so we can test system taxonomy rules * ADR: Clarifies that rules will be enforced in the views, not the model or APIs
1551cea
to
ac388bb
Compare
Tests are failing in openedx on sqlite.
@pomegranited: is this okay for me to rebase+merge now? |
@ormsbee Yes please! Sorry for the delay, I didn't think you were even up :) |
@pomegranited 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future. |
I'm a parent. I keep weird hours sometimes. 😉 |
Description
Implements the Taxonomy, Tag, and ObjectTag models, APIs, and permissions described in the discovery document for the tagging app.
Supporting information
Relates to openedx/modular-learning#63
Testing instructions
Manual testing is possible but not very useful, since we don't yet have views for these models and APIs.
Other information
The
import_tags
,export_tags
andautocomplete_tags
will be added by a subsequents PRsSystemTaxonomy will be added by a subsequent PR as a subclass of Taxonomy.
ContentTaxonomy and ContentTag will subclass these models, and will be implemented in edx-platform
openedx.features.content_tagging
.To do before merge
0001_initial.py