Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine RawContent and TextContent into Content #149

Merged
merged 7 commits into from
Feb 9, 2024

Conversation

ormsbee
Copy link
Contributor

@ormsbee ormsbee commented Feb 1, 2024

This is mostly pulling together RawContent and TextContent into a unified Content model in order to reduce confusion and not force text to always go to a file-based storage backend.

Other things that may also be a part of this PR as issues I've noticed along the way:

  • reduced the primary key for LearningPackage to 4-bytes to reduce the size of indexes.
  • maybe eliminate the FileField to save space (and use the low level storages API instead)
  • fix a potential cache corruption issue with lru in rollback situations.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 1, 2024

Okay, thinking on this a little more–the FileField is pretty much entirely redundant, as well as being one of the biggest parts of the table. The storage location is based entirely on the LearningPackage UUID + the Content hash digest. I'm going to see if it's cumbersome to make it a boolean flag + accessor method.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 3, 2024

Other self notes before I forget:

  • The text should be null, not blank, when the data is only in the file.
  • In the future, we might want to allow a blob field in Content for compressed text (e.g. using zlib compression). This could save us a lot of space, at the expense of making it more difficult to query. MySQL has a compressed row type sounds perfect for the usage pattern Content has, but we can't use it in RDS Aurora.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 3, 2024

The compression thing came to mind because I was looking through some example course data and there are a handful of Capa problems that weigh in at ~13-14 KB. But when compressed with zlib, that goes down to about 2K–the larger problems tend to be that way because they have a lot of Python code and HTML table markup, both of which compress really well.

There are HTML blocks that are dramatically larger than this, but that's because they're encoding images into the raw HTML using base64-encoded data URLs (<img src="image/png;base64,....>"). We are definitely not supporting that.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 4, 2024

Okay, I poked into the compression thing just a little bit further. I'm going to stop now because it's not critical to get in for the short term, but I have a general plan for it:

  1. Rename the text field to uncompressed_text.
  2. Create a new BinaryField for compressed_text.
  3. Create a cached property text that knows how to switch between the two.

At the time of write, we run zlib compression on the text and decide whether to use the compressed or uncompressed field for this row. The other field is left null. When we first introduce this feature, we can run it as a data migration, though that wouldn't be a requirement.

Pruning is still the more important feature for controlling the content size growth.

@ormsbee ormsbee marked this pull request as ready for review February 6, 2024 03:50
@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 6, 2024

@bradenmacdonald, @kdmccormick: A little later than I had hoped, but it's ready for real review.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 6, 2024

@bradenmacdonald, @kdmccormick: Do you folks know what this mypy error is about by any chance?

openedx_learning/core/contents/api.py:152: error: Incompatible type for "text" of "Content" (got "None", expected "Union[str, Combinable]") [misc]

@kdmccormick
Copy link
Member

@ormsbee It's telling you that when constructing a Content model instance, you can't set text=None; it wants you to set it to an actual str instance. Would it make sense to do text="" here, or is that fundamentally different?

You can ignore the Combinable thing--that's like F and Q objects.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 6, 2024

@ormsbee It's telling you that when constructing a Content model instance, you can't set text=None; it wants you to set it to an actual str instance. Would it make sense to do text="" here, or is that fundamentally different?

It's fundamentally different in this case. text=None means "there is no text representation in the database for this Content–it only exists in the file store". text="" means "there is a text representation in the database for this Content, and that happens to be an empty string".

But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?

@kdmccormick
Copy link
Member

But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?

That would surprise me, so I tested it out by changing the definition of text to a regular TextField:

#    text = MultiCollationTextField(
    text = models.TextField(
        blank=True,
        null=True,
        max_length=MAX_TEXT_LENGTH,
        # We don't really expect to ever sort by the text column, but we may
        # want to do case-insensitive searches, so it's useful to have a case
        # and accent insensitive collation.
#        db_collations={
#            "sqlite": "NOCASE",
#            "mysql": "utf8mb4_unicode_ci",
#        }
    )

and that type-checked fine (except for a new error that pops up on openedx_learning/core/components/admin.py:167, where I think you need to change content_obj.text to context_obj.text or "").

So, I think the issue that the field-value type argument (specifically, str|None) isn't getting passed through MultiCollationTextField into TextField for some reason. It's possible that we need to define MultiCollationTextField as a generic class, although I'll need to play with that a bit to figure out the right syntax.

If that ends up blocking this PR, you could hack around the error for now by adding a type annotation directly to text:

    # TextField type args are [TypeForGetting,TypeForSetting]
    text: models.TextField[str|None, str|None] = MultiCollationTextField(
        blank=True,
        null=True,
        max_length=MAX_TEXT_LENGTH,
        # We don't really expect to ever sort by the text column, but we may
        # want to do case-insensitive searches, so it's useful to have a case
        # and accent insensitive collation.
        db_collations={
            "sqlite": "NOCASE",
            "mysql": "utf8mb4_unicode_ci",
        }
    )

@kdmccormick
Copy link
Member

kdmccormick commented Feb 6, 2024

@ormsbee if you want to just work around this for now using text: models.TextField[str|None, str|None], I opened #152, which the type-checking nerd in me would be happy to work through at some later date.

The only reason I balk at just using # type: ignore is that, if left unresolved, we'd lose some type safety in edx-platform wherever .text is referenced, since the field value's type is in fact str|None, not str, even if mypy thinks it's the latter.

@kdmccormick kdmccormick self-requested a review February 6, 2024 16:36
@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 6, 2024

@kdmccormick: I respect and appreciate your inner type-checking nerd. I'll use the workaround you suggested. Thank you.

Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general shape of the refactoring looks great.

I'm about 2/3rds through; I'll leave the rest of my review after lunch.

openedx_learning/core/components/api.py Show resolved Hide resolved
openedx_learning/core/components/api.py Show resolved Hide resolved
openedx_learning/core/components/api.py Outdated Show resolved Hide resolved
openedx_learning/core/contents/api.py Show resolved Hide resolved
openedx_learning/core/contents/api.py Outdated Show resolved Hide resolved
openedx_learning/core/contents/api.py Show resolved Hide resolved
openedx_learning/core/contents/api.py Outdated Show resolved Hide resolved
Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty, all I have is a bunch of docstring nits.

I'm sure my next review will be a ✅ , so feel free to merge if someone beats me to it.

openedx_learning/core/contents/models.py Outdated Show resolved Hide resolved
openedx_learning/__init__.py Outdated Show resolved Hide resolved
openedx_learning/core/contents/models.py Show resolved Hide resolved
openedx_learning/core/contents/models.py Outdated Show resolved Hide resolved
openedx_learning/core/contents/models.py Outdated Show resolved Hide resolved
openedx_learning/core/contents/models.py Outdated Show resolved Hide resolved
openedx_learning/core/contents/models.py Outdated Show resolved Hide resolved
openedx_learning/core/contents/models.py Outdated Show resolved Hide resolved
openedx_learning/core/publishing/models.py Outdated Show resolved Hide resolved
@ormsbee ormsbee self-assigned this Feb 8, 2024
@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 9, 2024

@kdmccormick: Incorporated all your suggestions except this one on get_component_by_key/component_exists_by_key.

Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ormsbee ormsbee merged commit b386858 into openedx:main Feb 9, 2024
7 checks passed
@ormsbee ormsbee deleted the contents-refactoring-3 branch February 9, 2024 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants