Combine RawContent and TextContent into Content #149

ormsbee · 2024-02-01T21:40:56Z

This is mostly pulling together RawContent and TextContent into a unified Content model in order to reduce confusion and not force text to always go to a file-based storage backend.

Other things that may also be a part of this PR as issues I've noticed along the way:

reduced the primary key for LearningPackage to 4-bytes to reduce the size of indexes.
maybe eliminate the FileField to save space (and use the low level storages API instead)
fix a potential cache corruption issue with lru in rollback situations.

ormsbee · 2024-02-01T21:59:52Z

Okay, thinking on this a little more–the FileField is pretty much entirely redundant, as well as being one of the biggest parts of the table. The storage location is based entirely on the LearningPackage UUID + the Content hash digest. I'm going to see if it's cumbersome to make it a boolean flag + accessor method.

ormsbee · 2024-02-03T01:54:45Z

Other self notes before I forget:

The text should be null, not blank, when the data is only in the file.
In the future, we might want to allow a blob field in Content for compressed text (e.g. using zlib compression). This could save us a lot of space, at the expense of making it more difficult to query. MySQL has a compressed row type sounds perfect for the usage pattern Content has, but we can't use it in RDS Aurora.

ormsbee · 2024-02-03T19:09:48Z

The compression thing came to mind because I was looking through some example course data and there are a handful of Capa problems that weigh in at ~13-14 KB. But when compressed with zlib, that goes down to about 2K–the larger problems tend to be that way because they have a lot of Python code and HTML table markup, both of which compress really well.

There are HTML blocks that are dramatically larger than this, but that's because they're encoding images into the raw HTML using base64-encoded data URLs (<img src="image/png;base64,....>"). We are definitely not supporting that.

ormsbee · 2024-02-04T05:05:57Z

Okay, I poked into the compression thing just a little bit further. I'm going to stop now because it's not critical to get in for the short term, but I have a general plan for it:

Rename the text field to uncompressed_text.
Create a new BinaryField for compressed_text.
Create a cached property text that knows how to switch between the two.

At the time of write, we run zlib compression on the text and decide whether to use the compressed or uncompressed field for this row. The other field is left null. When we first introduce this feature, we can run it as a data migration, though that wouldn't be a requirement.

Pruning is still the more important feature for controlling the content size growth.

ormsbee · 2024-02-06T03:50:29Z

@bradenmacdonald, @kdmccormick: A little later than I had hoped, but it's ready for real review.

ormsbee · 2024-02-06T04:49:42Z

@bradenmacdonald, @kdmccormick: Do you folks know what this mypy error is about by any chance?

openedx_learning/core/contents/api.py:152: error: Incompatible type for "text" of "Content" (got "None", expected "Union[str, Combinable]") [misc]

kdmccormick · 2024-02-06T13:23:24Z

@ormsbee It's telling you that when constructing a Content model instance, you can't set text=None; it wants you to set it to an actual str instance. Would it make sense to do text="" here, or is that fundamentally different?

You can ignore the Combinable thing--that's like F and Q objects.

ormsbee · 2024-02-06T15:36:11Z

@ormsbee It's telling you that when constructing a Content model instance, you can't set text=None; it wants you to set it to an actual str instance. Would it make sense to do text="" here, or is that fundamentally different?

It's fundamentally different in this case. text=None means "there is no text representation in the database for this Content–it only exists in the file store". text="" means "there is a text representation in the database for this Content, and that happens to be an empty string".

But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?

kdmccormick · 2024-02-06T16:17:00Z

But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?

That would surprise me, so I tested it out by changing the definition of text to a regular TextField:

#    text = MultiCollationTextField(
    text = models.TextField(
        blank=True,
        null=True,
        max_length=MAX_TEXT_LENGTH,
        # We don't really expect to ever sort by the text column, but we may
        # want to do case-insensitive searches, so it's useful to have a case
        # and accent insensitive collation.
#        db_collations={
#            "sqlite": "NOCASE",
#            "mysql": "utf8mb4_unicode_ci",
#        }
    )

and that type-checked fine (except for a new error that pops up on openedx_learning/core/components/admin.py:167, where I think you need to change content_obj.text to context_obj.text or "").

So, I think the issue that the field-value type argument (specifically, str|None) isn't getting passed through MultiCollationTextField into TextField for some reason. It's possible that we need to define MultiCollationTextField as a generic class, although I'll need to play with that a bit to figure out the right syntax.

If that ends up blocking this PR, you could hack around the error for now by adding a type annotation directly to text:

    # TextField type args are [TypeForGetting,TypeForSetting]
    text: models.TextField[str|None, str|None] = MultiCollationTextField(
        blank=True,
        null=True,
        max_length=MAX_TEXT_LENGTH,
        # We don't really expect to ever sort by the text column, but we may
        # want to do case-insensitive searches, so it's useful to have a case
        # and accent insensitive collation.
        db_collations={
            "sqlite": "NOCASE",
            "mysql": "utf8mb4_unicode_ci",
        }
    )

kdmccormick · 2024-02-06T16:32:51Z

@ormsbee if you want to just work around this for now using text: models.TextField[str|None, str|None], I opened #152, which the type-checking nerd in me would be happy to work through at some later date.

The only reason I balk at just using # type: ignore is that, if left unresolved, we'd lose some type safety in edx-platform wherever .text is referenced, since the field value's type is in fact str|None, not str, even if mypy thinks it's the latter.

ormsbee · 2024-02-06T16:41:17Z

@kdmccormick: I respect and appreciate your inner type-checking nerd. I'll use the workaround you suggested. Thank you.

kdmccormick

The general shape of the refactoring looks great.

I'm about 2/3rds through; I'll leave the rest of my review after lunch.

openedx_learning/core/components/api.py

openedx_learning/core/contents/api.py

kdmccormick

Alrighty, all I have is a bunch of docstring nits.

I'm sure my next review will be a ✅ , so feel free to merge if someone beats me to it.

openedx_learning/core/contents/models.py

openedx_learning/__init__.py

openedx_learning/core/contents/models.py

openedx_learning/core/publishing/models.py

…debugging

ormsbee · 2024-02-09T04:57:15Z

@kdmccormick: Incorporated all your suggestions except this one on get_component_by_key/component_exists_by_key.

kdmccormick

🚀

This was referenced Feb 2, 2024

Discovery: Determine how v2 Content Libraries can make use of Learning Core #30

Closed

[DEPR]: Blockstore openedx/public-engineering#238

Open

ormsbee force-pushed the contents-refactoring-3 branch from f2cd20a to 6c1f89b Compare February 6, 2024 03:49

ormsbee marked this pull request as ready for review February 6, 2024 03:50

ormsbee mentioned this pull request Feb 6, 2024

Switch v2 libraries to Learning Core data models openedx/edx-platform#34066

Merged

6 tasks

ormsbee force-pushed the contents-refactoring-3 branch from dd74059 to ffad90c Compare February 6, 2024 16:02

kdmccormick mentioned this pull request Feb 6, 2024

Nullable MultiCollationTextField type-checking #152

Open

kdmccormick self-requested a review February 6, 2024 16:36

ormsbee mentioned this pull request Feb 7, 2024

Add compression support for Content #153

Open

kdmccormick reviewed Feb 7, 2024

View reviewed changes

kdmccormick reviewed Feb 8, 2024

View reviewed changes

ormsbee self-assigned this Feb 8, 2024

ormsbee added 7 commits February 8, 2024 23:55

refactor: merge RawContent and TextContent into Content

fad495d

fix: forgot to set the default for file creation back to False after …

fba30b9

…debugging

chore: remove unused import, add some comments

39ba96a

fix: fixup a number of linter violations

1e84e78

chore: remove misc. references to TextContent and FileContent

2a6603a

refactor: change some of the doc strings around file access

8a074eb

refactor: incorporate suggestions from Kyle's review

fdc11ee

ormsbee force-pushed the contents-refactoring-3 branch from 1e17b5d to fdc11ee Compare February 9, 2024 04:56

kdmccormick approved these changes Feb 9, 2024

View reviewed changes

ormsbee merged commit b386858 into openedx:main Feb 9, 2024
7 checks passed

ormsbee deleted the contents-refactoring-3 branch February 9, 2024 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine RawContent and TextContent into Content #149

Combine RawContent and TextContent into Content #149

ormsbee commented Feb 1, 2024 •

edited

ormsbee commented Feb 1, 2024 •

edited

ormsbee commented Feb 3, 2024

ormsbee commented Feb 3, 2024 •

edited

ormsbee commented Feb 4, 2024

ormsbee commented Feb 6, 2024

ormsbee commented Feb 6, 2024

kdmccormick commented Feb 6, 2024

ormsbee commented Feb 6, 2024

kdmccormick commented Feb 6, 2024

kdmccormick commented Feb 6, 2024 •

edited

ormsbee commented Feb 6, 2024

kdmccormick left a comment

kdmccormick left a comment

ormsbee commented Feb 9, 2024

kdmccormick left a comment

Combine RawContent and TextContent into Content #149

Combine RawContent and TextContent into Content #149

Conversation

ormsbee commented Feb 1, 2024 • edited

ormsbee commented Feb 1, 2024 • edited

ormsbee commented Feb 3, 2024

ormsbee commented Feb 3, 2024 • edited

ormsbee commented Feb 4, 2024

ormsbee commented Feb 6, 2024

ormsbee commented Feb 6, 2024

kdmccormick commented Feb 6, 2024

ormsbee commented Feb 6, 2024

kdmccormick commented Feb 6, 2024

kdmccormick commented Feb 6, 2024 • edited

ormsbee commented Feb 6, 2024

kdmccormick left a comment

Choose a reason for hiding this comment

kdmccormick left a comment

Choose a reason for hiding this comment

ormsbee commented Feb 9, 2024

kdmccormick left a comment

Choose a reason for hiding this comment

ormsbee commented Feb 1, 2024 •

edited

ormsbee commented Feb 1, 2024 •

edited

ormsbee commented Feb 3, 2024 •

edited

kdmccormick commented Feb 6, 2024 •

edited