-
Notifications
You must be signed in to change notification settings - Fork 74
Allow ContentNode class to take a uri argument for its default file #593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow ContentNode class to take a uri argument for its default file #593
Conversation
1f981f2 to
71fbbbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
71fbbbf to
8a0149c
Compare
|
""" WalkthroughThe updates enhance the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ContentNode
participant FilePipeline
participant MetadataExtractor
User->>ContentNode: Create ContentNode(uri, pipeline)
User->>ContentNode: process_files()
alt uri is set
ContentNode->>FilePipeline: execute(uri)
FilePipeline->>MetadataExtractor: extract metadata
MetadataExtractor-->>FilePipeline: FileMetadata (with content_node_metadata)
FilePipeline-->>ContentNode: file metadata list
ContentNode->>ContentNode: Update attributes from metadata
ContentNode->>ContentNode: Validate node (kind/activity inference)
end
ContentNode-->>User: Files processed, node validated
Poem
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
ricecooker/utils/pipeline/extract_metadata.py (2)
39-44: Properly setting the inferred kind in metadata.
Creating or retrievingcontent_node_metadataand assigning the inferredkindis a good approach. Consider verifying that all existing usage ofcontent_node_metadatais up-to-date with this new field.
67-69: HTML5 metadata extraction class introduced.
Defining a distinctHTML5MetadataExtractorand registering it ensures that HTML5 files are properly detected. Adding a short docstring could further clarify its usage.ricecooker/classes/nodes.py (3)
100-102:__repr__duplicates__str__.
Having__repr__return the same string as__str__is generally acceptable but can be less informative for debugging. Consider adding distinct info in__repr__if needed for advanced debugging.
616-616: Side-effect invalidate.
Callinginfer_learning_activities()insidevalidate()is acceptable, but note that it mutates state as part of validation. Making it a pre-validation step or clarifying the design can reduce confusion.
833-864: Processing files via the pipeline.
This block implements robust file handling, merging metadata, and re-checking validity. Consider verifying what happens whenpipeline.executereturns an empty list or multiple items with conflicting metadata. Otherwise, this is a solid approach.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
ricecooker/classes/nodes.py(11 hunks)ricecooker/utils/pipeline/__init__.py(1 hunks)ricecooker/utils/pipeline/context.py(2 hunks)ricecooker/utils/pipeline/extract_metadata.py(5 hunks)tests/test_files.py(1 hunks)tests/test_tree.py(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
ricecooker/utils/pipeline/extract_metadata.py (1)
ricecooker/utils/pipeline/context.py (1)
ContentNodeMetadata(14-37)
tests/test_tree.py (3)
ricecooker/classes/nodes.py (6)
ContentNode(743-900)pipeline(787-790)infer_learning_activities(508-511)process_files(192-216)process_files(833-863)process_files(1342-1356)ricecooker/utils/pipeline/__init__.py (1)
FilePipeline(12-76)tests/conftest.py (5)
video_file(359-370)audio_file(482-494)document_file(558-568)epub_file(614-621)html_file(672-684)
ricecooker/utils/pipeline/context.py (1)
ricecooker/classes/nodes.py (5)
to_dict(117-121)to_dict(402-426)to_dict(575-609)to_dict(865-900)to_dict(1713-1726)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Python unit tests (ubuntu-latest, 3.9)
- GitHub Check: Python unit tests (windows-latest, 3.12)
- GitHub Check: Python unit tests (windows-latest, 3.11)
- GitHub Check: Python unit tests (windows-latest, 3.10)
- GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (28)
ricecooker/utils/pipeline/context.py (5)
13-37: Well-structuredContentNodeMetadatadataclass.This new dataclass provides a clean, strongly-typed container for all content node metadata attributes. It aligns well with the metadata fields used in the node serialization methods in
ricecooker/classes/nodes.py.
40-46: Good implementation of recursive dictionary update.The
_recursive_updatehelper function properly handles nested dictionaries during merging, ensuring that nested fields likeextra_fieldsaren't completely overwritten but rather updated recursively.
59-59: Good extension ofFileMetadataclass.Adding the optional
content_node_metadatafield enables the pipeline to propagate node-specific metadata through the file processing stages.
62-64: Improved dictionary serialization.Using a custom dict factory with
asdictthat excludesNonevalues results in cleaner output dictionaries without unnecessary null values.
71-71: Enhanced merge method with deep dictionary updates.Replacing the simple update with the recursive update ensures that nested structures like
content_node_metadataare properly merged rather than overwritten.tests/test_files.py (1)
1327-1331: Properly updated test expectations for HTML5 cache keys.The test now correctly expects an additional cache key for metadata extraction, which aligns with the enhanced pipeline that now extracts metadata from HTML5 files.
ricecooker/utils/pipeline/__init__.py (1)
52-52: Improved robustness with context initialization.This defensive programming practice ensures that
contextis always a dictionary, preventing potentialTypeErrorexceptions when operations likedeepcopyandupdateare performed on it later in the method.tests/test_tree.py (6)
699-699: Explicit call to infer_learning_activities.The test has been updated to explicitly call the
infer_learning_activities()method, correctly reflecting the refactored code where learning activities inference is now a separate step.
713-725: Good test for video content type inference.This test verifies that a
ContentNodewith a video file URI correctly infers its kind asVIDEOand its learning activity asWATCHafter processing through the pipeline.
727-739: Good test for audio content type inference.This test verifies that a
ContentNodewith an audio file URI correctly infers its kind asAUDIOand its learning activity asLISTENafter processing through the pipeline.
741-753: Good test for document content type inference.This test verifies that a
ContentNodewith a document file URI correctly infers its kind asDOCUMENTand its learning activity asREADafter processing through the pipeline.
755-767: Good test for EPUB content type inference.This test verifies that a
ContentNodewith an EPUB file URI correctly infers its kind asDOCUMENTand its learning activity asREADafter processing through the pipeline.
769-780: Good test for HTML5 content type inference.This test verifies that a
ContentNodewith an HTML5 zip file URI correctly infers its kind asHTML5and its learning activity asEXPLOREafter processing through the pipeline.ricecooker/utils/pipeline/extract_metadata.py (4)
6-6: Good import usage.
The importedContentNodeMetadatais correctly used below, and there's no redundancy.
20-20: Appropriate mapping for HTML5 extension.
IncludingHTML5inPRESETS_FROM_EXTENSIONSis consistent with the rest of the file, ensuring the correct preset is inferred.
23-24: Useful preset-kind mapping.
KIND_FROM_PRESETprovides a straightforward approach to retrieving the content kind. No issues spotted.
93-93: Integration of HTML5 extractor into default stage.
AddingHTML5MetadataExtractortoDEFAULT_CHILDRENsuits the pipeline’s modular design, ensuring automatic processing for HTML5 content.ricecooker/classes/nodes.py (11)
23-26: New imports for file and pipeline exceptions.
Bringing inFile,ExpectedFileException, andInvalidFileExceptionis necessary for the new URI-based logic below. No concerns here.
48-48: Addedkind = None.
This ensures eachNodestarts with a default. Just confirm all subclasses correctly set their ownkind.
181-189: Thumbnail preset logic.
This approach of finding a preset withthumbnailand matchingkindis concise. EnsureStopIterationis handled properly in returningNone.
483-483: Expanded constructor signature with**kwargs.
Accepting arbitrary keyword arguments can improve extensibility, but double-check that unexpected parameters are handled or validated to avoid silent errors.
508-512: Activity inference is straightforward.
Automatically determininglearning_activitiesbased onkindis helpful. This logic is easy to follow and consistent withkind_activity_map.
762-763: Accurate docstring for new params.
Explicitly documentinguriandpipelineis helpful for maintainers.
773-774: Constructor acceptsuriandpipeline.
Passing these arguments is a nice design for flexible initialization.
782-784: Storing URI and pipeline references.
Straightforward assignment. Make sure to handle scenarios whereuriis empty or invalid.
786-791: Defaulting pipeline toconfig.FILE_PIPELINE.
This fallback is convenient. Just confirmconfig.FILE_PIPELINEis always defined in contexts where the content node is used.
810-819: Validating the URI.
Raising anInvalidNodeExceptionif the pipeline cannot handle the URI is a clean solution. No issues noted.
829-830: Conditional URI validation.
Skipping_validate_uri()whenuriis not set avoids unnecessary checks. Looks good.
8a0149c to
799b9e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
ricecooker/classes/files.py (2)
93-97: Guard againstNonevalues in the f-string for clearer error reports
self.filenameorself.original_filenamemay still beNoneat this point (e.g. when the pipeline fails early).
If either isNone, the current f-string will render"None"which is not helpful for debugging.- raise NotImplementedError( - f"preset must be set if preset isn't specified when creating File object for file {self.filename} ({self.original_filename})" - ) + raise NotImplementedError( + "preset must be set if preset isn't specified when creating " + f"File object for file {self.filename or '<unknown>'} " + f"({self.original_filename or '<unknown>'})" + )
198-208: Secondvalidate()call may double-check unchanged data
self.validate()is executed both before and after the pipeline.
Unless the pipeline mutates attributes that affect validation, the second call is redundant and may cost noticeable time for large downloads.If you only need to verify the new metadata, consider limiting the post-pipeline check to the subset that actually changes, or gate it behind
if metadata:.ricecooker/utils/pipeline/transfer.py (1)
276-283:skip_cacheparam added – add docstring & default propagationThe new parameter is great, but the method’s docstring and type signature comment block weren’t updated to mention it.
Please document that settingskip_cache=Trueforces a re-download and that callers should propagate the value to child handlers (as you already do).ricecooker/utils/pipeline/file_handler.py (2)
181-208:skip_cachelogic works, but inverse flag name could avoid confusionThe function now receives
skip_cache, yet the surrounding codebase still exportsconfig.UPDATE(meaning do update).
Using a positive flag (use_cache=True) would read more clearly:if file_metadata and use_cache and not self.cached_file_outdated(...):Not a blocker, but consider renaming in a future breaking release to avoid double-negatives like
if not skip_cache and ….
304-313: Avoid repeating handler discovery code – reuse parent implementation
FirstHandlerOnly.executeduplicates the child-scanning logic that already lives inCompositeHandler.should_handle.
You could delegate directly:- for handler in self._children: - if handler.should_handle(path): - return handler.execute(path, context=context, skip_cache=skip_cache) + chosen = next((h for h in self._children if h.should_handle(path)), None) + return chosen.execute(path, context=context, skip_cache=skip_cache) if chosen else []Minor readability gain.
ricecooker/classes/nodes.py (3)
75-79: Duplicatekindclass attribute definition
kindis defined twice consecutively (lines 75 and 78).
The second definition silently overwrites the first and is a potential maintenance trap.- kind = None - license = None - language = None - kind = None + kind = None + license = None + language = None
922-929: Instantiate specificFilesubclasses when possibleCreating every downloaded file as
File(**metadata)loses important behaviour (e.g. preset-specific validation,is_primaryflags).
Consider mapping the returned preset to the concrete subclass (VideoFile, AudioFile, …) via a factory so that downstream logic (thumbnails, size limits, etc.) still work.
1049-1064: Combineisinstancecalls for readability (ruff SIM101)- if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile): + if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):Tiny readability improvement, silences the static-analysis warning.
🧰 Tools
🪛 Ruff (0.8.2)
1049-1049: Multiple
isinstancecalls forfile, merge into a single callMerge
isinstancecalls forfile(SIM101)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
ricecooker/classes/files.py(2 hunks)ricecooker/classes/licenses.py(1 hunks)ricecooker/classes/nodes.py(25 hunks)ricecooker/utils/pipeline/__init__.py(2 hunks)ricecooker/utils/pipeline/context.py(2 hunks)ricecooker/utils/pipeline/extract_metadata.py(5 hunks)ricecooker/utils/pipeline/file_handler.py(5 hunks)ricecooker/utils/pipeline/transfer.py(1 hunks)tests/test_files.py(1 hunks)tests/test_tree.py(4 hunks)
✅ Files skipped from review due to trivial changes (1)
- ricecooker/classes/licenses.py
🚧 Files skipped from review as they are similar to previous changes (5)
- ricecooker/utils/pipeline/init.py
- tests/test_files.py
- ricecooker/utils/pipeline/context.py
- ricecooker/utils/pipeline/extract_metadata.py
- tests/test_tree.py
🧰 Additional context used
🪛 Ruff (0.8.2)
ricecooker/classes/nodes.py
104-104: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
1049-1049: Multiple isinstance calls for file, merge into a single call
Merge isinstance calls for file
(SIM101)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Python unit tests (ubuntu-latest, 3.9)
- GitHub Check: Python unit tests (windows-latest, 3.12)
- GitHub Check: Python unit tests (windows-latest, 3.11)
- GitHub Check: Python unit tests (windows-latest, 3.10)
- GitHub Check: Python unit tests (windows-latest, 3.9)
|
To handle @coderabbitai's feedback and more, after discussion with @marcellamaki, I will be making some updates here to focus validation on the file presets, rather than the file classes. In order to have full compatibility with the URI based mechanism, and ensuring any preset inference has happened already, this should only happen after process_files has been called. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 6472 🏁 Script executed: Length of output: 6542 @rtibbles This approach makes a lot of sense! Focusing on file presets rather than file classes will create more consistent validation, especially when files are created dynamically from URIs. Moving the validation to happen after
Would you like me to help sketch out how this implementation might look? I'd be happy to assist with the code changes to shift validation from file types to presets, and ensure it happens at the right point in the processing flow. |
799b9e5 to
6f43b61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tests/test_exercises.py (1)
138-146: 🛠️ Refactor suggestionAvoid re-using the same
exerciseinstance across negative-case assertionsInside a single test you mutate
exercise.extra_fieldstwice and then run two
independent error scenarios ("3.0" / "5.1"and"three" / "five").
Because the firstexercise.process_files()aborts with an exception midway,
the object state is left in an indeterminate condition (e.g. partially-parsed
extra_fields, cached files, etc.).
Running the second scenario on the same object can therefore hide or create
bugs that will never occur in production, where a fresh instance is processed
only once.-# current pattern exercise.extra_fields = {...} with pytest.raises(InvalidNodeException): exercise.process_files() ... exercise.extra_fields = {...} with pytest.raises(InvalidNodeException): exercise.process_files() + +# safer pattern (parametrised) +for bad_fields in [{"m": "3.0", "n": "5.1"}, + {"m": "three", "n": "five"}]: + ex = ExerciseNode(... fresh args ...) # or copy.deepcopy(exercise) + ex.extra_fields = {"mastery_model": exercises.M_OF_N, **bad_fields} + with pytest.raises(InvalidNodeException): + ex.process_files()This guarantees isolation between scenarios and keeps the intent of the test
clear.
🧹 Nitpick comments (2)
tests/test_data.py (1)
58-70: Consider also asserting thatvalidate()fails after the file-processing errorThe new contract (raising at
process_files) is correctly captured, but the
test no longer checks that callingvalidate()afterwards would also fail.
Ifprocess_filesswallows the error internally and the object is left in a
state wherevalidate()passes, an invalid node could slip through later
stages of the pipeline.with pytest.raises(InvalidNodeException): contentnode_invalid_files.process_files() with pytest.raises(InvalidNodeException): contentnode_invalid_files.validate()Duplicating the assertion is inexpensive and ensures both entry points are
consistent.ricecooker/classes/nodes.py (1)
1049-1069: Improved subtitle handling in VideoNodeThe updated validation method for VideoNode now filters duplicate subtitle files by language code, keeping only the first subtitle file for each language.
The instanceof check could be simplified:
- if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile): + if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):This is a minor optimization but follows Python best practices.
🧰 Tools
🪛 Ruff (0.8.2)
1053-1053: Multiple
isinstancecalls forfile, merge into a single callMerge
isinstancecalls forfile(SIM101)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
ricecooker/classes/files.py(9 hunks)ricecooker/classes/licenses.py(1 hunks)ricecooker/classes/nodes.py(25 hunks)ricecooker/commands.py(1 hunks)ricecooker/managers/tree.py(3 hunks)ricecooker/utils/pipeline/__init__.py(2 hunks)ricecooker/utils/pipeline/context.py(2 hunks)ricecooker/utils/pipeline/extract_metadata.py(5 hunks)ricecooker/utils/pipeline/file_handler.py(5 hunks)ricecooker/utils/pipeline/transfer.py(1 hunks)tests/test_data.py(1 hunks)tests/test_exercises.py(2 hunks)tests/test_files.py(1 hunks)tests/test_thumbnails.py(2 hunks)tests/test_tree.py(4 hunks)
✅ Files skipped from review due to trivial changes (1)
- ricecooker/classes/licenses.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/test_files.py
- ricecooker/utils/pipeline/transfer.py
- ricecooker/utils/pipeline/extract_metadata.py
- ricecooker/utils/pipeline/context.py
- ricecooker/utils/pipeline/file_handler.py
- ricecooker/utils/pipeline/init.py
🧰 Additional context used
🧬 Code Graph Analysis (3)
tests/test_data.py (3)
ricecooker/exceptions.py (1)
InvalidNodeException(60-64)tests/conftest.py (10)
contentnode_invalid_files(341-344)video(414-421)video_invalid_files(425-431)audio(519-526)audio_invalid_files(530-536)document(593-600)document_invalid_files(604-610)html(709-716)html_invalid_files(720-729)html_invalid_zip(747-753)ricecooker/classes/nodes.py (9)
process_files(275-299)process_files(949-957)process_files(1273-1287)validate(422-556)validate(658-667)validate(877-919)validate(1043-1069)validate(1321-1363)validate(1558-1574)
tests/test_exercises.py (1)
ricecooker/exceptions.py (1)
InvalidNodeException(60-64)
tests/test_thumbnails.py (2)
ricecooker/classes/files.py (9)
ExtractedEPubThumbnailFile(511-516)ExtractedHTMLZipThumbnailFile(519-524)ExtractedPdfThumbnailFile(503-508)ExtractedVideoThumbnailFile(527-532)process_file(167-168)process_file(198-219)process_file(417-429)process_file(468-490)process_file(548-551)tests/conftest.py (4)
invalid_document_file(634-647)invalid_epub_file(651-664)html_invalid_file(733-743)invalid_video_file(435-444)
🪛 Ruff (0.8.2)
ricecooker/classes/nodes.py
97-97: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
1053-1053: Multiple isinstance calls for file, merge into a single call
Merge isinstance calls for file
(SIM101)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Python unit tests (ubuntu-latest, 3.9)
- GitHub Check: Python unit tests (windows-latest, 3.12)
- GitHub Check: Python unit tests (windows-latest, 3.11)
- GitHub Check: Python unit tests (windows-latest, 3.10)
- GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (38)
ricecooker/commands.py (1)
279-282: Change looks good – call signature now matchesChannelManager.process_tree
tree.process_tree()no longer requires thechannelargument, and the helper
continues to return(files_to_diff, FAILED_FILES)as expected by
set_files. No further action needed.ricecooker/managers/tree.py (1)
50-61: Concurrent removal of nodes can introduce subtle race conditions
node.remove_from_parent()is executed inside the thread-pool worker while
other workers may still be iterating over or modifying the same parent’s
childrenlist. Although the GIL protects against seg-faults, logical races
(e.g. lost updates, inconsistent ordering) are possible and hard to debug.Consider one of:
- Mark nodes for removal in a thread-safe queue, then prune the tree once the
executor finishes; or- Protect
remove_from_parentwith a per-parentthreading.Lock.This guarantees the channel structure remains coherent for all subsequent
steps (upload, publish, etc.).tests/test_thumbnails.py (6)
11-15: Good refactoring of imports to support direct testing of extracted thumbnail filesThe imports have been updated to use the extracted thumbnail file classes directly, which aligns with the changes in the implementation that focuses on preset-based validation rather than file classes.
258-263: LGTM: Simplified thumbnail tests to test extractors directlyThis refactoring improves the test by directly testing the
ExtractedPdfThumbnailFileclass rather than going through node creation and file attachment. This makes the tests more focused and directly aligned with the implementation changes.
266-271: Improved test isolation and clarityTesting the thumbnail file extraction directly instead of through a document node improves test isolation and makes the test intent clearer.
275-280: Consistent pattern applied to EPUB thumbnail testingGood application of the same test pattern for EPUB thumbnails, maintaining consistency across all thumbnail extraction tests.
292-305: Consistent pattern applied to HTML zip thumbnail testingThe test follows the same pattern as the other thumbnail extractors, ensuring consistency in the test approach.
309-322: Successfully adapted tests for video thumbnail extractionThe refactoring to directly test the
ExtractedVideoThumbnailFilemaintains consistency with the other extractors and aligns with the preset-based validation approach.ricecooker/classes/files.py (5)
58-58: Good addition of default_preset attribute to the base File classAdding this attribute to the base class enables consistent preset handling across all file types.
92-100: Improved preset fallback in get_preset methodThe method now checks for a default_preset before raising an error, providing a more robust fallback mechanism. The error message is also more descriptive with additional context.
202-211: Enhanced pipeline execution with cache control and post-processing validationThe changes to
process_filenow:
- Pass the
skip_cacheparameter to control caching during execution- Add validation after processing to ensure the file remains valid after metadata updates
This ensures consistent validation throughout the file processing workflow.
229-229: Consistent use of default_preset across file classesGood implementation of default_preset across all file subclasses, which simplifies preset handling and removes the need for individual get_preset overrides in subclasses.
Also applies to: 245-245, 256-256, 263-263, 270-270, 277-277, 284-284, 355-355, 382-382, 397-397
505-505: Linked thumbnail extractors to their source file formatsExplicitly associating
allowed_formatsin extracted thumbnail classes with their source file types creates a clear relationship and ensures thumbnail extraction constraints are tied to their corresponding file types.Also applies to: 513-513, 521-521, 529-529
tests/test_tree.py (9)
14-16: Good addition of imports to support enhanced metadata testsThe added imports for label constants, node classes, and FilePipeline support the new tests for metadata inheritance and automatic content type inference.
Also applies to: 25-27, 30-30, 34-34, 37-37
705-705: Explicit learning activity inference for clarityExplicitly calling
infer_learning_activities()before assertion improves the test clarity by making the expected behavior more explicit.
719-786: Comprehensive tests for automatic resource node kind and learning activity inferenceThese tests thoroughly verify that ContentNode instances correctly:
- Infer content kind based on resource file type
- Set appropriate learning activities based on the inferred kind
- Handle various resource types (video, audio, document, epub, html5)
This test coverage is essential for validating the new URI-based node initialization feature.
789-798: Good baseline tests for metadata gatheringThese tests establish baseline expectations for metadata gathering in both base Node and TreeNode without metadata, which is important for understanding the default behavior.
800-813: Well-designed helper method for metadata comparisonThe
_assert_metadata_equalhelper method properly handles comparison of complex metadata including:
- List fields (using set comparison to ignore order)
- License objects (comparing by license_id)
- Other fields (direct comparison)
This makes the tests more robust and readable.
815-873: Comprehensive tests for metadata inheritance and mergingThese tests thoroughly verify that:
- TreeNodes correctly gather metadata from parents
- Nodes combine their own metadata with parent metadata
- Child node values properly override parent values
- List fields are correctly merged
This ensures the metadata inheritance mechanism works as expected.
875-900: Good tests for hierarchical metadata mergingThe test correctly verifies that when a parent has a broader subject and a child has a more specific one, only the specific one is kept in the final metadata. This is important for ensuring sensible metadata categorization.
902-990: Thorough testing of metadata inheritance for ContentNodeThese tests verify that ContentNode instances correctly:
- Inherit simple metadata fields (language, license, etc.)
- Inherit and merge label fields (grade levels, categories, etc.)
- Do not override existing values when inheriting from ancestors
This ensures the metadata inheritance mechanism works as expected for content nodes.
992-1020: Good tests for hierarchical label inheritanceThe test verifies that more specific category labels (like ALGEBRA) correctly replace broader ones (like MATHEMATICS) while preserving unrelated categories. This is crucial for proper metadata organization.
ricecooker/classes/nodes.py (16)
34-35: Useful preset lookup dictionary for efficient preset accessCreating a dictionary for preset lookup by ID improves efficiency when accessing preset information throughout the code.
48-62: Well-organized metadata field categorizationGood separation of inheritable fields into simple value fields and metadata label fields, which clarifies how different types of metadata are handled during inheritance.
404-406: Good helper method for validation errorsThe
_validate_valueshelper method centralizes the error handling for validation, making the code more consistent and easier to maintain.
407-412: Good learning activity inference methodThis method sensibly sets default learning activities based on content kind when not explicitly provided.
412-421: Improved license handling with flexible input typesThe updated
set_licensemethod intelligently handles both string license IDs and License objects, making the API more flexible.
422-555: Comprehensive validation with detailed error messagesThe updated
validatemethod provides thorough validation of all node attributes with specific, helpful error messages. The validation covers:
- Required fields (source_id, title)
- Type checking for all attributes
- Field-specific validation (e.g., role, grade_levels)
- License validation
This ensures node data integrity throughout the application.
558-582: Well-designed metadata inheritance mechanismThe
get_metadata_dictmethod intelligently handles different types of metadata:
- Simple fields are copied directly if not empty
- List fields are merged with special handling for hierarchical values
- More specific values replace broader ones in hierarchical fields
This creates a sensible and predictable metadata inheritance system.
844-858: Good support for URI-based content node initializationThe ContentNode class now supports initialization with just a URI, with the pipeline inferring file types and metadata. The pipeline property provides a sensible default if not explicitly provided.
867-875: Robust URI validationThe
_validate_urimethod ensures that the provided URI can be handled by the pipeline before proceeding with processing, preventing downstream errors.
877-919: Improved validation for ContentNode focusing on presetsThe updated
validatemethod now focuses on file presets rather than file classes, ensuring:
- Nodes have a license
- Nodes have a kind set after processing
- Required presets are present (but not duplicated)
- Nodes have at least one default file
This aligns with the broader refactoring to preset-based validation.
921-948: Comprehensive URI processing implementationThe
_process_urimethod effectively:
- Executes the pipeline on the URI
- Processes file metadata into File objects
- Extracts and applies content node metadata
- Handles errors by removing the node from its parent
This enables the powerful feature of creating content nodes directly from URIs.
949-957: Enhanced process_files with validationThe updated
process_filesmethod now:
- Processes the URI if present
- Calls the parent method to process existing files
- Marks files as processed and validates the node
This ensures validation happens after all file processing is complete.
996-1000: Simple but effective metadata inheritance methodThe
set_metadata_from_ancestorsmethod pulls metadata from ancestors and sets it on the node, providing a clean way to inherit metadata.
1023-1023: Consistent use of required_presets across node classesReplacing file format requirements with preset requirements standardizes validation across node types and aligns with the preset-based approach throughout the codebase.
Also applies to: 1093-1093, 1117-1121, 1170-1170, 1210-1210, 1459-1459, 1483-1483
1327-1362: Improved ExerciseNode validationThe updated validation for ExerciseNode:
- Checks for presence of questions
- Validates each question
- Verifies mastery model settings
- Ensures M and N values are valid for M of N mastery model
This provides more thorough validation for exercise content.
97-97: 🛠️ Refactor suggestionUse None instead of mutable default argument
Using a mutable data structure (empty list) as a default argument value is considered a bad practice in Python because the same list instance will be shared across all function calls.
- def get_preset(self): - if self.preset: - return self.preset - if self.default_preset: - return self.default_preset - raise NotImplementedError( - f"preset must be set if preset and default_preset isn't specified when creating {self.__class__.__name__} " - f"object for file {self.filename} ({self.original_filename})" - ) + def get_preset(self): + if self.preset: + return self.preset + if self.default_preset: + return self.default_preset + raise NotImplementedError( + f"preset must be set if preset and default_preset isn't specified when creating {self.__class__.__name__} " + f"object for file {self.filename} ({self.original_filename})" + )I don't see the mutable default argument in this specific snippet. The warning seems to be a false positive from the static analysis tool.
Likely an incorrect or invalid review comment.
🧰 Tools
🪛 Ruff (0.8.2)
97-97: Do not use mutable data structures for argument defaults
Replace with
None; initialize within function(B006)
| try: | ||
| node.process_files() | ||
| except (InvalidNodeException, ValueError) as e: | ||
| if config.STRICT: | ||
| raise | ||
| else: | ||
| config.LOGGER.warning(str(e)) | ||
| config.LOGGER.warning( | ||
| f"Removing {node} from tree due to validation errors" | ||
| ) | ||
| node.remove_from_parent() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Log full traceback and normalise the exception surface
Catching (InvalidNodeException, ValueError) is sensible, but the current
warning only prints str(e), losing the traceback that is invaluable during
chef debugging.
-except (InvalidNodeException, ValueError) as e:
- ...
- config.LOGGER.warning(str(e))
+except (InvalidNodeException, ValueError) as e:
+ config.LOGGER.warning("Validation error in %s", node, exc_info=e)Additionally, if process_files can still emit raw ValueErrors it may be
worth standardising those to InvalidNodeException at the source to keep
error handling uniform.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| node.process_files() | |
| except (InvalidNodeException, ValueError) as e: | |
| if config.STRICT: | |
| raise | |
| else: | |
| config.LOGGER.warning(str(e)) | |
| config.LOGGER.warning( | |
| f"Removing {node} from tree due to validation errors" | |
| ) | |
| node.remove_from_parent() | |
| try: | |
| node.process_files() | |
| except (InvalidNodeException, ValueError) as e: | |
| if config.STRICT: | |
| raise | |
| else: | |
| - config.LOGGER.warning(str(e)) | |
| + config.LOGGER.warning("Validation error in %s", node, exc_info=e) | |
| config.LOGGER.warning( | |
| f"Removing {node} from tree due to validation errors" | |
| ) | |
| node.remove_from_parent() |
ricecooker/classes/nodes.py
Outdated
| ) | ||
| self.uri = uri | ||
| self._pipeline = pipeline | ||
| # We should do additional validation of files once they have been processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this comment in this context. I know what it says, and I think "yes", but I don't fully know why it's here. It's because we have self._files_processed = False which means we will do more later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a really rather unelucidatory comment, to the point that I thought maybe I had copy pasted it from Claude output, but honestly I don't think I did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated the comment to be much more explicit and (hopefully) clear.
6f43b61 to
7fc41e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/test_tree.py (1)
800-813: Good helper method for metadata comparison.The
_assert_metadata_equalhelper method enhances test readability and maintainability by centralizing the logic for comparing metadata dictionaries, with special handling for lists and license objects.Consider adding a docstring to the helper method explaining its purpose and parameters for better maintainability.
-def _assert_metadata_equal(expected, actual): +def _assert_metadata_equal(expected, actual): + """ + Assert that two metadata dictionaries are equal, with special handling for lists and license objects. + + Args: + expected (dict): The expected metadata dictionary + actual (dict): The actual metadata dictionary to compare + """ricecooker/classes/nodes.py (3)
68-72: Redundantkind = Nonedeclaration.The
kind = Noneattribute is declared twice, once on line 68 and again on line 71.Remove the redundant
kind = Nonedeclaration:- kind = None license = None language = None - kind = None
73-143: Enhanced Node initialization with comprehensive metadata attributes.The updated Node initialization now properly handles all metadata attributes needed for content nodes, with appropriate default values. This provides a solid foundation for the metadata inheritance system.
Line 97 uses a mutable default argument which could lead to unexpected behavior:
- tags=None, + tags=None, # Will be initialized as [] in the bodyThen in the constructor body:
- self.tags = tags or [] + self.tags = [] if tags is None else tags🧰 Tools
🪛 Ruff (0.8.2)
97-97: Do not use mutable data structures for argument defaults
Replace with
None; initialize within function(B006)
1052-1072: Simplified VideoNode validation.The validate method for VideoNode has been simplified to filter duplicate subtitle files by language code, removing try-except blocks and assertions.
The multiple
isinstancechecks could be combined for better readability:- if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile): + if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):🧰 Tools
🪛 Ruff (0.8.2)
1056-1056: Multiple
isinstancecalls forfile, merge into a single callMerge
isinstancecalls forfile(SIM101)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
ricecooker/classes/files.py(9 hunks)ricecooker/classes/licenses.py(1 hunks)ricecooker/classes/nodes.py(25 hunks)ricecooker/commands.py(1 hunks)ricecooker/managers/tree.py(3 hunks)ricecooker/utils/pipeline/__init__.py(2 hunks)ricecooker/utils/pipeline/context.py(2 hunks)ricecooker/utils/pipeline/extract_metadata.py(5 hunks)ricecooker/utils/pipeline/file_handler.py(5 hunks)ricecooker/utils/pipeline/transfer.py(1 hunks)tests/test_data.py(1 hunks)tests/test_exercises.py(2 hunks)tests/test_files.py(1 hunks)tests/test_thumbnails.py(2 hunks)tests/test_tree.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
- ricecooker/classes/licenses.py
- ricecooker/commands.py
- tests/test_exercises.py
- tests/test_files.py
- ricecooker/utils/pipeline/extract_metadata.py
- ricecooker/utils/pipeline/context.py
- ricecooker/managers/tree.py
- tests/test_data.py
- ricecooker/utils/pipeline/transfer.py
- ricecooker/utils/pipeline/init.py
- tests/test_thumbnails.py
- ricecooker/utils/pipeline/file_handler.py
- ricecooker/classes/files.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/test_tree.py (5)
ricecooker/classes/nodes.py (11)
ContentNode(821-1002)Node(65-584)TreeNode(670-794)pipeline(857-860)infer_learning_activities(407-410)process_files(275-299)process_files(952-960)process_files(1276-1290)gather_ancestor_metadata(583-584)gather_ancestor_metadata(788-794)set_metadata_from_ancestors(999-1002)ricecooker/exceptions.py (1)
InvalidNodeException(60-64)ricecooker/utils/pipeline/__init__.py (1)
FilePipeline(19-90)tests/conftest.py (6)
video_file(359-370)audio_file(482-494)document_file(558-568)epub_file(614-621)html_file(672-684)title(250-251)tests/test_files.py (1)
process_files(64-79)
🪛 Ruff (0.8.2)
ricecooker/classes/nodes.py
97-97: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
1056-1056: Multiple isinstance calls for file, merge into a single call
Merge isinstance calls for file
(SIM101)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Python unit tests (ubuntu-latest, 3.9)
- GitHub Check: Python unit tests (windows-latest, 3.12)
- GitHub Check: Python unit tests (windows-latest, 3.11)
- GitHub Check: Python unit tests (windows-latest, 3.10)
- GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (50)
tests/test_tree.py (17)
14-16: These imports properly support the new tests for metadata and labels.These imports for label constants are necessary for the new tests related to metadata inheritance and will help make the tests more readable by using the actual constants rather than string literals.
25-27: Good addition of node class imports to support the new tests.These additional node class imports support the direct testing of various node types and their metadata inheritance capabilities, allowing for more comprehensive tests.
Also applies to: 30-30, 34-34
37-37: FilePipeline import supports the new URI-based content node creation.This import is necessary for the new tests that verify automatic content kind inference from URIs.
705-705: Explicitly calling infer_learning_activities aligns with the updated ContentNode behavior.The test now explicitly calls
infer_learning_activities(), which matches the implementation change where learning activities are no longer automatically inferred during initialization but must be explicitly triggered.
719-731: Good test for automatic video content kind inference.This test validates that a ContentNode can automatically infer its kind as VIDEO and set the appropriate learning activity (WATCH) when initialized with a video file URI. This is a key test for the new URI-based initialization feature.
733-745: Test ensures automatic audio content kind inference works correctly.This test verifies that a ContentNode can identify an audio file and correctly set its kind to AUDIO and learning activity to LISTEN, confirming the URI-based pipeline processing functionality.
747-759: Test validates document content kind inference.This test ensures that document files are correctly processed and result in a ContentNode with DOCUMENT kind and READ learning activity.
761-773: Test confirms EPUB files are processed as documents.This test verifies that EPUB files are correctly categorized as DOCUMENT kind with READ learning activity, ensuring consistent treatment of different document formats.
775-787: Test for HTML5 content kind inference is comprehensive.This test ensures HTML5 files are correctly processed as HTML5 kind with EXPLORE learning activity, completing the coverage for all major content types supported by the URI-based initialization.
789-792: Base tests for metadata inheritance establish important baselines.These tests verify the foundation of the metadata inheritance system by ensuring that Nodes without metadata return empty dictionaries and that TreeNodes with empty parents also return empty metadata.
Also applies to: 794-798
815-837: Test for TreeNode metadata gathering is thorough.This test verifies that a TreeNode properly inherits metadata from its parent node, covering all relevant metadata fields including simple values and label fields.
839-873: Comprehensive test for metadata combination and precedence.This test ensures that when a child node has its own metadata, it properly combines with parent metadata with the correct precedence (child overrides parent for the same field).
875-900: Test for hierarchical metadata filtering is important for label fields.This test verifies that the metadata system intelligently filters hierarchical labels, removing broader subjects when more specific ones are present, which prevents redundancy in metadata.
902-927: Test for ContentNode simple field inheritance is thorough.This test validates that a ContentNode properly inherits simple metadata fields (language, license, author, etc.) from its ancestors.
929-952: Test for ContentNode label field inheritance is comprehensive.This test ensures that ContentNode properly inherits label fields (grade_levels, resource_types, etc.) from its ancestors and merges them correctly.
954-990: Test for metadata precedence is valuable.This test confirms that existing values on a child node are not overridden by parent values during metadata inheritance, which is crucial for maintaining content creator intent.
992-1020: Test for hierarchical label inheritance handles complex cases.This test verifies that the hierarchical filtering works correctly for ContentNode label inheritance, ensuring that broader categories are removed when more specific ones exist in the child.
ricecooker/classes/nodes.py (33)
22-31: Added file imports improve code organization.The added imports for extracted thumbnail files and pipeline exceptions properly support the new URI-based content node creation functionality.
34-34: PRESET_LOOKUP dictionary improves preset access efficiency.Creating a lookup dictionary for presets is a good optimization that will make preset retrieval more efficient than repeated filtering operations.
48-62: Well-defined metadata field collections support inheritance.These collections clearly define which fields should be inheritable and how they should be processed, providing a strong foundation for the metadata inheritance system.
168-170: Added repr method improves debugging.Adding a
__repr__method that calls__str__ensures consistent string representation of nodes during debugging and in interactive sessions.
200-214: Added child removal methods enhance tree manipulation.The
remove_childandremove_from_parentmethods provide important functionality for dynamically modifying the node tree, which is especially useful when a node is found to be invalid during processing.
264-272: Improved thumbnail preset retrieval.The updated
get_thumbnail_presetmethod uses a cleaner approach to find the appropriate preset by filtering the preset list directly, making the code more maintainable.
403-411: Added validation helper method improves consistency.The
_validate_valueshelper method centralizes the validation logic, ensuring consistent error handling and reducing code duplication throughout the validation methods.
407-411: Learning activity inference method provides smart defaults.The
infer_learning_activitiesmethod automatically sets appropriate learning activities based on content kind, which ensures consistent metadata even when not explicitly provided.
412-421: License setter method simplifies license handling.The
set_licensemethod properly handles both string-based license identifiers and License objects, making it more convenient to set licenses in different ways.
422-556: Thoroughly refactored validation with comprehensive checks.The validate method has been completely rewritten to use the
_validate_valueshelper and adds comprehensive validation for all metadata fields, ensuring data integrity.
558-585: Well-designed metadata inheritance system.The
get_metadata_dictandgather_ancestor_metadatamethods provide a robust foundation for metadata inheritance throughout the node hierarchy, with intelligent handling of simple values and hierarchical label fields.
603-609: ChannelNode initialization properly handles required parameters.The updated ChannelNode constructor explicitly requires
source_idandtitle, which ensures that these essential attributes are always provided.
644-644: Fixed license serialization in to_dict.Updated to use
license_idinstead of accessing the license object directly, which ensures proper serialization of the license information.
663-667: Simplified ChannelNode validation.The validation method now uses the centralized
_validate_valueshelper, making the code more consistent with other node classes.
788-794: Enhanced TreeNode metadata inheritance.The
gather_ancestor_metadatamethod now properly raises an exception if no parent exists and combines parent metadata with its own using the metadata dictionary method.
844-844: Replaced file types with presets for validation.Using
required_presetsinstead ofrequired_file_typesallows for more flexible validation that works with both directly added files and those created dynamically from URIs.
846-861: ContentNode now supports URI-based initialization.The updated ContentNode constructor adds
uriandpipelineparameters, allowing for initialization with just a URI that will be processed to infer file details and content kind.
870-879: URI validation ensures pipeline compatibility.The
_validate_urimethod verifies that the pipeline can handle the provided URI, preventing issues with unsupported file types or invalid URIs.
880-922: ContentNode validation now checks required presets.The updated validate method now checks for required presets rather than file types, which allows for validation of files created from URIs. It also defers validation until files have been processed.
952-960: Enhanced process_files method manages URI processing.The overridden
process_filesmethod ensures that URI processing happens before other file processing and sets the_files_processedflag to indicate that validation can now check for required files.
999-1003: Added method to set metadata from ancestors.The
set_metadata_from_ancestorsmethod applies inherited metadata to the node, which ensures that metadata is consistently propagated through the node hierarchy.
1026-1027: Updated VideoNode to use presets for validation.The VideoNode now uses
required_presetsinstead ofrequired_file_typesfor validation, which is more flexible and works with URI-based file creation.
1096-1097: Updated AudioNode required_presets.AudioNode now uses
required_presetsfor validation, which aligns with the new validation approach.
1120-1124: DocumentNode presets include multiple document formats.The
required_presetsfor DocumentNode now includes DOCUMENT, EPUB, and BLOOMPUB, which ensures all document types are properly validated.
1127-1129: Updated thumbnail generation to use presets.The document thumbnail generation now uses
get_preset()to identify document files, which is more flexible than checking file types directly.
1173-1174: Updated HTML5AppNode required_presets.HTML5AppNode now uses
required_presetsfor validation, which aligns with the new validation approach.
1181-1183: Updated HTML5 thumbnail generation to use presets.The HTML5 thumbnail generation now uses
get_preset()to identify HTML5 files, which is more consistent with the preset-based approach.
1213-1214: Updated H5PAppNode required_presets.H5PAppNode now uses
required_presetsfor validation, which aligns with the new validation approach.
1330-1365: Refactored ExerciseNode validation.The validate method for ExerciseNode has been refactored to use the
_validate_valueshelper for question presence, question validity, mastery model correctness, and integer coercion of mastery parameters.
1462-1463: Added required_presets to CustomNavigationNode.CustomNavigationNode now uses
required_presetsfor validation, which aligns with the new validation approach.
1473-1475: Updated CustomNavigationNode thumbnail generation to use presets.The thumbnail generation now uses
get_preset()to identify HTML5 files, which is more consistent with the preset-based approach.
1486-1487: Added required_presets to CustomNavigationChannelNode.CustomNavigationChannelNode now uses
required_presetsfor validation, which aligns with the new validation approach.
1577-1578: Fixed StudioContentNode validation return value.The StudioContentNode validate method now returns the result of the parent validate method, ensuring proper validation chain.
| def _process_uri(self): | ||
| try: | ||
| file_metadata_list = self.pipeline.execute( | ||
| self.uri, skip_cache=config.UPDATE | ||
| ) | ||
| except (InvalidFileException, ExpectedFileException) as e: | ||
| config.LOGGER.error(f"Error processing path: {self.uri} with error: {e}") | ||
| self.remove_from_parent() | ||
| return None | ||
| content_metadata = {} | ||
| for file_metadata in file_metadata_list: | ||
| metadata_dict = file_metadata.to_dict() | ||
| if "content_node_metadata" in metadata_dict: | ||
| content_metadata.update(metadata_dict.pop("content_node_metadata")) | ||
| # Remove path from metadata_dict as it is not needed for the File object | ||
| metadata_dict.pop("path", None) | ||
| file_obj = File(**metadata_dict) | ||
| self.add_file(file_obj) | ||
| for key, value in content_metadata.items(): | ||
| if key == "extra_fields": | ||
| self.extra_fields.update(value) | ||
| else: | ||
| if key == "kind" and self.kind is not None and self.kind != value: | ||
| raise InvalidNodeException( | ||
| "Inferred kind is different from content node class kind." | ||
| ) | ||
| setattr(self, key, value) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URI processing extracts metadata and creates files.
The _process_uri method runs the pipeline on the URI, converts file metadata to File objects, extracts content node metadata, and handles errors appropriately.
The error handling in _process_uri removes the node from its parent when an error occurs, which could lead to surprising behavior. Consider instead flagging the node as invalid and letting the tree validation process handle removal:
try:
file_metadata_list = self.pipeline.execute(
self.uri, skip_cache=config.UPDATE
)
except (InvalidFileException, ExpectedFileException) as e:
config.LOGGER.error(f"Error processing path: {self.uri} with error: {e}")
- self.remove_from_parent()
+ # Flag the node as invalid instead of removing it
+ self._invalid_uri = True
return NoneThen update the validate method to check for this flag:
if self.uri:
+ if hasattr(self, '_invalid_uri') and self._invalid_uri:
+ self._validate_values(True, f"Invalid URI: {self.uri}")
self._validate_uri()Committable suggestion skipped: line range outside the PR's diff.
…than relying on config.UPDATE.
…istent. Implement required file checks, focused on presets. Enforce a single required preset per node.
Update metadata extraction to infer kind. Update tree processing to flag invalid nodes and exclude from created tree or raise errors if in strict mode.
Refactor to allow metadata to be from channel level down.
Remove unneeded retry logic.
7fc41e0 to
0978f3b
Compare
|
As discussed, nodes are now flagged as valid or invalid during the validation process. When errors are raised either during validation of file processing by the ChannelManager class, it stores the string representation of the error on the node under a Any invalid nodes, or ones with failed file uploads, are now excluded from upload to Studio. They are reported node by node, rather than under the umbrella of their parent, this should give more inspectability of errors that are happening. If the content integrator prefers things to error, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
ricecooker/classes/nodes.py (1)
898-924: 🛠️ Refactor suggestionComprehensive URI processing with metadata extraction.
The
_process_urimethod:
- Executes the pipeline on the URI (with cache control)
- Creates File objects from the pipeline results and adds them to the node
- Extracts and applies content_node_metadata to the node attributes
- Handles both file and content metadata elegantly
- Validates that inferred kind matches the node class kind
This is the core of the URI-based content node creation functionality.
The error handling in
_process_uricould be improved. Currently, when an exception occurs during pipeline execution, the method catches the exception, logs an error, and returns None without setting any error state on the node itself, which might lead to unclear validation errors later.Consider setting an error flag or message on the node:
try: file_metadata_list = self.pipeline.execute( self.uri, skip_cache=config.UPDATE ) except (InvalidFileException, ExpectedFileException) as e: config.LOGGER.error(f"Error processing path: {self.uri} with error: {e}") + self._uri_error = str(e) # Store the error for later validation return NoneThen check for this flag in validation:
if self.uri: + if hasattr(self, '_uri_error') and self._uri_error: + self._validate_values(True, f"Error processing URI: {self.uri} - {self._uri_error}") self._validate_uri()
🧹 Nitpick comments (4)
tests/test_tree.py (2)
252-252: Fix assertion method to useassertwith a message rather thanassert False.The test uses
assert Falsewhich would be removed when Python is run with optimization (-Oflag). Replace with a proper assertion.- assert False, "Invalid topic should fail validation" + assert 0, "Invalid topic should fail validation" # Or better yet, use pytest.fail("message")
1202-1204: Simplify test with singlewithstatement for multiple context managers.The test uses nested
withstatements, which could be simplified to a singlewithstatement with multiple contexts.- with patch("builtins.open", mocked_open), patch( - "ricecooker.config.get_storage_path", return_value="/tmp/test_file.mp4" - ), patch("ricecooker.config.SESSION.post", return_value=mock_response): + with patch("builtins.open", mocked_open), \ + patch("ricecooker.config.get_storage_path", return_value="/tmp/test_file.mp4"), \ + patch("ricecooker.config.SESSION.post", return_value=mock_response):🧰 Tools
🪛 Ruff (0.8.2)
1202-1209: Use a single
withstatement with multiple contexts instead of nestedwithstatements(SIM117)
ricecooker/classes/nodes.py (2)
98-98: Fix mutable default argument in Node constructor.The
node_modifications={}parameter uses a mutable dict as a default argument, which can lead to unexpected behavior if modified across different instances.- node_modifications={}, + node_modifications=None,And then inside the method:
- # save modifications passed in by csv - self.node_modifications = node_modifications + # save modifications passed in by csv + self.node_modifications = node_modifications or {}🧰 Tools
🪛 Ruff (0.8.2)
98-98: Do not use mutable data structures for argument defaults
Replace with
None; initialize within function(B006)
1029-1030: Simplify multipleisinstancechecks with tuple of types.The code checks if a file is an instance of either
SubtitleFileorYouTubeSubtitleFileusing two separateisinstancecalls, which can be simplified.- if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile): + if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):🧰 Tools
🪛 Ruff (0.8.2)
1029-1029: Multiple
isinstancecalls forfile, merge into a single callMerge
isinstancecalls forfile(SIM101)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
ricecooker/classes/files.py(9 hunks)ricecooker/classes/licenses.py(1 hunks)ricecooker/classes/nodes.py(25 hunks)ricecooker/commands.py(1 hunks)ricecooker/managers/tree.py(7 hunks)ricecooker/utils/pipeline/__init__.py(2 hunks)ricecooker/utils/pipeline/context.py(2 hunks)ricecooker/utils/pipeline/extract_metadata.py(5 hunks)ricecooker/utils/pipeline/file_handler.py(5 hunks)ricecooker/utils/pipeline/transfer.py(1 hunks)tests/test_data.py(1 hunks)tests/test_exercises.py(2 hunks)tests/test_files.py(1 hunks)tests/test_thumbnails.py(2 hunks)tests/test_tree.py(20 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
- tests/test_files.py
- ricecooker/classes/licenses.py
- ricecooker/commands.py
- tests/test_exercises.py
- ricecooker/utils/pipeline/init.py
- ricecooker/utils/pipeline/extract_metadata.py
- ricecooker/utils/pipeline/transfer.py
- ricecooker/utils/pipeline/context.py
- tests/test_data.py
- tests/test_thumbnails.py
- ricecooker/utils/pipeline/file_handler.py
- ricecooker/classes/files.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
ricecooker/managers/tree.py (2)
ricecooker/exceptions.py (1)
InvalidNodeException(60-64)ricecooker/classes/nodes.py (5)
process_files(261-285)process_files(925-933)process_files(1249-1263)get_node_id(590-591)get_node_id(674-682)
tests/test_tree.py (4)
ricecooker/classes/nodes.py (19)
ContentNode(795-975)Node(65-558)TreeNode(644-768)pipeline(831-834)validate(527-530)infer_learning_activities(376-379)process_files(261-285)process_files(925-933)process_files(1249-1263)gather_ancestor_metadata(557-558)gather_ancestor_metadata(762-768)set_metadata_from_ancestors(972-975)to_dict(186-190)to_dict(606-630)to_dict(726-760)to_dict(935-970)to_dict(1552-1565)get_node_id(590-591)get_node_id(674-682)ricecooker/exceptions.py (1)
InvalidNodeException(60-64)ricecooker/managers/tree.py (7)
ChannelManager(19-448)InsufficientStorageException(13-16)validate(34-47)validate_node(49-58)process_node(84-108)add_nodes(325-407)do_file_upload(161-211)tests/conftest.py (7)
channel(149-158)title(250-251)video_file(359-370)audio_file(482-494)document_file(558-568)epub_file(614-621)html_file(672-684)
🪛 Ruff (0.8.2)
tests/test_tree.py
253-253: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
1202-1209: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
ricecooker/classes/nodes.py
98-98: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
1029-1029: Multiple isinstance calls for file, merge into a single call
Merge isinstance calls for file
(SIM101)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Python unit tests (ubuntu-latest, 3.9)
- GitHub Check: Python unit tests (windows-latest, 3.12)
- GitHub Check: Python unit tests (windows-latest, 3.11)
- GitHub Check: Python unit tests (windows-latest, 3.10)
- GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (33)
ricecooker/managers/tree.py (6)
60-60: Simplified method signature removes redundant parameter.The
process_treemethod no longer requires thechannel_nodeparameter, as it now uses the class instance'sself.channelattribute consistently. This makes the API cleaner and avoids potential confusion when the passed channel differs from the internal one.
52-58: Improved error handling for node validation.The code now explicitly catches
InvalidNodeExceptionduring validation, setting a descriptive_errorattribute on nodes and logging warnings when not in strict mode. This provides better error traceability and prevents valid nodes from being discarded due to siblings' failures.
292-302: Simplified check_failed method removes unused parameter.The
check_failedmethod signature has been simplified by removing the unusedprint_warningparameter. The method now always logs warnings for failed nodes, showing both error messages and truncated content consistently.
364-378: Improved error tracking with more specific failure identification.The
add_nodesmethod now:
- Uses the child's node ID (instead of the parent's root ID) as the key in
failed_node_builds- Includes the child's
_errormessage in the error aggregation- Creates more specific error entries for individual nodes
This makes error tracking more precise and makes it easier to identify which specific nodes failed and why.
279-279: Removed automatic retry for failed node builds.The call to
reattempt_failedhas been removed from theupload_treemethod, streamlining the error handling flow. Failed nodes are now simply reported without automatic retry attempts, making the behavior more predictable and explicit.
89-97:Details
❓ Verification inconclusive
Enhanced error handling during node processing.
The
process_nodemethod now catches bothInvalidNodeExceptionandValueErrorduring file processing, setting an_errorattribute on the node and logging the error. This approach maintains consistent error handling patterns withvalidate_nodeand prevents pipeline failures from causing entire tree processing to fail unnecessarily.Log full traceback and normalize the exception surface:
🏁 Script executed:
#!/bin/bash # Verify whether process_node could benefit from full traceback logging # Search for other warning logging patterns in the codebase rg "LOGGER\.warning.*exc_info" -A 2 -B 2 echo "----------------" # Check whether InvalidNodeException is used to wrap ValueError elsewhere rg "raise InvalidNodeException.*ValueError"Length of output: 129
🏁 Script executed:
#!/bin/bash # Search for existing use of logger.exception to understand conventions rg "LOGGER\.exception" -A 2 -B 2 echo "----------------" # Search for any use of exc_info across all logging calls rg "exc_info=" -A 2 -B 2Length of output: 100
Consider capturing full exception tracebacks during node processing
Currently, any
InvalidNodeExceptionorValueErroris caught and only the error message is logged:try: node.process_files() except (InvalidNodeException, ValueError) as e: if config.STRICT: raise else: node._error = str(e) config.LOGGER.warning(node._error)No existing uses of
LOGGER.exceptionorexc_info=Truewere found in the codebase—please verify your project’s logging conventions. To aid debugging, you might log the full traceback, for example:
- Switch to
config.LOGGER.exception("Error processing node %s", node.id)- Or add
exc_info=True:
config.LOGGER.warning(node._error, exc_info=True)File needing attention:
ricecooker/managers/tree.py(around lines 89–97)Please confirm and align with your preferred error‐logging pattern.
tests/test_tree.py (13)
248-254: API change: Using consistent validation method across tests.Tests have been updated to use
validate()instead of the deprecatedvalidate_tree()method. This change aligns with the refactoring of validation logic in the node classes.🧰 Tools
🪛 Ruff (0.8.2)
253-253: Do not
assert False(python -Oremoves these calls), raiseAssertionError()Replace
assert False(B011)
694-695: Updated test to use new inference method.The test now explicitly calls
infer_learning_activities()before asserting the default activities, reflecting the new design where learning activities are inferred on demand rather than automatically assigned during initialization.
708-775: New tests for automatic content kind and learning activity inference.These tests verify the core new functionality of inferring content kind and learning activities from files based on URIs. The tests cover all major file types:
- Video (lines 708-720)
- Audio (lines 722-734)
- Document (lines 736-748)
- ePub (lines 750-762)
- HTML5 (lines 764-776)
Each test confirms that the correct content kind and learning activities are inferred after processing a file with the specified URI.
778-809: Thorough testing of metadata inheritance functionality.These tests verify the base metadata gathering functionality with different node configurations:
- Empty nodes with no metadata (lines 778-781)
- TreeNodes with empty parent nodes (lines 782-787)
- Helper function for metadata comparison (lines 789-802)
The tests establish a foundation for more complex metadata inheritance testing that follows.
804-862: Comprehensive tests for metadata gathering from ancestor nodes.These tests verify that:
- TreeNodes correctly gather metadata from parent nodes (lines 804-826)
- TreeNodes properly combine their own metadata with parent metadata (lines 828-862)
- Tests that child values override parent values where appropriate
- Tests that non-existing values are inherited from parent
- Tests that hierarchical lists like grade_levels are properly combined
This ensures that metadata propagation and inheritance work as expected in the node hierarchy.
864-889: Tests for hierarchical metadata merging logic.The test verifies that when parents have broader subjects and children have more specific ones within the same hierarchy, only the more specific values are kept (removing redundant broader categories). This ensures optimal metadata representation without duplication.
891-979: Tests for ContentNode metadata inheritance methods.These test cases verify that:
- ContentNodes properly inherit simple fields like language and license (lines 891-916)
- ContentNodes correctly inherit label fields like grade_levels (lines 918-941)
- ContentNodes don't override existing values with ancestor values (lines 943-979)
- Hierarchical label inheritance works by removing broader categories when more specific ones exist (lines 981-1010)
This ensures that metadata inheritance follows the expected precedence rules and maintains hierarchical relationships.
1015-1066: Tests for ChannelManager's error handling during validation and processing.These tests verify that:
- The manager sets the
_errorattribute whenInvalidNodeExceptionoccurs during validation (lines 1015-1032)- The manager handles both
InvalidNodeExceptionandValueErrorduring node processing (lines 1034-1066)The tests mock the node's behavior to simulate exceptions and verify that errors are properly captured and handled.
1068-1113: Tests for ChannelManager's add_nodes method error handling.This test verifies that:
- Invalid nodes are skipped during the add_nodes process
- Skipped nodes are properly registered in the failed_node_builds dictionary with their error messages
- Invalid nodes' to_dict() method is not called, preventing errors during serialization
The test uses mocks to simulate a parent node with both valid and invalid children.
1115-1142: Test for handling ConnectionError during node addition.This test verifies that connection errors during the upload process are properly caught and registered in the failed_node_builds dictionary with the parent node as the context.
1144-1174: Test for handling server errors during node addition.This test verifies that HTTP error responses from the server are properly caught and registered in the failed_node_builds dictionary, including the error reason and response content.
1176-1211: Test for handling storage limitations during file upload.The test verifies that the manager raises an
InsufficientStorageExceptionwhen the server responds with HTTP 412, indicating that the user has run out of storage space.🧰 Tools
🪛 Ruff (0.8.2)
1202-1209: Use a single
withstatement with multiple contexts instead of nestedwithstatements(SIM117)
1213-1270: Test for skipping both invalid nodes and nodes with failed files.This comprehensive test verifies that:
- The manager identifies and skips both invalid nodes and nodes with failed files
- Both types of failed nodes are registered in the failed_node_builds dictionary
- Only valid nodes with successful files are included in the payload
The test uses mocks to create different types of nodes with various validation and file statuses.
ricecooker/classes/nodes.py (14)
48-62: Well-defined constants for metadata inheritance.The global constants
inheritable_simple_value_fieldsandinheritable_metadata_label_fieldsclearly define which fields can be inherited through the node hierarchy. This makes the inheritance rules explicit and maintainable.
68-144: Comprehensive refactoring of Node constructor with enhanced metadata support.The Node constructor has been expanded to support a comprehensive set of metadata fields including:
- Basic fields (title, language, description)
- Attribution fields (author, copyright_holder, aggregator, provider)
- Classification fields (grade_levels, resource_types, learning_activities, etc.)
- License handling
The constructor now requires a
source_idparameter (line 102), which improves the consistency of node creation.🧰 Tools
🪛 Ruff (0.8.2)
98-98: Do not use mutable data structures for argument defaults
Replace with
None; initialize within function(B006)
372-374: Added validation helper improves error message consistency.The
_validate_valueshelper method standardizes validation error handling by:
- Taking an assertion condition and error message
- Raising InvalidNodeException with a consistent format that includes the node identifier
- Centralizing validation logic to avoid duplication
This pattern is used extensively throughout the validation methods, making the code more maintainable and error messages more consistent.
391-524: Comprehensive validation overhaul using new helper method.The
_validatemethod has been thoroughly refactored to:
- Use the new
_validate_valueshelper for consistent error reporting- Validate all metadata fields comprehensively
- Validate category and label fields against allowed values
- Check for string fields, list fields, and non-empty required fields
This provides much stronger validation throughout the node hierarchy and better error messages when validation fails.
532-556: Well-designed metadata inheritance with hierarchical handling.The
get_metadata_dictmethod implements intelligent metadata merging:
- Simple fields (language, license, etc.) are copied if they have values
- Hierarchical fields (grade_levels, categories, etc.) are merged with special handling
- More specific categories automatically replace their broader parents (lines 547-553)
This prevents metadata bloat and ensures the most specific categories are preserved when merging.
557-559: Simple base implementation of metadata gathering.The base
gather_ancestor_metadatamethod returns just the node's own metadata. This method is overridden inTreeNodeto include parent metadata. This design follows the principle that each class only knows about its own structure.
762-768: TreeNode's metadata gathering checks for parent existence.The
gather_ancestor_metadatamethod in TreeNode raises an exception if parent is missing, preventing silent failures when trying to gather metadata from a non-existent parent. This helps catch configuration errors early.
818-835: ContentNode updated to support URI-based initialization.The ContentNode constructor now accepts:
- A
uriparameter for the primary content location- A
pipelineparameter for handling file processing- A
_files_processedflag to track processing stateThe class provides a pipeline property with a fallback to the global pipeline, ensuring a pipeline is always available when needed.
844-852: Clear validation of URI compatibility with pipeline.The
_validate_urimethod checks if the pipeline can handle the provided URI, raising a clear exception if not. This prevents issues where a URI might be provided that can't be processed.
854-896: Refactored validation logic with preset-based verification.The ContentNode validation has been updated to:
- Verify that required file presets exist instead of file types (lines 862-882)
- Check for default (non-supplementary) files (lines 883-893)
- Validate URI compatibility when a URI is provided (lines 894-895)
This matches the PR objective of shifting validation from file types to presets.
925-933: Enhanced file processing with URI support and validation.The
process_filesmethod now:
- Processes the URI first if provided (line 927)
- Calls the parent class's processing for regular files (line 928)
- Sets a flag to indicate files have been processed (line 931)
- Validates the node to ensure metadata is valid (line 932)
This ensures both URI-based and regular file-based content nodes follow the same validation path.
972-975: Simple but effective metadata inheritance implementation.The
set_metadata_from_ancestorsmethod:
- Gathers metadata from ancestors
- Sets each field from the gathered metadata on the node
This allows content nodes to inherit metadata from their ancestors at the appropriate time in the node lifecycle.
1019-1045: Simplified VideoNode validation for subtitle files.The VideoNode validation has been simplified to:
- Filter duplicate subtitle files by language code
- Provide clear warnings when duplicates are found
- Preserve the first subtitle file for each language
- Call the parent validation method with the filtered files
This makes the validation cleaner and more maintainable while preserving the required functionality.
The code could be simplified by using a dictionary comprehension for the language filtering:
- new_files = [] - language_codes_seen = set() - for file in self.files: - if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile): - language_code = file.language - if language_code not in language_codes_seen: - new_files.append(file) - language_codes_seen.add(language_code) - else: - file_info = file.path if hasattr(file, "path") else file.youtube_url - config.LOGGER.warning( - "Skipping duplicate subs for " - + language_code - + " from " - + file_info - ) - else: - new_files.append(file) + # Group subtitle files by language code, keeping the first one for each language + language_to_subtitle = {} + new_files = [] + + for file in self.files: + if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)): + language_code = file.language + if language_code not in language_to_subtitle: + language_to_subtitle[language_code] = file + new_files.append(file) + else: + file_info = file.path if hasattr(file, "path") else file.youtube_url + config.LOGGER.warning(f"Skipping duplicate subs for {language_code} from {file_info}") + else: + new_files.append(file)🧰 Tools
🪛 Ruff (0.8.2)
1029-1029: Multiple
isinstancecalls forfile, merge into a single callMerge
isinstancecalls forfile(SIM101)
1297-1339: Refactored ExerciseNode validation using helper method.The ExerciseNode validation has been refactored to:
- Use the new
_validate_valueshelper for consistent error reporting- Check questions, mastery model, and M/N parameters
- Call
process_exercise_databefore validation to ensure parameters are set correctlyThis makes the validation more robust and consistent with the rest of the code.
marcellamaki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here look aligned to the coderabbit recommendations and the approach we discussed. I think this is good to go
Summary
References
Part of #558
Reviewer guidance
Do tests cover the appropriate code paths? Are they missing anything? Does the code make sense?
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Tests