Skip to content

Conversation

@rtibbles
Copy link
Member

@rtibbles rtibbles commented Apr 9, 2025

Summary

  • Updates the ContentNode class to allow just passing a URI for the resource, which will then infer the kind, files etc to make the node
  • Updates pipeline metadata extraction to infer kind for an associated content node
  • Cleans up ContentNode behaviour to allow deferred setting of the kind

References

Part of #558

Reviewer guidance

Do tests cover the appropriate code paths? Are they missing anything? Does the code make sense?

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added support for processing content nodes using a pipeline with URI-based file handling.
    • Content nodes now automatically infer their type and learning activities based on the resource file type.
    • Enhanced metadata extraction for HTML5 files, including kind inference and metadata merging.
    • Introduced comprehensive metadata inheritance and validation across node hierarchies.
    • Added new metadata fields and improved license serialization and validation.
    • Unified file preset handling with default presets and improved thumbnail file format associations.
    • Added explicit error handling to remove invalid nodes during validation and processing.
    • Enabled optional cache skipping in pipeline and file handling stages for more flexible processing.
  • Bug Fixes

    • Improved handling of context initialization in file pipelines to prevent errors when context is not provided.
    • Enhanced error messages for file preset retrieval and improved pipeline execution with optional cache skipping.
    • Fixed license validation error messages to be clearer and simplified cache usage control in file handling.
  • Tests

    • Added tests to verify automatic kind and learning activity inference for various file types.
    • Updated tests to check for new cache keys related to HTML5 metadata extraction.
    • Added tests for metadata inheritance and merging in node hierarchies.
    • Updated tests to expect exceptions during file processing rather than validation for invalid content nodes.
    • Refactored thumbnail failure tests to focus on thumbnail file processing directly.
    • Updated exercise tests to expect consistent exception types for invalid extra field values.
    • Added tests covering robust error handling in node validation, processing, and upload management.

@rtibbles rtibbles force-pushed the automatic_for_the_people branch 2 times, most recently from 1f981f2 to 71fbbbf Compare April 9, 2025 20:54
@rtibbles rtibbles requested a review from Copilot April 10, 2025 13:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

@rtibbles rtibbles force-pushed the automatic_for_the_people branch from 71fbbbf to 8a0149c Compare April 16, 2025 03:20
@coderabbitai
Copy link

coderabbitai bot commented Apr 16, 2025

"""

Walkthrough

The updates enhance the Node and its subclasses with extensive metadata attributes, uniform validation, and improved metadata inheritance. The ContentNode class now supports processing content from a URI using a pipeline, which extracts metadata, infers content kind and learning activities, and validates the node. Metadata extraction is extended to support HTML5 content, and the pipeline context supports deep merging of nested metadata. Pipeline handlers and execution methods were updated to allow skipping cache. Tests were expanded to verify metadata inheritance and automatic inference of content kind and learning activities for various resource types, including HTML5.

Changes

File(s) Change Summary
ricecooker/classes/nodes.py Added source_id and many metadata attributes to Node. Added __repr__, _validate_values, metadata inheritance, and improved validation logic. Simplified thumbnail preset lookup. Removed TreeNode constructor; moved metadata gathering there. Modified ChannelNode constructor and validation. Extended ContentNode with URI/pipeline, added pipeline processing, metadata merging, and validation. Replaced string file format checks with class-based required_presets in content node subclasses. Simplified or removed redundant validations in subclasses.
ricecooker/utils/pipeline/init.py Added optional skip_cache parameter to FilePipeline.execute. Ensured context defaults to an empty dict if None. Passed skip_cache to handlers.
ricecooker/utils/pipeline/context.py Added ContentNodeMetadata dataclass with educational metadata fields. Extended FileMetadata with optional content_node_metadata. Modified to_dict to exclude None fields. Updated merge to perform recursive deep merges of metadata dictionaries. Introduced helper _recursive_update for deep merging.
ricecooker/utils/pipeline/extract_metadata.py Added support for HTML5 file format with a new HTML5MetadataExtractor class. Mapped presets to content node kinds. Modified MetadataExtractor to set content_node_metadata.kind based on preset. Added HTML5MetadataExtractor to extraction stage handlers.
ricecooker/utils/pipeline/file_handler.py Added optional skip_cache parameter to Handler.execute and subclasses. Modified caching logic to respect skip_cache instead of global config.
ricecooker/utils/pipeline/transfer.py Added optional skip_cache parameter to DownloadStageHandler.execute, passing it through to superclass.
ricecooker/classes/files.py Added default_preset class attribute to File and subclasses, replacing overridden get_preset methods. Modified DownloadFile.process_file to pass skip_cache to pipeline execution and revalidate after processing. Added allowed_formats attribute to extracted thumbnail file subclasses.
ricecooker/classes/licenses.py Removed "Assertion Failed: " prefix from error messages in License.validate assertions.
ricecooker/managers/tree.py Narrowed exception handling in validate_node to catch only InvalidNodeException. Added exception handling in process_node for InvalidNodeException and ValueError to remove invalid nodes if not strict. Removed channel_node parameter from process_tree. Simplified check_failed and removed reattempt_failed. Updated add_nodes to track failed nodes by child ID with error messages.
ricecooker/commands.py Modified process_tree_files to call process_tree() without arguments.
tests/test_files.py Updated test_html5_zip_cache_keys to expect an additional cache key "EXTRACT_METADATA:{filename}" after processing an HTMLZipFile.
tests/test_tree.py Added imports for new labels and node classes. Modified test_default_learning_activities_in_tree_node to call infer_learning_activities. Added tests verifying automatic inference of content kind and learning activities for video, audio, document, epub, and HTML5 content nodes after pipeline processing. Added tests for metadata inheritance and merging behavior in node hierarchies, including label filtering and learner needs inheritance.
tests/test_data.py Changed tests for invalid content nodes to expect InvalidNodeException raised by process_files instead of validate.
tests/test_exercises.py Modified tests to expect InvalidNodeException instead of ValueError when processing exercises with invalid extra_fields string values.
tests/test_thumbnails.py Refactored failure tests to instantiate and test Extracted*ThumbnailFile classes directly, removing node/file setup. Updated assertions to check process_file() returns None, failed files list, and filename state. Removed imports and tests related to original file classes and MP3 nodes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ContentNode
    participant FilePipeline
    participant MetadataExtractor

    User->>ContentNode: Create ContentNode(uri, pipeline)
    User->>ContentNode: process_files()
    alt uri is set
        ContentNode->>FilePipeline: execute(uri)
        FilePipeline->>MetadataExtractor: extract metadata
        MetadataExtractor-->>FilePipeline: FileMetadata (with content_node_metadata)
        FilePipeline-->>ContentNode: file metadata list
        ContentNode->>ContentNode: Update attributes from metadata
        ContentNode->>ContentNode: Validate node (kind/activity inference)
    end
    ContentNode-->>User: Files processed, node validated
Loading

Poem

A pipeline now flows through the node,
With URIs and metadata in code.
HTML5 joins the fun,
Learning activities inferred for everyone!
Deep merges keep the data right,
While rabbits test from day to night.
🐇✨
"""

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
ricecooker/utils/pipeline/extract_metadata.py (2)

39-44: Properly setting the inferred kind in metadata.
Creating or retrieving content_node_metadata and assigning the inferred kind is a good approach. Consider verifying that all existing usage of content_node_metadata is up-to-date with this new field.


67-69: HTML5 metadata extraction class introduced.
Defining a distinct HTML5MetadataExtractor and registering it ensures that HTML5 files are properly detected. Adding a short docstring could further clarify its usage.

ricecooker/classes/nodes.py (3)

100-102: __repr__ duplicates __str__.
Having __repr__ return the same string as __str__ is generally acceptable but can be less informative for debugging. Consider adding distinct info in __repr__ if needed for advanced debugging.


616-616: Side-effect in validate.
Calling infer_learning_activities() inside validate() is acceptable, but note that it mutates state as part of validation. Making it a pre-validation step or clarifying the design can reduce confusion.


833-864: Processing files via the pipeline.
This block implements robust file handling, merging metadata, and re-checking validity. Consider verifying what happens when pipeline.execute returns an empty list or multiple items with conflicting metadata. Otherwise, this is a solid approach.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7ba4ea and 8a0149c.

📒 Files selected for processing (6)
  • ricecooker/classes/nodes.py (11 hunks)
  • ricecooker/utils/pipeline/__init__.py (1 hunks)
  • ricecooker/utils/pipeline/context.py (2 hunks)
  • ricecooker/utils/pipeline/extract_metadata.py (5 hunks)
  • tests/test_files.py (1 hunks)
  • tests/test_tree.py (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
ricecooker/utils/pipeline/extract_metadata.py (1)
ricecooker/utils/pipeline/context.py (1)
  • ContentNodeMetadata (14-37)
tests/test_tree.py (3)
ricecooker/classes/nodes.py (6)
  • ContentNode (743-900)
  • pipeline (787-790)
  • infer_learning_activities (508-511)
  • process_files (192-216)
  • process_files (833-863)
  • process_files (1342-1356)
ricecooker/utils/pipeline/__init__.py (1)
  • FilePipeline (12-76)
tests/conftest.py (5)
  • video_file (359-370)
  • audio_file (482-494)
  • document_file (558-568)
  • epub_file (614-621)
  • html_file (672-684)
ricecooker/utils/pipeline/context.py (1)
ricecooker/classes/nodes.py (5)
  • to_dict (117-121)
  • to_dict (402-426)
  • to_dict (575-609)
  • to_dict (865-900)
  • to_dict (1713-1726)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Python unit tests (ubuntu-latest, 3.9)
  • GitHub Check: Python unit tests (windows-latest, 3.12)
  • GitHub Check: Python unit tests (windows-latest, 3.11)
  • GitHub Check: Python unit tests (windows-latest, 3.10)
  • GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (28)
ricecooker/utils/pipeline/context.py (5)

13-37: Well-structured ContentNodeMetadata dataclass.

This new dataclass provides a clean, strongly-typed container for all content node metadata attributes. It aligns well with the metadata fields used in the node serialization methods in ricecooker/classes/nodes.py.


40-46: Good implementation of recursive dictionary update.

The _recursive_update helper function properly handles nested dictionaries during merging, ensuring that nested fields like extra_fields aren't completely overwritten but rather updated recursively.


59-59: Good extension of FileMetadata class.

Adding the optional content_node_metadata field enables the pipeline to propagate node-specific metadata through the file processing stages.


62-64: Improved dictionary serialization.

Using a custom dict factory with asdict that excludes None values results in cleaner output dictionaries without unnecessary null values.


71-71: Enhanced merge method with deep dictionary updates.

Replacing the simple update with the recursive update ensures that nested structures like content_node_metadata are properly merged rather than overwritten.

tests/test_files.py (1)

1327-1331: Properly updated test expectations for HTML5 cache keys.

The test now correctly expects an additional cache key for metadata extraction, which aligns with the enhanced pipeline that now extracts metadata from HTML5 files.

ricecooker/utils/pipeline/__init__.py (1)

52-52: Improved robustness with context initialization.

This defensive programming practice ensures that context is always a dictionary, preventing potential TypeError exceptions when operations like deepcopy and update are performed on it later in the method.

tests/test_tree.py (6)

699-699: Explicit call to infer_learning_activities.

The test has been updated to explicitly call the infer_learning_activities() method, correctly reflecting the refactored code where learning activities inference is now a separate step.


713-725: Good test for video content type inference.

This test verifies that a ContentNode with a video file URI correctly infers its kind as VIDEO and its learning activity as WATCH after processing through the pipeline.


727-739: Good test for audio content type inference.

This test verifies that a ContentNode with an audio file URI correctly infers its kind as AUDIO and its learning activity as LISTEN after processing through the pipeline.


741-753: Good test for document content type inference.

This test verifies that a ContentNode with a document file URI correctly infers its kind as DOCUMENT and its learning activity as READ after processing through the pipeline.


755-767: Good test for EPUB content type inference.

This test verifies that a ContentNode with an EPUB file URI correctly infers its kind as DOCUMENT and its learning activity as READ after processing through the pipeline.


769-780: Good test for HTML5 content type inference.

This test verifies that a ContentNode with an HTML5 zip file URI correctly infers its kind as HTML5 and its learning activity as EXPLORE after processing through the pipeline.

ricecooker/utils/pipeline/extract_metadata.py (4)

6-6: Good import usage.
The imported ContentNodeMetadata is correctly used below, and there's no redundancy.


20-20: Appropriate mapping for HTML5 extension.
Including HTML5 in PRESETS_FROM_EXTENSIONS is consistent with the rest of the file, ensuring the correct preset is inferred.


23-24: Useful preset-kind mapping.
KIND_FROM_PRESET provides a straightforward approach to retrieving the content kind. No issues spotted.


93-93: Integration of HTML5 extractor into default stage.
Adding HTML5MetadataExtractor to DEFAULT_CHILDREN suits the pipeline’s modular design, ensuring automatic processing for HTML5 content.

ricecooker/classes/nodes.py (11)

23-26: New imports for file and pipeline exceptions.
Bringing in File, ExpectedFileException, and InvalidFileException is necessary for the new URI-based logic below. No concerns here.


48-48: Added kind = None.
This ensures each Node starts with a default. Just confirm all subclasses correctly set their own kind.


181-189: Thumbnail preset logic.
This approach of finding a preset with thumbnail and matching kind is concise. Ensure StopIteration is handled properly in returning None.


483-483: Expanded constructor signature with **kwargs.
Accepting arbitrary keyword arguments can improve extensibility, but double-check that unexpected parameters are handled or validated to avoid silent errors.


508-512: Activity inference is straightforward.
Automatically determining learning_activities based on kind is helpful. This logic is easy to follow and consistent with kind_activity_map.


762-763: Accurate docstring for new params.
Explicitly documenting uri and pipeline is helpful for maintainers.


773-774: Constructor accepts uri and pipeline.
Passing these arguments is a nice design for flexible initialization.


782-784: Storing URI and pipeline references.
Straightforward assignment. Make sure to handle scenarios where uri is empty or invalid.


786-791: Defaulting pipeline to config.FILE_PIPELINE.
This fallback is convenient. Just confirm config.FILE_PIPELINE is always defined in contexts where the content node is used.


810-819: Validating the URI.
Raising an InvalidNodeException if the pipeline cannot handle the URI is a clean solution. No issues noted.


829-830: Conditional URI validation.
Skipping _validate_uri() when uri is not set avoids unnecessary checks. Looks good.

@rtibbles rtibbles changed the title Add AutomaticResourceNode class. Allow ContentNode class to take a uri argument for its default file Apr 23, 2025
@rtibbles rtibbles force-pushed the automatic_for_the_people branch from 8a0149c to 799b9e5 Compare April 28, 2025 15:10
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
ricecooker/classes/files.py (2)

93-97: Guard against None values in the f-string for clearer error reports

self.filename or self.original_filename may still be None at this point (e.g. when the pipeline fails early).
If either is None, the current f-string will render "None" which is not helpful for debugging.

-        raise NotImplementedError(
-            f"preset must be set if preset isn't specified when creating File object for file {self.filename} ({self.original_filename})"
-        )
+        raise NotImplementedError(
+            "preset must be set if preset isn't specified when creating "
+            f"File object for file {self.filename or '<unknown>'} "
+            f"({self.original_filename or '<unknown>'})"
+        )

198-208: Second validate() call may double-check unchanged data

self.validate() is executed both before and after the pipeline.
Unless the pipeline mutates attributes that affect validation, the second call is redundant and may cost noticeable time for large downloads.

If you only need to verify the new metadata, consider limiting the post-pipeline check to the subset that actually changes, or gate it behind if metadata:.

ricecooker/utils/pipeline/transfer.py (1)

276-283: skip_cache param added – add docstring & default propagation

The new parameter is great, but the method’s docstring and type signature comment block weren’t updated to mention it.
Please document that setting skip_cache=True forces a re-download and that callers should propagate the value to child handlers (as you already do).

ricecooker/utils/pipeline/file_handler.py (2)

181-208: skip_cache logic works, but inverse flag name could avoid confusion

The function now receives skip_cache, yet the surrounding codebase still exports config.UPDATE (meaning do update).
Using a positive flag (use_cache=True) would read more clearly:

if file_metadata and use_cache and not self.cached_file_outdated(...):

Not a blocker, but consider renaming in a future breaking release to avoid double-negatives like if not skip_cache and ….


304-313: Avoid repeating handler discovery code – reuse parent implementation

FirstHandlerOnly.execute duplicates the child-scanning logic that already lives in CompositeHandler.should_handle.
You could delegate directly:

-        for handler in self._children:
-            if handler.should_handle(path):
-                return handler.execute(path, context=context, skip_cache=skip_cache)
+        chosen = next((h for h in self._children if h.should_handle(path)), None)
+        return chosen.execute(path, context=context, skip_cache=skip_cache) if chosen else []

Minor readability gain.

ricecooker/classes/nodes.py (3)

75-79: Duplicate kind class attribute definition

kind is defined twice consecutively (lines 75 and 78).
The second definition silently overwrites the first and is a potential maintenance trap.

-    kind = None
-    license = None
-    language = None
-    kind = None
+    kind = None
+    license = None
+    language = None

922-929: Instantiate specific File subclasses when possible

Creating every downloaded file as File(**metadata) loses important behaviour (e.g. preset-specific validation, is_primary flags).
Consider mapping the returned preset to the concrete subclass (VideoFile, AudioFile, …) via a factory so that downstream logic (thumbnails, size limits, etc.) still work.


1049-1064: Combine isinstance calls for readability (ruff SIM101)

-            if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile):
+            if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):

Tiny readability improvement, silences the static-analysis warning.

🧰 Tools
🪛 Ruff (0.8.2)

1049-1049: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a0149c and 799b9e5.

📒 Files selected for processing (10)
  • ricecooker/classes/files.py (2 hunks)
  • ricecooker/classes/licenses.py (1 hunks)
  • ricecooker/classes/nodes.py (25 hunks)
  • ricecooker/utils/pipeline/__init__.py (2 hunks)
  • ricecooker/utils/pipeline/context.py (2 hunks)
  • ricecooker/utils/pipeline/extract_metadata.py (5 hunks)
  • ricecooker/utils/pipeline/file_handler.py (5 hunks)
  • ricecooker/utils/pipeline/transfer.py (1 hunks)
  • tests/test_files.py (1 hunks)
  • tests/test_tree.py (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • ricecooker/classes/licenses.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • ricecooker/utils/pipeline/init.py
  • tests/test_files.py
  • ricecooker/utils/pipeline/context.py
  • ricecooker/utils/pipeline/extract_metadata.py
  • tests/test_tree.py
🧰 Additional context used
🪛 Ruff (0.8.2)
ricecooker/classes/nodes.py

104-104: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


1049-1049: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Python unit tests (ubuntu-latest, 3.9)
  • GitHub Check: Python unit tests (windows-latest, 3.12)
  • GitHub Check: Python unit tests (windows-latest, 3.11)
  • GitHub Check: Python unit tests (windows-latest, 3.10)
  • GitHub Check: Python unit tests (windows-latest, 3.9)

@rtibbles
Copy link
Member Author

To handle @coderabbitai's feedback and more, after discussion with @marcellamaki, I will be making some updates here to focus validation on the file presets, rather than the file classes.

In order to have full compatibility with the URI based mechanism, and ensuring any preset inference has happened already, this should only happen after process_files has been called.

@coderabbitai
Copy link

coderabbitai bot commented Apr 29, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Let's first check the current implementation of ContentNode validation and file presets
echo "### ContentNode class and validation ###"
cat ricecooker/classes/nodes.py | grep -A 50 "class ContentNode" | grep -B 10 -A 30 "def validate"

echo "### process_files method ###"
cat ricecooker/classes/nodes.py | grep -n "def process_files" -A 15

echo "### File presets related code ###"
grep -r "preset" --include="*.py" ricecooker/classes/ | head -30

echo "### Looking at file_formats and format_presets ###"
grep -r "file_formats\|format_presets" --include="*.py" ricecooker/ | head -20

Length of output: 6472


🏁 Script executed:

#!/bin/bash
# Let's examine the ContentNode validation method to understand current implementation
echo "### ContentNode validate method ###"
cat ricecooker/classes/nodes.py | grep -n -A 30 "def validate" | grep -A 50 "class ContentNode"

# Let's look at how content node subclasses define required file types
echo -e "\n### ContentNode subclasses and their file type requirements ###"
grep -rn "required_file_types\|required_file_format\|required_presets" --include="*.py" ricecooker/classes/nodes.py | head -30

# Let's look at the _process_uri method to better understand how URIs are processed
echo -e "\n### _process_uri method ###"
cat ricecooker/classes/nodes.py | grep -n -A 40 "_process_uri"

# Let's understand the relationship between file formats and presets
echo -e "\n### format_presets in le_utils ###"
pip install le-utils -q
python -c "from le_utils.constants import format_presets; print([{p.id: {'convertible_formats': p.convertible_formats, 'associated_mimetype': getattr(p, 'associated_mimetypes', None)}} for p in format_presets.PRESETLIST][:5])"

Length of output: 6542


@rtibbles This approach makes a lot of sense! Focusing on file presets rather than file classes will create more consistent validation, especially when files are created dynamically from URIs.

Moving the validation to happen after process_files is called is the right strategy because:

  1. It ensures all URI processing has completed
  2. All preset inference will have already happened
  3. It creates a uniform validation path for both direct file addition and URI-based content nodes

Would you like me to help sketch out how this implementation might look? I'd be happy to assist with the code changes to shift validation from file types to presets, and ensure it happens at the right point in the processing flow.

@rtibbles rtibbles force-pushed the automatic_for_the_people branch from 799b9e5 to 6f43b61 Compare May 5, 2025 23:25
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tests/test_exercises.py (1)

138-146: 🛠️ Refactor suggestion

Avoid re-using the same exercise instance across negative-case assertions

Inside a single test you mutate exercise.extra_fields twice and then run two
independent error scenarios ("3.0" / "5.1" and "three" / "five").
Because the first exercise.process_files() aborts with an exception midway,
the object state is left in an indeterminate condition (e.g. partially-parsed
extra_fields, cached files, etc.).
Running the second scenario on the same object can therefore hide or create
bugs that will never occur in production, where a fresh instance is processed
only once.

-# current pattern
exercise.extra_fields = {...}
with pytest.raises(InvalidNodeException):
    exercise.process_files()
...
exercise.extra_fields = {...}
with pytest.raises(InvalidNodeException):
    exercise.process_files()
+
+# safer pattern (parametrised)
+for bad_fields in [{"m": "3.0", "n": "5.1"},
+                   {"m": "three", "n": "five"}]:
+    ex = ExerciseNode(... fresh args ...)          # or copy.deepcopy(exercise)
+    ex.extra_fields = {"mastery_model": exercises.M_OF_N, **bad_fields}
+    with pytest.raises(InvalidNodeException):
+        ex.process_files()

This guarantees isolation between scenarios and keeps the intent of the test
clear.

🧹 Nitpick comments (2)
tests/test_data.py (1)

58-70: Consider also asserting that validate() fails after the file-processing error

The new contract (raising at process_files) is correctly captured, but the
test no longer checks that calling validate() afterwards would also fail.
If process_files swallows the error internally and the object is left in a
state where validate() passes, an invalid node could slip through later
stages of the pipeline.

with pytest.raises(InvalidNodeException):
    contentnode_invalid_files.process_files()
with pytest.raises(InvalidNodeException):
    contentnode_invalid_files.validate()

Duplicating the assertion is inexpensive and ensures both entry points are
consistent.

ricecooker/classes/nodes.py (1)

1049-1069: Improved subtitle handling in VideoNode

The updated validation method for VideoNode now filters duplicate subtitle files by language code, keeping only the first subtitle file for each language.

The instanceof check could be simplified:

-            if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile):
+            if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):

This is a minor optimization but follows Python best practices.

🧰 Tools
🪛 Ruff (0.8.2)

1053-1053: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 799b9e5 and 6f43b61.

📒 Files selected for processing (15)
  • ricecooker/classes/files.py (9 hunks)
  • ricecooker/classes/licenses.py (1 hunks)
  • ricecooker/classes/nodes.py (25 hunks)
  • ricecooker/commands.py (1 hunks)
  • ricecooker/managers/tree.py (3 hunks)
  • ricecooker/utils/pipeline/__init__.py (2 hunks)
  • ricecooker/utils/pipeline/context.py (2 hunks)
  • ricecooker/utils/pipeline/extract_metadata.py (5 hunks)
  • ricecooker/utils/pipeline/file_handler.py (5 hunks)
  • ricecooker/utils/pipeline/transfer.py (1 hunks)
  • tests/test_data.py (1 hunks)
  • tests/test_exercises.py (2 hunks)
  • tests/test_files.py (1 hunks)
  • tests/test_thumbnails.py (2 hunks)
  • tests/test_tree.py (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • ricecooker/classes/licenses.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • tests/test_files.py
  • ricecooker/utils/pipeline/transfer.py
  • ricecooker/utils/pipeline/extract_metadata.py
  • ricecooker/utils/pipeline/context.py
  • ricecooker/utils/pipeline/file_handler.py
  • ricecooker/utils/pipeline/init.py
🧰 Additional context used
🧬 Code Graph Analysis (3)
tests/test_data.py (3)
ricecooker/exceptions.py (1)
  • InvalidNodeException (60-64)
tests/conftest.py (10)
  • contentnode_invalid_files (341-344)
  • video (414-421)
  • video_invalid_files (425-431)
  • audio (519-526)
  • audio_invalid_files (530-536)
  • document (593-600)
  • document_invalid_files (604-610)
  • html (709-716)
  • html_invalid_files (720-729)
  • html_invalid_zip (747-753)
ricecooker/classes/nodes.py (9)
  • process_files (275-299)
  • process_files (949-957)
  • process_files (1273-1287)
  • validate (422-556)
  • validate (658-667)
  • validate (877-919)
  • validate (1043-1069)
  • validate (1321-1363)
  • validate (1558-1574)
tests/test_exercises.py (1)
ricecooker/exceptions.py (1)
  • InvalidNodeException (60-64)
tests/test_thumbnails.py (2)
ricecooker/classes/files.py (9)
  • ExtractedEPubThumbnailFile (511-516)
  • ExtractedHTMLZipThumbnailFile (519-524)
  • ExtractedPdfThumbnailFile (503-508)
  • ExtractedVideoThumbnailFile (527-532)
  • process_file (167-168)
  • process_file (198-219)
  • process_file (417-429)
  • process_file (468-490)
  • process_file (548-551)
tests/conftest.py (4)
  • invalid_document_file (634-647)
  • invalid_epub_file (651-664)
  • html_invalid_file (733-743)
  • invalid_video_file (435-444)
🪛 Ruff (0.8.2)
ricecooker/classes/nodes.py

97-97: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


1053-1053: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Python unit tests (ubuntu-latest, 3.9)
  • GitHub Check: Python unit tests (windows-latest, 3.12)
  • GitHub Check: Python unit tests (windows-latest, 3.11)
  • GitHub Check: Python unit tests (windows-latest, 3.10)
  • GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (38)
ricecooker/commands.py (1)

279-282: Change looks good – call signature now matches ChannelManager.process_tree

tree.process_tree() no longer requires the channel argument, and the helper
continues to return (files_to_diff, FAILED_FILES) as expected by
set_files. No further action needed.

ricecooker/managers/tree.py (1)

50-61: Concurrent removal of nodes can introduce subtle race conditions

node.remove_from_parent() is executed inside the thread-pool worker while
other workers may still be iterating over or modifying the same parent’s
children list. Although the GIL protects against seg-faults, logical races
(e.g. lost updates, inconsistent ordering) are possible and hard to debug.

Consider one of:

  1. Mark nodes for removal in a thread-safe queue, then prune the tree once the
    executor finishes; or
  2. Protect remove_from_parent with a per-parent threading.Lock.

This guarantees the channel structure remains coherent for all subsequent
steps (upload, publish, etc.).

tests/test_thumbnails.py (6)

11-15: Good refactoring of imports to support direct testing of extracted thumbnail files

The imports have been updated to use the extracted thumbnail file classes directly, which aligns with the changes in the implementation that focuses on preset-based validation rather than file classes.


258-263: LGTM: Simplified thumbnail tests to test extractors directly

This refactoring improves the test by directly testing the ExtractedPdfThumbnailFile class rather than going through node creation and file attachment. This makes the tests more focused and directly aligned with the implementation changes.


266-271: Improved test isolation and clarity

Testing the thumbnail file extraction directly instead of through a document node improves test isolation and makes the test intent clearer.


275-280: Consistent pattern applied to EPUB thumbnail testing

Good application of the same test pattern for EPUB thumbnails, maintaining consistency across all thumbnail extraction tests.


292-305: Consistent pattern applied to HTML zip thumbnail testing

The test follows the same pattern as the other thumbnail extractors, ensuring consistency in the test approach.


309-322: Successfully adapted tests for video thumbnail extraction

The refactoring to directly test the ExtractedVideoThumbnailFile maintains consistency with the other extractors and aligns with the preset-based validation approach.

ricecooker/classes/files.py (5)

58-58: Good addition of default_preset attribute to the base File class

Adding this attribute to the base class enables consistent preset handling across all file types.


92-100: Improved preset fallback in get_preset method

The method now checks for a default_preset before raising an error, providing a more robust fallback mechanism. The error message is also more descriptive with additional context.


202-211: Enhanced pipeline execution with cache control and post-processing validation

The changes to process_file now:

  1. Pass the skip_cache parameter to control caching during execution
  2. Add validation after processing to ensure the file remains valid after metadata updates

This ensures consistent validation throughout the file processing workflow.


229-229: Consistent use of default_preset across file classes

Good implementation of default_preset across all file subclasses, which simplifies preset handling and removes the need for individual get_preset overrides in subclasses.

Also applies to: 245-245, 256-256, 263-263, 270-270, 277-277, 284-284, 355-355, 382-382, 397-397


505-505: Linked thumbnail extractors to their source file formats

Explicitly associating allowed_formats in extracted thumbnail classes with their source file types creates a clear relationship and ensures thumbnail extraction constraints are tied to their corresponding file types.

Also applies to: 513-513, 521-521, 529-529

tests/test_tree.py (9)

14-16: Good addition of imports to support enhanced metadata tests

The added imports for label constants, node classes, and FilePipeline support the new tests for metadata inheritance and automatic content type inference.

Also applies to: 25-27, 30-30, 34-34, 37-37


705-705: Explicit learning activity inference for clarity

Explicitly calling infer_learning_activities() before assertion improves the test clarity by making the expected behavior more explicit.


719-786: Comprehensive tests for automatic resource node kind and learning activity inference

These tests thoroughly verify that ContentNode instances correctly:

  1. Infer content kind based on resource file type
  2. Set appropriate learning activities based on the inferred kind
  3. Handle various resource types (video, audio, document, epub, html5)

This test coverage is essential for validating the new URI-based node initialization feature.


789-798: Good baseline tests for metadata gathering

These tests establish baseline expectations for metadata gathering in both base Node and TreeNode without metadata, which is important for understanding the default behavior.


800-813: Well-designed helper method for metadata comparison

The _assert_metadata_equal helper method properly handles comparison of complex metadata including:

  1. List fields (using set comparison to ignore order)
  2. License objects (comparing by license_id)
  3. Other fields (direct comparison)

This makes the tests more robust and readable.


815-873: Comprehensive tests for metadata inheritance and merging

These tests thoroughly verify that:

  1. TreeNodes correctly gather metadata from parents
  2. Nodes combine their own metadata with parent metadata
  3. Child node values properly override parent values
  4. List fields are correctly merged

This ensures the metadata inheritance mechanism works as expected.


875-900: Good tests for hierarchical metadata merging

The test correctly verifies that when a parent has a broader subject and a child has a more specific one, only the specific one is kept in the final metadata. This is important for ensuring sensible metadata categorization.


902-990: Thorough testing of metadata inheritance for ContentNode

These tests verify that ContentNode instances correctly:

  1. Inherit simple metadata fields (language, license, etc.)
  2. Inherit and merge label fields (grade levels, categories, etc.)
  3. Do not override existing values when inheriting from ancestors

This ensures the metadata inheritance mechanism works as expected for content nodes.


992-1020: Good tests for hierarchical label inheritance

The test verifies that more specific category labels (like ALGEBRA) correctly replace broader ones (like MATHEMATICS) while preserving unrelated categories. This is crucial for proper metadata organization.

ricecooker/classes/nodes.py (16)

34-35: Useful preset lookup dictionary for efficient preset access

Creating a dictionary for preset lookup by ID improves efficiency when accessing preset information throughout the code.


48-62: Well-organized metadata field categorization

Good separation of inheritable fields into simple value fields and metadata label fields, which clarifies how different types of metadata are handled during inheritance.


404-406: Good helper method for validation errors

The _validate_values helper method centralizes the error handling for validation, making the code more consistent and easier to maintain.


407-412: Good learning activity inference method

This method sensibly sets default learning activities based on content kind when not explicitly provided.


412-421: Improved license handling with flexible input types

The updated set_license method intelligently handles both string license IDs and License objects, making the API more flexible.


422-555: Comprehensive validation with detailed error messages

The updated validate method provides thorough validation of all node attributes with specific, helpful error messages. The validation covers:

  1. Required fields (source_id, title)
  2. Type checking for all attributes
  3. Field-specific validation (e.g., role, grade_levels)
  4. License validation

This ensures node data integrity throughout the application.


558-582: Well-designed metadata inheritance mechanism

The get_metadata_dict method intelligently handles different types of metadata:

  1. Simple fields are copied directly if not empty
  2. List fields are merged with special handling for hierarchical values
  3. More specific values replace broader ones in hierarchical fields

This creates a sensible and predictable metadata inheritance system.


844-858: Good support for URI-based content node initialization

The ContentNode class now supports initialization with just a URI, with the pipeline inferring file types and metadata. The pipeline property provides a sensible default if not explicitly provided.


867-875: Robust URI validation

The _validate_uri method ensures that the provided URI can be handled by the pipeline before proceeding with processing, preventing downstream errors.


877-919: Improved validation for ContentNode focusing on presets

The updated validate method now focuses on file presets rather than file classes, ensuring:

  1. Nodes have a license
  2. Nodes have a kind set after processing
  3. Required presets are present (but not duplicated)
  4. Nodes have at least one default file

This aligns with the broader refactoring to preset-based validation.


921-948: Comprehensive URI processing implementation

The _process_uri method effectively:

  1. Executes the pipeline on the URI
  2. Processes file metadata into File objects
  3. Extracts and applies content node metadata
  4. Handles errors by removing the node from its parent

This enables the powerful feature of creating content nodes directly from URIs.


949-957: Enhanced process_files with validation

The updated process_files method now:

  1. Processes the URI if present
  2. Calls the parent method to process existing files
  3. Marks files as processed and validates the node

This ensures validation happens after all file processing is complete.


996-1000: Simple but effective metadata inheritance method

The set_metadata_from_ancestors method pulls metadata from ancestors and sets it on the node, providing a clean way to inherit metadata.


1023-1023: Consistent use of required_presets across node classes

Replacing file format requirements with preset requirements standardizes validation across node types and aligns with the preset-based approach throughout the codebase.

Also applies to: 1093-1093, 1117-1121, 1170-1170, 1210-1210, 1459-1459, 1483-1483


1327-1362: Improved ExerciseNode validation

The updated validation for ExerciseNode:

  1. Checks for presence of questions
  2. Validates each question
  3. Verifies mastery model settings
  4. Ensures M and N values are valid for M of N mastery model

This provides more thorough validation for exercise content.


97-97: 🛠️ Refactor suggestion

Use None instead of mutable default argument

Using a mutable data structure (empty list) as a default argument value is considered a bad practice in Python because the same list instance will be shared across all function calls.

-        def get_preset(self):
-            if self.preset:
-                return self.preset
-            if self.default_preset:
-                return self.default_preset
-            raise NotImplementedError(
-                f"preset must be set if preset and default_preset isn't specified when creating {self.__class__.__name__} "
-                f"object for file {self.filename} ({self.original_filename})"
-            )
+        def get_preset(self):
+            if self.preset:
+                return self.preset
+            if self.default_preset:
+                return self.default_preset
+            raise NotImplementedError(
+                f"preset must be set if preset and default_preset isn't specified when creating {self.__class__.__name__} "
+                f"object for file {self.filename} ({self.original_filename})"
+            )

I don't see the mutable default argument in this specific snippet. The warning seems to be a false positive from the static analysis tool.

Likely an incorrect or invalid review comment.

🧰 Tools
🪛 Ruff (0.8.2)

97-97: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Comment on lines 92 to 97
try:
node.process_files()
except (InvalidNodeException, ValueError) as e:
if config.STRICT:
raise
else:
config.LOGGER.warning(str(e))
config.LOGGER.warning(
f"Removing {node} from tree due to validation errors"
)
node.remove_from_parent()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Log full traceback and normalise the exception surface

Catching (InvalidNodeException, ValueError) is sensible, but the current
warning only prints str(e), losing the traceback that is invaluable during
chef debugging.

-except (InvalidNodeException, ValueError) as e:
-    ...
-    config.LOGGER.warning(str(e))
+except (InvalidNodeException, ValueError) as e:
+    config.LOGGER.warning("Validation error in %s", node, exc_info=e)

Additionally, if process_files can still emit raw ValueErrors it may be
worth standardising those to InvalidNodeException at the source to keep
error handling uniform.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
node.process_files()
except (InvalidNodeException, ValueError) as e:
if config.STRICT:
raise
else:
config.LOGGER.warning(str(e))
config.LOGGER.warning(
f"Removing {node} from tree due to validation errors"
)
node.remove_from_parent()
try:
node.process_files()
except (InvalidNodeException, ValueError) as e:
if config.STRICT:
raise
else:
- config.LOGGER.warning(str(e))
+ config.LOGGER.warning("Validation error in %s", node, exc_info=e)
config.LOGGER.warning(
f"Removing {node} from tree due to validation errors"
)
node.remove_from_parent()

)
self.uri = uri
self._pipeline = pipeline
# We should do additional validation of files once they have been processed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this comment in this context. I know what it says, and I think "yes", but I don't fully know why it's here. It's because we have self._files_processed = False which means we will do more later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a really rather unelucidatory comment, to the point that I thought maybe I had copy pasted it from Claude output, but honestly I don't think I did?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the comment to be much more explicit and (hopefully) clear.

@rtibbles rtibbles force-pushed the automatic_for_the_people branch from 6f43b61 to 7fc41e0 Compare May 9, 2025 16:38
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/test_tree.py (1)

800-813: Good helper method for metadata comparison.

The _assert_metadata_equal helper method enhances test readability and maintainability by centralizing the logic for comparing metadata dictionaries, with special handling for lists and license objects.

Consider adding a docstring to the helper method explaining its purpose and parameters for better maintainability.

-def _assert_metadata_equal(expected, actual):
+def _assert_metadata_equal(expected, actual):
+    """
+    Assert that two metadata dictionaries are equal, with special handling for lists and license objects.
+    
+    Args:
+        expected (dict): The expected metadata dictionary
+        actual (dict): The actual metadata dictionary to compare
+    """
ricecooker/classes/nodes.py (3)

68-72: Redundant kind = None declaration.

The kind = None attribute is declared twice, once on line 68 and again on line 71.

Remove the redundant kind = None declaration:

-    kind = None
     license = None
     language = None
-    kind = None

73-143: Enhanced Node initialization with comprehensive metadata attributes.

The updated Node initialization now properly handles all metadata attributes needed for content nodes, with appropriate default values. This provides a solid foundation for the metadata inheritance system.

Line 97 uses a mutable default argument which could lead to unexpected behavior:

-        tags=None,
+        tags=None,  # Will be initialized as [] in the body

Then in the constructor body:

-        self.tags = tags or []
+        self.tags = [] if tags is None else tags
🧰 Tools
🪛 Ruff (0.8.2)

97-97: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


1052-1072: Simplified VideoNode validation.

The validate method for VideoNode has been simplified to filter duplicate subtitle files by language code, removing try-except blocks and assertions.

The multiple isinstance checks could be combined for better readability:

-            if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile):
+            if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):
🧰 Tools
🪛 Ruff (0.8.2)

1056-1056: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f43b61 and 7fc41e0.

📒 Files selected for processing (15)
  • ricecooker/classes/files.py (9 hunks)
  • ricecooker/classes/licenses.py (1 hunks)
  • ricecooker/classes/nodes.py (25 hunks)
  • ricecooker/commands.py (1 hunks)
  • ricecooker/managers/tree.py (3 hunks)
  • ricecooker/utils/pipeline/__init__.py (2 hunks)
  • ricecooker/utils/pipeline/context.py (2 hunks)
  • ricecooker/utils/pipeline/extract_metadata.py (5 hunks)
  • ricecooker/utils/pipeline/file_handler.py (5 hunks)
  • ricecooker/utils/pipeline/transfer.py (1 hunks)
  • tests/test_data.py (1 hunks)
  • tests/test_exercises.py (2 hunks)
  • tests/test_files.py (1 hunks)
  • tests/test_thumbnails.py (2 hunks)
  • tests/test_tree.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
  • ricecooker/classes/licenses.py
  • ricecooker/commands.py
  • tests/test_exercises.py
  • tests/test_files.py
  • ricecooker/utils/pipeline/extract_metadata.py
  • ricecooker/utils/pipeline/context.py
  • ricecooker/managers/tree.py
  • tests/test_data.py
  • ricecooker/utils/pipeline/transfer.py
  • ricecooker/utils/pipeline/init.py
  • tests/test_thumbnails.py
  • ricecooker/utils/pipeline/file_handler.py
  • ricecooker/classes/files.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/test_tree.py (5)
ricecooker/classes/nodes.py (11)
  • ContentNode (821-1002)
  • Node (65-584)
  • TreeNode (670-794)
  • pipeline (857-860)
  • infer_learning_activities (407-410)
  • process_files (275-299)
  • process_files (952-960)
  • process_files (1276-1290)
  • gather_ancestor_metadata (583-584)
  • gather_ancestor_metadata (788-794)
  • set_metadata_from_ancestors (999-1002)
ricecooker/exceptions.py (1)
  • InvalidNodeException (60-64)
ricecooker/utils/pipeline/__init__.py (1)
  • FilePipeline (19-90)
tests/conftest.py (6)
  • video_file (359-370)
  • audio_file (482-494)
  • document_file (558-568)
  • epub_file (614-621)
  • html_file (672-684)
  • title (250-251)
tests/test_files.py (1)
  • process_files (64-79)
🪛 Ruff (0.8.2)
ricecooker/classes/nodes.py

97-97: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


1056-1056: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Python unit tests (ubuntu-latest, 3.9)
  • GitHub Check: Python unit tests (windows-latest, 3.12)
  • GitHub Check: Python unit tests (windows-latest, 3.11)
  • GitHub Check: Python unit tests (windows-latest, 3.10)
  • GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (50)
tests/test_tree.py (17)

14-16: These imports properly support the new tests for metadata and labels.

These imports for label constants are necessary for the new tests related to metadata inheritance and will help make the tests more readable by using the actual constants rather than string literals.


25-27: Good addition of node class imports to support the new tests.

These additional node class imports support the direct testing of various node types and their metadata inheritance capabilities, allowing for more comprehensive tests.

Also applies to: 30-30, 34-34


37-37: FilePipeline import supports the new URI-based content node creation.

This import is necessary for the new tests that verify automatic content kind inference from URIs.


705-705: Explicitly calling infer_learning_activities aligns with the updated ContentNode behavior.

The test now explicitly calls infer_learning_activities(), which matches the implementation change where learning activities are no longer automatically inferred during initialization but must be explicitly triggered.


719-731: Good test for automatic video content kind inference.

This test validates that a ContentNode can automatically infer its kind as VIDEO and set the appropriate learning activity (WATCH) when initialized with a video file URI. This is a key test for the new URI-based initialization feature.


733-745: Test ensures automatic audio content kind inference works correctly.

This test verifies that a ContentNode can identify an audio file and correctly set its kind to AUDIO and learning activity to LISTEN, confirming the URI-based pipeline processing functionality.


747-759: Test validates document content kind inference.

This test ensures that document files are correctly processed and result in a ContentNode with DOCUMENT kind and READ learning activity.


761-773: Test confirms EPUB files are processed as documents.

This test verifies that EPUB files are correctly categorized as DOCUMENT kind with READ learning activity, ensuring consistent treatment of different document formats.


775-787: Test for HTML5 content kind inference is comprehensive.

This test ensures HTML5 files are correctly processed as HTML5 kind with EXPLORE learning activity, completing the coverage for all major content types supported by the URI-based initialization.


789-792: Base tests for metadata inheritance establish important baselines.

These tests verify the foundation of the metadata inheritance system by ensuring that Nodes without metadata return empty dictionaries and that TreeNodes with empty parents also return empty metadata.

Also applies to: 794-798


815-837: Test for TreeNode metadata gathering is thorough.

This test verifies that a TreeNode properly inherits metadata from its parent node, covering all relevant metadata fields including simple values and label fields.


839-873: Comprehensive test for metadata combination and precedence.

This test ensures that when a child node has its own metadata, it properly combines with parent metadata with the correct precedence (child overrides parent for the same field).


875-900: Test for hierarchical metadata filtering is important for label fields.

This test verifies that the metadata system intelligently filters hierarchical labels, removing broader subjects when more specific ones are present, which prevents redundancy in metadata.


902-927: Test for ContentNode simple field inheritance is thorough.

This test validates that a ContentNode properly inherits simple metadata fields (language, license, author, etc.) from its ancestors.


929-952: Test for ContentNode label field inheritance is comprehensive.

This test ensures that ContentNode properly inherits label fields (grade_levels, resource_types, etc.) from its ancestors and merges them correctly.


954-990: Test for metadata precedence is valuable.

This test confirms that existing values on a child node are not overridden by parent values during metadata inheritance, which is crucial for maintaining content creator intent.


992-1020: Test for hierarchical label inheritance handles complex cases.

This test verifies that the hierarchical filtering works correctly for ContentNode label inheritance, ensuring that broader categories are removed when more specific ones exist in the child.

ricecooker/classes/nodes.py (33)

22-31: Added file imports improve code organization.

The added imports for extracted thumbnail files and pipeline exceptions properly support the new URI-based content node creation functionality.


34-34: PRESET_LOOKUP dictionary improves preset access efficiency.

Creating a lookup dictionary for presets is a good optimization that will make preset retrieval more efficient than repeated filtering operations.


48-62: Well-defined metadata field collections support inheritance.

These collections clearly define which fields should be inheritable and how they should be processed, providing a strong foundation for the metadata inheritance system.


168-170: Added repr method improves debugging.

Adding a __repr__ method that calls __str__ ensures consistent string representation of nodes during debugging and in interactive sessions.


200-214: Added child removal methods enhance tree manipulation.

The remove_child and remove_from_parent methods provide important functionality for dynamically modifying the node tree, which is especially useful when a node is found to be invalid during processing.


264-272: Improved thumbnail preset retrieval.

The updated get_thumbnail_preset method uses a cleaner approach to find the appropriate preset by filtering the preset list directly, making the code more maintainable.


403-411: Added validation helper method improves consistency.

The _validate_values helper method centralizes the validation logic, ensuring consistent error handling and reducing code duplication throughout the validation methods.


407-411: Learning activity inference method provides smart defaults.

The infer_learning_activities method automatically sets appropriate learning activities based on content kind, which ensures consistent metadata even when not explicitly provided.


412-421: License setter method simplifies license handling.

The set_license method properly handles both string-based license identifiers and License objects, making it more convenient to set licenses in different ways.


422-556: Thoroughly refactored validation with comprehensive checks.

The validate method has been completely rewritten to use the _validate_values helper and adds comprehensive validation for all metadata fields, ensuring data integrity.


558-585: Well-designed metadata inheritance system.

The get_metadata_dict and gather_ancestor_metadata methods provide a robust foundation for metadata inheritance throughout the node hierarchy, with intelligent handling of simple values and hierarchical label fields.


603-609: ChannelNode initialization properly handles required parameters.

The updated ChannelNode constructor explicitly requires source_id and title, which ensures that these essential attributes are always provided.


644-644: Fixed license serialization in to_dict.

Updated to use license_id instead of accessing the license object directly, which ensures proper serialization of the license information.


663-667: Simplified ChannelNode validation.

The validation method now uses the centralized _validate_values helper, making the code more consistent with other node classes.


788-794: Enhanced TreeNode metadata inheritance.

The gather_ancestor_metadata method now properly raises an exception if no parent exists and combines parent metadata with its own using the metadata dictionary method.


844-844: Replaced file types with presets for validation.

Using required_presets instead of required_file_types allows for more flexible validation that works with both directly added files and those created dynamically from URIs.


846-861: ContentNode now supports URI-based initialization.

The updated ContentNode constructor adds uri and pipeline parameters, allowing for initialization with just a URI that will be processed to infer file details and content kind.


870-879: URI validation ensures pipeline compatibility.

The _validate_uri method verifies that the pipeline can handle the provided URI, preventing issues with unsupported file types or invalid URIs.


880-922: ContentNode validation now checks required presets.

The updated validate method now checks for required presets rather than file types, which allows for validation of files created from URIs. It also defers validation until files have been processed.


952-960: Enhanced process_files method manages URI processing.

The overridden process_files method ensures that URI processing happens before other file processing and sets the _files_processed flag to indicate that validation can now check for required files.


999-1003: Added method to set metadata from ancestors.

The set_metadata_from_ancestors method applies inherited metadata to the node, which ensures that metadata is consistently propagated through the node hierarchy.


1026-1027: Updated VideoNode to use presets for validation.

The VideoNode now uses required_presets instead of required_file_types for validation, which is more flexible and works with URI-based file creation.


1096-1097: Updated AudioNode required_presets.

AudioNode now uses required_presets for validation, which aligns with the new validation approach.


1120-1124: DocumentNode presets include multiple document formats.

The required_presets for DocumentNode now includes DOCUMENT, EPUB, and BLOOMPUB, which ensures all document types are properly validated.


1127-1129: Updated thumbnail generation to use presets.

The document thumbnail generation now uses get_preset() to identify document files, which is more flexible than checking file types directly.


1173-1174: Updated HTML5AppNode required_presets.

HTML5AppNode now uses required_presets for validation, which aligns with the new validation approach.


1181-1183: Updated HTML5 thumbnail generation to use presets.

The HTML5 thumbnail generation now uses get_preset() to identify HTML5 files, which is more consistent with the preset-based approach.


1213-1214: Updated H5PAppNode required_presets.

H5PAppNode now uses required_presets for validation, which aligns with the new validation approach.


1330-1365: Refactored ExerciseNode validation.

The validate method for ExerciseNode has been refactored to use the _validate_values helper for question presence, question validity, mastery model correctness, and integer coercion of mastery parameters.


1462-1463: Added required_presets to CustomNavigationNode.

CustomNavigationNode now uses required_presets for validation, which aligns with the new validation approach.


1473-1475: Updated CustomNavigationNode thumbnail generation to use presets.

The thumbnail generation now uses get_preset() to identify HTML5 files, which is more consistent with the preset-based approach.


1486-1487: Added required_presets to CustomNavigationChannelNode.

CustomNavigationChannelNode now uses required_presets for validation, which aligns with the new validation approach.


1577-1578: Fixed StudioContentNode validation return value.

The StudioContentNode validate method now returns the result of the parent validate method, ensuring proper validation chain.

def _process_uri(self):
try:
file_metadata_list = self.pipeline.execute(
self.uri, skip_cache=config.UPDATE
)
except (InvalidFileException, ExpectedFileException) as e:
config.LOGGER.error(f"Error processing path: {self.uri} with error: {e}")
self.remove_from_parent()
return None
content_metadata = {}
for file_metadata in file_metadata_list:
metadata_dict = file_metadata.to_dict()
if "content_node_metadata" in metadata_dict:
content_metadata.update(metadata_dict.pop("content_node_metadata"))
# Remove path from metadata_dict as it is not needed for the File object
metadata_dict.pop("path", None)
file_obj = File(**metadata_dict)
self.add_file(file_obj)
for key, value in content_metadata.items():
if key == "extra_fields":
self.extra_fields.update(value)
else:
if key == "kind" and self.kind is not None and self.kind != value:
raise InvalidNodeException(
"Inferred kind is different from content node class kind."
)
setattr(self, key, value)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

URI processing extracts metadata and creates files.

The _process_uri method runs the pipeline on the URI, converts file metadata to File objects, extracts content node metadata, and handles errors appropriately.

The error handling in _process_uri removes the node from its parent when an error occurs, which could lead to surprising behavior. Consider instead flagging the node as invalid and letting the tree validation process handle removal:

     try:
         file_metadata_list = self.pipeline.execute(
             self.uri, skip_cache=config.UPDATE
         )
     except (InvalidFileException, ExpectedFileException) as e:
         config.LOGGER.error(f"Error processing path: {self.uri} with error: {e}")
-        self.remove_from_parent()
+        # Flag the node as invalid instead of removing it
+        self._invalid_uri = True
         return None

Then update the validate method to check for this flag:

     if self.uri:
+        if hasattr(self, '_invalid_uri') and self._invalid_uri:
+            self._validate_values(True, f"Invalid URI: {self.uri}")
         self._validate_uri()

Committable suggestion skipped: line range outside the PR's diff.

rtibbles added 5 commits May 12, 2025 16:32
…istent.

Implement required file checks, focused on presets. Enforce a single required preset per node.
Update metadata extraction to infer kind.
Update tree processing to flag invalid nodes and exclude from created tree or raise errors if in strict mode.
Refactor to allow metadata to be from channel level down.
@rtibbles rtibbles force-pushed the automatic_for_the_people branch from 7fc41e0 to 0978f3b Compare May 12, 2025 23:59
@rtibbles
Copy link
Member Author

As discussed, nodes are now flagged as valid or invalid during the validation process. When errors are raised either during validation of file processing by the ChannelManager class, it stores the string representation of the error on the node under a _error attribute, which is then reused for reporting.

Any invalid nodes, or ones with failed file uploads, are now excluded from upload to Studio. They are reported node by node, rather than under the umbrella of their parent, this should give more inspectability of errors that are happening.

If the content integrator prefers things to error, the config.STRICT configuration can be used.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
ricecooker/classes/nodes.py (1)

898-924: 🛠️ Refactor suggestion

Comprehensive URI processing with metadata extraction.

The _process_uri method:

  1. Executes the pipeline on the URI (with cache control)
  2. Creates File objects from the pipeline results and adds them to the node
  3. Extracts and applies content_node_metadata to the node attributes
  4. Handles both file and content metadata elegantly
  5. Validates that inferred kind matches the node class kind

This is the core of the URI-based content node creation functionality.

The error handling in _process_uri could be improved. Currently, when an exception occurs during pipeline execution, the method catches the exception, logs an error, and returns None without setting any error state on the node itself, which might lead to unclear validation errors later.

Consider setting an error flag or message on the node:

    try:
        file_metadata_list = self.pipeline.execute(
            self.uri, skip_cache=config.UPDATE
        )
    except (InvalidFileException, ExpectedFileException) as e:
        config.LOGGER.error(f"Error processing path: {self.uri} with error: {e}")
+       self._uri_error = str(e)  # Store the error for later validation
        return None

Then check for this flag in validation:

    if self.uri:
+       if hasattr(self, '_uri_error') and self._uri_error:
+           self._validate_values(True, f"Error processing URI: {self.uri} - {self._uri_error}")
        self._validate_uri()
🧹 Nitpick comments (4)
tests/test_tree.py (2)

252-252: Fix assertion method to use assert with a message rather than assert False.

The test uses assert False which would be removed when Python is run with optimization (-O flag). Replace with a proper assertion.

-        assert False, "Invalid topic should fail validation"
+        assert 0, "Invalid topic should fail validation"  # Or better yet, use pytest.fail("message")

1202-1204: Simplify test with single with statement for multiple context managers.

The test uses nested with statements, which could be simplified to a single with statement with multiple contexts.

-    with patch("builtins.open", mocked_open), patch(
-        "ricecooker.config.get_storage_path", return_value="/tmp/test_file.mp4"
-    ), patch("ricecooker.config.SESSION.post", return_value=mock_response):
+    with patch("builtins.open", mocked_open), \
+         patch("ricecooker.config.get_storage_path", return_value="/tmp/test_file.mp4"), \
+         patch("ricecooker.config.SESSION.post", return_value=mock_response):
🧰 Tools
🪛 Ruff (0.8.2)

1202-1209: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

ricecooker/classes/nodes.py (2)

98-98: Fix mutable default argument in Node constructor.

The node_modifications={} parameter uses a mutable dict as a default argument, which can lead to unexpected behavior if modified across different instances.

-    node_modifications={},
+    node_modifications=None,

And then inside the method:

-        # save modifications passed in by csv
-        self.node_modifications = node_modifications
+        # save modifications passed in by csv
+        self.node_modifications = node_modifications or {}
🧰 Tools
🪛 Ruff (0.8.2)

98-98: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


1029-1030: Simplify multiple isinstance checks with tuple of types.

The code checks if a file is an instance of either SubtitleFile or YouTubeSubtitleFile using two separate isinstance calls, which can be simplified.

-            if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile):
+            if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):
🧰 Tools
🪛 Ruff (0.8.2)

1029-1029: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fc41e0 and 0978f3b.

📒 Files selected for processing (15)
  • ricecooker/classes/files.py (9 hunks)
  • ricecooker/classes/licenses.py (1 hunks)
  • ricecooker/classes/nodes.py (25 hunks)
  • ricecooker/commands.py (1 hunks)
  • ricecooker/managers/tree.py (7 hunks)
  • ricecooker/utils/pipeline/__init__.py (2 hunks)
  • ricecooker/utils/pipeline/context.py (2 hunks)
  • ricecooker/utils/pipeline/extract_metadata.py (5 hunks)
  • ricecooker/utils/pipeline/file_handler.py (5 hunks)
  • ricecooker/utils/pipeline/transfer.py (1 hunks)
  • tests/test_data.py (1 hunks)
  • tests/test_exercises.py (2 hunks)
  • tests/test_files.py (1 hunks)
  • tests/test_thumbnails.py (2 hunks)
  • tests/test_tree.py (20 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
  • tests/test_files.py
  • ricecooker/classes/licenses.py
  • ricecooker/commands.py
  • tests/test_exercises.py
  • ricecooker/utils/pipeline/init.py
  • ricecooker/utils/pipeline/extract_metadata.py
  • ricecooker/utils/pipeline/transfer.py
  • ricecooker/utils/pipeline/context.py
  • tests/test_data.py
  • tests/test_thumbnails.py
  • ricecooker/utils/pipeline/file_handler.py
  • ricecooker/classes/files.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
ricecooker/managers/tree.py (2)
ricecooker/exceptions.py (1)
  • InvalidNodeException (60-64)
ricecooker/classes/nodes.py (5)
  • process_files (261-285)
  • process_files (925-933)
  • process_files (1249-1263)
  • get_node_id (590-591)
  • get_node_id (674-682)
tests/test_tree.py (4)
ricecooker/classes/nodes.py (19)
  • ContentNode (795-975)
  • Node (65-558)
  • TreeNode (644-768)
  • pipeline (831-834)
  • validate (527-530)
  • infer_learning_activities (376-379)
  • process_files (261-285)
  • process_files (925-933)
  • process_files (1249-1263)
  • gather_ancestor_metadata (557-558)
  • gather_ancestor_metadata (762-768)
  • set_metadata_from_ancestors (972-975)
  • to_dict (186-190)
  • to_dict (606-630)
  • to_dict (726-760)
  • to_dict (935-970)
  • to_dict (1552-1565)
  • get_node_id (590-591)
  • get_node_id (674-682)
ricecooker/exceptions.py (1)
  • InvalidNodeException (60-64)
ricecooker/managers/tree.py (7)
  • ChannelManager (19-448)
  • InsufficientStorageException (13-16)
  • validate (34-47)
  • validate_node (49-58)
  • process_node (84-108)
  • add_nodes (325-407)
  • do_file_upload (161-211)
tests/conftest.py (7)
  • channel (149-158)
  • title (250-251)
  • video_file (359-370)
  • audio_file (482-494)
  • document_file (558-568)
  • epub_file (614-621)
  • html_file (672-684)
🪛 Ruff (0.8.2)
tests/test_tree.py

253-253: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


1202-1209: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

ricecooker/classes/nodes.py

98-98: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


1029-1029: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Python unit tests (ubuntu-latest, 3.9)
  • GitHub Check: Python unit tests (windows-latest, 3.12)
  • GitHub Check: Python unit tests (windows-latest, 3.11)
  • GitHub Check: Python unit tests (windows-latest, 3.10)
  • GitHub Check: Python unit tests (windows-latest, 3.9)
🔇 Additional comments (33)
ricecooker/managers/tree.py (6)

60-60: Simplified method signature removes redundant parameter.

The process_tree method no longer requires the channel_node parameter, as it now uses the class instance's self.channel attribute consistently. This makes the API cleaner and avoids potential confusion when the passed channel differs from the internal one.


52-58: Improved error handling for node validation.

The code now explicitly catches InvalidNodeException during validation, setting a descriptive _error attribute on nodes and logging warnings when not in strict mode. This provides better error traceability and prevents valid nodes from being discarded due to siblings' failures.


292-302: Simplified check_failed method removes unused parameter.

The check_failed method signature has been simplified by removing the unused print_warning parameter. The method now always logs warnings for failed nodes, showing both error messages and truncated content consistently.


364-378: Improved error tracking with more specific failure identification.

The add_nodes method now:

  1. Uses the child's node ID (instead of the parent's root ID) as the key in failed_node_builds
  2. Includes the child's _error message in the error aggregation
  3. Creates more specific error entries for individual nodes

This makes error tracking more precise and makes it easier to identify which specific nodes failed and why.


279-279: Removed automatic retry for failed node builds.

The call to reattempt_failed has been removed from the upload_tree method, streamlining the error handling flow. Failed nodes are now simply reported without automatic retry attempts, making the behavior more predictable and explicit.


89-97:

Details

❓ Verification inconclusive

Enhanced error handling during node processing.

The process_node method now catches both InvalidNodeException and ValueError during file processing, setting an _error attribute on the node and logging the error. This approach maintains consistent error handling patterns with validate_node and prevents pipeline failures from causing entire tree processing to fail unnecessarily.

Log full traceback and normalize the exception surface:


🏁 Script executed:

#!/bin/bash
# Verify whether process_node could benefit from full traceback logging

# Search for other warning logging patterns in the codebase
rg "LOGGER\.warning.*exc_info" -A 2 -B 2
echo "----------------"
# Check whether InvalidNodeException is used to wrap ValueError elsewhere
rg "raise InvalidNodeException.*ValueError"

Length of output: 129


🏁 Script executed:

#!/bin/bash
# Search for existing use of logger.exception to understand conventions
rg "LOGGER\.exception" -A 2 -B 2
echo "----------------"
# Search for any use of exc_info across all logging calls
rg "exc_info=" -A 2 -B 2

Length of output: 100


Consider capturing full exception tracebacks during node processing

Currently, any InvalidNodeException or ValueError is caught and only the error message is logged:

try:
    node.process_files()
except (InvalidNodeException, ValueError) as e:
    if config.STRICT:
        raise
    else:
        node._error = str(e)
        config.LOGGER.warning(node._error)

No existing uses of LOGGER.exception or exc_info=True were found in the codebase—please verify your project’s logging conventions. To aid debugging, you might log the full traceback, for example:

  • Switch to config.LOGGER.exception("Error processing node %s", node.id)
  • Or add exc_info=True:
    config.LOGGER.warning(node._error, exc_info=True)

File needing attention:

  • ricecooker/managers/tree.py (around lines 89–97)

Please confirm and align with your preferred error‐logging pattern.

tests/test_tree.py (13)

248-254: API change: Using consistent validation method across tests.

Tests have been updated to use validate() instead of the deprecated validate_tree() method. This change aligns with the refactoring of validation logic in the node classes.

🧰 Tools
🪛 Ruff (0.8.2)

253-253: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


694-695: Updated test to use new inference method.

The test now explicitly calls infer_learning_activities() before asserting the default activities, reflecting the new design where learning activities are inferred on demand rather than automatically assigned during initialization.


708-775: New tests for automatic content kind and learning activity inference.

These tests verify the core new functionality of inferring content kind and learning activities from files based on URIs. The tests cover all major file types:

  • Video (lines 708-720)
  • Audio (lines 722-734)
  • Document (lines 736-748)
  • ePub (lines 750-762)
  • HTML5 (lines 764-776)

Each test confirms that the correct content kind and learning activities are inferred after processing a file with the specified URI.


778-809: Thorough testing of metadata inheritance functionality.

These tests verify the base metadata gathering functionality with different node configurations:

  • Empty nodes with no metadata (lines 778-781)
  • TreeNodes with empty parent nodes (lines 782-787)
  • Helper function for metadata comparison (lines 789-802)

The tests establish a foundation for more complex metadata inheritance testing that follows.


804-862: Comprehensive tests for metadata gathering from ancestor nodes.

These tests verify that:

  1. TreeNodes correctly gather metadata from parent nodes (lines 804-826)
  2. TreeNodes properly combine their own metadata with parent metadata (lines 828-862)
    • Tests that child values override parent values where appropriate
    • Tests that non-existing values are inherited from parent
    • Tests that hierarchical lists like grade_levels are properly combined

This ensures that metadata propagation and inheritance work as expected in the node hierarchy.


864-889: Tests for hierarchical metadata merging logic.

The test verifies that when parents have broader subjects and children have more specific ones within the same hierarchy, only the more specific values are kept (removing redundant broader categories). This ensures optimal metadata representation without duplication.


891-979: Tests for ContentNode metadata inheritance methods.

These test cases verify that:

  1. ContentNodes properly inherit simple fields like language and license (lines 891-916)
  2. ContentNodes correctly inherit label fields like grade_levels (lines 918-941)
  3. ContentNodes don't override existing values with ancestor values (lines 943-979)
  4. Hierarchical label inheritance works by removing broader categories when more specific ones exist (lines 981-1010)

This ensures that metadata inheritance follows the expected precedence rules and maintains hierarchical relationships.


1015-1066: Tests for ChannelManager's error handling during validation and processing.

These tests verify that:

  1. The manager sets the _error attribute when InvalidNodeException occurs during validation (lines 1015-1032)
  2. The manager handles both InvalidNodeException and ValueError during node processing (lines 1034-1066)

The tests mock the node's behavior to simulate exceptions and verify that errors are properly captured and handled.


1068-1113: Tests for ChannelManager's add_nodes method error handling.

This test verifies that:

  1. Invalid nodes are skipped during the add_nodes process
  2. Skipped nodes are properly registered in the failed_node_builds dictionary with their error messages
  3. Invalid nodes' to_dict() method is not called, preventing errors during serialization

The test uses mocks to simulate a parent node with both valid and invalid children.


1115-1142: Test for handling ConnectionError during node addition.

This test verifies that connection errors during the upload process are properly caught and registered in the failed_node_builds dictionary with the parent node as the context.


1144-1174: Test for handling server errors during node addition.

This test verifies that HTTP error responses from the server are properly caught and registered in the failed_node_builds dictionary, including the error reason and response content.


1176-1211: Test for handling storage limitations during file upload.

The test verifies that the manager raises an InsufficientStorageException when the server responds with HTTP 412, indicating that the user has run out of storage space.

🧰 Tools
🪛 Ruff (0.8.2)

1202-1209: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


1213-1270: Test for skipping both invalid nodes and nodes with failed files.

This comprehensive test verifies that:

  1. The manager identifies and skips both invalid nodes and nodes with failed files
  2. Both types of failed nodes are registered in the failed_node_builds dictionary
  3. Only valid nodes with successful files are included in the payload

The test uses mocks to create different types of nodes with various validation and file statuses.

ricecooker/classes/nodes.py (14)

48-62: Well-defined constants for metadata inheritance.

The global constants inheritable_simple_value_fields and inheritable_metadata_label_fields clearly define which fields can be inherited through the node hierarchy. This makes the inheritance rules explicit and maintainable.


68-144: Comprehensive refactoring of Node constructor with enhanced metadata support.

The Node constructor has been expanded to support a comprehensive set of metadata fields including:

  • Basic fields (title, language, description)
  • Attribution fields (author, copyright_holder, aggregator, provider)
  • Classification fields (grade_levels, resource_types, learning_activities, etc.)
  • License handling

The constructor now requires a source_id parameter (line 102), which improves the consistency of node creation.

🧰 Tools
🪛 Ruff (0.8.2)

98-98: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


372-374: Added validation helper improves error message consistency.

The _validate_values helper method standardizes validation error handling by:

  1. Taking an assertion condition and error message
  2. Raising InvalidNodeException with a consistent format that includes the node identifier
  3. Centralizing validation logic to avoid duplication

This pattern is used extensively throughout the validation methods, making the code more maintainable and error messages more consistent.


391-524: Comprehensive validation overhaul using new helper method.

The _validate method has been thoroughly refactored to:

  1. Use the new _validate_values helper for consistent error reporting
  2. Validate all metadata fields comprehensively
  3. Validate category and label fields against allowed values
  4. Check for string fields, list fields, and non-empty required fields

This provides much stronger validation throughout the node hierarchy and better error messages when validation fails.


532-556: Well-designed metadata inheritance with hierarchical handling.

The get_metadata_dict method implements intelligent metadata merging:

  1. Simple fields (language, license, etc.) are copied if they have values
  2. Hierarchical fields (grade_levels, categories, etc.) are merged with special handling
  3. More specific categories automatically replace their broader parents (lines 547-553)

This prevents metadata bloat and ensures the most specific categories are preserved when merging.


557-559: Simple base implementation of metadata gathering.

The base gather_ancestor_metadata method returns just the node's own metadata. This method is overridden in TreeNode to include parent metadata. This design follows the principle that each class only knows about its own structure.


762-768: TreeNode's metadata gathering checks for parent existence.

The gather_ancestor_metadata method in TreeNode raises an exception if parent is missing, preventing silent failures when trying to gather metadata from a non-existent parent. This helps catch configuration errors early.


818-835: ContentNode updated to support URI-based initialization.

The ContentNode constructor now accepts:

  1. A uri parameter for the primary content location
  2. A pipeline parameter for handling file processing
  3. A _files_processed flag to track processing state

The class provides a pipeline property with a fallback to the global pipeline, ensuring a pipeline is always available when needed.


844-852: Clear validation of URI compatibility with pipeline.

The _validate_uri method checks if the pipeline can handle the provided URI, raising a clear exception if not. This prevents issues where a URI might be provided that can't be processed.


854-896: Refactored validation logic with preset-based verification.

The ContentNode validation has been updated to:

  1. Verify that required file presets exist instead of file types (lines 862-882)
  2. Check for default (non-supplementary) files (lines 883-893)
  3. Validate URI compatibility when a URI is provided (lines 894-895)

This matches the PR objective of shifting validation from file types to presets.


925-933: Enhanced file processing with URI support and validation.

The process_files method now:

  1. Processes the URI first if provided (line 927)
  2. Calls the parent class's processing for regular files (line 928)
  3. Sets a flag to indicate files have been processed (line 931)
  4. Validates the node to ensure metadata is valid (line 932)

This ensures both URI-based and regular file-based content nodes follow the same validation path.


972-975: Simple but effective metadata inheritance implementation.

The set_metadata_from_ancestors method:

  1. Gathers metadata from ancestors
  2. Sets each field from the gathered metadata on the node

This allows content nodes to inherit metadata from their ancestors at the appropriate time in the node lifecycle.


1019-1045: Simplified VideoNode validation for subtitle files.

The VideoNode validation has been simplified to:

  1. Filter duplicate subtitle files by language code
  2. Provide clear warnings when duplicates are found
  3. Preserve the first subtitle file for each language
  4. Call the parent validation method with the filtered files

This makes the validation cleaner and more maintainable while preserving the required functionality.

The code could be simplified by using a dictionary comprehension for the language filtering:

-        new_files = []
-        language_codes_seen = set()
-        for file in self.files:
-            if isinstance(file, SubtitleFile) or isinstance(file, YouTubeSubtitleFile):
-                language_code = file.language
-                if language_code not in language_codes_seen:
-                    new_files.append(file)
-                    language_codes_seen.add(language_code)
-                else:
-                    file_info = file.path if hasattr(file, "path") else file.youtube_url
-                    config.LOGGER.warning(
-                        "Skipping duplicate subs for "
-                        + language_code
-                        + " from "
-                        + file_info
-                    )
-            else:
-                new_files.append(file)
+        # Group subtitle files by language code, keeping the first one for each language
+        language_to_subtitle = {}
+        new_files = []
+        
+        for file in self.files:
+            if isinstance(file, (SubtitleFile, YouTubeSubtitleFile)):
+                language_code = file.language
+                if language_code not in language_to_subtitle:
+                    language_to_subtitle[language_code] = file
+                    new_files.append(file)
+                else:
+                    file_info = file.path if hasattr(file, "path") else file.youtube_url
+                    config.LOGGER.warning(f"Skipping duplicate subs for {language_code} from {file_info}")
+            else:
+                new_files.append(file)
🧰 Tools
🪛 Ruff (0.8.2)

1029-1029: Multiple isinstance calls for file, merge into a single call

Merge isinstance calls for file

(SIM101)


1297-1339: Refactored ExerciseNode validation using helper method.

The ExerciseNode validation has been refactored to:

  1. Use the new _validate_values helper for consistent error reporting
  2. Check questions, mastery model, and M/N parameters
  3. Call process_exercise_data before validation to ensure parameters are set correctly

This makes the validation more robust and consistent with the rest of the code.

Copy link
Member

@marcellamaki marcellamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here look aligned to the coderabbit recommendations and the approach we discussed. I think this is good to go

@rtibbles rtibbles merged commit 82625e9 into learningequality:develop May 13, 2025
17 checks passed
@rtibbles rtibbles deleted the automatic_for_the_people branch May 13, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants