Generate a content manifest file with exportcontent #9460

dylanmccall · 2022-05-24T23:40:17Z

Summary

This change will update Kolibri's exportcontent command to generate a manifest file, describing an intentional content selection associated with the exported data files. This is distinct from the user's provided node_ids and exclude_node_ids, because it needs to be reproducible on another Kolibri instance with a different set of available content. To achieve this, I split get_import_export_data into a few different functions, and added another function (get_content_nodes_selectors) which finds the optimal list of include nodes to describe the list of content being exported.

That list of nodes is what is written to a manifest.json file alongside the exported content. For simplicity, it is possible for the same channel ID to be repeated multiple times, with different sets of include nodes and export nodes, in the same manifest.json file. This will only happen if the user runs exportcontent multiple times for the same channel.

This pull request does not include the importcontent counterpart to this work, which is in #9467.

References

Reviewer guidance

Testing checklist

Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Critical and brittle code paths are covered by unit tests

PR process

PR has the correct target branch and milestone
PR has 'needs review' or 'work-in-progress' label
If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
If this is an important user-facing change, PR or related issue has a 'changelog' label
If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

github-actions · 2022-05-24T23:56:02Z

Build Artifacts

Asset type	Download link
PEX file	kolibri-0.16.0.dev0_git.20220804153540.pex
Unsigned Windows installer	kolibri-0.16.0.dev0+git.20220804153540-unsigned.exe
Debian Package	kolibri_0.16.0.dev0+git.20220804153540-0ubuntu1_all.deb
Mac Installer (DMG)	kolibri-0.16.0.dev0+git.20220804153540-0.3.0.dmg
Source Tarball	kolibri-0.16.0.dev0+git.20220804153540.tar.gz
WHL file	kolibri-0.16.0.dev0+git.20220804153540-py2.py3-none-any.whl

dylanmccall · 2022-05-26T00:26:14Z

Getting this to play nice with exportcontent is a bit trickier than it sounds at first because we need to deal with the tool being run multiple times, which means we need to generate a sane set of include_nodes / exclude_nodes, and then update that set as needed. After some poking and prodding today, the solution I have for now is to append a new {channel id, include nodes, exclude nodes} to the manifest file for every run. When I'm working on importcontent I think I'll be able to add a mechanism that extends the existing selection for a channel as long as it is the same version, but I think it's reasonable to expect the channel to appear multiple times in some cases.

This is a first step to import the content from USB if present instead of from network, without importing the content, just the metadata. The list of channels is getted from the network right now, waiting for the metadata.json definition that could be present in the KOLIBRI_DATA/content folder: learningequality/kolibri#9460 https://phabricator.endlessm.com/T33526

rtibbles · 2022-06-01T21:51:20Z

kolibri/core/content/utils/import_export_content.py

    files_to_download = list(queried_file_objects.values())

    total_bytes_to_transfer = sum(map(lambda x: x["file_size"] or 0, files_to_download))
    return number_of_resources, files_to_download, total_bytes_to_transfer


+def get_content_nodes_selectors(channel_id, nodes_queries_list):


I feel like there's a more efficient way to do this with a hierarchical aggregation, similar to how we do our annotation logic. I will try to sketch out what that might look like and get back to you here.

Yeah, I think you're right that there's probably a better way. I opted for the simple to read, procedural approach since it's part of a one-off task that users already understand will take a length of time measured in minutes (so a few seconds generating a manifest isn't a big deal as long as we aren't pegging their CPU to do it). I think it's more important here that the code is clearly understandable and somebody can optimize it once it has been proved out.

But fetching all the node IDs into memory is certainly ugly. For Khan Academy, for example, that's a really big query and about 800 kB of memory for our set of available_node_ids. I'm sure we could do something clever involving the lft / rft of leaf nodes that are excluded, bubbling up from them to select the correct topics. There's a danger that it could turn into a lot of queries for a problem where we're really just shuffling primary keys around. So while it would involve less duplicate data in memory, I don't think we're guaranteed that it would be more efficient, particularly for the sort of deployment where someone would be using exportcontent. Worth exploring in a branch?

We do something very similar in our importability annotation https://github.com/learningequality/kolibri/blob/develop/kolibri/core/content/utils/importability_annotation.py#L36 - the main difference here would be instead of doing annotation based on some, we would annotate on all children being available. Otherwise the code would be nearly identical. We can then descend down the tree in a read query until we have a set of ids that are marked as 'all' available.

The main trick here is that the write and read all happen within a transaction, and then we roll it back at the end.

manuq · 2022-06-13T22:32:37Z

kolibri/core/content/utils/import_export_content.py

+try:
+    FileNotFoundError
+except NameError:
+    FileNotFoundError = IOError


👍 I guess this is to be compliant with older python.

kolibri/core/content/utils/import_export_content.py

dylanmccall · 2022-06-22T00:43:30Z

Okay, I added a test suite here, which should give us a nice way to measure performance, although the current test fixture isn't quite doing that. (It's a very small tree of content). In general, when I say this is slow I really just mean "it isn't trying very hard to be fast," but (anecdotally) it is acceptable with the channels I've tried it against. In the worst case I've had it spend ~1.5 seconds generating a manifest file for Khan Academy, which is a particularly large channel. It could be faster with some added cleverness, but I don't think we need to block on it here - and the nice thing is the test suite should make it easier to add such cleverness in the future without breaking existing behaviour :)

Here's a handy little script to play with this in isolation:

from kolibri.utils.main import initialize
initialize(skip_update=True)

import json

# Constants to know:
# Khan Academy (c9d7f950ab6b5a1199e3d6c10d7f0103)

from kolibri.core.content.utils.import_export_content import get_import_export_nodes
from kolibri.core.content.utils.import_export_content import get_content_nodes_selectors

KHAN_ACADEMY="c9d7f950ab6b5a1199e3d6c10d7f0103"

nodes_segments = get_import_export_nodes(KHAN_ACADEMY, available=True)
content_selection = get_content_nodes_selectors(KHAN_ACADEMY, nodes_segments)

print("Content selection:", json.dumps(content_selection, indent=4))

dylanmccall · 2022-07-05T21:19:42Z

I did a couple tweaks here, moving some stuff over from #9467:

In get_import_export_nodes, differentiate between node_ids=None (all node IDs) and node_ids=[] (no node IDs). This required changes to a bunch of tests which were using an empty array to mean "all the nodes from this channel", but I think it makes sense. Kolibri's annotation code appears to be making the same distinction: https://github.com/learningequality/kolibri/blob/develop/kolibri/core/content/utils/annotation.py#L207-L228.
When exporting all nodes in a channel, include_node_ids in the manifest file will be set to [channel_id]. In a previous version, I had it output an empty list instead.
With these changes together, we have the ability to clearly express both a channel with no content and a channel with all of the content in our manifest files. In theory, that means we can have exportchannel generate a manifest file as well.

rtibbles · 2022-07-05T21:26:42Z

In get_import_export_nodes, differentiate between node_ids=None (all node IDs) and node_ids=[] (no node IDs). This required changes to a bunch of tests which were using an empty array to mean "all the nodes from this channel", but I think it makes sense. Kolibri's annotation code appears to be making the same distinction: https://github.com/learningequality/kolibri/blob/develop/kolibri/core/content/utils/annotation.py#L207-L228.

I think that's generally helpful, and may be the cause of some behavioural errors I've witnessed elsewhere in Kolibri develop.

When exporting all nodes in a channel, include_node_ids in the manifest file will be set to [channel_id]. In a previous version, I had it output an empty list instead.

I think for precision this should be [root_node_id] - they are frequently identical, but there are some cases where they are not.

With these changes together, we have the ability to clearly express both a channel with no content and a channel with all of the content in our manifest files. In theory, that means we can have exportchannel generate a manifest file as well.

This would also seem fine.

dylanmccall · 2022-07-05T21:28:25Z

I think for precision this should be [root_node_id] - they are frequently identical, but there are some cases where they are not.

Ah, good point. That is in fact what it's doing - I didn't realize they could be different, so that's useful to know! In a previous version, it was special-casing this by saying "if the list we end up with is [channel_id] then make it []", which was silly, so now it just isn't doing that special case :)

rtibbles · 2022-07-06T23:47:41Z

kolibri/core/content/utils/import_export_content.py

+
+    channels = manifest_data.setdefault("channels", [])
+
+    # TODO: If the channel is already listed, it would be nice to merge the


So, in the case the channel is already in the manifest what happens, do we have two entries for the same channel in the manifest?

Yeah, exactly. So it could end up with a manifest file like this:

{ "channels": [ { "id": "e409b964366a59219c148f2aaa741f43", "version": 10, "include_node_ids": [ "6277aa0c44235435acdc8a9ed98f466b" ], "exclude_node_ids": [] }, { "id": "e409b964366a59219c148f2aaa741f43", "version": 10, "include_node_ids": [ "e409b964366a59219c148f2aaa741f43" ], "exclude_node_ids": [] } ], "channel_list_hash": "a2c59ee9bab4586a78e56ca8cf13a912" }

Where the first time around, I exported just "6277aa0c44235435acdc8a9ed98f466b", and the second time I exported the entire channel.

Note that identical entries are skipped, but if we have a different selection for the same channel, they each get a new entry. It would be better if it merges them more exhaustively. For instance the second entry is a superset of the first entry. So, I added a TODO about that. Could do that by adding the existing entries to our list of selected nodes.

Come to think of it, that doesn't sound too difficult to implement, so I think I'll give it a shot right now over in #9467 :)

Yes - I was wondering if we could just do the TODO now, and not leave it for later!

I think so, indeed. Ran out of time today, but I'll let you know how it goes tomorrow!

Okay, I added a commit with a couple things:

Removed exclude_node_ids from manifest.json. It isn't being used right now, and it makes our life easier if it isn't there. If we are only including nodes, we can trivially extend that set of nodes when a new export happens. Otherwise (if there is the possibility of nodes being excluded), we would have to expand the selection into a complete set of nodes and then simplify it again.

Added a content_manifest module to deal with this in one place, moving some file access and json parsing code out of import_export_content.py. It is responsible for reading and writing manifest files, as well as simplifying the provided set of node IDs.

rtibbles

All seems reasonable on a read through, and the tests and generation time for KA gives assurance that this is a reasonable approach.

One question about the merge case.

rtibbles

A couple of questions - nothing that is completely blocking, but I think it would be good to handle malformed JSON in the manifest reading, and I wonder if optional filtering by available might be useful.

Beyond that - some doc strings on the methods of the ContentManifest class would be helpful - especially to distinguish the role of different but somewhat similarly named methods.

rtibbles · 2022-07-13T19:24:25Z

kolibri/core/content/utils/content_manifest.py

+        json_str = fp.read()
+        if not json_str:
+            return
+        # Raises JSONDecodeError if the file is invalid


Note in Python 2 this raises a ValueError, of which JSONDecodeError is a subclass.

Oh yeah. I made this clearer along with my fix for the below comment, defining JSONDecodeError = ValueError as a fallback.

rtibbles · 2022-07-13T21:32:33Z

kolibri/core/content/utils/content_manifest.py

+        if not json_str:
+            return
+        # Raises JSONDecodeError if the file is invalid
+        manifest_data = json.loads(json_str)


Any reason not to just do a try...except here, catch any errors and just do json.load directly from the fp object?

This would also have the benefit of handling non-empty, but invalid JSON.

Yeah, that's funny it ended up looking like this :b Doesn't seem to be any reason, so I switched it to use json.load(fp).

rtibbles · 2022-07-13T21:46:32Z

kolibri/core/content/utils/import_export_content.py

@@ -97,15 +97,56 @@ def filter_by_file_availability(nodes_to_include, channel_id, drive_id, peer_id)

 def get_import_export_data(  # noqa: C901


Guessing we can probably remove the noqa from here now?

👍 Indeed we can!

rtibbles · 2022-07-13T21:54:35Z

kolibri/core/content/utils/content_manifest.py

+            if len(missing_leaf_nodes) == 0:
+                include_node_ids.add(node.id)
+            elif len(matching_leaf_nodes) > 0:
+                available_nodes_queue.extend(node.children.all())


Should we be optionally filtering the node children by an available flag here (similarly to how we do in the import export data function).

Oh, yeah, I think that's right. Assuming the availability of nodes in the tree is correct, this should be doing node.children.filter(available=True). Unavailable nodes aren't going to resolve to anything useful here so it's just wasting memory. I should make sure the tests actually cover this properly first (might need to adopt that channel builder instead of the questionable database fixture :b), so, I'll do that on Monday.

I see what you mean about passing an available= flag to the function to be consistent with the import export data function, although I should tweak some names in that case, and probably wrap my head about the use case there. It does sound like it would be convenient for something.

Yeah, looking at how the function is used, I'm not sure there is a need to have optional available filtering. It's really just that it can be filtered by available=True to limit the search space.

The only question in my mind is.... does the manifest handle the case where a ContentNode has the files it needs in the exported directory, but is not available on the machine doing the exporting?

So, for example, if I exported the basic arithmetic topic from Khan Academy from my machine, and it generated a content manifest - I then pass that USB key to you, and you export more of Khan Academy from your machine, but you don't have basic arithmetic on yours - is that going to now be excluded from the updated manifest?

Would be good to make a test case to cover this scenario, I think.

Okay, I added some test cases which explore that. Specifically:

If we add some new content nodes to a ContentManifest, include_node_ids is extended to include those nodes. Any nodes that were listed previously are still there.

If content nodes are added for one channel version, and then for a different version of the same channel, the two versions are separated; the lists of include_node_ids are not merged. (In practice, importcontent merges them in Read options from a manifest file in importcontent #9467, albeit with a warning message).

This behaviour means a bit of duplication in some cases. That can be solved, but it adds some complexity - especially if we start thinking about channel versions - so I think just something to be aware of for the moment.

rtibbles

I think this is nearly ready to go, I just have one more question that I couldn't be sure of by looking at the code and test cases.

rtibbles · 2022-08-02T16:34:40Z

kolibri/core/content/utils/content_manifest.py

+            if len(missing_leaf_nodes) == 0:
+                include_node_ids.add(node.id)
+            elif len(matching_leaf_nodes) > 0:
+                available_nodes_queue.extend(node.children.all())


Yeah, looking at how the function is used, I'm not sure there is a need to have optional available filtering. It's really just that it can be filtered by available=True to limit the search space.

The only question in my mind is.... does the manifest handle the case where a ContentNode has the files it needs in the exported directory, but is not available on the machine doing the exporting?

So, for example, if I exported the basic arithmetic topic from Khan Academy from my machine, and it generated a content manifest - I then pass that USB key to you, and you export more of Khan Academy from your machine, but you don't have basic arithmetic on yours - is that going to now be excluded from the updated manifest?

Would be good to make a test case to cover this scenario, I think.

Using the same list of nodes selected for export, we can generate a set of node IDs which work with importcontent.

This case indicates that all of the content from the channel is included. It is distinct from include_node_ids being an empty list, which suggests that none of the content is included.

When node_ids is set to an empty list, get_import_export_nodes will understand that to mean no nodes are selected, which is different from when node_ids is unset, in which case all nodes are selected.

Previously, node_ids defaulted to empty list. With the previous change, that resulted in no nodes being exported by default.

This is designed similarly to ConfigParser, with a ContentManifest class that can read from or write to a JSON file. It is responsible for simplifying the list of nodes to include and merging additions to that list for the same channel ID and version.

These explore situations where we extend an existing list of node IDs. We need to ensure that data is not lost in the process.

In get_content_nodes_selectors, only step through child nodes that are marked as available.

dylanmccall · 2022-08-11T15:11:51Z

Funky build failure from the CI system, but the tests passed :)

rtibbles · 2022-08-11T22:32:31Z

Oh yes, the buildkite error was due to a sysops issue with the builder.

dylanmccall force-pushed the exportcontent-manifest-file branch from dd4fcb8 to 00b383c Compare May 24, 2022 23:51

dylanmccall force-pushed the exportcontent-manifest-file branch 3 times, most recently from d459fc5 to 35e01c1 Compare May 28, 2022 00:15

dylanmccall marked this pull request as ready for review May 28, 2022 00:15

dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from 30f4533 to e695c66 Compare May 28, 2022 00:25

dylanmccall changed the title ~~WIP: Generate a content manifest file with exportcontent~~ Generate a content manifest file with exportcontent May 28, 2022

dylanmccall force-pushed the exportcontent-manifest-file branch 4 times, most recently from 37d5bf6 to db8c569 Compare May 30, 2022 22:10

danigm mentioned this pull request May 31, 2022

Import EK channels from USB if present endlessm/kolibri-explore-plugin#422

Closed

dylanmccall mentioned this pull request May 31, 2022

Read options from a manifest file in importcontent #9467

Merged

9 tasks

rtibbles reviewed Jun 1, 2022

View reviewed changes

manuq reviewed Jun 13, 2022

View reviewed changes

dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from ddc6709 to 1cb17ea Compare June 23, 2022 21:03

dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from 704c71a to 933d5eb Compare July 5, 2022 20:57

dylanmccall requested a review from rtibbles July 5, 2022 21:25

rtibbles reviewed Jul 6, 2022

View reviewed changes

dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from 084d100 to a359246 Compare July 13, 2022 02:30

dylanmccall mentioned this pull request Jul 13, 2022

Use manifest.json files to download collections endlessm/kolibri-explore-plugin#436

Merged

rtibbles reviewed Jul 13, 2022

View reviewed changes

rtibbles reviewed Aug 2, 2022

View reviewed changes

dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from cfa0be5 to 0787542 Compare August 4, 2022 15:13

dylanmccall added 12 commits August 4, 2022 08:33

Generate content manifest files in exportcontent

61e115d

Using the same list of nodes selected for export, we can generate a set of node IDs which work with importcontent.

Add unit tests for import_export_content

c05df65

Allow include_node_ids to equal the channel ID

b18e910

This case indicates that all of the content from the channel is included. It is distinct from include_node_ids being an empty list, which suggests that none of the content is included.

Differentiate between node_ids as None, and as []

471e4fc

When node_ids is set to an empty list, get_import_export_nodes will understand that to mean no nodes are selected, which is different from when node_ids is unset, in which case all nodes are selected.

Default node_ids to None in exportcontent

7597f7b

Previously, node_ids defaulted to empty list. With the previous change, that resulted in no nodes being exported by default.

Use json.load with file objects instead of loads

81611bd

Add documentation comments for ContentManifest

dda8313

Remove noqa C901 from get_import_export_data

fd1e2a6

Add content manifest test cases

7fb34f8

These explore situations where we extend an existing list of node IDs. We need to ensure that data is not lost in the process.

Performance enhancement for content manifests

9f68ef0

In get_content_nodes_selectors, only step through child nodes that are marked as available.

Add optional validation for channel list hash

2da94c3

dylanmccall force-pushed the exportcontent-manifest-file branch from 0787542 to 2da94c3 Compare August 4, 2022 15:35

dylanmccall requested a review from rtibbles August 8, 2022 20:08

rtibbles approved these changes Aug 11, 2022

View reviewed changes

rtibbles merged commit b4539a6 into learningequality:develop Aug 11, 2022

rtibbles mentioned this pull request Aug 29, 2022

tracking of content resource visibility in exported channels #6067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a content manifest file with exportcontent #9460

Generate a content manifest file with exportcontent #9460

dylanmccall commented May 24, 2022 •

edited

github-actions bot commented May 24, 2022 •

edited

dylanmccall commented May 26, 2022 •

edited

rtibbles Jun 1, 2022

dylanmccall Jun 9, 2022 •

edited

rtibbles Jun 9, 2022

manuq Jun 13, 2022

dylanmccall commented Jun 22, 2022 •

edited

dylanmccall commented Jul 5, 2022 •

edited

rtibbles commented Jul 5, 2022

dylanmccall commented Jul 5, 2022 •

edited

rtibbles Jul 6, 2022

dylanmccall Jul 7, 2022 •

edited

rtibbles Jul 8, 2022

dylanmccall Jul 8, 2022

dylanmccall Jul 13, 2022

rtibbles left a comment

rtibbles left a comment

rtibbles Jul 13, 2022

dylanmccall Jul 30, 2022

rtibbles Jul 13, 2022 •

edited

dylanmccall Jul 30, 2022

rtibbles Jul 13, 2022

dylanmccall Jul 30, 2022 •

edited

rtibbles Jul 13, 2022

dylanmccall Jul 30, 2022

rtibbles Aug 2, 2022

dylanmccall Aug 4, 2022

rtibbles left a comment

rtibbles Aug 2, 2022

dylanmccall commented Aug 11, 2022

rtibbles commented Aug 11, 2022


		channels = manifest_data.setdefault("channels", [])

		# TODO: If the channel is already listed, it would be nice to merge the

		@@ -97,15 +97,56 @@ def filter_by_file_availability(nodes_to_include, channel_id, drive_id, peer_id)

		def get_import_export_data( # noqa: C901

Generate a content manifest file with exportcontent #9460

Generate a content manifest file with exportcontent #9460

Conversation

dylanmccall commented May 24, 2022 • edited

Summary

References

Reviewer guidance

Testing checklist

PR process

Reviewer checklist

github-actions bot commented May 24, 2022 • edited

dylanmccall commented May 26, 2022 • edited

Choose a reason for hiding this comment

dylanmccall Jun 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylanmccall commented Jun 22, 2022 • edited

dylanmccall commented Jul 5, 2022 • edited

rtibbles commented Jul 5, 2022

dylanmccall commented Jul 5, 2022 • edited

Choose a reason for hiding this comment

dylanmccall Jul 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtibbles left a comment

Choose a reason for hiding this comment

rtibbles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtibbles Jul 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylanmccall Jul 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtibbles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylanmccall commented Aug 11, 2022

rtibbles commented Aug 11, 2022

dylanmccall commented May 24, 2022 •

edited

github-actions bot commented May 24, 2022 •

edited

dylanmccall commented May 26, 2022 •

edited

dylanmccall Jun 9, 2022 •

edited

dylanmccall commented Jun 22, 2022 •

edited

dylanmccall commented Jul 5, 2022 •

edited

dylanmccall commented Jul 5, 2022 •

edited

dylanmccall Jul 7, 2022 •

edited

rtibbles Jul 13, 2022 •

edited

dylanmccall Jul 30, 2022 •

edited