New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLIP: Content import/export #1386

Closed
ebrehault opened this Issue Feb 15, 2016 · 41 comments

Comments

Projects
None yet
@ebrehault
Copy link
Member

ebrehault commented Feb 15, 2016

Proposer : Eric Bréhault

Seconder : Dylan Jay, @hvelarde

Abstract

Provide a method to allow an editor to import content, export content or move content securely between sites.

Motivation

Exporting and importing content from a Plone site should be easy.
Tools like transmogrifier and collective.jsonify require a lot of programming skills.
This feature must be exposed through a simple UI.

There are a number scenarios where import/export of content is useful.

  • User has created a site locally and now wants to put it into production
  • User has staging server where they want to make wide ranging changes which get moved into production in a single transaction
    • also moving production content back into staging. ie resynching.
  • When upgrades are too hard you might want to start afresh and move just the content over
  • Cherry pick parts of the site to export
  • Export selected metadata for auditing purposes
  • Export of content into an external system
  • Export of content or metadata to be filtered or modified before reimporting for a bulk update.
  • Exporting from another source and importing content into plone.
  • Allowing editors to import content where they have permission to add content.

The motivation is try and cover all of this with a single UI but if not the primary usecase is moving content between sites.

Assumptions

It might be relevant to re-use here the plone.restapi serialization.

Proposal & Implementation

Rationale

See #1373 for links to previous discussions

Scope

The import/export feature will be applicable to all the Plone 5 default content-types and any regular Dexterity type.
Note: Processing other content types or Dexterity types involving custom fields can be done by registering custom adapters.

UI

The screen is accessible from the Actions menu on any folder.
The screen would propose two tabs: Basic and Advanced.

The Basic tab would allow to export all the folder and tree contents (and the result is immediately downloaded), or to import contents by uploading a file (about the data format see below).

The Advanced tab would allow to do the same but also to choose:

  • the types to export/import, (or maybe query)
  • to exclude (or not)
    • comments,
    • assigned portlets,
    • workflow state
    • sharing
    • content rule assignments
    • display view assignments
    • the fields to export/import, (based on schemas of locally addable types)
  • to choose the exportation mode (browser upload/download or read/write in server ./var folder),
  • to choose the data format.
  • action to take if content already exists (stop, ignore, update, overwrite, rename) (import)
  • dry run mode (import)

After import a report is given of how many objects created, updated etc.

Data format

The default data format would be a .zip containing:

  • a single CSV file with all the metadata. When fields aren't text, numbers or dates, json will be used.
    e.g.
path, title, description, authors_json, ...
"/folder1/page1", "A page", "blah, blah", "['djay','hector']",..
  • a set of separated files containing the actual inner contents: attached files + rich text (as HTML files), folders are represented as folders.

The Advanced tab will allow to choose a pure JSON format instead of the CSV format. It will be a .zip file containing:

  • one file by object containing metadata + rich text fields,
  • attached files as separated files.

If we choose to use the server ./var folder instead of upload/download, the files are not zipped.

Note: we propose to use CSV as a default format because standard users are more likely to open/edit/manipulate CSV files rather than JSON.

Security

By default the corresponding permission will be assigned to Managers only.

As data can be exposed and manipulated in transit when uploading or downloading contents (see Risks), we just propose to add the following warning:
"If you choose to upload/download exported contents, be aware your data can be exposed and manipulated in transit. For a more secure procedure, prefer server local folder import/export mode."

Implementation details

Import / export processing must/should be done asynchroneously.

Deliverables

  • a new module named plone.importexport to implement the import and export core mechanism
  • a new version of Products.CMFPlone providing the needed control panels
  • documentation (note: the documentation will explain how to implement an import/export adapter in add-ons)

Risks

  • To export all the data some internal data structures could be exposed and manipulated in transit.
    • could be mitigated by encrypting the data with a key from the target site before export.
    • could allow only managers to do full import, and lower users to only import fields they can normally edit.
  • It might be possible for the data to be in an inconsistent state if manipulated. Validations can rely certain add forms and order or input.
  • The order of creating content can lead to unexpected results if validation involves relationships between objects.
  • Very large exports or exports can be expensive and
    • Might need a way to restrict by user or size
    • Might need to support resumable uploads
  • Ops might like to prevent data being dumped or accessed from the server.
  • Not allowing uploading partial metadata for bulk updates means other plugins would be needed to handle this usecase which would also be labeled import/export.
  • A data format like json makes is harder for non technical users from being able to run reports on content metadata, filter content before upload, change content before upload, take data out of other systems and import as content.
    • could use combination of csv and json (such as data format 4) to allow some metadata manipulated using a spreadsheet, while still supporting complex data structures.

Alternatives

Data format

  1. Zip of json files, one per object
  2. Single jsonlines file
  3. Single json file (compatible with collective.jsonify)
  4. Zip of files with primary field data (images, html etc) and metadata in single CSV file. Where fields aren't text, numbers or dates, json will be used.
  5. Zip of files with primary field data (images, html etc) and metadata stored in a similarly named .json file
  6. Zip of files with primary field data and metadata stored as RFC822 marshalling, compatible with current GS.

Participants

@ebrehault ebrehault added this to the Future milestone Feb 15, 2016

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Feb 15, 2016

+1

@djay

This comment has been minimized.

Copy link
Member

djay commented Feb 15, 2016

I edited it to include more motivations, risks and some alternative data formats and UI options

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Feb 15, 2016

@djay I think is was clearer before; by including so many information and corner cases we risk the initial implementation.

@djay

This comment has been minimized.

Copy link
Member

djay commented Feb 16, 2016

@hvelarde by not including the corner cases we risk implementing something half has useful. I am trying to make it clearer while still showing that risk

@ebrehault

This comment has been minimized.

Copy link
Member Author

ebrehault commented Feb 16, 2016

@djay @hvelarde I edited to describe the UI (2 tabs: Basic / Advanced) and the data format.

@seanupton

This comment has been minimized.

Copy link
Member

seanupton commented Feb 16, 2016

+1 on preserving portlet assignment data and comments;
+1 on zip file from export any arbitrary content tree, not just site root;
+1 on pure JSON, with primary field (blob) data stored separately, with naming convention.

Non-content data, should be pluggable: Also, I think it should be possible for add-on author to write adapters on content to serialize (and restore) things the general-case does not (e.g. I store things in ZODB on hybrid content that stores form data distinct from content, but "contained" within; some folks may want something to keep data made in a Rapido application inside their Plone site). This is loosely analogous to the toJSON() method JavaScript's JSON.stringify() serialization uses -- but a plug point should (a) use adaptation here, (b) adapter should provide both serialize from context and restoration back to it, (c) ideally, possible to have multiple adapters for one kind of content, with some kind of namespacing used in the output/merged JSON object. It might well be that making this pluggable allows good-enough to ship, and improvement to follow without expecting perfection.

@datakurre

This comment has been minimized.

Copy link
Member

datakurre commented Feb 17, 2016

FWIW about legacy approaches:

There used to be RFC822 marshalling and demarshalling (mostly familiar from webdav support), which was a bit of a hack for Archetypes, but had a good framework (plone.rfc822) for Dexterity. Of course, the RFC822 format feels outdated in the current JSON world. I did Archetypes adapters for plone.rfc822 for my own migration use at https://github.com/datakurre/collective.atrfc822

There used to be CMF GS export and import steps for content ("structure"), but it did't get much love from Plone. Quite deprecated, again, but I did experiment with adding support for current content types in https://github.com/datakurre/collective.themesitesetup

About portlets. Portlets have good GS import code, but are lacking export code for context portlets below the root. Some pointers for re-using GS for exporting and importing context portlets for arbitrary objects at https://github.com/datakurre/transmogrifier_ploneblueprints/blob/master/src/transmogrifier_ploneblueprints/portlets.py

@djay

This comment has been minimized.

Copy link
Member

djay commented Feb 18, 2016

@datakurre the marshalling format can't handle any kind of complex data, I think, so it would be limited in what can be transferred. This is aiming to be more similar to jsonify, which transfers everything it can.
I think that these new formats could be a good candidate from including as GS handlers and also allowing to be embedded as default content inside a theme that could be installed on initial setup.

@datakurre

This comment has been minimized.

Copy link
Member

datakurre commented Feb 18, 2016

@djay I agree. Marshalling also excludes all non-schema defined data, and there's a lot of that in "Plone to Plone migrations".

Are you co-operating with plone.restapi? (It has (re)invented content import/export in jsonld.)

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Feb 18, 2016

@ebrehault do you mind adding documentation on how to implement an import/export adapter in add-ons as a deliverable?

@ebrehault

This comment has been minimized.

Copy link
Member Author

ebrehault commented Feb 18, 2016

@hvelarde good idea, I add it

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Feb 26, 2016

we have a bunch of old sites waiting to be upgraded; implementing the import/export on a way that wont be usable for upgrading sites seems to me pretty important; count on us to help on this!

@thet

This comment has been minimized.

Copy link
Member

thet commented Mar 1, 2016

Another use case would be to "Backup" the site. Of course that's implicitly the case with this PLIP, it's just not mentioned explicitly.

I have some questions:

  • How is binary content exported? Base64 encoded, which is hard to edit? Or as file (which is easier to edit) content in the ZIP file, within a folder structure reflecting the files path?
  • What about exports of the whole site, especially if it's really big > 20GB?
  • The CSV export/import feature would require an additional handler to plone.restapi's serialization.
@seanupton

This comment has been minimized.

Copy link
Member

seanupton commented Mar 1, 2016

How is binary content exported? Base64 encoded, which is hard to edit? Or as file (which is easier to edit) content in the ZIP file, within a folder structure reflecting the files path?

@thet I think as files, with a "sidecar" file for metadata (personally, I prefer JSON everything). Maybe my nomenclature is too file-centric ("sidecar" is a term Adobe uses, for example, in asset management to describe XML metadata in a distinct file) -- the JSON is actually the primary thing for the type; all the binary stuff it includes should be cleanly referenced/named, and stored as distinct files.

Using a naming convention alone for association between JSON and binary content makes sense for 1:1 content ("primary field"), but may be unwieldy for content types that store more than one file/binary field and have issues with name collision (depending on implementation). Instead, I like the idea that every content item gets a master JSON file that has names/links to the files included within (maybe stored with some unique scheme like OID-plus-original-filename, but original filename metadata for the blob needs to be preserved as well).

A few caveats/challenges:

(1) What if a (text) field in a content type stores JSON, how does one serialize JSON-within-JSON? We have a few types like this.

(2) Can we preserve all of the following (non-lossy round-trip):

  • Original filename, as kept in the BLOB.
  • Primary fields and secondary file content alike
  • Uniqueness of stored filename in dump/output relative to other files in directory. Prevent collision (e.g. two different content items in same directory have an attached file called "logo.png").
  • Ability for a human to read filenames of binary output, while still preventing collision?
  • Bytes fields containing XML or JSON?
  • Content/mime type for the file in the metadata.

(3) Are we assuming that HTML primary fields should just be kept within the JSON, or kept in distinct files for editing by tools/editors?

@djay

This comment has been minimized.

Copy link
Member

djay commented Mar 2, 2016

@thet @seanupton I realise JSON is preferable for developers and that CSV requires more work. I'm not sure CSV really needs to be integrated into restapi. But I do think its going to be very useful for end user as I've outlined above and that makes it worth the extra complexity.

Imagine an transfer where you want to drop all files of a certain size or a certain content type or by a certain author? That doesn't require a programmer, it would only require very basic excel.

If they wanted to do a bulk change of all the ownerids on the same site, just export metadata only, remove extra columns, upload with the just the path and owner and import.

as much as possible it would be nice to keep filenames and folders that are easily readable rather than folders with pages of OIDs. Again I realise this increase the complexity.

Again I'm imaging an end user case where a hand made import zip is created. They create a csv with a path field so the primary data can be found on import. That could work for additional binary fields too. For example csv col name of lead_image__binary_path__, lead_image__binary__name__.

(3) I'd keep html as a seperate file. Maybe some kind of rule around length or mimetype?

@ebrehault

This comment has been minimized.

Copy link
Member Author

ebrehault commented Mar 2, 2016

(3): I think having a separated HTML files for rich text fields is much easier for users. But on this point (and also others), we can have a specific option in the Advanced tab for that (so basic mode would export separated HTML files, but in Advanced mode we could choose to put them into the JSON, we could also choose to put the attached fiels as base 64 in JSON or not, etc.)

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Mar 2, 2016

@djay you are addressing a different use case with your example; the purpose of this PLIP is to have an import/export function in Plone. once we have that, we can start imagining other scenarios.

I would love to have a ZIP file with the whole site exported in JSON format maintaining the structure and exporting the attachments as separate files.

@djay

This comment has been minimized.

Copy link
Member

djay commented Mar 2, 2016

I helped write the PLIP hector. It addresses the use cases listed under
motivation. Read it rather than just read what you think it should be.

On Wed, 2 Mar 2016 7:37 pm Héctor Velarde notifications@github.com wrote:

@djay https://github.com/djay you are addressing a different use case
with your example; the purpose of this PLIP is to have an import/export
function in Plone. once we have that, we can start imagining other
scenarios.

I would love to have a ZIP file with the whole site exported in JSON
format maintaining the structure and exporting the attachments as separate
files.


Reply to this email directly or view it on GitHub
#1386 (comment)
.

@seanupton

This comment has been minimized.

Copy link
Member

seanupton commented Mar 2, 2016

@djay you are addressing a different use case with your example; the purpose of this PLIP is to have an import/export function in Plone. once we have that, we can start imagining other scenarios.

@hvelarde there are some reasons these goals are not really in conflict, I think:

(1) It is possible to start with JSON and bolt on other formats;

(2) It is reasonable to use the "advanced" tab to select formats;

(3) Other formats (CSV, legacy RFC822) are easy to output in common case, though IMHO the JSON format should be the one that absolutely guarantees lossless roundrip;

(4) A compressed archive makes having both CSV and JSON of same content in the same archive neglegible in storage cost.

(5) Designing an "advanced" tab with checkboxes for formats (JSON checked by default) has trivial additional cost, and probably isn't YAGNI.

(6) Supporting multiple formats through some sort of pluggability (adapters?) will encourage contribution of folks who want to scratch an itch beyond some core promise of lossless JSON.

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Mar 2, 2016

@djay yes, you listed 9 scenarios and one of them is the one I would prefer to move to another issue because almost the whole discussion here has been around it; you prefer CSV because you want to edit the file but that is a different use case.

besides that, this PLIP will be the result of this conversation and not only what you think it should be.

@seanupton yes, so we can leave CSV conversion for another iteration.

@djay

This comment has been minimized.

Copy link
Member

djay commented Mar 2, 2016

So your argument hector is you don't like the usecase??

If we have a chance to solve two problems with one solution then in my book
that is a great thing.

Unless someone can argue a disadvantage then I think we should shut this
discussion down as it seems purely based on assumptions and intuition at
this stage.

On Thu, 3 Mar 2016 2:41 am Héctor Velarde notifications@github.com wrote:

@djay https://github.com/djay yes, you listed 9 scenarios and one of
them is the one I would prefer to move to another issue because almost the
whole discussion here has been around it; you prefer CSV because you want
to edit the file but that is a different use case.

besides that, this PLIP will be the result of this conversation and not
only what you think it should be.

@seanupton https://github.com/seanupton yes, so we can leave CSV
conversion for another iteration.


Reply to this email directly or view it on GitHub
#1386 (comment)
.

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Mar 3, 2016

my argument, @djay, is the scope of this PLIP; I have spent 10 of the last 15 years dealing with this kind of things and I can smell when a task needs to be split into smaller ones to be accomplished with the minimum amount of time and resources (yes, call it intuition if you like).

as I said before, you have listed 9 scenarios just to emphasize how important is to have content import/export working out of the box in Plone; this will save us a lot of time and effort if implemented the right way and I am compromising our scarce resources to help on that.

can we just move on and focus on the main issue? content import/export.

starting with a well defined scope can help us think on implementation details like content, format, structure, versions supported, etc. then a bunch of details will emerge and we can solve all of them one by one.

worth reading: Advantages of User Stories for Requirements

@djay

This comment has been minimized.

Copy link
Member

djay commented Mar 3, 2016

nah. I think its possible to fix it all. I've detailed how. Let's give it a try.

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Mar 4, 2016

we should probably include also users and versioning information (someone asked for it today at irc channel).

@tisto

This comment has been minimized.

Copy link
Member

tisto commented Mar 15, 2016

I like the idea and this is definitely something that Plone should provide out of the box. I share @hvelarde concerns about the scope of the PLIP though. I also think that we should finish plone.restapi first and then build the import/export feature on top of that, not the other way around. Otherwise we risk ending up with two incompatible "APIs" or reinventing the wheel twice. Both tasks are just too big to work on them in parallel in my opinion.

@pigeonflight

This comment has been minimized.

Copy link
Member

pigeonflight commented Mar 25, 2016

+1 ... I have code from a project that I could possibly rework to be useful on this.

@gforcada

This comment has been minimized.

Copy link
Contributor

gforcada commented Mar 25, 2016

So one way to re-focus this PLIP could be to provide the format and utilities that will be used by the restapi? So when implementing the restapi only one api call needs to be done, i.e.:

api.content.export(obj=my_cool_object)

I wired that up on plone.api style, which could be one of the first users of this utility/adapter as well.

@jensens

This comment has been minimized.

@jensens jensens closed this Sep 23, 2016

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Sep 23, 2016

well, I think this is important guys; can we redefine the scope to something less ambitious?

@rnixx

This comment has been minimized.

Copy link
Member

rnixx commented Sep 23, 2016

What about exposing TTW ZEXP format export/import for manager users? Anyway it should be possible to disable this feature entirely.

@ebrehault

This comment has been minimized.

Copy link
Member Author

ebrehault commented Sep 23, 2016

@hvelarde I also think it is important, but it hasn't been rejected because of the scope.
I proposed to base this feature on plone.restapi (because we need to serialize/deserialize Plone contents, and that's what plone.restapi does, so it would be foolish to implement a similar thing elsewhere).
For now plone.restapi is an addon, it is not in the core.
So the import/export featuer has to be an addon too.
When plone.restapi will be in the core, I will re-submit this PLIP to the framework team.

Meanwhile, I plan to develop the addon (I proposed it as a GSoC subject but nobody picked it). If you want to help you are welcome :)

@hvelarde

This comment has been minimized.

Copy link
Member

hvelarde commented Sep 23, 2016

yes, we want to help, our problem is we sometimes lack time/experience to work faster on this kind of stuff; when @rodfersou finish the stuff is still pending we will contact you on this.

@djay

This comment has been minimized.

Copy link
Member

djay commented Sep 24, 2016

@ebrehault it is mostly implemented already in https://github.com/collective/collective.importexport. it just needs work to include json in the csv for aditional fields, and a zip file format for blobs.

@ebrehault

This comment has been minimized.

Copy link
Member Author

ebrehault commented Sep 24, 2016

@djay oh right, I totally forgot this add-on.
Supporting files should be quite easy, don't you think?

@Rudd-O

This comment has been minimized.

Copy link
Contributor

Rudd-O commented Sep 29, 2016

Yes, trim the scope please. This is an important feature to have in a CMS.

I think the minimal viable product should start with JSON export/import and then proceed from there. As a developer, that's the format I would use to load and process data structures, with a second format being YAML (it's much nicer and more readable, but less popular), and a distant third being XML.

CSV may be useful for some folks — those who use spreadsheets — but that set of folks has little overlap with the set of folks maintaining and operating Plone sites. CSV also can can't capture hierarchy well, and it's a mess to query / manipulate / et cetera.

I remember using collective.importexport and it really wasn't a nice experience, precisely because of the CSV.

@djay

This comment has been minimized.

Copy link
Member

djay commented Oct 25, 2016

@Rudd-O developers have plenty of options to import export. They aren't the main target IMO. I think you might have a skewed view of who is maintaining websites. There are plenty of cases where developers or devops don't want to get involved for simple cases of importing and exporting content. If its the case of a webmaster and they want migrate the whole site out of plone, then if this function helps them, then great, but I don't think thats the target.

collective.importexport is a start. All good UI's they born out of lots of user testing and feedback. So I'd very much appreciate where you felt you got lost in using that plugin and what your usecase was so I can improve it. But not here.

@Rudd-O

This comment has been minimized.

Copy link
Contributor

Rudd-O commented Oct 27, 2016

Very sad to hear this getting trimmed from the set of deliverables :-(

I understand the reasons given, and I still think it would be a valuable, self-promoting goal.

@djay

This comment has been minimized.

Copy link
Member

djay commented Feb 2, 2017

I will propose this as a GSOC project

Summary

Strategically, good content import/export is important because

  1. it allows new users to get started quickly.
  2. It also helps overcome the obstacle when users expect a SQL database to be able to import and export from.
  3. It makes it easier for regular bulk uploads or syncs of content from external sources. It allows this for non technical users and non python developers.

The end goal is to make plone more approachable for webmasters which will in turn help grow the install base.

The aim would be a online UI which allows

  • both CSV and JSON import and export of content (using seperate files for binary/content). CSV is included so non technical users can update metadata. Both formats will able to hold the same data, CSV will need to use quoted JSON for certain parts.
  • It will allow for both object creation as well as finding and updating existing content via various unique attributes such as path or custom fields.
  • It will work for metadata, content and binary content or just combinations of these (ie just metadata refresh if required).
  • It should make it possible to export content and then reimport it into a new site with a different version such that almost all data is retained.
  • Help when imports go wrong, e.g dry-run mode, reports on content created, skipped etc etc.
  • Permissions and security to be respected. Lower roles can still use it just with content/fields they have access to.

Implementation

It will be implemented as an addon or extend an existing addon, that can be incorporated into plone as at a later date. c.importexport is example of existing addon that can be extended.

Skills

Mainly python. Some UX skills to help create an intuitive UI but this can be provided by the mentor.

Mentors

Dylan Jay, Eric Bréhault

Aims

Addon and a PLIP to include this in core plone.

@cewing

This comment has been minimized.

Copy link
Member

cewing commented Feb 3, 2017

Thanks for the write-up @djay. I'll include this in 2017 GSoC listings.

@djay

This comment has been minimized.

Copy link
Member

djay commented Mar 1, 2017

@ebrehault why did you close this one instead of #1373 ?

@ebrehault

This comment has been minimized.

Copy link
Member Author

ebrehault commented Mar 1, 2017

@djay we closed it because this one is the PLIP and it has been rejected, the other one is the actual user problem description, and it still very valid (to me).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment