Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Export CSV report to external service #339
The CSV report is currently saved in a publicly accessible location on the Lagotto server. We want to store the report in an external service instead, using one of these three options:
I started a cursory investigation into Zenodo just now and it sparked the following questions:
The size of the zipped report in CSV format was something like 25 MB a few months ago. I think 2 GB is fine. If we go with Zenodo for the JSON dump, we could split that file into smaller sizes.
I forgot to mention that we can ask @lnielsen if there are specific question about Zenodo.
The other reports we are generating are slightly different in nature and there is no need to make them publicly accessible. They are only available to admins and much smaller in size.
I'm not 100% sure I understand the third question, but I could imagine other CSV reports in the future.
On a more general terms as it applies to this GH issue...
I've spent more time with Zenodo yesterday and today and have started to document some non-intuitive behavior (at least to me) and questions to Zenodo's GH Issue 348.
Lagotto integration (errors)
Narrowing in on how to handle errors during the export process it appears that:
@mfenner, how does the above sound for the general approach to error handling? Any thing else come to mind of mind that should be considered for logging in general of the export process?
The default Rails error log also uses Logstash format in production, but I don't think there is a need to also write to the log when an error occurs. I would only use
A few more things I've noticed about Zenodo (I've emailed email@example.com about the first two they can confirm this is the case):
These aren't show stoppers, but they do seem counterintuitive. There API seems to tell you one thing, but then something else is happening behind the scenes and you either can't access your content via the public URL they provide you or you get weird HTTP errors.
Need Guidance / Input On
Next, there are some attributes as it relates to the deposition that you'll likely want to look at. I'll need some guidance on the appropriate values. Currently, I'm using a very small set of the attributes as most of them I'm not sure make sense for this data.
Here's what I'm using now:
For all possible attributes and their meaning see the Deposition Metadata API docs.
@zdennis the Zenodo data repository creates a persistent identifier (a DOI) for every uploaded dataset (or other item type such as a text document). I hope you appreciate the dogfood aspect of this, and also that you could in theory use lagotto to track metrics around the dataset you just created (e.g. how many times the ALM monthly CSV has been downloaded, etc.).
When using DOIs you have to provide a standard set of metadata, and that is reflected in the fields required by Zenodo. To make things more complicated, there are different DOI registration agencies with different required and optional metadata. Zenodo uses DataCite and the required and optional metadata are described in the DataCite Metadata Schema which you can find at http://doi.org/10.5438/0010. Incidentally (or not), @lnielsen is one of the two chairs of the group that defines this metadata schema. I think the schema provides good background information for the fields required by Zenodo.
Regarding the specific metadata fields used when uploading ALM monthly reports to Zenodo I suggest the following:
One challenge is to link the reports together when we upload them multiple times, e.g. monthly. In theory we could use the same DOI and update the data, but I think it is a better idea to have a separate DOI for each report. You could use the
Sorry for all the trouble. The reason for the effort above is to make it easier for people to find and reuse the CSV file, and I think all of the above can be automated.
Re-posting this comment. Accidentally posted to the zenodo project...
On a more general process note (not specifically Zenodo) I pushed up some spike code to issues/339_export_csv_report
Here's what a brain dump of what I'm thinking based on this spike. This is intended to work for what we need today, but to allow allow re-use for other files we may want to export later (either to Zenodo or other services):
FileExport is a model that represents any file that we want to export. It has a generic data model that more specialized subclasses would rely/build on. FileExport implements STI so subclasses can have their own specialized logic instantiated automatically.
ZenodoFileExport is a subclass of FileExport and implements the responsibility for what it means to upload a file to Zenodo. If we wanted to export multiple kinds of files to Zenodo this would either own that responsibility or we could have multiple ZenodoXXXFileExport classes that are responsible for specific kinds of files.
If Lagotto ever needs to export data to a new service it'd be as simple as implementing a new FileExport subclass that acted as the integration point.
rake export:all would look at all FileExport(s) that have not been exported and that are not in the process of exporting and it will queue up a generic FileExportJob for the specific FileExport.
FileExportJob is an ActiveJob that will find the specific FileExport in the DB and then tell it it
Over time there will be a record of all of the files that are exported along with information that can be used to track down that information for later retrieval if necessary.
Exceptions will be stored as Alert(s) which already exist and which already communicate to the user thru the administrative interface.
Retry logic for the jobs will be the default Sidekiq retry logic for now. Will revisit this next week.
API_KEYs etc would be added as environment variables in the same way it is done today.
Thoughts, Concerns, Feedback?
Let me know. Thanks!
@zdennis I like the above outline on file export. Some thoughts:
I do. I was a little nervous of publishing test files to their production environment since I was fearful they'd steal valuable real estate on Zenodo's home page from other meaningfully published works.
I heard back from Zenodo via email and they do have a developer sandbox: http://sandbox.zenodo.org/. I'm going to switch to using this for development.
Related, the Zenodo ruby gem only supports production right now and doesn't provide a way to override this setting. I'm going to submit another patch so the URL can be overridden so we can set it to the sandbox for development and test environments as needed but allow production to use the Zenodo production URL.
Awesome! Thank you for providing information on this so quickly.
I do like the idea of using
Here are three possible ways that we could implement this in some pseudo-code:
Re: file exports to Zenodo
Ah, okay. I was thinking initially one deposition per file, but I can see where that wouldn't always make sense. We can allow for multiple files to be added, but that in our current use-case for the ALM report(s) that'd we still zip them up and upload the zip file.
With respect to the README files we could add them to the zip file OR we could add them as a sibling to the zip file in the Zenodo deposition. From browsing other depositions on Zenodo's site it may be nice to have the README live outside of the zip file so it can be viewed prior to having downloaded the data-set. But, that may be a bit presumptuous on my part. Do you have have any preferences based on your understanding of how this may be used? Or, what do you think would be the first go?
With that being said I'm now leaning towards:
Yeah, I was planning on keeping the reports zipped up, but possibly extracting that to its own
I will look into that. If alright with you I'd like to get the base functionality in place and then come back and add this as a separate PR.
Related to this though: Do you know when/where that report is generated? I don't see any where that calls
If the above sounds like a good place to start I'll honing in what I outlined above and in the pseudo-code gist and get a branch up. We can always review and adapt once code starts taking shape for any areas that may be hard to visualize now.
Let me know if any other feedback or thoughts as I get started on this. Thanks.
Lets go with
Ideally I would want to have the README in the zip file. It makes it a bit harder to extract that information for display on the Zenodo web pages, but it makes it easier to keep the README associated with the CSV on a local computer. In the future I might do the next step and use the Universal Container Format, which is basically a ZIP with some extra rules. I attended the very interesting http://csvconf.com/ last year on some cool ideas around CSV files. One example was linting CSV files.
I don't suspect these will be much effort. I've got to for the evening though and wanted you to have something to look at when you're on next.