New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export CSV report to external service #339

Closed
mfenner opened this Issue Jun 23, 2015 · 17 comments

Comments

Projects
None yet
3 participants
@mfenner
Member

mfenner commented Jun 23, 2015

The CSV report is currently saved in a publicly accessible location on the Lagotto server. We want to store the report in an external service instead, using one of these three options:

  • automatically store in Amzaon S3 or similar service
  • automatically store in Zenodo data repository
  • keep functionality, upload to Zenodo manually
@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jun 24, 2015

Contributor

I started a cursory investigation into Zenodo just now and it sparked the following questions:

  • Zenodo has a maximum file size of 2Gb. Amazon has a maximum file size of 5Tb. Do you know what file size the actual report is now?
  • The aggregate ALM report is the only one shared publicly currently via an HTTP-accessible route, but there are other generated reports as well. Should those also be stored on whatever the final storage mechanism is even though they may not be publicly HTTP-accessible?
  • Should all of the generated reports be publicly HTTP-accessible since they are already generated or are they truly just a work in progress (e.g. source-specific CSV reports)?
Contributor

zdennis commented Jun 24, 2015

I started a cursory investigation into Zenodo just now and it sparked the following questions:

  • Zenodo has a maximum file size of 2Gb. Amazon has a maximum file size of 5Tb. Do you know what file size the actual report is now?
  • The aggregate ALM report is the only one shared publicly currently via an HTTP-accessible route, but there are other generated reports as well. Should those also be stored on whatever the final storage mechanism is even though they may not be publicly HTTP-accessible?
  • Should all of the generated reports be publicly HTTP-accessible since they are already generated or are they truly just a work in progress (e.g. source-specific CSV reports)?
@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jun 25, 2015

Member

The size of the zipped report in CSV format was something like 25 MB a few months ago. I think 2 GB is fine. If we go with Zenodo for the JSON dump, we could split that file into smaller sizes.

I forgot to mention that we can ask @lnielsen if there are specific question about Zenodo.

The other reports we are generating are slightly different in nature and there is no need to make them publicly accessible. They are only available to admins and much smaller in size.

I'm not 100% sure I understand the third question, but I could imagine other CSV reports in the future.

Member

mfenner commented Jun 25, 2015

The size of the zipped report in CSV format was something like 25 MB a few months ago. I think 2 GB is fine. If we go with Zenodo for the JSON dump, we could split that file into smaller sizes.

I forgot to mention that we can ask @lnielsen if there are specific question about Zenodo.

The other reports we are generating are slightly different in nature and there is no need to make them publicly accessible. They are only available to admins and much smaller in size.

I'm not 100% sure I understand the third question, but I could imagine other CSV reports in the future.

@lnielsen

This comment has been minimized.

Show comment
Hide comment
@lnielsen

lnielsen Jun 25, 2015

FYI: I think we will soon increase the 2GB per file limit so don't let that stop you.

FYI: I think we will soon increase the 2GB per file limit so don't let that stop you.

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jun 25, 2015

Contributor

@mfenner, you answered my 3rd question :) – that the other smaller admin reports don't need to be made accessible over HTTP.

@lnielsen, awesome! Out of curiosity do you know (and can you share) what it will be increasing to?

Contributor

zdennis commented Jun 25, 2015

@mfenner, you answered my 3rd question :) – that the other smaller admin reports don't need to be made accessible over HTTP.

@lnielsen, awesome! Out of curiosity do you know (and can you share) what it will be increasing to?

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jul 2, 2015

Contributor

@lnielsen, any word on @mfenner's question from the other week? Is Zenodo appropriate for the kind of information we're looking to put up there?

On a more general terms as it applies to this GH issue...

Zenodo

I've spent more time with Zenodo yesterday and today and have started to document some non-intuitive behavior (at least to me) and questions to Zenodo's GH Issue 348.

I've also opened up some minor issues and PRs for the Zenodo ruby gem: 2, 3, and 4.

Lagotto integration (errors)

Narrowing in on how to handle errors during the export process it appears that:

  • Alert(s) should be created for logging errors. For example, if Zenodo fails with an unexpected error code create a new Alert. This seems well supported in the app and its UI already.
  • LogStash doesn't seem to be necessary here as it's currently only used to log 3rd party API responses (non-errors) during the ingestion process and Alert(s) seem more appropriate for system/job-level errors.
  • There doesn't seem to be a need to log/instrument any ActiveSupport::Notifications as it relates to this work.

@mfenner, how does the above sound for the general approach to error handling? Any thing else come to mind of mind that should be considered for logging in general of the export process?

Contributor

zdennis commented Jul 2, 2015

@lnielsen, any word on @mfenner's question from the other week? Is Zenodo appropriate for the kind of information we're looking to put up there?

On a more general terms as it applies to this GH issue...

Zenodo

I've spent more time with Zenodo yesterday and today and have started to document some non-intuitive behavior (at least to me) and questions to Zenodo's GH Issue 348.

I've also opened up some minor issues and PRs for the Zenodo ruby gem: 2, 3, and 4.

Lagotto integration (errors)

Narrowing in on how to handle errors during the export process it appears that:

  • Alert(s) should be created for logging errors. For example, if Zenodo fails with an unexpected error code create a new Alert. This seems well supported in the app and its UI already.
  • LogStash doesn't seem to be necessary here as it's currently only used to log 3rd party API responses (non-errors) during the ingestion process and Alert(s) seem more appropriate for system/job-level errors.
  • There doesn't seem to be a need to log/instrument any ActiveSupport::Notifications as it relates to this work.

@mfenner, how does the above sound for the general approach to error handling? Any thing else come to mind of mind that should be considered for logging in general of the export process?

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 2, 2015

Member

@zdennis yes, Alert(s) make the most sense here. If you find it useful you can define a custom error in lib/custom_error.rb to make it easier to filter the alerts by these errors.

The default Rails error log also uses Logstash format in production, but I don't think there is a need to also write to the log when an error occurs. I would only use ActiveSupport::Notifications if there is a need to time the duration of the API calls to Zenodo, which I don't see.

Member

mfenner commented Jul 2, 2015

@zdennis yes, Alert(s) make the most sense here. If you find it useful you can define a custom error in lib/custom_error.rb to make it easier to filter the alerts by these errors.

The default Rails error log also uses Logstash format in production, but I don't think there is a need to also write to the log when an error occurs. I would only use ActiveSupport::Notifications if there is a need to time the duration of the API calls to Zenodo, which I don't see.

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jul 2, 2015

Contributor

A few more things I've noticed about Zenodo (I've emailed info@zenodo.com about the first two they can confirm this is the case):

  • It doesn't seem that Zenodo will let you remove a published file. You must email them with a request to remove the file. There is probably good reason for this, but I published a test file and had to email them to see about removing it).
  • It doesn't seem that Zenodo has the concept of a test or sandboxed environment. Not a show stopper, but a bummer for testing integration since to test it you have to create, upload, ultimately publish files that you never actually intend to publish, just to test it out.
  • Uploading files seemed relatively fast, but it seems to take a very long time for the file to actually become available (I'm assuming my test file will eventually be available). I'm not sure if they have a person look at each published upload, but the Zenodo API will tell you that a file is done, submitted, and has a public url, but then you cannot access the file.

These aren't show stoppers, but they do seem counterintuitive. There API seems to tell you one thing, but then something else is happening behind the scenes and you either can't access your content via the public URL they provide you or you get weird HTTP errors.

Need Guidance / Input On

Next, there are some attributes as it relates to the deposition that you'll likely want to look at. I'll need some guidance on the appropriate values. Currently, I'm using a very small set of the attributes as most of them I'm not sure make sense for this data.

Here's what I'm using now:

  • upload_type - this is set to dataset.
  • publication_date - right now defaults to the date of the upload as opposed to the date the report was run. Should this be the day the report ran and not the upload day in case they differ or does that not matter?
  • title - this is set to 'ALM Monthly Stats Report'. What should be this be?
  • description - this is set to 'ALM Monthly Stats Report' same as title. What should this be?
  • creators - there is one creator listed as 'Public Library of Science'. What should this be?

For all possible attributes and their meaning see the Deposition Metadata API docs.

Contributor

zdennis commented Jul 2, 2015

A few more things I've noticed about Zenodo (I've emailed info@zenodo.com about the first two they can confirm this is the case):

  • It doesn't seem that Zenodo will let you remove a published file. You must email them with a request to remove the file. There is probably good reason for this, but I published a test file and had to email them to see about removing it).
  • It doesn't seem that Zenodo has the concept of a test or sandboxed environment. Not a show stopper, but a bummer for testing integration since to test it you have to create, upload, ultimately publish files that you never actually intend to publish, just to test it out.
  • Uploading files seemed relatively fast, but it seems to take a very long time for the file to actually become available (I'm assuming my test file will eventually be available). I'm not sure if they have a person look at each published upload, but the Zenodo API will tell you that a file is done, submitted, and has a public url, but then you cannot access the file.

These aren't show stoppers, but they do seem counterintuitive. There API seems to tell you one thing, but then something else is happening behind the scenes and you either can't access your content via the public URL they provide you or you get weird HTTP errors.

Need Guidance / Input On

Next, there are some attributes as it relates to the deposition that you'll likely want to look at. I'll need some guidance on the appropriate values. Currently, I'm using a very small set of the attributes as most of them I'm not sure make sense for this data.

Here's what I'm using now:

  • upload_type - this is set to dataset.
  • publication_date - right now defaults to the date of the upload as opposed to the date the report was run. Should this be the day the report ran and not the upload day in case they differ or does that not matter?
  • title - this is set to 'ALM Monthly Stats Report'. What should be this be?
  • description - this is set to 'ALM Monthly Stats Report' same as title. What should this be?
  • creators - there is one creator listed as 'Public Library of Science'. What should this be?

For all possible attributes and their meaning see the Deposition Metadata API docs.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 2, 2015

Member

@zdennis the Zenodo data repository creates a persistent identifier (a DOI) for every uploaded dataset (or other item type such as a text document). I hope you appreciate the dogfood aspect of this, and also that you could in theory use lagotto to track metrics around the dataset you just created (e.g. how many times the ALM monthly CSV has been downloaded, etc.).

When using DOIs you have to provide a standard set of metadata, and that is reflected in the fields required by Zenodo. To make things more complicated, there are different DOI registration agencies with different required and optional metadata. Zenodo uses DataCite and the required and optional metadata are described in the DataCite Metadata Schema which you can find at http://doi.org/10.5438/0010. Incidentally (or not), @lnielsen is one of the two chairs of the group that defines this metadata schema. I think the schema provides good background information for the fields required by Zenodo.

Regarding the specific metadata fields used when uploading ALM monthly reports to Zenodo I suggest the following:

  • upload_type: dataset
  • publication_date: date the report was run
  • title: the title should reflect the date the report was run. We want to run the report monthly, but that may change. I therefore suggest to include the publication_date in the title, but not use monthly. Instead of ALM we should use ENV["APPLICATION"] to distinguish reports from different Lagotto installations. In summary: #{ENV["APPLICATION"]} Summary Stats on #{publication_date}
  • description: This should contain some text that describes how the data were generated, for now we can be short. My suggestion: #{ENV["APPLICATION"]} Summary Stats on #{publication_date}, generated by the Lagotto software.
  • creators: this should be an ENV variable. For now we only need one creator, e.g. Public Library of Science, so we can do [ENV["CREATOR"]]
  • access_right: open (the default)
  • license: cc-zero (the default for datasets)
  • keywords: This could be a constant ZENODO_KEYWORDS (e.g. at config/initializers/constants.rb) with array ["alm","lagotto","article-level metrics","altmetrics","bibliometrics","mendeley","facebook","twitter","scopus"]
  • related_identifiers: [{ relation: "IsCompiledBy", identifier: "https://github.com/articlemetrics/lagotto" }, { relation: "IsPartOf", identifier: ENV["ZENODO_DOI" }]

One challenge is to link the reports together when we upload them multiple times, e.g. monthly. In theory we could use the same DOI and update the data, but I think it is a better idea to have a separate DOI for each report. You could use the related_identifier field with relation IsNewVersionOf, but that will be complex to generate. It would be easier to manually generate a parent DOI once and then use the related_identifier IsPartOf to link all datasets to that parent DOI, using a ZENODO_DOI ENV variable.

Sorry for all the trouble. The reason for the effort above is to make it easier for people to find and reuse the CSV file, and I think all of the above can be automated.

Member

mfenner commented Jul 2, 2015

@zdennis the Zenodo data repository creates a persistent identifier (a DOI) for every uploaded dataset (or other item type such as a text document). I hope you appreciate the dogfood aspect of this, and also that you could in theory use lagotto to track metrics around the dataset you just created (e.g. how many times the ALM monthly CSV has been downloaded, etc.).

When using DOIs you have to provide a standard set of metadata, and that is reflected in the fields required by Zenodo. To make things more complicated, there are different DOI registration agencies with different required and optional metadata. Zenodo uses DataCite and the required and optional metadata are described in the DataCite Metadata Schema which you can find at http://doi.org/10.5438/0010. Incidentally (or not), @lnielsen is one of the two chairs of the group that defines this metadata schema. I think the schema provides good background information for the fields required by Zenodo.

Regarding the specific metadata fields used when uploading ALM monthly reports to Zenodo I suggest the following:

  • upload_type: dataset
  • publication_date: date the report was run
  • title: the title should reflect the date the report was run. We want to run the report monthly, but that may change. I therefore suggest to include the publication_date in the title, but not use monthly. Instead of ALM we should use ENV["APPLICATION"] to distinguish reports from different Lagotto installations. In summary: #{ENV["APPLICATION"]} Summary Stats on #{publication_date}
  • description: This should contain some text that describes how the data were generated, for now we can be short. My suggestion: #{ENV["APPLICATION"]} Summary Stats on #{publication_date}, generated by the Lagotto software.
  • creators: this should be an ENV variable. For now we only need one creator, e.g. Public Library of Science, so we can do [ENV["CREATOR"]]
  • access_right: open (the default)
  • license: cc-zero (the default for datasets)
  • keywords: This could be a constant ZENODO_KEYWORDS (e.g. at config/initializers/constants.rb) with array ["alm","lagotto","article-level metrics","altmetrics","bibliometrics","mendeley","facebook","twitter","scopus"]
  • related_identifiers: [{ relation: "IsCompiledBy", identifier: "https://github.com/articlemetrics/lagotto" }, { relation: "IsPartOf", identifier: ENV["ZENODO_DOI" }]

One challenge is to link the reports together when we upload them multiple times, e.g. monthly. In theory we could use the same DOI and update the data, but I think it is a better idea to have a separate DOI for each report. You could use the related_identifier field with relation IsNewVersionOf, but that will be complex to generate. It would be easier to manually generate a parent DOI once and then use the related_identifier IsPartOf to link all datasets to that parent DOI, using a ZENODO_DOI ENV variable.

Sorry for all the trouble. The reason for the effort above is to make it easier for people to find and reuse the CSV file, and I think all of the above can be automated.

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jul 2, 2015

Contributor

Re-posting this comment. Accidentally posted to the zenodo project...

On a more general process note (not specifically Zenodo) I pushed up some spike code to issues/339_export_csv_report

Here's what a brain dump of what I'm thinking based on this spike. This is intended to work for what we need today, but to allow allow re-use for other files we may want to export later (either to Zenodo or other services):

FileExport is a model that represents any file that we want to export. It has a generic data model that more specialized subclasses would rely/build on. FileExport implements STI so subclasses can have their own specialized logic instantiated automatically.

ZenodoFileExport is a subclass of FileExport and implements the responsibility for what it means to upload a file to Zenodo. If we wanted to export multiple kinds of files to Zenodo this would either own that responsibility or we could have multiple ZenodoXXXFileExport classes that are responsible for specific kinds of files.

If Lagotto ever needs to export data to a new service it'd be as simple as implementing a new FileExport subclass that acted as the integration point.

rake export:all would look at all FileExport(s) that have not been exported and that are not in the process of exporting and it will queue up a generic FileExportJob for the specific FileExport.

FileExportJob is an ActiveJob that will find the specific FileExport in the DB and then tell it it export! itself. If a FileExport successfully finishes it will record the response as well as the finish time of the export. It also records the start time when the export begins.

Over time there will be a record of all of the files that are exported along with information that can be used to track down that information for later retrieval if necessary.

Exceptions will be stored as Alert(s) which already exist and which already communicate to the user thru the administrative interface.

Retry logic for the jobs will be the default Sidekiq retry logic for now. Will revisit this next week.

API_KEYs etc would be added as environment variables in the same way it is done today.

Thoughts, Concerns, Feedback?

Let me know. Thanks!

Contributor

zdennis commented Jul 2, 2015

Re-posting this comment. Accidentally posted to the zenodo project...

On a more general process note (not specifically Zenodo) I pushed up some spike code to issues/339_export_csv_report

Here's what a brain dump of what I'm thinking based on this spike. This is intended to work for what we need today, but to allow allow re-use for other files we may want to export later (either to Zenodo or other services):

FileExport is a model that represents any file that we want to export. It has a generic data model that more specialized subclasses would rely/build on. FileExport implements STI so subclasses can have their own specialized logic instantiated automatically.

ZenodoFileExport is a subclass of FileExport and implements the responsibility for what it means to upload a file to Zenodo. If we wanted to export multiple kinds of files to Zenodo this would either own that responsibility or we could have multiple ZenodoXXXFileExport classes that are responsible for specific kinds of files.

If Lagotto ever needs to export data to a new service it'd be as simple as implementing a new FileExport subclass that acted as the integration point.

rake export:all would look at all FileExport(s) that have not been exported and that are not in the process of exporting and it will queue up a generic FileExportJob for the specific FileExport.

FileExportJob is an ActiveJob that will find the specific FileExport in the DB and then tell it it export! itself. If a FileExport successfully finishes it will record the response as well as the finish time of the export. It also records the start time when the export begins.

Over time there will be a record of all of the files that are exported along with information that can be used to track down that information for later retrieval if necessary.

Exceptions will be stored as Alert(s) which already exist and which already communicate to the user thru the administrative interface.

Retry logic for the jobs will be the default Sidekiq retry logic for now. Will revisit this next week.

API_KEYs etc would be added as environment variables in the same way it is done today.

Thoughts, Concerns, Feedback?

Let me know. Thanks!

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 3, 2015

Member

@zdennis I like the above outline on file export. Some thoughts:

  • in theory one Zenodo data package could consist of multiple files, and that could also be possible for other export services such as S3. Is that part of the general logic? If not then maybe you could implement a FolderExport that can handle multiple files. One idea for the future would be to add a README, as CSV files are usually not very informative.
  • are you planning to zip the CSV you send to Zenodo? If that is the case, then the FolderExport would also work well for a single file and could be generic.
  • we have a report that is sent whenever a CSV file is generated (send_work_statistics_report), and registered users can subscribe to that report. If that existing functionality is easy to integrate into this FileExport framework, then it would be good to have in there.
Member

mfenner commented Jul 3, 2015

@zdennis I like the above outline on file export. Some thoughts:

  • in theory one Zenodo data package could consist of multiple files, and that could also be possible for other export services such as S3. Is that part of the general logic? If not then maybe you could implement a FolderExport that can handle multiple files. One idea for the future would be to add a README, as CSV files are usually not very informative.
  • are you planning to zip the CSV you send to Zenodo? If that is the case, then the FolderExport would also work well for a single file and could be generic.
  • we have a report that is sent whenever a CSV file is generated (send_work_statistics_report), and registered users can subscribe to that report. If that existing functionality is easy to integrate into this FileExport framework, then it would be good to have in there.
@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 3, 2015

Member

@zdennis you don't have to wait to hear back from @lnielsen regarding the question whether the data we deposit are appropriate for Zenodo.

Member

mfenner commented Jul 3, 2015

@zdennis you don't have to wait to hear back from @lnielsen regarding the question whether the data we deposit are appropriate for Zenodo.

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jul 6, 2015

Contributor

the Zenodo data repository creates a persistent identifier (a DOI) for every uploaded dataset (or other item type such as a text document). I hope you appreciate the dogfood aspect of this

I do. I was a little nervous of publishing test files to their production environment since I was fearful they'd steal valuable real estate on Zenodo's home page from other meaningfully published works.

I heard back from Zenodo via email and they do have a developer sandbox: http://sandbox.zenodo.org/. I'm going to switch to using this for development.

Related, the Zenodo ruby gem only supports production right now and doesn't provide a way to override this setting. I'm going to submit another patch so the URL can be overridden so we can set it to the sandbox for development and test environments as needed but allow production to use the Zenodo production URL.

Regarding the specific metadata fields used when uploading ALM monthly reports to Zenodo I suggest the following...

Awesome! Thank you for providing information on this so quickly.

One challenge is to link the reports together when we upload them multiple times, e.g. monthly...

I do like the idea of using related_identifier and its IsNewVersionOf relation (as opposed to the manual DOI registration). Here's why I like it:

  • easy to store/track the DOI since Zenodo registers one when we publish a file
  • easy to retrieve/use the next time we upload/publish a file
  • if new reports/files/etc are generated over time that get added to Zenodo the functionality could be entirely inherited and/or very easy to use/extend
  • we can also take advantage of the convention for tracking kinds of reports/files/etc and update IsPreviousVersionOf and IsPartOf

Here are three possible ways that we could implement this in some pseudo-code:

https://gist.github.com/zdennis/f1ee525797b2fe3c6c1b

Re: file exports to Zenodo

in theory one Zenodo data package could consist of multiple files, and that could also be possible for other export services such as S3. Is that part of the general logic? If not then maybe you could implement a FolderExport that can handle multiple files. One idea for the future would be to add a README, as CSV files are usually not very informative.

Ah, okay. I was thinking initially one deposition per file, but I can see where that wouldn't always make sense. We can allow for multiple files to be added, but that in our current use-case for the ALM report(s) that'd we still zip them up and upload the zip file.

With respect to the README files we could add them to the zip file OR we could add them as a sibling to the zip file in the Zenodo deposition. From browsing other depositions on Zenodo's site it may be nice to have the README live outside of the zip file so it can be viewed prior to having downloaded the data-set. But, that may be a bit presumptuous on my part. Do you have have any preferences based on your understanding of how this may be used? Or, what do you think would be the first go?

With that being said I'm now leaning towards:

  • Rename FileExport to ThirdPartyDataExport to encapsulate one or more files being uploaded by defining the public interface that all future third parties data exporters will adhere to.
  • Rename ZenodoFileExport to ZenodoDepositionExport to indicate what we're exporting to in Zenodo and not mislead a code reader into thinking we're uploading a single file (I think we can avoid having both FileExport and FolderExport this way).

are you planning to zip the CSV you send to Zenodo? If that is the case, then the FolderExport would also work well for a single file and could be generic.

Yeah, I was planning on keeping the reports zipped up, but possibly extracting that to its own Zipper or ZipUtility class. I was planning on keeping that separate from the third-party export functionality as much as possible. Although I'd like to reserve the ability to do a 180 once I get there and see how the code unfolds. :)

we have a report that is sent whenever a CSV file is generated (send_work_statistics_report), and registered users can subscribe to that report. If that existing functionality is easy to integrate into this FileExport framework, then it would be good to have in there.

I will look into that. If alright with you I'd like to get the base functionality in place and then come back and add this as a separate PR.

Related to this though: Do you know when/where that report is generated? I don't see any where that calls Report.send_work_statistics_report and that appears to be the only place that calls ReportMailer.send_work_statistics_report

Thoughts?

If the above sounds like a good place to start I'll honing in what I outlined above and in the pseudo-code gist and get a branch up. We can always review and adapt once code starts taking shape for any areas that may be hard to visualize now.

Let me know if any other feedback or thoughts as I get started on this. Thanks.

Contributor

zdennis commented Jul 6, 2015

the Zenodo data repository creates a persistent identifier (a DOI) for every uploaded dataset (or other item type such as a text document). I hope you appreciate the dogfood aspect of this

I do. I was a little nervous of publishing test files to their production environment since I was fearful they'd steal valuable real estate on Zenodo's home page from other meaningfully published works.

I heard back from Zenodo via email and they do have a developer sandbox: http://sandbox.zenodo.org/. I'm going to switch to using this for development.

Related, the Zenodo ruby gem only supports production right now and doesn't provide a way to override this setting. I'm going to submit another patch so the URL can be overridden so we can set it to the sandbox for development and test environments as needed but allow production to use the Zenodo production URL.

Regarding the specific metadata fields used when uploading ALM monthly reports to Zenodo I suggest the following...

Awesome! Thank you for providing information on this so quickly.

One challenge is to link the reports together when we upload them multiple times, e.g. monthly...

I do like the idea of using related_identifier and its IsNewVersionOf relation (as opposed to the manual DOI registration). Here's why I like it:

  • easy to store/track the DOI since Zenodo registers one when we publish a file
  • easy to retrieve/use the next time we upload/publish a file
  • if new reports/files/etc are generated over time that get added to Zenodo the functionality could be entirely inherited and/or very easy to use/extend
  • we can also take advantage of the convention for tracking kinds of reports/files/etc and update IsPreviousVersionOf and IsPartOf

Here are three possible ways that we could implement this in some pseudo-code:

https://gist.github.com/zdennis/f1ee525797b2fe3c6c1b

Re: file exports to Zenodo

in theory one Zenodo data package could consist of multiple files, and that could also be possible for other export services such as S3. Is that part of the general logic? If not then maybe you could implement a FolderExport that can handle multiple files. One idea for the future would be to add a README, as CSV files are usually not very informative.

Ah, okay. I was thinking initially one deposition per file, but I can see where that wouldn't always make sense. We can allow for multiple files to be added, but that in our current use-case for the ALM report(s) that'd we still zip them up and upload the zip file.

With respect to the README files we could add them to the zip file OR we could add them as a sibling to the zip file in the Zenodo deposition. From browsing other depositions on Zenodo's site it may be nice to have the README live outside of the zip file so it can be viewed prior to having downloaded the data-set. But, that may be a bit presumptuous on my part. Do you have have any preferences based on your understanding of how this may be used? Or, what do you think would be the first go?

With that being said I'm now leaning towards:

  • Rename FileExport to ThirdPartyDataExport to encapsulate one or more files being uploaded by defining the public interface that all future third parties data exporters will adhere to.
  • Rename ZenodoFileExport to ZenodoDepositionExport to indicate what we're exporting to in Zenodo and not mislead a code reader into thinking we're uploading a single file (I think we can avoid having both FileExport and FolderExport this way).

are you planning to zip the CSV you send to Zenodo? If that is the case, then the FolderExport would also work well for a single file and could be generic.

Yeah, I was planning on keeping the reports zipped up, but possibly extracting that to its own Zipper or ZipUtility class. I was planning on keeping that separate from the third-party export functionality as much as possible. Although I'd like to reserve the ability to do a 180 once I get there and see how the code unfolds. :)

we have a report that is sent whenever a CSV file is generated (send_work_statistics_report), and registered users can subscribe to that report. If that existing functionality is easy to integrate into this FileExport framework, then it would be good to have in there.

I will look into that. If alright with you I'd like to get the base functionality in place and then come back and add this as a separate PR.

Related to this though: Do you know when/where that report is generated? I don't see any where that calls Report.send_work_statistics_report and that appears to be the only place that calls ReportMailer.send_work_statistics_report

Thoughts?

If the above sounds like a good place to start I'll honing in what I outlined above and in the pseudo-code gist and get a branch up. We can always review and adapt once code starts taking shape for any areas that may be hard to visualize now.

Let me know if any other feedback or thoughts as I get started on this. Thanks.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 6, 2015

Member

@zdennis the work_statistics_report is generated via cron job that runs once a month: https://github.com/articlemetrics/lagotto/blob/master/lib/tasks/cron.rake#L70-L71 (all cron jobs are consolidated in cron.rake).

Member

mfenner commented Jul 6, 2015

@zdennis the work_statistics_report is generated via cron job that runs once a month: https://github.com/articlemetrics/lagotto/blob/master/lib/tasks/cron.rake#L70-L71 (all cron jobs are consolidated in cron.rake).

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 6, 2015

Member

Lets go with IsNewVersionOf to link a new version of the report. There might be other ways to do that in the future, but this should work for now.

Ideally I would want to have the README in the zip file. It makes it a bit harder to extract that information for display on the Zenodo web pages, but it makes it easier to keep the README associated with the CSV on a local computer. In the future I might do the next step and use the Universal Container Format, which is basically a ZIP with some extra rules. I attended the very interesting http://csvconf.com/ last year on some cool ideas around CSV files. One example was linting CSV files.

Member

mfenner commented Jul 6, 2015

Lets go with IsNewVersionOf to link a new version of the report. There might be other ways to do that in the future, but this should work for now.

Ideally I would want to have the README in the zip file. It makes it a bit harder to extract that information for display on the Zenodo web pages, but it makes it easier to keep the README associated with the CSV on a local computer. In the future I might do the next step and use the Universal Container Format, which is basically a ZIP with some extra rules. I attended the very interesting http://csvconf.com/ last year on some cool ideas around CSV files. One example was linting CSV files.

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jul 7, 2015

Contributor

@mfenner, opened up PR #362 for review. Still some things left to do on it:

  • Add README files for the report and include it in the zip file
  • Add environment variables for Zenodo deposition attributes and relationships (right now I'm just using test metadata)
  • Fix a rake spec I broke
  • Review change-set and make sure everything else is accounted for that should be

I don't suspect these will be much effort. I've got to for the evening though and wanted you to have something to look at when you're on next.

Contributor

zdennis commented Jul 7, 2015

@mfenner, opened up PR #362 for review. Still some things left to do on it:

  • Add README files for the report and include it in the zip file
  • Add environment variables for Zenodo deposition attributes and relationships (right now I'm just using test metadata)
  • Fix a rake spec I broke
  • Review change-set and make sure everything else is accounted for that should be

I don't suspect these will be much effort. I've got to for the evening though and wanted you to have something to look at when you're on next.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 7, 2015

Member

Thanks, will look at this tomorrow.

Member

mfenner commented Jul 7, 2015

Thanks, will look at this tomorrow.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jul 13, 2015

Member

Closing the issue. Merged into master and confirmed to work as expected.

Member

mfenner commented Jul 13, 2015

Closing the issue. Merged into master and confirmed to work as expected.

@mfenner mfenner closed this Jul 13, 2015

mfenner pushed a commit that referenced this issue Jul 13, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment