Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common data sources #2

Open
4 of 8 tasks
oxinabox opened this issue Nov 13, 2017 · 13 comments
Open
4 of 8 tasks

Add common data sources #2

oxinabox opened this issue Nov 13, 2017 · 13 comments
Labels
Data Repo For tagging repos / APIs that could be added

Comments

@oxinabox
Copy link
Owner Author

I think the github repos will be fairly easy to scrape.

The others may actually be really easy because they might use a lot of modern HTML practices like filling in #ids.

@oxinabox
Copy link
Owner Author

OAI-PMH is a standard API that is exposed by Figshare and DataDryad and probably many others.

@pdurbin
Copy link

pdurbin commented Nov 30, 2017

Dataverse supports OAI-PMH. You can find a list of OAI sets by installation at https://docs.google.com/spreadsheets/d/12cxymvXCqP_kCsLKXQD32go79HBWZ1vU_tdG4kvP5S8/edit?usp=sharing

@oxinabox
Copy link
Owner Author

@pdurbin thanks.
I've not properly gone through OAI-PMH,
am I right in saying that I should be able to us it to generate a data deps registration line (given some ID, like an URL).
A data deps registration line needs at minimum a list of URLs to download a local copy.
And really wants to have a bunch of metadata like author, website, and papers to cite.
Ideally also has a SHA checksum for each file.
I think OAI-PMH exists almost specifically to make it easy to get this kind of information.
But I am not sure.

You can see the current generator prototype (with reference outputs) I have for the UCI ML repo at https://github.com/oxinabox/DataDepsGenerators.jl/pull/1/files

@pdurbin
Copy link

pdurbin commented Nov 30, 2017

@oxinabox I'm sorry but I'm not familiar enough with OAI-PMH to know the answer. Someone on the dataverse-community mailing list might, and you've be welcome to start a thread about this: https://groups.google.com/forum/#!forum/dataverse-community

@pdurbin
Copy link

pdurbin commented Jan 12, 2018

@oxinabox if that was you over at http://irclog.iq.harvard.edu/dataverse/2018-01-12 I'm sorry I missed you. Yes, you can think of SWORD as being for uploads and OAI-PMH as being for downloading metadata (but not files, generally speaking).

@oxinabox
Copy link
Owner Author

indeed it was me. I'm thinking about this a bit more again.
Sometimes that metadata includes file URLs and checksums (I think?).
And even if it does't it includes other data I want to harvest, like author and copyright status.

@pdurbin
Copy link

pdurbin commented Jan 12, 2018

Right, from DDI you can get names of files and such. For tabular files, you can even get summary stats on variables (columns), like this example from https://dataverse.harvard.edu/api/datasets/export?exporter=ddi&persistentId=doi:10.7910/DVN/TJCLKP

<var ID="v17909793" name="stars" intrvl="discrete">
  <location fileid="f3040230"/>
  <labl level="variable">stars</labl>
  <sumStat type="medn">3.0</sumStat>
  <sumStat type="vald">74.0</sumStat>
  <sumStat type="max">196.0</sumStat>
  <sumStat type="stdev">38.35085209417775</sumStat>
  <sumStat type="min">0.0</sumStat>
  <sumStat type="invd">0.0</sumStat>
  <sumStat type="mean">19.081081081081102</sumStat>
  <sumStat type="mode">.</sumStat>
  <varFormat type="numeric"/>
  <notes subject="Universal Numeric Fingerprint" level="variable" type="Dataverse:UNF">UNF:6:HLicTVd/u3Cwzb/nrk29VA==</notes>
</var>

I'm not really an expert on all this, but again if you email https://groups.google.com/forum/#!forum/dataverse-community someone with more information could weigh in.

@oxinabox oxinabox mentioned this issue Jan 13, 2018
@oxinabox
Copy link
Owner Author

oxinabox commented Jan 29, 2018

The DataOne api is really nice:
It does exactly what I want
http://wiki.datadryad.org/DataONE_RESTful_API
Metadata + actual links to files + checksums

Looks like it would add a fair few sites, https://www.dataone.org/current-member-nodes#uploads
including DataDryad

The way to do this would be to implement an abstract dispatch type DataOne,
then if required implement DataDryad as a concrete case of it.

@oxinabox oxinabox added the Data Repo For tagging repos / APIs that could be added label Jun 19, 2018
@BeastyBlacksmith
Copy link

If I may i would like to point to EDMOND the open data repository of the Max-Planck society

@oxinabox
Copy link
Owner Author

oxinabox commented Jul 9, 2019

@BeastyBlacksmith I am not actively adding new data sources at the moment.
But I will review PRs

You also might want to raise an issue with the EDMOND team to follow the google/schema.org guidelines for including JSON-LD structure data fragments on the pages.
This will make DataDepsGenerators.jl work with it automatically,
and will make it show up in Google Dataset search

https://developers.google.com/search/docs/data-types/dataset

@pdurbin
Copy link

pdurbin commented Jul 10, 2019

@oxinabox should I help answer questions for pull request #40 ? I didn't notice it until just now.

@oxinabox
Copy link
Owner Author

@pdurbin it is kinda stalled, since that GSOC project is over. But feel free to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Repo For tagging repos / APIs that could be added
Projects
None yet
Development

No branches or pull requests

3 participants