Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open Data Ship List #1483

Closed
14 of 15 tasks
andrew opened this issue Jun 5, 2017 · 11 comments
Closed
14 of 15 tasks

Open Data Ship List #1483

andrew opened this issue Jun 5, 2017 · 11 comments

Comments

@andrew
Copy link
Contributor

andrew commented Jun 5, 2017

Todo list issue for #609 and https://github.com/librariesio/supporters/issues/9

Related but not required:

  • Improve API documentation
  • Add list of API clients
  • Add license information to footer of the site
  • Add license information to API with commercial prompt to contact support@libraries.io
@andrew
Copy link
Contributor Author

andrew commented Jun 5, 2017

Tables to be exported:

  • Projects
  • Version
  • Dependencies
  • Repository
  • Tags
  • Manifests
  • Repository Dependencies

Tables we won't be exporting as the data is available elsewhere in more reliable forms:

  • Issues
  • Readmes

We won't be exporting any User related data in this first release, namely these tables:

  • Contribution
  • Repository Users
  • Repository Organisations

This was referenced Jun 5, 2017
@andrew andrew changed the title Open Data Release Open Data Ship List Jun 5, 2017
@BenJam
Copy link
Contributor

BenJam commented Jun 5, 2017

The DOI number thing seems like it would be managed by Zenodo no? worth a check, otherwise CrossRef has the nicest logo of any doi provider 😬

@andrew andrew self-assigned this Jun 6, 2017
@andrew
Copy link
Contributor Author

andrew commented Jun 6, 2017

First version of the export rake tasks are now merged: https://github.com/librariesio/libraries.io/blob/master/lib/tasks/open_data.rake

@andrew
Copy link
Contributor Author

andrew commented Jun 7, 2017

Some initial stats:

  • projects
    • filesize: 649MB
    • compressed: 147MB
    • rows: 2,607,879
  • versions
    • filesize: 932M
    • compressed: 136M
    • rows: 9,244,443
  • dependencies
    • filesize: 228M
    • compressed: 147MB
    • rows: 49,826,914
  • repositories
    • filesize: 5.7GB
    • compressed: 1.39GB
    • rows: 23,267,322
  • tags
    • filesize: 5.9GB
    • compressed: 1.53GB
    • rows: 39,502,588
  • repository dependencies
    • filesize: 8.75GB
    • compressed: 850MB
    • rows: 91,840,527

@andrew
Copy link
Contributor Author

andrew commented Jun 8, 2017

Currently investigating some invalid rows being generated in the projects csv file Fixed by removing newline characters from descriptions.

@andrew
Copy link
Contributor Author

andrew commented Jun 8, 2017

A folder containing all 6 csv files gzipped comes in at 4.3GB

@andrew
Copy link
Contributor Author

andrew commented Jun 9, 2017

Plans for data cleanup:

  • check all projects to see if they've been yanked/removed (running now)
  • check repositories with at least one dependency to see if they've been removed (running now)
  • resync as many projects as possible (running now)
  • resync most popular repositories

@andrew
Copy link
Contributor Author

andrew commented Jun 12, 2017

Thinking about adding back in the primary keys for projects, versions and repositories to make joining the data in the different csv files.

@BenJam
Copy link
Contributor

BenJam commented Jun 13, 2017

I've created data@ and opendata@ btw.

@andrew
Copy link
Contributor Author

andrew commented Jun 15, 2017

Submitted pr to awesome-public-datasets: awesomedata/awesome-public-datasets#300

@andrew
Copy link
Contributor Author

andrew commented Jun 15, 2017

🚢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants