Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

GitHub #22

Open
rht opened this issue Sep 17, 2015 · 25 comments
Open

GitHub #22

rht opened this issue Sep 17, 2015 · 25 comments

Comments

@rht
Copy link

rht commented Sep 17, 2015

via https://github.com/joeyh/github-backup

@harlantwood
Copy link

+1.

Maybe eventually everything else open source on github too.

@davidar
Copy link
Collaborator

davidar commented Sep 18, 2015

👍

@cryptix you're in charge of git-on-ipfs :)

Maybe eventually everything else open source on github too.

Haha, not asking much :). In the meantime, it would be cool if ipfs.io offered a webhook that automatically uploaded github repos (including github pages) to IPFS. Bonus points for providing dnslinks so that I can visit

CC: @jbenet @lgierth

@davidar davidar changed the title github.com/ipfs GitHub Sep 18, 2015
@cryptix
Copy link

cryptix commented Sep 18, 2015

Mirroring the repos is easy IMHO. Does GitHub-backup also copy the issues and PR discussions? A lot of insight gets lost when these become unavailable.

@harlantwood
Copy link

Looks like it... the readme says:

It backs up everything GitHub publishes about the repository, including branches, tags, other forks, issues, comments, wikis, milestones, pull requests, watchers, and stars.

@jbenet
Copy link
Contributor

jbenet commented Sep 19, 2015

yeah would be nice

@rht
Copy link
Author

rht commented Sep 23, 2015

There exists a transitionary state here http://ghtorrent.org/ (https://github.com/gousiosg/github-mirror).
Today the archive totals at ~6.5TB, harvested with ~29 donated api keys.
One step here could be s/torrent/ipfs/.

Though I still wish for a script to archive github.com/ipfs for offline viewing (which script can be replicated by other users/orgs when needed).

@davidar
Copy link
Collaborator

davidar commented Sep 23, 2015

@rht wow, that's really cool 👍

@jbenet
Copy link
Contributor

jbenet commented Sep 23, 2015

Putting ghtorrent on IPFS would be great. one thing also is that if the mongodb-specific stuff is dropped in favor of just the raw data on ipfs, we can make tools that query + render the data directly on the web! think libre issues.

cc @gousiosg for a quick overview of ipfs see #24 (comment)

@gousiosg
Copy link

Hi, with GHTorrent you can retrieve the full state of any repo on GitHub using the ght-data-retrieval command. It stores Github API responses (actual JSON objects) to a configurable backend (currently, MongoDB and no-op are supported) and metadata in MySQL (actually any DB supported by Ruby's Sequel gem). Let me know if you need any help running it.

@rht
Copy link
Author

rht commented Sep 25, 2015

Hi @gousiosg I tried with ght-retrieve-repo ipfs go-ipfs https://ipfs.io/ipfs/QmaEffVPYqZboKMLvbnsxii28cdjNZgaZQpLt8DQbsQWrC, but so far the only non-empty tables are organization_members, project_languages, projects, schema_info, users.

Also, the given default format for the metadata (sqlite3) doesn't work, with an error joined datasets cannot be modified (sqlite3 doesn't have this feature).
Had a mongodb set up as in https://github.com/gousiosg/github-mirror/wiki/Setting-up-a-mirroring-cluster.
Did I misconfigure anything?

@rht
Copy link
Author

rht commented Sep 25, 2015

And since github-mirror is written in ruby, this issue (entire github mirror) depends on the addition of 'ipfs add' in https://github.com/Fryie/ipfs-ruby @Fryie.

While for one user/org, github-backup should do.

@Fryie
Copy link

Fryie commented Sep 25, 2015

So sorry for not making faster progress, @rht and others, i'm on it, but have exams currently and a lot of work. :(

In the meantime you can also take take a look at https://github.com/hjoest/ipfs4r by @hjoest which apparently was started independently :) i think it already works with add

@gousiosg
Copy link

@rth I think you are using master instead of a released version and perhaps you forgot to specify your user name and password to go over the API rate limits

Here is a session of GHTorrent's latest release running on itself:
https://gist.github.com/gousiosg/58ce4f0d198064fc9a12

This is the slowest version you can use: if you configure MongoDB for caching it will be 10x faster.

@rht
Copy link
Author

rht commented Sep 26, 2015

(Thanks so much for the directed help)

I was using the version in gem (0.10, released last year) before you updated it to 0.11.
Yes I did with the name/password, where I used a pki instead.
No longer had the sqlite error in 0.11.

Currently timing for with and without mongodb.
Wouldn't it be much faster if the commits are retrieved through git instead of the api? (though the former only contains the clean "history")

@rht
Copy link
Author

rht commented Sep 26, 2015

@Fryie saw it. It does have add --recursive as well.

@gousiosg
Copy link

@rht The commit API has more information per commit (e.g. linking the committer to his/her account on GitHub) and you do not need to clone the repo.

@rht
Copy link
Author

rht commented Sep 29, 2015

ic, yes it is cloning the github ecosystem, not just the repos.
Though I wish there exists git repo for each of the issues, prs, ... (github does it for the wiki, why not extend to the rest?).

Been looking, but haven't found any git-backed database (e.g. git checkout -c 'select * from commits limit 10' -b out remote/query).

update here:
My local computer is too slow (1.5s for each fetch, disconnects, etc) that I ran it on one of the ipfs server (pollux) instead.
Tried 3 times, error-ed mid-way after 2.5 hours, usually at stargazers.
TODO: this will be fixed if I can run mongod without root access.

@davidar
Copy link
Collaborator

davidar commented Oct 1, 2015

this will be fixed if I can run mongod without root access.

@rht you can ask @lgierth to install stuff: ipfs/infra#70

@ghost
Copy link

ghost commented Oct 1, 2015

Maybe run it in docker containers?

@rht
Copy link
Author

rht commented Oct 1, 2015

(already half-expecting that answer...)

@rht
Copy link
Author

rht commented Oct 23, 2015

Just fixed the mongo issue (the mongod simply needs to be run with --dbpath).
Had all the go-ipfs github content mirrored.

go-ipfs git repo: 38 MB
go-ipfs github metadata: 3.7 MB
go-ipfs github db (in mongo): 448 MB

To dump the last one into bson, requires a mongoexport/mongodump binary.
I tried the binary from https://www.mongodb.org/downloads, but didn't work.
Maybe the admin should install mongodb-clients?

@jbenet
Copy link
Contributor

jbenet commented Oct 27, 2015

@rht 👏 👏 👏 👏

@davidar
Copy link
Collaborator

davidar commented Jul 23, 2016

@qianlitayunhai
Copy link

Hi,how to get the desired research data from GitHub? thanks!

@gousiosg
Copy link

Does this help? http://ghtorrent.org/downloads.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants