New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

March data dump #881

Closed
klop opened this Issue Feb 16, 2016 · 130 comments

Comments

Projects
None yet
@klop

klop commented Feb 16, 2016

Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.

@albertcui

This comment has been minimized.

Show comment
Hide comment
@albertcui

albertcui Feb 16, 2016

Member

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find were
the 500k Dec 2015 and 3.5M dumps.


Reply to this email directly or view it on GitHub
#881.

Member

albertcui commented Feb 16, 2016

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find were
the 500k Dec 2015 and 3.5M dumps.


Reply to this email directly or view it on GitHub
#881.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Feb 16, 2016

Member

Maybe with skill data this time!
On Feb 16, 2016 10:14 AM, "Albert Cui" notifications@github.com wrote:

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find
were
the 500k Dec 2015 and 3.5M dumps.


Reply to this email directly or view it on GitHub
#881.


Reply to this email directly or view it on GitHub
#881 (comment).

Member

howardchung commented Feb 16, 2016

Maybe with skill data this time!
On Feb 16, 2016 10:14 AM, "Albert Cui" notifications@github.com wrote:

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find
were
the 500k Dec 2015 and 3.5M dumps.


Reply to this email directly or view it on GitHub
#881.


Reply to this email directly or view it on GitHub
#881 (comment).

@howardchung howardchung added this to the 2016-4 milestone Feb 16, 2016

@klop

This comment has been minimized.

Show comment
Hide comment
@klop

klop Feb 16, 2016

With skill data would be awesome.

klop commented Feb 16, 2016

With skill data would be awesome.

@howardchung howardchung added the data label Feb 19, 2016

@howardchung howardchung changed the title from Data dump by patch to March data dump Feb 19, 2016

@howardchung howardchung modified the milestones: 2016-4, 2016-3 Feb 25, 2016

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 1, 2016

Member

do we want to make this a quarterly or semiannual thing?

Member

howardchung commented Mar 1, 2016

do we want to make this a quarterly or semiannual thing?

@albertcui

This comment has been minimized.

Show comment
Hide comment
@albertcui

albertcui Mar 27, 2016

Member

Pushing back because we're doing import right now.

Member

albertcui commented Mar 27, 2016

Pushing back because we're doing import right now.

@albertcui albertcui modified the milestones: 2016-4, 2016-3 Mar 27, 2016

@onelivesleft

This comment has been minimized.

Show comment
Hide comment
@onelivesleft

onelivesleft Mar 30, 2016

Contributor

Posting to say this would be good quarterly (unless you get the BigQuery thing updating live). Will you post a blog post when the next dump happens?

Contributor

onelivesleft commented Mar 30, 2016

Posting to say this would be good quarterly (unless you get the BigQuery thing updating live). Will you post a blog post when the next dump happens?

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 31, 2016

Member

If it were up to me I'd probably do semiannual but if @albertcui wants to do it quarterly I won't say no (he's the one having to export/upload the data anyway).

Regarding future dumps:
I think at some point after we complete the import we will do a massive pg_dump (this would produce a PostgreSQL-specific dump) with every match ever played (~1.2 billion matches, mostly unparsed). This will also aid us in doing a data migration if we need to move our match data somewhere else (possibly because of Google getting too expensive). Then we can do periodic "addendum" dumps to keep updated records exported. It is up to @albertcui if he wants to continue doing the more generic JSON dumps as well.

We could possibly also get away with not keeping snapshots in Google as well (that would save nearly $100 a month).

ETA for import is 10-15 days.

Member

howardchung commented Mar 31, 2016

If it were up to me I'd probably do semiannual but if @albertcui wants to do it quarterly I won't say no (he's the one having to export/upload the data anyway).

Regarding future dumps:
I think at some point after we complete the import we will do a massive pg_dump (this would produce a PostgreSQL-specific dump) with every match ever played (~1.2 billion matches, mostly unparsed). This will also aid us in doing a data migration if we need to move our match data somewhere else (possibly because of Google getting too expensive). Then we can do periodic "addendum" dumps to keep updated records exported. It is up to @albertcui if he wants to continue doing the more generic JSON dumps as well.

We could possibly also get away with not keeping snapshots in Google as well (that would save nearly $100 a month).

ETA for import is 10-15 days.

@onelivesleft

This comment has been minimized.

Show comment
Hide comment
@onelivesleft

onelivesleft Mar 31, 2016

Contributor

That'd be great: I'd love to be able to query a db about matches (like the official api, but not limited to the last x hundred games). If I have to download a massive file first that's not really a problem.

I take it opening up an api of your own would have too high a bandwidth overhead?

Contributor

onelivesleft commented Mar 31, 2016

That'd be great: I'd love to be able to query a db about matches (like the official api, but not limited to the last x hundred games). If I have to download a massive file first that's not really a problem.

I take it opening up an api of your own would have too high a bandwidth overhead?

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 31, 2016

Member

Yeah, APIs are expensive to operate.

Member

howardchung commented Mar 31, 2016

Yeah, APIs are expensive to operate.

@mikkelam

This comment has been minimized.

Show comment
Hide comment
@mikkelam

mikkelam Apr 7, 2016

I'm very interested in using the MMR data for machine learning, is this added in this data dump? I suspect one can estimate a players MMR up to a very high accuracy.

mikkelam commented Apr 7, 2016

I'm very interested in using the MMR data for machine learning, is this added in this data dump? I suspect one can estimate a players MMR up to a very high accuracy.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Apr 7, 2016

Member

@albertcui are you planning to dump player_ratings? Or perhaps export a "snapshot" of current MMR data?

Member

howardchung commented Apr 7, 2016

@albertcui are you planning to dump player_ratings? Or perhaps export a "snapshot" of current MMR data?

@paulodfreitas

This comment has been minimized.

Show comment
Hide comment
@paulodfreitas

paulodfreitas Apr 12, 2016

I think would be nice if dumps are somewhat synchronized with Majors. This way they could be release in "know" intervals and somewhat related with big updates.

paulodfreitas commented Apr 12, 2016

I think would be nice if dumps are somewhat synchronized with Majors. This way they could be release in "know" intervals and somewhat related with big updates.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Apr 19, 2016

Member

import is done. Been talking with @albertcui about doing a full dump this time (with every match ever played).

We'd dump matches, player_matches, and match_skill as CSV. Users would have to join the data themselves.

Member

howardchung commented Apr 19, 2016

import is done. Been talking with @albertcui about doing a full dump this time (with every match ever played).

We'd dump matches, player_matches, and match_skill as CSV. Users would have to join the data themselves.

@onelivesleft

This comment has been minimized.

Show comment
Hide comment
@onelivesleft

onelivesleft Apr 19, 2016

Contributor

Sounds good

Contributor

onelivesleft commented Apr 19, 2016

Sounds good

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Apr 20, 2016

Member

@albertcui I put sample queries in the OP. You may want to try them locally on your devbox first to make sure they work properly.

Member

howardchung commented Apr 20, 2016

@albertcui I put sample queries in the OP. You may want to try them locally on your devbox first to make sure they work properly.

@howardchung howardchung modified the milestones: 2016-5, 2016-4 Apr 24, 2016

@albertcui

This comment has been minimized.

Show comment
Hide comment
@albertcui

albertcui Apr 27, 2016

Member
yasp=# COPY matches TO PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/matches.gz' CSV HEADER;
COPY 1191768403
yasp=# COPY match_skill to PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/match_sill.gz' CSV HEADER;
COPY 132447335

matches.gz is 146 GB. Currently exporting player_matches.

Member

albertcui commented Apr 27, 2016

yasp=# COPY matches TO PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/matches.gz' CSV HEADER;
COPY 1191768403
yasp=# COPY match_skill to PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/match_sill.gz' CSV HEADER;
COPY 132447335

matches.gz is 146 GB. Currently exporting player_matches.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Dec 25, 2016

Member

So doing some quick searching on the subject it looks like the torrent size is based on the number of pieces the data is split into. If we want a smaller torrent, we probably need to pass an option specifying the number of pieces. The two commands probably have different defaults.

I assume you have the seeding working as described here? https://lists.freebsd.org/pipermail/freebsd-questions/2009-June/201753.html

Perhaps try another tracker--wasnt there a public one that you were going to try?

Member

howardchung commented Dec 25, 2016

So doing some quick searching on the subject it looks like the torrent size is based on the number of pieces the data is split into. If we want a smaller torrent, we probably need to pass an option specifying the number of pieces. The two commands probably have different defaults.

I assume you have the seeding working as described here? https://lists.freebsd.org/pipermail/freebsd-questions/2009-June/201753.html

Perhaps try another tracker--wasnt there a public one that you were going to try?

@waprin

This comment has been minimized.

Show comment
Hide comment
@waprin

waprin Jan 4, 2017

Just got too frustrated and wanted to take a break from this, especially since I was traveling. Getting back home tomorrow, going to build a new PC with a bigger disk, will loop back around on this and learn more about torrents sometime this month. I might even try to host my own torrent tracker, might be a good learning experience.

waprin commented Jan 4, 2017

Just got too frustrated and wanted to take a break from this, especially since I was traveling. Getting back home tomorrow, going to build a new PC with a bigger disk, will loop back around on this and learn more about torrents sometime this month. I might even try to host my own torrent tracker, might be a good learning experience.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Jan 14, 2017

Member

If you just put the blobs on Google Cloud Storage and shared the download links, would you be able to pay for the download bandwidth/storage on your personal account? Or should we wait until we get a torrent working before making a blog post/public announcement?

Member

howardchung commented Jan 14, 2017

If you just put the blobs on Google Cloud Storage and shared the download links, would you be able to pay for the download bandwidth/storage on your personal account? Or should we wait until we get a torrent working before making a blog post/public announcement?

@rossengeorgiev

This comment has been minimized.

Show comment
Hide comment
@rossengeorgiev

rossengeorgiev Jan 26, 2017

I know you guys haven't finish dealing with the original dump, but any chance for a fresh one? Like the last month or something. It would be really useful given the dramatic changes of 700.

rossengeorgiev commented Jan 26, 2017

I know you guys haven't finish dealing with the original dump, but any chance for a fresh one? Like the last month or something. It would be really useful given the dramatic changes of 700.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Jan 26, 2017

Member

unfortunately the old code we used for dumps doesn't work anymore since the move to Cassandra. No telling when we'll be able to get a new migration script working.

I think @waprin wants to eventually get something set up where match data is directly streamed to BigQuery. If we get that working then it would probably be the best place to obtain fresh data dumps.

Member

howardchung commented Jan 26, 2017

unfortunately the old code we used for dumps doesn't work anymore since the move to Cassandra. No telling when we'll be able to get a new migration script working.

I think @waprin wants to eventually get something set up where match data is directly streamed to BigQuery. If we get that working then it would probably be the best place to obtain fresh data dumps.

@rossengeorgiev

This comment has been minimized.

Show comment
Hide comment
@rossengeorgiev

rossengeorgiev Jan 26, 2017

I'm unfamiliar with Cassandra, but there seems to be CAPTURE command that would export the result of queries. Couldn't find any details about performance. Maybe that could do it?

I really like the idea of streaming data to BigQuery, but does seem to be happening any time soon. The original issues was about just a slice of data. I'm looking for the same thing a year later and it seems to be even further from happening.

rossengeorgiev commented Jan 26, 2017

I'm unfamiliar with Cassandra, but there seems to be CAPTURE command that would export the result of queries. Couldn't find any details about performance. Maybe that could do it?

I really like the idea of streaming data to BigQuery, but does seem to be happening any time soon. The original issues was about just a slice of data. I'm looking for the same thing a year later and it seems to be even further from happening.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Jan 26, 2017

Member
Member

howardchung commented Jan 26, 2017

@waprin

This comment has been minimized.

Show comment
Hide comment
@waprin

waprin Jan 26, 2017

I just ordered a new HDD so I can just seed from home as soon as it arrives. Torrent does seem like the best option.

Would like to stream matches directly from API into Bigquery so will look into that.

waprin commented Jan 26, 2017

I just ordered a new HDD so I can just seed from home as soon as it arrives. Torrent does seem like the best option.

Would like to stream matches directly from API into Bigquery so will look into that.

@rossengeorgiev

This comment has been minimized.

Show comment
Hide comment
@rossengeorgiev

rossengeorgiev Feb 11, 2017

I've recently scrapped all dota matches for January and I'm making it available as a torrent. It's 33mil matches without the dark moon ones.

http://static.rgp.io/dota2_matches_jan2017.torrent

rossengeorgiev commented Feb 11, 2017

I've recently scrapped all dota matches for January and I'm making it available as a torrent. It's 33mil matches without the dark moon ones.

http://static.rgp.io/dota2_matches_jan2017.torrent

@howardchung howardchung modified the milestones: Backlog, 2016-12 Feb 15, 2017

@bippum

This comment has been minimized.

Show comment
Hide comment
@bippum

bippum Mar 1, 2017

Contributor

Downloaded and created torrent files from #881 (comment).

Files:

Contributor

bippum commented Mar 1, 2017

Downloaded and created torrent files from #881 (comment).

Files:

@bippum

This comment has been minimized.

Show comment
Hide comment
@bippum

bippum Mar 7, 2017

Contributor

@albertcui, I believe my links should be good to go as long as you upload the files to academictorrents. It should resolve this error in my client:
image

Contributor

bippum commented Mar 7, 2017

@albertcui, I believe my links should be good to go as long as you upload the files to academictorrents. It should resolve this error in my client:
image

@jvanhees

This comment has been minimized.

Show comment
Hide comment
@jvanhees

jvanhees Mar 14, 2017

Is there anyone seeding the files from @bippum , and is there some sample data available? I've got plenty of space and a home server that can seed 24/7 at 200mbit/s, but I first need to download the data :) . I will leave the torrents provided above running for now, hoping that someone can share them. If there are other torrents available, please let me know.

jvanhees commented Mar 14, 2017

Is there anyone seeding the files from @bippum , and is there some sample data available? I've got plenty of space and a home server that can seed 24/7 at 200mbit/s, but I first need to download the data :) . I will leave the torrents provided above running for now, hoping that someone can share them. If there are other torrents available, please let me know.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 14, 2017

Member

The OP has small sample datasets.

@albertcui can you please upload the torrents to academic torrents?

Member

howardchung commented Mar 14, 2017

The OP has small sample datasets.

@albertcui can you please upload the torrents to academic torrents?

@albertcui

This comment has been minimized.

Show comment
Hide comment
@albertcui

albertcui Mar 18, 2017

Member

I've uploaded matches + match_skill. It won't let me update player_matches:

"Sorry, the piece length is too small. The torrent file must be less than 2MB. Increase your piece length to lower the file size" :(

For reference, all the torrents are in this collection: http://academictorrents.com/collection/opendota-formerly-yasp-data-dumps

Member

albertcui commented Mar 18, 2017

I've uploaded matches + match_skill. It won't let me update player_matches:

"Sorry, the piece length is too small. The torrent file must be less than 2MB. Increase your piece length to lower the file size" :(

For reference, all the torrents are in this collection: http://academictorrents.com/collection/opendota-formerly-yasp-data-dumps

@bippum

This comment has been minimized.

Show comment
Hide comment
@bippum

bippum Mar 18, 2017

Contributor

I'll attempt to create the torrent again within the day.

Contributor

bippum commented Mar 18, 2017

I'll attempt to create the torrent again within the day.

@albertcui

This comment has been minimized.

Show comment
Hide comment
@albertcui

albertcui Mar 18, 2017

Member

Thanks, sorry for the delay. Did uploading the other ones fix the error?

Member

albertcui commented Mar 18, 2017

Thanks, sorry for the delay. Did uploading the other ones fix the error?

@bippum

This comment has been minimized.

Show comment
Hide comment
@bippum

bippum Mar 18, 2017

Contributor

Yes it did.

Contributor

bippum commented Mar 18, 2017

Yes it did.

@jvanhees

This comment has been minimized.

Show comment
Hide comment
@jvanhees

jvanhees Mar 20, 2017

Great, thanks guys, I'm currently downloading the files and will continue to seed them :). Good work!

jvanhees commented Mar 20, 2017

Great, thanks guys, I'm currently downloading the files and will continue to seed them :). Good work!

@bippum

This comment has been minimized.

Show comment
Hide comment
@bippum

bippum Mar 20, 2017

Contributor

Glad to hear you are able to download them OK. I switched torrent clients (from Transmission to Deluge) after having difficulty getting Transmission to do anything, let alone upload. Now I wake up to see that match_skill and matches are seeding! I updated the player_matches link with a 4 MiB piece torrent, @albertcui if you could upload that one to academictorrents? Thanks.

If this one doesn't work I can try creating it again with 8 MiB pieces.

Contributor

bippum commented Mar 20, 2017

Glad to hear you are able to download them OK. I switched torrent clients (from Transmission to Deluge) after having difficulty getting Transmission to do anything, let alone upload. Now I wake up to see that match_skill and matches are seeding! I updated the player_matches link with a 4 MiB piece torrent, @albertcui if you could upload that one to academictorrents? Thanks.

If this one doesn't work I can try creating it again with 8 MiB pieces.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 20, 2017

Member

Awesome, once they're all up we can write a release blog post and then we can finally close this! :)

Member

howardchung commented Mar 20, 2017

Awesome, once they're all up we can write a release blog post and then we can finally close this! :)

@albertcui

This comment has been minimized.

Show comment
Hide comment
Member

albertcui commented Mar 20, 2017

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung
Member

howardchung commented Mar 27, 2017

@viniciusmr

This comment has been minimized.

Show comment
Hide comment
@viniciusmr

viniciusmr Apr 5, 2018

Hi there guys!
Isn't there anywhere else where the files are available?
(Or maybe someone with the file that wants/can join the torrent stream?)

I'm currently downloading the "OpenDota - All Matches from March 2016 - Matches"
(matches.gz , 155.94GB)
However the torrent availability is less than 1 (0.781) which means even if I leave this downloading forever, I won't be able to finish the download, cause there are missing pieces in the stream.
(and actually there is only one seeder =/ )

viniciusmr commented Apr 5, 2018

Hi there guys!
Isn't there anywhere else where the files are available?
(Or maybe someone with the file that wants/can join the torrent stream?)

I'm currently downloading the "OpenDota - All Matches from March 2016 - Matches"
(matches.gz , 155.94GB)
However the torrent availability is less than 1 (0.781) which means even if I leave this downloading forever, I won't be able to finish the download, cause there are missing pieces in the stream.
(and actually there is only one seeder =/ )

@bippum

This comment has been minimized.

Show comment
Hide comment
@bippum

bippum Apr 5, 2018

Contributor

Hi, I'm currently seeding all 3 files with 100% completion, so it should complete eventually.

seeding

Contributor

bippum commented Apr 5, 2018

Hi, I'm currently seeding all 3 files with 100% completion, so it should complete eventually.

seeding

@pranavchintala

This comment has been minimized.

Show comment
Hide comment
@pranavchintala

pranavchintala Sep 19, 2018

Hello, would anyone be willing to seed player_matches.gz? Haven't been able to find a seeder for a week now and could really use this data for a project! Thanks in advance!

pranavchintala commented Sep 19, 2018

Hello, would anyone be willing to seed player_matches.gz? Haven't been able to find a seeder for a week now and could really use this data for a project! Thanks in advance!

@bippum

This comment has been minimized.

Show comment
Hide comment
@bippum

bippum Sep 19, 2018

Contributor

The copy I had got corrupted during a transfer between hard drives. I no longer have the original files from the amazon cloud drive location, and they aren't obtainable either. Sorry for the inconvenience.

Contributor

bippum commented Sep 19, 2018

The copy I had got corrupted during a transfer between hard drives. I no longer have the original files from the amazon cloud drive location, and they aren't obtainable either. Sorry for the inconvenience.

@pranavchintala

This comment has been minimized.

Show comment
Hide comment
@pranavchintala

pranavchintala Sep 19, 2018

Alright no problem, thanks for the response!
Would anybody else have even a subset of this data available? Perhaps something larger than the 4GB samples above would do the trick.

pranavchintala commented Sep 19, 2018

Alright no problem, thanks for the response!
Would anybody else have even a subset of this data available? Perhaps something larger than the 4GB samples above would do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment