Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Dota 2 YASP dataset loaded in BigQuery #924
Hey, I loaded the YASP 452 GB file into BigQuery - it will allow you to run fast queries over the raw dataset.
I did something similar for Wikidata, see more here: https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.html
It has 3.5 million rows, and I can do quick calculations like:
(the average game takes 2507.74 seconds)
Or the average duration for a game, depending on the number of human players:
These queries consume almost nothing of the monthly free quota - things get big only if you try to parse the big json object in real time. It would be much better if I parse those columns out as in Wikidata, but that's an exercise for the future.
How I extracted the columns:
Thanks Howard! I would only use it for analytics right now. It's good for people that want to run arbitrary queries without having to download and setup the whole thing first.
What would be super interesting: Could we set up a pipeline where you stream new records in realtime to BigQuery? BigQuery can handle up to a 100 thousand rows per second into a table, so that wouldn't be a problem.
Re: License, I added this description to the table. Does it look fine in terms of compliance?
(maybe it would be great to generate the production charts too, but I don't know enough about YASP to make that call)
Sure, we can probably direct interested users to it with a link from the FAQ or something.
We may be able to add new records as we get them, but it can be a little tricky since the data comes in two distinct steps. We insert a record from the API and then update the record with new columns when we parse the match. It looks like the fields you've extracted so far are only the basic API data, so perhaps we can just start with that.
@albertcui can answer the question about license.
Are you already running YASP on Google Compute Engine? Great! Then there are no egress costs to worry about between GCE and BigQuery :).
Streaming results: Yeah, for BigQuery I would only stream data once the games are completed - but there could be a good strategy to also do partial results.
Let's see who is interested in consuming data this way, and we can plan the rest.
Hi @fhoffa, how does one query the dataset on free quota? I ran 2 very simple select queries, no where, no aggregate functions, just
I'm confused. Why does
Some good news: The free quota replenishes on an ongoing basis, so you don't need to wait until next month to query again - just wait a couple of hours.
On how BigQuery charges per query, I wrote a longer answer here: http://stackoverflow.com/a/22001277/132438
@fhoffa Thanks for the response. I am still confused about the
I've read the stackoverflow answer and the part about limiting columns makes sense. However, at the very end there is a github example where you specify a time window, and the query supposedly goes through only the fraction of the data. I would expect that to be the case with