Dota 2 YASP dataset loaded in BigQuery #924

Closed
fhoffa opened this Issue Mar 17, 2016 · 14 comments

Comments

Projects
None yet
4 participants
@fhoffa
Contributor

fhoffa commented Mar 17, 2016

Hey, I loaded the YASP 452 GB file into BigQuery - it will allow you to run fast queries over the raw dataset.

I did something similar for Wikidata, see more here: https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.html

Current status: I only parsed the main features for the raw json rows ingested. Querying it will be fast, but at 452GB, a query that involves the nested features will take half of your monthly free BigQuery quota. Things would be much more efficient if I extract a normal table (see Wikidata), but I haven't yet. In the meantime you can navigate JSON in realtime, and write arbitrary JavaScript code over these rows.

It has 3.5 million rows, and I can do quick calculations like:

SELECT AVG(duration)
FROM [fh-bigquery:public_dump.dota2_yasp_v1]

2507.74

(the average game takes 2507.74 seconds)

Or the average duration for a game, depending on the number of human players:

SELECT human_players, AVG(duration) avg_duration
FROM [fh-bigquery:public_dump.dota2_yasp_v1] 
GROUP BY 1
ORDER BY 1

http://i.imgur.com/XBAKN9s.png

These queries consume almost nothing of the monthly free quota - things get big only if you try to parse the big json object in real time. It would be much better if I parse those columns out as in Wikidata, but that's an exercise for the future.

How I extracted the columns:

SELECT
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.match_id')) match_id,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.match_seq_num')) match_seq_num,
  (JSON_EXTRACT_SCALAR(item, '$.radiant_win')) radiant_win,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.start_time')) start_time,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.duration')) duration,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.tower_status_radiant')) tower_status_radiant,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.tower_status_dire')) tower_status_dire,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.barracks_status_radiant')) barracks_status_radiant,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.barracks_status_dire')) barracks_status_dire,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.cluster')) cluster,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.first_blood_time')) first_blood_time,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.lobby_type')) lobby_type,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.human_players')) human_players,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.leagueid')) leagueid,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.positive_votes')) positive_votes,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.negative_votes')) negative_votes,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.game_mode')) game_mode,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.engine')) engine,
  JSON_EXTRACT_SCALAR(item, '$.picks_bans') picks_bans,
  INTEGER(JSON_EXTRACT_SCALAR(item, '$.parse_status')) parse_status,
  item
FROM [imports.dota2_yasp_raw] 
WHERE LENGTH(item)>3

-- https://twitter.com/felipehoffa

@howardchung howardchung added the backend label Mar 17, 2016

@howardchung howardchung added this to the Backlog milestone Mar 17, 2016

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 17, 2016

Member

Interesting. Do you think we could serve production traffic out of it or would it only be useful for analysis?

Member

howardchung commented Mar 17, 2016

Interesting. Do you think we could serve production traffic out of it or would it only be useful for analysis?

@fhoffa

This comment has been minimized.

Show comment
Hide comment
@fhoffa

fhoffa Mar 17, 2016

Contributor

Thanks Howard! I would only use it for analytics right now. It's good for people that want to run arbitrary queries without having to download and setup the whole thing first.

What would be super interesting: Could we set up a pipeline where you stream new records in realtime to BigQuery? BigQuery can handle up to a 100 thousand rows per second into a table, so that wouldn't be a problem.

Re: License, I added this description to the table. Does it look fine in terms of compliance?

http://i.imgur.com/clVTgyX.png
https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.dota2_yasp_v1

(maybe it would be great to generate the production charts too, but I don't know enough about YASP to make that call)

Contributor

fhoffa commented Mar 17, 2016

Thanks Howard! I would only use it for analytics right now. It's good for people that want to run arbitrary queries without having to download and setup the whole thing first.

What would be super interesting: Could we set up a pipeline where you stream new records in realtime to BigQuery? BigQuery can handle up to a 100 thousand rows per second into a table, so that wouldn't be a problem.

Re: License, I added this description to the table. Does it look fine in terms of compliance?

http://i.imgur.com/clVTgyX.png
https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.dota2_yasp_v1

(maybe it would be great to generate the production charts too, but I don't know enough about YASP to make that call)

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 17, 2016

Member

Sure, we can probably direct interested users to it with a link from the FAQ or something.

We may be able to add new records as we get them, but it can be a little tricky since the data comes in two distinct steps. We insert a record from the API and then update the record with new columns when we parse the match. It looks like the fields you've extracted so far are only the basic API data, so perhaps we can just start with that.

@albertcui can answer the question about license.

Member

howardchung commented Mar 17, 2016

Sure, we can probably direct interested users to it with a link from the FAQ or something.

We may be able to add new records as we get them, but it can be a little tricky since the data comes in two distinct steps. We insert a record from the API and then update the record with new columns when we parse the match. It looks like the fields you've extracted so far are only the basic API data, so perhaps we can just start with that.

@albertcui can answer the question about license.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 17, 2016

Member

Also re: uploading records, we'd have to consider the bandwidth cost since I think GCE will consider it egress traffic if it's coming out of our instances into your BigQuery table.

Member

howardchung commented Mar 17, 2016

Also re: uploading records, we'd have to consider the bandwidth cost since I think GCE will consider it egress traffic if it's coming out of our instances into your BigQuery table.

@fhoffa

This comment has been minimized.

Show comment
Hide comment
@fhoffa

fhoffa Mar 17, 2016

Contributor

Are you already running YASP on Google Compute Engine? Great! Then there are no egress costs to worry about between GCE and BigQuery :).

Streaming results: Yeah, for BigQuery I would only stream data once the games are completed - but there could be a good strategy to also do partial results.

Let's see who is interested in consuming data this way, and we can plan the rest.

Contributor

fhoffa commented Mar 17, 2016

Are you already running YASP on Google Compute Engine? Great! Then there are no egress costs to worry about between GCE and BigQuery :).

Streaming results: Yeah, for BigQuery I would only stream data once the games are completed - but there could be a good strategy to also do partial results.

Let's see who is interested in consuming data this way, and we can plan the rest.

@fhoffa

This comment has been minimized.

Show comment
Hide comment
@fhoffa

fhoffa Mar 17, 2016

Contributor

(a link from https://yasp.co/blog/33 to this thread would be awesome!)

Contributor

fhoffa commented Mar 17, 2016

(a link from https://yasp.co/blog/33 to this thread would be awesome!)

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 17, 2016

Member

Feel free to submit a PR with the link :) Posts are in the _posts directory.

Member

howardchung commented Mar 17, 2016

Feel free to submit a PR with the link :) Posts are in the _posts directory.

@albertcui

This comment has been minimized.

Show comment
Hide comment
@albertcui

albertcui Mar 18, 2016

Member

License looks good. Thanks for the PR. We can promote this more when I finally get around to #881.

Member

albertcui commented Mar 18, 2016

License looks good. Thanks for the PR. We can promote this more when I finally get around to #881.

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Mar 18, 2016

Member

Closing issue since no more action needed

Member

howardchung commented Mar 18, 2016

Closing issue since no more action needed

@howardchung

This comment has been minimized.

Show comment
Hide comment
@howardchung

howardchung Apr 12, 2016

Member

@fhoffa If we were to dump the matches and player_matches tables with pg_dump would you be able to load them into bigquery?

Member

howardchung commented Apr 12, 2016

@fhoffa If we were to dump the matches and player_matches tables with pg_dump would you be able to load them into bigquery?

@fhoffa

This comment has been minimized.

Show comment
Hide comment
@fhoffa

fhoffa Apr 13, 2016

Contributor

@howardchung I would love to!

(COPY TO CSV better than a pg_dump though)

Contributor

fhoffa commented Apr 13, 2016

@howardchung I would love to!

(COPY TO CSV better than a pg_dump though)

@rossengeorgiev

This comment has been minimized.

Show comment
Hide comment
@rossengeorgiev

rossengeorgiev Apr 19, 2016

Hi @fhoffa, how does one query the dataset on free quota? I ran 2 very simple select queries, no where, no aggregate functions, just json_extract and limit 5. On the 3rd query, I got Your project exceeded quota for free query bytes scanned. I have not tried to query the full dataset at any point, but the query status did say 500GB+ scanned.

I'm confused. Why does SELECT * FROM [table] LIMIT 5 need to process 452GB?

rossengeorgiev commented Apr 19, 2016

Hi @fhoffa, how does one query the dataset on free quota? I ran 2 very simple select queries, no where, no aggregate functions, just json_extract and limit 5. On the 3rd query, I got Your project exceeded quota for free query bytes scanned. I have not tried to query the full dataset at any point, but the query status did say 500GB+ scanned.

I'm confused. Why does SELECT * FROM [table] LIMIT 5 need to process 452GB?

@fhoffa

This comment has been minimized.

Show comment
Hide comment
@fhoffa

fhoffa May 3, 2016

Contributor

Hi Rossen!

Some good news: The free quota replenishes on an ongoing basis, so you don't need to wait until next month to query again - just wait a couple of hours.

On how BigQuery charges per query, I wrote a longer answer here: http://stackoverflow.com/a/22001277/132438

Short answer:

  • There is a free way to do the same as "SELECT * LIMIT 5" - just use the preview feature instead on the web UI (or the equivalent REST call)
  • Things will work much better for this Yasp-dota table in BigQuery when I complete the transition from "raw json" to more standard columns. I need to put more effort into it, and everyone will benefit!
Contributor

fhoffa commented May 3, 2016

Hi Rossen!

Some good news: The free quota replenishes on an ongoing basis, so you don't need to wait until next month to query again - just wait a couple of hours.

On how BigQuery charges per query, I wrote a longer answer here: http://stackoverflow.com/a/22001277/132438

Short answer:

  • There is a free way to do the same as "SELECT * LIMIT 5" - just use the preview feature instead on the web UI (or the equivalent REST call)
  • Things will work much better for this Yasp-dota table in BigQuery when I complete the transition from "raw json" to more standard columns. I need to put more effort into it, and everyone will benefit!
@rossengeorgiev

This comment has been minimized.

Show comment
Hide comment
@rossengeorgiev

rossengeorgiev May 5, 2016

@fhoffa Thanks for the response. I am still confused about the limit 5 query. I know I wrote select * limit 5, but my point was that I was testing to see if my query is correct before running the whole dataset. It was a more complex query. I was trying to extract specific fields from the json data. In my effort to preserve my quota I added LIMIT thinking it won't go through the whole dataset while I was testing.

I've read the stackoverflow answer and the part about limiting columns makes sense. However, at the very end there is a github example where you specify a time window, and the query supposedly goes through only the fraction of the data. I would expect that to be the case with LIMIT 5, and only go through 5 rows. After all I am only doing a simple select with json_extract.

rossengeorgiev commented May 5, 2016

@fhoffa Thanks for the response. I am still confused about the limit 5 query. I know I wrote select * limit 5, but my point was that I was testing to see if my query is correct before running the whole dataset. It was a more complex query. I was trying to extract specific fields from the json data. In my effort to preserve my quota I added LIMIT thinking it won't go through the whole dataset while I was testing.

I've read the stackoverflow answer and the part about limiting columns makes sense. However, at the very end there is a github example where you specify a time window, and the query supposedly goes through only the fraction of the data. I would expect that to be the case with LIMIT 5, and only go through 5 rows. After all I am only doing a simple select with json_extract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment