**How to Query the Stack Overflow Data (BigQuery Dataset)**

In [1]:
import bq_helper
from bq_helper import BigQueryHelper
# https://www.kaggle.com/sohier/introduction-to-the-bq-helper-package
stackOverflow = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                   dataset_name="stackoverflow")

In [2]:
bq_assistant = BigQueryHelper("bigquery-public-data", "stackoverflow")
bq_assistant.list_tables()

['badges',
 'comments',
 'post_history',
 'post_links',
 'posts_answers',
 'posts_moderator_nomination',
 'posts_orphaned_tag_wiki',
 'posts_privilege_wiki',
 'posts_questions',
 'posts_tag_wiki',
 'posts_tag_wiki_excerpt',
 'posts_wiki_placeholder',
 'stackoverflow_posts',
 'tags',
 'users',
 'votes']

In [3]:
bq_assistant.head("comments", num_rows=20)

Unnamed: 0,id,text,creation_date,post_id,user_id,user_display_name,score
0,50232955,`json.keys()` will give you fileds of your class.,2015-06-29 06:52:25.147000+00:00,30938884,2823164,,0
1,5733229,"MiffTheFox: I'm talking about autoloading, not...",2011-02-25 02:51:17.527000+00:00,5112964,388916,,1
2,59596181,Microsoft.mshtml.dll is an ancient assembly an...,2016-03-13 13:52:45.057000+00:00,35969898,17034,,0
3,72357806,"Edited my answer, missed the actual problem.",2017-03-05 21:47:57.003000+00:00,42613986,3476154,,0
4,67742655,Glad to help. Remember that if you found diffe...,2016-10-25 12:48:47.883000+00:00,40239896,4323648,,0
5,57931161,"Hadn't thought of that. That means, Windows, h...",2016-01-30 17:59:37.473000+00:00,35104966,4957620,,0
6,69267765,`mapData = [serverdata];` also seems wrong,2016-12-08 01:02:05.210000+00:00,41029970,14104,,0
7,55666198,i fixed it myself. i appreciate your help. Tha...,2015-11-27 09:29:23.887000+00:00,33952075,3046937,,0
8,10545382,Welcome to SO. Your question would profit grea...,2011-12-15 08:54:23.533000+00:00,8517384,847601,,2
9,78972208,@scaisEdge I think Stack Overflow's auto-tagge...,2017-09-01 17:56:09.263000+00:00,46005434,87189,,1


In [4]:
bq_assistant.table_schema("comments")

[SchemaField('id', 'INTEGER', 'REQUIRED', None, ()),
 SchemaField('text', 'STRING', 'NULLABLE', None, ()),
 SchemaField('creation_date', 'TIMESTAMP', 'NULLABLE', None, ()),
 SchemaField('post_id', 'INTEGER', 'NULLABLE', None, ()),
 SchemaField('user_id', 'INTEGER', 'NULLABLE', None, ()),
 SchemaField('user_display_name', 'STRING', 'NULLABLE', None, ()),
 SchemaField('score', 'INTEGER', 'NULLABLE', None, ())]

What is the percentage of questions that have been answered over the years?


In [5]:
query1 = """SELECT
  EXTRACT(YEAR FROM creation_date) AS Year,
  COUNT(*) AS Number_of_Questions,
  ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers
FROM
  `bigquery-public-data.stackoverflow.posts_questions`
GROUP BY
  Year
HAVING
  Year > 2008 AND Year < 2016
ORDER BY
  Year;
        """
response1 = stackOverflow.query_to_pandas_safe(query1)
response1.head(10)

Unnamed: 0,Year,Number_of_Questions,Percent_Questions_with_Answers
0,2009,344099,99.7
1,2010,694733,99.1
2,2011,1201116,97.1
3,2012,1646361,94.5
4,2013,2061829,91.5
5,2014,2166032,88.3
6,2015,2221374,86.1


What is the reputation and badge count of users across different tenures on StackOverflow?


In [6]:
query2 = """SELECT User_Tenure,
       COUNT(1) AS Num_Users,
       ROUND(AVG(reputation)) AS Avg_Reputation,
       ROUND(AVG(num_badges)) AS Avg_Num_Badges
FROM (
  SELECT users.id AS user,
         ROUND(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), ANY_VALUE(users.creation_date), DAY)/365) AS user_tenure,
         ANY_VALUE(users.reputation) AS reputation,
         SUM(IF(badges.user_id IS NULL, 0, 1)) AS num_badges
  FROM `bigquery-public-data.stackoverflow.users` users
  LEFT JOIN `bigquery-public-data.stackoverflow.badges` badges
  ON users.id = badges.user_id
  GROUP BY user
)
GROUP BY User_Tenure
ORDER BY User_Tenure;
        """
response2 = stackOverflow.query_to_pandas_safe(query2)
response2.head(10)

Unnamed: 0,User_Tenure,Num_Users,Avg_Reputation,Avg_Num_Badges
0,0.0,632793,3.0,1.0
1,1.0,1712249,7.0,1.0
2,2.0,1511557,16.0,1.0
3,3.0,1239810,31.0,2.0
4,4.0,1178283,49.0,2.0
5,5.0,1067822,90.0,3.0
6,6.0,592135,229.0,7.0
7,7.0,335525,475.0,11.0
8,8.0,181367,865.0,16.0
9,9.0,56772,2979.0,38.0


What are 10 of the “easier” gold badges to earn?


In [7]:
query3 = """SELECT badge_name AS First_Gold_Badge,
       COUNT(1) AS Num_Users,
       ROUND(AVG(tenure_in_days)) AS Avg_Num_Days
FROM
(
  SELECT
    badges.user_id AS user_id,
    badges.name AS badge_name,
    TIMESTAMP_DIFF(badges.date, users.creation_date, DAY) AS tenure_in_days,
    ROW_NUMBER() OVER (PARTITION BY badges.user_id
                       ORDER BY badges.date) AS row_number
  FROM
    `bigquery-public-data.stackoverflow.badges` badges
  JOIN
    `bigquery-public-data.stackoverflow.users` users
  ON badges.user_id = users.id
  WHERE badges.class = 1
)
WHERE row_number = 1
GROUP BY First_Gold_Badge
ORDER BY Num_Users DESC
LIMIT 10;
        """
response3 = stackOverflow.query_to_pandas_safe(query3, max_gb_scanned=10)
response3.head(10)

Unnamed: 0,First_Gold_Badge,Num_Users,Avg_Num_Days
0,Famous Question,220280,1255.0
1,Fanatic,17888,673.0
2,Unsung Hero,14834,649.0
3,Great Answer,13330,1489.0
4,Electorate,7155,903.0
5,Populist,6571,1323.0
6,Steward,1180,1044.0
7,Great Question,617,747.0
8,Copy Editor,267,619.0
9,Marshal,198,631.0


Which day of the week has most questions answered within an hour?


In [8]:
query4 = """SELECT
  Day_of_Week,
  COUNT(1) AS Num_Questions,
  SUM(answered_in_1h) AS Num_Answered_in_1H,
  ROUND(100 * SUM(answered_in_1h) / COUNT(1),1) AS Percent_Answered_in_1H
FROM
(
  SELECT
    q.id AS question_id,
    EXTRACT(DAYOFWEEK FROM q.creation_date) AS day_of_week,
    MAX(IF(a.parent_id IS NOT NULL AND
           (UNIX_SECONDS(a.creation_date)-UNIX_SECONDS(q.creation_date))/(60*60) <= 1, 1, 0)) AS answered_in_1h
  FROM
    `bigquery-public-data.stackoverflow.posts_questions` q
  LEFT JOIN
    `bigquery-public-data.stackoverflow.posts_answers` a
  ON q.id = a.parent_id
  WHERE EXTRACT(YEAR FROM a.creation_date) = 2016
    AND EXTRACT(YEAR FROM q.creation_date) = 2016
  GROUP BY question_id, day_of_week
)
GROUP BY
  Day_of_Week
ORDER BY
  Day_of_Week;
        """
response4 = stackOverflow.query_to_pandas_safe(query4, max_gb_scanned=10)
response4.head(10)

Unnamed: 0,Day_of_Week,Num_Questions,Num_Answered_in_1H,Percent_Answered_in_1H
0,1,157479,88966,56.5
1,2,283124,164912,58.2
2,3,311049,183508,59.0
3,4,319240,187658,58.8
4,5,314876,185159,58.8
5,6,283119,166664,58.9
6,7,160005,90557,56.6
