# StackOverflow Dataset Analysis with BigQuery

This notebook explores the [StackOverflow public dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=stackoverflow&page=dataset) available on Google BigQuery.

**SQL** is used to analyze trends in developer questions, focusing on:

- Popularity of tags (e.g., Python, SQL, etc.)
- Temporal trends in post creation
- Post scores and activity by year
- Differences between question and answer posts

The goal is to demonstrate how SQL can be used to extract meaningful insights from a large, real-world dataset using BigQuery within a Jupyter/Colab environment.


Run the next cell to fetch the `stackoverflow` dataset.

In [1]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

Using Kaggle's public dataset BigQuery integration.


In [2]:
# Get a list of available tables 

tables =  list(client.list_tables(dataset)) 
list_of_tables = [table.table_id for table in tables]

In [3]:
# Construct a reference to the "posts_answers" table
answers_table_ref = dataset_ref.table("posts_answers")

# API request - fetch the table
answers_table = client.get_table(answers_table_ref)

# Preview the first five lines of the "posts_answers" table
client.list_rows(answers_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,18,,<p>For a table like this:</p>\n\n<pre><code>CR...,,,2,NaT,2008-08-01 05:12:44.193000+00:00,,2016-06-02 05:56:26.060000+00:00,2016-06-02 05:56:26.060000+00:00,Jeff Atwood,126039,phpguy,,17,2,59,,
1,165,,"<p>You can use a <a href=""http://sharpdevelop....",,,0,NaT,2008-08-01 18:04:25.023000+00:00,,2019-04-06 14:03:51.080000+00:00,2019-04-06 14:03:51.080000+00:00,,1721793,user2189331,,145,2,10,,
2,1028,,<p>The VB code looks something like this:</p>\...,,,0,NaT,2008-08-04 04:58:40.300000+00:00,,2013-02-07 13:22:14.680000+00:00,2013-02-07 13:22:14.680000+00:00,,395659,user2189331,,947,2,8,,
3,1073,,<p>My first choice would be a dedicated heap t...,,,0,NaT,2008-08-04 07:51:02.997000+00:00,,2015-09-01 17:32:32.120000+00:00,2015-09-01 17:32:32.120000+00:00,,45459,user2189331,,1069,2,29,,
4,1260,,<p>I found the answer. all you have to do is a...,,,0,NaT,2008-08-04 14:06:02.863000+00:00,,2016-12-20 08:38:48.867000+00:00,2016-12-20 08:38:48.867000+00:00,,1221571,Jin,,1229,2,1,,


In [4]:
# Construct a reference to the "posts_questions" table
questions_table_ref = dataset_ref.table("posts_questions")

# API request - fetch the table
questions_table = client.get_table(questions_table_ref)

# Preview the first five lines of the "posts_questions" table
client.list_rows(questions_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,320268,Html.ActionLink doesn’t render # properly,<p>When using Html.ActionLink passing a string...,,0,0,NaT,2008-11-26 10:42:37.477000+00:00,0,2009-02-06 20:13:54.370000+00:00,NaT,,,Paulo,,,1,0,asp.net-mvc,390
1,324003,Primitive recursion,<p>how will i define the function 'simplify' ...,,0,0,NaT,2008-11-27 15:12:37.497000+00:00,0,2012-09-25 19:54:40.597000+00:00,2012-09-25 19:54:40.597000+00:00,Marcin,1288.0,,41000.0,,1,0,haskell|lambda|functional-programming|lambda-c...,497
2,390605,While vs. Do While,<p>I've seen both the blocks of code in use se...,390608.0,0,0,NaT,2008-12-24 01:49:54.230000+00:00,2,2008-12-24 03:08:55.897000+00:00,NaT,,,Unkwntech,115.0,,1,0,language-agnostic|loops,11262
3,413246,Protect ASP.NET Source code,<p>Im currently doing some research in how to ...,,0,0,NaT,2009-01-05 14:23:51.040000+00:00,0,2009-03-24 21:30:22.370000+00:00,2009-01-05 14:42:28.257000+00:00,Tom Anderson,13502.0,Velnias,,,1,0,asp.net|deployment|obfuscation,4823
4,454921,"Difference between ""int[] myArray"" and ""int my...",<blockquote>\n <p><strong>Possible Duplicate:...,454928.0,0,0,NaT,2009-01-18 10:22:52.177000+00:00,0,2009-01-18 10:30:50.930000+00:00,2017-05-23 11:49:26.567000+00:00,,-1.0,Evan Fosmark,49701.0,,1,0,java|arrays,798


In [5]:
# Construct a reference to the "users" table
users_table_ref = dataset_ref.table("users")

# API request - fetch the table
users_table = client.get_table(users_table_ref)

# Preview the first five lines of the table
client.list_rows(users_table, max_results=5).to_dataframe()

Unnamed: 0,id,display_name,about_me,age,creation_date,last_access_date,location,reputation,up_votes,down_votes,views,profile_image_url,website_url
0,366,littlecharva,,,2008-08-05 06:49:49.780000+00:00,2022-08-08 13:12:26.337000+00:00,,4214,90,1,252,,http://www.twitter.com/littlecharva
1,1110,user1110,,,2008-08-12 12:33:05.770000+00:00,2009-07-08 20:03:10.577000+00:00,,59,0,0,49,,
2,1666,SCdF,<p>...!</p>\n\n<p><i>(contactable via email: d...,,2008-08-17 21:07:18.133000+00:00,2022-01-13 16:16:27.517000+00:00,"London, UK",55061,740,133,2308,,https://sdufresne.info
3,1754,David R. Longnecker,"<p>As an entrepreneur, I bridge the business a...",,2008-08-18 12:45:25.120000+00:00,2022-09-17 03:59:36.623000+00:00,"Kansas City, MO, USA",2767,170,2,614,https://i.stack.imgur.com/egFxf.jpg?s=128&g=1,https://drlongnecker.com
4,2169,Michał Niedźwiedzki,,,2008-08-20 17:54:21.833000+00:00,2021-03-01 11:02:01.960000+00:00,"Katowice, Poland",12739,99,28,1094,https://i.stack.imgur.com/FE0lM.jpg,


### 1) Selecting the right questions

The query selects the `id`, `title` and `owner_user_id` columns from the `posts_questions` table. 
- Restrict the results to rows that contain the word "bigquery" in the `tags` column. 
- Include rows where there is other text in addition to the word "bigquery" 

In [6]:
questions_query = """
                  SELECT id, title, owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_questions`
                  WHERE tags LIKE '%bigquery%'
                  """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe() # Your code goes here

# Preview results
print(questions_results.head())



         id                                              title  owner_user_id
0  64345717  Loop by array and union looped result in BigQuery       13304769
1  64610766  BigQuery Transfer jobs from S3 stuck pending o...       14549617
2  64383871  How to get sum of values in days intervals usi...       12472644
3  64251948                BigQuery get row above empty column        4572124
4  64323398  SQL: Remove part of string that is in another ...        6089137


### 2) Joins
The query returns the `id`, `body` and `owner_user_id` columns from the `posts_answers` table.  

In [7]:
answers_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=27*10**10)
answers_query_job = client.query(answers_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe() # Your code goes here

# Preview results
print(answers_results.head())



         id                                               body  owner_user_id
0  57032546  <p>Another solution - same logic, just a diffe...        1391685
1  57032790  <p>Recommendation : </p>\n\n<pre><code>select\...       10369599
2  57036857  <p>Basically, this is the answer</p>\n\n<pre><...       11506172
3  57042242  <p>first, I hope that year+week &amp; year+day...        2929192
4  57049923  <p>I think the solutions mentioned by others s...        1391685


### 3) Answer the question
This query returns the list of users who have answered many questions.

The query that has a single row for each user who answered at least one question with a tag that includes the string "bigquery". The result will have two columns:
- `user_id` - contains the `owner_user_id` column from the `posts_answers` table
- `number_of_answers` - contains the number of answers the user has written to "bigquery"-related questions

In [8]:
bigquery_experts_query = """
                SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                GROUP BY a.owner_user_id
                """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe() # Your code goes here

# Preview results
print(bigquery_experts_results.head())



    user_id  number_of_answers
0  12742892                  2
1   6611037                  1
2   6511361                 17
3   4404302                  3
4   1405208                  1


### 4) How long does it take for questions to receive answers?

Use this information to better design the order in which questions are presented to Stack Overflow users.

The query below,  focuses on questions asked in January 2018. It returns a table with two columns:
- `q_id` - the ID of the question
- `time_to_answer` - how long it took (in seconds) for the question to receive an answer

In [9]:
# Your code here
correct_query = """
              SELECT q.id AS q_id,
                  MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
              FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                  LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
              ON q.id = a.parent_id
              WHERE q.creation_date >= '2018-01-01' and q.creation_date < '2018-02-01'
              GROUP BY q_id
              ORDER BY time_to_answer
              LIMIT 1000
              """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
correct_query_job = client.query(correct_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
correct_results = correct_query_job.to_dataframe() # Your code goes here

# Preview results
print(correct_results.head())

# Run the query, and return a pandas DataFrame
correct_result = client.query(correct_query).result().to_dataframe()
print("Percentage of answered questions: %s%%" % \
      (sum(correct_result["time_to_answer"].notnull()) / len(correct_result) * 100))
print("Number of questions:", len(correct_result))



       q_id  time_to_answer
0  48397572            <NA>
1  48520044            <NA>
2  48173626            <NA>
3  48551746            <NA>
4  48155883            <NA>




Percentage of answered questions: 0.0%
Number of questions: 1000


### 5) Initial questions and answers

Is important to understand the initial experiences that users typically have with the Stack Overflow website.  Is it more common for users to first ask questions or provide answers? After signing up, how long does it take for users to first interact with the website?  

The query returns a table with three columns:
- `owner_user_id` - the user ID
- `q_creation_date` - the first time the user asked a question 
- `a_creation_date` - the first time the user contributed an answer 

Goal: keep track of users who have asked questions, but have yet to provide answers. The table should also include users who have answered questions, but have yet to pose their own questions.  

To avoid returning too much data, the attention is restricted to questions and answers posed in January 2019. 

In [10]:
q_and_a_query = """
                SELECT q.owner_user_id AS owner_user_id,
                    MIN(q.creation_date) AS q_creation_date,
                    MIN(a.creation_date) AS a_creation_date
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                    LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                ON q.owner_user_id = a.owner_user_id 
                WHERE q.creation_date >= '2019-01-01' AND q.creation_date < '2019-02-01' 
                    AND a.creation_date >= '2019-01-01' AND a.creation_date < '2019-02-01'
                GROUP BY owner_user_id
                --LIMIT 1000
                """
# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
q_and_a_job = client.query(q_and_a_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
q_and_a_result = q_and_a_job.to_dataframe() # Your code goes here

# Preview results
print(q_and_a_result.head())

# Run the query, and return a pandas DataFrame
q_and_a_result = client.query(q_and_a_query).result().to_dataframe()
print("Percentage of answered questions: %s%%" % \
      (sum(correct_result["time_to_answer"].notnull()) / len(correct_result) * 100))
print("Number of questions:", len(correct_result))

   owner_user_id                  q_creation_date  \
0        8403648 2019-01-05 06:17:32.730000+00:00   
1        2627166 2019-01-12 10:06:41.903000+00:00   
2        7844349 2019-01-09 09:28:21.313000+00:00   
3        3760984 2019-01-04 14:54:45.177000+00:00   
4       10342402 2019-01-10 13:29:44.913000+00:00   

                   a_creation_date  
0 2019-01-05 06:20:53.400000+00:00  
1 2019-01-12 11:30:18.110000+00:00  
2 2019-01-04 11:18:44.007000+00:00  
3 2019-01-05 11:49:12.873000+00:00  
4 2019-01-14 11:31:55.283000+00:00  




Percentage of answered questions: 0.0%
Number of questions: 1000


### 6) Initial questions and answers

Now, need to pull information from *three* different tables.  This syntax very similar to the case when we have to join only two tables. It is possible to use two different **JOINs** to link together information from all three tables, in a single query.

The interest is in understanding users who joined the site in January 2019 and to track their activity on the site: when did they post their first questions and answers, if ever?

The query that returns the following columns:
- `id` - the IDs of all users who created Stack Overflow accounts in January 2019 (January 1, 2019, to January 31, 2019, inclusive)
- `q_creation_date` - the first time the user posted a question on the site; if the user has never posted a question, the value should be null
- `a_creation_date` - the first time the user posted a question on the site; if the user has never posted a question, the value should be null

NB: questions and answers posted after January 31, 2019, should still be included in the results.  And, all users who joined the site in January 2019 should be included (even if they have never posted a question or provided an answer).

In [11]:
init_qa_query = """
                     SELECT u.id AS id,
                         MIN(q.creation_date) AS q_creation_date,
                         MIN(a.creation_date) AS a_creation_date
                     FROM `bigquery-public-data.stackoverflow.users` AS u
                         LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                             ON u.id = a.owner_user_id
                         LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
                             ON q.owner_user_id = u.id
                     WHERE u.creation_date >= '2019-01-01' and u.creation_date < '2019-02-01'
                     GROUP BY id
                    """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
init_qa_job = client.query(init_qa_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
init_qa_result = init_qa_job.to_dataframe() # Your code goes here

# Preview results
print(init_qa_result.head())

         id                  q_creation_date a_creation_date
0  10884195                              NaT             NaT
1  10939640 2022-01-21 06:17:58.520000+00:00             NaT
2  10900835                              NaT             NaT
3  10861579 2019-11-06 14:54:11.567000+00:00             NaT
4  10873792                              NaT             NaT


### 7) How many distinct users posted on January 1, 2019?

The query that returns a table with a single column:
- `owner_user_id` - the IDs of all users who posted at least one question or answer on January 1, 2019.  Each user ID should appear at most once.

In the `posts_questions` (and `posts_answers`) tables, it is possible to get the ID of the original poster from the `owner_user_id` column.  Likewise, the date of the original posting can be found in the `creation_date` column.  

In [12]:
all_users_query = """
                     SELECT q.owner_user_id, 
                     FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                     WHERE EXTRACT(DATE FROM q.creation_date) = '2019-01-01'
                     UNION DISTINCT
                     SELECT a.owner_user_id, 
                     FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                     WHERE EXTRACT(DATE FROM a.creation_date) = '2019-01-01'
                  """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
all_users_job = client.query(all_users_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
all_users_result = all_users_job.to_dataframe() # Your code goes here

# Preview results
print(all_users_result.head())



   owner_user_id
0        8402112
1        8865951
2        6767417
3        3049065
4        2018434


### 8) Volume of questions per month

In [13]:
q_per_m_query = """
                     SELECT EXTRACT(MONTH FROM creation_date) AS month,
                     COUNT(*) AS total_questions
                     FROM `bigquery-public-data.stackoverflow.posts_questions`
                     WHERE EXTRACT(YEAR FROM creation_date) = 2019
                     GROUP BY month
                     ORDER BY month ASC
                     LIMIT 2000
                  """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
q_per_m_job = client.query(q_per_m_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
q_per_m_result = q_per_m_job.to_dataframe() # Your code goes here

# Preview results
print(q_per_m_result)



    month  total_questions
0       1           150355
1       2           147198
2       3           162023
3       4           154351
4       5           152252
5       6           136660
6       7           151903
7       8           137970
8       9           137737
9      10           153648
10     11           149313
11     12           133523


### 9) Questions with most answers

In [14]:
most_answered_query = """
                     SELECT id,
                     answer_count
                     FROM `bigquery-public-data.stackoverflow.posts_questions`
                     WHERE EXTRACT(YEAR FROM creation_date) = 2019
                     AND answer_count >= 
                       ( SELECT AVG(answer_count)  
                         FROM `bigquery-public-data.stackoverflow.posts_questions`
                         WHERE EXTRACT(YEAR FROM creation_date) = 2019
                       )
                     ORDER BY answer_count DESC
                     LIMIT 2000
                  """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
most_answered_job = client.query(most_answered_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
most_answered_result = most_answered_job.to_dataframe() # Your code goes here

# Preview results
print(most_answered_result)



            id  answer_count
0     55441230            58
1     59516408            50
2     57606462            46
3     55921442            43
4     54557479            43
...        ...           ...
1995  58318150             8
1996  55879712             8
1997  56855766             8
1998  58596541             8
1999  54909711             8

[2000 rows x 2 columns]


### 10) Most frequent tags

In [15]:
frequent_tags_query = """
                     SELECT tag,
                     COUNT(*) AS num_tags
                     FROM `bigquery-public-data.stackoverflow.posts_questions`,
                     UNNEST(SPLIT(tags, '|')) AS tag
                     WHERE EXTRACT(YEAR FROM creation_date) = 2019
                     GROUP BY tag
                     ORDER BY num_tags DESC
                     LIMIT 2000
                  """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
frequent_tags_job = client.query(frequent_tags_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
frequent_tags_result = frequent_tags_job.to_dataframe() # Your code goes here

# Preview results
print(frequent_tags_result)



                tag  num_tags
0            python    223982
1        javascript    188287
2              java    126031
3                c#     99384
4           android     84814
...             ...       ...
1995  .net-standard       258
1996   nested-lists       258
1997         jpanel       258
1998      git-merge       258
1999     xlsxwriter       257

[2000 rows x 2 columns]


### 11) Questions with AT LEAST one answer + average time at first answer


In [16]:
first_answer_query = """
                     WITH first_answer AS 
                     ( SELECT parent_id AS answer_id,
                     MIN(creation_date) AS first_answer_time
                     FROM `bigquery-public-data.stackoverflow.posts_answers`
                     GROUP BY parent_id
                     )

                     SELECT DATE(q.creation_date) AS date,
                     COUNT(*) AS total_questions,
                     COUNT(a.answer_id) AS questions_with_answer,
                     ROUND(100 * COUNT(a.answer_id) / COUNT(*), 2) AS pct_answered,
                     AVG(TIMESTAMP_DIFF(a.first_answer_time, q.creation_date, DAY)) AS avg_day_to_first_answer
                     FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                     JOIN first_answer AS a ON q.id = a.answer_id
                     WHERE EXTRACT (YEAR FROM q.creation_date)= 2019
                     GROUP BY date
                     ORDER BY date ASC
                     
                     LIMIT 2000
                  """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
first_answer_job = client.query(first_answer_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
first_answer_result = first_answer_job.to_dataframe() # Your code goes here

# Preview results
print(first_answer_result)



           date  total_questions  questions_with_answer  pct_answered  \
0    2019-01-01             1971                   1971         100.0   
1    2019-01-02             3818                   3818         100.0   
2    2019-01-03             4362                   4362         100.0   
3    2019-01-04             4264                   4264         100.0   
4    2019-01-05             2479                   2479         100.0   
..          ...              ...                    ...           ...   
360  2019-12-27             3187                   3187         100.0   
361  2019-12-28             2250                   2250         100.0   
362  2019-12-29             2187                   2187         100.0   
363  2019-12-30             3548                   3548         100.0   
364  2019-12-31             2816                   2816         100.0   

     avg_day_to_first_answer  
0                  14.527651  
1                  17.539550  
2                  16.882393  

### 12) Question comparison between two tags: Python vs JavaScript

In [17]:
comparison_query = """SELECT tag,
                      COUNT(*) AS num_tags
                      FROM `bigquery-public-data.stackoverflow.posts_questions`,
                      UNNEST(SPLIT(tags, '|')) AS tag
                      WHERE tag IN ("python", "javascript")
                      GROUP BY tag
                      ORDER BY num_tags DESC;
                      """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
comparison_job = client.query(comparison_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
comparison_result = comparison_job.to_dataframe() # Your code goes here

# Preview results
print(comparison_result)

          tag  num_tags
0  javascript   2426570
1      python   2026601




### 13) Percentage of answered questions within one day

In [18]:
qa_oneday_query = """
                     WITH qa_pair AS 
                     ( SELECT q.id AS question_id,
                     q.creation_date AS question_time,
                     MIN(q.creation_date) AS first_answer_time
                     FROM `bigquery-public-data.stackoverflow.posts_questions` AS q LEFT JOIN
                     `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_id
                     WHERE EXTRACT (YEAR FROM q.creation_date) = 2019
                     GROUP BY q.id, q.creation_date
                     )

                     SELECT
                     COUNTIF( TIMESTAMP_DIFF(first_answer_time, question_time, HOUR) <= 24 ) AS answer_24h,

                     ROUND(100 * COUNTIF(TIMESTAMP_DIFF(first_answer_time, question_time, HOUR) <= 24) / COUNT(*), 2) AS pct_answered_24h

                     FROM qa_pair
                     
                     LIMIT 2000
                  """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
qa_oneday_job = client.query(qa_oneday_query, job_config = safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
qa_oneday_result = qa_oneday_job.to_dataframe() # Your code goes here

# Preview results
print(qa_oneday_result)



   answer_24h  pct_answered_24h
0     1766933             100.0
