In [1]:
from google.cloud import bigquery
from google.oauth2 import service_account

# We will import a customized function called client which actually returns an authorized bigquery client object with right credentials
# this will cost us an extra pair of () each time we call the client object which is now called by the function client we define in bq_sa_auth.py 

from bq_sa_auth import client

## **[Advanced SQL](https://www.kaggle.com/learn/advanced-sql)**

## Lecture 1: [JOINs and UNIONs](https://www.kaggle.com/code/alexisbcook/joins-and-unions)

#### Combine information from multiple tables.

-------------

### Keywords: LEFT/RIGHT JOIN, FULL JOIN & UNION ALL/DISTINCT

### **Exercises:** 

#### Introduction

We will use different types of SQL **JOINs** to answer questions about the [Stack Overflow](https://www.kaggle.com/stackoverflow/stackoverflow) dataset.

- First fetch the `posts_questions` and `posts_answers` tables from the `stackoverflow` dataset.  We also preview the first five rows of these tables.

In [2]:
# Construct a reference to the "stackoverflow" dataset
sof_ref = client().dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
sof_ds = client().get_dataset(sof_ref)

# Construct a reference to the "posts_questions" and "posts_answers" table
pq_table_ref = sof_ds.table("posts_questions")
pa_table_ref = sof_ds.table("posts_answers")

# API request - fetch the tables
pq_table = client().get_table(pq_table_ref)
pa_table = client().get_table(pa_table_ref)

In [3]:
# Preview the first five lines of the table
client().list_rows(pq_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,320268,Html.ActionLink doesn’t render # properly,<p>When using Html.ActionLink passing a string...,,0,0,NaT,2008-11-26 10:42:37.477000+00:00,0,2009-02-06 20:13:54.370000+00:00,NaT,,,Paulo,,,1,0,asp.net-mvc,390
1,324003,Primitive recursion,<p>how will i define the function 'simplify' ...,,0,0,NaT,2008-11-27 15:12:37.497000+00:00,0,2012-09-25 19:54:40.597000+00:00,2012-09-25 19:54:40.597000+00:00,Marcin,1288.0,,41000.0,,1,0,haskell|lambda|functional-programming|lambda-c...,497
2,390605,While vs. Do While,<p>I've seen both the blocks of code in use se...,390608.0,0,0,NaT,2008-12-24 01:49:54.230000+00:00,2,2008-12-24 03:08:55.897000+00:00,NaT,,,Unkwntech,115.0,,1,0,language-agnostic|loops,11262
3,413246,Protect ASP.NET Source code,<p>Im currently doing some research in how to ...,,0,0,NaT,2009-01-05 14:23:51.040000+00:00,0,2009-03-24 21:30:22.370000+00:00,2009-01-05 14:42:28.257000+00:00,Tom Anderson,13502.0,Velnias,,,1,0,asp.net|deployment|obfuscation,4823
4,454921,"Difference between ""int[] myArray"" and ""int my...",<blockquote>\n <p><strong>Possible Duplicate:...,454928.0,0,0,NaT,2009-01-18 10:22:52.177000+00:00,0,2009-01-18 10:30:50.930000+00:00,2017-05-23 11:49:26.567000+00:00,,-1.0,Evan Fosmark,49701.0,,1,0,java|arrays,798


In [4]:
# Preview the first five lines of the table
client().list_rows(pa_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,18,,<p>For a table like this:</p>\n\n<pre><code>CR...,,,2,NaT,2008-08-01 05:12:44.193000+00:00,,2016-06-02 05:56:26.060000+00:00,2016-06-02 05:56:26.060000+00:00,Jeff Atwood,126039,phpguy,,17,2,59,,
1,165,,"<p>You can use a <a href=""http://sharpdevelop....",,,0,NaT,2008-08-01 18:04:25.023000+00:00,,2019-04-06 14:03:51.080000+00:00,2019-04-06 14:03:51.080000+00:00,,1721793,user2189331,,145,2,10,,
2,1028,,<p>The VB code looks something like this:</p>\...,,,0,NaT,2008-08-04 04:58:40.300000+00:00,,2013-02-07 13:22:14.680000+00:00,2013-02-07 13:22:14.680000+00:00,,395659,user2189331,,947,2,8,,
3,1073,,<p>My first choice would be a dedicated heap t...,,,0,NaT,2008-08-04 07:51:02.997000+00:00,,2015-09-01 17:32:32.120000+00:00,2015-09-01 17:32:32.120000+00:00,,45459,user2189331,,1069,2,29,,
4,1260,,<p>I found the answer. all you have to do is a...,,,0,NaT,2008-08-04 14:06:02.863000+00:00,,2016-12-20 08:38:48.867000+00:00,2016-12-20 08:38:48.867000+00:00,,1221571,Jin,,1229,2,1,,


### 1) How long does it take for questions to receive answers?

You're interested in exploring the data to have a better understanding of how long it generally takes for questions to receive answers.  Armed with this knowledge, you plan to use this information to better design the order in which questions are presented to Stack Overflow users.

With this goal in mind, you write the query below, which focuses on questions asked in January 2018.  It returns a table with two columns:
- `q_id` - the ID of the question
- `time_to_answer` - how long it took (in seconds) for the question to receive an answer

In [6]:
anstime_query = """
                 SELECT q.id AS q_id, MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
                 
                 FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                 INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q 
                    ON q.id = a.parent_id
                 WHERE EXTRACT(YEAR FROM q.creation_date) = 2018 and EXTRACT(MONTH FROM q.creation_date) = 01
                 GROUP BY q_id
                 ORDER BY time_to_answer
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed =27*10**10)

query_job = client().query(anstime_query, job_config=safe_config)

sof_anstime = query_job.to_dataframe()

sof_anstime .head()

Unnamed: 0,q_id,time_to_answer
0,48382183,-132444692
1,48396661,0
2,48287697,0
3,48118504,0
4,48108963,0


In [7]:
# Calculate the percentage of answered questions from the output of the query using the built in notnull() method which returns True if the chosen field in not NaN

print(f"The percentage of answered questions: {sum(sof_anstime['time_to_answer'].notnull())/len(sof_anstime) * 100}%")
print("\n")
print(f"Total number questions: {len(sof_anstime)}")

The percentage of answered questions: 100.0%


Total number questions: 134719


You're surprised at the results and strongly suspect that something is wrong with your query.  In particular,
- According to the query, 100% of the questions from January 2018 received an answer.  But, you know that ~80% of the questions on the site usually receive an answer.
- The total number of questions is surprisingly low.  You expected to see at least 150,000 questions represented in the table.

Given these observations, you think that the type of **JOIN** you have chosen has inadvertently excluded unanswered questions.  Using the code cell below, can you figure out what type of **JOIN** to use to fix the problem so that the table includes unanswered questions?

**Note**: You need only amend the type of **JOIN** (i.e., **INNER**, **LEFT**, **RIGHT**, or **FULL**) to answer the question successfully.

In [8]:
# To account for all questions without strictly requiring to have a match on the id's, e.g q.id = a.parent_id, we need to have a RIGHT join as the questions table appear on the right for us 

correct_anstime_query = """
                 SELECT q.id AS q_id, MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
                 
                 FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                 RIGHT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q 
                    ON q.id = a.parent_id
                 WHERE EXTRACT(YEAR FROM q.creation_date) = 2018 and EXTRACT(MONTH FROM q.creation_date) = 01
                 GROUP BY q_id
                 ORDER BY time_to_answer
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed =27*10**10)

query_job = client().query(correct_anstime_query, job_config=safe_config)

correct_sof_anstime = query_job.to_dataframe()

correct_sof_anstime.head()

Unnamed: 0,q_id,time_to_answer
0,48503640,
1,48549064,
2,48462818,
3,48300296,
4,48140644,


In [9]:
# Calculate the percentage of answered questions from the output of the query using the built in notnull() method which returns True if the chosen field in not NaN

print(f"The percentage of answered questions: {sum(correct_sof_anstime['time_to_answer'].notnull())/len(correct_sof_anstime) * 100:.2f}%")
print("\n")
print(f"Total number questions: {len(correct_sof_anstime)}")

The percentage of answered questions: 83.34%


Total number questions: 161656


### 2) Initial questions and answers, Part 1

You're interested in understanding the initial experiences that users typically have with the Stack Overflow website.  Is it more common for users to first ask questions or provide answers?  After signing up, how long does it take for users to first interact with the website?  To explore this further, you draft the (partial) query in the code cell below.

The query returns a table with three columns:
- `owner_user_id` - the user ID
- `q_creation_date` - the first time the user asked a question 
- `a_creation_date` - the first time the user contributed an answer 

You want to keep track of users who have asked questions, but have yet to provide answers.  And, your table should also include users who have answered questions, but have yet to pose their own questions.  

With this in mind, please fill in the appropriate **JOIN** (i.e., **INNER**, **LEFT**, **RIGHT**, or **FULL**) to return the correct information.  

**Note**: You need only fill in the appropriate **JOIN**.  All other parts of the query should be left as-is.  (You also don't need to write any additional code to run the query, since the `check()` method will take care of this for you.)

To avoid returning too much data, we'll restrict our attention to questions and answers posed in January 2019.  We'll amend the timeframe in Part 2 of this question to be more realistic!

In [10]:
# We need FULL JOIN though not on the id/parent_id columns of q and a tables but rather on the owner_user_id where posts
# may or may not match  

qa_init_query ="""
                 SELECT q.owner_user_id AS owner_user_id, MIN(q.creation_date) AS q_creation_date, MIN(a.creation_date) AS a_creation_date
                 
                 FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                 FULL JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q 
                    ON q.owner_user_id = a.owner_user_id
                    
                 WHERE EXTRACT(YEAR FROM q.creation_date) = 2019 AND EXTRACT(MONTH FROM q.creation_date) = 01
                   AND EXTRACT(YEAR FROM a.creation_date) = 2019 AND EXTRACT(MONTH FROM a.creation_date) = 01
                 GROUP BY owner_user_id
                 ORDER BY q_creation_date
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed =27*10**10)

query_job = client().query(qa_init_query, job_config=safe_config)

qa_init = query_job.to_dataframe()

qa_init.head()

Unnamed: 0,owner_user_id,q_creation_date,a_creation_date
0,6266364,2019-01-01 00:02:08.370000+00:00,2019-01-02 11:27:17.873000+00:00
1,9706003,2019-01-01 00:02:27.607000+00:00,2019-01-01 16:20:01.050000+00:00
2,492015,2019-01-01 00:11:20.773000+00:00,2019-01-07 19:09:32.523000+00:00
3,10751189,2019-01-01 00:12:06.683000+00:00,2019-01-03 05:47:02.663000+00:00
4,1953250,2019-01-01 00:15:23.450000+00:00,2019-01-14 21:41:53.133000+00:00


### 3) Initial questions and answers, Part 2

Now you'll address a more realistic (and complex!) scenario.  To answer this question, you'll need to pull information from *three* different tables!  This syntax very similar to the case when we have to join only two tables.  For instance, consider the three tables below.

![three tables](https://storage.googleapis.com/kaggle-media/learn/images/OyhYtD1.png)

We can use two different **JOINs** to link together information from all three tables, in a single query.

![double join](https://storage.googleapis.com/kaggle-media/learn/images/G6buS7P.png)

With this in mind, say you're interested in understanding users who joined the site in January 2019.  You want to track their activity on the site: when did they post their first questions and answers, if ever?

Write a query that returns the following columns:
- `id` - the IDs of all users who created Stack Overflow accounts in January 2019 (January 1, 2019, to January 31, 2019, inclusive)
- `q_creation_date` - the first time the user posted a question on the site; if the user has never posted a question, the value should be null
- `a_creation_date` - the first time the user posted a question on the site; if the user has never posted a question, the value should be null

Note that questions and answers posted after January 31, 2019, should still be included in the results.  And, all users who joined the site in January 2019 should be included (even if they have never posted a question or provided an answer).

The query from the previous question should be a nice starting point to answering this question!  You'll need to use the `posts_answers` and `posts_questions` tables.  You'll also need to use the `users` table from the Stack Overflow dataset.  The relevant columns from the `users` table are `id` (the ID of each user) and `creation_date` (when the user joined the Stack Overflow site, in DATETIME format).

In [11]:
three_tables_query = """
                    SELECT u.id AS id, MIN(q.creation_date) AS q_creation_date, MIN(a.creation_date) AS a_creation_date
                 
                    FROM `bigquery-public-data.stackoverflow.users` AS u
                    FULL JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q 
                        ON q.owner_user_id = u.id
                    LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a 
                        ON a.owner_user_id = q.owner_user_id
                    
                     
                    WHERE EXTRACT(YEAR FROM u.creation_date) = 2019 AND EXTRACT(MONTH FROM u.creation_date) = 01
                    GROUP BY id
                    ORDER BY q_creation_date     
                    """


safe_config = bigquery.QueryJobConfig(maximum_bytes_billed =27*10**10)

query_job = client().query(three_tables_query, job_config=safe_config)

qa_JAN_init = query_job.to_dataframe()

qa_JAN_init.tail(15)

Unnamed: 0,id,q_creation_date,a_creation_date
141691,10991654,2022-09-21 15:08:06.950000+00:00,NaT
141692,10885997,2022-09-21 18:25:07.780000+00:00,NaT
141693,10921854,2022-09-22 08:28:17.277000+00:00,2022-09-22 09:06:21.627000+00:00
141694,10867903,2022-09-22 08:31:43.093000+00:00,NaT
141695,10971940,2022-09-22 10:03:21.127000+00:00,NaT
141696,10941431,2022-09-22 19:04:44.320000+00:00,NaT
141697,10998252,2022-09-22 19:05:30.347000+00:00,NaT
141698,10990887,2022-09-22 20:41:05.500000+00:00,NaT
141699,10917584,2022-09-22 22:19:38.527000+00:00,NaT
141700,10994945,2022-09-23 08:35:41.933000+00:00,NaT


### 4) How many distinct users posted on January 1, 2019?

In the code cell below, write a query that returns a table with a single column:
- `owner_user_id` - the IDs of all users who posted at least one question or answer on January 1, 2019.  Each user ID should appear at most once.

In the `posts_questions` (and `posts_answers`) tables, you can get the ID of the original poster from the `owner_user_id` column.  Likewise, the date of the original posting can be found in the `creation_date` column.  

In order for your answer to be marked correct, your query must use a **UNION**.

In [12]:
JAN1_19_query = """
                    SELECT a.owner_user_id FROM `bigquery-public-data.stackoverflow.posts_answers` AS a 
                    WHERE EXTRACT(DATE FROM a.creation_date) = '2019-01-01'
                 
                    UNION DISTINCT

                    SELECT q.owner_user_id FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                    WHERE EXTRACT(DATE FROM q.creation_date) = '2019-01-01' 
                  """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed =27*10**10)

query_job = client().query(JAN1_19_query, job_config=safe_config)

JAN1_19_POSTS = query_job.to_dataframe()

JAN1_19_POSTS.head()

Unnamed: 0,owner_user_id
0,10853045
1,10852547
2,10185816
3,10309381
4,10235648


In [13]:
print(f"The number of distinct users posted to stackoverflow on Jan 1, 2019: {len(JAN1_19_POSTS)}")

The number of distinct users posted to stackoverflow on Jan 1, 2019: 4352
