# Joining data

You have the tools to obtain data from a single table in whatever format you want it. But what if the data you want is spread across multiple tables?

That's where JOIN comes in! JOIN is incredibly important in practical SQL workflows. 

In general, when you're joining tables, it's a good habit to specify which table each of your columns comes from. That way, you don't have to pull up the schema every time you go back to read the query.


## Example: How many files are covered by each type of software license?
GitHub is the most popular place to collaborate on software projects. A GitHub repository (or repo) is a collection of files associated with a specific project.

Most repos on GitHub are shared under a specific legal license, which determines the legal restrictions on how they are used. For our example, we're going to look at how many different files have been released under each license.

We'll work with two tables in the database. The first table is the licenses table, which provides the name of each GitHub repo (in the repo_name column) and its corresponding license. Here's a view of the first five rows.

In [2]:
from google.cloud import bigquery

client = bigquery.Client(project="sqlbigquery7711")

dataset_ref = client.dataset("github_repos", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)

In [5]:
tables = list(client.list_tables(dataset))

In [6]:
for table in tables:
    print (table.table_id)

commits
contents
files
languages
licenses
sample_commits
sample_contents
sample_files
sample_repos


In [9]:
table_ref = dataset_ref.table('licenses')

license_table = client.get_table(table_ref)

In [10]:
client.list_rows(license_table, max_results=5).to_dataframe()

Unnamed: 0,repo_name,license
0,autarch/Dist-Zilla-Plugin-Test-TidyAll,artistic-2.0
1,thundergnat/Prime-Factor,artistic-2.0
2,kusha-b-k/Turabian_Engin_Fan,artistic-2.0
3,onlinepremiumoutlet/onlinepremiumoutlet.github.io,artistic-2.0
4,huangyuanlove/LiaoBa_Service,artistic-2.0


The second table is the sample_files table, which provides, among other information, the GitHub repo that each file belongs to (in the repo_name column). The first several rows of this table are printed below.



In [11]:
sample_ref = dataset_ref.table('sample_files')

sample_table = client.get_table(sample_ref)

client.list_rows(sample_table, max_results=5).to_dataframe()

Unnamed: 0,repo_name,ref,path,mode,id,symlink_target
0,EOL/eol,refs/heads/master,generate/vendor/railties,40960,0338c33fb3fda57db9e812ac7de969317cad4959,/usr/share/rails-ruby1.8/railties
1,np/ling,refs/heads/master,tests/success/merger_seq_inferred.t/merger_seq...,40960,dd4bb3d5ecabe5044d3fa5a36e0a9bf7ca878209,../../../fixtures/all/merger_seq_inferred.ll
2,np/ling,refs/heads/master,fixtures/sequence/lettype.ll,40960,8fdf536def2633116d65b92b3b9257bcf06e3e45,../all/lettype.ll
3,np/ling,refs/heads/master,fixtures/failure/wrong_order_seq3.ll,40960,c2509ae1196c4bb79d7e60a3d679488ca4a753e9,../all/wrong_order_seq3.ll
4,np/ling,refs/heads/master,issues/sequence/keep.t,40960,5721de3488fb32745dfc11ec482e5dd0331fecaf,../keep.t


In [14]:
query = """
        SELECT  lic.license AS license, COUNT(1) AS number_of_files
        FROM    `bigquery-public-data.github_repos.licenses` AS lic 
                INNER JOIN `bigquery-public-data.github_repos.sample_files` AS smp
                ON lic.repo_name = smp.repo_name
        GROUP BY license
        ORDER BY number_of_files DESC
"""

query_job = client.query(query)

query_df = query_job.to_dataframe()

In [15]:
query_df.head()


Unnamed: 0,license,number_of_files
0,mit,20560894
1,gpl-2.0,16608922
2,apache-2.0,7201141
3,gpl-3.0,5107676
4,bsd-3-clause,3465437


## Exercises

[Stack Overflow](https://stackoverflow.com/) is a widely beloved question and answer site for technical questions. You'll probably use it yourself as you keep using SQL (or any programming language). 

Their data is publicly available. What cool things do you think it would be useful for?

Here's one idea:
You could set up a service that identifies the Stack Overflow users who have demonstrated expertise with a specific technology by answering related questions about it, so someone could hire those experts for in-depth help.

In this exercise, you'll write the SQL queries that might serve as the foundation for this type of service.

In [3]:
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)

### 1) Explore the data

Before writing queries or **JOIN** clauses, you'll want to see what tables are available. 

In [4]:
tables = list(client.list_tables(dataset))

In [6]:
list_of_tables = []
for table in tables:
    print(table.table_id)
    list_of_tables.append(table.table_id)

badges
comments
post_history
post_links
posts_answers
posts_moderator_nomination
posts_orphaned_tag_wiki
posts_privilege_wiki
posts_questions
posts_tag_wiki
posts_tag_wiki_excerpt
posts_wiki_placeholder
stackoverflow_posts
tags
users
votes


In [8]:
answers_table_ref = dataset_ref.table("posts_answers")

answers_table = client.get_table(answers_table_ref)

client.list_rows(answers_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,18,,<p>For a table like this:</p>\n\n<pre><code>CR...,,,2,NaT,2008-08-01 05:12:44.193000+00:00,,2016-06-02 05:56:26.060000+00:00,2016-06-02 05:56:26.060000+00:00,Jeff Atwood,126039,phpguy,,17,2,59,,
1,165,,"<p>You can use a <a href=""http://sharpdevelop....",,,0,NaT,2008-08-01 18:04:25.023000+00:00,,2019-04-06 14:03:51.080000+00:00,2019-04-06 14:03:51.080000+00:00,,1721793,user2189331,,145,2,10,,
2,1028,,<p>The VB code looks something like this:</p>\...,,,0,NaT,2008-08-04 04:58:40.300000+00:00,,2013-02-07 13:22:14.680000+00:00,2013-02-07 13:22:14.680000+00:00,,395659,user2189331,,947,2,8,,
3,1073,,<p>My first choice would be a dedicated heap t...,,,0,NaT,2008-08-04 07:51:02.997000+00:00,,2015-09-01 17:32:32.120000+00:00,2015-09-01 17:32:32.120000+00:00,,45459,user2189331,,1069,2,29,,
4,1260,,<p>I found the answer. all you have to do is a...,,,0,NaT,2008-08-04 14:06:02.863000+00:00,,2016-12-20 08:38:48.867000+00:00,2016-12-20 08:38:48.867000+00:00,,1221571,Jin,,1229,2,1,,


It isn't clear yet how to find users who answered questions on any given topic. But `posts_answers` has a `parent_id` column. If you are familiar with the Stack Overflow site, you might figure out that the `parent_id` is the question each post is answering.

Look at `posts_questions` using the cell below.

In [10]:
questions_table_ref = dataset_ref.table("posts_questions")

questions_table = client.get_table(questions_table_ref)

client.list_rows(questions_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,320268,Html.ActionLink doesn’t render # properly,<p>When using Html.ActionLink passing a string...,,0,0,NaT,2008-11-26 10:42:37.477000+00:00,0,2009-02-06 20:13:54.370000+00:00,NaT,,,Paulo,,,1,0,asp.net-mvc,390
1,324003,Primitive recursion,<p>how will i define the function 'simplify' ...,,0,0,NaT,2008-11-27 15:12:37.497000+00:00,0,2012-09-25 19:54:40.597000+00:00,2012-09-25 19:54:40.597000+00:00,Marcin,1288.0,,41000.0,,1,0,haskell|lambda|functional-programming|lambda-c...,497
2,390605,While vs. Do While,<p>I've seen both the blocks of code in use se...,390608.0,0,0,NaT,2008-12-24 01:49:54.230000+00:00,2,2008-12-24 03:08:55.897000+00:00,NaT,,,Unkwntech,115.0,,1,0,language-agnostic|loops,11262
3,413246,Protect ASP.NET Source code,<p>Im currently doing some research in how to ...,,0,0,NaT,2009-01-05 14:23:51.040000+00:00,0,2009-03-24 21:30:22.370000+00:00,2009-01-05 14:42:28.257000+00:00,Tom Anderson,13502.0,Velnias,,,1,0,asp.net|deployment|obfuscation,4823
4,454921,"Difference between ""int[] myArray"" and ""int my...",<blockquote>\n <p><strong>Possible Duplicate:...,454928.0,0,0,NaT,2009-01-18 10:22:52.177000+00:00,0,2009-01-18 10:30:50.930000+00:00,2017-05-23 11:49:26.567000+00:00,,-1.0,Evan Fosmark,49701.0,,1,0,java|arrays,798


### 3) Selecting the right questions

A lot of this data is text. 

A **WHERE** clause can limit your results to rows with certain text using the **LIKE** feature. For example, to select just the third row of the `pets` table from the tutorial, we could use the query in the picture below.



In [11]:
query = """
        SELECT id, title, owner_user_id
        FROM `bigquery-public-data.stackoverflow.posts_questions`
        WHERE tags LIKE '%bigquery%'
        """

query_job = client.query(query)

query_df = query_job.to_dataframe()

In [12]:
query_df.head()

Unnamed: 0,id,title,owner_user_id
0,36163455,bigquery : data versioning and incremental upd...,6021401
1,36174202,How to connect spark to BigQuery using BigQuer...,2784956
2,35987304,Google Bigquery - Bulk Load,6029436
3,36188504,Google BigQuery Error: Resources exceeded duri...,1965449
4,35809021,How can I set expiration time for Big Query ta...,3741898


### 4) Your first join
Now that you have a query to select questions on any given topic (in this case, you chose "bigquery"), you can find the answers to those questions with a **JOIN**.  

Write a query that returns the `id`, `body` and `owner_user_id` columns from the `posts_answers` table for answers to "bigquery"-related questions. 
- You should have one row in your results for each answer to a question that has "bigquery" in the tags.  
- Remember you can get the tags for a question from the `tags` column in the `posts_questions` table.

Here's a reminder of what a **JOIN** looked like in the tutorial:
```
query = """
        SELECT p.Name AS Pet_Name, o.Name AS Owner_Name
        FROM `bigquery-public-data.pet_records.pets` as p
        INNER JOIN `bigquery-public-data.pet_records.owners` as o 
            ON p.ID = o.Pet_ID
        """
```

It may be useful to scroll up and review the first several rows of the `posts_answers` and `posts_questions` tables.  

In [18]:
query = """
        SELECT pa.id, pa.body, pa.owner_user_id
        FROM    `bigquery-public-data.stackoverflow.posts_answers` as pa
                INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` as pq
                ON pa.parent_id = pq.id
        WHERE pq.tags LIKE '%bigquery%'
        """

query_job = client.query(query)

query_df = query_job.to_dataframe()

In [19]:
query_df

Unnamed: 0,id,body,owner_user_id
0,26841187,"<p>The problem is that ""destinationTable"" must...",3915486
1,26906162,<p>You need to first login in g+ then share vi...,2037889
2,26922819,"<p>This is what finally worked, modified from ...",2436922
3,26989406,<p>you are looking for:schema.fields[].descrip...,2881671
4,27003080,"<p>I.e. something like ""SELECT ThisRowCommitTi...",4267102
...,...,...,...
27802,24518205,<p>The issue is that the position function get...,1366527
27803,24903013,<p>I don't think there has been a syntax chang...,1366527
27804,24785113,"<p>As of this July 18 2014, all daily quotas a...",1366527
27805,24746718,<p>Analytic functions have an odd syntax... yo...,1366527


### 5) Answer the question
You have the merge you need. But you want a list of users who have answered many questions... which requires more work beyond your previous result.

Write a new query that has a single row for each user who answered at least one question with a tag that includes the string "bigquery". Your results should have two columns:
- `user_id` - contains the `owner_user_id` column from the `posts_answers` table
- `number_of_answers` - contains the number of answers the user has written to "bigquery"-related questions

In [20]:
query = """
        WITH ss AS (
        SELECT pa.id, pa.body, pa.owner_user_id
        FROM    `bigquery-public-data.stackoverflow.posts_answers` as pa
                INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` as pq
                ON pa.parent_id = pq.id
        WHERE pq.tags LIKE '%bigquery%'
        )
        SELECT  owner_user_id as user_id,
                COUNT(1) as number_of_answers
        FROM ss
        GROUP BY user_id
        ORDER BY number_of_answers DESC
        """

query_job = client.query(query)

query_df = query_job.to_dataframe()

In [21]:
query_df

Unnamed: 0,user_id,number_of_answers
0,5221944,5203
1,1144035,1634
2,132438,898
3,6253347,737
4,1366527,620
...,...,...
6366,655489,1
6367,11255025,1
6368,6805427,1
6369,546061,1


### 6) Building a more generally useful service
How could you convert what you've done to a general function a website could call on the backend to get experts on any topic?

In [22]:
def expert_finder(topic, client):
    '''
    Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.

    Inputs:
        topic: A string with the topic of interest
        client: A Client object that specifies the connection to the Stack Overflow dataset

    Outputs:
        results: A DataFrame with columns for user_id and number_of_answers. Follows similar logic to bigquery_experts_results shown above.
    '''
    my_query = """
               SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
               FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
               INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                   ON q.id = a.parent_Id
               WHERE q.tags like '%{topic}%'
               GROUP BY a.owner_user_id
               """

    # Set up the query (a real service would have good error handling for 
    # queries that scan too much data)
    safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)      
    my_query_job = client.query(my_query, job_config=safe_config)

    # API request - run the query, and return a pandas DataFrame
    results = my_query_job.to_dataframe()

    return results