# Kaggle Intro to SQL (and BigQuery)
- https://www.kaggle.com/learn/intro-to-sql

## 6. Joining Data
- Combine data sources. Critical for almost all real-world data problems

### Introduction

- If the data you want is spread across multiple tables you  must use __JOIN__.

### Example

Suppose two tables: `pets` and `owners`
- pets's columns: ID, Name, Animal
- owners's columns: ID, Name, Pet_ID

``` Python:
query = '''
    SELECT p.Name AS Pet_Name, o.Name AS Owner_Name
    FROM `bigquery-public-data.pet_records.pets` AS p
    INNER JOIN `bigquery-public-data.pet_records.owners` AS o
        ON p.ID = o.Pet_ID '''
```
- __INNER JOIN__: a row will only be put in the final output table if the value in the columns you're using to combine them shows up in both the tables you're joining.

### Example: licenses, files, GitHub

- Most repos on GitHub are shared under a specific legal license, which determines the legal restrictions on how they are used. For our example, we're going to look at how many different files have been released under each license.
- We'll work with two tables in the database: `licenses` and `sample_files`.
- licenses: provides the name of each GitHub repo (in the repo_name column) and its corresponding license. Cols: repo_name; license.
- sample_files: provides, the GitHub repo that each file belongs to (in the repo_name column). Cols: repo_name, ref, path, mode, id, symlink_target

In [1]:
### Let's see the licenses table
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client('jmproject86385')

# Construct a reference to the "crypto_bitcoin" dataset
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "transactions" table
table_ref = dataset_ref.table("licenses")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "transactions" table
#client.list_rows(table, max_results=5).to_dataframe()
client.list_rows(table, max_results=1_000).to_dataframe().iloc[[0, 5, 9, -9, -5, -1]]

  client.list_rows(table, max_results=1_000).to_dataframe().iloc[[0, 5, 9, -9, -5, -1]]


Unnamed: 0,repo_name,license
0,autarch/Dist-Zilla-Plugin-Test-TidyAll,artistic-2.0
5,gitpan/Mojo-Server-FCGI,artistic-2.0
9,elbehosg/sg_test1,artistic-2.0
991,denjello/brunch-with-crypto,artistic-2.0
995,BarelyFunctional/DiffTool,artistic-2.0
999,augint/rz.switchblade-osx,artistic-2.0


In [2]:
### Let's see the sample_files table
from google.cloud import bigquery

client = bigquery.Client('jmproject86385')
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table("sample_files")
table = client.get_table(table_ref)
client.list_rows(table, max_results=1_000).to_dataframe().iloc[[0, 5, 9, -9, -5, -1]]

  client.list_rows(table, max_results=1_000).to_dataframe().iloc[[0, 5, 9, -9, -5, -1]]


Unnamed: 0,repo_name,ref,path,mode,id,symlink_target
0,EOL/eol,refs/heads/master,generate/vendor/railties,40960,0338c33fb3fda57db9e812ac7de969317cad4959,/usr/share/rails-ruby1.8/railties
5,np/ling,refs/heads/master,fixtures/compile/non_dep_recv_dom_args.ll,40960,a292d7932b3a57da966f25fb1bcf08622c594a12,../all/non_dep_recv_dom_args.ll
9,ello/Moya,refs/heads/master,Demo/Pods/Headers/Private/ReactiveCocoa/UICont...,40960,b140b74ce03183daffa941d8a9be7d0a147338db,../../../ReactiveCocoa/ReactiveCocoa/Objective...
991,ozonos/ozon-icon-theme,refs/heads/master,Icons/Ozon-Grey/48x48/places/gtk-network.svg,40960,7da8cb2e67f1ae4cfe264b52fd579783e66ceca2,network.svg
995,varlesh/elementary-add,refs/heads/master,elementary-add/mimes/48/text-x-sql.svg,40960,9bd77b4778327ef82972849122127fb7fe807220,office-database.svg
999,varlesh/elementary-add,refs/heads/master,elementary-add/apps/48/gnome-settings-keybindi...,40960,f4f8197894ae13a634711a56ba15753f8e2b72fe,gnome-characters.svg


And now the query that uses info in both tables to determine how many files are realeased in each license.

In [10]:
# Query to determine the number of files per license, sorted by number of files
query = '''
    SELECT L.license, COUNT(1) AS number_of_files
    FROM `bigquery-public-data.github_repos.sample_files` AS sf
    INNER JOIN `bigquery-public-data.github_repos.licenses` AS L
        ON sf.repo_name = L.repo_name
    GROUP BY L.license
    ORDER BY number_of_files DESC '''
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)
# API request - run the query, and convert the results to a pandas DataFrame
file_count_by_license = query_job.to_dataframe()
# Print the DataFrame
file_count_by_license

Unnamed: 0,license,number_of_files
0,mit,20560894
1,gpl-2.0,16608922
2,apache-2.0,7201141
3,gpl-3.0,5107676
4,bsd-3-clause,3465437
5,agpl-3.0,1372100
6,lgpl-2.1,799664
7,bsd-2-clause,692357
8,lgpl-3.0,582277
9,mpl-2.0,457000
