# Kaggle Intro to SQL (and BigQuery)
- https://www.kaggle.com/learn/intro-to-sql

## Getting Started With SQL and BigQuery
- Learn the workflow for handling big datasets with BigQuery and SQL

In [1]:
from google.cloud import bigquery

https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev

C:\Users\jm\AppData\Local\Google\Cloud SDK>gcloud auth application-default login
Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=ajIOntCs5MVRm1Cv5K4mjDyJKzvwdr&access_type=offline&code_challenge=5F0px6oGP6ZiE8bqyYkM0u0j3wnahqexweyvCSmWn_4&code_challenge_method=S256


Credentials saved to file: [C:\Users\jm\AppData\Roaming\gcloud\application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).
WARNING:
Cannot find a quota project to add to ADC. You might receive a "quota exceeded" or "API not enabled" error. Run $ gcloud auth application-default set-quota-project to add a quota project.

In [2]:
# Create a 'Client' Object.
client = bigquery.Client('jmproject')



In [3]:
# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

Every dataset is just a collection of tables. You can think of a dataset as a spreadsheet file containing multiple tables, all composed of rows and columns.

We use the list_tables() method to list the tables in the dataset.

In [4]:
# List all the tables in the "hacker_news" dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset (there are four!)
for table in tables:  
    print(table.table_id)

tables

full


[<google.cloud.bigquery.table.TableListItem at 0x1aad2be3f40>]

In [5]:
# Construct a reference to the "full" table
table_ref = dataset_ref.table("full")
# API request - fetch the table
table = client.get_table(table_ref)

table

Table(TableReference(DatasetReference('bigquery-public-data', 'hacker_news'), 'full'))

## Table schema
The structure of a table is called its schema. We need to understand a table's schema to effectively pull out the data we want.

In this example, we'll investigate the full table that we fetched above.

In [6]:
# Print information on all the columns in the "full" table in the "hacker_news" dataset
table.schema

[SchemaField('title', 'STRING', 'NULLABLE', 'Story title', (), None),
 SchemaField('url', 'STRING', 'NULLABLE', 'Story url', (), None),
 SchemaField('text', 'STRING', 'NULLABLE', 'Story or comment text', (), None),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', (), None),
 SchemaField('by', 'STRING', 'NULLABLE', "The username of the item's author.", (), None),
 SchemaField('score', 'INTEGER', 'NULLABLE', 'Story score', (), None),
 SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', (), None),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'Timestamp for the unix time', (), None),
 SchemaField('type', 'STRING', 'NULLABLE', 'Type of details (comment, comment_ranking, poll, story, job, pollopt)', (), None),
 SchemaField('id', 'INTEGER', 'NULLABLE', "The item's unique id.", (), None),
 SchemaField('parent', 'INTEGER', 'NULLABLE', 'Parent comment ID', (), None),
 SchemaField('descendants', 'INTEGER', 'NULLABLE', 'Number of story or poll descendants', (), None),
 SchemaField

In [7]:
client.list_rows(table, max_results=5).to_dataframe()

  client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,"I would rather just have wired earbuds, period...",,zeveb,,1591717736,2020-06-09 15:48:56+00:00,comment,23467666,23456782,,,
1,,,DNS?,,nly,,1572810465,2019-11-03 19:47:45+00:00,comment,21436112,21435130,,,
2,,,These benchmarks seem pretty good. Filterable...,,mrkeen,,1591717727,2020-06-09 15:48:47+00:00,comment,23467665,23467426,,,
3,,,Oh really?<p>* Excel alone uses 86.1MB of priv...,,oceanswave,,1462987532,2016-05-11 17:25:32+00:00,comment,11677248,11676886,,,
4,,,These systems are useless. Of the many flaws:...,,nyxxie,,1572810473,2019-11-03 19:47:53+00:00,comment,21436113,21435025,,,


The list_rows() method will also let us look at just the information in a specific column. If we want to see the first five entries in the by column, for example, we can do that!

In [8]:
# Preview the first five entries in the "by" column of the "full" table
client.list_rows(table, selected_fields=table.schema[:1], max_results=5).to_dataframe()

  client.list_rows(table, selected_fields=table.schema[:1], max_results=5).to_dataframe()


Unnamed: 0,title
0,
1,
2,
3,
4,
