## Introduction

in fact with Big query that's mainly for enterprise data that could handle really big data within less time, I haven't dive into big query but it's similiar with HIVE. So the only problem is how could we use it?

There are 3 ways:
* web ui
* command line
* rest API

I would use python to interact with bigquery to do querying.

In [0]:
! pip install google-cloud-bigquery



## Make data ready

I have created a database and table with web UI, then we could do query with python, for how to use web UI to create dataset and table could follow this [link](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui).

In [0]:
# before we do anything we should config with credentials
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = [x for x in os.listdir('.') if x.endswith('json')][0]

In [0]:
from google.cloud import bigquery

project_id = "cloudtutorial-279003" 
# init client
client = bigquery.Client(project_id)

query = "select sales_region, count(1) as num from database_tutorial.query_table group by sales_region"

query_job = client.query(query)

# get result
result = query_job.result()

In [0]:
# let's figure out the result object
# so we oculd find that result is a iterator object. we could make the result into a dataframe with to_dataframe.
print(type(result))
print(dir(result))

<class 'google.cloud.bigquery.table.RowIterator'>
['_DEFAULT_ITEMS_KEY', '_HTTP_METHOD', '_MAX_RESULTS', '_NEXT_TOKEN', '_PAGE_TOKEN', '_RESERVED_PARAMS', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_field_to_index', '_get_next_page_response', '_get_progress_bar', '_get_query_params', '_has_next_page', '_items_iter', '_items_key', '_next_page', '_next_token', '_page_iter', '_page_size', '_page_start', '_preserve_order', '_project', '_schema', '_selected_fields', '_started', '_table', '_to_arrow_iterable', '_to_dataframe_iterable', '_to_page_iterable', '_total_rows', '_veri

In [0]:
# let's make it into a dataframe
df = result.to_dataframe()

df.head()

Unnamed: 0,sales_region,num
0,Region_1,3
1,Region_2,12
2,Region_3,4
3,Region_4,3
4,Region_5,2


In [0]:
# one more thing, as result is a iterator, after we have used it, we won't access it again.
# so if we need to access it again, we have to create a new one.
result_new = query_job.result()

for row in result_new:
  print(row.sales_region, row.num)

Region_1 3
Region_2 12
Region_3 4
Region_4 3
Region_5 2
Region_6 13
Region_7 9
Region_8 8
Region_9 2


## bigquery with command line

we could also interact with big query with commnad line.

In [0]:
# before we use glcoud command, we have to first auth application.
! gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&code_challenge=AMgRyG_twFRoc4S3CPgKVt0mRgr_Tdsd4CARrgt4EIc&code_challenge_method=S256&access_type=offline&response_type=code&prompt=select_account


Enter verification code: 4/0QHOpikCdJTLwS3XfLYn66w19c4G_PW_XIggtN1NedtaFQhLosEnWtc

You are now logged in as [gqianglu1990@gmail.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [0]:
# have to make sure without any blank space
! bq show cloudtutorial-279003:database_tutorial.query_table

Table cloudtutorial-279003:database_tutorial.query_table

   Last modified            Schema            Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels  
 ----------------- ------------------------- ------------ ------------- ------------ ------------------- ------------------ -------- 
  03 Jun 01:24:45   |- state_id: integer      56           1887                                                                      
                    |- state_code: string                                                                                            
                    |- state_name: string                                                                                            
                    |- sales_region: string                                                                                          



In [0]:
# let's query with command line
!bq query --use_legacy_sql=false \
"select sales_region, count(1) as num from database_tutorial.query_table group by sales_region"

Waiting on bqjob_r45b4568b13c95ad4_0000017277d9037a_1 ... (0s) Current status: DONE   
+--------------+-----+
| sales_region | num |
+--------------+-----+
| Region_1     |   3 |
| Region_2     |  12 |
| Region_3     |   4 |
| Region_4     |   3 |
| Region_5     |   2 |
| Region_6     |  13 |
| Region_7     |   9 |
| Region_8     |   8 |
| Region_9     |   2 |
+--------------+-----+


## Create dataset with command

let's create a new dataset and table with command.


In [0]:
# first to get what dataset we have
!bq ls 

      datasetId      
 ------------------- 
  database_tutorial  


In [0]:
# let's create a dataset
! bq mk command_dataset

Dataset 'cloudtutorial-279003:command_dataset' successfully created.


In [0]:
# let's check
!  bq ls 

      datasetId      
 ------------------- 
  command_dataset    
  database_tutorial  


In [0]:
# before we do the upload, we have to config columns name and data type with schema
import pandas as pd

file_name = [x for x in os.listdir('.') if x.endswith('csv')][0]
print(file_name)
df = pd.read_csv(file_name)

df.dtypes

bigqurey_data.csv


state_id         int64
state_code      object
state_name      object
sales_region    object
dtype: object

In [0]:
# so we have already created the dataset, let's load the data into big query
# but I upload the datafile in the colab now, so that we could upload directly
# this will create a table and load data into it.
! bq load --source_format=CSV --skip_leading_rows=1 \
command_dataset.query_table \    # table name
./bigqurey_data.csv \    # data file
state_id:integer,state_code:string,state_name:string,sales_region:string    # schema
# for schema, there shouldn't be any blank space


Upload complete.
Waiting on bqjob_r6e0a07e401b7dc04_0000017277eb5ff3_1 ... (0s) Current status: DONE   


In [0]:
# let's check with table information
! bq show cloudtutorial-279003:command_dataset.query_table

Table cloudtutorial-279003:command_dataset.query_table

   Last modified            Schema            Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels  
 ----------------- ------------------------- ------------ ------------- ------------ ------------------- ------------------ -------- 
  03 Jun 02:03:36   |- state_id: integer      56           1887                                                                      
                    |- state_code: string                                                                                            
                    |- state_name: string                                                                                            
                    |- sales_region: string                                                                                          



In [0]:
# let's query with table, nouse_legacy_sql: make standard SQL as the default query syntax
# you could do any query here...
! bq query --nouse_legacy_sql "select * from command_dataset.query_table limit 4"

Waiting on bqjob_r615b98fb64e94e97_0000017277eda9a1_1 ... (0s) Current status: DONE   
+----------+------------+----------------+--------------+
| state_id | state_code |   state_name   | sales_region |
+----------+------------+----------------+--------------+
|        1 | MO         | Missouri       | Region_1     |
|        2 | SC         | South Carolina | Region_1     |
|        3 | IN         | Indiana        | Region_1     |
|        6 | DE         | Delaware       | Region_2     |
+----------+------------+----------------+--------------+


In [0]:
# let's remove the dataset
! bq rm -r command_dataset

rm: remove dataset 'cloudtutorial-279003:command_dataset'? (y/N) y
