[View in Colaboratory](https://colab.research.google.com/github/lynnlangit/gcp-ml/blob/master/Getting_started_with_BigQuery.ipynb)

# Before you begin


1.   Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.
2.   [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
3.   [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.


### Declare the Cloud project ID which will be used throughout this notebook

In [0]:
project_id = '[your project ID]'

### Provide your credentials to the runtime

In [0]:
from google.colab import auth
auth.authenticate_user()

# Use BigQuery via magics

The `google.cloud.bigquery` library also includes a magic command which runs a query and displays the result, optionally saving it to a variable as a `DataFrame`.

In [0]:
%%bigquery --project yourprojectid df
SELECT 
  COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`

Unnamed: 0,total_rows
0,114420316


In [0]:
df

Unnamed: 0,total_rows
0,114420316


# Use BigQuery through Pandas

[Pandas GBQ Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html)

The [GSOD sample table](https://bigquery.cloud.google.com/table/bigquery-public-data:samples.gsod) contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.


### Sample approximately 2000 random rows

In [0]:
import pandas as pd

sample_count = 2000
row_count = pd.io.gbq.read_gbq('''
  SELECT 
    COUNT(*) as total
  FROM [bigquery-public-data:samples.gsod]''', project_id=project_id, verbose=False).total[0]

df = pd.io.gbq.read_gbq('''
  SELECT
    *
  FROM
    [bigquery-public-data:samples.gsod]
  WHERE RAND() < %d/%d
''' % (sample_count, row_count), project_id=project_id, verbose=False)

print('Full dataset has %d rows' % row_count)

Full dataset has 114420316 rows


### Describe the sampled data

In [0]:
df.describe()

Unnamed: 0,station_number,wban_number,year,month,day,mean_temp,num_mean_temp_samples,mean_dew_point,num_mean_dew_point_samples,mean_sealevel_pressure,...,mean_visibility,num_mean_visibility_samples,mean_wind_speed,num_mean_wind_speed_samples,max_sustained_wind_speed,max_gust_wind_speed,max_temperature,min_temperature,total_precipitation,snow_depth
count,1933.0,1933.0,1933.0,1933.0,1933.0,1933.0,1933.0,1848.0,1848.0,1465.0,...,1747.0,1747.0,1914.0,1914.0,1885.0,261.0,1932.0,0.0,1777.0,113.0
mean,505374.663735,89885.038282,1986.491981,6.500259,15.838076,52.308277,12.921883,41.841234,12.912879,1014.935359,...,12.262736,12.459073,6.932602,12.8814,12.330239,25.877395,43.592754,,0.081322,10.19469
std,302512.051035,26937.098157,16.49652,3.448471,8.770726,23.873779,7.930051,21.898547,7.964763,9.237936,...,9.747367,7.850253,5.044049,7.92061,6.889988,10.191199,23.680489,,0.423668,11.434962
min,10050.0,13.0,1937.0,1.0,1.0,-96.5,4.0,-52.599998,4.0,931.400024,...,0.0,4.0,0.0,4.0,1.0,7.8,-100.300003,,0.0,0.4
25%,249510.0,99999.0,1977.0,4.0,8.0,38.200001,7.0,29.700001,7.0,1009.900024,...,6.4,6.0,3.5,7.0,7.8,19.0,31.1,,0.0,2.4
50%,516440.0,99999.0,1989.0,6.0,16.0,54.599998,8.0,43.900002,8.0,1014.700012,...,9.7,8.0,5.8,8.0,11.1,24.9,45.950001,,0.0,6.3
75%,725805.0,99999.0,2000.0,10.0,23.0,71.0,23.0,57.599998,23.25,1019.900024,...,14.9,23.0,9.2,23.0,15.5,30.9,60.799999,,0.01,13.0
max,999999.0,99999.0,2010.0,12.0,31.0,100.5,24.0,80.400002,24.0,1056.599976,...,99.400002,24.0,52.099998,24.0,70.900002,71.900002,89.099998,,10.31,59.099998


### View the first 10 rows

In [0]:
df.head(10)

Unnamed: 0,station_number,wban_number,year,month,day,mean_temp,num_mean_temp_samples,mean_dew_point,num_mean_dew_point_samples,mean_sealevel_pressure,...,min_temperature,min_temperature_explicit,total_precipitation,snow_depth,fog,rain,snow,hail,thunder,tornado
0,645560,99999,1976,5,15,75.800003,6,71.0,6.0,,...,,,0.0,,True,True,True,True,True,True
1,911610,99999,1982,1,31,51.400002,4,45.5,4.0,,...,,,0.12,,False,False,False,False,False,False
2,766131,99999,1988,4,21,67.0,11,34.200001,4.0,,...,,,0.0,,False,False,False,False,False,False
3,744900,14702,1991,11,4,44.299999,16,32.5,16.0,,...,,,0.0,,False,False,False,False,False,False
4,423690,99999,2005,1,24,57.599998,16,48.599998,16.0,,...,,,0.0,,True,True,True,True,True,True
5,946310,99999,2008,9,9,52.099998,8,46.200001,8.0,,...,,,0.0,,False,False,False,False,False,False
6,898680,99999,1999,9,6,-25.299999,17,,,,...,,,0.0,,False,False,False,False,False,False
7,477590,99999,1962,4,14,50.0,4,34.299999,4.0,1014.900024,...,,,0.0,,False,False,False,False,False,False
8,128050,99999,1988,5,3,61.700001,4,45.900002,4.0,1008.400024,...,,,0.0,,False,False,False,False,False,False
9,232190,99999,1989,5,18,47.200001,7,32.5,7.0,1009.700012,...,,,0.0,,False,False,False,False,False,False


In [0]:
# 10 highest total_precipitation samples
df.sort_values('total_precipitation', ascending=False).head(10)[['station_number', 'year', 'month', 'day', 'total_precipitation']]

Unnamed: 0,station_number,year,month,day,total_precipitation
1252,319090,1984,6,10,10.31
946,614010,1978,2,11,7.87
602,284280,1970,3,21,5.91
101,605400,2003,10,7,4.02
1401,349290,1959,9,8,2.95
413,218240,1959,10,13,2.95
105,804370,2005,12,6,2.72
1837,161340,1991,10,25,2.44
947,683000,1974,3,3,2.2
797,725990,1973,4,12,2.13


# Use BigQuery through google.cloud.bigquery

[Documentation](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html)

In [0]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)

for dataset in client.list_datasets():
  print(dataset.dataset_id)