<a href="https://colab.research.google.com/github/paulboal/hds5210-2023-private/blob/main/week15/module45_bigquery.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Before you begin


1.   Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.
2.   [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
3.   [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.


### Provide your credentials to the runtime

In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


## Optional: Enable data table display

Colab includes the ``google.colab.data_table`` package that can be used to display large pandas dataframes as an interactive data table.
It can be enabled with:

In [2]:
%load_ext google.colab.data_table

If you would prefer to return to the classic Pandas dataframe display, you can disable this by running:
```python
%unload_ext google.colab.data_table
```

# Use BigQuery via magics

The `google.cloud.bigquery` library also includes a magic command which runs a query and either displays the result or saves it to a variable as a `DataFrame`.

In [4]:
# Display query output immediately
# Your PROJECT hds5210-tracker as prefix

%%bigquery --project YOUR_PROJECT
SELECT
  COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_rows
0,114420316


In [5]:
# Save output in a variable `df`

%%bigquery df --project YOUR_PROJECT
SELECT
  COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`

Query is running:   0%|          |

Downloading:   0%|          |

In [6]:
df

Unnamed: 0,total_rows
0,114420316


# Use BigQuery through google-cloud-bigquery

See [BigQuery documentation](https://cloud.google.com/bigquery/docs) and [library reference documentation](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html).

The [GSOD sample table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=gsod&page=table) contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.


### Declare the Cloud project ID which will be used throughout this notebook

In [9]:
project_id = 'YOUR_PROJECT'

### Sample approximately 2000 random rows

In [10]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)

sample_count = 2000
row_count = client.query('''
  SELECT
    COUNT(*) as total
  FROM `bigquery-public-data.samples.gsod`''').to_dataframe().total[0]

df = client.query('''
  SELECT
    *
  FROM
    `bigquery-public-data.samples.gsod`
  WHERE RAND() < %d/%d
''' % (sample_count, row_count)).to_dataframe()

print('Full dataset has %d rows' % row_count)

Full dataset has 114420316 rows


### Describe the sampled data

In [11]:
df.describe()



Unnamed: 0,station_number,wban_number,year,month,day,mean_temp,num_mean_temp_samples,mean_dew_point,num_mean_dew_point_samples,mean_sealevel_pressure,...,mean_visibility,num_mean_visibility_samples,mean_wind_speed,num_mean_wind_speed_samples,max_sustained_wind_speed,max_gust_wind_speed,max_temperature,min_temperature,total_precipitation,snow_depth
count,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1854.0,1854.0,1471.0,...,1717.0,1717.0,1912.0,1912.0,1888.0,263.0,1943.0,0.0,1776.0,94.0
mean,515410.151312,89654.097787,1987.404014,6.603706,15.586207,52.184714,13.038085,41.670982,12.935275,1014.884025,...,11.897903,12.648224,6.647333,13.005753,11.979555,25.581749,43.649871,,0.086616,11.565957
std,299820.894825,27225.747829,16.109017,3.460149,8.795283,25.375585,7.827872,23.756412,7.838295,9.055176,...,8.71476,7.710659,4.740706,7.808639,6.554459,9.04687,25.481299,,0.436317,11.436854
min,10010.0,17.0,1937.0,1.0,1.0,-57.599998,4.0,-62.0,4.0,972.200012,...,0.0,4.0,0.0,4.0,1.0,6.0,-64.099998,,0.0,0.4
25%,249750.0,99999.0,1977.0,4.0,8.0,38.099998,7.0,28.925,7.0,1009.5,...,6.3,7.0,3.4,7.0,7.8,19.799999,30.9,,0.0,2.4
50%,561820.0,99999.0,1990.0,7.0,16.0,56.200001,8.0,44.400002,8.0,1014.400024,...,9.6,8.0,5.6,8.0,11.1,23.9,46.900002,,0.0,8.5
75%,725142.0,99999.0,2001.0,10.0,23.0,71.400002,23.0,57.900002,23.0,1020.299988,...,14.9,23.0,8.7,23.0,15.0,30.9,62.400002,,0.01,18.0
max,999999.0,99999.0,2010.0,12.0,31.0,103.099998,24.0,81.0,24.0,1051.099976,...,99.400002,24.0,44.400002,24.0,66.0,64.099998,91.199997,,11.0,49.599998


### View the first 10 rows

In [12]:
df.head(10)



Unnamed: 0,station_number,wban_number,year,month,day,mean_temp,num_mean_temp_samples,mean_dew_point,num_mean_dew_point_samples,mean_sealevel_pressure,...,min_temperature,min_temperature_explicit,total_precipitation,snow_depth,fog,rain,snow,hail,thunder,tornado
0,725720,24127,1955,11,14,38.700001,24,32.400002,24.0,1000.900024,...,,,,,False,False,False,False,False,False
1,916230,99999,1974,3,24,80.599998,4,71.199997,4.0,1010.299988,...,,,0.0,,False,False,False,False,False,False
2,64470,99999,1976,10,10,61.099998,7,58.0,7.0,1012.400024,...,,,0.0,,False,False,False,False,False,False
3,782640,99999,1981,5,29,81.699997,15,76.400002,15.0,1014.0,...,,,,,False,False,False,False,False,False
4,726587,99999,1982,11,1,46.799999,12,,,,...,,,0.0,,True,True,True,True,True,True
5,32100,99999,1986,12,31,43.700001,21,41.200001,21.0,1000.700012,...,,,0.12,,False,False,False,False,False,False
6,356630,99999,1989,5,1,38.299999,8,26.1,8.0,1021.400024,...,,,0.0,,False,False,False,False,False,False
7,277030,99999,1991,1,15,17.4,8,13.4,8.0,1028.800049,...,,,0.12,,False,False,False,False,False,False
8,722170,3813,1992,12,6,36.599998,24,18.6,24.0,1026.5,...,,,0.01,,False,False,False,False,False,False
9,10780,99999,2000,6,9,39.400002,8,33.900002,8.0,1010.900024,...,,,0.11,,False,False,False,False,False,False


In [None]:
# 10 highest total_precipitation samples
df.sort_values('total_precipitation', ascending=False).head(10)[['station_number', 'year', 'month', 'day', 'total_precipitation']]

Unnamed: 0,station_number,year,month,day,total_precipitation
644,230220,1964,7,15,5.91
1155,985430,2008,12,8,3.46
1196,248260,1961,11,1,2.95
1588,257670,1959,8,9,2.95
980,299150,1962,3,1,2.95
1325,470250,1965,11,25,2.95
1917,288380,1994,8,6,2.32
1211,585190,1995,4,14,2.32
250,647000,2005,8,19,2.2
1418,964710,1975,9,8,1.97


# Use BigQuery through pandas-gbq

The `pandas-gbq` library is a community led project by the pandas community. It covers basic functionality, such as writing a DataFrame to BigQuery and running a query, but as a third-party library it may not handle all BigQuery features or use cases.

[Pandas GBQ Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html)

In [None]:
import pandas_gbq as gbq

sample_count = 2000
df = gbq.read_gbq('''
  SELECT name, SUM(number) as count
  FROM `bigquery-public-data.usa_names.usa_1910_2013`
  WHERE state = 'TX'
  GROUP BY name
  ORDER BY count DESC
  LIMIT 100
''', project_id=project_id, dialect='standard')

df.head()

Unnamed: 0,name,count
0,James,272793
1,John,235139
2,Michael,225320
3,Robert,220399
4,David,219028


# Syntax highlighting
`google.colab.syntax` can be used to add syntax highlighting to any Python string literals which are used in a query later.

In [None]:
from google.colab import syntax
query = syntax.sql('''
SELECT
  COUNT(*) as total_rows
FROM
  `bigquery-public-data.samples.gsod`
''')

gbq.read_gbq(query, project_id=project_id, dialect='standard')

Unnamed: 0,total_rows
0,114420316
