# Using Cloud Storage

Cloud storage can be used to persist data across notebook instances. By default, each project is provisioned a google cloud storage bucket that
can be used across various services. In this notebook, we will persist artifacts from the notebook instance into cloud storage for future use. We will
also create a Javascript library that can be used within BigQuery.

## Configuration

The bucket is found at `gs://${PROJECT}`. The project can be found using the `gcloud` command. We inject these values into the environment so we can move between Python and Bash.

In [1]:
! gcloud config get-value project
! gcloud config get-value account

moz-fx-data-pion-nonprod-b3c9
notebook-amiyaguchi@moz-fx-data-pion-nonprod-b3c9.iam.gserviceaccount.com


In [2]:
import os
import subprocess

def run(command: str) -> str:
    return subprocess.run(command.split(), stdout=subprocess.PIPE).stdout.strip().decode()

os.environ["PROJECT"] = run("gcloud config get-value project")
os.environ["USER"] = run("gcloud config get-value account").split("@")[0]

! echo $PROJECT
! echo $USER

moz-fx-data-pion-nonprod-b3c9
notebook-amiyaguchi


## Using `gsutil rsync` for artifact storage

The [`gsutil rsync` command](https://cloud.google.com/storage/docs/gsutil/commands/rsync) is used to sync the contents of two directories. We will synchronize the tutoral notebooks for storage into the project bucket.

The tutorials are installed by default into the instance.

In [3]:
! tree tutorials/storage

tutorials/storage
├── Cloud Storage client library.ipynb
├── resources
│   ├── downloaded-us-states.txt
│   └── us-states.txt
└── Storage command-line tool.ipynb

1 directory, 4 files


Now we recursively sync these into cloud storage.

In [4]:
! gsutil rsync -r tutorials/storage/ gs://$PROJECT/$USER/test/artifacts


both the source and destination. Your crcmod installation isn't using the
module's C extension, so checksumming will run very slowly. If this is your
first rsync since updating gsutil, this rsync can take significantly longer than
usual. For help installing the extension, please see "gsutil help crcmod".

Building synchronization state...
Starting synchronization...
Copying file://tutorials/storage/Cloud Storage client library.ipynb [Content-Type=application/octet-stream]...
Copying file://tutorials/storage/Storage command-line tool.ipynb [Content-Type=application/octet-stream]...
Copying file://tutorials/storage/resources/downloaded-us-states.txt [Content-Type=text/plain]...
Copying file://tutorials/storage/resources/us-states.txt [Content-Type=text/plain]...
/ [4 files][ 17.3 KiB/ 17.3 KiB]                                                
Operation completed over 4 objects/17.3 KiB.                                     


Now they are persisted outside of this notebook instance. The notebook instance can be deleted without losing these artifacts.

In [None]:
! gsutil ls -r gs://$PROJECT/$USER

gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/:

gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/:

gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/:
gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/Cloud Storage client library.ipynb
gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/Storage command-line tool.ipynb

gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/resources/:
gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/resources/downloaded-us-states.txt
gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/resources/us-states.txt


## BigQuery UDFs

Cloud storage can be used to store [Javascript libraries that are called by BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#including-javascript-libraries). These libraries can be compiled wasm code for performing moderately complex tasks. First we define a function `addOne` into a library on the local disk.

In [None]:
from pathlib import Path

resources = Path(".") / "resources"
resources.mkdir(exist_ok=True, parents=True)

(resources / "addOne.js").write_text("""
    (function() { addOne = function(x) { return x + 1}; }())
""".strip())

! cat resources/addOne.js

Now we copy this file into the storage bucket.

In [None]:
! gsutil rsync -r resources/ gs://$PROJECT/$USER/test/resources/

We can use this function in BigQuery by defining the library as an option within a user defined function.

In [10]:
from google.cloud import bigquery

client = bigquery.Client()
query = f"""
CREATE TEMP FUNCTION addOne(a FLOAT64)
  RETURNS STRING
  LANGUAGE js
  OPTIONS (
    library=["gs://{os.environ['PROJECT']}/{os.environ['USER']}/test/resources/addOne.js"]
  )
  AS
'''
    // Use the function defined in the library function
    return addOne(a);
''';

SELECT addOne(3.14);
"""

for row in client.query(query):
    print(row.f0_)

4.140000000000001


## Cleanup

In [11]:
! gsutil rm -r gs://$PROJECT/$USER/test

Removing gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/Cloud Storage client library.ipynb#1596067736046955...
Removing gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/Storage command-line tool.ipynb#1596067736124013...
Removing gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/resources/downloaded-us-states.txt#1596067736224044...
Removing gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/artifacts/resources/us-states.txt#1596067736300602...
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m rm ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://moz-fx-data-pion-nonprod-b3c9/notebook-amiyaguchi/test/resources/addOne.js#1596067740924127...
/ [5 objects]                       