# Get Started with Jupyter on Google Cloud
In the following you find various helpful code examples which show you how to access data or start ML routines on Google Cloud resources like CPUs, GPUs, or TPUs.

Our first task is to import all necessary libraries used in the examples below.

In [4]:
import pandas as pd

## Access Data on Google Cloud Storage


Cloud Storage is a storage service in the Google Cloud. It can store virtually infinite amounts of data. Typically, Cloud Storage is used to store files with unstructured data, such as images, text files, and semi-structured file formats, such as CSV, Avro, Parquet, and TFRecords.

We start by creating a Cloud Storage client in Python. The client allows us to interact with the Cloud Storage service. With the client we can download and upload files.

In [49]:
from google.cloud import storage
client = storage.Client()
print("Client created using default project: {}".format(client.project))

Client created using default project: ecb-fsf-hackathon-base


To explicitly specify a project when constructing the client, set the `project` parameter:

In [23]:
# client = storage.Client(project='your-project-id')

First, we work with a bucket which is a root folder in Cloud Storage. Buckets can contain many files and have (practically) no size limit. Here is how we access our bucket for the hackathon:

In [44]:
bucket_name = "ecb-fsf-hackathon-base-data"
bucket = client.get_bucket(bucket_name)

print("Bucket name: {}".format(bucket.name))
print("Bucket location: {}".format(bucket.location))
print("Bucket storage class: {}".format(bucket.storage_class))

Bucket name: ecb-fsf-hackathon-base-data
Bucket location: EU
Bucket storage class: STANDARD


Let's list all files in the bucket:

In [45]:
blobs = bucket.list_blobs()

print("Blobs in {}:".format(bucket.name))
for item in blobs:
    print("\t" + item.name)

Blobs in ecb-fsf-hackathon-base-data:
	data.csv


We can also use the gsutil command line tool for a list:

In [29]:
!gsutil ls gs://{bucket_name}

gs://ecb-fsf-hackathon-base-data/data.csv


Now we can get details about one of the files, download it, and load into a dataframe:

In [46]:
blob_name = "data.csv"
blob = bucket.get_blob(blob_name)

print("Name: {}".format(blob.id))
print("Size: {} bytes".format(blob.size))
print("Content type: {}".format(blob.content_type))
print("Public URL: {}".format(blob.public_url))

output_file_name = "/tmp/data.csv"
blob.download_to_filename(output_file_name)

print("Downloaded blob {} to {}.".format(blob.name, output_file_name))

Name: ecb-fsf-hackathon-base-data/data.csv/1567698639814380
Size: 4198 bytes
Content type: application/octet-stream
Public URL: https://storage.googleapis.com/ecb-fsf-hackathon-base-data/data.csv
Downloaded blob data.csv to /tmp/data.csv.


Again, the same can be achieved using the gsutil command line tool:

In [32]:
!gsutil cp gs://{bucket_name}/{blob_name} /tmp/{blob_name}

Copying gs://ecb-fsf-hackathon-base-data/data.csv...
/ [1 files][  4.1 KiB/  4.1 KiB]                                                
Operation completed over 1 objects/4.1 KiB.                                      


With the file stored locally, we can load it into a Pandas dataframe:

In [47]:
df = pd.read_csv(output_file_name, header=None)
df.describe()

Unnamed: 0,0,1,2
count,60,60,60
unique,2,60,3
top,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cumulus...,cirrus
freq,45,1,20


And we should have a look into the dataframe:

In [48]:
df.head()

Unnamed: 0,0,1,2
0,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
1,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
2,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
3,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
4,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus


Let's use Panda's built-in support for Google Cloud Storage:

In [35]:
df = pd.read_csv('gs://ecb-fsf-hackathon-base-data/data.csv', header=None)
df.describe()

Unnamed: 0,0,1,2
count,60,60,60
unique,2,60,3
top,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cumulus...,cirrus
freq,45,1,20


And .head() should return the same lines as with our manual download:

In [38]:
df.head()

Unnamed: 0,0,1,2
0,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
1,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
2,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
3,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
4,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus


**Learn more about interacting with Cloud Storage in the following tutorials:**
- [Cloud Storage client library](../tutorials/storage/Cloud%20Storage%20client%20library.ipynb)
- [Storage command-line tool](../tutorials/storage/Storage%20command-line%20tool.ipynb)

## Access Tables & Views on Google BigQuery



In [50]:
from google.cloud import bigquery
client = bigquery.Client(location="EU")
print("Client creating using default project: {}".format(client.project))

Client creating using default project: ecb-fsf-hackathon-base


To explicitly specify a project when constructing the client, set the `project` parameter:

In [51]:
# client = bigquery.Client(location="US", project="your-project-id")

In [52]:
query = """
    SELECT `Set`, URL, Label
    FROM `hackathon_dataset.data_table`
    LIMIT 60
"""
query_job = client.query(query, location="EU")
df = query_job.to_dataframe()
df.describe()

Unnamed: 0,Set,URL,Label
count,60,60,60
unique,2,60,3
top,TRAIN,gs://sandbox-michael-menzel-vcm/clouds/cumulus...,cirrus
freq,45,1,20


In [53]:
df.head()

Unnamed: 0,Set,URL,Label
0,TEST,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
1,TEST,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
2,TEST,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
3,TEST,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus
4,TEST,gs://sandbox-michael-menzel-vcm/clouds/cirrus/...,cirrus


You can also execute a query using the BigQuery magic expression in a cell:

In [56]:
%%bigquery --verbose
SELECT Label, Count(*) as Occurence
FROM `hackathon_dataset.data_table`
GROUP BY Label
LIMIT 10

Executing query with job ID: a5c4e954-489c-4344-9caa-7cdaf2093eb8
Query executing: 0.42s
Query complete after 1.24s


Unnamed: 0,Label,Occurence
0,cirrus,20
1,cumulonimbus,20
2,cumulus,20


**Learn more about interacting with BigQuery in the following tutorials:**
- [BigQuery basics](../tutorials/bigquery/BigQuery%20basics.ipynb)
- [BigQuery command-line tool](../tutorials/bigquery/BigQuery%20command-line%20tool.ipynb)
- [BigQuery query magic](../tutorials/bigquery/BigQuery%20query%20magic.ipynb)

## Train Models with Google Cloud AI Platform Training