# Get Started with Jupyter on Google Cloud
In the following you find various helpful code examples which show you how to access data or start ML routines on Google Cloud resources like CPUs, GPUs, or TPUs.

Our first task is to import all necessary libraries used in the examples below.

In [1]:
import pandas as pd

## Access Data on Google Cloud Storage


Cloud Storage is a storage service in the Google Cloud. It can store virtually infinite amounts of data. Typically, Cloud Storage is used to store files with unstructured data, such as images, text files, and semi-structured file formats, such as CSV, Avro, Parquet, and TFRecords.

We start by creating a Cloud Storage client in Python. The client allows us to interact with the Cloud Storage service. With the client we can download and upload files.

In [34]:
from google.cloud import storage
client = storage.Client()
print("Client created using default project: {}".format(client.project))
project_id = client.project

Client created using default project: sandbox-michael-menzel


To explicitly specify a project when constructing the client, set the `project` parameter:

In [3]:
# client = storage.Client(project='your-project-id')

First, we work with a bucket which is a root folder in Cloud Storage. Buckets can contain many files and have (practically) no size limit. Here is how we access our bucket for the hackathon:

In [4]:
bucket_name = "public-sample-data-data-science-hackathon"
bucket = client.get_bucket(bucket_name)

print("Bucket name: {}".format(bucket.name))
print("Bucket location: {}".format(bucket.location))
print("Bucket storage class: {}".format(bucket.storage_class))

Bucket name: public-sample-data-data-science-hackathon
Bucket location: EU
Bucket storage class: STANDARD


Let's list all files in the bucket:

In [5]:
blobs = bucket.list_blobs()

print("Blobs in {}:".format(bucket.name))
for item in blobs:
    print("\t" + item.name)

Blobs in public-sample-data-data-science-hackathon:
	sample.csv


We can also use the gsutil command line tool for a list:

In [6]:
!gsutil ls gs://{bucket_name}

gs://public-sample-data-data-science-hackathon/sample.csv


Now we can get details about one of the files, download it, and load into a dataframe:

In [7]:
blob_name = "sample.csv"
blob = bucket.get_blob(blob_name)

print("Name: {}".format(blob.id))
print("Size: {} bytes".format(blob.size))
print("Content type: {}".format(blob.content_type))
print("Public URL: {}".format(blob.public_url))

output_file_name = "/tmp/sample.csv"
blob.download_to_filename(output_file_name)

print("Downloaded blob {} to {}.".format(blob.name, output_file_name))

Name: public-sample-data-data-science-hackathon/sample.csv/1571217850249200
Size: 7157 bytes
Content type: application/octet-stream
Public URL: https://storage.googleapis.com/public-sample-data-data-science-hackathon/sample.csv
Downloaded blob sample.csv to /tmp/sample.csv.


Again, the same can be achieved using the gsutil command line tool:

In [8]:
!gsutil cp gs://{bucket_name}/{blob_name} /tmp/{blob_name}

Copying gs://public-sample-data-data-science-hackathon/sample.csv...
/ [1 files][  7.0 KiB/  7.0 KiB]                                                
Operation completed over 1 objects/7.0 KiB.                                      


With the file stored locally, we can load it into a Pandas dataframe:

In [9]:
df = pd.read_csv(output_file_name, header=None)
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
count,13,13,13,9,13,13,13,12,13,13.0,...,13,1,13,13,13,13,13,13,13,1
unique,13,13,13,9,13,13,13,10,5,11.0,...,2,1,2,13,7,13,8,3,3,1
top,Getränke Bier,https://www.edeka24.de/Lebensmittel/Nuesse-Fru...,Meica Mini Wini Singles extra zart 260g,Bei unserer leckeren REWE Beste Wahl frischen ...,product_id_store,816ff5f8088f0b049d61e22a88d93d79,"65g (100 g = 1,68 )",50,gram,3.99,...,FALSE,score,1,product,Nahrung,2132,111,11,1,Controversial_Classification
freq,1,1,1,1,1,1,1,2,8,2.0,...,12,1,12,1,4,1,4,10,10,1


And we should have a look into the dataframe:

In [10]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,category,url,product_name,product_description,product_id_store,product_id,volume,qty,unit,price,...,reduction,score,match_level,product,agg_category,Coicop5_Final,Coicop4_Final,Coicop3_Final,Coicop2_Final,Controversial_Classification
1,"Nahrungsmittel Nudeln, Reis & Getreide Reis Ba...",https://shop.rewe.de/p/davert-echter-basmati-r...,Davert Echter Basmati Reis 500g,"In Nord Indien, am Fuße des Himalayas befindet...",p/davert-echter-basmati-reis-500g/8752591,fa3ff18774b0e1304931e13b31296f65,"500g (1 kg = 7,98 )",500,gram,3.99,...,FALSE,,1,https://shop.rewe.de/p/davert-echter-basmati-r...,Nahrung,01111,0111,011,01,
2,Nahrungsmittel Backzutaten Mehl,https://shop.rewe.de/p/goldpuder-weizenmehl-ty...,Goldpuder Weizenmehl Typ 405 5kg,,p/goldpuder-weizenmehl-typ-405-5kg/4477286,ee64a255097af44bb99d432271796682,"5kg (1 kg = 0,80 )",5,kilogram,3.99,...,FALSE,,1,https://shop.rewe.de/p/goldpuder-weizenmehl-ty...,Nahrung,01112,0111,011,01,
3,Nahrungsmittel Brot & Backwaren Aufbackwaren,https://shop.rewe.de/p/rewe-beste-wahl-baguett...,REWE Beste Wahl Baguette 360g,Im Backofen in wenigen Minuten schnell zuberei...,p/rewe-beste-wahl-baguette-360g/1079106,711dfea4c27e434c2384f09b4aeb25c3,"360g (1 kg = 2,75 )",360,gram,0.99,...,FALSE,,1,https://shop.rewe.de/p/rewe-beste-wahl-baguett...,Nahrung,01113,0111,011,01,
4,Lebensmittel Nudeln,https://www.edeka24.de/Lebensmittel/Nudeln/Bar...,Barilla Nudeln Fusilli,Teigwaren aus Hartweizengrieß,TY4y6t3wbijnjwhGWBzBEd-1142,bd3abb2f23de5861632b954d9a99d6b3,Inhalt: 500 g,500,gram,1.69,...,FALSE,,1,https://www.edeka24.de/Lebensmittel/Nudeln/Bar...,Lebensm,01116,0111,011,01,


Let's use Panda's built-in support for Google Cloud Storage:

In [11]:
df = pd.read_csv('gs://public-sample-data-data-science-hackathon/sample.csv', dtype=str)
df.describe()

Unnamed: 0,category,url,product_name,product_description,product_id_store,product_id,volume,qty,unit,price,...,reduction,score,match_level,product,agg_category,Coicop5_Final,Coicop4_Final,Coicop3_Final,Coicop2_Final,Controversial_Classification
count,12,12,12,8,12,12,12,11,12,12.0,...,12,0.0,12,12,12,12,12,12,12,0.0
unique,12,12,12,8,12,12,12,9,4,10.0,...,1,0.0,1,12,6,12,7,2,2,0.0
top,Lebensmittel Nüsse/Früchte/Samen,https://shop.rewe.de/p/davert-echter-basmati-r...,Meica Mini Wini Singles extra zart 260g,Bei unserer leckeren REWE Beste Wahl frischen ...,4gn9fe52eff7475bc8270473c539b3fc,88f34a92b134c6f9b08cb0b86778fe09,"1 Stück ca. 200 g (1 kg = 36,90 )",50,gram,3.99,...,FALSE,,1,https://shop.rewe.de/p/davert-echter-basmati-r...,Nahrung,2132,111,11,1,
freq,1,1,1,1,1,1,1,2,8,2.0,...,12,,12,1,4,1,4,10,10,


And .head() should return the same lines as with our manual download:

In [12]:
df.head()

Unnamed: 0,category,url,product_name,product_description,product_id_store,product_id,volume,qty,unit,price,...,reduction,score,match_level,product,agg_category,Coicop5_Final,Coicop4_Final,Coicop3_Final,Coicop2_Final,Controversial_Classification
0,"Nahrungsmittel Nudeln, Reis & Getreide Reis Ba...",https://shop.rewe.de/p/davert-echter-basmati-r...,Davert Echter Basmati Reis 500g,"In Nord Indien, am Fuße des Himalayas befindet...",p/davert-echter-basmati-reis-500g/8752591,fa3ff18774b0e1304931e13b31296f65,"500g (1 kg = 7,98 )",500,gram,3.99,...,False,,1,https://shop.rewe.de/p/davert-echter-basmati-r...,Nahrung,1111,111,11,1,
1,Nahrungsmittel Backzutaten Mehl,https://shop.rewe.de/p/goldpuder-weizenmehl-ty...,Goldpuder Weizenmehl Typ 405 5kg,,p/goldpuder-weizenmehl-typ-405-5kg/4477286,ee64a255097af44bb99d432271796682,"5kg (1 kg = 0,80 )",5,kilogram,3.99,...,False,,1,https://shop.rewe.de/p/goldpuder-weizenmehl-ty...,Nahrung,1112,111,11,1,
2,Nahrungsmittel Brot & Backwaren Aufbackwaren,https://shop.rewe.de/p/rewe-beste-wahl-baguett...,REWE Beste Wahl Baguette 360g,Im Backofen in wenigen Minuten schnell zuberei...,p/rewe-beste-wahl-baguette-360g/1079106,711dfea4c27e434c2384f09b4aeb25c3,"360g (1 kg = 2,75 )",360,gram,0.99,...,False,,1,https://shop.rewe.de/p/rewe-beste-wahl-baguett...,Nahrung,1113,111,11,1,
3,Lebensmittel Nudeln,https://www.edeka24.de/Lebensmittel/Nudeln/Bar...,Barilla Nudeln Fusilli,Teigwaren aus Hartweizengrieß,TY4y6t3wbijnjwhGWBzBEd-1142,bd3abb2f23de5861632b954d9a99d6b3,Inhalt: 500 g,500,gram,1.69,...,False,,1,https://www.edeka24.de/Lebensmittel/Nudeln/Bar...,Lebensm,1116,111,11,1,
4,Frische & Kühlung Fleisch Lamm,https://shop.rewe.de/p/wilhelm-brandenburg-lam...,Wilhelm Brandenburg Lamm Rückensteak ohne Knoc...,VOR DEM VERZEHR GUT DURCHGAREN. NICHT ZUM ROHV...,p/wilhelm-brandenburg-lamm-rueckensteak-ohne-k...,88f34a92b134c6f9b08cb0b86778fe09,"1 Stück ca. 200 g (1 kg = 36,90 )",1,gram,7.38,...,False,,1,https://shop.rewe.de/p/wilhelm-brandenburg-lam...,Frische,1123,112,11,1,


**Learn more about interacting with Cloud Storage in the following tutorials:**
- [Cloud Storage client library](../tutorials/storage/Cloud%20Storage%20client%20library.ipynb)
- [Storage command-line tool](../tutorials/storage/Storage%20command-line%20tool.ipynb)

## Access Tables & Views on Google BigQuery



In [13]:
from google.cloud import bigquery
client = bigquery.Client(location="EU")
print("Client creating using default project: {}".format(client.project))

Client creating using default project: sandbox-michael-menzel


To explicitly specify a project when constructing the client, set the `project` parameter:

In [14]:
# client = bigquery.Client(location="US", project="your-project-id")

In [17]:
query = """
    SELECT `timestamp`, transaction_id, block_id
    FROM `bigquery-public-data.bitcoin_blockchain.transactions`
    LIMIT 60
"""
query_job = client.query(query, location="US")
df = query_job.to_dataframe()
df.describe()

Unnamed: 0,timestamp
count,60.0
mean,1401283000000.0
std,127522800000.0
min,1232107000000.0
25%,1271254000000.0
50%,1514928000000.0
75%,1515270000000.0
max,1515801000000.0


In [18]:
df.head()

Unnamed: 0,timestamp,transaction_id,block_id
0,1241693386000,b78dd4052c5c19ed15bff7f7cbc072cb87601680165412...,000000006b6810ea2b71871065c31f0939c61bc73ca19e...
1,1261947871000,bfcb4467092290da3bee702d5ffedfe1933c36a18b0e77...,000000003d0aa75d182618516bf64536d94119d23ef412...
2,1262072718000,a069017c031239357a6d325c7a10e6f4ed7cb722b1cb38...,00000000b574d15c470a479874f19ea232b8b26e3ab742...
3,1261474382000,9b9d3a70b70df897e2383fe16a09286502222f7ca06653...,000000006224e9ce1dbe8a9b593d8f0485a19983b479bd...
4,1277392209000,3fe2d7fa73e776f591e075783bc24cbd3e2fff8d444c72...,0000000001ad7196de0396085a3fa95f2322722aa8b805...


You can also execute a query using the BigQuery magic expression in a cell:

In [19]:
%%bigquery --verbose df
SELECT `timestamp`, transaction_id, block_id
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
LIMIT 60

Executing query with job ID: 518518a2-e1ac-4bd7-847b-b6e1acebd7ae
Query executing: 0.45s
Query complete after 0.67s


In [20]:
df.head()

Unnamed: 0,timestamp,transaction_id,block_id
0,1260877681000,c0ce69e1272486661aa56673ca5c988bc898ddf5269016...,0000000084e1341d1202440622599aabaee14e017109d9...
1,1261677656000,2160f284d8f818cb76225ad07eaca026d6d8f860f8aa02...,00000000c17522535540533052ce12a6db0c9123c026e7...
2,1266112798000,790b1a2a3e246718889eb05bb747b8817ee85dad882228...,0000000009eb1ee6db7c8ea6e3c9f1ccf37a4c99192881...
3,1241222355000,366e80396c61babc4c001daadab0e68a7c47db24be0e2e...,00000000ba16045c3dae8a6b1abee685a1c0db2d56195c...
4,1271617286000,268ee4e02e4ba11185ff78f85800766dd90c0bd71aa4cd...,0000000013eedbaf2f3a8e5c8a9df17fb1219992251774...


**Learn more about interacting with BigQuery in the following tutorials:**
- [BigQuery basics](../tutorials/bigquery/BigQuery%20basics.ipynb)
- [BigQuery command-line tool](../tutorials/bigquery/BigQuery%20command-line%20tool.ipynb)
- [BigQuery query magic](../tutorials/bigquery/BigQuery%20query%20magic.ipynb)

## PAIR Visualizations

In [28]:
!pip3 install -q facets-overview

# Introduce facets
from IPython.core.display import display, HTML
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
import base64
import uuid

def facets_overview(df):
    gfsg = GenericFeatureStatisticsGenerator()
    proto = gfsg.ProtoFromDataFrames([{'name': 'df', 'table': df}])
    protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

    HTML_TEMPLATE = """
            <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
            <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
            <facets-overview id="facets-overview-{uuid}"></facets-overview>
            <script>
              document.querySelector("#facets-overview-{uuid}").protoInput = "{protostr}";
            </script>"""

    html = HTML_TEMPLATE.format(protostr=protostr, uuid=uuid.uuid4())
    display(HTML(html))

def facets_dive(df): 
    HTML_TEMPLATE = """
            <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
            <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
            <facets-dive id="facets-dive-{uuid}" height="500"></facets-dive>
            <script>
              document.querySelector("#facets-dive-{uuid}").data = {jsonstr};
            </script>"""

    jsonstr = df.to_json(orient='records')
    html = HTML_TEMPLATE.format(jsonstr=jsonstr, uuid=uuid.uuid4())
    display(HTML(html))

In [29]:
facets_overview(df)

In [30]:
facets_dive(df)

## Cloud AI APIs and Cloud AutoML

Some useful resources to get started with our Cloud APIs for NLP and [AutoML](https://cloud.google.com/automl/) for NLP:
* [Cloud NLP Intro](https://cloud.google.com/natural-language/)
* [Cloud Natural Language API Docs](https://cloud.google.com/natural-language/docs/)
* [Cloud AutoML Get Started Guides](https://cloud.google.com/natural-language/overview/docs/get-started)
* [Cloud AutoML NLP in the Console](https://console.cloud.google.com/natural-language)

There is also [Cloud AutoML Tables](https://cloud.google.com/automl-tables/) to build ML models on tabular data (e.g. from BigQuery):
* [Cloud AutoML Tables Intro](https://cloud.google.com/automl-tables/)
* [Cloud AutoML Tables Docs](https://cloud.google.com/automl-tables/docs/)
* [Cloud AutoML Tables in the Console](https://console.cloud.google.com/automl-tables)


## Data Transformation with Apache Beam (and Cloud Dataflow)

In [26]:
!pip3 install -q apache-beam[gcp]

In [32]:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, DebugOptions, GoogleCloudOptions, WorkerOptions

pipeline_options = PipelineOptions.from_dictionary({
    'runner': 'DirectRunner',
# Run it massively parallel on Dataflow with
#   'runner': 'DataflowRunner'
    'job_name': 'notebook',
    'streaming': True
})

def collect(i):
    output.append(i)
    return True

output = []

p = beam.Pipeline(options=pipeline_options)

pipeline = (
    p 
    | 'generate' >> beam.Create(range(1000))
    | 'square' >> beam.Map(lambda x: x**2)
    | "print" >> beam.Map(collect)
)

result = p.run()
result.wait_until_finish()

output[:10]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [46]:
import apache_beam.io as io
import apache_beam.io.fileio as fileio

target_bucket = '[YOUR_TARGET_BUCKET]'

input_file = f"gs://{bucket_name}/sample.csv"
output_file = f"gs://{target_bucket}/converted.csv"
output_file_local = './converted.csv'

options = PipelineOptions()
options.view_as(StandardOptions).runner = 'DataflowRunner'
options.view_as(StandardOptions).streaming = False

google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = project_id
google_cloud_options.job_name = 'prepare-sample-csv'
google_cloud_options.region = 'europe-west1'
google_cloud_options.staging_location = f"gs://{target_bucket}/staging"
google_cloud_options.temp_location = f"gs://{target_bucket}/temp"

In [47]:
from io import StringIO
from functools import reduce
import csv
import uuid

def file_cacher(row):
    filename = "/tmp/%s.tmp" % uuid.uuid4()
    print("Caching %s" % filename)
    with open(filename, "a") as f:
        f.write(row)
    f.close()
    return filename

def read_file(file):
    rows = file.read_utf8().split('\n')
    rows[0] = rows[0].replace('.', '_')
    return rows

def csv_reader(rows):
    return list(
        pd.read_csv(StringIO(rows), 
                    delimiter=',',
                    verbose=True)
          .replace({'-': None})
          .replace({pd.np.nan: None})
          .to_dict('records'))   

def csv_writer(rows):
    csv_str = StringIO("")
    writer = csv.DictWriter(csv_str, fieldnames=rows[0].keys())
    writer.writeheader()
    for row in rows:
        writer.writerow(row)
    return csv_str.getvalue()

def csv_trimmer(row):
    stripped_row = ";".join(map(str.strip, row.split(';')))
    return stripped_row

def concatinator(rows):
    return "\n".join(rows)

def id_filter(row):
    return row['product_id'] is not None and row['product_id'] != ''

def data_printer(data, verbose=True):
    if verbose:
        print("Data: %s" % data)
    return data

In [48]:
p = beam.Pipeline(options=options)

cleaned = (p 
           | 'List CSV files' >> fileio.MatchFiles(input_file)
           | 'Download CSV files' >> fileio.ReadMatches()
           | 'Read CSV files into rows' >> beam.ParDo(read_file)
           | 'Trim values in row' >> beam.Map(csv_trimmer)
           | 'Concatenate rows' >> beam.CombineGlobally(concatinator)
           | 'Parse as CSV format' >> beam.FlatMap(csv_reader)
           | 'Remove rows with empty ID' >> beam.Filter(id_filter)
           | 'Print data' >> beam.Map(data_printer, True))

to_csv = (cleaned   
           | 'Collect CSV dicts into list' >> beam.transforms.combiners.ToList()
           | 'Convert rows to CSV' >> beam.Map(csv_writer))

to_csv | 'Write to Cloud Storage' >> io.WriteToText(output_file)
to_csv | 'Write to local file' >> io.WriteToText(output_file_local)

result = p.run()

  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


## Train Models with Google Cloud AI Platform Training

We want to enable the ML and Container Registry APIs in our project.

In [None]:
!gcloud services enable ml.googleapis.com
!gcloud services enable containerregistry.googleapis.com

Then, we need to create a bucket for the staging and training results. Replace with your favorite name (needs to be globally unique!):

In [None]:
!gsutil mb gs://[YOUR_GCS_BUCKET]

Ready to start our Training Job! Fill in in your bucket name where you find brackets. You can modify the model_dir parameter to change where the training output is stored.

In [None]:
gcloud ml-engine jobs submit training $JOB_NAME \
    --staging-bucket [YOUR_GCS_BUCKET] \
    --runtime-version 1.8 \
    --scale-tier BASIC_TPU \
    --module-name resnet.resnet_main \
    --package-path resnet/ \
    --region us-central1 \
    -- \
    --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet \
    --model_dir=gs://[YOUR_GCS_BUCKET]/training_result/ \
    --resnet_depth=50 \
    --train_steps=1024

Learn more about AI Platform Training & Serving with ML Engine:
- [Training & Serving on ML Engine with SciKit Learn](../tutorials/cloud-ml-engine/Training%20and%20prediction%20with%20scikit-learn.ipynb)
- [Github Repo full of Training & Prediction Examples](https://github.com/GoogleCloudPlatform/cloudml-samples)