## Introduction

According official website: Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data [official](https://cloud.google.com/dataproc/docs).

When we need to use Spark to do data processing, then we could use DataProc as a tool to process data.

In [None]:
# first auth the lab
from google.colab import auth
auth.authenticate_user()

In [None]:
# install dataproc first
! pip install google-cloud-dataproc --quiet

[?25l[K     |█▏                              | 10kB 23.2MB/s eta 0:00:01[K     |██▎                             | 20kB 1.8MB/s eta 0:00:01[K     |███▌                            | 30kB 2.4MB/s eta 0:00:01[K     |████▋                           | 40kB 2.6MB/s eta 0:00:01[K     |█████▊                          | 51kB 2.1MB/s eta 0:00:01[K     |███████                         | 61kB 2.3MB/s eta 0:00:01[K     |████████                        | 71kB 2.5MB/s eta 0:00:01[K     |█████████▏                      | 81kB 2.8MB/s eta 0:00:01[K     |██████████▍                     | 92kB 3.0MB/s eta 0:00:01[K     |███████████▌                    | 102kB 2.9MB/s eta 0:00:01[K     |████████████▊                   | 112kB 2.9MB/s eta 0:00:01[K     |█████████████▉                  | 122kB 2.9MB/s eta 0:00:01[K     |███████████████                 | 133kB 2.9MB/s eta 0:00:01[K     |████████████████▏               | 143kB 2.9MB/s eta 0:00:01[K     |█████████████████▎        

### Create a cluster

Please follow these steps to create a cluster from [here](https://cloud.google.com/dataproc/docs/quickstarts/quickstart-console). I just create a cluster with one node, this doesn't matter of the code, **Spark** will handle whole logic for us.

We do create a **gcloud** to create a cluster, but here I just create a cluster with console.

In [None]:
! gcloud config set project emerald-road-282501

Updated property [core/project].


#### Noted

As with the `sandbox` problem, I haven't make the job run successfully, as the reason is for DataProc, it will start a cluster in compute engine, this is really expensive for `sandbox`, so the logic should do be similiar, we could sand our jars into the server and start the job in the remote server, but do keep in mind, we have to delete the cluster in the dataproc, so that we won't cost so much. In fact, this is just a solution that we could start our Spark job in cloud but with cluster created manually by us.

In [None]:
# define with basic info
cluster_name = "dataproc-spark"
region = "us-central1"


## Spark Feature engineering with Dataproc

Now we could start our training or processing logic with PySpark, here I just create a sample file to use Spark to do feature extraction. After we  finish the logic, then we could upload the training file into storage for later use case like load data into **Bigquery**, then we could query the result from **bigquery** and do the later step.

In [None]:
# here I write the job logic
%%writefile training_spark.py
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

import logging

logger = logging.getLogger(__name__)

logger.info('init spark')
spark = SparkSession.builder.getOrCreate()

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

# split sentence into words.
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

# This a UDF function created by ourselves to get length of sentence.
countTokens = udf(lambda words: len(words), IntegerType())

# with udf function to get each sentence length.
tokenized = tokenizer.transform(sentenceDataFrame)
token_selected = tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words")))

print("Get final result: ")
token_selected.show(truncate=False)

logger.info('whole Spark logic finished.')

Writing training_spark.py


In [None]:
# we could submit our job here with just a gcloud command,
# let's show it
# then we could get whole output from here or we could get the info with console
# with Dataproc `jobs` tab.
! gcloud dataproc jobs submit pyspark training_spark.py --cluster $cluster_name --region $region

Job [9c51e38e1d5b43ec809536ce89e0ef82] submitted.
Waiting for job output...
20/07/08 05:20:25 INFO org.spark_project.jetty.util.log: Logging initialized @4403ms
20/07/08 05:20:25 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/08 05:20:25 INFO org.spark_project.jetty.server.Server: Started @4602ms
20/07/08 05:20:25 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@171d6c9{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/08 05:20:25 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/08 05:20:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dataproc-spark-m/10.128.0.3:8032
20/07/08 05:20:28 INFO org.apache.hadoop.yarn.client.AHSProxy: 

#### **Noted**
As if we just use our account to create it, won't be fine so we need to create a service account with our mail, so that we could do it easier

In [None]:
# one thing to notice, currently I just submit the job in terminal
# more common way should upload our training file into bucket
# then trigger the job with file in bucket

# let's test it

# first we need to create a bucket
! gsutil mb gs://dataproc_lugq

# then upload our file into bucket
! gsutil cp training_spark.py gs://dataproc_lugq

# let's check it
! gsutil ls gs://dataproc_lugq

Creating gs://dataproc_lugq/...
ServiceException: 409 Bucket dataproc_lugq already exists.
Copying file://training_spark.py [Content-Type=text/x-python]...
/ [1 files][  1.0 KiB/  1.0 KiB]                                                
Operation completed over 1 objects/1.0 KiB.                                      
gs://dataproc_lugq/sample.txt
gs://dataproc_lugq/training_spark.py
gs://dataproc_lugq/words.txt/


In [None]:
# then let's trigger our training job with storage file
! gcloud dataproc jobs submit pyspark gs://dataproc_lugq/training_spark.py \
--cluster $cluster_name \
--region $region

Job [a081b9d1e7c54346b82ef5307e124095] submitted.
Waiting for job output...
20/07/08 05:21:37 INFO org.spark_project.jetty.util.log: Logging initialized @4017ms
20/07/08 05:21:38 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/08 05:21:38 INFO org.spark_project.jetty.server.Server: Started @4211ms
20/07/08 05:21:38 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@2b5dc039{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/08 05:21:38 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/08 05:21:40 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dataproc-spark-m/10.128.0.3:8032
20/07/08 05:21:41 INFO org.apache.hadoop.yarn.client.AHSProxy:

Alright, we have submitted our job into cluster both in local and remote **GCS**.

### Dataproc with GCS

Next step is how could we use **Dataproc** with **Cloud storage** to read and write. 

Let's start.

In [None]:
# first let's make some sample data, for most times we will use structured data,
# let's make sample file to count the word frequency
import os

sample_text = """
This part of the Python Guestbook code walkthrough shows how to deploy the application to App Engine.
This page is part of a multi-page tutorial. To start from the beginning and see instructions for setting up, go to Creating a Guestbook.
"""

with open("sample.txt", 'w') as f:
  f.write(sample_text)

# let's check
print("Current folder: ", os.listdir('.'))

Current folder:  ['.config', 'sample.txt', 'training_spark.py', 'adc.json', 'sample_data']


In [None]:
# upload this file into bukcet: dataproc_lugq
! gsutil cp sample.txt gs://dataproc_lugq/

! gsutil ls gs://dataproc_lugq


Copying file://sample.txt [Content-Type=text/plain]...
/ [1 files][  240.0 B/  240.0 B]                                                
Operation completed over 1 objects/240.0 B.                                      
gs://dataproc_lugq/sample.txt
gs://dataproc_lugq/training_spark.py
gs://dataproc_lugq/words.txt/


In [None]:
# next step is our main function for word count

%%writefile words_count.py
from pyspark.sql import SparkSession

input_uri = "gs://dataproc_lugq/sample.txt"

# init spark context
spark = SparkSession.builder.getOrCreate()

sc = spark.sparkContext

# use spark context to read file, return is a RDD
lines = sc.textFile(input_uri)

# main step for word count: flatmap to split -> map to k: v -> reduceybykey -> sort
word_counts = lines.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).sortBy(lambda x: x[1])

# let's save our rdd into GCS
output_uri = "gs://dataproc_lugq/words.txt"
# one thing to notice is even we define a file, with spark to write file won't just get one file! 
# as data is distributed
word_counts.saveAsTextFile(output_uri)

# get whole result into driver memory
print("Words count:", word_counts.collect())


Writing words_count.py


In [None]:
# let's deploy our code with cluster and region for where to execute our code
! gcloud dataproc jobs submit pyspark words_count.py --cluster $cluster_name \
--region $region

Job [729d1fa544d14c4e8ed901189abc51f9] submitted.
Waiting for job output...
20/07/08 05:29:32 INFO org.spark_project.jetty.util.log: Logging initialized @3716ms
20/07/08 05:29:32 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/08 05:29:32 INFO org.spark_project.jetty.server.Server: Started @3983ms
20/07/08 05:29:32 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@2b5dc039{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/08 05:29:32 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/08 05:29:35 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dataproc-spark-m/10.128.0.3:8032
20/07/08 05:29:35 INFO org.apache.hadoop.yarn.client.AHSProxy:

In [None]:
# let's check with storage
! gsutil ls gs://dataproc_lugq

gs://dataproc_lugq/sample.txt
gs://dataproc_lugq/training_spark.py
gs://dataproc_lugq/words.txt/


Well done, we have read and write the data from **Cloud storage**, in fact we could just face with GCS as HDFS or S3.

#### Dataproc with Bigquery

In fact, if we need to process relational data with big data, with Big query should be the most common way that we could interact with database.

Let's created a sample dataset, and upload it into bigquery, use **Dataproc** to read it and do some feature engineering, last step is do a model training on new data, for later usecase we could store our trained model into **GCS** for reference.

Let's get start.

#### Load data into bucket

In [None]:
# first let's create a sample dataset
import numpy as np
import pandas as pd
import os
from sklearn.datasets import load_iris

x, y = load_iris(return_X_y = True)
data = np.concatenate([x, y[:, np.newaxis]], axis=1)

df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'label'])

# save our dataframe into disk
df.to_csv('data.csv', index=False)

print("Now what files we have: ", os.listdir('.'))

Now what files we have:  ['.config', 'sample.txt', 'training_spark.py', 'words_count.py', 'adc.json', 'data.csv', 'sample_data']


In [None]:
# let's upload our data.csv into bucket
! gsutil cp data.csv gs://dataproc_lugq/

# check we done it
! gsutil ls gs://dataproc_lugq

Copying file://data.csv [Content-Type=text/csv]...
/ [1 files][  2.9 KiB/  2.9 KiB]                                                
Operation completed over 1 objects/2.9 KiB.                                      
gs://dataproc_lugq/data.csv
gs://dataproc_lugq/sample.txt
gs://dataproc_lugq/training_spark.py
gs://dataproc_lugq/words.txt/


#### Load data into bigquery

Before we do the load action with python, first we do need to create a dataset_id in the console, please just go to the **bigquery** console and create a dataset with **iris_dataset**.

After the dataset has been created, let's create our table with python.

In [None]:
# first install bigquery module
! pip install google-cloud-bigquery --quiet

In [None]:
from google.cloud import bigquery

# we need to create the dataset in console first
project_id = "emerald-road-282501"
dataset_id = "iris_dataset"
bucket_name = "dataproc_lugq"

# init bigquery client
client = bigquery.Client(project_id)

# create dataset inference
dataset_ref = client.dataset(dataset_id)

# define schema
job_config = bigquery.LoadJobConfig()
job_config.schema = [bigquery.SchemaField("a", "float"),
                     bigquery.SchemaField("b", "float"),
                     bigquery.SchemaField("c", "float"),
                     bigquery.SchemaField("d", "float"),
                     bigquery.SchemaField("label", "float")]

# skip the header, as I skip first row just get 149 records. not correct
job_config.skip_leading_rows = 1
# set to load csv
job_config.source_format = bigquery.SourceFormat.CSV

# data uri
data_uri = "gs://{}/{}".format(bucket_name, "data.csv")

# create a load job
load_job = client.load_table_from_uri(data_uri, dataset_ref.table('iris'), job_config =job_config)
print("submitted job: {}".format(load_job.job_id))

# wait result to finish
load_job.result()

submitted job: ad004d23-ac25-4637-a1a5-442ab5ef7cda


<google.cloud.bigquery.job.LoadJob at 0x7f7bfec3c940>

In [None]:
# let's check how many data has been inserted
# that's right!
response = client.get_table(dataset_ref.table('iris'))

print("there are {} records in bigquery".format(response.num_rows))

there are 150 records in bigquery


#### Model Training with Dataproc and Bigquery

In [None]:
# after we have load data in bigquery
# then let's use dataproc to read data from bigquery
# so that we could use the power of Spark and Bigquery

%%writefile spark_train_bigquery.py

from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
import logging

logger = logging.getLogger(__name__)

# combine features into vector and get lable
def inputs_to_vector(row):
  return (row['label'], Vectors.dense(float(row['a']), 
                                      float(row['b']), 
                                      float(row['c']), 
                                      float(row['d']) ))

# create sparksession
spark = SparkSession.builder.getOrCreate()

# read bigquery return into a DataFrame
logger.info("Read data from bigquery")
df = spark.read.format('bigquery').option('table', 'iris_dataset.iris').load()

# logger.info("get dataframe:", df.show(5))
df.createOrReplaceTempView('iris')

df_new = spark.sql("select * from iris")

# map dataframe with vector function
data_df = df_new.rdd.map(inputs_to_vector).toDF(["label", "features"])

# split into train and test
(train_df, test_df) = data_df.randomSplit([0.7, 0.3])

# cache dataframe
train_df.cache()

lr = LogisticRegression(maxIter=10, regParam=0.1,elasticNetParam=0.8)

logger.info("start to train model")
model = lr.fit(train_df)

# get model prediction on test data
pred = model.transform(test_df)


# let's try to save our trained model into GCS
bucket_name = "dataproc_lugq"
model_folder = "lr_model"
model_storage_path = "gs://{}/{}".format(bucket_name, model_folder)

# in case the model already exist
# I found that couldn't just save the model file directly into bucket
# so let's just save the file into local server, then upload file with command
# reference here: https://stackoverflow.com/questions/48684048/save-python-data-object-to-file-in-google-storage-from-a-pyspark-job-running-in
# import os

# local_path = os.getcwd()
# model_name = 'lr_model'
# model_path = os.path.join(local_path, model_name)
# # with `write` function won't write the model
# # model.write().overwrite().save(model_path)
# model.save(model_path)

# try:
#   print("what we have:", os.listdir(local_path))
#   os.system("gsutil cp -r {} {}".format(model_path, model_storage_path))
# except Exception as e:
#   print("when upload model with error:", e)

# this is try to get the model file
# try:
#   file_list = os.listdir(local_path)
#   if any(['model' in x for x in file_list]):
#     file_name = [x for x in file_list if x.endswith('model')][0]
#     print("Get file:{}".format(file_name))
#     os.system("gsutil cp -r {} {}".format(file_name, model_storage_path))
#   else:
#     print("we don't find with model file: current we have :", file_list)
# except Exception as e:
#   print("When upload file inot bucket with error: ", e)

# let's use command to upload file
# from subprocess import call
# print("Save model into bucket")
# call(['gsutil', 'cp', '-r', "./*.model", model_storage_path])

print("get prediction:", pred.show(5))

Overwriting spark_train_bigquery.py


#### Noted

If we need to use **Bigquery** in **Dataproc**, we need to provide the connection between two of them, that's: `--jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar`, keep this in mind if you face error with couldn't find `bigquery`, do need to provide it with `jars`.

In [None]:
# let's submit our job into Dataproc
! gcloud dataproc jobs submit pyspark spark_train_bigquery.py \
--cluster $cluster_name \
--region $region \
--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar

Job [70af41685a264c34b3cc71d7b0fe2742] submitted.
Waiting for job output...
20/07/08 06:29:57 INFO org.spark_project.jetty.util.log: Logging initialized @8474ms
20/07/08 06:29:57 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/08 06:29:57 INFO org.spark_project.jetty.server.Server: Started @8703ms
20/07/08 06:29:57 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@304f32a4{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/08 06:29:58 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/08 06:29:59 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dataproc-spark-m/10.128.0.3:8032
20/07/08 06:29:59 INFO org.apache.hadoop.yarn.client.AHSProxy:

In [None]:
# last let's try to check the storage to find model exists or not
# so we do save our trained model into bucket.
! gsutil ls gs://dataproc_lugq/

gs://dataproc_lugq/data.csv
gs://dataproc_lugq/sample.txt
gs://dataproc_lugq/training_spark.py
gs://dataproc_lugq/words.txt/


#### Dataproc write data into Bigquery

We have read data from **Bigquery**, how about we have done some processing and want to write the result into bigquery, so that we could do query with **SQL** later step.


In [5]:
! gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&code_challenge=A8VJOVCdCxX8HypLWgrSgcEw26Lw6bO4FEfWDMTxJ1Q&code_challenge_method=S256&access_type=offline&response_type=code&prompt=select_account


Enter verification code: 4/1wEhRyc-f7I3e19znB5d3KDvBXu3KhEA45sfwmydRsESNNoi9Am6-Ts

You are now logged in as [gqianglu1990@gmail.com].
Your current project is [cloudtutorial-282707].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [6]:
! gcloud config set project 	cloudtutorial-282707

Updated property [core/project].


First we need to create cluster named `word-count` and bigquery dataset named: `words` with console.

In [23]:
%%bash
export cluster_name="words-count"
export region='us-central1'
export dataset_name='words'

In [3]:
# first let's make sample file and upload it into bucket
import os

sample_texts = "This configures the Hive servers to read from and write to the correct location. You provide the Cloud SQL Proxy initialization action that Dataproc automatically"
with open('sample.txt', 'w') as f:
  f.write(sample_texts)

print(os.listdir('.'))

['.config', 'sample.txt', 'sample_data']


In [10]:
# create bucket and upload file into bucket
%%bash
# make bucket
gsutil mb gs://words_count

# copy files
gsutil cp sample.txt gs://words_count

# check files
gsutil ls gs://words_count

gs://words_count/sample.txt


Creating gs://words_count/...
ServiceException: 409 Bucket words_count already exists.
Copying file://sample.txt [Content-Type=text/plain]...
/ [0 files][    0.0 B/  162.0 B]                                                / [1 files][  162.0 B/  162.0 B]                                                
Operation completed over 1 objects/162.0 B.                                      


In [11]:
# before we submit our spark job, first maybe we should create bigquery table first
! pip install google-cloud-bigquery --quiet

In [None]:
### Noted with API create table must provide with credentials

# from google.cloud import bigquery

# project_id = "cloudtutorial-282707"

# client = bigquery.Client(project_id)

# # Here I just create table with API, with SQL also should be fine.
# # first define table schema
# schema = [bigquery.SchemaField('words', 'string', mode='required'), 
#           bigquery.SchemaField('num', 'int', mode='required')]
# table_name = "words.sample_words" 

# table = bigquery.table(table_name, schema=schema)

# # make request to create table
# table = client.create_table(table)

# print("Created table")

In [21]:
# so here just with command
%%bash
bq mk --table cloudtutorial-282707:words.sample_words words:STRING,num:INTEGER

Table 'cloudtutorial-282707:words.sample_words' successfully created.


In [30]:
# next step is to create our main logic here
%%writefile words_count_to_bigquery.py
import os
from pyspark.sql import SparkSession

bucket_name = "words_count"
file_name = "sample.txt"
file_path = "gs://{}/{}".format(bucket_name, file_name)

spark = SparkSession.builder.master('yarn').getOrCreate()

# we have to set the temprate bucket for spark, otherwise with error: requirement failed: Temporary GCS path has not been set
spark.conf.set('temporaryGcsBucket', bucket_name)

sc = spark.sparkContext

# we could use sc to read files as a DataFrame
words_df = sc.textFile(file_path).flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).toDF(["words", "num"])

print("what we have:")
words_df.show()

# as we have made DataFrame, let's try to save the result into Bigquery
words_df.write.format("bigquery").option("table", "words.sample_words").mode('overwrite').save()

print("Save data finished.")

Overwriting words_count_to_bigquery.py


In [31]:
# now let's submit our job into cluster
%%bash
gcloud dataproc jobs submit pyspark  words_count_to_bigquery.py \
--cluster word-count --region us-central1 --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar

done: true
driverControlFilesUri: gs://dataproc-staging-us-central1-870970564128-vrn3qoba/google-cloud-dataproc-metainfo/f64b324b-9298-4586-9426-ff8ec2fe748f/jobs/3ce310d450e34ccb8aa2d882c0e36535/
driverOutputResourceUri: gs://dataproc-staging-us-central1-870970564128-vrn3qoba/google-cloud-dataproc-metainfo/f64b324b-9298-4586-9426-ff8ec2fe748f/jobs/3ce310d450e34ccb8aa2d882c0e36535/driveroutput
jobUuid: 1d046aca-3fa6-353b-b05e-25ab59ae6f72
placement:
  clusterName: word-count
  clusterUuid: f64b324b-9298-4586-9426-ff8ec2fe748f
pysparkJob:
  jarFileUris:
  - gs://spark-lib/bigquery/spark-bigquery-latest.jar
  mainPythonFileUri: gs://dataproc-staging-us-central1-870970564128-vrn3qoba/google-cloud-dataproc-metainfo/f64b324b-9298-4586-9426-ff8ec2fe748f/jobs/3ce310d450e34ccb8aa2d882c0e36535/staging/words_count_to_bigquery.py
reference:
  jobId: 3ce310d450e34ccb8aa2d882c0e36535
  projectId: cloudtutorial-282707
status:
  state: DONE
  stateStartTime: '2020-07-09T02:44:34.428Z'
statusHistory:


Job [3ce310d450e34ccb8aa2d882c0e36535] submitted.
Waiting for job output...
20/07/09 02:43:37 INFO org.spark_project.jetty.util.log: Logging initialized @4759ms
20/07/09 02:43:37 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/09 02:43:37 INFO org.spark_project.jetty.server.Server: Started @4964ms
20/07/09 02:43:37 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@2a07dfeb{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/09 02:43:37 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/09 02:43:39 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at word-count-m/10.128.0.5:8032
20/07/09 02:43:39 INFO org.apache.hadoop.yarn.client.AHSProxy: Con

In [33]:
# Good news, we have run our job success.

# let's check result with bigquery by command should be fine
! bq query --nouse_legacy_sql "select * from words.sample_words"

Waiting on bqjob_r7334220d269721cc_0000017331795fcd_1 ... (0s) Current status: DONE   
+----------------+-----+
|     words      | num |
+----------------+-----+
| and            |   1 |
| location.      |   1 |
| from           |   1 |
| Dataproc       |   1 |
| provide        |   1 |
| initialization |   1 |
| Hive           |   1 |
| write          |   1 |
| This           |   1 |
| read           |   1 |
| Cloud          |   1 |
| action         |   1 |
| You            |   1 |
| automatically  |   1 |
| that           |   1 |
| configures     |   1 |
| Proxy          |   1 |
| SQL            |   1 |
| servers        |   1 |
| correct        |   1 |
| to             |   2 |
| the            |   3 |
+----------------+-----+


So good we do could both read and write data with **Bigquery** by **Dataproc**.

###  Last words

Alright, we do could train our model with data from **Bigquery** with **Dataproc**, as **Spark** is a unified framework, we could do many things with it, like **ML** and **SQL** etc. What we could do with **Spark** then whole things could be done with **Dataproc**, as it's just a cloud framework support Spark.

Let's just remove our storage and dataproc cluster!

In [None]:
# remove dataproc
! gcloud dataproc clusters delete dataproc-spark --region $region

The cluster 'dataproc-spark' and all attached disks will be deleted.

Do you want to continue (Y/n)?  y

Waiting on operation [projects/cloudtutorial-282208/regions/us-central1/operations/aca94a03-5eb6-3c6e-bb1b-6b76606c6183].
Deleted [https://dataproc.googleapis.com/v1/projects/cloudtutorial-282208/regions/us-central1/clusters/dataproc-spark].


In [None]:
# remove the bigquery dataset
! bq rm -r -d iris_dataset


Welcome to BigQuery! This script will walk you through the 
process of initializing your .bigqueryrc configuration file.

First, we need to set up your credentials if they do not 
already exist.

Credential creation complete. Now we will select a default project.

List of projects:
  #        projectId           friendlyName    
 --- ---------------------- ------------------ 
  1   cloudtutorial-282208   CloudTutorial     
  2   my-project-34336       My Project 34336  
Found multiple projects. Please enter a selection for 
which should be the default, or leave blank to not 
set a default.

Enter a selection (1 - 2): 1

BigQuery configuration complete! Type "bq" to get started.

rm: remove dataset 'cloudtutorial-282208:iris_dataset'? (y/N) y


In [None]:
# remove whole buckets
from google.cloud import storage

client = storage.Client(project_id)

buckets_list = list(client.list_buckets())

# delete whole buckets
for bucket in buckets_list:
  print("Now to delete: {}".format(bucket.name))
  bucket.delete(force=True)

Now to delete: dataproc-staging-us-central1-397497159726-nw76gzlj
Now to delete: dataproc-temp-us-central1-397497159726-jbjqy65f
Now to delete: dataproc_lugq


### Last word

This tutorial is based on using **Dataproc** to do feature engineering and model training based on **Bigquery**. When we need to do big data processing, maybe we would use **Dataproc** many times.