## Introduction

According official website: Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data [official](https://cloud.google.com/dataproc/docs).

When we need to use Spark to do data processing, then we could use DataProc as a tool to process data.

In [1]:
# first auth the lab
from google.colab import auth
auth.authenticate_user()

In [2]:
# install dataproc first
! pip install google-cloud-dataproc --quiet

[?25l[K     |█▏                              | 10kB 29.3MB/s eta 0:00:01[K     |██▎                             | 20kB 6.5MB/s eta 0:00:01[K     |███▌                            | 30kB 7.4MB/s eta 0:00:01[K     |████▋                           | 40kB 7.9MB/s eta 0:00:01[K     |█████▊                          | 51kB 6.7MB/s eta 0:00:01[K     |███████                         | 61kB 7.2MB/s eta 0:00:01[K     |████████                        | 71kB 7.7MB/s eta 0:00:01[K     |█████████▏                      | 81kB 7.6MB/s eta 0:00:01[K     |██████████▍                     | 92kB 7.2MB/s eta 0:00:01[K     |███████████▌                    | 102kB 7.6MB/s eta 0:00:01[K     |████████████▊                   | 112kB 7.6MB/s eta 0:00:01[K     |█████████████▉                  | 122kB 7.6MB/s eta 0:00:01[K     |███████████████                 | 133kB 7.6MB/s eta 0:00:01[K     |████████████████▏               | 143kB 7.6MB/s eta 0:00:01[K     |█████████████████▎        

### Create a cluster

Please follow these steps to create a cluster from [here](https://cloud.google.com/dataproc/docs/quickstarts/quickstart-console). I just create a cluster with one node, this doesn't matter of the code, **Spark** will handle whole logic for us.

We do create a **gcloud** to create a cluster, but here I just create a cluster with console.

In [3]:
! gcloud config set project 	cloudtutorial-282208

Updated property [core/project].


## Noted

As with the `sandbox` problem, I haven't make the job run successfully, as the reason is for DataProc, it will start a cluster in compute engine, this is really expensive for `sandbox`, so the logic should do be similiar, we could sand our jars into the server and start the job in the remote server, but do keep in mind, we have to delete the cluster in the dataproc, so that we won't cost so much. In fact, this is just a solution that we could start our Spark job in cloud but with cluster created manually by us.

In [4]:
# define with basic info
cluster_name = "dataproc-spark"
region = "us-central1"


## Spark Feature engineering with Dataproc

Now we could start our training or processing logic with PySpark, here I just create a sample file to use Spark to do feature extraction. After we  finish the logic, then we could upload the training file into storage for later use case like load data into **Bigquery**, then we could query the result from **bigquery** and do the later step.

In [10]:
# here I write the job logic
%%writefile training_spark.py
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

import logging

logger = logging.getLogger(__name__)

logger.info('init spark')
spark = SparkSession.builder.getOrCreate()

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

# split sentence into words.
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

# This a UDF function created by ourselves to get length of sentence.
countTokens = udf(lambda words: len(words), IntegerType())

# with udf function to get each sentence length.
tokenized = tokenizer.transform(sentenceDataFrame)
token_selected = tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words")))

print("Get final result: ")
token_selected.show(truncate=False)

logger.info('whole Spark logic finished.')

Overwriting training_spark.py


In [11]:
# we could submit our job here with just a gcloud command,
# let's show it
# then we could get whole output from here or we could get the info with console
# with Dataproc `jobs` tab.
! gcloud dataproc jobs submit pyspark training_spark.py --cluster $cluster_name --region $region

Job [1942c96d1eb64030b8aa290fea31cd80] submitted.
Waiting for job output...
20/07/03 08:22:40 INFO org.spark_project.jetty.util.log: Logging initialized @3748ms
20/07/03 08:22:41 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/03 08:22:41 INFO org.spark_project.jetty.server.Server: Started @3930ms
20/07/03 08:22:41 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@3f432a74{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/03 08:22:41 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/03 08:22:43 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dataproc-spark-m/10.128.0.2:8032
20/07/03 08:22:43 INFO org.apache.hadoop.yarn.client.AHSProxy:

#### **Noted**
As if we just use our account to create it, won't be fine so we need to create a service account with our mail, so that we could do it easier

In [14]:
# one thing to notice, currently I just submit the job in terminal
# more common way should upload our training file into bucket
# then trigger the job with file in bucket

# let's test it

# first we need to create a bucket
! gsutil mb gs://dataproc_lugq

# then upload our file into bucket
! gsutil cp training_spark.py gs://dataproc_lugq

# let's check it
! gsutil ls gs://dataproc_lugq

Creating gs://dataproc_lugq/...
Copying file://training_spark.py [Content-Type=text/x-python]...
/ [1 files][  970.0 B/  970.0 B]                                                
Operation completed over 1 objects/970.0 B.                                      
gs://dataproc_lugq/training_spark.py


In [16]:
# then let's trigger our training job with storage file
! gcloud dataproc jobs submit pyspark gs://dataproc_lugq/training_spark.py \
--cluster $cluster_name \
--region $region

Job [dcc583272cdc43199a7a44acca877f92] submitted.
Waiting for job output...
20/07/03 08:34:31 INFO org.spark_project.jetty.util.log: Logging initialized @3860ms
20/07/03 08:34:31 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/03 08:34:31 INFO org.spark_project.jetty.server.Server: Started @4045ms
20/07/03 08:34:31 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@7703d473{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/03 08:34:31 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/03 08:34:34 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dataproc-spark-m/10.128.0.2:8032
20/07/03 08:34:34 INFO org.apache.hadoop.yarn.client.AHSProxy:

Alright, we have submitted our job into cluster both in local and remote **GCS**.

#### Dataproc with Bigquery

In fact, if we need to process relational data with big data, with Big query should be the most common way that we could interact with database.

Let's created a sample dataset, and upload it into bigquery, use **Dataproc** to read it and do some feature engineering, last step is do a model training on new data, for later usecase we could store our trained model into **GCS** for reference.

Let's get start.

#### Load data into bucket

In [17]:
# first let's create a sample dataset
import numpy as np
import pandas as pd
import os
from sklearn.datasets import load_iris

x, y = load_iris(return_X_y = True)
data = np.concatenate([x, y[:, np.newaxis]], axis=1)

df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'label'])

# save our dataframe into disk
df.to_csv('data.csv', index=False)

print("Now what files we have: ", os.listdir('.'))

Now what files we have:  ['.config', 'adc.json', 'training_spark.py', 'data.csv', 'sample_data']


In [18]:
# let's upload our data.csv into bucket
! gsutil cp data.csv gs://dataproc_lugq/

# check we done it
! gsutil ls gs://dataproc_lugq

Copying file://data.csv [Content-Type=text/csv]...
/ [1 files][  2.9 KiB/  2.9 KiB]                                                
Operation completed over 1 objects/2.9 KiB.                                      
gs://dataproc_lugq/data.csv
gs://dataproc_lugq/training_spark.py


#### Load data into bigquery

Before we do the load action with python, first we do need to create a dataset_id in the console, please just go to the **bigquery** console and create a dataset with **iris_dataset**.

After the dataset has been created, let's create our table with python.

In [19]:
# first install bigquery module
! pip install google-cloud-bigquery --quiet

In [25]:
from google.cloud import bigquery

# we need to create the dataset in console first
project_id = "cloudtutorial-282208"
dataset_id = "iris_dataset"
bucket_name = "dataproc_lugq"

# init bigquery client
client = bigquery.Client(project_id)

# create dataset inference
dataset_ref = client.dataset(dataset_id)

# define schema
job_config = bigquery.LoadJobConfig()
job_config.schema = [bigquery.SchemaField("a", "float"),
                     bigquery.SchemaField("b", "float"),
                     bigquery.SchemaField("c", "float"),
                     bigquery.SchemaField("d", "float"),
                     bigquery.SchemaField("label", "float")]

# skip the header, as I skip first row just get 149 records. not correct
job_config.skip_leading_rows = 1
# set to load csv
job_config.source_format = bigquery.SourceFormat.CSV

# data uri
data_uri = "gs://{}/{}".format(bucket_name, "data.csv")

# create a load job
load_job = client.load_table_from_uri(data_uri, dataset_ref.table('iris'), job_config =job_config)
print("submitted job: {}".format(load_job.job_id))

# wait result to finish
load_job.result()

submitted job: f970b412-4eb4-4581-a9f4-0df615470b17


<google.cloud.bigquery.job.LoadJob at 0x7f7f87f69550>

In [28]:
# let's check how many data has been inserted
# that's right!
response = client.get_table(dataset_ref.table('iris'))

print("there are {} records in bigquery".format(response.num_rows))

there are 150 records in bigquery


#### Model Training with Dataproc and Bigquery

In [36]:
# after we have load data in bigquery
# then let's use dataproc to read data from bigquery
# so that we could use the power of Spark and Bigquery

%%writefile spark_train_bigquery.py

from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
import logging

logger = logging.getLogger(__name__)

# combine features into vector and get lable
def inputs_to_vector(row):
  return (row['label'], Vectors.dense(float(row['a']), 
                                      float(row['b']), 
                                      float(row['c']), 
                                      float(row['d']) ))

# create sparksession
spark = SparkSession.builder.getOrCreate()

# read bigquery return into a DataFrame
logger.info("Read data from bigquery")
df = spark.read.format('bigquery').option('table', 'iris_dataset.iris').load()

# logger.info("get dataframe:", df.show(5))
df.createOrReplaceTempView('iris')

df_new = spark.sql("select * from iris")

# map dataframe with vector function
data_df = df_new.rdd.map(inputs_to_vector).toDF(["label", "features"])

# split into train and test
(train_df, test_df) = data_df.randomSplit([0.7, 0.3])

# cache dataframe
train_df.cache()

lr = LogisticRegression(maxIter=10, regParam=0.1,elasticNetParam=0.8)

logger.info("start to train model")
model = lr.fit(train_df)

# get model prediction on test data
pred = model.transform(test_df)


# let's try to save our trained model into GCS
bucket_name = "dataproc_lugq"
model_folder = "lr_model"
model_storage_path = "gs://{}/{}".format(bucket_name, model_folder)

# in case the model already exist
# I found that couldn't just save the model file directly into bucket
# so let's just save the file into local server, then upload file with command
# reference here: https://stackoverflow.com/questions/48684048/save-python-data-object-to-file-in-google-storage-from-a-pyspark-job-running-in
local_path = './logistic_model'
model.write().overwrite().save(local_path)

# let's use command to upload file
from subprocess import call
print("Save model into bucket")
call(['gsutil', 'cp', local_path, model_storage_path])

print("get prediction:", pred.show(5))

Overwriting spark_train_bigquery.py


#### Noted

If we need to use **Bigquery** in **Dataproc**, we need to provide the connection between two of them, that's: `--jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar`, keep this in mind if you face error with couldn't find `bigquery`, do need to provide it with `jars`.

In [37]:
# let's submit our job into Dataproc
! gcloud dataproc jobs submit pyspark spark_train_bigquery.py \
--cluster $cluster_name \
--region $region \
--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar

Job [0ce69abc56734071984a541c48037548] submitted.
Waiting for job output...
20/07/03 09:09:32 INFO org.spark_project.jetty.util.log: Logging initialized @5119ms
20/07/03 09:09:32 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/03 09:09:32 INFO org.spark_project.jetty.server.Server: Started @5302ms
20/07/03 09:09:32 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@598fc0ab{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/03 09:09:32 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/03 09:09:34 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dataproc-spark-m/10.128.0.2:8032
20/07/03 09:09:34 INFO org.apache.hadoop.yarn.client.AHSProxy:

In [39]:
# last let's try to check the storage to find model exists or not
# so we do save our trained model into bucket.
! gsutil ls gs://dataproc_lugq/lr_model

gs://dataproc_lugq/lr_model/
gs://dataproc_lugq/lr_model/data/
gs://dataproc_lugq/lr_model/metadata/


Alright, we do could train our model with data from **Bigquery** with **Dataproc**, as **Spark** is a unified framework, we could do many things with it, like **ML** and **SQL** etc. What we could do with **Spark** then whole things could be done with **Dataproc**, as it's just a cloud framework support Spark.

Let's just remove our storage and dataproc cluster!

In [42]:
# remove dataproc
! gcloud dataproc clusters delete dataproc-spark --region $region

The cluster 'dataproc-spark' and all attached disks will be deleted.

Do you want to continue (Y/n)?  y

Waiting on operation [projects/cloudtutorial-282208/regions/us-central1/operations/aca94a03-5eb6-3c6e-bb1b-6b76606c6183].
Deleted [https://dataproc.googleapis.com/v1/projects/cloudtutorial-282208/regions/us-central1/clusters/dataproc-spark].


In [43]:
! bq rm -r -d iris_dataset


Welcome to BigQuery! This script will walk you through the 
process of initializing your .bigqueryrc configuration file.

First, we need to set up your credentials if they do not 
already exist.

Credential creation complete. Now we will select a default project.

List of projects:
  #        projectId           friendlyName    
 --- ---------------------- ------------------ 
  1   cloudtutorial-282208   CloudTutorial     
  2   my-project-34336       My Project 34336  
Found multiple projects. Please enter a selection for 
which should be the default, or leave blank to not 
set a default.

Enter a selection (1 - 2): 1

BigQuery configuration complete! Type "bq" to get started.

rm: remove dataset 'cloudtutorial-282208:iris_dataset'? (y/N) y


In [57]:
# remove whole buckets
from google.cloud import storage

client = storage.Client(project_id)

buckets_list = list(client.list_buckets())

# delete whole buckets
for bucket in buckets_list:
  print("Now to delete: {}".format(bucket.name))
  bucket.delete(force=True)

Now to delete: dataproc-staging-us-central1-397497159726-nw76gzlj
Now to delete: dataproc-temp-us-central1-397497159726-jbjqy65f
Now to delete: dataproc_lugq


### Last word

This tutorial is based on using **Dataproc** to do feature engineering and model training based on **Bigquery**. When we need to do big data processing, maybe we would use **Dataproc** many times.