## Introduction

According official website: Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data [official](https://cloud.google.com/dataproc/docs).

When we need to use Spark to do data processing, then we could use DataProc as a tool to process data.

In [0]:
# first auth the lab
from google.colab import auth
auth.authenticate_user()

In [0]:
# install dataproc
! pip install google-cloud-dataproc --quiet

In [0]:
import argparse
import time

from google.cloud import dataproc_v1 as dataproc
from google.cloud import storage

# define the parameters
project_id = "cloudtutorial-279003"
cluster_name = 'python-cluster'
region = 'us-central1'


def create_cluster():
  # first create cluster client
  cluster_client = dataproc.ClusterControllerClient(client_options={
        'api_endpoint': '{}-dataproc.googleapis.com:443'.format(region)
    })


  cluster = {
          'project_id': project_id,
          'cluster_name': cluster_name,
          'config': {
              'master_config': {
                  'num_instances': 1,
                  'machine_type_uri': 'n1-standard-1'
              },
              'worker_config': {
                  'num_instances': 2,
                  'machine_type_uri': 'n1-standard-1'
              }
          }
      }


  # then create cluster
  cluster_client = cluster_client.create_cluster(project_id, region, cluster)
  result = cluster_client.result()

  print("Cluster has been created: {}".format(result.cluster_name))

  return cluster_client


def submit_job(job_path):
  # create job client, so that we could submit our job into cluster
  job_client = dataproc.JobControllerClient(client_options={
      'api_endpoint': "{}-dataproc.googleapi.com:443".format(region)
  })

  # create job config
  job = {
      'placement': {
          'cluster_name': cluster_name
      },
      'pyspark_job':{
          'main_python_file_uri': job_path
      }
  }

  job_response = job_client.submit_job(project_id, region, job)
  job_id = job_response.reference.job_id

  print('Submit job {}'.format(job_id))

  # get terminate states of jobs
  terminal_states = {
      dataproc.types.JobStatus.ERROR,
      dataproc.types.JobStatus.CANCELLED,
      dataproc.types.JobStatus.DONE
  }

  # we could also config a timeline that if a job fun too long, we could terminate the job
  timeout_seconds = 600
  time_start = time.time()

  # we have to wait the job to complete
  while job_response.status.state not in terminal_states:
    if time.time() > time_start + timeout_seconds:
      job_client.cancel_job(project_id, region, job_id)
      print("Cancel job after {} seconds".format(job_id, timeout_seconds))

    # we could check every 1 second.
    time.sleep(1)
    job_response = job_client.get_job(project_id, region, job_id)

    return job_response


def save_result_into_bucket(cluster_client, job_response):
  # this is to save the final result into bucket so that we could get some information about jobs.
  storage_client = storage.Client()

  cluser_info = cluster_client.get_cluster(project_id, region, cluster_name)

  bucket = storage_client.get_bucket(cluster_info.config.config_bucket)

  output_blob = 'google-cloud-dataproc-metainfo/{}/jobs/{}/driveroutput.000000000'.format(cluster_info.cluster_uuid, job_id)
  output = bucket.blob(output_blob).download_as_string()

  print('job {} finished with status: {}:\n{}'.format(job_id, job_response.status.state, output))


def delete_cluster(cluster_client):
  # after whole step finished, we could delete our cluster
  delete_cluster = cluster_client.detele_cluster(project_id, region, cluster_name)
  delete_cluster.result()

  print('cluster has been deleted')

In [0]:
! gcloud config set project $project_id

Updated property [core/project].


In [0]:
# first let's get the cluster obj
cluster = create_cluster()

Cluster has been created: python-cluster


## Noted

As with the `sandbox` problem, I haven't make the job run successfully, as the reason is for DataProc, it will start a cluster in compute engine, this is really expensive for `sandbox`, so the logic should do be similiar, we could sand our jars into the server and start the job in the remote server, but do keep in mind, we have to delete the cluster in the dataproc, so that we won't cost so much. In fact, this is just a solution that we could start our Spark job in cloud but with cluster created manually by us.

In [0]:
# then we could submit our job to the cluster
job_path = 'gs://{}'.format(storage_path)
job_response = submit_job(job_path)

KeyboardInterrupt: ignored

## Training file

Now we could start our training or processing logic with PySpark, here I just create a sample file to use Spark to do feature extraction. After we  finish the logic, then we should upload the training file into storage.

In [0]:
# here I write the job logic
%%writefile training_spark.py
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

import logging

logger = logging.getLogger(__name__)

logger.info('init spark')
spark = SparkSession.builder.getOrCreate()

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

# split sentence into words.
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

countTokens = udf(lambda words: len(words), IntegerType())

# with udf function to get each sentence length.
tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

logger.info('whole Spark logic finished.')

Overwriting training_spark.py


In [0]:
# upload the file into bucket
bucket_name = 'dataflow_tutorial_bucket'
folder_name = 'spark_code'
file_name = 'training_spark.py'

storage_path = os.path.join(bucket_name, folder_name, file_name)

! gsutil cp ./training_spark.py gs://$storage_path

Copying file://./training_spark.py [Content-Type=text/x-python]...
/ [1 files][  909.0 B/  909.0 B]                                                
Operation completed over 1 objects/909.0 B.                                      
