### Introduction

This notebook is trying to combine whole GCP big data components into one notebook contains GCS, Bigquery, Dataproc, Composer and Terraform.

If we face in real product, we need to use CI/CD pipeline to combine whole of them, but that's just a process not the fundamental part. 

Let's just start.

In [45]:
# first we need to create a **.tf file to build cloud environment that we need
# so here just write the file into a file called init.tf(any name should be fine)

# as in the notebook to init the terraform isn't that easy, so I just start it in my own env.
%%writefile init.tf
// this will create some common usecase with AIA, so here just to create resources

// first with composer
resource "google_composer_environment" "test" {
  name = " proc-cluster-lu"
  region = "us-central1"
  config {
    node_count = 3
    node_config {
      zone = "us-central1-a"
      machine_type = "n1-standard-1"
    }
  }
}

// create bucket
resource "google_storage_bucket" "default" {
  name = "aia-terraform-lugq"
  location = "US"

}

// create dataproc cluster
resource "google_dataproc_cluster" "my-cluster" {
  name = "my-cluster"
  region = "us-central1"

  cluster_config {
    master_config {
      num_instances = 1
      machine_type = "n1-standard-1"
      disk_config {
        boot_disk_size_gb = 15
      }
    }
    worker_config {
      num_instances = 2
      machine_type = "n1-standard-1"
      disk_config {
        boot_disk_size_gb = 15
      }
    }
  }
}

Writing init.tf


In [1]:
#auth the notebook
from google.colab import auth

auth.authenticate_user()

In [2]:
! gcloud config set project 	cloudtutorial-283609

Updated property [core/project].


In [21]:
project_id = "cloudtutorial-283609"

In [4]:
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd

x, y = load_iris(return_X_y=True)

data = np.concatenate([x, y[:, np.newaxis]], axis=1)

df = pd.DataFrame(data, columns=['a', 'b','c', 'd', 'label'])

df.to_csv('data.csv', index=False)

### upload file into bucket

In [5]:
from google.cloud import storage

client = storage.Client(project_id)

bucket_name = "full_step_bucket_lugq"

bucket = client.create_bucket(bucket_name)

In [6]:
! gsutil ls gs://

gs://full_step_bucket_lugq/


In [7]:
# let's upload the file into bucket
blob = bucket.blob('data.csv')
try:
  blob.upload_from_filename('data.csv')
  print("file has been uploaded")
except Exception as e:
  print("When upload file with error: {}".format(e))

file has been uploaded


In [8]:
# list bucket
bucket = client.get_bucket(bucket_name)
blobs = bucket.list_blobs()
print(list(blobs))

[<Blob: full_step_bucket_lugq, data.csv, 1595209256655044>]


### Read file from bucket with DataProc

In [10]:
### create a file to read data from storage
# and do a feature engineering step with Binarizer
%%writefile process_data_with_spark.py

from pyspark.sql import SparkSession
from pyspark.ml.feature import Binarizer
import os
import logging

logger = logging.getLogger(__name__)

project_id = "cloudtutorial-283609"
bucket_name = "full_step_bucket_lugq"
file_name = "data.csv"

file_path = "gs://{}/{}".format(bucket_name, file_name)

# init spark
spark = SparkSession.builder.getOrCreate()

logger.info("read data.")
# we should set `inferSchema `, so that with data type
df = spark.read.format('csv').option('header', 'true').option("inferSchema", 'true').load(file_path)

logger.info("Get data:", df.show(5))

# binarizer columns: `a`
binarizer = Binarizer(threshold=3.0, inputCol="a", outputCol="a_binary")

df_new = binarizer.transform(df)
logger.info("get binarizer result:", df_new.show(5))

# write the result into bucket
logger.info("save result into bucket")
# in case the file already exists.
df_new.write.format('csv').mode("overwrite").save("gs://{}/{}".format(bucket_name, 'data_new.csv'))

logger.info("whole step finsished.")

Overwriting process_data_with_spark.py


In [11]:
# let's upload our pyspark file into bucket
! gsutil cp process_data_with_spark.py gs://full_step_bucket_lugq/process_data_with_spark.py

Copying file://process_data_with_spark.py [Content-Type=text/x-python]...
/ [1 files][   1004 B/   1004 B]                                                
Operation completed over 1 objects/1004.0 B.                                     


In [12]:
# Here we have to create a Dataproc cluster with name: proc-cluster-lu
CLUSTER = "proc-cluster-lu"

In [13]:
# set region for DataProc
! gcloud config set dataproc/region us-central1

Updated property [dataproc/region].


In [14]:
# check
! gsutil ls gs://full_step_bucket_lugq

gs://full_step_bucket_lugq/data.csv
gs://full_step_bucket_lugq/process_data_with_spark.py


In [15]:
# install dataproc
! pip install google-cloud-dataproc --quiet

[?25l[K     |█▏                              | 10kB 23.4MB/s eta 0:00:01[K     |██▎                             | 20kB 1.8MB/s eta 0:00:01[K     |███▍                            | 30kB 2.3MB/s eta 0:00:01[K     |████▌                           | 40kB 2.5MB/s eta 0:00:01[K     |█████▊                          | 51kB 2.0MB/s eta 0:00:01[K     |██████▉                         | 61kB 2.2MB/s eta 0:00:01[K     |████████                        | 71kB 2.5MB/s eta 0:00:01[K     |█████████                       | 81kB 2.7MB/s eta 0:00:01[K     |██████████▎                     | 92kB 2.9MB/s eta 0:00:01[K     |███████████▍                    | 102kB 2.8MB/s eta 0:00:01[K     |████████████▌                   | 112kB 2.8MB/s eta 0:00:01[K     |█████████████▋                  | 122kB 2.8MB/s eta 0:00:01[K     |██████████████▊                 | 133kB 2.8MB/s eta 0:00:01[K     |████████████████                | 143kB 2.8MB/s eta 0:00:01[K     |█████████████████         

In [16]:
# with error: AttributeError: module 'google.protobuf.descriptor' has no attribute '_internal_create_key'
# reinstall protobuf
! pip uninstall protobuf --quiet

! pip install protobuf --quiet

Proceed (y/n)? y
[K     |████████████████████████████████| 1.3MB 2.8MB/s 
[?25h

In [17]:
# after re-start should make credential for service account
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = [x for x in os.listdir('.') if x.lower().startswith('cloud')][0]

In [18]:
# let's just use gcloud to submit job to our cluster
# if we haven't enable with API will ask you to enable
! gcloud dataproc jobs submit pyspark process_data_with_spark.py --cluster=$CLUSTER

Job [7cab330510b6490d9918a755d806a3e3] submitted.
Waiting for job output...
20/07/20 01:51:45 INFO org.spark_project.jetty.util.log: Logging initialized @4094ms
20/07/20 01:51:45 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/20 01:51:45 INFO org.spark_project.jetty.server.Server: Started @4275ms
20/07/20 01:51:45 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@735566d1{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/20 01:51:46 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/20 01:51:48 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at proc-cluster-lu-m/10.128.0.2:8032
20/07/20 01:51:48 INFO org.apache.hadoop.yarn.client.AHSProxy

### Load storage file into bigquery

In [27]:
! gsutil ls gs://

gs://dataproc-staging-us-central1-544826698357-zpklxhfw/
gs://dataproc-temp-us-central1-544826698357-vyt5vdvo/
gs://full_step_bucket_lugq/


In [28]:
# when we use spark to save file to storage, won't save just a file but a partition file
# so here just to get the file name in the bucket
from google.cloud import storage

storage_client = storage.Client("cloudtutorial-283609")

bucket_name = "full_step_bucket_lugq"
folder_name = "data_new.csv"

bucket = storage_client.get_bucket(bucket_name)

file_list =  list(bucket.list_blobs(prefix=folder_name))

try:
  file_name = [x.name for x in file_list if x.name.lower().endswith('csv')][0]
  print("get file: {}".format(file_name))
except Exception as e:
  print("There isn't that file in the bucket.")

get file: data_new.csv/part-00000-c7d40999-3e58-42b6-836c-1ec4043af258-c000.csv


Create table with API, but we have to create the dataset with bigquery, `dataset` could represent as `database`, so let's create it first.

In [29]:
# create dataset with command
! bq mk iris_dataset


Welcome to BigQuery! This script will walk you through the 
process of initializing your .bigqueryrc configuration file.

First, we need to set up your credentials if they do not 
already exist.

Credential creation complete. Now we will select a default project.

List of projects:
  #        projectId           friendlyName    
 --- ---------------------- ------------------ 
  1   cloudtutorial-283609   CloudTutorial     
  2   my-project-34336       My Project 34336  
Found multiple projects. Please enter a selection for 
which should be the default, or leave blank to not 
set a default.

Enter a selection (1 - 2): 1

BigQuery configuration complete! Type "bq" to get started.

Dataset 'cloudtutorial-283609:iris_dataset' successfully created.


In [30]:
from google.cloud import bigquery

# we need to create the dataset in console first
dataset_id = "iris_dataset"

# init bigquery client
client = bigquery.Client(project_id)

# create dataset inference
dataset_ref = client.dataset(dataset_id)

# define schema
job_config = bigquery.LoadJobConfig()
job_config.schema = [bigquery.SchemaField("a", "float"),
                     bigquery.SchemaField("b", "float"),
                     bigquery.SchemaField("c", "float"),
                     bigquery.SchemaField("d", "float"),
                     bigquery.SchemaField("label", "float"),
                     bigquery.SchemaField("a_bina", "float")]

# skip the header, as I skip first row just get 149 records. not correct
# job_config.skip_leading_rows = 1
# set to load csv
job_config.source_format = bigquery.SourceFormat.CSV

# data uri
data_uri = "gs://{}/{}".format(bucket_name, file_name)

# create a load job
load_job = client.load_table_from_uri(data_uri, dataset_ref.table('iris'), job_config =job_config)
print("submitted job: {}".format(load_job.job_id))

# wait result to finish
load_job.result()



submitted job: fc4ecc80-28b3-44c8-a10f-2aed61412fe4


<google.cloud.bigquery.job.LoadJob at 0x7f25f128ae48>

In [31]:
# let's check
check_table = client.get_table(dataset_ref.table('iris'))

print("there are {} records in bigquery".format(check_table.num_rows))

there are 150 records in bigquery


In [32]:
# let's make a into a file for Composer test
%%writefile load_data_into_bigquery.py
from google.cloud import storage

storage_client = storage.Client(project_id)

bucket_name = "full_step_bucket"
folder_name = "data_new.csv"

bucket = storage_client.get_bucket(bucket_name)

file_list =  list(bucket.list_blobs(prefix=folder_name))

file_name = [x.name for x in file_list if x.name.lower().endswith('csv')][0]

print("get file: {}".format(file_name))

from google.cloud import bigquery

# we need to create the dataset in console first
dataset_id = "iris_dataset"

# init bigquery client
client = bigquery.Client(project_id)

# create dataset inference
dataset_ref = client.dataset(dataset_id)

# define schema
job_config = bigquery.LoadJobConfig()
job_config.schema = [bigquery.SchemaField("a", "float"),
                     bigquery.SchemaField("b", "float"),
                     bigquery.SchemaField("c", "float"),
                     bigquery.SchemaField("d", "float"),
                     bigquery.SchemaField("label", "float"),
                     bigquery.SchemaField("a_bina", "float")]

# skip the header, as I skip first row just get 149 records. not correct
# job_config.skip_leading_rows = 1
# set to load csv
job_config.source_format = bigquery.SourceFormat.CSV

# data uri
data_uri = "gs://{}/{}".format(bucket_name, file_name)

# create a load job
load_job = client.load_table_from_uri(data_uri, dataset_ref.table('iris'), job_config =job_config)
print("submitted job: {}".format(load_job.job_id))

# wait result to finish
load_job.result()
# let's check
check_table = client.get_table(dataset_ref.table('iris'))

print("there are {} records in bigquery".format(check_table.num_rows))
|

Writing load_data_into_bigquery.py


#### DataProc with bigquery

In [33]:
# this file is to read data from bigquery and train a AI model
%%writefile spark_train_bigquery.py

from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
import logging

logger = logging.getLogger(__name__)

# combine features into vector and get lable
def inputs_to_vector(row):
  return (row['label'], Vectors.dense(float(row['a']), 
                                      float(row['b']), 
                                      float(row['c']), 
                                      float(row['d']), 
                                      float(row['a_bina']) ))

# create sparksession
spark = SparkSession.builder.getOrCreate()

# read bigquery return into d DataFrame
logger.info("Read data from bigquery")
df = spark.read.format('bigquery').option('table', 'iris_dataset.iris').load()

# logger.info("get dataframe:", df.show(5))
df.createOrReplaceTempView('iris')

df_new = spark.sql("select * from iris")

# map dataframe with vector function
data_df = df_new.rdd.map(inputs_to_vector).toDF(["label", "features"])

# split into train and test
(train_df, test_df) = data_df.randomSplit([0.7, 0.3])

# cache dataframe
train_df.cache()

lr = LogisticRegression(maxIter=100, regParam=0.1,elasticNetParam=0.8)

logger.info("start to train model")
model = lr.fit(train_df)

pred = model.transform(test_df)

print("get prediction:", pred.show(5))


Writing spark_train_bigquery.py


In [36]:
# upload python file into bucket
! gsutil cp spark_train_bigquery.py gs://full_step_bucket_lugq

Copying file://spark_train_bigquery.py [Content-Type=text/x-python]...
/ [1 files][  1.3 KiB/  1.3 KiB]                                                
Operation completed over 1 objects/1.3 KiB.                                      


In [38]:
# check bucket
! gsutil ls gs://full_step_bucket_lugq

gs://full_step_bucket_lugq/data.csv
gs://full_step_bucket_lugq/process_data_with_spark.py
gs://full_step_bucket_lugq/spark_train_bigquery.py
gs://full_step_bucket_lugq/data_new.csv/


##### Tips:

When we connect with bigquery, we have to provide with the connection jar: **gs://spark-lib/bigquery/spark-bigquery-latest.jar**.

In [42]:
# submit the dataproc job with command
! gcloud dataproc jobs submit pyspark gs://full_step_bucket_lugq/spark_train_bigquery.py \
--cluster=$CLUSTER --region=us-central1 \
--jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

Job [e7d9c005583e4a8195b9b528bad2ec1f] submitted.
Waiting for job output...
20/07/20 02:01:09 INFO org.spark_project.jetty.util.log: Logging initialized @4728ms
20/07/20 02:01:09 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/07/20 02:01:09 INFO org.spark_project.jetty.server.Server: Started @4919ms
20/07/20 02:01:09 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@2a07dfeb{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/07/20 02:01:10 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/07/20 02:01:11 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at proc-cluster-lu-m/10.128.0.2:8032
20/07/20 02:01:11 INFO org.apache.hadoop.yarn.client.AHSProxy

### Composer with Dataproc and bigquery

In [41]:
# first we need to create composer environments
# one thing to notice if we create a composer, we will create a Kubernetes cluster!
# By default this will create a K8S cluster with 3 nodes: 1 master and 2 slaves
! gcloud composer environments create dataproc --location us-central1 --async 

API [composer.googleapis.com] not enabled on project [544826698357]. 
Would you like to enable and retry (this will take a few minutes)? 
(y/N)?  y

Enabling service [composer.googleapis.com] on project [544826698357]...
Operation "operations/acf.bd914bc3-ec92-4b47-ae75-15b721c924d0" finished successfully.
Create in progress for environment [projects/cloudtutorial-283609/locations/us-central1/environments/dataproc] with operation [projects/cloudtutorial-283609/locations/us-central1/operations/6abd9347-d18e-4ca9-a69c-f4f7d9143ce8].
metadata:
  '@type': type.googleapis.com/google.cloud.orchestration.airflow.service.v1.OperationMetadata
  createTime: '2020-07-20T02:00:57.605Z'
  operationType: CREATE
  resource: projects/cloudtutorial-283609/locations/us-central1/environments/dataproc
  resourceUuid: 58cf2723-e86a-43be-a87b-a644199075af
  state: PENDING
name: projects/cloudtutorial-283609/locations/us-central1/operations/6abd9347-d18e-4ca9-a69c-f4f7d9143ce8


In [43]:
# for sdk not found
!gcloud config set container/use_application_default_credentials true

Updated property [container/use_application_default_credentials].


#### install Cloud SDK in colab

In [None]:
# before we install kubectl, we have to install Cloud SDK
# then let's install the kubectl to interact with k8s
! echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
! sudo apt-get install apt-transport-https ca-certificates gnupg

! curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -

# install gcloud-sdk
! sudo apt-get update && sudo apt-get install google-cloud-sdk

# install kubectl
! sudo apt-get install kubectl

##### Set parameters for Composer

In [None]:
# first we need to set up the variables that we would need in composer

# if sometime we need to set some avariables could be use for composer 
# we could just set the parameters as bellow


# ! gcloud composer environments run dataproc --location us-central1 \
# variables -- \
# --set project_id  cloudtutorial-282102

In [None]:
# set gce region into airflow
# ! gcloud composer environments run dataproc --location us-central1 \
# variables -- --set gce_region us-central1

In [None]:
# set gce zone
# ! gcloud composer  environments run dataproc --location us-central1 \
# variables -- --set gce_zone us-central1

In [None]:
# set bucket
# ! gcloud composer environments run dataproc --location us-central1 \
# variables -- --set bucket_path gs://full_step_bucket

##### Creat DAG

In [57]:
# last step is to create airflow python DAG
%%writefile dataproc_dag.py
"""
To run the code we need 3 variables:
* gcp_project - Google Cloud Project to use for the Cloud Dataproc cluster.
* gce_zone - Google Compute Engine zone where Cloud Dataproc cluster should be
  created.
* gcs_bucket - Google Cloud Storage bucket to use for result of Hadoop job.
  See https://cloud.google.com/storage/docs/creating-buckets for creating a
  bucket.
"""
import datetime

from airflow import models
from airflow.contrib.operators import  dataproc_operator
from airflow.utils import trigger_rule

### This is to get the paramaters configured in running processing 
# bucket_path = models.Variable.get("bucket_path")
# project_id = models.Variable.get('project_id')
# gce_zone = models.Variable.get("gce_zone")
# gce_region = models.Variable.get('gce_region')


# default args, we have to set the now before current timestamp
now = datetime.datetime.now() - datetime.timedelta(days=1)

default_args = {
    'owner': 'lugq',
    'depends_on_past': False,
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'start_date': now,
    'project_id': "cloudtutorial-283609",    # one thing to notice here: when I change a new project, do keep in mind to change this, otherwise will get error: permission.
    'region': 'us-central1',
    'zone': 'us-central1',
   #'cluster_name': " proc-cluster-lu",  # according to: https://stackoverflow.com/questions/50134110/dataprocpysparkoperator-cluster-region-and-zone-issue
}

# spark training file url
training_file = "gs://full_step_bucket_lugq/process_data_with_spark.py"

# create DAG operations, Maybe we should use composer to create the dataproc cluster
# run the job and delete the cluster. Let's try to do this here.
with models.DAG("dataproc_dag", default_args=default_args,
                schedule_interval=datetime.timedelta(days=1)) as dag:
  # HERE will create a DAG : create cluster -> submit job -> delete cluster
  
  # 1. create cluster, as current with IP usecase problem, we have to delete original cluster first.
  create_cluster = dataproc_operator.DataprocClusterCreateOperator(
      task_id = "create_cluster",
      cluster_name = "composer-cluster",
      num_workers=0,
      zone = 'us-central1-a',
      master_machine_type='n1-standard-1',
  )
  
  # 2. submit jobs
  # One thing to notice here: if we want to submit our job with Dataproc, we have to provide
  # with cluster name and region etc. as by default will use 'cluster-1', I have try many times.
  data_job = dataproc_operator.DataProcPySparkOperator(task_id="training_with_airflow", 
                                                       main=training_file,
                                                       cluster_name="composer-cluster")
  
  # 3. delete cluster
  delete_cluster = dataproc_operator.DataprocClusterDeleteOperator(
      task_id="delete_cluster",
      cluster_name = "composer-cluster",
      # this means will delete cluster no matter done or fail.
      trigger_rule = trigger_rule.TriggerRule.ALL_DONE
  )

  # this is DAG
  create_cluster >> data_job >> delete_cluster

  print("airflow with dataproc finished!")

Overwriting dataproc_dag.py


In [55]:
# let's get the DAGs bucket path
! gcloud composer environments describe  --location us-central1 --format=json dataproc

{
  "config": {
    "airflowUri": "https://qd0d300c3c3857e79p-tp.appspot.com",
    "dagGcsPrefix": "gs://us-central1-dataproc-6abd9347-bucket/dags",
    "gkeCluster": "projects/cloudtutorial-283609/zones/us-central1-a/clusters/us-central1-dataproc-6abd9347-gke",
    "nodeConfig": {
      "diskSizeGb": 100,
      "ipAllocationPolicy": {},
      "location": "projects/cloudtutorial-283609/zones/us-central1-a",
      "machineType": "projects/cloudtutorial-283609/zones/us-central1-a/machineTypes/n1-standard-1",
      "network": "projects/cloudtutorial-283609/global/networks/default",
      "oauthScopes": [
        "https://www.googleapis.com/auth/cloud-platform"
      ],
      "serviceAccount": "544826698357-compute@developer.gserviceaccount.com"
    },
    "nodeCount": 3,
    "privateEnvironmentConfig": {
      "cloudSqlIpv4CidrBlock": "10.0.0.0/12",
      "privateClusterConfig": {},
      "webServerIpv4CidrBlock": "172.31.245.0/24"
    },
    "softwareConfig": {
      "imageVersion": "com

In [58]:
# upload the training file into bucket
# this is from Composer website: DAG folder: gs://us-central1-dataproc-d189ca6a-bucket/dags

# Everytime to rerun use another project we have to change it.
! gsutil cp dataproc_dag.py gs://us-central1-dataproc-6abd9347-bucket/dags/

Copying file://dataproc_dag.py [Content-Type=text/x-python]...
/ [1 files][  3.0 KiB/  3.0 KiB]                                                
Operation completed over 1 objects/3.0 KiB.                                      


#### Conclusion

Good news we do could use **Composer** to create, submit and delete our cluster with job success, you could find the result that we do finished the whole thing sucessfully:
![Composer with Dataproc result](https://docs.google.com/uc?export=download&id=1u998GAmTc6ohrC_RHMYjMCFDNnc82Hhc)

After the whole step finished, we could just use Terraform command: destropy to cancel whole resources are used for this project. 