# GCP Professional Data Engineer
### Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform
#### Modules:
- Introduction to Cloud Dataproc
- Running Dataproc Jobs
- Leveraging GCP
- Analyzing Unstructured Data

#### Course Description:

This 1-week, accelerated course builds upon previous courses in the Data Engineering on Google Cloud Platform specialization. Through a combination of video lectures, demonstrations, and hands-on labs, you'll learn how to create and manage computing clusters to run Hadoop, Spark, Pig and/or Hive jobs on Google Cloud Platform.  You will also learn how to access various cloud storage options from their compute clusters and integrate Google’s machine learning capabilities into their analytics programs.  

In the hands-on labs, you will create and manage Dataproc Clusters using the Web Console and the CLI, and use cluster to run Spark and Pig jobs. You will then create iPython notebooks that integrate with BigQuery and storage and utilize Spark. Finally, you integrate the machine learning APIs into your data analysis.

Pre-requisites
- Google Cloud Platform Big Data & Machine Learning Fundamentals (or equivalent experience)
- Some knowledge of Python

## Module 1: Introduction to Cloud Dataproc
### Introducing Cloud Dataproc
- Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead, and you pay only for the resources you use (with per-second billing). Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning.

### Why Unstructured Data
- About 90% of enterprise data that is collected by a business tends to be unstructured. This includes:
    - Emails, reviews, text, etc.
- Consider the Google Steetview initiative- cars gathering photos at the street level with no immediate impact, yet now is the foundation for one of the most useful datasets available for autonomous cars.
    
### Why Cloud Dataproc
- Horizontal vs vertical scaling
- Hadoop origins:
    - Based on a whitepaper describing MapReduce, Hadoop is an open-source distributed file system also known as HDFS. Spark is a framework that takes advantage of the distributed file system to effectively process tasks.
- Running an onsite Hadoop cluster is costly and often inefficient.
- Additional benefits:
    - Stateless clusters in <90 seconds
    - Supports Hadoop, Spark, Pig, Hive
    - High-level APIs for job submission
    - Connectors in BigTable,  BigQuery, Cloud Storage

### Lab: Create a Dataproc Cluster

#### Objectives:
- Prepare a bucket for cluster initialization
- Create a Dataproc Hadoop Cluster customized to use the Google Cloud API
- Enable secure access to the Dataproc cluster
- Explore Hadoop operations

#### Task 1: Prepare Environment Variables
- In the Console, on the Navigation menu () click Compute Engine > VM instances.
- Locate the line with the instance called training_vm.
- On the far right, under 'connect', Click on SSH to open a terminal window.
- In this lab you will enter CLI commands on the training_vm.
##### Create the source file for setting and resetting environment variables


- In the training_vm SSH terminal window, using your preferred command line editor, create and edit the file to hold your environment variables. For example:
- One environment variable that you will set is 'PROJECT_ID' that contains the Google Cloud project ID required to access billable resources.
- In the Console, on the Navigation menu () click Home. In the panel with Project Info, the Project ID is listed. You can also find this information in the Qwiklabs tab under Connection Details, where it is labeled GCP Project ID.
Add the environment variable to myenv for easy reference.
- Dataproc can use a Cloud Storage bucket to stage its files during initialization. You can use this bucket to stage application programs or data for use by Dataproc. The bucket can also host Dataproc initialization scripts and output. The bucket name must be globally unique. Qwiklabs has already created a bucket for you that has the same name as the Project ID, which is already globally unique.
- In the Console, on the Navigation menu () click Storage > Browser. Verify that the bucket exists. Notice the default storage class and the location (region) of this bucket. You will be using this region information next.
Add the line to myenv to create an environment variable named "BUCKET".
- You can use BUCKET in CLI commands. And if you need to enter the bucket name <your-bucket> in a text field in Console, you can quickly retrieve the name with echo BUCKET.
- You will be creating a Dataproc cluster in a specific region. The Dataproc cluster and the bucket it will use for staging must be in the same region. Since the bucket you are using already exists, you will need to match the environment variable $MYREGION to the bucket region.
- You can use find the region used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Region.
The zone must be in the same region MYZONE will contain this value.
- You can find the zone used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Zone.
- Add the environment variables to myenv for easy reference.
You will use the browser IP address to enable your local browser to reach the Dataproc cluster.
- Find your computer's browser IP address by opening a browser window and viewing http://ip4.me/ Copy the IP address.
Add the line to myenv to create an environment variable named BROWSER_IP.
- After you have added all three definitions to myenv, and saved the file, use the source command to create the environment variables.

In [1]:
cd ~ 
nano myenv
PROJECT_ID=<project ID>
BUCKET=<project ID>
MYREGION=<region>
MYZONE=<zone>
BROWSER_IP=<your-browser-ip>

# set environment variables
source myenv
# verify variables are set
echo $PROJECT_ID
echo $MYREGION $MYZONE
echo $BUCKET
echo $BROWSER_IP

SyntaxError: invalid syntax (<ipython-input-1-959bb4230150>, line 1)

#### Task 2. Create a Dataproc Cluster
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Click Create Cluster.
- Specify the following, and leave the remaining settings as their defaults:
- Click on Preemptible workers, bucket, network, version, initialization, & access options
- Specify the following, and leave the remaining settings as their defaults:
    - Name: Cluster Dataproc
    - Region: <myregion>
    - Zone: <myzone>
    - Cluster mode: Standard (1 master, n workers)
    - (Master node) Machine type: n1-standard-2
    - (Master node) Primary disk size: 100GB
    - (Worker node) Machine type: n1-standard-1
    - (Worker node) Primary disk size: 50GB
    - Nodes: 3
    - Network tags: hadoop access
    - Cloud storage staging bucket: <your bucket>
    - Image version: 1.2
    - Project access: Allow API access
    
    
- Create.
- The cluster will take several minutes to become operational. In the Console, on the Navigation menu () click Dataproc > Clusters.
- Click on your cluster, cluster-dataproc. Then click on the VM Instances tab. The instances will become operational before the hadoop software has completed initialization. When a checkmark in a green circle appears next to the name of the cluster, it is operational.

#### Task 3: Enable secure access to Dataproc cluster
- Create a firewall rule that allows access only to the Master Node from your computer's IP address. Only ports 8088 (Hadoop Job Interface) and 9870 (Hadoop Admin interface) will be permitted.
- Port 8042 is the web UI for the node manager on the worker nodes and port 8080 is the default port for Datalab. Datalab is a notebook-based integrated development environment derived from Jupyter notebooks. It is a common tool for developing Dataproc applications. The Serverless Machine Learning on GCP course uses Datalab extensively.
- Recall your computer's browser IP address for use in Console.

    ```echo $BROWSER_IP```


- In the Console, on the Navigation menu () click VPC Network > Firewall rules.
- Click Create Firewall Rule.
- Specify the following, and leave the remaining settings as their defaults:
    - Name: allow-hadoop
    - Network: default
    - Priority: 1000
    - Direction of traffic: Ingress
    - Action on match: allow
    - Targets: specified target tags
    - Target tags: hadoopaccess
    - Source IP ranges: <yourIP>32
    - Specified ports and protocols tcp:9870;tcp:8088
    
- Verify that the network tag "hadoopaccess" is set on the Master Node. That will apply the firewall rule to the Master Node, giving your laptop access to it.
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- Click on the Master Node, cluster-dataproc-m.
- Verify that under Network Tags it lists hadoopaccess.
- If the tag is not there, Click EDIT.
- Under Network Tags add the tag: hadoopaccess
- Click Save.

#### Task 4: Explore Hadoop Operations
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- In the list of VM instances, in the row for cluster-dataproc-m, highlight the External IP and copy it.
- Open a new browser tab or window and paste the External IP. Add ":8088" after the IP and press enter. Example: <External IP>:8088 The web page displayed is the Hadoop Applications interface.
- Open a new browser tab or window. Paste the External IP. Add ":9870" after the IP and press return. Example: <External IP>:9870 The webpage displayed is the Hadoop Administration Interface and should look something like this:
- Click on the Datanodes tab. This will show you how much capacity is being used on the worker nodes HDFS (Hadoop Distributed File System) and how much capacity remains.
- Click on Utilities > Logs. This shows you the Hadoop log files for each node in the cluster. This is where you can go to investigate problems with Hadoop. Use your browser's back button to return to the Hadoop Administration console.
- Click on Utilities > Browse the file system. After a few moments the file system will appear in the browser page. You can use this to navigate the files system. In the row that says Owner is hdfs and Group is hadoop, click on the link that says user. -- - Here you can see directories for all the hadoop services.
- Leave the JobTracker <External IP>:8088 and the Administration Interface <External IP>:9870 tabs or windows open. You will use them in the next task.


### Module 1 Review

1.) Which of the following statements is true about Cloud Dataproc?
- Lets you run Spark and Hadoop clusters with minimal administrations
- Helps you create job-specific clusters w/o HDFS

2.) Matching definitions:
- Zone: determines the Google data center where compute nodes will be
- Preemtible: costs less but may not always be available
- Standard cluster mode: Provides 1 master and n workers

## Module 2: Running Dataproc Jobs
### Running Jobs
- Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network.[1] The standard TCP port for SSH is 22. The best known example application is for remote login to computer systems by users.

### Lab: Work with structured and semi-structured Data
#### Objectives:
- Use the Hive CLI and run a Pig job
- Hive is used for structured data, similar to SQL
- Pig is used for semi-structured data, similar to SQL + scripting

#### Task 1: Preparation
- A Dataproc cluster has been prepared for you. If you login to GCP before the progress bar reports that the "Lab is Running", you may have to wait several minutes for the cluster to transition from "Provisioning" to "Running" before the cluster completes setup.
- You will be performing most of the lab steps from the Master Node of the cluster in an SSH terminal window.
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Locate the cluster named dataproc-cluster. Which region and zone is it located in? The region and zone have been selected automatically for you by Qwiklabs.
- Notice the Cloud Storage staging bucket defined for this cluster. This bucket has the same name as the project ID, which is a convenient way to make the name globally unique.
- Click on the name dataproc-cluster to go to the Cluster details page.
- The Cluster details page opens to the "Overview" tab. Click on the tab labeled "VM Instances".
- Open the Master Node terminal
- On the line for the VM named dataproc-cluster-m you will see that it has the Role of Master and there is an SSH link next to it. Click on SSH to open a terminal window to the Master Node.

#### Task 2. Enable secure web access to the Dataproc cluster
- Create a restrictive firewall rule using Target tags, IP address, and protocol
- Create a firewall rule that allows access only to the Master Node from your computer's IP address. Only ports 8088 (Hadoop Job Interface) and 9870 (Hadoop Admin interface) will be permitted.
- Verify that the network tag is set on the Master Node
- Verify that the network tag "hadoopaccess" is set on the Master Node. That will apply the firewall rule to the Master Node, giving your laptop access to it.
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- Click on the Master Node, cluster-dataproc-m.
- Verify that under Network Tags it lists hadoopaccess.
- If the tag is not there, Click EDIT.
- Under Network Tags add the tag: hadoopaccess
- Click Save.
- Identify the browser IP address
- You will use the browser IP address to allow your local browser to connect to the Dataproc cluster.
- Find your computer's browser IP address by opening a browser window and viewing http://ip4.me/ Copy the IP address.
- Create the firewall rule
- In the Console, on the Navigation menu () click VPC Network > Firewall rules.
- Click Create Firewall Rule.
- Specify the following, and leave the remaining settings as their defaults:
    - Name: allow-hadoop
    - Network: default
    - Priority: 1000
    - Direction of traffic: Ingress
    - Action on match: allow
    - Targets: specified target tags
    - Target tags: hadoopaccess
    - Source IP ranges:  |yourIP|/32
    - Specified ports and protocols tcp:9870;tcp:8088
- Click create.
    
#### Task 3. Prepare the data for Hive
- Copy sample files to the Master node home directory
- The sample files you need are have already been archived on the Master Node. You will need to copy them into your user directory with the following command.
- In the Master Node SSH terminal window:

In [5]:
cd
cp -r /training .
ls
cd training/training-data-analyst/courses/unstructured
ls pet*.*
# view structured data in text file  
cat pet-details.txt
# stage data in HDFS
hadoop fs -mkdir /pet-details
hadoop fs -put pet-details.txt /pet-details

SyntaxError: invalid syntax (<ipython-input-5-1ca6502de273>, line 2)

- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- In the list of VM instances, in the row for cluster-dataproc-m, highlight the External IP and copy it.
- Open a new browser tab or window and paste the External IP. Add ":9870" after the IP and press enter. Example: <External IP>:9870
- You should now see the Hadoop Administration interface. Under Utilities, click on Browse the file system. Click on the folder /pet-details.
- Notice that the file pet-details.txt is inside /pet-details.
- Leave the Hadoop Administration interface open. You will return to it in later steps.
    
#### Task 4. Explore Hive using the Hive interactive CLI
- Use HIVE to access the data in HDFS as if it were in a database
- Hive provides a subset of SQL. The way it does this is by maintaining metadata to define a schema on top of the data. This is one way to work with a large amount of distributed data in HDFS using familiar SQL syntax.
- In the master node SSH window, make sure you are in the right directory and start the Hive CLI interpreter:

In [2]:
hive # initialize hive CLI interpreter
CREATE DATABASE pets;
USE pets
# create table
CREATE TABLE details (Type String, Name String, Breed String, Color String, Weight Int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
SHOW TABLES;
DESCRIBE pets.details;
# establish relationship between metadata schema and data in HDFS
load data INPATH '/pet-details/pet-details.txt' OVERWRITE INTO TABLE details;
# verify that everything is working
SELECT * FROM pets.details;
# quit HIVE interpreter
quit;

SyntaxError: invalid syntax (<ipython-input-2-4bc48b4f45bf>, line 2)

- Use the Hadoop Administration interface to see how hive works
- Hive ingested the pet-details.txt file into a data warehouse format requiring a schema. You will use the Hadoop Administration interface to see this transformation.
- Return to the Hadoop Administration interface in the browser.
- Under Utilities, click on Browse the file system. Click on the folder /pet-details. The file pet-details.txt is gone.
- Under Utilities, click on Browse the file system. Then click on user > hive > warehouse > pets.db > details. The file pet-details.txt has been moved to this location.

***Note: Hive is designed for batch jobs and not for transactions. It ingests data into a data warehouse format requiring a schema. It does not support real-time queries, row-level updates, or unstructured data. Some queries may run much slower than others due to the underlying transformations Hive has to implement to simulate SQL.

#### Task 5: Run a Pig job

- In the master nodes SSH windowm view the Pig application:

```cat pet-details.pig```

- In line 'x1', the load statement in the application creates a schema on top of the HDFS data file. Lines 'x2' through 'x5' perform transformations on the data. And the last line stores the result in a folder called /GroupedByType in HDFS.
- The application expects to find the ingest file in HDFS in the directory /pet-details. Make another copy of the data at that location:

```hadoop fs -put pet-details.txt /pet-details```

Run the application:

```pig < pet-details.pig```

- Return to the browser tab containing the Hadoop Applications interface and refresh it, or reopen it with <External-IP>:8088. Notice that Pig generated a Java MapReduce job which is running on the cluster. Click the browser refresh button to watch for job completion.
- Return to the browser tab containing the Hadoop Administration interface and refresh it, or reopen it with <External-IP>:9870. Under Utilities, click on Browse the file system. In the resulting list, click on GroupedByType. This is the output directory specified in the Pig application. The file named part-r-00000 is the HDFS file containing the output. You cannot view the contents from here. First, you must download that part to the local file system
- Return to the SSH terminal on the Master node, cloud-dataproc-m and make a local output directory and retrieve the results from HDFS.

In [3]:
cd
mkdir output
cd output
hadoop fs -get /GroupedByType/part* .
# view results of the pig job
cat part-r-00000

SyntaxError: invalid syntax (<ipython-input-3-cb82c8aaa442>, line 2)

***Note: Pig provides SQL primitives similar to Hive, but in a more flexible scripting language format. Pig can also deal with semi-structured data, such as data having partial schemas, or for which the schema is not yet known. For this reason it is sometimes used for Extract Transform Load (ETL). It generates Java MapReduce jobs. Pig is not designed to deal with unstructured data.

### End Lab

### Separation of Storage & Compute
### Submitting Jobs
### Spark RDDs, Transformations, and Actions
### Lab: Working with Spark Jobs
### Module 2 Review

## Module 3: Leveraging GCP
### Big Query Support
### Lab: Leverage GCP
#### Objectives:
- Explore Spark using PySpark jobs
- Using Cloud Storage instead of HDFS
- Run a PySpark application from Cloud Storage
- Using Python Pandas to add BigQuery to a Spark application

#### Task 1: Prepare the Master Node and the Bucket
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Locate the cluster named dataproc-cluster.
- Click on the name dataproc-cluster to go to the Cluster details page.
- The Cluster details page opens to the "Overview" tab. Click on the tab labeled "VM Instances".
- On the line for the VM named dataproc-cluster-m you will see that it has the Role of Master and there is an SSH link next to it. Click on SSH to open a terminal window to the Master Node.
- In the Master Node SSH terminal window, type:

```
cd
cp -r /training .
ls
```

#### Note: A Cloud Storage bucket has already been created for you. It has the same name as the Project ID. You will create an environment variable to make it easy to reference the bucket from the command line on the Master Node.

- In the console, on the Navigation menu click Storage -> Bucket
- In the Master Node SSH terminal window:

```
BUCKET=<bucket-name>
echo $BUCKET
```

#### Task 2: The two letter lab

#### Note: Why would you want to use Cloud Storage instead of HDFS?
You can shut down the cluster when you are not running jobs. The storage persists even when the cluster is shut down, so you don't have to pay for the cluster just to maintain data in HDFS.
In some cases Cloud Storage provides better performance than HDFS.
Cloud Storage does not require the administration overhead of a local file system

- Place a copy of your sample data file in a Cloud Storage bucket instead of HDFS.
- In the Master Node terminal window, enter the following gsutil command to copy the sample text files to the Cloud Storage bucket.

```
gsutil cp /training/road-not-taken.txt gs://$BUCKET
```

- In the SSH terminal for the Master Node, use nano or vi to create the file wordcount.py
- Copy and paste the following code into the file:

In [1]:
from pyspark.sql import SparkSession
from operator import add
import re

print("Okay Google.")

spark = SparkSession\
        .builder\
        .appName("CountUniqueWords")\
        .getOrCreate()

lines = spark.read.text("/sampledata/road-not-taken.txt").rdd.map(lambda x: x[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
                  .filter(lambda x: re.sub('[^a-zA-Z]+', '', x)) \
                  .filter(lambda x: len(x)>1 ) \
                  .map(lambda x: x.upper()) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add) \
                  .sortByKey()
output = counts.collect()
for (word, count) in output:
  print("%s = %i" % (word, count))

spark.stop()

ModuleNotFoundError: No module named 'pyspark'

- First, verify that the data file does not exist in HDFS:
```
hadoop fs -ls
```
- Next, use the Hadoop file system command to view the files through the hadoop connector to Cloud Storage. This verifies that the connector is working and that the file is available in the bucket.
```
hadoop fs -ls gs://$BUCKET
```
- Edit wordcount.py in nano or vi

```
lines = spark.read.text("/sampledata/road-not-taken.txt").rdd.map(lambda x: x[0])
```

- With a line the refers to the file in Cloud Storage. Remember to remove "/sampledata" because that directory does not exist. Remember to use the actual bucket name and not the environment variable. The Worker Nodes on the cluster where the program will run do not know the value of the local environment variable on the Master Node.

In [None]:
lines = spark.read.text("gs://<YOUR-BUCKET>/road-not-taken.txt").rdd.map(lambda x: x[0])
# run the job
spark-submit wordcount.py

#### Task 3: Run a Pyspark application from Cloud Storage

- In the previous task you created a PySpark application in a development environment (on the Master Node). You tested the application using spark-submit.
- In this task you will migrate the application from the development environment to a production environment. You will stage the working application file in Cloud Storage. And you will run the production job from Console.
- In the Master Node terminal, use the following command to copy the tested wordcount.py PySpark application to the bucket.
```
gsutil cp wordcount.py gs://$BUCKET
```

- In the Console, on the Navigation menu () click Dataproc > Clusters. Take note of the region where the cluster is located. You will need that in the next steps.
- You will also need the bucket name. You can also retrieve the bucket name from the Master Node terminal by entering the following. Highlight the bucket name and copy it.

```
echo $BUCKET
```
- In the Console, on the Navigation menu () click Dataproc > Jobs.
- Submit job and specifcy the following:
    - Region: <your-region>
    - Cluster: dataproc-cluster
    - Job type: PySpark
    - Main python file: gs://<your bucket>/wordcount.py
- Submit and end. Check Dataproc -> Jobs for progress.
    
### End Lab

### Customizing Clusters
### Lab: Cluster Automation using CLI Commands
#### Objectives:
- Create a customized Dataproc cluster using Cloud Shell

#### Task 1: Preparation of Env Variables
- In the Console, on the Navigation menu () click Compute Engine > VM instances.
- Locate the line with the instance called training_vm.
- On the far right, under 'connect', Click on SSH to open a terminal window.
- In this lab you will enter CLI commands on the training_vm.
#### Note: Dataproc can use a Cloud Storage bucket to stage its files during initialization. You can use this bucket to stage application programs or data for use by Dataproc. The bucket can also host Dataproc initialization scripts and output. The bucket name must be globally unique. Qwiklabs has already created a bucket for you that has the same name as the Project ID, which is already globally unique.

- In the Console, on the Navigation menu click Storage > Browser. Verify that the bucket exists. Notice the default storage class and the location (region) of this bucket. You will be using this region information next.
- On the training_vm SSH terminal, set the BUCKET.
```
BUCKET=<bucket name>
```

- You will be creating a Dataproc cluster in a specific region. The Dataproc cluster and the bucket it will use for staging must be in the same region. Since the bucket you are using already exists, you will need to match the environment variable $MYREGION to the bucket region.
- You can use find the region used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Region.
- The zone must be in the same region $MYZONE will contain this value.
- You can find the zone used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Zone.
On the training_vm SSH terminal, set the REGION and ZONE..
```
MYREGION=<region> 
# example region: us-central1
MYZONE=<zone>
# example zone: us-central1-a
```
- One environment variable that you will set is $PROJECT_ID that contains the Google Cloud project ID required to access billable resources.
- In the Console, on the Navigation menu () click Home. In the panel with Project Info, the Project ID is listed. You can also find this information in the Qwiklabs tab under Connection Details, where it is labeled GCP Project ID.
- On the training_vm SSH terminal, set the PROJECT_ID.
```
PROJECT_ID=<project ID>
```
- Find your computer's browser IP address by opening a browser window and viewing http://ip4.me/ Copy the IP address.
- Create an environment variable named BROWSER_IP.
```
BROWSER_IP=<your-browser-ip>
```
- In the training_vm SSH terminal window.
```
cd
cp -r /training/training-data-analyst .
ls
```

#### Task 2: Customize the Dataproc Initialization Action

- Review the cluster customization script.
```
cd ~/training-data-analyst/courses/unstructured/
cat init-script.sh
```
- Use nano or vi to edit the init-script.sh file:

In [None]:
#!/bin/bash

# install Google Python client on all nodes
apt-get update
apt-get install -y python-pip
pip install --upgrade google-api-python-client

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
   git clone https://github.com/GoogleCloudPlatform/training-data-analyst
fi

#### Task 3: Create the Dataproc Cluster
- Verify that the Cloud Storage bucket exists and that the $BUCKET and $MYZONE environment variables are still set. The bucket will be used by the Dataproc cluster to stage files as the cluster initializes.
```
echo $BUCKET $MYREGION $MYZONE
echo $PROJECT_ID
# copy the customization script to the bucket
gsutil cp init-script.sh gs://$BUCKET
```
#### Note: Cloud Storage is a very sophisticated distributed and resilient data service that supports Spark RDDs. It is connected to Dataproc by a petabit bisection bandwidth network enabling the data to be processed from where it is located rather than needing to be copied. So you can use Cloud Storage instead of HDFS.
#### Because data in Cloud Storage survives cluster shutdown, if you used it instead of HDFS, you can terminate clusters when they are not being used to reduce the expense. You can schedule the cluster to terminate after it is idle for a period (when the jobs are done).
#### https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion


- In addition to the custom initialization script, you can use initialization scripts that have been predefined. The script located at: gs://dataproc-initialization-actions/datalab/datalab.sh installs Datalab on the Master Node. Datalab is a notebook-based development environment based on Jupyter notebooks.
- Notice that this cluster includes two preemptible worker nodes.
- Create the custom cluster:

In [None]:
gcloud dataproc clusters create cluster-custom \
--bucket $BUCKET \
--subnet default \
--zone $MYZONE \
--master-machine-type n1-standard-2 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type n1-standard-1 \
--worker-boot-disk-size 50 \
--num-preemptible-workers 2 \
--image-version 1.2 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--tags customaccess \
--project $PROJECT_ID \
--initialization-actions 'gs://'$BUCKET'/init-script.sh','gs://dataproc-initialization-actions/datalab/datalab.sh'

#### Options used in this command include security, cost-savings, and flexibility features.

--tags: Applies a network tag so you can automate the creation of firewall rules.

--scopes: Applies Cloud IAM restrictions and permissions to the cluster.

--num-preemptible-workers: Controls the number of low cost worker nodes present.

--initialization-actions: Customizes the software on the cluster. 

#### Options for further study:

--no-address, --network, --subnet:

VMs only have internal IPs for added security. Requires enabling GCP API private access on the network, establishing specific firewall rules, and passing the subnet.

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network

#### Task 4: Verify Cluster Customization

```
# verify browser IP address is set as env variable for use in firewall rule
echo $BROWSER_IP
# create firewall rule
gcloud compute \
--project=$PROJECT_ID \
firewall-rules create allow-custom \
--direction=INGRESS \
--priority=1000 \
--network=default \
--action=ALLOW \
--rules=tcp:9870,tcp:8088,tcp:8080 \
--source-ranges=$BROWSER_IP/32 \
--target-tags=customaccess
```
- Locate the Master Node External IP Address. In the Console, on the Navigation menu () click Dataproc > Clusters. Click on cluster-custom.
- Click on VM instances
- Click on cluster-custom-m
- In the Network Interfaces section, find the External IP. Highlight and copy it.
- Open a new browser tab or window. Enter <external IP>:8080 and press return.
- You should see the Google Cloud Datalab.
- Creating the custom cluster is the objective of this lab. If this was your production environment, your next steps might be:
    - Turn the create commands into a script so that you can start up a cluster on demand.
    - Add an option to the command to terminate the cluster after a quiet period.
    - Turn the firewall rule into a script so that you can enable/disable external (browser) access only when it is required for administration activities.
    - Develop and test your application in Datalab notebooks.
    - Host the production application in a Cloud Storage bucket and access your data in either Cloud Storage, BigQuery, or Bigtable.
    - For capacity, Edit the number of preemptible worker nodes using Console, and the running cluster will adapt.
    - Shut down the cluster when not in use, or schedule auto termination.
    
### End Lab

### Module 3 Review

1.) Which of the following will you typically NOT use an initialization action script for?

- Change the number of workers in the cluster

#### Initialization scripts ARE used for:

- Copying custom configuration files
- Installing software libraries on the master and worker nodes

## Module 4: Analyzing Unstructured Data
### Infuse Your Business With Machine Learning
### Lab: Add Machine Learning
#### Objectives:
- Add Machine Learning (ML) to a Spark application

#### Task 1: Preparation of Dataproc clusters
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Locate the cluster named dataproc-cluster. Which region and zone is it located in? The region and zone have been selected automatically for you by Qwiklabs.
- Notice the Cloud Storage staging bucket defined for this cluster. This bucket has the same name as the project ID, which is a convenient way to make the name globally unique.
- Click on the name dataproc-cluster to go to the Cluster details page.
- The Cluster details page opens to the "Overview" tab. Click on the tab labeled "VM Instances".
- On the line for the VM named dataproc-cluster-m you will see that it has the Role of Master and there is an SSH link next to it. Click on SSH to open a terminal window to the Master Node.
#### Prepare API Key


- In the Console, on the Navigation menu () click APIs & Services > Credentials.
- Click on Create Credentials and select API Key
- Copy the API Key. In the terminal, create an environment variable for easy recall of the key.


```
APIKEY=<your-api-key>
```
#### Dataproc can use a Cloud Storage bucket to stage its files during initialization. You can use this bucket to stage application programs or data for use by Dataproc. The bucket can also host Dataproc initialization scripts and output. The bucket name must be globally unique. Qwiklabs has already created a bucket for you that has the same name as the Project ID, which is already globally unique.

- In the Console, on the Navigation menu () click Storage > Browser. Verify that the bucket exists. Notice the default storage class and the location (region) of this bucket. You will be using this region information next.
- On the training_vm SSH terminal, set the BUCKET.
```
BUCKET=<bucket name>
```
- One environment variable that you will set is $DEVSHELL_PROJECT_ID that contains the Google Cloud project ID required to access billable resources.
- In the Console, on the Navigation menu () click Home. In the panel with Project Info, the Project ID is listed. You can also find this information in the Qwiklabs tab under Connection Details, where it is labeled GCP Project ID.
On the training_vm SSH terminal, set the DEVSHELL_PROJECT_ID.
```
DEVSHELL_PROJECT_ID=<project ID>
```
- Verify that you have these environment variables are set. Do not proceed until they are set:

In [None]:
echo $DEVSHELL_PROJECT_ID, $BUCKET, $APIKEY
export DEVSHELL_PROJECT_ID
export BUCKET
export APIKEY

# Copy application files to training_vm home directory

cd
cp -r /training/training-data-analyst .
ls
cd ~/training-data-analyst/courses/unstructured/
# run staging script
./stagelabs.sh

In [None]:
!/bin/bash
# Go to the standard location
cd ~/training-data-analyst/courses/unstructured/
# "If at first you don't succeed, try, try again."
#   If this is our first time here, backup the program files
#   If this is a subsequent run, restore fresh from backup before proceeding
#
if [ -d "backup" ]; then
  cp backup/*dataproc* .
else
  mkdir backup
  cp *dataproc* backup
fi
# Verify that the environment variables exist
#
OKFLAG=1
if [[ -v $BUCKET ]]; then
  echo "BUCKET environment variable not found"
  OKFLAG=0
fi
if [[ -v $DEVSHELL_PROJECT_ID ]]; then
  echo "DEVSHELL_PROJECT_ID environment variable not found"
  OKFLAG=0
fi
if [[ -v $APIKEY ]]; then
  echo "APIKEY environment variable not found"
  OKFLAG=0
fi
if [ OKFLAG==1 ]; then
  # Edit the script files
  sed -i "s/your-api-key/$APIKEY/" *dataprocML.py
  sed -i "s/your-project-id/$DEVSHELL_PROJECT_ID/" *dataprocML.py
  sed -i "s/your-bucket/$BUCKET/" *dataprocML.py
  # Copy python scripts to the bucket
  gsutil cp *dataprocML.py gs://$BUCKET/
  # Copy data to the bucket
  gsutil cp gs:\/\/cloud-training\/gcpdei\/road* gs:\/\/$BUCKET\/sampledata\/ 
  gsutil cp gs:\/\/cloud-training\/gcpdei\/time* gs:\/\/$BUCKET\/sampledata\/
        
fi

 This is what the staging script is doing:
    - Edits the three python scripts 01_dataprocML.py, 02_dataprocML.py, 03_dataprocML.py, and replaces the APIKEY, BUCKET, and DEVSHELL_PROJECT_ID with the values from the exported environment variables.
    - Copies the updated files to your bucket in Cloud Storage, so that Dataproc can access them.
    - Copies sample data files to your bucket.
    - Verify that the PySpark application files and sample data files are in the bucket.
- In the Console, on the Navigation menu () click Storage > Browser.


#### Task 2: Natural Language Processing

#### The three programs are "snapshots" from a development process. Each program builds on and enhances the one before it. Examining and running each program shows you how to progressively develop a Dataproc/Spark + Machine Learning application.
#### The sample data is unstructured data. That is, it either lacks structure, or it has a structure that is not suited to the intended purpose. In this lab you will use Machine Learning to identify and associate the data with values, giving it structure and making the data useful.

```
cd ~/training-data-analyst/courses/unstructured/
```

- Examine 01-dataprocML.py using editor such as nano. Don't make any changes to the file.
- This program is just a Python program. It will run on Dataproc, but it does not make use of any of the big data features. The program creates a sample line of text in memory and then passes it to the Natural Language Processing service for Sentiment Analysis.
- The function SentimentAnalysis() is a wrapper around the REST API. This code creates the structured format of the request and passes the request along with the API Key.
- Why is the output printed using a json.dumps?
- You could do post-processing of the returned data using Python.
- The stagelabs.sh script you ran in Task 1 should have replaced the DEVSHELL_PROJECT_ID, BUCKET, and APIKEY with your information from the environment variables.
- Run the application
- In the Console, on Navigation menu () click Dataproc > Jobs. The click SUBMIT JOB.
- You will need to select the region where your cluster is located, and the cluster, dataproc-cluster. The Job Type is PySpark.
- In the field for Main python file, enter the path to the application file, which is something like this: gs://<bucket name>/01-dataprocML.py, where you replace <bucket name> with your bucket name.
- Click Submit. View the output.
    
#### Task 3: Load Sample Data

- In the terminal, enter the following commands to copy sample files to the Cloud Storage bucket.

```
gsutil cp /training/road-not-taken.txt gs://$BUCKET/sampledata/road-not-taken.txt
```
- In the Console, on the Navigation menu () click Storage > Browser.
- Click on your bucket.
- Click on sampledata
- Some files have already been staged.

#### Task 4: Testing Sentiment Analysis with Spark

- Examine 02-dataprocML.py using editor such as nano. Don't make any changes to the file.
- This program uses Spark RDDs. It reads a small sample file and passes it to the Natural Language Processing service for Sentiment Analysis.
- Post-processing of the returned data is done in the pipeline using transformations.
- In the Console, on Navigation menu () click Dataproc > Jobs. The click SUBMIT JOB.
 You will need to select the region where your cluster is located, and the cluster, dataproc-cluster. The Job Type is PySpark.
- In the field for Main python file, enter the path to the application file, which is something like this: gs://<bucket name>/02-dataprocML.py, where you replace <bucket name> with your bucket name.
- Click Submit. View the output.

#### Task 5: Doing Something Useful

- Examine 03-dataprocML.py using editor such as nano. Don't make any changes to the file.

- This program builds on the previous one. Instead of reading a poem it is going to read an entire book. However, it could just as easily read and process an entire library.
- Adds filter (in the pipeline) and sort (Python).
- This gives a list of the lines in the book with the strongest sentiment, both positive and negative.
- Now this was just a book. Imagine how you could use this to sort through social media commentary. For example, consider the feedback left by customers on a shopping website. You could use this kind of data analysis to identify the most admired and most despised products.
- Run the application
- In the Console, on Navigation menu () click Dataproc > Jobs. The click SUBMIT JOB.
- You will need to select the region where your cluster is located, and the cluster, dataproc-cluster. The Job Type is PySpark.
- In the field for Main python file, enter the path to the application file, which is something like this: gs://<bucket name>/03-dataprocML.py, where you replace <bucket name> with your bucket name.

### End Lab

### Module 4 Review

1.) Which (one) of these is NOT a good use case for a ML API?
- Identify images where your product is shown upside down (requires domain knowledge & orientation)

ML API is GOOD for: 
- Identifying objects in images
- Translation of text
- Transcribing audio to text

In [3]:
#!/usr/bin/env python
#! 01-dataprocML.py
# Copyright 2018 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
  This program takes a sample text line of text and passes to a Natural Language Processing
  services, sentiment analysis, and processes the results in Python.
'''
import logging
import argparse
import json
import os
from googleapiclient.discovery import build
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
'''
You must set these values for the job to run.
'''
APIKEY="removed_API_key"   # CHANGE
print APIKEY
PROJECT_ID="qwiklabs-gcp-40459444aeb2e780"  # CHANGE
print PROJECT_ID
BUCKET="qwiklabs-gcp-40459444aeb2e780"   # CHANGE
## Wrappers around the NLP REST interface
def SentimentAnalysis(text):
    from googleapiclient.discovery import build
    lservice = build('language', 'v1beta1', developerKey=APIKEY)
    response = lservice.documents().analyzeSentiment(
        body={
            'document': {
                'type': 'PLAIN_TEXT',
                'content': text
            }
        }).execute()
    return response
## main
sampleline = u'There are places I remember, all my life though some have changed.'
#
# Calling the Natural Language Processing REST interface
#
results = SentimentAnalysis(sampleline)
# 
#  What is the service returning?

    response = lservice.documents().analyzeSentiment(
        body={
            'document': {
                'type': 'PLAIN_TEXT',
                'content': text
            }
        }).execute()
    return response
## main
sampleline = u'There are places I remember, all my life though some have changed.'
#
# Calling the Natural Language Processing REST interface
#
results = SentimentAnalysis(sampleline)
# 
#  What is the service returning?
#
print "Function returns: ", type(results)
print json.dumps(results, sort_keys=True, indent=4)

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(APIKEY)? (<ipython-input-3-18443e1df7e1>, line 31)

#### Example Dataproc ML Script [2]

This program uses Spark RDDs. 

It reads a small sample file and passes it to the Natural Language Processing service for Sentiment Analysis.
Post-processing of the returned data is done in the pipeline using transformations.

In [None]:
#!/usr/bin/env python
#! 02-dataprocML.py
# Copyright 2018 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
  This program reads a text file and passes to a Natural Language Processing
  service, sentiment analysis, and processes the results in Spark.
'''
import logging
import argparse
import json
import os
from googleapiclient.discovery import build
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
'''
You must set these values for the job to run.
'''
APIKEY="removed_API_key"   # CHANGE
print APIKEY
PROJECT_ID="qwiklabs-gcp-40459444aeb2e780"  # CHANGE
print PROJECT_ID
BUCKET="qwiklabs-gcp-40459444aeb2e780"   # CHANGE
## Wrappers around the NLP REST interface
def SentimentAnalysis(text):
    from googleapiclient.discovery import build
    lservice = build('language', 'v1beta1', developerKey=APIKEY)
    response = lservice.documents().analyzeSentiment(
        body={
            'document': {
                'type': 'PLAIN_TEXT',
                'content': text
            }
        }).execute()
    return response
## main
# We could use sc.textFiles(...)
#
#   However, that will read each line of text as a separate object.
#   And using the REST API to NLP for each line will rapidly exhaust the rate-limit quota 
#   producing HTTP 429 errors
#
#   Instead, it is more efficient to pass an entire document to NLP in a single call.
#
#   So we are using sc.wholeTextFiles(...)  


    response = lservice.documents().analyzeSentiment(
        body={
            'document': {
                'type': 'PLAIN_TEXT',
                'content': text
            }
        }).execute()
    return response
## main
# We could use sc.textFiles(...)
#
#   However, that will read each line of text as a separate object.
#   And using the REST API to NLP for each line will rapidly exhaust the rate-limit quota 
#   producing HTTP 429 errors
#
#   Instead, it is more efficient to pass an entire document to NLP in a single call.
#
#   So we are using sc.wholeTextFiles(...)
#
#      This provides a file as a tuple.
#      The first element is the file pathname, and second element is the content of the file.
#
sample = sc.wholeTextFiles("gs://{0}/sampledata/road-not-taken.txt".format(BUCKET))
# Calling the Natural Language Processing REST interface
#
rdd1 = sample.map(lambda x: SentimentAnalysis(x[1]))
rdd2 =  rdd1.flatMap(lambda x: x['sentences'] )\
            .flatMap(lambda x: [(x['sentiment']['magnitude'], x['sentiment']['score'], [x['text']$
results = rdd2.take(50)
for item in results:
  print 'Magnitude= ',item[0],' | Score= ',item[1], ' | Text= ',item[2]

#### Example Dataproc ML Script [3]

This program builds on the previous one. Instead of reading a poem it is going to read an entire book. However, it could just as easily read and process an entire library.

Adds filter (in the pipeline) and sort (Python).

This gives a list of the lines in the book with the strongest sentiment, both positive and negative.

Now this was just a book. Imagine how you could use this to sort through social media commentary. For example, consider the feedback left by customers on a shopping website. You could use this kind of data analysis to identify the most admired and most despised products.

In [None]:
#!/usr/bin/env python
#! 03-dataprocML.py
# Copyright 2018 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
  This program reads a text file and passes to a Natural Language Processing
  service, sentiment analysis, and processes the results in Spark.
'''
import logging
import argparse
import json
import os
from googleapiclient.discovery import build
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
'''
You must set these values for the job to run.
'''
APIKEY="removed_API_key"   # CHANGE
print APIKEY
PROJECT_ID="qwiklabs-gcp-40459444aeb2e780"  # CHANGE
print PROJECT_ID
BUCKET="qwiklabs-gcp-40459444aeb2e780"   # CHANGE
## Wrappers around the NLP REST interface
def SentimentAnalysis(text):
    from googleapiclient.discovery import build#!/usr/bin/env python
    lservice = build('language', 'v1beta1', developerKey=APIKEY)
    response = lservice.documents().analyzeSentiment(
        body={
            'document': {
                'type': 'PLAIN_TEXT',
                'content': text
            }
        }).execute()
    return response
## main
# We could use sc.textFiles(...)
#
#   However, that will read each line of text as a separate object.
#   And using the REST API to NLP for each line will rapidly exhaust the rate-limit quota 
#   producing HTTP 429 errors
#
#   Instead, it is more efficient to pass an entire document to NLP in a single call.
#
#   So we are using sc.wholeTextFiles(...)
      This provides a file as a tuple.
#      The first element is the file pathname, and second element is the content of the file.
#
sample = sc.wholeTextFiles("gs://{0}/sampledata/time-machine.txt".format(BUCKET))
# Calling the Natural Language Processing REST interface
#
# results = SentimentAnalysis(sampleline)
rdd1 = sample.map(lambda x: SentimentAnalysis(x[1]))
# The RDD contains a dictionary, using the key 'sentences' picks up each individual sentence
# The value that is returned is a list. And inside the list is another dictionary
# The key 'sentiment' produces a value of another list.
# And the keys magnitude and score produce values of floating numbers. 
#
rdd2 =  rdd1.flatMap(lambda x: x['sentences'] )\
            .flatMap(lambda x: [(x['sentiment']['magnitude'], x['sentiment']['score'], [x['text']$
# First item in the list tuple is magnitude
# Filter on only the statements with the most intense sentiments
#
                                                                                        
#      This provides a file as a tuple.
#      The first element is the file pathname, and second element is the content of the file.
#
sample = sc.wholeTextFiles("gs://{0}/sampledata/time-machine.txt".format(BUCKET))
# Calling the Natural Language Processing REST interface
#
# results = SentimentAnalysis(sampleline)
rdd1 = sample.map(lambda x: SentimentAnalysis(x[1]))
# The RDD contains a dictionary, using the key 'sentences' picks up each individual sentence
# The value that is returned is a list. And inside the list is another dictionary
# The key 'sentiment' produces a value of another list.
# And the keys magnitude and score produce values of floating numbers. 
#
rdd2 =  rdd1.flatMap(lambda x: x['sentences'] )\
            .flatMap(lambda x: [(x['sentiment']['magnitude'], x['sentiment']['score'], [x['text']$
# First item in the list tuple is magnitude
# Filter on only the statements with the most intense sentiments
#
rdd3 =  rdd2.filter(lambda x: x[0]>.75)
results = sorted(rdd3.take(50))
print '\n\n'
for item in results:
  print 'Magnitude= ',item[0],' | Score= ',item[1], ' | Text= ',item[2],'\n'