# GCP Professional Data Engineer
### Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform
#### Modules:
- Introduction to Cloud Dataproc
- Running Dataproc Jobs
- Leveraging GCP
- Analyzing Unstructured Data

## Module 1: Introduction to Cloud Dataproc
### Introducing Cloud Dataproc
- Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead, and you pay only for the resources you use (with per-second billing). Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning.

### Why Unstructured Data
- About 90% of enterprise data that is collected by a business tends to be unstructured. This includes:
    - Emails, reviews, text, etc.
- Consider the Google Steetview initiative- cars gathering photos at the street level with no immediate impact, yet now is the foundation for one of the most useful datasets available for autonomous cars.
    
### Why Cloud Dataproc
- Horizontal vs vertical scaling
- Hadoop origins:
    - Based on a whitepaper describing MapReduce, Hadoop is an open-source distributed file system also known as HDFS. Spark is a framework that takes advantage of the distributed file system to effectively process tasks.
- Running an onsite Hadoop cluster is costly and often inefficient.
- Additional benefits:
    - Stateless clusters in <90 seconds
    - Supports Hadoop, Spark, Pig, Hive
    - High-level APIs for job submission
    - Connectors in BigTable,  BigQuery, Cloud Storage

### Lab: Create a Dataproc Cluster

#### Objectives:
- Prepare a bucket for cluster initialization
- Create a Dataproc Hadoop Cluster customized to use the Google Cloud API
- Enable secure access to the Dataproc cluster
- Explore Hadoop operations

#### Task 1: Prepare Environment Variables
- In the Console, on the Navigation menu () click Compute Engine > VM instances.
- Locate the line with the instance called training_vm.
- On the far right, under 'connect', Click on SSH to open a terminal window.
- In this lab you will enter CLI commands on the training_vm.
##### Create the source file for setting and resetting environment variables


- In the training_vm SSH terminal window, using your preferred command line editor, create and edit the file to hold your environment variables. For example:
- One environment variable that you will set is 'PROJECT_ID' that contains the Google Cloud project ID required to access billable resources.
- In the Console, on the Navigation menu () click Home. In the panel with Project Info, the Project ID is listed. You can also find this information in the Qwiklabs tab under Connection Details, where it is labeled GCP Project ID.
Add the environment variable to myenv for easy reference.
- Dataproc can use a Cloud Storage bucket to stage its files during initialization. You can use this bucket to stage application programs or data for use by Dataproc. The bucket can also host Dataproc initialization scripts and output. The bucket name must be globally unique. Qwiklabs has already created a bucket for you that has the same name as the Project ID, which is already globally unique.
- In the Console, on the Navigation menu () click Storage > Browser. Verify that the bucket exists. Notice the default storage class and the location (region) of this bucket. You will be using this region information next.
Add the line to myenv to create an environment variable named "BUCKET".
- You can use BUCKET in CLI commands. And if you need to enter the bucket name <your-bucket> in a text field in Console, you can quickly retrieve the name with echo BUCKET.
- You will be creating a Dataproc cluster in a specific region. The Dataproc cluster and the bucket it will use for staging must be in the same region. Since the bucket you are using already exists, you will need to match the environment variable $MYREGION to the bucket region.
- You can use find the region used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Region.
The zone must be in the same region MYZONE will contain this value.
- You can find the zone used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Zone.
- Add the environment variables to myenv for easy reference.
You will use the browser IP address to enable your local browser to reach the Dataproc cluster.
- Find your computer's browser IP address by opening a browser window and viewing http://ip4.me/ Copy the IP address.
Add the line to myenv to create an environment variable named BROWSER_IP.
- After you have added all three definitions to myenv, and saved the file, use the source command to create the environment variables.

In [1]:
cd ~ 
nano myenv
PROJECT_ID=<project ID>
BUCKET=<project ID>
MYREGION=<region>
MYZONE=<zone>
BROWSER_IP=<your-browser-ip>

# set environment variables
source myenv
# verify variables are set
echo $PROJECT_ID
echo $MYREGION $MYZONE
echo $BUCKET
echo $BROWSER_IP

SyntaxError: invalid syntax (<ipython-input-1-959bb4230150>, line 1)

#### Task 2. Create a Dataproc Cluster
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Click Create Cluster.
- Specify the following, and leave the remaining settings as their defaults:
- Click on Preemptible workers, bucket, network, version, initialization, & access options
- Specify the following, and leave the remaining settings as their defaults:
    - Name: Cluster Dataproc
    - Region: <myregion>
    - Zone: <myzone>
    - Cluster mode: Standard (1 master, n workers)
    - (Master node) Machine type: n1-standard-2
    - (Master node) Primary disk size: 100GB
    - (Worker node) Machine type: n1-standard-1
    - (Worker node) Primary disk size: 50GB
    - Nodes: 3
    - Network tags: hadoop access
    - Cloud storage staging bucket: <your bucket>
    - Image version: 1.2
    - Project access: Allow API access
    
    
- Create.
- The cluster will take several minutes to become operational. In the Console, on the Navigation menu () click Dataproc > Clusters.
- Click on your cluster, cluster-dataproc. Then click on the VM Instances tab. The instances will become operational before the hadoop software has completed initialization. When a checkmark in a green circle appears next to the name of the cluster, it is operational.

#### Task 3: Enable secure access to Dataproc cluster
- Create a firewall rule that allows access only to the Master Node from your computer's IP address. Only ports 8088 (Hadoop Job Interface) and 9870 (Hadoop Admin interface) will be permitted.
- Port 8042 is the web UI for the node manager on the worker nodes and port 8080 is the default port for Datalab. Datalab is a notebook-based integrated development environment derived from Jupyter notebooks. It is a common tool for developing Dataproc applications. The Serverless Machine Learning on GCP course uses Datalab extensively.
- Recall your computer's browser IP address for use in Console.

    ```echo $BROWSER_IP```


- In the Console, on the Navigation menu () click VPC Network > Firewall rules.
- Click Create Firewall Rule.
- Specify the following, and leave the remaining settings as their defaults:
    - Name: allow-hadoop
    - Network: default
    - Priority: 1000
    - Direction of traffic: Ingress
    - Action on match: allow
    - Targets: specified target tags
    - Target tags: hadoopaccess
    - Source IP ranges: <yourIP>32
    - Specified ports and protocols tcp:9870;tcp:8088
    
- Verify that the network tag "hadoopaccess" is set on the Master Node. That will apply the firewall rule to the Master Node, giving your laptop access to it.
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- Click on the Master Node, cluster-dataproc-m.
- Verify that under Network Tags it lists hadoopaccess.
- If the tag is not there, Click EDIT.
- Under Network Tags add the tag: hadoopaccess
- Click Save.

#### Task 4: Explore Hadoop Operations
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- In the list of VM instances, in the row for cluster-dataproc-m, highlight the External IP and copy it.
- Open a new browser tab or window and paste the External IP. Add ":8088" after the IP and press enter. Example: <External IP>:8088 The web page displayed is the Hadoop Applications interface.
- Open a new browser tab or window. Paste the External IP. Add ":9870" after the IP and press return. Example: <External IP>:9870 The webpage displayed is the Hadoop Administration Interface and should look something like this:
- Click on the Datanodes tab. This will show you how much capacity is being used on the worker nodes HDFS (Hadoop Distributed File System) and how much capacity remains.
- Click on Utilities > Logs. This shows you the Hadoop log files for each node in the cluster. This is where you can go to investigate problems with Hadoop. Use your browser's back button to return to the Hadoop Administration console.
- Click on Utilities > Browse the file system. After a few moments the file system will appear in the browser page. You can use this to navigate the files system. In the row that says Owner is hdfs and Group is hadoop, click on the link that says user. -- - Here you can see directories for all the hadoop services.
- Leave the JobTracker <External IP>:8088 and the Administration Interface <External IP>:9870 tabs or windows open. You will use them in the next task.


### Module 1 Review

1.) Which of the following statements is true about Cloud Dataproc?
- Lets you run Spark and Hadoop clusters with minimal administrations
- Helps you create job-specific clusters w/o HDFS

2.) Matching definitions:
- Zone: determines the Google data center where compute nodes will be
- Preemtible: costs less but may not always be available
- Standard cluster mode: Provides 1 master and n workers

## Module 2: Running Dataproc Jobs
### Running Jobs
- Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network.[1] The standard TCP port for SSH is 22. The best known example application is for remote login to computer systems by users.

### Lab: Work with structured and semi-structured Data
#### Objectives:
- Use the Hive CLI and run a Pig job
- Hive is used for structured data, similar to SQL
- Pig is used for semi-structured data, similar to SQL + scripting

Task 1: Preparation
- A Dataproc cluster has been prepared for you. If you login to GCP before the progress bar reports that the "Lab is Running", you may have to wait several minutes for the cluster to transition from "Provisioning" to "Running" before the cluster completes setup.
- You will be performing most of the lab steps from the Master Node of the cluster in an SSH terminal window.
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Locate the cluster named dataproc-cluster. Which region and zone is it located in? The region and zone have been selected automatically for you by Qwiklabs.
- Notice the Cloud Storage staging bucket defined for this cluster. This bucket has the same name as the project ID, which is a convenient way to make the name globally unique.
- Click on the name dataproc-cluster to go to the Cluster details page.
- The Cluster details page opens to the "Overview" tab. Click on the tab labeled "VM Instances".
- Open the Master Node terminal
- On the line for the VM named dataproc-cluster-m you will see that it has the Role of Master and there is an SSH link next to it. Click on SSH to open a terminal window to the Master Node.

Task 2. Enable secure web access to the Dataproc cluster
- Create a restrictive firewall rule using Target tags, IP address, and protocol
- Create a firewall rule that allows access only to the Master Node from your computer's IP address. Only ports 8088 (Hadoop Job Interface) and 9870 (Hadoop Admin interface) will be permitted.
- Verify that the network tag is set on the Master Node
- Verify that the network tag "hadoopaccess" is set on the Master Node. That will apply the firewall rule to the Master Node, giving your laptop access to it.
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- Click on the Master Node, cluster-dataproc-m.
- Verify that under Network Tags it lists hadoopaccess.
- If the tag is not there, Click EDIT.
- Under Network Tags add the tag: hadoopaccess
- Click Save.
- Identify the browser IP address
- You will use the browser IP address to allow your local browser to connect to the Dataproc cluster.
- Find your computer's browser IP address by opening a browser window and viewing http://ip4.me/ Copy the IP address.
- Create the firewall rule
- In the Console, on the Navigation menu () click VPC Network > Firewall rules.
- Click Create Firewall Rule.
- Specify the following, and leave the remaining settings as their defaults:
    - Name: allow-hadoop
    - Network: default
    - Priority: 1000
    - Direction of traffic: Ingress
    - Action on match: allow
    - Targets: specified target tags
    - Target tags: hadoopaccess
    - Source IP ranges:  |yourIP|/32
    - Specified ports and protocols tcp:9870;tcp:8088
- Click create.
    
Task 3. Prepare the data for Hive
- Copy sample files to the Master node home directory
- The sample files you need are have already been archived on the Master Node. You will need to copy them into your user directory with the following command.
- In the Master Node SSH terminal window:

In [5]:
cd
cp -r /training .
ls
cd training/training-data-analyst/courses/unstructured
ls pet*.*
# view structured data in text file  
cat pet-details.txt
# stage data in HDFS
hadoop fs -mkdir /pet-details
hadoop fs -put pet-details.txt /pet-details

SyntaxError: invalid syntax (<ipython-input-5-1ca6502de273>, line 2)

- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- In the list of VM instances, in the row for cluster-dataproc-m, highlight the External IP and copy it.
- Open a new browser tab or window and paste the External IP. Add ":9870" after the IP and press enter. Example: <External IP>:9870
- You should now see the Hadoop Administration interface. Under Utilities, click on Browse the file system. Click on the folder /pet-details.
- Notice that the file pet-details.txt is inside /pet-details.
- Leave the Hadoop Administration interface open. You will return to it in later steps.
    
Task 4. Explore Hive using the Hive interactive CLI
- Use HIVE to access the data in HDFS as if it were in a database
- Hive provides a subset of SQL. The way it does this is by maintaining metadata to define a schema on top of the data. This is one way to work with a large amount of distributed data in HDFS using familiar SQL syntax.
- In the master node SSH window, make sure you are in the right directory and start the Hive CLI interpreter:

In [2]:
hive # initialize hive CLI interpreter
CREATE DATABASE pets;
USE pets
# create table
CREATE TABLE details (Type String, Name String, Breed String, Color String, Weight Int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
SHOW TABLES;
DESCRIBE pets.details;
# establish relationship between metadata schema and data in HDFS
load data INPATH '/pet-details/pet-details.txt' OVERWRITE INTO TABLE details;
# verify that everything is working
SELECT * FROM pets.details;
# quit HIVE interpreter
quit;

SyntaxError: invalid syntax (<ipython-input-2-4bc48b4f45bf>, line 2)

- Use the Hadoop Administration interface to see how hive works
- Hive ingested the pet-details.txt file into a data warehouse format requiring a schema. You will use the Hadoop Administration interface to see this transformation.
- Return to the Hadoop Administration interface in the browser.
- Under Utilities, click on Browse the file system. Click on the folder /pet-details. The file pet-details.txt is gone.
- Under Utilities, click on Browse the file system. Then click on user > hive > warehouse > pets.db > details. The file pet-details.txt has been moved to this location.

***Note: Hive is designed for batch jobs and not for transactions. It ingests data into a data warehouse format requiring a schema. It does not support real-time queries, row-level updates, or unstructured data. Some queries may run much slower than others due to the underlying transformations Hive has to implement to simulate SQL.

Task 5: Run a Pig job

- In the master nodes SSH windowm view the Pig application:

```cat pet-details.pig```

- In line 'x1', the load statement in the application creates a schema on top of the HDFS data file. Lines 'x2' through 'x5' perform transformations on the data. And the last line stores the result in a folder called /GroupedByType in HDFS.
- The application expects to find the ingest file in HDFS in the directory /pet-details. Make another copy of the data at that location:

```hadoop fs -put pet-details.txt /pet-details```

Run the application:

```pig < pet-details.pig```

- Return to the browser tab containing the Hadoop Applications interface and refresh it, or reopen it with <External-IP>:8088. Notice that Pig generated a Java MapReduce job which is running on the cluster. Click the browser refresh button to watch for job completion.
- Return to the browser tab containing the Hadoop Administration interface and refresh it, or reopen it with <External-IP>:9870. Under Utilities, click on Browse the file system. In the resulting list, click on GroupedByType. This is the output directory specified in the Pig application. The file named part-r-00000 is the HDFS file containing the output. You cannot view the contents from here. First, you must download that part to the local file system
- Return to the SSH terminal on the Master node, cloud-dataproc-m and make a local output directory and retrieve the results from HDFS.

In [3]:
cd
mkdir output
cd output
hadoop fs -get /GroupedByType/part* .
# view results of the pig job
cat part-r-00000

SyntaxError: invalid syntax (<ipython-input-3-cb82c8aaa442>, line 2)

***Note: Pig provides SQL primitives similar to Hive, but in a more flexible scripting language format. Pig can also deal with semi-structured data, such as data having partial schemas, or for which the schema is not yet known. For this reason it is sometimes used for Extract Transform Load (ETL). It generates Java MapReduce jobs. Pig is not designed to deal with unstructured data.

### End Lab

### Separation of Storage & Compute
### Submitting Jobs
### Spark RDDs, Transformations, and Actions
### Lab: Working with Spark Jobs
### Module 2 Review

## Module 3: Leveraging GCP
### Big Query Support
### Lab: Leverage GCP
#### Objectives:
- Explore Spark using PySpark jobs
- Using Cloud Storage instead of HDFS
- Run a PySpark application from Cloud Storage
- Using Python Pandas to add BigQuery to a Spark application

Task 1: Prepare the Master Node and the Bucket
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Locate the cluster named dataproc-cluster.
- Click on the name dataproc-cluster to go to the Cluster details page.
- The Cluster details page opens to the "Overview" tab. Click on the tab labeled "VM Instances".
- On the line for the VM named dataproc-cluster-m you will see that it has the Role of Master and there is an SSH link next to it. Click on SSH to open a terminal window to the Master Node.
- In the Master Node SSH terminal window, type:

```
cd
cp -r /training .
ls
```

#### Note: A Cloud Storage bucket has already been created for you. It has the same name as the Project ID. You will create an environment variable to make it easy to reference the bucket from the command line on the Master Node.

- In the console, on the Navigation menu click Storage -> Bucket
- In the Master Node SSH terminal window:

```
BUCKET=<bucket-name>
echo $BUCKET
```

Task 2: The two letter lab

#### Note: Why would you want to use Cloud Storage instead of HDFS?
You can shut down the cluster when you are not running jobs. The storage persists even when the cluster is shut down, so you don't have to pay for the cluster just to maintain data in HDFS.
In some cases Cloud Storage provides better performance than HDFS.
Cloud Storage does not require the administration overhead of a local file system

- Place a copy of your sample data file in a Cloud Storage bucket instead of HDFS.
- In the Master Node terminal window, enter the following gsutil command to copy the sample text files to the Cloud Storage bucket.

```
gsutil cp /training/road-not-taken.txt gs://$BUCKET
```

- In the SSH terminal for the Master Node, use nano or vi to create the file wordcount.py
- Copy and paste the following code into the file:

In [1]:
from pyspark.sql import SparkSession
from operator import add
import re

print("Okay Google.")

spark = SparkSession\
        .builder\
        .appName("CountUniqueWords")\
        .getOrCreate()

lines = spark.read.text("/sampledata/road-not-taken.txt").rdd.map(lambda x: x[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
                  .filter(lambda x: re.sub('[^a-zA-Z]+', '', x)) \
                  .filter(lambda x: len(x)>1 ) \
                  .map(lambda x: x.upper()) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add) \
                  .sortByKey()
output = counts.collect()
for (word, count) in output:
  print("%s = %i" % (word, count))

spark.stop()

ModuleNotFoundError: No module named 'pyspark'

- First, verify that the data file does not exist in HDFS:
```
hadoop fs -ls
```
- Next, use the Hadoop file system command to view the files through the hadoop connector to Cloud Storage. This verifies that the connector is working and that the file is available in the bucket.
```
hadoop fs -ls gs://$BUCKET
```
- Edit wordcount.py in nano or vi

```
lines = spark.read.text("/sampledata/road-not-taken.txt").rdd.map(lambda x: x[0])
```

- With a line the refers to the file in Cloud Storage. Remember to remove "/sampledata" because that directory does not exist. Remember to use the actual bucket name and not the environment variable. The Worker Nodes on the cluster where the program will run do not know the value of the local environment variable on the Master Node.

In [None]:
lines = spark.read.text("gs://<YOUR-BUCKET>/road-not-taken.txt").rdd.map(lambda x: x[0])
# run the job
spark-submit wordcount.py

Task 3: Run a Pyspark application from Cloud Storage

- In the previous task you created a PySpark application in a development environment (on the Master Node). You tested the application using spark-submit.
- In this task you will migrate the application from the development environment to a production environment. You will stage the working application file in Cloud Storage. And you will run the production job from Console.
- In the Master Node terminal, use the following command to copy the tested wordcount.py PySpark application to the bucket.
```
gsutil cp wordcount.py gs://$BUCKET
```

- In the Console, on the Navigation menu () click Dataproc > Clusters. Take note of the region where the cluster is located. You will need that in the next steps.
- You will also need the bucket name. You can also retrieve the bucket name from the Master Node terminal by entering the following. Highlight the bucket name and copy it.

```
echo $BUCKET
```
- In the Console, on the Navigation menu () click Dataproc > Jobs.
- Submit job and specifcy the following:
    - Region: <your-region>
    - Cluster: dataproc-cluster
    - Job type: PySpark
    - Main python file: gs://<your bucket>/wordcount.py
- Submit and end. Check Dataproc -> Jobs for progress.

### Customizing Clusters
### Lab: Cluster Automation using CLI Commands
### Module 3 Review

## Module 4: Analyzing Unstructured Data
### Infuse Your Business With Machine Learning
### Lab: Add Machine Learning
### Module 4 Review