# GCP Professional Data Engineer
### Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform
#### Modules:
- Introduction to Cloud Dataproc
- Running Dataproc Jobs
- Leveraging GCP
- Analyzing Unstructured Data

## Module 1: Introduction to Cloud Dataproc
### Introducing Cloud Dataproc
- Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead, and you pay only for the resources you use (with per-second billing). Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning.

### Why Unstructured Data
- About 90% of enterprise data that is collected by a business tends to be unstructured. This includes:
    - Emails, reviews, text, etc.
- Consider the Google Steetview initiative- cars gathering photos at the street level with no immediate impact, yet now is the foundation for one of the most useful datasets available for autonomous cars.
    
### Why Cloud Dataproc
- Horizontal vs vertical scaling
- Hadoop origins:
    - Based on a whitepaper describing MapReduce, Hadoop is an open-source distributed file system also known as HDFS. Spark is a framework that takes advantage of the distributed file system to effectively process tasks.
- Running an onsite Hadoop cluster is costly and often inefficient.
- Additional benefits:
    - Stateless clusters in <90 seconds
    - Supports Hadoop, Spark, Pig, Hive
    - High-level APIs for job submission
    - Connectors in BigTable,  BigQuery, Cloud Storage

### Lab: Create a Dataproc Cluster

#### Objectives:
- Prepare a bucket for cluster initialization
- Create a Dataproc Hadoop Cluster customized to use the Google Cloud API
- Enable secure access to the Dataproc cluster
- Explore Hadoop operations

#### Task 1: Prepare Environment Variables
- In the Console, on the Navigation menu () click Compute Engine > VM instances.
- Locate the line with the instance called training_vm.
- On the far right, under 'connect', Click on SSH to open a terminal window.
- In this lab you will enter CLI commands on the training_vm.
##### Create the source file for setting and resetting environment variables


- In the training_vm SSH terminal window, using your preferred command line editor, create and edit the file to hold your environment variables. For example:
- One environment variable that you will set is 'PROJECT_ID' that contains the Google Cloud project ID required to access billable resources.
- In the Console, on the Navigation menu () click Home. In the panel with Project Info, the Project ID is listed. You can also find this information in the Qwiklabs tab under Connection Details, where it is labeled GCP Project ID.
Add the environment variable to myenv for easy reference.
- Dataproc can use a Cloud Storage bucket to stage its files during initialization. You can use this bucket to stage application programs or data for use by Dataproc. The bucket can also host Dataproc initialization scripts and output. The bucket name must be globally unique. Qwiklabs has already created a bucket for you that has the same name as the Project ID, which is already globally unique.
- In the Console, on the Navigation menu () click Storage > Browser. Verify that the bucket exists. Notice the default storage class and the location (region) of this bucket. You will be using this region information next.
Add the line to myenv to create an environment variable named "BUCKET".
- You can use BUCKET in CLI commands. And if you need to enter the bucket name <your-bucket> in a text field in Console, you can quickly retrieve the name with echo BUCKET.
- You will be creating a Dataproc cluster in a specific region. The Dataproc cluster and the bucket it will use for staging must be in the same region. Since the bucket you are using already exists, you will need to match the environment variable $MYREGION to the bucket region.
- You can use find the region used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Region.
The zone must be in the same region MYZONE will contain this value.
- You can find the zone used by Qwiklabs on the Qwiklabs tab under Connection Details, labeled QL Zone.
- Add the environment variables to myenv for easy reference.
You will use the browser IP address to enable your local browser to reach the Dataproc cluster.
- Find your computer's browser IP address by opening a browser window and viewing http://ip4.me/ Copy the IP address.
Add the line to myenv to create an environment variable named BROWSER_IP.
- After you have added all three definitions to myenv, and saved the file, use the source command to create the environment variables.

In [1]:
cd ~ 
nano myenv
PROJECT_ID=<project ID>
BUCKET=<project ID>
MYREGION=<region>
MYZONE=<zone>
BROWSER_IP=<your-browser-ip>

# set environment variables
source myenv
# verify variables are set
echo $PROJECT_ID
echo $MYREGION $MYZONE
echo $BUCKET
echo $BROWSER_IP

SyntaxError: invalid syntax (<ipython-input-1-959bb4230150>, line 1)

#### Task 2. Create a Dataproc Cluster
- In the Console, on the Navigation menu () click Dataproc > Clusters.
- Click Create Cluster.
- Specify the following, and leave the remaining settings as their defaults:
- Click on Preemptible workers, bucket, network, version, initialization, & access options
- Specify the following, and leave the remaining settings as their defaults:
    - Name: Cluster Dataproc
    - Region: <myregion>
    - Zone: <myzone>
    - Cluster mode: Standard (1 master, n workers)
    - (Master node) Machine type: n1-standard-2
    - (Master node) Primary disk size: 100GB
    - (Worker node) Machine type: n1-standard-1
    - (Worker node) Primary disk size: 50GB
    - Nodes: 3
    - Network tags: hadoop access
    - Cloud storage staging bucket: <your bucket>
    - Image version: 1.2
    - Project access: Allow API access
    
    
- Create.
- The cluster will take several minutes to become operational. In the Console, on the Navigation menu () click Dataproc > Clusters.
- Click on your cluster, cluster-dataproc. Then click on the VM Instances tab. The instances will become operational before the hadoop software has completed initialization. When a checkmark in a green circle appears next to the name of the cluster, it is operational.

#### Task 3: Enable secure access to Dataproc cluster
- Create a firewall rule that allows access only to the Master Node from your computer's IP address. Only ports 8088 (Hadoop Job Interface) and 9870 (Hadoop Admin interface) will be permitted.
- Port 8042 is the web UI for the node manager on the worker nodes and port 8080 is the default port for Datalab. Datalab is a notebook-based integrated development environment derived from Jupyter notebooks. It is a common tool for developing Dataproc applications. The Serverless Machine Learning on GCP course uses Datalab extensively.
- Recall your computer's browser IP address for use in Console.

    ```echo $BROWSER_IP```


- In the Console, on the Navigation menu () click VPC Network > Firewall rules.
- Click Create Firewall Rule.
- Specify the following, and leave the remaining settings as their defaults:
    - Name: allow-hadoop
    - Network: default
    - Priority: 1000
    - Direction of traffic: Ingress
    - Action on match: allow
    - Targets: specified target tags
    - Target tags: hadoopaccess
    - Source IP ranges: <yourIP>32
    - Specified ports and protocols tcp:9870;tcp:8088
    
- Verify that the network tag "hadoopaccess" is set on the Master Node. That will apply the firewall rule to the Master Node, giving your laptop access to it.
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- Click on the Master Node, cluster-dataproc-m.
- Verify that under Network Tags it lists hadoopaccess.
- If the tag is not there, Click EDIT.
- Under Network Tags add the tag: hadoopaccess
- Click Save.

#### Task 4: Explore Hadoop Operations
- In the Console, on the Navigation menu () click Compute Engine > VM Instances.
- In the list of VM instances, in the row for cluster-dataproc-m, highlight the External IP and copy it.
- Open a new browser tab or window and paste the External IP. Add ":8088" after the IP and press enter. Example: <External IP>:8088 The web page displayed is the Hadoop Applications interface.
- Open a new browser tab or window. Paste the External IP. Add ":9870" after the IP and press return. Example: <External IP>:9870 The webpage displayed is the Hadoop Administration Interface and should look something like this:
- Click on the Datanodes tab. This will show you how much capacity is being used on the worker nodes HDFS (Hadoop Distributed File System) and how much capacity remains.
- Click on Utilities > Logs. This shows you the Hadoop log files for each node in the cluster. This is where you can go to investigate problems with Hadoop. Use your browser's back button to return to the Hadoop Administration console.
- Click on Utilities > Browse the file system. After a few moments the file system will appear in the browser page. You can use this to navigate the files system. In the row that says Owner is hdfs and Group is hadoop, click on the link that says user. -- - Here you can see directories for all the hadoop services.
- Leave the JobTracker <External IP>:8088 and the Administration Interface <External IP>:9870 tabs or windows open. You will use them in the next task.


### Module 1 Review

1.) Which of the following statements is true about Cloud Dataproc?
- Lets you run Spark and Hadoop clusters with minimal administrations
- Helps you create job-specific clusters w/o HDFS

2.) Matching definitions:
- Zone: determines the Google data center where compute nodes will be
- Preemtible: costs less but may not always be available
- Standard cluster mode: Provides 1 master and n workers

## Module 2: Running Dataproc Jobs
### Running Jobs
### Lab: Work with structured and semi-structured Data
### Separation of Storage & Compute
### Submitting Jobs
### Spark RDDs, Transformations, and Actions
### Lab: Working with Spark Jobs
### Module 2 Review

## Module 3: Leveraging GCP
### Big Query Support
### Lab: Leverage GCP
### Customizing Clusters
### Lab: Cluster Automation using CLI Commands
### Module 3 Review

## Module 4: Analyzing Unstructured Data
### Infuse Your Business With Machine Learning
### Lab: Add Machine Learning
### Module 4 Review