## GCP Professional Data Engineer
### Introduction to the Data and Machine Learning on Google Clouud Platform Specialization

## Module 1: Introduction to Google Cloud Platform and its Big Data Products

#! What is the Google Cloud Platform?

GCP is the natural evolution from personal compute centers to a global, distributed computational network.

### Data technologies to reseach
2002 - GFS
2004 - MapReduce (deprecated)
2006 - BigTable
2008 - Dremel (dataprocessing)
2009 - Colossus
2010 - Flume (dataprocessing)
2011 - Megastore
2012 - Spanner
2013 - Millwheel
2014 - PubSub / F1
2015 - TensorFlow

http://research.google.com/pubs/papers.html

#! GCP Big Data Products

### A functional view of the platform:

Foundation - Compute engine, Cloud storage
Databases - Datastore, Cloud SQL, Cloud Bigtable
Analytics & ML - BigQuery, CloudDatalab, Translate API
Data-handling Frameworks - Cloud PubSub, Cloud Dataflow, Cloud Dataproc

#!! Module 1 Review 

Q. What is success for you?
A. ~Confidential.

#! Welcome to the Foundations of GCP Compute and Storage

CPUs on Demand
https://cloud.google.com/custom-machine-types
https://cloud.google.com/compute/pricing

### Lab: Create a Compute Engine Instance

Objectives:
- Create a Compute Engine instance with the necessary Access and Security
- SSH into the instance
- Install the software package Git (for source code version control)

Launch GCP via console.cloud.google.com

Task 1: Create Compute Engine instance with the necessary API access

- In the GCP Console, on the Navigation menu (8ab244f9cffa6198.png), click Compute Engine.
- Click Create and wait for a form to load. You will need to change some options on the form that comes up.
- For Name, leave the default value, for Region, select us-central1, and for Zone, select us-central1-a.
- For Identify and API access, in Access scopes, select Allow full access to all Cloud APIs:

Task 2: SSH into the Instance

- When the instance you just created is available, click SSH
- To view information about the Compute Engine instance you just launched, type the following into your SSH terminal:

In [None]:
cat /proc/cpuinfo

Task 3: Install Software:

In [None]:
# Install git
sudo apt-get update
sudo apt-get -y -qq install git
# Verify installation
git --version
# exit
exit

### End Lab 1

## Module 2: Foundations of GCP Compute and Storage

#! A Global Filesystem

https://cloud.google.com/storage/docs/overview

### Lab: Interact With Cloud Storage

Objectives: 
- Create a Compute Engine instance with the necessary Access and Security
- SSH into the instance
- Install the software package Git (for source code version control)
- Ingest data into a Compute Engine instance
- Transform data on the Compute Engine instance
- Store the transformed data on Cloud Storage
- Publish Cloud Storage data to the web

###### Sample code for transformation
https://github.com/GoogleCloudPlatform/datalab-samples/blob/master/basemap/earthquakes.ipynb


Task 1: Create Compute Engine instance with the necessary API access

Ibid.

Task 2: SSH into the instance

Ibid.

Task 3: Install software and Ingest USGS data

In [None]:
# Install git
sudo apt-get update
sudo apt-get -y -qq install git
# Verify installation
git --version
# Copy data from Github and navigate to folder
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd training-data-analyst/CPB100/lab2b
# Examine 
less ingest.sh # less allows you to view the file in the terminal
bash ingest.sh # bash runs the file
head earthquakes.csv # head show the first few lines
# Install python packages
bash install_missing.sh
# Run transform code from link in lab description
python transform.py
# List directory comments
ls -l

Task 5: Create bucket

- In the GCP Console, on the Navigation menu, click Storage.
- Click Create Bucket.
- For Name, enter your Project ID, then click Create. To find your Project ID, click the project in the top menu of the GCP Console and copy the value under ID for your selected project.

Task 6: Store Data

- In your SSH terminal, type the following, replacing <YOUR-BUCKET> with the name of the bucket you created in the previous task:

In [None]:
gsutil cp earthquakes.* gs://<YOUR-BUCKET>/earthquakes/ # move data to bucket

- In the GCP Console, click the bucket name and notice there are three new files present in the earthquakes folder (click Refresh if necessary).

Task 7: Publish Cloud Storage files to web

- In the GCP Console, check public link for all the three files in the earthquake folder.
- For earthquakes.htm, click Public link.

##### Note: this feature was non-obvious, possibly due to a UI change

### End Lab 2

#! Module 2 Review

1. Compute nodes on GCP are:
- Allocated on demand, and you pay for the time they are up
- Cheaper if you allow them to be shutdown at anytime

2. Google Cloud Storage is a good option for storing data that:
- May be required to be read at some later time
- May be imported into a cluster for analysis

## Module 3: Data Analysis on the Cloud

#! Resources

Compute Engine: https://cloud.google.com/compute/
Storage: https://cloud.google.com/storage/
Pricing: https://cloud.google.com/pricing/
Cloud Launcher: https://cloud.google.com/launcher/
Pricing Philosophy: https://cloud.google.com/pricing/philosophy/

### Lab: Working with Cloud SQL

Objectives:
- Create Cloud SQL instance
- Create database tables by importing .sql files from Cloud Storage
- Populate the tables by importing .csv files from Cloud Storage
- Allow access to Cloud SQL
- Explore the rentals data using SQL statements from CloudShell

Task 1: Access lab code

In [None]:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
# Navigate to folder
cd training-data-analyst/CPB100/lab3a
# Examine file
less cloudsql/table_creation.sql
# Examine first few rows
head cloudsql/*.csv

Task 2: Create bucket
- In the GCP Console, on the Navigation menu.
- Click Storage.
- Click Create Bucket.
- For Name, enter your Project ID, then click Create. To find your Project ID, click the project in the top menu of the GCP Console and copy the value under ID for your selected project.

Task 3: Stage .sql & .csv files into Cloud Storage

In [None]:
# move files to bucket
gsutil cp cloudsql/* gs://<BUCKET-NAME>/sql/

Task 4:  Create Cloud SQL Instance
- In the GCP console, click SQL (in the Storage section).
- Click Create instance.
- Click Choose MySQL, then click Configure MySQL Development.
- For Instance ID, type rentals
- Specify password
- Scroll down and click Show configuration options. Click Authorize networks, then click + Add network.
- From Cloud Shell within the lab3a directory, find your IP address by typing:

In [None]:
# find IP address
bash ./find_my_ip.sh

- In the New network dialog, enter any Name, and for Network, type the IP address from the previous step. Click Done.

Note: If you lose your Cloud Shell VM due to inactivity, you will have to reauthorize your new Cloud Shell VM with Cloud SQL. For your convenience, lab3a includes a script called authorize\_cloudshell.sh that you can run.

- Click Create to create the instance. It will take a minute or so for your Cloud SQL instance to be provisioned.
- Note down the IP address of your Cloud SQL instance (from the browser window) in the third row of the table you started.

Task 5: Create tables

- In Cloud SQL, click rentals to view instance information.
- Click Import(on the top menu bar).
- Click Browse. This will bring up a list of buckets. Click on the bucket you created, then navigate into sql and click table_creation.sql.
- Click Select, then click Import. 

Task 6: Populate tables

- To import CSV files from Cloud Storage, from the GCP console page with the Cloud SQL instance details, click Import (top menu).
- Click Browse, browse in the bucket you created to sql, then click accommodation.csv. Click Select.
- For Database, select recommendation_spark.
- For Table, type Accommodation.
- Click Import.
- Repeat the Import (steps 1 - 5) for rating.csv, but for Table, type Rating.

Task 7: Explore Cloud SQL

- To explore Cloud SQL, you can use the mysql CLI. In Cloud Shell, type the following, replacing MYSQLIP with the Public IP address of your rentals instance:

In [None]:
mysql --host=<MySQLIP> --user=root --password

- The IP address is the one for the database server (i.e. the third row in the notes). You can also find it on the instance details on the cloud console.
- MySQL will prompt you for the root password. Type that into the prompt when prompted.
- In Cloud Shell, at the mysql prompt, type:

In [None]:
use recommendation_spark; # sets database in mysql session
show tables; # view list of tables
select * from Rating; # verify data was loaded
select * from Accommodation where type = 'castle' and price < 1500; # check for cheap castles


### End Lab 3

### Lab: Recommendations ML w/ Dataproc

Objectives:
    
- Launch Dataproc
- Train and apply ML model written in PySpark to create product recommendations
- Explore inserted rows in Cloud SQL 

Task 1: Create Assets

- In Cloud Shell, clone the repo using the following command:

In [None]:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
# Navigate to folder containing training data
cd training-data-analyst/CPB100/lab3a

- In the GCP Console, on the Navigation menu (8ab244f9cffa6198.png), click Storage.
- Click Create Bucket.
- For Name, enter your Project ID, then click Create. To find your Project ID, click the project in the top menu of the GCP Console and copy the value under ID for your selected project.
- Finally, stage the table definition and data files into Cloud Storage, so that you can later import them into Cloud SQL from Cloud Shell within the lab3a directory by typing the following, replacing <BUCKET-NAME> with the name of the bucket you just created:

In [None]:
gsutil cp cloudsql/* gs://<BUCKET-NAME>/sql/

- From the GCP console, go to Storage, navigate to your bucket and verify that the .sql and .csv files now exist on Cloud Storage.

Task 2: Create Cloud SQL instance

- In the GCP Console, on the Navigation menu (8ab244f9cffa6198.png), click SQL (in the Storage section).
- Click Create Instance.
- Click Choose MySQL, then click Configure MySQL Development.
- For Instance ID, type rentals.
- Scroll down and specify a root password. Before you forget, note down the root password (please don't do this in real-life!).
- Scroll down and in Configuration options, click Authorize networks - Add network.
- In Cloud Shell, make sure you're in the lab3a directory and find your IP address by typing:

In [None]:
bash ./find_my_ip.sh

- In the Add Network dialog, enter an optional Name and enter the IP address output in the previous step. Click Done.

Note: If you lose your Cloud Shell VM due to inactivity, you will have to reauthorize your new Cloud Shell VM with Cloud SQL. For your convenience, lab3a includes a script called authorize\_cloudshell.sh that you can run.

- Click Create to create the instance. It will take a minute or so for your Cloud SQL instance to be provisioned.
- Note down the Public IP address of your Cloud SQL instance (from the browser window).

Task 3: Create and populate tables

- Click rentals to view details about your Cloud SQL instance.
- Click Import.
- Click Browse. This will bring up a list of buckets. Click on the bucket you created, then navigate into /sql, click table_creation.sql, then click Select.
- Click Import.
- Next, to import CSV files from Cloud Storage, click Import.
- Click Browse, navigate into /sql, click accommodation.csv, then click Select.
- Fill out the rest of the dialog as follows:
- For Database, select recommendation_spark
- For Table, type Accommodation
- Click Import.
- Repeat the Import process (steps 5 - 8) for rating.csv, but for Table, type Rating

Task 4: Launch Dataproc

- In the GCP Console, on the Navigation menu, click SQL and note the region of your Cloud SQL instance:
- In the GCP Console, on the Navigation menu, click Dataproc and click Enable API if prompted. Once enabled, click Create cluster.
- Change the zone to be in the same region as your Cloud SQL instance. This will minimize network latency between the cluster and the database.
- For Master node, for Machine type, select 2 vCPU (n1-standard-2).
- For Worker nodes, for Machine type, select 2 vCPU (n1-standard-2).
- Leave all other values with their default and click Create. It will take 1-2 minutes to provision your cluster.
- Note the Name, Zone and Total worker nodes in your cluster.
- In Cloud Shell, navigate to the folder corresponding to this lab and authorize all the Dataproc nodes to be able to access your Cloud SQL instance, replacing <Cluster-Name>, <Zone>, and <Total-Worker-Nodes> with the values you noted in the previous step:

In [None]:
cd ~/training-data-analyst/CPB100/lab3b
bash authorize
ataproc.sh <Cluster-Name> <Zone> <Total-Worker-Nodes>

Task 5: Run ML model

- Edit the model training file using nano:

In [None]:
nano sparkml/train_and_apply.py

- Change the fields marked #CHANGE at the top of the file (scroll down using the down arrow key) to match your Cloud SQL setup (see earlier parts of this lab where you noted these down), and save the file using Ctrl+O then press Enter, and then press Ctrl+X to exit from the file.
- Copy this file to your Cloud Storage bucket using:

In [None]:
gsutil cp sparkml/tr*.py gs://<bucket-name>/

- In the Dataproc console, click Jobs.
- Click Submit job.
- For Job type, select PySpark and for Main python file, specify the location of the Python file you uploaded to your bucket.

In [None]:
gs://<bucket-name>/train_and_apply.py

- Click Submit and wait for the job Status to change from Running (this will take up to 5 minutes) to Succeeded.

Task 6: Explore inserted rows

- In Cloud Shell, authorize your CloudShell VM to access the Cloud SQL instance. This will also deauthorize the Dataproc cluster.

In [None]:
bash ../lab3a/authorize_cloudshell.sh

- Connect to your Cloud SQL instance, replacing <MySQLIP> with your SQL instance Public IP Address noted in an earlier task:

In [None]:
mysql --host=<MySQLIP> --user=root --password

At the mysql prompt, type:

In [None]:
use recommendation_spark; # set database in mysql session
# find recommendations for some user
select r.userid, 
r.accoid, 
r.prediction, 
a.title, 
a.location, 
a.price, 
a.rooms, 
a.rating, 
a.type 
from Recommendation as r, 
Accommodation as a 
where r.accoid = a.id and r.userid = 10;

### End Lab 4

#! Module 3 Review

1. Relational databases are a good choice when you need:
- Transactional updates on relatively small datasets

2. Cloud SQL and Cloud Dataproc offer familiar tools (MySQL and Hadoop/Pig/Hive/Spark). What is the value-add provided by Google Cloud Platform?
- Google-proprietary extensions and bug fixes to MySQL, Hadoop, and so on
- Fully-managed versions of the software offer no-ops
- Running it on Google infrastructure offers reliability and cost savings

## Module 4: Scaling Data Analysis: Compute with GCP

Objectives:

- Employ BigQuery and Cloud Datalab to carry out interactive data analysis
- Train and use a neural network using TensorFlow

#### Sections:

#! Intro to Scaling Data Analysis: Change How You Compute w/ GCP
#! Fast Random Access
#! Interactive, iterative development
#! Warehouse and query petabytes
#! Machine learning w/ TensorFlow
#! Fully build machine learning models

### Lab: Create ML Dataset with BigQuery

Objectives:

- Use BigQuery and Datalab to explore and visualize data
- Build a Pandas dataframe that will be used as the training dataset for machine learning using TensorFlow

Task 1: Launch Cloud Datalab

- In Cloud Shell, type:

In [None]:
gcloud compute zones list
datalab create bdmlvm --zone <ZONE> # choose a zone from list

Task 2: Checkout notebook into Cloud Datalab

- Click on the Web Preview icon (looks like a browser <> window) on the top-right corner of the Cloud Shell ribbon. Click on Change port. Switch to port 8081 using the Change Preview Port dialog box, and then click on Change and Preview.

Note: The connection to your Datalab instance remains open for as long as the datalab command is active. If the cloud shell used for running the datalab command is closed or interrupted, the connection to your Cloud Datalab VM will terminate. If that happens, you may be able to reconnect using the command datalab connect bdmlvm in your new Cloud Shell.

- In Datalab, click on the icon for Open ungit in the top-right ribbon. (looks like a forked branch)
- In the Ungit window, select the text that reads /content/datalab/notebooks and remove the notebooks so that it reads /content/datalab, then hit Enter.
- In the panel that comes up, type the following as the GitHub repository to Clone from:

In [None]:
https://github.com/GoogleCloudPlatform/training-data-analyst

Task 3: Open a Datalab notebook
    
- In the Datalab browser, navigate to training-data-analyst > CPB100 > lab4a > demandforecast.ipynb.
- Read the commentary, Click Clear | Clear all Cells, then run the Python snippets (Use Shift+Enter to run each piece of code) in the cell, step by step.
- When you reach the section Machine Learning with Tensorflow, please stop -- that is the next lab.

### End Lab