# Chapter 5. Interactive Data Exploration

* 싸이그래머 / DeepCity & CloudΨ - DS on GCP [1]
* 김무성

# 차례 
* 준비단계    
* Exploratory Data Analysis
* Loading Flights Data into BigQuery
    - Advantages of a Serverless Columnar Database
    - Staging on Cloud Storage
    - Access Control
    - Federated Queries
    - Ingesting CSV Files
    - Reading a query explanation
* Exploratory Data Analysis in Cloud Datalab
    - Jupyter Notebooks
    - Cloud Datalab
    - Installing Packages in Cloud Datalab
    - Jupyter Magic for Google Cloud Platform
        - Exploring arrival delays
* Quality Control
    - Oddball Values
    - Outlier Removal: Big Data Is Different
    - Filtering Data on Occurrence Frequency
* Arrival Delay Conditioned on Departure Delay
    - Applying Probabilistic Decision Threshold
    - Empirical Probability Distribution Function
    - The Answer Is...
* Evaluating the Model
    - Random Shuffling
    - Splitting by Date
    - Training and Testing
* Summary

----------------------------

#### 참고 
* Getting started with Google Cloud Training Material - 2018 - https://www.slideshare.net/jkbaseer/getting-started-with-google-cloud-training-material-2018



---------------------

# 준비단계

Google Cloud Shell로 이동 
* https://console.cloud.google.com/



--------------------

# 실습 1. Load the CSV files created in Chapter 3 into BigQuery

### BigQuery Basic

1. Setup:
   * https://github.com/GoogleCloudPlatform/data-science-on-gcp/tree/master/05_bqdatalab
   <br><br>
   
2. Open CloudShell and navigate to 05_bqdatalab
    
3. Run the script to load data into BigQuery:
	```shell
	bash load_into_bq.sh <BUCKET-NAME>
	```
    <br><br>
    
4. Visit the BigQuery console to query the new table:
	```shell
	SELECT
       *
    FROM (
       SELECT
         ORIGIN,
         AVG(DEP_DELAY) AS dep_delay,
         AVG(ARR_DELAY) AS arr_delay,
         COUNT(ARR_DELAY) AS num_flights
       FROM
         flights.tzcorr
       GROUP BY
         ORIGIN )
    WHERE
       num_flights > 3650
    ORDER BY
       dep_delay DESC
	```
    <br><br>
    
5. In BigQuery, run this query and save the results as a table named trainday
```shell
    #standardsql
    SELECT
        FL_DATE,
        IF(MOD(ABS(FARM_FINGERPRINT(CAST(FL_DATE AS STRING))), 100) < 70, 'True', 'False') AS is_train_day
    FROM (
        SELECT
             DISTINCT(FL_DATE) AS FL_DATE
        FROM
             `flights.tzcorr`)
    ORDER BY
       FL_DATE
```

-----------------------------------

# Exploratory Data Analysis

# Loading Flights Data into BigQuery
* Advantages of a Serverless Columnar Database
* Staging on Cloud Storage
* Access Control
* Federated Queries
* Ingesting CSV Files
* Reading a query explanation

## Advantages of a Serverless Columnar Database

## Staging on Cloud Storage

<img src="./figures/cap01.png" width=600 />

## Access Control

<img src="./figures/cap02.png" width=600 />

<img src="./figures/cap03.png" width=600 />

<img src="./figures/cap04.png" width=600 />
<img src="./figures/cap05.png" width=600 />

<img src="./figures/cap06.png" width=600 />

## Federated Queries

<img src="./figures/cap07.png" width=600 />

## Ingesting CSV Files

<img src="./figures/cap08.png" width=600 />

## Reading a query explanation

<img src="./figures/cap09.png" width=600 />

<img src="./figures/cap10.png" width=600 />

# Exploratory Data Analysis in Cloud Datalab
* Jupyter Notebooks
* Cloud Datalab
* Installing Packages in Cloud Datalab
* Jupyter Magic for Google Cloud Platform
* Exploring arrival delays

---------------------

# 실습 2. create a Datalab instance

1. In CloudShell, create a Datalab instance (change the zone to match where your bucket is located):

```shell
 datalab create --zone us-central1-a dsongcp
```

2. Once you get the message that the instance is reachable on localhost:8081, navigate to the web page using the Web Preview button on the top-right of Cloud Shell.
<br><br>

3. Within Datalab, start a new notebook. Then, copy and paste cells from exploration.ipynb and click Run to execute the code.

-------------------------------

<img src="./figures/cap11.png" width=600 />

## Jupyter Notebooks

## Cloud Datalab

<img src="./figures/cap12.png" width=600 />

<img src="./figures/cap13.png" width=600 />

<img src="./figures/cap14.png" width=600 />

## Installing Packages in Cloud Datalab

## Jupyter Magic for Google Cloud Platform

<img src="./figures/cap15.png" width=600 />

<img src="./figures/cap16.png" width=600 />

## Exploring arrival delays

<img src="./figures/cap17.png" width=600 />

<img src="./figures/cap18.png" width=600 />

<img src="./figures/cap19.png" width=600 />

<img src="./figures/cap20.png" width=600 />

# Quality Control
* Oddball Values
* Outlier Removal: Big Data Is Different
* Filtering Data on Occurrence Frequency

## Oddball Values

<img src="./figures/cap21.png" width=600 />

<img src="./figures/cap22.png" width=600 />

## Outlier Removal: Big Data Is Different

## Filtering Data on Occurrence Frequency

<img src="./figures/cap23.png" width=600 />

# Arrival Delay Conditioned on Departure Delay
* Applying Probabilistic Decision Threshold
* Empirical Probability Distribution Function
* The Answer Is...

<img src="./figures/cap24.png" width=600 />

<img src="./figures/cap25.png" width=600 />

## Applying Probabilistic Decision Threshold

<img src="./figures/cap26.png" width=600 />

## Empirical Probability Distribution Function

<img src="./figures/cap27.png" width=600 />

<img src="./figures/cap28.png" width=600 />

## The Answer Is...

# Evaluating the Model
* Random Shuffling
* Splitting by Date
* Training and Testing

## Random Shuffling

## Splitting by Date

<img src="./figures/cap29.png" width=600 />

<img src="./figures/cap30.png" width=600 />

## Training and Testing

<img src="./figures/cap31.png" width=600 />

<img src="./figures/cap32.png" width=600 />

<img src="./figures/cap33.png" width=600 />

<img src="./figures/cap34.png" width=600 />

# Summary

# 참고자료
* [1] Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning - https://www.amazon.com/Data-Science-Google-Cloud-Platform/dp/1491974567
* [2] Book github - https://github.com/GoogleCloudPlatform/data-science-on-gcp