# Data Ingestion
## Contents
1. [Loading Data into Google Cloud Storage](#LoadingDataGCS)
2. [Loading Data into BigQuery](#LoadingData)<br>
    2.1 [From local directory](#localdirectory)<br>
    2.2 [From Google Cloud Storage](#loadGCS)<br>
    2.3 [Triggering loading when new files are added to Google Cloud Storage](#loadCloudFn)  
    2.4 [From Pandas DataFrame](#loadDF)
3. [Viewing BigQuery table schema](#ViewSchema)<br> 
4. [Reading from BigQuery](#ReadingData)<br>

# Getting Started
First, we want to ensure that the `Data.zip` file (and its unzipped contents) are available to this notebooks.<br/>
The necessary files should already be uploaded to the `data` directory. If they are not, follow the following instructions.
### Step #1: Upload data into your Jupyter notebook
Manually download the `Data.zip` file using the upload button. This is necessary because Jupyter does not have access to your local file system.   
<img src="img/upload_button.png" title="Upload Button"/>   
### Step #2: Unzip the file  
Running the following Bash command will unzip the `Data.zip` file and add the contents the `data/` directory

In [None]:
%%bash
unzip Data.zip -d data

<a id='LoadingDataGCS'></a>
# 1. Loading Data into Google Cloud Storage (GCS)

### Step #1: Create Google Cloud Storage bucket (Optional)
A Google Cloud Storage bucket is...   
<br>You only need to create a new GCS bucket if you're not uploading to an existing bucket.

In [None]:
%%bash
gsutil mb gs://email-propensity-sandbox-data

### Step #2: Upload files to GCS bucket

In [70]:
%%bash
gsutil -m cp data/Data/* gs://email-propensity-sandbox-data/

Copying file://data/Data/events.csv [Content-Type=text/csv]...
Copying file://data/Data/campaign_types.csv [Content-Type=text/csv]...
Copying file://data/Data/Opens.csv [Content-Type=text/csv]...
Copying file://data/Data/Sends.csv [Content-Type=text/csv]...
Copying file://data/Data/User Info.csv [Content-Type=text/csv]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.



<a id='LoadingData'></a>
# 2. Loading Data into BigQuery
First, we need to load the sample dataset into BigQuery.   
   
The same commands can be used from this notebook and from your laptop's command line. The only differences in processes is that you have to authenticate (tell GCP who you are) and authorize (get sufficient permissins) your request when you're loading via laptop command line. When using Cloud Shell or AI Platform Notebooks, your requests are already authenticated and authorized (since your notebook is already running on the cloud!).   
<br/>
Let's start by identifying the GCP project_id and (desired) BigQuery dataset id. 

In [None]:
project_id = 'email-propensity-sandbox'
dataset_id = 'test_upload'

<a id='localdirectory'></a>
## Option #1: Loading from local directory using Command-line
### Step #1: Load data into BigQuery using the bq command-line tool 
If you want to create your own BigQuery dataset, run the following Bash command to create a new dataset with name `project_id:dataset_id`. You'll get an error message if this dataset already exists.   

The bq command-line tool provides a convenient point of entry to interact with the BigQuery service on Google Cloud Platform, although everything you do with bq can be done using the REST API and most things can also be accomplished using the GCP web console. Here, we are asking it to make (mk) a dataset
  
Datasets in BigQuery function like top-level folders that are used to organize and control access to tables, views, and machine learning models. The dataset is created in the current project.


In [None]:
%%bash -s "$dataset_id"
bq --location=US mk --dataset $1

Run the following Bash commands to load your local csv into BigQuery.   
   
In this case, we are asking bq to load the dataset, telling the tool that the source format is CSV and that we would like the tool to auto-detect the schema (i.e., the data types of individual columns).<br/>
<br/>
The bq tool command uses the following syntax to load local files into BQ:
```
bq load --autodetect --source_format=CSV [dataset_id].[table_name] [path of local CSV]
```

In [62]:
%%bash -s "$dataset_id"
bq load --autodetect --replace --source_format=CSV $1.events data/Data/events.csv
bq load --autodetect --replace --source_format=CSV $1.opens data/Data/Opens.csv
bq load --autodetect --replace --source_format=CSV $1.sends data/Data/Sends.csv
bq load --autodetect --replace --source_format=CSV $1.user_info "data/Data/User Info.csv"
bq load --autodetect --replace --source_format=CSV $1.campaign_types data/Data/campaign_types.csv













Waiting on bqjob_r64a13fb39209c49e_0000016c5e2319ff_1 ... (1s) Current status: DONE   

<a id='loadGCS'></a>
## Option #2: Loading from Google Cloud Storage
If you want to create your own BigQuery dataset, run the following Bash command to create a new dataset with name `project_id:dataset_id`. You'll get an error message if this dataset already exists.   

The bq command-line tool provides a convenient point of entry to interact with the BigQuery service on Google Cloud Platform, although everything you do with bq can be done using the REST API and most things can also be accomplished using the GCP web console. Here, we are asking it to make (mk) a dataset
  
Datasets in BigQuery function like top-level folders that are used to organize and control access to tables, views, and machine learning models. The dataset is created in the current project.


In [None]:
%%bash -s "$dataset_id"
bq --location=US mk --dataset $1

The bq tool command uses the following syntax to load GCS files into BigQuery:
```
bq load --autodetect --source_format=CSV [dataset_id].[table_name] [path of CSV in GCS]
```

In [None]:
%%bash -s "$dataset_id"
bq load --autodetect --replace --source_format=CSV $1.events gs://email-propensity-sandbox-data/events.csv
bq load --autodetect --replace --source_format=CSV $1.opens  gs://email-propensity-sandbox-data/Opens.csv
bq load --autodetect --replace --source_format=CSV $1.sends  gs://email-propensity-sandbox-data/Sends.csv
bq load --autodetect --replace --source_format=CSV $1.user_info  "gs://email-propensity-sandbox-data/User Info.csv"
bq load --autodetect --replace --source_format=CSV $1.campaign_types  gs://email-propensity-sandbox-data/campaign_types.csv

<a id='loadCloudFn'></a>
## Option #3: Triggering loading when new files are added to Google Cloud Storage
With Cloud Functions you write simple, single-purpose functions that are attached to events emitted from your cloud infrastructure and services. Your Cloud Function is triggered when an event being watched is fired. Your code executes in a fully managed environment. There is no need to provision any infrastructure or worry about managing any servers.   
  
This example demonstrates how to trigger a BigQuery load job when a new file is added to a Google Cloud Storage bucket, as shown in the diagram below.  
<br/>
<img src="img/gcs_to_bq.png" title="GCS to BigQuery" style="width: 500px;"/>   

In the [Cloud Functions UI](http://console.cloud.google.com/functions), select "Create Function".  
<br>Set the following values:
* Trigger: "Cloud Storage"
* Event Type: "Finalize/Create"
* Bucket: browse and select bucket where new files will be added
* Runtime: Python 3.7<br/>

<br/>
<b>Paste the below code into the `main.py` inline editor.</b>
<br/>

```code
def import_bigquery(event, context):
    """Import CSV to BigQuery after file is added to GCS.

    Args:
         event (dict): Event payload.
         context (google.cloud.functions.Context): Metadata for the event.
    """
    from google.cloud import bigquery
    
    file = event
    print(f"Processing file: {file['name']}.")
    table_id = file['name'].replace('.csv','') # set BQ table name as the filename
    bucket = file['bucket']
    uri = 'gs://{}/{}'.format(bucket, file['name'])
    
    client = bigquery.Client()
    dataset = client.dataset('test_upload') # replace with your dataset id
    job_config = bigquery.LoadJobConfig()
    job_config.autodetect = True # auto-detect BigQuery schema
    job_config.skip_leading_rows = 1 # skip header row
    job_config.source_format = bigquery.SourceFormat.CSV
    load_job = client.load_table_from_uri(
        uri,
        dataset.table(table_id),
        job_config=job_config
    )
```
<br/>
<b>Paste the below code into the `requirements.txt` inline editor.</b>

```
google-cloud-bigquery
google-cloud-storage
```
<b>Set "Function to execute" as "import_bigquery"</b>

<a id='LoadDF'></a>
### Option #4: Load Pandas DataFrame to a BigQuery table 
You can load a Pandas DataFrame into BigQuery.  
<br/>
This is important because you'll be able to follow the following process while preprocessing data:
1. Query raw data and load results into a Pandas DataFrame.
2. Manipulate data using Pandas.  
3. Upload DataFrame back to BigQuery.  
<br/>

In [74]:
from google.cloud import bigquery
import pandas as pd

In [80]:
data = [
  (1, u'What is BigQuery?'),
  (2, u'Query essentials'),
]
df = pd.DataFrame(data, columns=['chapter', 'title'])

In [83]:
client = bigquery.Client()
dataset_ref = client.dataset('test_upload') # set to name of dataset
table_ref = dataset_ref.table('pandas_table') # set to name of destination table

# use load_config to overwrite old table contents
load_config = bigquery.job.LoadJobConfig(
    create_disposition=bigquery.job.CreateDisposition.CREATE_IF_NEEDED,
    write_disposition=bigquery.job.WriteDisposition.WRITE_TRUNCATE)

job = client.load_table_from_dataframe(
    dataframe=df, # set to name of DataFrame
    destination=table_ref,
    job_config=load_config)

<a id='ViewSchema'></a>
# 3. Viewing BigQuery table schema
Before querying a table, it is helpful to first know the table's schema.
## Option #1: BigQuery UI
Schemas are displaying in the [BigQuery UI](http://console.cloud.google.com/bigquery) under a table's "Schema" tab.   

## Option #2: Command-Line
Use the command-line bq tool to print a json representing a table's metadata. "name" corresponds to the column name and "type" corresponds to that column's data type.  
An example is shown below for the `campaign_types` table.

In [65]:
%%bash -s "$dataset_id"
bq show --format prettyjson --schema $1.campaign_types

[
  {
    "mode": "NULLABLE", 
    "name": "launch_id", 
    "type": "DATE"
  }, 
  {
    "mode": "NULLABLE", 
    "name": "camptype", 
    "type": "STRING"
  }
]


<a id='ReadingData'></a>
## 4. Reading from BigQuery
### Option #1: BigQuery Magic
More information about BigQuery Magic can be found [here](https://cloud.google.com/bigquery/docs/visualize-jupyter)

In [None]:
%%bash
pip3 install --upgrade google-cloud-bigquery[pandas]

In [5]:
from google.cloud import bigquery
import pandas as pd
import numpy as np

In [6]:
%%bigquery output_open_rate
SELECT sum(opened)/count(*) as aggregate_open_rate
FROM `email-propensity-sandbox.emails.sends`

The query result is saved to a Pandas dataframe, output_open_rate.   
You can now try out data manipulations and visualizations using this dataframe.

In [12]:
print("Aggregate Open Rate: {:.2%}".format(output_open_rate['aggregate_open_rate'][0]))

Aggregate Open Rate: 18.63%


### Option #2: BigQuery Python Client API

In [14]:
client = bigquery.Client()
sql = """SELECT sum(opened)/count(*) as aggregate_open_rate
    FROM `email-propensity-sandbox.emails.sends`
    """
query_job = client.query(sql) # API request
result = query_job.to_dataframe()

In the last line, we're running the query and saving the results to a Pandas Dataframe.   
So, any data manipulations or visualizations that you did with the results from BigQuery Magic can also be used for these results.

In [15]:
print("Aggregate Open Rate: {:.2%}".format(output_open_rate['aggregate_open_rate'][0]))

Aggregate Open Rate: 18.63%
