<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Exploratory Data Analysis using BigQuery on Flights Data</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

Reference: Data Science on the Google Cloud Platform. Valliappa Lakshmanan. O'Reilly, 2ed, April 2022

Ch5. Interactive Data Exploration

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from google.cloud import bigquery

In [2]:
bq = bigquery.Client()

## Data Ingest:
### Go to the Storage section of the GCP web console and create a new bucket
### Open CloudShell and git clone this repo: `git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp`
### Then, run:
- `cd data-science-on-gcp/02_ingest`
- `./ingest_from_crsbucket bucketname`
- `./bqload.sh (csv-bucket-name) YEAR`
- `cd ../03_sqlstudio`
- `./create_views.sh`
- `cd ../04_streaming`
- `./ingest_from_crsbucket.sh`

After the above steps, 26 JSON files should appear in the folder "flights/tzcorr/' in the bucket. A table  `flights_tzcorr`  should appear in BigQuery under the dataset `dsongcp`. Use these data to conduct the exploratory data analsysis.

## Vertex AI Workbench 
GCP Vertex AI workbench provides a hosted version of JupyterLab on Google Cloud; it knows how to authenticate against Google Cloud so as to provide easy access to Cloud Storage, BigQuery, Cloud Dataflow, Vertex AI Training, and so on. 

## Jupyter magics 
Jupyter magics provide a mechanism to run a wide variety of languages and ways to add some more. The BigQuery Python package has added a few magics to make the interaction with Google Cloud Platform convenient.

For example, you can run a query on your BigQuery table using the `%%bigquery` magic environment that comes with Vertex AI Workbench:

```
%%bigquery
SELECT
  COUNTIF(arr_delay >= 15)/COUNT(arr_delay) AS frac_delayed
FROM dsongcp.flights_tzcorr
```

## BigQuery from Python
If there is a piece of code that you’d ultimately want to run outside a notebook (perhaps as part of a scheduled script), it’s better to use the underlying Python and not the magic pragma:15

```
sql = """
SELECT
  COUNTIF(arr_delay >= 15)/COUNT(arr_delay) AS frac_delayed
FROM dsongcp.flights_tzcorr
"""
from google.cloud import bigquery
bq = bigquery.Client()
df = bq.query(sql).to_dataframe()
print(df)
```

## Exploring Arrival Delays
Now that we have a notebook up and running, let’s use it to do exploratory analysis of arrival delays because this is the variable we want to be able to predict.

## Basic Statistics
Find he arrival delay for flights that depart more than 10 minutes late. Get the basis statistics using Pandas DataFrame describe().

## Plot Distribution 
Plot a violin plot of our decision surface, i.e., of the arrival delay for flights that depart more than 10 minutes late.

## Probabilistic Analysis

### Query 1: 
Find average departure and arrival delays at each airport in the US, retain only airports where there were at least 3650 flights, and sort them by departure delay in descending order.

### Query 2: 
Assume a flight is delayed if its arrival delay was more than 15 minutes. Find the percentage of the delayed flights.

### Query 3: 
Create a DataFrame containing the ARR_DELAY and DEP_DELAY of 1% of the flights that were departed 15 minutes later than scheduled time. Show the summary statistics of the DataFrame.

### Visualization: 
Plot the distribution of the ARR_DELAY of the flights extracted in the DataFrame of Query 3.

### Query 4: 
Create a DataFrame containing the ARR_DELAY and DEP_DELAY of 0.1% of all flights. Show the summary statistics of the DataFrame.

### Adding Label Column: 
Add a new column 'ontime' to the DataFrame extracted in Query 4 using the following conditions. Show the summary statistics including the 'ontime' column.
-  'ontime' = True if the DEP_DELAY is less than 15 minutes. 
-  Otherwise, 'ontime' = False.

### Visualization: 
Plot and compare the distributions of ARR_DELAY for flights with 'ontime' = True and flights with 'ontime' = False using the DataFrame above.

### Query 5: 
Create a DataFrame containing the average ARR_DELAY and total number of flights for each DEP_DELAY. Order the results by DEP_DELAY in ascending order. Show the head of the DataFrame.

### Query 6: 
Same as above. Create a DataFrame by removing the records with the total number of flights fewer than 365.  Add a column corresponding to the standard deviation of the ARR_DELAY. Show the head of the DataFrame.

### Visualization: 
Plot the relations between DEP_DELAYs and its means and standard deviations of ARR_DELAYs using the DataFrame extracted in Query 6.

### Make Probabilistic Suggestion for Canceling a Meeting at Destination: 
Assume we have an important meeting at destination. We want to decide whether to postpone or cancel the meeting depending on the predicted arrival delay time. Our decision criterion is 15 minutes and 30%. That is, if the plane is more than 30% likely to be delayed (on arrival) by more than 15 minutes, we want to send a text message asking to postpone or cancel the meeting. At what departure delay does this happen? Plot the data to find the DEP_DELAY where there is a 30% chance that the ARR_DELAY is more than 15 minutes using the DataFrame extracted in Query 6.