# 01 - Data Analysis and Preparation

This notebook covers the following tasks:

1. Perform Exploratory Data Analysis and Visualization.
2. Prepare the data for the ML task in BigQuery.
3. Create a managed dataset.
4. Produce and fix the raw data schema.

## Dataset

The [Chicago Taxi Trips](https://pantheon.corp.google.com/marketplace/details/city-of-chicago-public-data/chicago-taxi-trips) dataset is one ofof [public datasets hosted with BigQuery](https://cloud.google.com/bigquery/public-data/), which includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. The `taxi_trips` table size is 70.72 GB and includes more than 195 million records. The dataset includes information about the trips, like pickup and dropoff datetime and location, passengers count, miles travelled, and trip toll. 

The ML task is to predict whether a given trip will result in a tip > 20%.

## Setup

In [None]:
import os
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv
from google.cloud import bigquery
import matplotlib.pyplot as plt
from google.cloud.aiplatform import gapic as aip

In [None]:
PROJECT = 'ksalama-cloudml'
REGION = 'us-central1'
BQ_DATASET_NAME = 'playground_us'
BQ_TABLE_NAME = 'chicago_taxitrips_prep'

API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"
PARENT = f"projects/{PROJECT}/locations/{REGION}"
DATASET_DISPLAYNAME = 'chicago_taxi_tips'
BQ_URI = f"bq://{PROJECT}.{BQ_DATASET_NAME}.{BQ_TABLE_NAME}"

client_options = {"api_endpoint": API_ENDPOINT}

In [None]:
LOCAL_WORKSPACE = '_workspace'
LOCAL_DATA_DIR = os.path.join(LOCAL_WORKSPACE, 'csv_data')
RAW_SCHEMA_DIR = 'model_src/raw_schema'
REMOVE_WORKSPACE = True

if tf.io.gfile.exists(LOCAL_WORKSPACE) and REMOVE_WORKSPACE:
    print("Removing previous local workspace...")
    tf.io.gfile.rmtree(LOCAL_WORKSPACE)

print("Creating new local workspace...")
tf.io.gfile.mkdir(LOCAL_WORKSPACE)
print("Creating data directory...")
tf.io.gfile.mkdir(LOCAL_DATA_DIR)
    
tf.io.gfile.mkdir(RAW_SCHEMA_DIR)

In [None]:
!bq --location=US mk -d \
$PROJECT:$BQ_DATASET_NAME

## 1. Explore the Data in BigQuery

In [None]:
%%bigquery data

SELECT 
    CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS string) AS trip_dayofweek, 
    FORMAT_DATE('%A',cast(trip_start_timestamp as date)) AS trip_dayname,
    COUNT(*) as trip_count,
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
    EXTRACT(YEAR FROM trip_start_timestamp) = 2015 
GROUP BY
    trip_dayofweek,
    trip_dayname
ORDER BY
    trip_dayofweek
;

In [None]:
data

In [None]:
data.plot(kind='bar', x='trip_dayname', y='trip_count')

## 2. Create data for the ML task

We add `data_split` column, where 80% of the records is set to `UNASSIGNED` while the other 20% is set to `TEST`.
This column will be used by the AutoML Tables and the custom model to split the data for learning and testing.
In the learning phase, each model will split the `UNASSIGNED` records to `train` and `eval`. The `TEST` split is the same for
bot models for fair comparison in the testing phase.

In [None]:
sample_size = 1000000
year = 2020

In [None]:
sql_script = '''
CREATE OR REPLACE TABLE `@PROJECT.@DATASET.@TABLE` 
AS (
    WITH
      taxitrips AS (
      SELECT
        trip_start_timestamp,
        trip_seconds,
        trip_miles,
        payment_type,
        pickup_longitude,
        pickup_latitude,
        dropoff_longitude,
        dropoff_latitude,
        tips,
        fare
      FROM
        `bigquery-public-data.chicago_taxi_trips.taxi_trips`
      WHERE 1=1 
      AND pickup_longitude IS NOT NULL
      AND pickup_latitude IS NOT NULL
      AND dropoff_longitude IS NOT NULL
      AND dropoff_latitude IS NOT NULL
      AND trip_miles > 0
      AND trip_seconds > 0
      AND fare > 0
      AND EXTRACT(YEAR FROM trip_start_timestamp) = @YEAR
    )

    SELECT
      trip_start_timestamp,
      EXTRACT(MONTH from trip_start_timestamp) as trip_month,
      EXTRACT(DAY from trip_start_timestamp) as trip_day,
      EXTRACT(DAYOFWEEK from trip_start_timestamp) as trip_day_of_week,
      EXTRACT(HOUR from trip_start_timestamp) as trip_hour,
      trip_seconds,
      trip_miles,
      payment_type,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(pickup_longitude, pickup_latitude), 0.1)
      ) AS pickup_grid,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(dropoff_longitude, dropoff_latitude), 0.1)
      ) AS dropoff_grid,
      ST_Distance(
          ST_GeogPoint(pickup_longitude, pickup_latitude), 
          ST_GeogPoint(dropoff_longitude, dropoff_latitude)
      ) AS euclidean,
      CONCAT(
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
              pickup_latitude), 0.1)), 
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
              dropoff_latitude), 0.1))
      ) AS loc_cross,
      IF((tips/fare >= 0.2), 1, 0) AS tip_bin,
      IF(RAND() > 0.8, 'UNASSIGNED', 'TEST') AS data_split
    FROM
      taxitrips
    LIMIT @LIMIT
)
'''

In [None]:
sql_script = sql_script.replace(
    '@PROJECT', PROJECT).replace(
    '@DATASET', BQ_DATASET_NAME).replace(
    '@TABLE', BQ_TABLE_NAME).replace(
    '@YEAR', str(year)).replace(
    '@LIMIT', str(sample_size))

In [None]:
print(sql_script)

In [None]:
bq_client = bigquery.Client()
job = bq_client.query(sql_script)
job.result()

In [None]:
%%bigquery

SELECT data_split, COUNT(*)
FROM ksalama-cloudml.playground_us.chicago_taxitrips_prep
GROUP BY data_split

### Save data locally as CSV

In [None]:
%%bigquery sample_data

SELECT * EXCEPT (data_split)
FROM ksalama-cloudml.playground_us.chicago_taxitrips_prep
#LIMIT 10000

In [None]:
sample_data.head().T

In [None]:
sample_data.tip_bin.value_counts()

In [None]:
sample_data.euclidean.hist()

In [None]:
DATA_FILE = os.path.join(LOCAL_DATA_DIR,'sample_data.csv')
sample_data.to_csv(DATA_FILE, index=False, header=False)

In [None]:
!wc -l $DATA_FILE

## 3. Create Managed AI Platform Dataset

In [None]:
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

### Create managed dataset

In [None]:
dataset_client = aip.DatasetServiceClient(
    client_options=client_options)

In [None]:
metadata_dict = {
    "input_config": {
        "bigquery_source": {"uri": BQ_URI}
    }
}

metadata = json_format.ParseDict(metadata_dict, Value())

dataset_desc = {
    "display_name": DATASET_DISPLAYNAME,
    "metadata_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml",
    "metadata": metadata,
}
    
response = dataset_client.create_dataset(
    parent=PARENT, dataset=dataset_desc)
response.result()

### List datasets

In [None]:
datasets = dataset_client.list_datasets(parent=PARENT)
dataset_uri = None
for dataset in datasets:
    if dataset.display_name == DATASET_DISPLAYNAME:
        dataset_uri = dataset.name
        break
        
print("Dataset uri:", dataset_uri)

In [None]:
dataset = dataset_client.get_dataset(name=dataset_uri)
dataset

In [None]:
dataset.metadata['inputConfig']['bigquerySource']['uri']

## 4. Generate Raw Data Schema

The raw data schema will be used in:
1. Defining the input columns for the AutoML Tables model.
2. Indentifying the raw data types and shapes in the data transformation.
3. Create the serving input signature for the custom model.
4. Validating the new raw training data in the tfx pipeline.

In [None]:
stats = tfdv.generate_statistics_from_csv(
    data_location=DATA_FILE, 
    column_names=list(sample_data.columns), # CSV data file include header
)

In [None]:
tfdv.visualize_statistics(stats)

In [None]:
schema = tfdv.infer_schema(statistics=stats)
tfdv.display_schema(schema=schema)

In [None]:
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

In [None]:
raw_schema_location = os.path.join(RAW_SCHEMA_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, raw_schema_location)