# GCP Resources Setup

En esta notebook se configura todo lo necesario para la ejecución del pipeline.

In this notebook we configure everything we need to run the pipeline.

## Install libraries - Paquetes

Se ofrecen dos versiones un archivo de texto con las librerías: una con las instalaciones realizadas de manera directa (*requirements.txt*) y una que tiene la totalidad del entorno con el que se realizó este pipeline (*full_requirements.txt*).

We present two requirement text files: one with the critical imports needed (*requirements.txt*) and another with the entirety of the environment with which the pipeline was run (*full_requirements.txt*).

In [1]:
!cat requirements.txt

google-cloud-pipeline-components==0.2.0
kfp==1.8.9
scikit-learn==1.0.0
google-cloud-bigquery
google-cloud-bigquery-storage
google-cloud-aiplatform
pandas
numpy

In [None]:
#!pip install -r requirements.txt

In [1]:
import matplotlib.pyplot as plt
import pandas as pd

import kfp

from kfp.v2 import compiler, dsl
from kfp.v2.dsl import pipeline, component, Artifact, Dataset, Input, Metrics, Model, Output, InputPath, OutputPath, ClassificationMetrics
from typing import NamedTuple

from google.cloud import aiplatform

# We'll use this namespace for metadata querying
from google.cloud import aiplatform_v1

from google.cloud.aiplatform import pipeline_jobs
from google_cloud_pipeline_components import aiplatform as gcc_aip
from google.cloud import bigquery

import os
PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)
    
    
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin
REGION="us-central1"

from datetime import datetime

TIMESTAMP =datetime.now().strftime("%Y%m%d%H%M%S")


Project ID:  teco-prod-adam-dev-826c
env: PATH=/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/jupyter/.local/bin


## BigQuery - Database (Raw stage)

Necesitamos crear dos datasets en BigQuery: uno que oficiará de histórico, y otro que representará el momento actual (no va a tener variable target). Para eso usaremos el gcloud cli, junto con algunas variables de Python. 

**Importante**: Este código tiene que correrse sólo una vez, es para crear las bases necesarias que se usarán a lo largo del pipeline.

We need to create two datasets in BigQuery: one will be our historical data, and the other one will simulate the current time (the target variable will be missing). We'll use the gcloud cli, along with some Python variables.

**Important**: This code needs to run only once to set up the necessary datasets to be used through the entire pipeline.

In [4]:
BQ_DATASET_HISTORIC_NAME = 'chicago_taxi_historic_test'
BQ_DATASET_CURRENT_NAME = 'chicago_taxi_current_test'

BQ_HISTORIC_RAW = 'raw'
BQ_CURRENT_RAW = 'raw'

BQ_LOCATION = 'US'

In [3]:
# !bq --location=US mk -d \
# $PROJECT_ID:$BQ_DATASET_HISTORIC_NAME

Dataset 'vertex-testing-327520:chicago_taxi_historic_test' successfully created.


In [4]:
# !bq --location=US mk -d \
# $PROJECT_ID:$BQ_DATASET_CURRENT_NAME

Dataset 'vertex-testing-327520:chicago_taxi_current_test' successfully created.


Se preparan dos funciones para automatizar y parametrizar la búsqueda de variables de acuerdo a tiempos y volúmenes deseados.

We create two functions in order to automate and parametrize variables according to times and size needed.

In [5]:
import datetime as dt

def get_year_and_month():
    previous_month = (dt.date.today().replace(day=1) - dt.timedelta(days=33)).month
    year = dt.date.today().year
    
    if previous_month == 12:
        year = year-1
    else:
        year
    return year, previous_month

In [6]:
def get_year_and_month_hist(current_year, current_month):
    month_hist = current_month - 1
    if month_hist == 0:
        year_hist = current_year -1
        month_hist = 12
    else:
        year_hist = current_year
    
    return year_hist, month_hist

In [7]:
import datetime as dt
SAMPLE_SIZE = 100000
YEAR, MONTH = get_year_and_month()

print('current year: ', YEAR)
print('current month: ', MONTH)
print('sample size: ', SAMPLE_SIZE)

current year:  2022
current month:  1
sample size:  100000


In [8]:
HIST_YEAR, HIST_MONTH = get_year_and_month_hist(YEAR, MONTH)
print('past year: ', HIST_YEAR)
print('past month: ', HIST_MONTH)

past year:  2021
past month:  12


### Dataset actual - Current dataset

Usamos una query de SQL para poblar la tabla.

We use a SQL query to load the table.

In [12]:
current_sql_script = '''
CREATE OR REPLACE TABLE `@PROJECT_ID.@DATASET.@TABLE` 
AS (
    WITH
      taxitrips AS (
      SELECT
        trip_start_timestamp,
        trip_seconds,
        trip_miles,
        payment_type,
        pickup_longitude,
        pickup_latitude,
        dropoff_longitude,
        dropoff_latitude,
        tips,
        fare
      FROM
        `bigquery-public-data.chicago_taxi_trips.taxi_trips`
      WHERE 1=1 
      AND pickup_longitude IS NOT NULL
      AND pickup_latitude IS NOT NULL
      AND dropoff_longitude IS NOT NULL
      AND dropoff_latitude IS NOT NULL
      AND trip_miles > 0
      AND trip_seconds > 0
      AND fare > 0
      AND EXTRACT(YEAR FROM trip_start_timestamp) = @YEAR
      AND EXTRACT(MONTH FROM trip_start_timestamp) = @MONTH
    )

    SELECT
      trip_start_timestamp,
      EXTRACT(MONTH from trip_start_timestamp) as trip_month,
      EXTRACT(DAY from trip_start_timestamp) as trip_day,
      EXTRACT(DAYOFWEEK from trip_start_timestamp) as trip_day_of_week,
      EXTRACT(HOUR from trip_start_timestamp) as trip_hour,
      trip_seconds,
      trip_miles,
      payment_type,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(pickup_longitude, pickup_latitude), 0.1)
      ) AS pickup_grid,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(dropoff_longitude, dropoff_latitude), 0.1)
      ) AS dropoff_grid,
      ST_Distance(
          ST_GeogPoint(pickup_longitude, pickup_latitude), 
          ST_GeogPoint(dropoff_longitude, dropoff_latitude)
      ) AS euclidean,
      CONCAT(
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
              pickup_latitude), 0.1)), 
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
              dropoff_latitude), 0.1))
      ) AS loc_cross,
      IF((tips/fare >= 0.2), 1, 0) AS tip_bin,
      IF(ABS(MOD(FARM_FINGERPRINT(STRING(trip_start_timestamp)), 10)) < 9, 'UNASSIGNED', 'TEST') AS data_split
    FROM
      taxitrips
    LIMIT @LIMIT
)
'''

In [10]:
current_sql_script = current_sql_script.replace(
    '@PROJECT_ID', PROJECT_ID).replace(
    '@DATASET', BQ_DATASET_CURRENT_NAME).replace(
    '@TABLE', BQ_CURRENT_RAW).replace(
    '@YEAR', str(YEAR)).replace(
    '@LIMIT', str(SAMPLE_SIZE)).replace(
    '@MONTH', str(MONTH))

In [11]:
bq_client = bigquery.Client(project=PROJECT_ID, location=BQ_LOCATION)
job = bq_client.query(current_sql_script)
_ = job.result()

### Dataset historico - Historic dataset

In [12]:
historic_sql_script = '''
CREATE OR REPLACE TABLE `@PROJECT_ID.@DATASET.@TABLE` 
AS (
    WITH
      taxitrips AS (
      SELECT
        trip_start_timestamp,
        trip_seconds,
        trip_miles,
        payment_type,
        pickup_longitude,
        pickup_latitude,
        dropoff_longitude,
        dropoff_latitude,
        tips,
        fare
      FROM
        `bigquery-public-data.chicago_taxi_trips.taxi_trips`
      WHERE 1=1 
      AND pickup_longitude IS NOT NULL
      AND pickup_latitude IS NOT NULL
      AND dropoff_longitude IS NOT NULL
      AND dropoff_latitude IS NOT NULL
      AND trip_miles > 0
      AND trip_seconds > 0
      AND fare > 0
      AND EXTRACT(YEAR FROM trip_start_timestamp) = @YEAR
      AND EXTRACT(MONTH FROM trip_start_timestamp) = @MONTH
    )

    SELECT
      trip_start_timestamp,
      EXTRACT(MONTH from trip_start_timestamp) as trip_month,
      EXTRACT(DAY from trip_start_timestamp) as trip_day,
      EXTRACT(DAYOFWEEK from trip_start_timestamp) as trip_day_of_week,
      EXTRACT(HOUR from trip_start_timestamp) as trip_hour,
      trip_seconds,
      trip_miles,
      payment_type,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(pickup_longitude, pickup_latitude), 0.1)
      ) AS pickup_grid,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(dropoff_longitude, dropoff_latitude), 0.1)
      ) AS dropoff_grid,
      ST_Distance(
          ST_GeogPoint(pickup_longitude, pickup_latitude), 
          ST_GeogPoint(dropoff_longitude, dropoff_latitude)
      ) AS euclidean,
      CONCAT(
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
              pickup_latitude), 0.1)), 
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
              dropoff_latitude), 0.1))
      ) AS loc_cross,
      IF((tips/fare >= 0.2), 1, 0) AS tip_bin,
      IF(ABS(MOD(FARM_FINGERPRINT(STRING(trip_start_timestamp)), 10)) < 9, 'UNASSIGNED', 'TEST') AS data_split
    FROM
      taxitrips
    LIMIT @LIMIT
)
'''

In [13]:
historic_sql_script = historic_sql_script.replace(
    '@PROJECT_ID', PROJECT_ID).replace(
    '@DATASET', BQ_DATASET_HISTORIC_NAME).replace(
    '@TABLE', BQ_HISTORIC_RAW).replace(
    '@YEAR', str(HIST_YEAR)).replace(
    '@LIMIT', str(SAMPLE_SIZE)).replace(
    '@MONTH', str(HIST_MONTH))

In [14]:
bq_client = bigquery.Client(project=PROJECT_ID, location=BQ_LOCATION)
job = bq_client.query(historic_sql_script)
_ = job.result()

## Cloud Storage - Artifacts

Crearemos buckets en GCS para almacenar distintos tipos de objetos que va produciendo el pipeline a lo largo del camino. En *stage* se guardarán algunos más relevantes, como respaldos de las particiones de train, validación y test para poder acceder más rapidamente y fácilmente, mientras que en *pipelines* habrá mayormente logs y resultados de ejecuciones.

We'll create GCS buckets to store different types of objects that the pipeline produces through its execution. In *stage* there will be some relevant files, such as easily accesible backups of train, validation and test data in order to perform quick reviews if needed, whereas in *pipeline* there will mostly be execution outputs and logs.

In [13]:
STAGE_DATA_BUCKET = f'{PROJECT_ID}-chicago_taxi_stage'
PIPELINE_DATA_BUCKET = f'{PROJECT_ID}-chicago_taxi_pipelines'

print('Stage bucket: ', STAGE_DATA_BUCKET)
print('Pipeline bucket: ', PIPELINE_DATA_BUCKET)


Stage bucket:  teco-prod-adam-dev-826c-chicago_taxi_stage
Pipeline bucket:  teco-prod-adam-dev-826c-chicago_taxi_pipelines


In [3]:
from google.cloud import storage


def create_bucket_class_location(bucket_name, location):
    storage_client = storage.Client()

    bucket = storage_client.bucket(bucket_name)
    new_bucket = storage_client.create_bucket(bucket, location=location)

    print(
        "Created bucket {} in {} ".format(
            new_bucket.name, new_bucket.location
        )
    )
    return new_bucket

In [4]:
bucket_stage = create_bucket_class_location(STAGE_DATA_BUCKET, REGION)

Created bucket vertex-testing-327520-chicago_taxi_stage in US-CENTRAL1 


In [5]:
bucket_pipeline = create_bucket_class_location(PIPELINE_DATA_BUCKET, REGION)

Created bucket vertex-testing-327520-chicago_taxi_pipelines in US-CENTRAL1 


## Container Registry - Hyperparameter tuning jobs

Para realizar un job de tuneo de hiperparámetros, la forma más conveniente es mediante el uso de imágenes de Docker. Se ofrecen dos configuraciones para algoritmos: Random Forest y regresión logística.

To perform a hyperparameter tuning job, the most convenient way is through the use of Docker images. There are two algorithm configurations: Random Forest and logistic regression.

#### General structure

In [16]:
!tree hp_lr

[01;34mhp_lr[00m
├── Dockerfile
└── [01;34mtrainer[00m
    └── task.py

1 directory, 2 files


#### Dockerfile

In [17]:
cat hp_lr/Dockerfile

from gcr.io/deeplearning-platform-release/sklearn-cpu

WORKDIR /

# Installs hypertune library
RUN pip install cloudml-hypertune sklearn scipy google-cloud-bigquery joblib pandas google-cloud-storage

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

#### Model

In [14]:
cat hp_lr/trainer/task.py

from sklearn.metrics import roc_curve, confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from google.cloud import bigquery#
from google.cloud import storage
from joblib import dump

import os
import pandas as pd

#from xgboost import XGBClassifier
#from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import argparse
import hypertune
from sklearn.model_selection import train_test_split as tts



STAGE_DATA_BUCKET = 'your_bucket'
TRAIN_DATA_PATH = 'data/chicago_taxi_train.csv'
LOCAL_DATA_PATH = 'chicago_taxi_train.csv'

cols = ['trip_month', 'trip_day', 'trip_day_of_week',
       'trip_hour', 'trip_seconds', 'trip_miles', 'euclidean', 'target',
       'payment_type_Credit_Card', 'payment_type_Dispute', 'payment_type_Mobile',
       'payment_type_No_Charge', 'payment_type_Prcard', 'payment_type_Unknown']

def get_args():
    '''Parses args. Must include all hyperparameters you want to tune.''

- Se pasan los hiperparametros a iterar como argumentos (argparser).
- Se selecciona la metrica (f1_score) y el objetivo (maximizar).

- Pass the hyperparameters to iterate as arguments (argparser).
- Choose metrics (f1_score) and goal (maximize).

In [15]:
!sed -i 's/your_bucket/{STAGE_DATA_BUCKET}/' hp_lr/trainer/task.py

In [16]:
!sed -i 's/your_bucket/{STAGE_DATA_BUCKET}/' hp_rf/trainer/task.py

#### Image: build and push

In [17]:
RANDOM_FOREST_IMAGE = f'gcr.io/{PROJECT_ID}/rf_hp_job:v1'
RANDOM_FOREST_IMAGE

'gcr.io/teco-prod-adam-dev-826c/rf_hp_job:v1'

In [18]:
LOG_REG_IMAGE = f'gcr.io/{PROJECT_ID}/lr_hp_job:v1'
LOG_REG_IMAGE

'gcr.io/teco-prod-adam-dev-826c/lr_hp_job:v1'

In [19]:
# !docker build ./hp_rf -t $RANDOM_FOREST_IMAGE

Sending build context to Docker daemon  11.78kB
Step 1/5 : from gcr.io/deeplearning-platform-release/sklearn-cpu
 ---> 2574879dfd34
Step 2/5 : WORKDIR /
 ---> Using cache
 ---> 3dbd66708df1
Step 3/5 : RUN pip install cloudml-hypertune sklearn scipy google-cloud-bigquery joblib pandas google-cloud-storage
 ---> Using cache
 ---> 0040d68f198b
Step 4/5 : COPY trainer /trainer
 ---> 5f12dfadda9e
Step 5/5 : ENTRYPOINT ["python", "-m", "trainer.task"]
 ---> Running in 457ce0cad30d
Removing intermediate container 457ce0cad30d
 ---> ed1dea4c596c
Successfully built ed1dea4c596c
Successfully tagged gcr.io/teco-prod-adam-dev-826c/rf_hp_job:v1


In [20]:
# !docker build ./hp_lr -t $LOG_REG_IMAGE

Sending build context to Docker daemon  11.78kB
Step 1/5 : from gcr.io/deeplearning-platform-release/sklearn-cpu
 ---> 2574879dfd34
Step 2/5 : WORKDIR /
 ---> Using cache
 ---> 3dbd66708df1
Step 3/5 : RUN pip install cloudml-hypertune sklearn scipy google-cloud-bigquery joblib pandas google-cloud-storage
 ---> Using cache
 ---> 0040d68f198b
Step 4/5 : COPY trainer /trainer
 ---> 78bbf259a494
Step 5/5 : ENTRYPOINT ["python", "-m", "trainer.task"]
 ---> Running in f75c53f39ec8
Removing intermediate container f75c53f39ec8
 ---> 47e042de7f52
Successfully built 47e042de7f52
Successfully tagged gcr.io/teco-prod-adam-dev-826c/lr_hp_job:v1


In [21]:
# !docker push $LOG_REG_IMAGE

The push refers to repository [gcr.io/teco-prod-adam-dev-826c/lr_hp_job]

[1B8b9ad592: Preparing 
[1Bc56a3432: Preparing 
[1Bbfb2e242: Preparing 
[1Bb0d69ede: Preparing 
[1Bc3dd1b30: Preparing 
[1Bf55e5b0f: Preparing 
[1Bb8042115: Preparing 
[1Be7ceaeea: Preparing 
[1B659ee3aa: Preparing 
[1B153ced2f: Preparing 
[1Bb90e8bce: Preparing 
[1Bd686dc1d: Preparing 
[1Bc56dcfc0: Preparing 
[1Bdd09476c: Preparing 
[1B384be1ed: Preparing 
[1B0864cc76: Preparing 
[1B24e4876e: Preparing 
[1Bbf18a086: Preparing 
[1B282950fe: Preparing 
[1B0b19050d: Preparing 
[1Bb453bec5: Preparing 
[22Bb9ad592: Pushed lready exists 5kB[21A[2K[16A[2K[17A[2K[10A[2K[8A[2K[3A[2K[4A[2K[22A[2Kv1: digest: sha256:8c44e9838cd84363ded24fb75c08d46d654127c0e69c93ac175e7d74a87a243b size: 4916


In [22]:
# !docker push $RANDOM_FOREST_IMAGE

The push refers to repository [gcr.io/teco-prod-adam-dev-826c/rf_hp_job]

[1B9f902063: Preparing 
[1Bc56a3432: Preparing 
[1Bbfb2e242: Preparing 
[1Bb0d69ede: Preparing 
[1Bc3dd1b30: Preparing 
[1Bf55e5b0f: Preparing 
[1Bb8042115: Preparing 
[1Be7ceaeea: Preparing 
[1B659ee3aa: Preparing 
[1B153ced2f: Preparing 
[1Bb90e8bce: Preparing 
[1Bd686dc1d: Preparing 
[1Bc56dcfc0: Preparing 
[1Bdd09476c: Preparing 
[1B384be1ed: Preparing 
[1B0864cc76: Preparing 
[1B24e4876e: Preparing 
[1Bbf18a086: Preparing 
[1B282950fe: Preparing 
[1B0b19050d: Preparing 
[1Bb453bec5: Preparing 
[22Bf902063: Pushed lready exists 3kB[18A[2K[17A[2K[13A[2K[9A[2K[5A[2K[1A[2K[22A[2Kv1: digest: sha256:c45a8a6c16f117abf1776d2ecadb12e8f9f4494dba0cfccae0ed23771ae0a70b size: 4916
