<h1><center>Apigee History Data Preprocessing with Dataproc Cluster</center></h1>
<a id="tc"></a>

## Table of Contents
1. [Configuration](#configuration) 
2. [Source Selection](#select)
3. [Augmentation](#augmentation)
4. [Remove Duplicates](#rmduplicates)
5. [Push Metrics To DB](#todb)
6. [Push Notebook to GCS Bucket](#gcs)

<a id="configuration"></a>
## Configuration
[back to Table Of Contents](#tc)

In [13]:
import os 
os.environ["JAVA_HOME"] = '/usr/lib/jvm/jdk1.8.0_221'
os.environ["PATH"] += os.pathsep + os.environ["JAVA_HOME"] + '/bin'

In [14]:
BUCKET = 'ai4ops-main-storage-bucket'
PROJECT = 'kohls-kos-cicd'
CLUSTER = 'ai4ops'
REGION='global'
SCRIPT_PATH = 'poc/spark/ingest'
AI4OPS_HISTORY_PATH=f"gs://{BUCKET}/apigee_history/apigee/metrics/history"
RESOURCES='/opt/dataproc/.resources'
DATA_START_DATE = '2019-05-20T00:00:00Z'
DATA_END_DATE = '2019-06-09T00:00:00Z'
PUSH_TO_DB_START_FROM = '2019-05-20T00:00:00Z'

<a id="select"></a>
## Source Selection
[back to Table Of Contents](#tc)

In [15]:
import os
import json

def get_transition(transition_file):
    with open(transition_file, 'r') as f:
        return json.load(f)

In [16]:
transition = get_transition('api_transition_ingest.json')
print(transition)
INGEST_OUTPUT_JOB_1 = transition.get('INGEST_OUTPUT_JOB_1', '')
INGEST_OUTPUT_JOB_2 = transition.get('INGEST_OUTPUT_JOB_2', '')
INGEST_OUTPUT_JOB_3 = transition.get('INGEST_OUTPUT_JOB_3', '')
INPUT_PATH = f'{INGEST_OUTPUT_JOB_1},{INGEST_OUTPUT_JOB_2},{INGEST_OUTPUT_JOB_3}'

{'INGEST_JOB_1': 'ai4ops_history_ingest_1567422501', 'INGEST_JOB_2': 'ai4ops_history_ingest_1567422590', 'INGEST_JOB_3': 'ai4ops_history_ingest_1567422598', 'INGEST_TIMESTAMP': '1567422808', 'INGEST_BUCKET': 'ai4ops-main-storage-bucket', 'INGEST_OUTPUT_JOB_1': 'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/ai4ops_history_ingest_1567422501/chunk*', 'INGEST_OUTPUT_JOB_2': 'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/ai4ops_history_ingest_1567422590/chunk*', 'INGEST_OUTPUT_JOB_3': 'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/ai4ops_history_ingest_1567422598/chunk*', 'INGEST_STATE_JOB_1': 'RUNNING', 'INGEST_STATE_JOB_2': 'RUNNING', 'INGEST_STATE_JOB_3': 'RUNNING'}


In [17]:
from job_api import *
import importlib
from datetime import datetime
import sys
import pyspark

In [18]:
builder = DataprocJobBuilder()
session = Session(BUCKET, REGION, CLUSTER, PROJECT)

In [19]:
%%py_script  --task --name select_sources.py
import argparse
from pyspark.sql import SparkSession
from ai4ops_db import *
from pyspark.sql.functions import col
from job_api import Task


class SelectTask(Task):
    def run():
        parser = argparse.ArgumentParser()
        parser.add_argument('--input_data_path', type=str, help='comma separated input data paths')
        parser.add_argument('--output_data_path', type=str, help='total base path')

        args, u = parser.parse_known_args()

        spark = SparkSession.builder.getOrCreate()
        sc = spark.sparkContext
        df = (spark.read.format("csv").
              option("header", "false").
              schema(DB.metrics_schema()).
              option('delimiter', ',').
              load(args.input_data_path.split(',')))
        df.printSchema()
        df_normal = df.filter(col('source') == 'apigee-kohls-prod')
        df_empty = df.filter(col('source') == 'apigee-kohls-prod-empty')
        df_error = df.filter(col('source') == 'apigee-kohls-prod-error')
        df_normal.write.format('csv').save(args.output_data_path + '/normal')
        df_empty.write.format('csv').save(args.output_data_path + '/empty')
        df_error.write.format('csv').save(args.output_data_path + '/error')

<job_api.PyScript at 0x7f0dc460fd30>

In [20]:
sel_job_name = "api_ai4ops_select_source_{}".format(int(datetime.now().timestamp()))

SELECTION_OUT = f"{AI4OPS_HISTORY_PATH}/selected/{sel_job_name}"

arguments = {"--input_data_path":INPUT_PATH,\
            "--output_data_path":SELECTION_OUT\
            }

selection_job = builder.task_script('select_sources.py')\
.job_id(sel_job_name)\
.py_file(f'{SCRIPT_PATH}/apigee_ingest_utils.py')\
.py_file(f'{SCRIPT_PATH}/ai4ops_db.py')\
.py_file(f'{SCRIPT_PATH}/yarn_logging.py')\
.arguments(**arguments)\
.build_job()

session = Session(BUCKET, REGION, CLUSTER, PROJECT)

select_executor = DataprocExecutor(selection_job, session)

In [21]:
selection_res = select_executor.submit_job(run_async=True)

Job with id api_ai4ops_select_source_1567427501 was submitted to the cluster ai4ops


In [22]:
sleep(60)
state = select_executor.get_job_state()

print('State : {}'.format(state))
if state not in ['DONE', 'RUNNING']:
    raise RuntimeError('Previous workflow step was failed')


State : RUNNING


In [16]:
select_executor.download_output_from_gs()

Downloading output file.


b'19/09/02 08:01:32 INFO org.spark_project.jetty.util.log: Logging initialized @3096ms\n19/09/02 08:01:32 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown\n19/09/02 08:01:32 INFO org.spark_project.jetty.server.Server: Started @3218ms\n19/09/02 08:01:32 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@5edef7ef{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}\n19/09/02 08:01:33 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.\n19/09/02 08:01:33 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ai4ops-m/10.208.107.14:8032\n19/09/02 08:01:33 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ai4ops-m/10.208.107.14:10200

In [23]:
select_transition = {
    "SELECT_SOURCE_JOB": sel_job_name,
    "SELECT_SOURCE_OUTPUT": SELECTION_OUT,
    "SELECT_SOURCE_STATE": state
}

print (select_transition)

with open('api_transition_select.json', 'w') as file:
     file.write(json.dumps(select_transition)) 

{'SELECT_SOURCE_JOB': 'api_ai4ops_select_source_1567427501', 'SELECT_SOURCE_OUTPUT': 'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/selected/api_ai4ops_select_source_1567427501', 'SELECT_SOURCE_STATE': 'RUNNING'}


<a id="augmentation"></a>
## Augmentation
[back to Table Of Contents](#tc)

In [3]:
%%py_script  --task --name augmentation.py

from pyspark.sql import SparkSession
from ai4ops_db import *
from pyspark.sql.functions import spark_partition_id, pandas_udf
from pyspark.sql.functions import PandasUDFType
from datetime import datetime, timedelta
from apigee_ingest_utils import ApigeeIngest, ISO_TIME_FORMAT
from pytz import timezone
import pandas as pd
import argparse
import time
from job_api import Task
import sys

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.addPyFile('yarn_logging.py')
import yarn_logging
import gc

logger = yarn_logging.YarnLogger()


def prepare_augmented_pd_light(pdf, start, end, time_unit='m', chunk_id=''):
    source = pdf.loc[:1, 'source'].values.tolist()[0]
    names = pdf.groupby(['metric'], as_index=False).agg({})['metric'].values.tolist()
    logger.info('Chunk ID: {}\nmetrics: {},\nsource: {}'.format(chunk_id, names, source))
    time_range_size = int((end - start) / ApigeeIngest.delta(1, 1, time_unit)) + 1
    time_range = [
        (start + ApigeeIngest.delta(i, 1, time_unit)).strftime(ISO_TIME_FORMAT) for i in range(time_range_size)
    ]
    time_range = pd.DataFrame(time_range, columns=['time'])
    time_range.loc[:, 'tmp'] = 1
    time_range.loc[:, 'source_new'] = '{}-empty'.format(source)
    gc.collect()
    metrics = pdf.groupby(['metric'], as_index=False).agg({})
    metrics.loc[:, 'tmp'] = 1
    augmented = pd.merge(time_range, metrics, on=['tmp'], how='inner')
    augmented.loc[:, 'value_new'] = None
    gc.collect()

    augmented = pd.merge(pdf, augmented, on=['time', 'metric'], how='right')
    augmented['source'].fillna(augmented['source_new'], inplace=True)
    augmented['value'].fillna(augmented['value_new'], inplace=True)
    gc.collect()
    return augmented.drop(['tmp', 'source_new', 'value_new'], axis=1)


def prepare_augmented_udf(start, end, time_unit='m', chunk_id=''):
    print(f'from {start} to {end}')
    return pandas_udf(lambda p: prepare_augmented_pd_light(p, start, end, time_unit, chunk_id),
                      returnType=DB.metrics_schema(),
                      functionType=PandasUDFType.GROUPED_MAP)


class MyTask(Task):
    def run():
        parser = argparse.ArgumentParser()
        parser.add_argument('--input_data_path', type=str, help='Input Data files path including wildcards', default='')
        parser.add_argument('--output_data_path', type=str, help='Output data files path', default='')
        parser.add_argument('--start_date', type=str, help='Epoch start date in ISO format %Y-%m-%dT%H:%M:%SZ', default='')
        parser.add_argument('--end_date', type=str, help='Epoch end date (exclusive) in ISO format %Y-%m-%dT%H:%M:%SZ', default='')
        args, d = parser.parse_known_args()

        sc = spark.sparkContext
        df = (spark.read.format("csv").
              option("header", "false").
              schema(DB.metrics_schema()).
              option('delimiter', ',').
              load(args.input_data_path.split(',')))

        start_time = datetime.strptime(args.start_date, ISO_TIME_FORMAT)
        start_time = start_time.replace(tzinfo=timezone('UTC'))
        end_time = datetime.strptime(args.end_date, ISO_TIME_FORMAT)
        end_time = end_time.replace(tzinfo=timezone('UTC'))
        chunk_id = '{}_{}'.format(start_time.strftime('%Y-%m-%d-%H-%M'), end_time.strftime('%Y-%m-%d-%H-%M'))
        chunk = (df.repartition(2000, "metric")
                 .groupby(spark_partition_id())
                 .apply(prepare_augmented_udf(start_time, end_time + timedelta(minutes=-1), time_unit='m', chunk_id=chunk_id)))
        chunk.write.format('csv').save(args.output_data_path + '/chunk-{}'.format(chunk_id))

<job_api.PyScript at 0x7fd01fe74e80>

In [6]:
transition = get_transition('api_transition_select.json')
print(transition)
SELECTION_OUT = transition.get('SELECT_SOURCE_OUTPUT', '')

{'SELECT_SOURCE_JOB': 'api_ai4ops_select_source_1567411439', 'SELECT_SOURCE_OUTPUT': 'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/selected/api_ai4ops_select_source_1567411439', 'SELECT_SOURCE_STATE': 'DONE'}


In [7]:
builder = DataprocJobBuilder()
session = Session(BUCKET, REGION, CLUSTER, PROJECT)

In [8]:
aug_job_name = "augmentation_normal_{}".format(int(datetime.now().timestamp()))

AUGMENTATION_OUT=f'{AI4OPS_HISTORY_PATH}/normal_augmented/{aug_job_name}'

arguments = {"--input_data_path":f"{SELECTION_OUT}/normal",\
        "--output_data_path":AUGMENTATION_OUT, \
        "--start_date":DATA_START_DATE,"--end_date":DATA_END_DATE}

augumentation_job = builder.task_script('augmentation.py')\
.job_id(aug_job_name)\
.py_file(f'{SCRIPT_PATH}/apigee_ingest_utils.py')\
.py_file(f'{SCRIPT_PATH}/ai4ops_db.py')\
.py_file(f'{SCRIPT_PATH}/yarn_logging.py')\
.arguments(**arguments)\
.build_job()

aug_executor = DataprocExecutor(augumentation_job, session)

In [10]:
aug_res = aug_executor.submit_job(run_async=True)

Job with id augmentation_normal_1567414052 was submitted to the cluster ai4ops


In [13]:
sleep(60)
state = aug_executor.get_job_state()

print('State : {}'.format(state))
if state not in ['DONE', 'RUNNING']:
    raise RuntimeError('Previous workflow step was failed')


State : RUNNING


In [15]:
aug_executor.get_job_state()

'DONE'

In [17]:
augmetation_transition = {
    "AUGMENTATION_JOB": aug_job_name,
    "AUGMENTATION_OUTPUT": AUGMENTATION_OUT,
    "AUGMENTATION_STATE": state
}

with open('api_transition_augmentation.json', 'w') as file:
     file.write(json.dumps(augmetation_transition)) 

<a id="rmduplicates"></a>
## Remove Duplicates
[back to Table Of Contents](#tc)


In [29]:
builder = DataprocJobBuilder()
session = Session(BUCKET, REGION, CLUSTER, PROJECT)

In [30]:
%%py_script --task --name remove_duplicates.py
import argparse
from pyspark.sql import SparkSession
from ai4ops_db import *
import pyspark.sql.functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType, spark_partition_id
import numpy as np
from job_api import Task


@pandas_udf(DB.metrics_schema(), PandasUDFType.GROUPED_MAP)
def grouped_mean(pdf):
    res = pdf.sort_values([DB.METRIC, DB.TIME, DB.SOURCE]).groupby([DB.METRIC, DB.TIME], as_index=False).agg({
        DB.VALUE: np.mean,
        DB.SOURCE: 'first'
    })
    return res[DB.metrics_schema_names()]

class DeduplicationTask(Task):
    def run():
        parser = argparse.ArgumentParser()
        parser.add_argument('--input_data_path', type=str, help='comma separated input data paths')
        parser.add_argument('--output_data_path', type=str, help='total base path')

        args, u = parser.parse_known_args()

        spark = SparkSession.builder.getOrCreate()
        sc = spark.sparkContext
        df = (spark.read.format("csv").
              option("header", "false").
              schema(DB.metrics_schema()).
              option('delimiter', ',').
              load(args.input_data_path.split(',')))
        df.printSchema()
        # df = df.groupby('time', 'metric', 'source').agg(F.max('value').alias('value'))
        df = df.repartition(1000, DB.METRIC).groupby(spark_partition_id()).apply(grouped_mean)
        # df = df.groupby(DB.TIME, DB.METRIC).apply(grouped_mean)
        df.select(DB.metrics_schema().names).write.format('csv').save(args.output_data_path)

<job_api.PyScript at 0x7fd01d01c588>

In [31]:
transition = get_transition('api_transition_augmentation.json')
print(transition)
AUGMENTATION_OUT = transition.get('AUGMENTATION_OUTPUT', '')

{'AUGMENTATION_JOB': 'augmentation_normal_1567414052', 'AUGMENTATION_OUTPUT': 'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/normal_augmented/augmentation_normal_1567414052', 'AUGMENTATION_STATE': 'RUNNING'}


In [32]:
dedup_job_name = "remove_duplicates_{}".format(int(datetime.now().timestamp()))

DEDUPLICATION_OUT=f'{AI4OPS_HISTORY_PATH}/no_duplicates/{dedup_job_name}'

arguments = {"--input_data_path":f"{AUGMENTATION_OUT}/chunk*",\
        "--output_data_path":DEDUPLICATION_OUT}

deduplication_job = builder.task_script('remove_duplicates.py')\
.job_id(dedup_job_name)\
.py_file(f'{SCRIPT_PATH}/apigee_ingest_utils.py')\
.py_file(f'{SCRIPT_PATH}/ai4ops_db.py')\
.py_file(f'{SCRIPT_PATH}/yarn_logging.py')\
.arguments(**arguments)\
.build_job()

dedup_executor = DataprocExecutor(deduplication_job, session)

In [34]:
dedup_res = dedup_executor.submit_job(run_async=True)

Job with id remove_duplicates_1567415840 was submitted to the cluster ai4ops


In [43]:
sleep(60)
state = dedup_executor.get_job_state()

print('State : {}'.format(state))
if state not in ['DONE', 'RUNNING']:
    raise RuntimeError('Previous workflow step was failed')

State : DONE


In [28]:
py_scripts

{'augmentation.py': <job_api.PyScript at 0x7fd01fe74e80>,
 'remove_duplicates.py': <job_api.PyScript at 0x7fd01fe74940>}

In [44]:
deduplication_transition = {
    "REMOVE_DUPLICATES_JOB": dedup_job_name,
    "REMOVE_DUPLICATES_OUTPUT": DEDUPLICATION_OUT,
    "REMOVE_DUPLICATES_STATE": state
}

with open('api_transition_remove_duplicates.json', 'w') as file:
     file.write(json.dumps(deduplication_transition)) 

<a id="todb"></a>
## Push Metrics to DB
[back to Table Of Contents](#tc)

### May-June

In [45]:
DB_SECRET="kohls_db.txt"
TPS_FILTER = '%-tps-%-proxy'
TOTAL_LATENCY_FILTER = '%-totalLatency%'
PUSH_TO_DB_START_FROM = '2019-05-20T00:00:00Z'

In [74]:
%%py_script --task --name augmentation_to_mysql.py
import argparse
import json

from pyspark.sql import SparkSession

from ai4ops_db import DB
from apigee_ingest_utils import ApigeeIngest
from augmentation import augment_corrupt
from apigee_ingest_utils import TIME_NODE
from job_api import Task

DB_TABLE = 'metric'
DB_STATS_TABLE = 'stats'

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.addPyFile('yarn_logging.py')
import yarn_logging

logger = yarn_logging.YarnLogger()


def store_to_db(df, table_name, db_credentials):
    db = DB(db_credentials, db_table=table_name)
    df.write.format(DB.DB_FORMAT).options(**db.get_spark_db_params()).mode(DB.TBL_APPEND_FORMAT).save()


def store_stats_to_db(stats_path, db_credentials):
    logger.info("Storing statistics from csv to db start")
    df = spark.read.csv(stats_path, header=True)
    df = df.drop('expected_timestep_quantity')
    df = df.toDF('metric', 'normal_all', 'normal_null', 'empty', 'error', 'timestep_quantity')

    logger.info("Dataframe with statistics has been prepared, start storing to db")
    store_to_db(df, DB_STATS_TABLE, db_credentials)


def store_metrics_to_db(metrics_path, db_credentials, start_from=None, metric_filter='%'):
    df = spark.read.csv(metrics_path, header=False, schema=DB.metrics_schema())
    if start_from is not None:
        df = df.filter("{} >= '{}' and metric like '{}'".format(TIME_NODE, start_from, metric_filter))
    store_to_db(df, DB_TABLE, db_credentials)


def augmentation_to_db(spark_session, latency_path, traffic_path, db_credentials):
    good_latency, bad_latency, good_traffic, bad_traffic = augment_corrupt(spark_session, latency_path, traffic_path)

    logger.info("start saving bad_latency")
    store_to_db(bad_latency, DB_TABLE, db_credentials)
    logger.info("end saving bad_latency")

    logger.info("start saving bad_traffic")
    store_to_db(bad_traffic, DB_TABLE, db_credentials)
    logger.info("end saving bad_traffic")

    logger.info("start saving good_latency")
    store_to_db(good_latency, DB_TABLE, db_credentials)
    logger.info("end saving good_latency")

    logger.info("start saving good_traffic")
    store_to_db(good_traffic, DB_TABLE, db_credentials)
    logger.info("end saving good_traffic")

class SaveToDbTask(Task):
    def run():
        parser = argparse.ArgumentParser()
        parser.add_argument('--db_credentials_file_path', type=str, help='db credentials file path on cluster file system')
        parser.add_argument('--db_credentials_gcs_file_path', type=str, help='db credentials file path on GCS')
        parser.add_argument('--res_path', type=str, help='resources directory path')
        parser.add_argument('--metrics_path', type=str, help='metrics path')
        parser.add_argument('--start_from', type=str, help='time string')
        parser.add_argument('--metric_filter', type=str, help='time string', default='%')

        args, u = parser.parse_known_args()

        if args.db_credentials_file_path is None and args.db_credentials_gcs_file_path is None:
            print('DB credentials paths are not found')
            exit(1)

        db_credentials_file_path = args.db_credentials_file_path
        db_credentials_gcs_file_path = args.db_credentials_gcs_file_path
        res_path = args.res_path
        if db_credentials_file_path is not None:
            db_credentials = json.loads(ApigeeIngest.dcr(res_path + '/resource.txt', db_credentials_file_path).decode('utf-8'))
        else:
            db_credentials_rows = spark.read.text(db_credentials_gcs_file_path).collect()
            db_credentials_file_path = 'db.txt'
            with open(db_credentials_file_path, 'w') as f:
                f.write(db_credentials_rows[0][0])
            db_credentials = json.loads(ApigeeIngest.dcr(res_path + '/resource.txt', db_credentials_file_path).decode('utf-8'))

        store_metrics_to_db(args.metrics_path, db_credentials, args.start_from, metric_filter=args.metric_filter)


<job_api.PyScript at 0x7fd01cfb7550>

In [75]:
transition = get_transition('api_transition_remove_duplicates.json')
print(transition)
DEDUPLICATION_OUT = transition.get('REMOVE_DUPLICATES_OUTPUT', '')

{'REMOVE_DUPLICATES_JOB': 'remove_duplicates_1567415840', 'REMOVE_DUPLICATES_OUTPUT': 'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/no_duplicates/remove_duplicates_1567415840', 'REMOVE_DUPLICATES_STATE': 'DONE'}


In [76]:
builder = DataprocJobBuilder()
session = Session(BUCKET, REGION, CLUSTER, PROJECT)

In [78]:
save_to_db_tps_job_name = "api_push_metrics_to_mysql_{}".format(int(datetime.now().timestamp()))

arguments = {"--metrics_path": f"{DEDUPLICATION_OUT}/chunk*",\
            "--db_credentials_gcs_file_path" : f"gs://{BUCKET}/resources/{DB_SECRET}", \
            "--res_path" : RESOURCES, \
            "--start_from": PUSH_TO_DB_START_FROM, \
            "--metric_filter": TPS_FILTER
            }

save_to_db_tps_job = builder.task_script('augmentation_to_mysql.py')\
.job_id(save_to_db_tps_job_name)\
.py_file(f'{SCRIPT_PATH}/apigee_ingest_utils.py')\
.py_file(f'{SCRIPT_PATH}/ai4ops_db.py')\
.py_file(f'{SCRIPT_PATH}/yarn_logging.py')\
.py_file(f'{SCRIPT_PATH}/augmentation.py')\
.jar(f'gs://{BUCKET}/resources/mysql-connector-java-8.0.16.jar')\
.arguments(**arguments)\
.build_job()

save_tps_executor = DataprocExecutor(save_to_db_tps_job, session)

In [79]:
save_to_db_lat_job_name = "api_push_metrics_to_mysql_{}".format(int(datetime.now().timestamp()))


arguments = {"--metrics_path": f"{DEDUPLICATION_OUT}/part*",\
            "--db_credentials_gcs_file_path" : f"gs://{BUCKET}/resources/{DB_SECRET}", \
            "--res_path" : RESOURCES, \
            "--start_from": PUSH_TO_DB_START_FROM, \
            "--metric_filter": TOTAL_LATENCY_FILTER
            }

save_to_db_lat_job = builder.task_script('augmentation_to_mysql.py')\
.job_id(save_to_db_lat_job_name)\
.py_file(f'{SCRIPT_PATH}/apigee_ingest_utils.py')\
.py_file(f'{SCRIPT_PATH}/ai4ops_db.py')\
.py_file(f'{SCRIPT_PATH}/yarn_logging.py')\
.py_file(f'{SCRIPT_PATH}/augmentation.py')\
.jar(f'gs://{BUCKET}/resources/mysql-connector-java-8.0.16.jar')\
.arguments(**arguments)\
.build_job()

save_lat_executor = DataprocExecutor(save_to_db_lat_job, session)

In [80]:
save_tps_res = save_tps_executor.submit_job(run_async=False)

Job with id api_push_metrics_to_mysql_1567420284 was submitted to the cluster ai4ops
Job STATUS was set to PENDING at 2019-09-02 10:31:30
Job STATUS was set to SETUP_DONE at 2019-09-02 10:31:30
      Yarn APP augmentation_to_mysql.py with STATUS ACCEPTED has PROGRESS 0
      Yarn APP augmentation_to_mysql.py with STATUS RUNNING has PROGRESS 10
Job STATUS was set to RUNNING at 2019-09-02 10:31:30
      Yarn APP augmentation_to_mysql.py with STATUS FINISHED has PROGRESS 100
Job STATUS was set to DONE at 2019-09-02 10:31:57


In [73]:
save_tps_executor.job_description()

{'reference': {'project_id': 'kohls-kos-cicd',
  'job_id': 'push_metrics_to_mysql_1567416573'},
 'placement': {'cluster_name': 'ai4ops'},
 'pyspark_job': {'main_python_file_uri': 'gs://ai4ops-main-storage-bucket/jobs-root/push_metrics_to_mysql_1567416573/run.py',
  'args': ['--metrics_path',
   'gs://ai4ops-main-storage-bucket/apigee_history/apigee/metrics/history/no_duplicates/remove_duplicates_1567415840/chunk*',
   '--db_credentials_gcs_file_path',
   'gs://ai4ops-main-storage-bucket/resources/kohls_db.txt',
   '--res_path',
   '/opt/dataproc/.resources',
   '--start_from',
   '2019-05-20T00:00:00Z',
   '--metric_filter',
   '%-tps-%-proxy'],
  'python_file_uris': ['gs://ai4ops-main-storage-bucket/jobs-root/push_metrics_to_mysql_1567416573/apigee_ingest_utils.py',
   'gs://ai4ops-main-storage-bucket/jobs-root/push_metrics_to_mysql_1567416573/ai4ops_db.py',
   'gs://ai4ops-main-storage-bucket/jobs-root/push_metrics_to_mysql_1567416573/yarn_logging.py',
   'gs://ai4ops-main-storage-bu

In [81]:
save_lat_res = save_lat_executor.submit_job(run_async=False)

Job with id api_push_metrics_to_mysql_1567420285 was submitted to the cluster ai4ops
Job STATUS was set to PENDING at 2019-09-02 10:32:23
Job STATUS was set to SETUP_DONE at 2019-09-02 10:32:23
      Yarn APP augmentation_to_mysql.py with STATUS RUNNING has PROGRESS 10
Job STATUS was set to RUNNING at 2019-09-02 10:32:24
      Yarn APP augmentation_to_mysql.py with STATUS FINISHED has PROGRESS 100
Job STATUS was set to DONE at 2019-09-02 10:32:48


<a id="gcs"></a>
## Push Notebook to GCS Bucket
[back to Table Of Contents](#tc)

In [None]:
!gsutil cp api_data_preprocessing_workflow.ipynb gs://ai4ops-main-storage-bucket/ai4ops-source/ai4ops-jupyter-ds-03/api
