# Trendspotting POC

Goal of this notebook is to
* Load signals data into a managed vertex dataset for time series forecasting
* Create a forecast prediction model for each term, geo and category combination
* Project the forecasts on a holdout set of data to assess performance and trends
* Clean up results of test predictions
* Cluster test predictions
* Create dashboard for backtesting

End users would take this parameterized pipeline to produce futurama backtests using clustering and forecasting

[Source Control Link](https://source.cloud.google.com/cpg-cdp/trendspotting/+/master:pipeline_train.ipynb)

When run - the piepline will look something like this:

![pipeline example](img/pipeline_example.png)

[Link to pipeline](https://pantheon.corp.google.com/vertex-ai/locations/us-central1/pipelines/runs/report-pipe-trendspotting-pipeline-20220309212043?authuser=0&project=cpg-cdp)
## Install packages, create bucket (only run once)

In [1]:
# ! pip3 install -U google-cloud-storage $USER_FLAG
# ! pip3 install $USER kfp google-cloud-pipeline-components==1.0.5
# # !git clone https://github.com/kubeflow/pipelines.git
# # !pip install pipelines/components/google-cloud/.
# !pip install google-cloud-aiplatform
# from google_cloud_pipeline_components.v1.bigquery import BigqueryCreateModelJobOp

### Import libs and types for KFP pipeline

In [2]:
from datetime import datetime
import json
import os
import time
from typing import Any, Callable, Dict, NamedTuple, Optional
from IPython.display import clear_output
from google_cloud_pipeline_components.v1.bigquery import BigqueryCreateModelJobOp

from google import auth
from google.api_core import exceptions as google_exceptions
from google_cloud_pipeline_components import aiplatform as gcc_aip
from google_cloud_pipeline_components.experimental import forecasting as gcc_aip_forecasting
from google_cloud_pipeline_components.experimental import bigquery as gcp_aip_bq

import google.cloud.aiplatform
from google.cloud import bigquery
from google.cloud import storage

import kfp
import kfp.v2.dsl
from kfp.v2.google import client as pipelines_client

from matplotlib import dates as mdates
from matplotlib import pyplot as plt

import pandas as pd
import seaborn as sns

from IPython.display import Image
from IPython.core.display import HTML 

from typing import Dict, List, Optional, Sequence, Tuple, Union
from kfp.v2.dsl import Artifact
from kfp.v2.dsl import Input, Model
from kfp.v2.components.types.type_utils import artifact_types


In [3]:
PROJECT_ID = 'cpg-cdp'
LOCATION = 'us-central1'

In [4]:
PIPELINES = {}

PIPELINES_FILEPATH = 'gs://trendspotting-pipeline' # <--- TODO: CHANGE THIS; can be blank json file

if os.path.isfile(PIPELINES_FILEPATH):
    with open(PIPELINES_FILEPATH) as f:
        PIPELINES = json.load(f)
else:
    PIPELINES = {}

def save_pipelines():
    with open(PIPELINES_FILEPATH, 'w') as f:
        json.dump(PIPELINES, f)

### KFP Custom Component - training data query

Details: From `futurama_weekly` pull data between 7/20 - 12/21 (100 gb limit for automl tables). Automatically set testing and validation as follows:
    
* Train: 2/20-4/21
* Validate: 5/21-6/21
* Test: 6/21-12/21
    
Also set `series_id` to be a concat: `concat(category_id, geo_id, term) as series_id`

Note this will create a backtest for the test period to understand the effecicacy of the predictions

### Starter component - check source table date for run

#### Detailed Report Custom Component

This creates the futurama weekly table with Swivel NLP embeddings (20 float fields)

## Top level report

This component takes futurama weekly, adds the NLP features plus applies the trained 100 cluster model

#### Creation of aggregated cluster-level data
Used in the cluster forecasts

### Feature Specs for forecasts

### Pipeline 

Uses custom components, also uses reusable vertex components for creating the training dataset and training the forecast models

Notice the output for testing in BQ is set by `target_table`, assigned to `export_evaluated_data_items_bigquery_destination_uri`

In [5]:
from src.components import components

In [6]:
VERSION = 'v4'
SUFFIX = "png_hair_22"
PIPELINE_TAG = f'{SUFFIX}-trendspotting-pipeline-{VERSION}' # <--- TODO; optionally name pipeline
@kfp.v2.dsl.pipeline(
  name=f'{VERSION}-{PIPELINE_TAG}'.replace('_', '-'),
        pipeline_root=PIPELINES_FILEPATH,

)
def pipeline(
    vertex_project: str,
    location: str,
    version: str,
    ds_display_name_terms: str,
    ds_display_name_cluster: str,
    label_table: str,
    scored_classification_table: str,
    classification_train_table: str,
    classification_model_name: str,
    classification_model_budget: int,
    auto_cluster_train_table: str,
    auto_min_cluster: int,
    auto_max_cluster: int,
    auto_cluster_target_table: str,
    label_list: list,
    train_st: str,
    train_end: str,
    valid_st: str,
    valid_end: str,
    predict_on_dt: str,
    override: str,
    fix_embed_target: str,
    k_means_name: str,
    n_clusters: int,
    top_n_results: int,
    six_month_dt: str,
    source_table: str ,
    target_term_forecast_table: str ,
    target_cluster_forecast_table: str ,
    budget_milli_node_hours: int,
    budget_milli_node_hours_cluster: int,
    context_window: int,
    forecast_horizon: int,
    top_movers_target_table: str,
    cluster_table_agg: str,
    cluster_table: str,
    subcat_id: int,
    model_name: str,
):


    # get_data_source = gcp_aip_bq.BigqueryQueryJobOp(
    #   project = 'cpg-cdp',
    #   location = 'US',
    #   query = f"""select distinct date from {source_table}""",
    #     # encryption_spec_key_name=CMEK
    # )
    
    embed_terms = components.create_prediction_dataset_term_level(
      target_table = f'cpg-cdp.trendspotting.ETL_futurama_weekly_embed_{SUFFIX}',
      source_table_uri = source_table,
      train_st = train_st,
      train_end = train_end,
      valid_st = valid_st,
      valid_end = valid_end,
      subcat_id = subcat_id,
    ) #-> NamedTuple('Outputs', [('training_data_table_uri', str)])j
    
    fix_embed = components.prep_forecast_term_level(
        source_table = embed_terms.outputs['training_data_table_uri'],
        target_table = fix_embed_target,
        )# -> NamedTuple('Outputs', [('term_train_table', str)]):


    time_series_dataset_create_op = gcc_aip.TimeSeriesDatasetCreateOp(
        display_name=ds_display_name_terms, 
        bq_source=fix_embed.outputs['term_train_table'],
        project=vertex_project,
        location=location,
    )
    
    term_forecasting_op = gcc_aip.AutoMLForecastingTrainingJobRunOp(
        display_name=f'train-point-forecast-futurama',
        model_display_name='point-forecast-futurama',
        dataset=time_series_dataset_create_op.outputs['dataset'],
        context_window=context_window,
        forecast_horizon=forecast_horizon,
        budget_milli_node_hours=budget_milli_node_hours,
        project=vertex_project,
        location=location,
        export_evaluated_data_items=True,
        export_evaluated_data_items_override_destination=True,
        target_column='category_rank',
        time_column='date',
        time_series_identifier_column='series_id',
        time_series_attribute_columns=['geo_name', 'geo_id', 'category_id', 'term', 
                                      'emb1', 'emb2', 'emb3', 'emb4', 'emb5', 'emb6',
                                      'emb7', 'emb8', 'emb9', 'emb10', 'emb11', 'emb12',
                                      'emb13', 'emb14', 'emb15', 'emb16', 'emb17', 'emb18', 
                                      'emb19', 'emb20', 'sentences'],
        unavailable_at_forecast_columns=['category_rank'],
        available_at_forecast_columns=['date'],
        data_granularity_unit='week',
        data_granularity_count=1,
        predefined_split_column_name= 'split_col', 
        optimization_objective='minimize-rmse',
        column_transformations=components.COLUMN_TRANSFORMATIONS,
        export_evaluated_data_items_bigquery_destination_uri = target_term_forecast_table, # must be format:``bq://<project_id>:<dataset_id>:<table>``
    )
    
    top_movers_data_op = components.create_top_mover_table(source_table = target_term_forecast_table,
    target_table = top_movers_target_table,
        predict_on_dt = predict_on_dt, 
        six_month_dt = six_month_dt,
        trained_model = term_forecasting_op.outputs['model'],
        top_n_results = top_n_results,
        ) #-> NamedTuple('Outputs', [('term_train_table', str)]):
    
    
    #HIGH LEVEL REPORT PIPELINE STARTS HERE
    
    #######################################
    
    model_train_sql = f"""CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
                        RETURNS 
                        STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
                               p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64, 
                               p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64, 
                               p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
                               p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>
                        AS (
                        STRUCT(
                            arr[OFFSET(0)]
                            , arr[OFFSET(1)]
                            , arr[OFFSET(2)]
                            , arr[OFFSET(3)]
                            , arr[OFFSET(4)]
                            , arr[OFFSET(5)]
                            , arr[OFFSET(6)]
                            , arr[OFFSET(7)]
                            , arr[OFFSET(8)]
                            , arr[OFFSET(9)]
                            , arr[OFFSET(10)]
                            , arr[OFFSET(11)]
                            , arr[OFFSET(12)]
                            , arr[OFFSET(13)]
                            , arr[OFFSET(14)]
                            , arr[OFFSET(15)]
                            , arr[OFFSET(16)]
                            , arr[OFFSET(17)]
                            , arr[OFFSET(18)]
                            , arr[OFFSET(19)]    
                        ));
                        
            CREATE OR REPLACE MODEL `{model_name}` OPTIONS(model_type='kmeans', KMEANS_INIT_METHOD='KMEANS++', num_clusters={n_clusters}) AS
                    select arr_to_input_20(output_0) AS comments_embed from 
                        ML.PREDICT(MODEL trendspotting.swivel_text_embed,(
                      SELECT date, geo_name, term AS sentences, volume
                      FROM `{source_table}`
                      WHERE date >= '{train_st}'
                      and category_id = {subcat_id}
                      ))
    """
    #tell if the scored topic tables exist
    
    sct_exists_task = components.if_tbl_exists(label_table, vertex_project)
    with kfp.v2.dsl.Condition(sct_exists_task.output=="True"):
        ### if labled data exists, we will create a model and auto cluster each category
        
        train_model_op = components.train_classification_model(
              target_table = scored_classification_table,
              source_table = fix_embed.outputs['term_train_table'],
              label_table = label_table,
              train_table = classification_train_table,
              classification_model_name = classification_model_name,
              project_id = vertex_project,
              classification_budget_hours = classification_model_budget
            ) 
        
        auto_cluster_op = components.auto_cluster(
            cluster_min = auto_max_cluster,
            cluster_max = auto_min_cluster,
            labels = label_list,
            cluster_train_table = auto_cluster_train_table,
            classified_terms_table = train_model_op.output,
            target_table = auto_cluster_target_table,
            project_id = vertex_project
            )# -> NamedTuple('Outputs', [('target_table', str)]):
        
        aggregate_cluster_op = components.aggregate_clusters(
            source_table = cluster_table,
            category_table = auto_cluster_op.output,
            target_table = cluster_table_agg,
            train_st = train_st,
            train_end = train_end,
            valid_st = valid_st,
            valid_end = valid_end,
            model_name = model_name,
            )# -> Name
        #create training ds in vertex
        time_series_dataset_create_op_high_level = gcc_aip.TimeSeriesDatasetCreateOp(
            display_name=ds_display_name_cluster, 
            bq_source=aggregate_cluster_op.outputs['term_cluster_agg_table'],
            project=vertex_project,
            location=location,
        )

        term_forecasting_op = gcc_aip.AutoMLForecastingTrainingJobRunOp(
            display_name=f'train-cluster-forecast-futurama',
            model_display_name='cluster-forecast-futurama',
            dataset=time_series_dataset_create_op_high_level.outputs['dataset'],
            context_window=context_window,
            forecast_horizon=forecast_horizon,
            budget_milli_node_hours=budget_milli_node_hours_cluster,
            project=vertex_project,
            location=location,
            export_evaluated_data_items=True,
            export_evaluated_data_items_override_destination=True,
            target_column='volume',
            time_column='date',
            time_series_identifier_column='series_id',
            time_series_attribute_columns=['topic_id', 'category', 'comments_embed_p1', 'comments_embed_p2', 'comments_embed_p3', 'comments_embed_p4', 'comments_embed_p5', 'comments_embed_p6',
                                          'comments_embed_p7', 'comments_embed_p8', 'comments_embed_p9', 'comments_embed_p10', 'comments_embed_p11', 'comments_embed_p12',
                                          'comments_embed_p13', 'comments_embed_p14', 'comments_embed_p15', 'comments_embed_p16', 'comments_embed_p17', 'comments_embed_p18', 
                                          'comments_embed_p19', 'comments_embed_p20'],
            unavailable_at_forecast_columns=['volume'],
            available_at_forecast_columns=['date'],
            data_granularity_unit='week',
            data_granularity_count=1,
            predefined_split_column_name= 'split_col', 
            optimization_objective='minimize-rmse',
            column_transformations=components.COLUMN_TRANSFORMS_CLUSTER,
            export_evaluated_data_items_bigquery_destination_uri = target_cluster_forecast_table, # must be format:``bq://<project_id>:<dataset_id>:<table>``
        )

    with kfp.v2.dsl.Condition(sct_exists_task.output=="False"):
        train_k_means_op = BigqueryCreateModelJobOp(project=PROJECT_ID,
                                               location='US',
                                               query=model_train_sql
                                               )
        create_cluster_terms_op = components.nlp_featurize_and_cluster(
            source_table = source_table,
            target_table = cluster_table,
            train_st = train_st,
            train_end = train_end,
            subcat_id = subcat_id,
            model_name = model_name
            ).after(train_k_means_op)# -> NamedTuple('Outputs', [('term_cluster_table', str)]):
    
    

## todo - to get explainations
drop predictions on train , use `google_cloud_pipeline_components.aiplatform.ModelBatchPredictOp`

In [7]:
kfp.v2.compiler.Compiler().compile(
  pipeline_func=pipeline, 
  package_path='trendspotting.json',
)



### Set parameters for pipeline here

In [8]:
PROJECT_ID = 'cpg-cdp' # <--- TODO: If not set
LOCATION = 'us-central1' # <--- TODO: If not set
SERVICE_ACCOUNT = 'vertex-pipelines@cpg-cdp.iam.gserviceaccount.com' , # <--- TODO: Change This if needed
N_CLUSTERS = 20
K_MEANS_MODEL_NAME = f"cpg-cdp.trendspotting.trendspotting_{N_CLUSTERS}_rmse_{SUFFIX}"

MODEL_NAME = f'cpg-cdp.trendspotting.{SUFFIX}_kmeans_{N_CLUSTERS}'
# BQ dataset for source data source
SOURCE_DATA = 'futurama_weekly'
TOP_N_RESULTS = 500
# TODO: Forecasting Configuration:
HISTORY_WINDOW_n = 52 #  {type: 'integer'} # context_window
FORECAST_HORIZON = 52 #  {type: 'integer'} 
BUDGET_MILLI_NODE_HOURS = 20000
BUDGET_MILLI_NODE_HOURS_CLUSTER = 20000
BUDGET_HOURS_CLASSIFICATION = 2

categories = ['Hair Straighteners and Relaxers',
 'Near me',
 'Scalp/Anti-Dandruff Products',
 'Hair Dyes & Coloring',
 'Damaged Hair',
 'Hair Styling',
 'Hair Loss Products',
 'Lice',
 'Shampoos & Conditioners']

In [9]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


### Run the pipeline
Follow the link to see the exectution

In [10]:
from google.cloud import aiplatform


PIPELINE_PARAMETERS = {
    'subcat_id': 10055, #hair care for just this run
    'vertex_project': PROJECT_ID,
    'location': LOCATION,
    'version': VERSION,
    'label_table': f'trendspotting.labels_jw_pl_{SUFFIX}',
    'scored_classification_table': f'cpg-cdp.ETL_trendspotting.classified_terms_bqml_aml_pl_{SUFFIX}',
    'fix_embed_target': f'cpg-cdp.trendspotting.ETL_futurama_weekly_embed_aml_pl_{SUFFIX}',
    'classification_train_table': f'trendspotting.ETL_labeled_distinct_training_jw_pl_{SUFFIX}',
    'classification_model_name': f'trendspotting.bqml_distinct_pl_{SUFFIX}',
    'classification_model_budget': BUDGET_HOURS_CLASSIFICATION,
    'auto_min_cluster': 2,
    'auto_max_cluster': 10,
    'auto_cluster_train_table': f'trendspotting.cat_clus_train_{SUFFIX}',
    'auto_cluster_target_table': f'trendspotting.full_cat_clus_{SUFFIX}',
    'label_list' : categories,
    'train_st': '2021-05-30',
    'train_end': '2021-10-10',
    'valid_st': '2021-10-17',
    'valid_end': '2021-12-26',
    'predict_on_dt': '2022-01-02',
    'six_month_dt': '2022-07-17',
    'context_window': HISTORY_WINDOW_n,
    'forecast_horizon': FORECAST_HORIZON,
    'budget_milli_node_hours': BUDGET_MILLI_NODE_HOURS,
    'ds_display_name_terms': f'futurama-term-forecasts-{SUFFIX}',
    'ds_display_name_cluster': f'futurama-clusters-{SUFFIX}',
    'k_means_name': K_MEANS_MODEL_NAME,
    'n_clusters': N_CLUSTERS,
    'top_n_results': TOP_N_RESULTS,
    'cluster_table_agg': f"cpg-cdp.trendspotting.ETL_futurama_weekly_embed_cluster_agg_100_{SUFFIX}",
    'cluster_table': f"cpg-cdp.trendspotting.ETL_futurama_weekly_embed_cluster_100_{SUFFIX}",
    'override' : 'false',
    'target_term_forecast_table' : f'bq://cpg-cdp.trendspotting.ETL_predict_c52_p52_embed_pl_{SUFFIX}',
    'source_table' : f'cpg-cdp.trendspotting.futurama_weekly', #FIX -TODO
    'target_term_forecast_table': f'cpg-cdp.trendspotting.ETL_predict_{SUFFIX}',
    'target_cluster_forecast_table': f'cpg-cdp.trendspotting.predict_cluster_{SUFFIX}',
    'top_movers_target_table': f'cpg-cdp.trendspotting.top_movers_pl_{SUFFIX}',
    'budget_milli_node_hours_cluster': BUDGET_MILLI_NODE_HOURS_CLUSTER,
    'model_name': MODEL_NAME
    }

job = aiplatform.PipelineJob(display_name = f'trendspotting_{PIPELINE_PARAMETERS["subcat_id"]}_{SUFFIX}',
                             template_path = 'trendspotting.json',
                             pipeline_root = PIPELINES_FILEPATH,
                             parameter_values = PIPELINE_PARAMETERS,
                             project = PROJECT_ID,
                             location = LOCATION,
                              enable_caching=True)

job.submit()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/939655404703/locations/us-central1/pipelineJobs/v4-png-hair-22-trendspotting-pipeline-v4-20220907153554
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/939655404703/locations/us-central1/pipelineJobs/v4-png-hair-22-trendspotting-pipeline-v4-20220907153554')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/v4-png-hair-22-trendspotting-pipeline-v4-20220907153554?project=939655404703
