<a target="_blank" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/gretel-demo-sequence-of-events.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Create Complex Time Series sequences

* This notebook demonstrates how to use Gretel DGAN with Gretel Tuner to generate synthetic time series adhering to a business logic or set of rules.
* To run this notebook, you will need an API key from the [Gretel Console](https://console.gretel.ai/).

## Getting Started

In [1]:
!pip install -Uqq gretel-client[tuner]

In [None]:
# Download the file with helper functions needed to run this notebook
!curl -o dgan_tuner_utils.py https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/docs/notebooks/demo/demo-utilities-lib/dgan_tuner_utils.py

## Dataset description

This dataset describes a project management process as shown in the figure below. The process is structured into five main phases, with each phase comprising mandatory events and optional events to provide flexibility and adaptability to the project's needs. 

<img src="https://gretel-datasets.s3.us-west-2.amazonaws.com/time-series-events/project_management_workflow.png" width="25%" height="25%">
<!--  -->
<!-- ![Project Management Workflow](https://gretel-datasets.s3.us-west-2.amazonaws.com/project_management_sequences/project_management_workflow.png) -->

Here's a breakdown of the dataset based on this workflow:

### Phases:
**A Initiation:** The starting point of the project, focused on establishing the project's foundation.  
**B Planning:** Involves detailed preparation and strategizing for how the project will be executed.  
**C Execution:** The phase where the planned activities are carried out to create the project's deliverables.  
**D Monitoring and Controlling:** Concurrent with execution, this phase ensures the project stays on track and adheres to quality standards.  
**E Closure:** Concludes the project by ensuring all aspects are completed satisfactorily and formally closing the project.  


### Mandatory Events:
**A-1 Project Kick-off**: Marks the official start of the project.  
**B-1 Requirements Gathering**: Collection of all necessary project requirements.  
**B-2 Resource Allocation**: Assignment and scheduling of resources needed for the project.  
**C-1 Development Start**: Commencement of the project development activities.  
**D-1 Quality Assurance Testing**: Testing to ensure the quality of the project's outputs.  
**E-1 Final Review**: Comprehensive review of all project deliverables.  

### Optional Events:
**B-3 Risk Assessment**: Evaluation of potential project risks and their impacts.  
**C-2 First Prototype Review**: Assessment of an early project prototype.  
**C-3 Mid-Project Evaluation**: Evaluation of project progress before completion.  
**D-2 Client Feedback Session**: Gathering feedback from clients or stakeholders.  
**D-3 Adjustments Based on Feedback**: Making changes to the project based on received feedback.  
**E-2 Deployment**: Release of the final product to the end-users or stakeholders.  
**E-3 Project Retrospective**: Reflective meeting to discuss what went well and what could be improved.

### Workflow Logic:
The workflow is designed with a logical progression from initiation to closure, with mandatory events ensuring the project's essential milestones are met. Optional events provide opportunities to enhance project outcomes, address unforeseen challenges, or incorporate stakeholder feedback. Solid lines represent the flow between mandatory events, while dashed lines indicate where optional events can be integrated into the project lifecycle.

This dataset and workflow visualization offer a comprehensive overview of the structured approach to managing projects, highlighting the flexibility to adapt to project-specific requirements and changes throughout the project

## Load and preview training data

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv(
    "https://gretel-datasets.s3.us-west-2.amazonaws.com/time-series-events/project_management_sequences.csv"
)

EXAMPLE_COLUMN = "PROJECT_ID"
EVENT_COLUMN = "EVENT_TYPE"

selected_projects = np.random.choice(
    df[EXAMPLE_COLUMN].unique(), 3, replace=False
)

pd.set_option("display.max_rows", None)
display(df[df[EXAMPLE_COLUMN].isin(selected_projects)])

## Prepare data for DGAN

In [None]:
from dgan_tuner_utils import pad_sequence

# prepare for DGAN

max_len = df.groupby(EXAMPLE_COLUMN).size().max()

# Pad each group and concatenate back into a DataFrame
data_source = pd.concat(
    [pad_sequence(group, max_len, example_id_column=EXAMPLE_COLUMN, event_column=EVENT_COLUMN, pad_value="[END]") for _, group in df.groupby(EXAMPLE_COLUMN)],
    ignore_index=True,
)

# Number of sequences in the source dataset
NUM_RECORDS = len(data_source.groupby(EXAMPLE_COLUMN))

# MAX_SEQUENCE_LEN defines the total length of synthetic sequences generated and also the fixed length for all training examples.
# This parameter ensures uniformity in sequence length across the dataset, set to 6 here indicating that each sequence (synthetic or training) will consist of 6 time points.
MAX_SEQUENCE_LEN = int(data_source.groupby(EXAMPLE_COLUMN).size().max())

# SAMPLE_LEN specifies the number of time points generated by a single RNN cell within the generator.
# It must be a divisor of MAX_SEQUENCE_LEN. A value of 1 here means each RNN cell generates 1 time point in the sequence.
# For optimal model learning and memory management, the ratio of MAX_SEQUENCE_LEN to SAMPLE_LEN should ideally be between 10 and 20.
SAMPLE_LEN = 1

print(f"Number of Records: {NUM_RECORDS}")
print(f"Maximum Sequence Length: {MAX_SEQUENCE_LEN}")
print(f"Sample Length: {SAMPLE_LEN}")

## Visualize the sequences

In [None]:
from dgan_tuner_utils import (
    plot_event_sequences,
    plot_transition_matrices,
    plot_event_type_distribution,
)

# Define the valid sequence including optional events. Optional events are provided as a list.
valid_sequence = [
    "A-1 Project Kick-off",
    "B-1 Requirements Gathering",
    "B-2 Resource Allocation",
    ["B-3 Risk Assessment"],  # Optional
    "C-1 Development Start",
    ["C-2 First Prototype Review", "C-3 Mid-Project Evaluation"],  # Optional
    "D-1 Quality Assurance Testing",
    ["D-2 Client Feedback Session", "D-3 Adjustments Based on Feedback",],  # Optional
    "E-1 Final Review",
    ["E-2 Deployment", "E-3 Project Retrospective", "[END]"],  # Optional
]

event_mapping = {
    "A-1 Project Kick-off": "A-1",
    "B-1 Requirements Gathering": "B-1",
    "B-2 Resource Allocation": "B-2",
    "B-3 Risk Assessment": "B-3",  # Optional
    "C-1 Development Start": "C-1",
    "C-2 First Prototype Review": "C-2",  # Optional
    "C-3 Mid-Project Evaluation": "C-3",  # Optional
    "D-1 Quality Assurance Testing": "D-1",
    "D-2 Client Feedback Session": "D-2",  # Optional
    "D-3 Adjustments Based on Feedback": "D-3",  # Optional
    "E-1 Final Review": "E-1",
    "E-2 Deployment": "E-2",  # Optional
    "E-3 Project Retrospective": "E-3",  # Optional
    "[END]": "[END]",  # Optional
}

plot_event_sequences(data_source, example_id_column=EXAMPLE_COLUMN, event_column=EVENT_COLUMN, num_sequences=5, event_mapping=event_mapping)
plot_event_type_distribution(data_source, event_column=EVENT_COLUMN, event_mapping=event_mapping)
plot_transition_matrices(data_source, example_id_column=EXAMPLE_COLUMN, event_column=EVENT_COLUMN, event_mapping=event_mapping)


## Create Gretel Session and Project

In [None]:
from gretel_client import Gretel

gretel = Gretel(
    project_name="gretel-demo-events",
    api_key="prompt",
    cache="yes",
    validate=True,
)

## Define base DGAN configuration

In [None]:
# create a custom dgan config to modify some base setting
import yaml
from gretel_client.gretel.config_setup import create_model_config_from_base

config = create_model_config_from_base(
    "time-series",
    params={
        "apply_example_scaling": True,
        "max_sequence_len": MAX_SEQUENCE_LEN,
        "sample_len": SAMPLE_LEN,
    },
    example_id_column=EXAMPLE_COLUMN,
)

with open("custom_base_dgan_config.yaml", "w") as file:
    yaml.dump(config, file, default_flow_style=False, sort_keys=False)

## Train DGAN models leveraging Gretel Tuner

In [None]:
from dgan_tuner_utils import EventTypeHistogramAndTransitionDistance

# This cell should take ~20 minutes to complete.
tuner_config = """
base_config: custom_base_dgan_config.yaml

params:

    attribute_loss_coef:
        choices: [1, 5, 10]

    attribute_num_layers:
        choices: [3, 4]

    attribute_num_units:
        choices: [50, 100, 200]

    batch_size:
        choices: [100, 200]

    epochs:
        choices: [1000, 2000, 4000, 8000]

    generator_learning_rate:
        log_range: [0.000001, 0.001]

    discriminator_learning_rate:
        log_range: [0.000001, 0.001]

"""

target_job = "tune-dgan"

metric = EventTypeHistogramAndTransitionDistance(reference_df=data_source, example_id_column=EXAMPLE_COLUMN, event_column=EVENT_COLUMN, num_samples=5000)

tuner_results = gretel.run_tuner(
    tuner_config,
    data_source=data_source,
    n_jobs=4,
    n_trials=24,
    metric=metric,
)

In [None]:
# show best config 
best_config = tuner_results.best_config
print(yaml.dump(best_config))

## Generate synthetic time-series

In [None]:
# generate data
trained = gretel.fetch_train_job_results(tuner_results.best_model_id)
generated = gretel.submit_generate(trained.model_id, num_records=1000)

## Assess synthetic time-series results

In [None]:
from dgan_tuner_utils import (
    calculate_percentage_of_valid_sequences,
    remove_invalid_sequences,
    check_series_order,
)

# Retrieve synthetic data and remove invalid sequences
generated_data = generated.synthetic_data
series_validity = generated_data.groupby(EXAMPLE_COLUMN)[EVENT_COLUMN].apply(
    lambda x: check_series_order(x, valid_sequence)
)
generated_data_valid = remove_invalid_sequences(
    generated_data, series_validity, EXAMPLE_COLUMN
)

# Calculate the percentage of valid sequences
percentage_valid = calculate_percentage_of_valid_sequences(series_validity)

print(f"Percentage of Valid Sequences: {percentage_valid:.2f}%")

In [None]:
# show source vs synthetic data
plot_event_sequences(
    generated_data_valid,
    example_id_column=EXAMPLE_COLUMN,
    event_column=EVENT_COLUMN,
    df_ref=data_source,
    num_sequences=5,
    event_mapping=event_mapping,
)

plot_event_type_distribution(
    generated_data_valid, 
    event_column=EVENT_COLUMN,
    df_ref=data_source, 
    event_mapping=event_mapping
)

plot_transition_matrices(
    generated_data_valid,
    example_id_column=EXAMPLE_COLUMN,
    event_column=EVENT_COLUMN,
    df_ref=data_source,
    event_mapping=event_mapping
)