# Lab 02: Bulk Flow Creation with Templates

**Objective:** Learn to automate data pipeline creation by using templates and loops to generate multiple flow variations from a single pattern.

**What you'll build:** Three transaction processing pipelines (Low/Medium/High value) created programmatically in under 5 minutes - a task that would take approximately 60 minutes manually in the UI.

**Key Concept:** Template-based automation demonstrates the true power of the SDK for production-scale data engineering.

## Part 1: Setup and Configuration

### Cell 1: (Optional) SDK Reinstall

**Skip this cell if you just completed Lab 01.** Only uncomment and run if you're starting a fresh session or experiencing import issues.

In [None]:
# Uncomment the lines below ONLY if you need to reinstall the SDK
# !pip uninstall ibm_watsonx_data_integration -y
# !pip install ibm_watsonx_data_integration --force-reinstall

### Cell 2: Import Required Libraries

In [None]:
from ibm_watsonx_data_integration import *
from ibm_watsonx_data_integration.common.auth import IAMAuthenticator
from ibm_watsonx_data_integration.services.datastage import *
from ibm_watsonx_data_integration.services.datastage.models.enums import SEQUENTIALFILE
from datetime import datetime

print("Libraries imported successfully")

### Cell 3: Set Your Credentials

**Use the same API key and Project ID from Lab 00 and Lab 01.**

In [None]:
API_KEY = "YOUR_IBM_CLOUD_API_KEY"
PROJECT_ID = "YOUR_PROJECT_ID"

print("Credentials set (not displayed for security)")

### Cell 4: Connect to Platform and Project

In [None]:
# Authenticate
auth = IAMAuthenticator(api_key=API_KEY, base_auth_url="https://cloud.ibm.com")
print("Authenticator created")

# Connect to platform
platform = Platform(auth, base_api_url="https://api.ca-tor.dai.cloud.ibm.com")
print("Platform connection initialized")

# Get project
project = platform.projects.get(guid=PROJECT_ID)
print(f"Connected to project: {project.name}")
print(f"Project ID: {PROJECT_ID}")

### Cell 5: Define Flow Parameters

**Key Concept:** This parameters list defines three different pipeline variations. Each will filter transactions at a different threshold and write to uniquely-named CSV files.

**Note:** Flow names use "Templated" prefix to avoid conflicts with Lab 01 flows.

In [None]:
# Generate timestamp for unique CSV filenames
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define three flow variations
FLOW_PARAMETERS = [
    {
        "flow_name": "TemplatedTransactionsLowValue-SDK",
        "filter_condition": "amount > 25",
        "mysql_table": "processed_transactions",
        "csv_filename": f"templated_low_value_{timestamp}.csv"
    },
    {
        "flow_name": "TemplatedTransactionsMediumValue-SDK",
        "filter_condition": "amount > 100",
        "mysql_table": "processed_transactions",
        "csv_filename": f"templated_medium_value_{timestamp}.csv"
    },
    {
        "flow_name": "TemplatedTransactionsHighValue-SDK",
        "filter_condition": "amount > 750",
        "mysql_table": "processed_transactions",
        "csv_filename": f"templated_high_value_{timestamp}.csv"
    }
]

print(f"Configured {len(FLOW_PARAMETERS)} flow variations")
print(f"Timestamp: {timestamp}")
for params in FLOW_PARAMETERS:
    print(f"  - {params['flow_name']}: {params['filter_condition']}")

### Cell 6: (Optional) Delete Existing Flows

**Use this if you need to re-run the lab.** Uncomment to delete any flows with matching names before creating new ones.

In [None]:
# Uncomment the code below to delete existing flows before re-running

# for params in FLOW_PARAMETERS:
#     try:
#         existing_flow = project.flows.get(name=params['flow_name'])
#         project.delete_flow(existing_flow)
#         print(f"Deleted existing flow: {params['flow_name']}")
#     except Exception as e:
#         print(f"Flow '{params['flow_name']}' not found or already deleted")

print("Delete operation complete (if uncommented)")

## Part 2: Create Flow Template Function

This function encapsulates the entire flow creation pattern from Lab 01, making it reusable with different parameters.

### Cell 7: Define Flow Creation Function

**This is the template pattern.** The function accepts parameters and builds a complete 6-stage pipeline.

In [None]:
def create_transaction_flow(flow_name, filter_condition, mysql_table, csv_filename):
    """
    Create a transaction processing flow with specified parameters.
    
    Args:
        flow_name: Name for the flow
        filter_condition: SQL-like filter condition (e.g., 'amount > 100')
        mysql_table: Target MySQL table name
        csv_filename: Output CSV filename
    
    Returns:
        Created flow object
    """
    
    print(f"\nCreating flow: {flow_name}")
    print(f"  Filter: {filter_condition}")
    print(f"  MySQL Table: {mysql_table}")
    print(f"  CSV File: {csv_filename}")
    
    # Create flow container
    flow = project.create_flow(name=flow_name, environment=None, flow_type="datastage")
    
    # Stage 1: PostgreSQL Source
    transactions_1 = flow.add_stage("PostgreSQL", "transactions_1")
    transactions_1.configuration.runtime_column_propagation = False
    transactions_1.configuration.table_name = "transactions"
    transactions_1.configuration.connection.name = "PostgreSQL_conn"
    transactions_1.configuration.connection.database = "cpd"
    transactions_1.configuration.connection.defer_credentials = False
    transactions_1.configuration.connection.hostname_or_ip_address = "52.116.198.152"
    transactions_1.configuration.connection.password = "DataDuck!"
    transactions_1.configuration.connection.port = "5432"
    transactions_1.configuration.connection.proxy = False
    transactions_1.configuration.connection.port_is_ssl_enabled = False
    transactions_1.configuration.connection.username = "cpd"
    
    # Stage 2: Peek (monitor input)
    peek_1 = flow.add_stage("Peek", "Peek_1")
    peek_1.configuration.runtime_column_propagation = False
    peek_1.configuration.outputlink_ordering_list = [{"link_label": "Output 1", "link_name": "Link_2"}]
    
    # Stage 3: Filter (parameterized condition)
    filter_1 = flow.add_stage("Filter", "Filter_1")
    filter_1.configuration.show_coll_type = False
    filter_1.configuration.show_part_type = True
    filter_1.configuration.show_sort_options = False
    filter_1.configuration.where_properties = [{"where": filter_condition, "target": "0"}]
    
    # Stage 4: Transformer (reshape and route)
    transformer_1 = flow.add_stage("Transformer", "Transformer_1")
    
    # Stage 5: MySQL Target
    tm_ds_db_1_1 = flow.add_stage("MySQL", "TM_DS_DB_1_1")
    tm_ds_db_1_1.configuration.column_metadata_change_propagation = False
    tm_ds_db_1_1.configuration.output_acp_should_hide = False
    tm_ds_db_1_1.configuration.schema_name = "TM_DS_DB_1"
    tm_ds_db_1_1.configuration.show_coll_type = False
    tm_ds_db_1_1.configuration.show_part_type = True
    tm_ds_db_1_1.configuration.show_sort_options = False
    tm_ds_db_1_1.configuration.table_name = mysql_table
    tm_ds_db_1_1.configuration.connection.name = "MySQL Legacy Financial DB"
    tm_ds_db_1_1.configuration.connection.database = "TM_DS_DB_1"
    tm_ds_db_1_1.configuration.connection.defer_credentials = "false"
    tm_ds_db_1_1.configuration.connection.hostname_or_ip_address = "4d275b38-2eee-4b4d-8a88-cb022388e975.blijti4d0v0nkr55oei0.databases.appdomain.cloud"
    tm_ds_db_1_1.configuration.connection.password = "eDGxvzFX7tK_"
    tm_ds_db_1_1.configuration.connection.port = "32661"
    tm_ds_db_1_1.configuration.connection.proxy = False
    tm_ds_db_1_1.configuration.connection.port_is_ssl_enabled = True
    tm_ds_db_1_1.configuration.connection.username = "TM_DS_USER"
    
    # Stage 6a: Peek (monitor before CSV)
    peek_2 = flow.add_stage("Peek", "Peek_2")
    peek_2.configuration.outputlink_ordering_list = [{"link_label": "Output 1", "link_name": "Link_6"}]
    
    # Stage 6b: Sequential File (CSV output)
    sequential_file_1 = flow.add_stage("Sequential file", "Sequential_file_1")
    sequential_file_1.configuration.file = [f"/ds-storage/{csv_filename}"]
    sequential_file_1.configuration.first_line_is_column_names = SEQUENTIALFILE.FirstLineColumnNames.true
    sequential_file_1.configuration.null_field_value = "'NULL'"
    sequential_file_1.configuration.show_coll_type = True
    sequential_file_1.configuration.show_part_type = False
    sequential_file_1.configuration.show_sort_options = True
    
    # Connect all stages (create the graph)
    link_1 = transactions_1.connect_output_to(peek_1)
    link_1.name = "Link_1"
    
    link_2 = peek_1.connect_output_to(filter_1)
    link_2.name = "Link_2"
    
    link_3 = filter_1.connect_output_to(transformer_1)
    link_3.name = "Link_3"
    
    link_4 = transformer_1.connect_output_to(tm_ds_db_1_1)
    link_4.name = "Link_4"
    
    link_5 = transformer_1.connect_output_to(peek_2)
    link_5.name = "Link_5"
    
    link_6 = peek_2.connect_output_to(sequential_file_1)
    link_6.name = "Link_6"
    
    # Define schemas on all links
    
    # Link 1: PostgreSQL to Peek_1
    transactions_1_schema = link_1.create_schema()
    transactions_1_schema.add_field("INTEGER", "id")
    transactions_1_schema.add_field("INTEGER", "account_id")
    transactions_1_schema.add_field("VARCHAR", "timestamp").length(50)
    transactions_1_schema.add_field("NUMERIC", "amount").length(10).scale(2)
    transactions_1_schema.add_field("LONGVARCHAR", "location").length(1024)
    
    # Link 2: Peek_1 to Filter
    peek_1_schema = link_2.create_schema()
    peek_1_schema.add_field("INTEGER", "id").source("Link_1.id")
    peek_1_schema.add_field("INTEGER", "account_id").source("Link_1.account_id")
    peek_1_schema.add_field("VARCHAR", "timestamp").source("Link_1.timestamp").length(50)
    peek_1_schema.add_field("NUMERIC", "amount").source("Link_1.amount").length(10).scale(2)
    peek_1_schema.add_field("LONGVARCHAR", "location").source("Link_1.location").length(1024)
    
    # Link 3: Filter to Transformer
    filter_1_schema = link_3.create_schema()
    filter_1_schema.add_field("INTEGER", "id").source("Link_2.id")
    filter_1_schema.add_field("INTEGER", "account_id").source("Link_2.account_id")
    filter_1_schema.add_field("VARCHAR", "timestamp").source("Link_2.timestamp").length(50)
    filter_1_schema.add_field("NUMERIC", "amount").source("Link_2.amount").length(10).scale(2)
    filter_1_schema.add_field("LONGVARCHAR", "location").source("Link_2.location").length(1024)
    
    # Link 4: Transformer to MySQL (removes id field for auto-increment)
    transformer_1_schema = link_4.create_schema()
    transformer_1_schema.add_field("INTEGER", "account_id").source("Link_3.account_id")
    transformer_1_schema.add_field("VARCHAR", "timestamp").source("Link_3.timestamp").length(50)
    transformer_1_schema.add_field("NUMERIC", "amount").source("Link_3.amount").length(10).scale(2)
    transformer_1_schema.add_field("LONGVARCHAR", "location").source("Link_3.location").length(1024)
    
    # Link 5: Transformer to Peek_2 (keeps all fields)
    transformer_1_schema_2 = link_5.create_schema()
    transformer_1_schema_2.add_field("INTEGER", "id").source("Link_3.id")
    transformer_1_schema_2.add_field("INTEGER", "account_id").source("Link_3.account_id")
    transformer_1_schema_2.add_field("VARCHAR", "timestamp").source("Link_3.timestamp").length(50)
    transformer_1_schema_2.add_field("NUMERIC", "amount").source("Link_3.amount").length(10).scale(2)
    transformer_1_schema_2.add_field("LONGVARCHAR", "location").source("Link_3.location").length(1024)
    
    # Link 6: Peek_2 to CSV
    peek_2_schema = link_6.create_schema()
    peek_2_schema.add_field("INTEGER", "id").source("Link_5.id")
    peek_2_schema.add_field("INTEGER", "account_id").source("Link_5.account_id")
    peek_2_schema.add_field("VARCHAR", "timestamp").source("Link_5.timestamp").length(50)
    peek_2_schema.add_field("NUMERIC", "amount").source("Link_5.amount").length(10).scale(2)
    peek_2_schema.add_field("LONGVARCHAR", "location").source("Link_5.location").length(1024)
    
    print(f"  Flow structure created with 6 stages and 6 links")
    
    return flow

print("Flow creation function defined")

## Part 3: Create and Execute Flows

**This is where the automation happens.** Watch as we create three complete pipelines with a simple loop.

**Note:** You may see "unverified HTTPS request" warnings - these are normal and safe to ignore.

### Cell 8: Create All Flows Using the Template

In [None]:
print("="*70)
print("CREATING FLOWS - This would take approximately 60 minutes manually")
print("="*70)

created_flows = []

for i, params in enumerate(FLOW_PARAMETERS, 1):
    print(f"\n[{i}/{len(FLOW_PARAMETERS)}] Processing: {params['flow_name']}")
    
    # Create the flow using our template function
    flow = create_transaction_flow(
        flow_name=params['flow_name'],
        filter_condition=params['filter_condition'],
        mysql_table=params['mysql_table'],
        csv_filename=params['csv_filename']
    )
    
    # Save the flow to watsonx.data Integration
    project.update_flow(flow)
    print(f"  Flow saved to watsonx.data Integration")
    
    # Store for job creation
    created_flows.append({
        'flow': flow,
        'name': params['flow_name']
    })

print("\n" + "="*70)
print(f"SUCCESS! Created {len(created_flows)} flows in seconds")
print("="*70)
print("\nFlows created:")
for flow_info in created_flows:
    print(f"  - {flow_info['name']}")

### Cell 9: Create and Execute Jobs for All Flows

**Note:** Jobs will run concurrently. Execution typically takes 1-2 minutes total for all three jobs.

In [None]:
print("\n" + "="*70)
print("CREATING AND EXECUTING JOBS")
print("="*70)

job_runs = []

for i, flow_info in enumerate(created_flows, 1):
    flow = flow_info['flow']
    flow_name = flow_info['name']
    
    print(f"\n[{i}/{len(created_flows)}] {flow_name}")
    
    # Create job
    job = project.create_job(
        name=f"{flow_name}_job",
        flow=flow
    )
    print(f"  Job created: {flow_name}_job")
    
    # Execute job
    job_run = job.start(
        name=f"{flow_name} job run",
        description="Lab 02 - Bulk flow creation"
    )
    print(f"  Job started")
    
    job_runs.append({
        'flow_name': flow_name,
        'job': job,
        'job_run': job_run
    })

print("\n" + "="*70)
print(f"ALL JOBS STARTED! {len(job_runs)} pipelines now processing data")
print("="*70)
print("\nJobs executing:")
for run_info in job_runs:
    print(f"  - {run_info['flow_name']}_job")

## Part 4: Summary and Verification

### Cell 10: Lab Summary and Next Steps

In [None]:
print("\n" + "="*70)
print("LAB 02 COMPLETE - WHAT YOU ACCOMPLISHED")
print("="*70)

print("\nTHE NUMBERS:")
print(f"  Flows created: {len(created_flows)}")
print(f"  Jobs executed: {len(job_runs)}")
print(f"  Total stages configured: {len(created_flows) * 6}")
print(f"  Total links defined: {len(created_flows) * 6}")
print(f"  Execution time: approximately 2-3 minutes")
print(f"  Manual UI time would be: approximately 60 minutes")

print("\nWHAT YOU LEARNED:")
print("  - Template-based flow creation pattern")
print("  - Parameterization for reusability")
print("  - Loop-based bulk operations")
print("  - Production automation workflows")
print("  - The power of SDK vs. manual UI work")

print("\nVERIFY YOUR WORK:")
print("  1. Go to watsonx.data Integration UI:")
print("     https://ca-tor.dai.cloud.ibm.com/df/home?context=df")
print("  2. Navigate to Jobs tab")
print("     Confirm all 3 jobs are running/completed")
print("  3. Navigate to Assets tab")
print("     Confirm all 3 flows exist:")
for flow_info in created_flows:
    print(f"       - {flow_info['name']}")

print("\nNEXT STEP: Lab 03 - Real-time Streaming Pipeline")
print("  Learn to process continuous data streams with:")
print("  - Kafka message consumption")
print("  - Real-time fraud detection logic")
print("  - Dual streaming outputs (Elasticsearch + SingleStore)")
print("  - Continuous pipeline monitoring")

print("\n" + "="*70)