How This New Pipeline Works

Read and Decode: It starts the same way, by reading and decoding the raw JSON messages from your Pub/Sub subscription.
ExtractPageViews DoFn:
This is a crucial new step. Instead of processing the whole session, this DoFn iterates through the events array within each message.

If an event is a 'page_view', it uses beam.utils.timestamp.Timestamp.from_rfc3339 to parse the event's timestamp.

It then yields a TimestampedValue. This is a special Beam object that contains a value (in this case, just the number 1) and an associated timestamp. This tells Beam to use the event's time for windowing, not the time the message was processed.

WindowInto(window.FixedWindows(60)): This groups all the 1s that were yielded into fixed, non-overlapping 60-second windows based on their assigned event timestamps.

Count.Globally(): For each one-minute window, this transform simply counts how many elements (1s) it contains. This gives you the total page views for that minute.


LogAggregation DoFn:

This final DoFn receives the count for each completed window.

It accesses the window's start time via beam.DoFn.WindowParam.

It formats a user-friendly string and uses logging.info() to write it out. When running on Dataflow, these logs are automatically sent to Google Cloud Logging and can be viewed on the job's log page.

In [30]:
!pip install --quiet "apache-beam[gcp]"
!pip install -q google-cloud-logging

I0000 00:00:1750719945.402799   36612 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


[0m

I0000 00:00:1750719950.484992   36612 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


[0m

In [None]:
# use a terminal window in your project to create this firewall rule

gcloud compute firewall-rules create dataflow-allow-egress-to-google-apis \
    --network=default \
    --action=ALLOW \
    --direction=EGRESS \
    --destination-ranges=0.0.0.0/0 \
    --rules=tcp:443 \
    --priority=1000 \
    --project=jellyfish-training-demo-6

In [None]:
# this cell runs to code in a dataflow job

import logging
import json
import apache_beam as beam
from apache_beam.transforms import window
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
import time
from datetime import datetime, timezone

# =================================================================
# 1. DEFINE PIPELINE LOGIC (DoFns)
# =================================================================

class ExtractPageViews(beam.DoFn):
    """
    Parses incoming JSON from Pub/Sub and yields 1 for each 'page_view' event.
    It uses the original timestamp from the data for event-time windowing.
    """
    def process(self, element, *args, **kwargs):
        try:
            # The element from ReadFromPubSub is a byte string.
            data_str = element.decode('utf-8')
            data = json.loads(data_str)
            
            events = data.get('events', [])
            for event_data in events:
                event = event_data.get('event', {})
                if event.get('event_type') == 'page_view':
                    event_timestamp_str = event.get('timestamp')
                    if event_timestamp_str:
                        # --- FIX APPLIED HERE ---
                        # The previous logic was incorrectly handling timezone-aware timestamps.
                        # The datetime.fromisoformat() function correctly parses strings
                        # with timezone offsets (like '-04:00'). This timezone-aware
                        # datetime object can be passed directly to Beam's timestamp utility,
                        # which will handle the conversion to UTC correctly.
                        
                        # 1. Parse the string directly into a timezone-aware datetime object.
                        aware_dt_object = datetime.fromisoformat(event_timestamp_str)
                        
                        # 2. Create a Beam Timestamp. This will be correctly converted to UTC.
                        event_timestamp = beam.utils.timestamp.Timestamp.from_datetime(aware_dt_object)
                        
                        yield beam.window.TimestampedValue(1, event_timestamp)
                        
        except Exception as e:
            # Dataflow will automatically log errors to Cloud Logging.
            logging.error(f"Error parsing element: {e} - Element: {str(element)[:200]}")


class LogAggregation(beam.DoFn):
    """
    Formats the aggregated count and logs it via structured logging.
    """
    def process(self, element, window=beam.DoFn.WindowParam):
        window_start = window.start.to_rfc3339()
        page_view_count = element
        
        # This structured log format is automatically parsed by Google Cloud's logging agent
        # on the Dataflow workers.
        log_entry = {
            "message": (
                f"PIPELINE OUTPUT: In the 1-minute window starting at {window_start}, "
                f"there were {page_view_count} page views."
            ),
            "severity": "INFO",
            "jsonPayload": {
                "window_start": window_start,
                "page_view_count": page_view_count,
                "metric": "page_views_per_minute"
            }
        }
        # Printing the JSON string is the correct way to log to Cloud Logging from Dataflow.
        print(json.dumps(log_entry))


# =================================================================
# 2. DEFINE PIPELINE OPTIONS AND EXECUTION FOR JUPYTER NOTEBOOK
# =================================================================

def run_dataflow_from_notebook():
    """Defines and runs the Dataflow pipeline from a notebook cell."""

    # --- Manually define all pipeline options for notebook execution ---
    
    # --- REQUIRED: Replace with your details if they are different ---
    PROJECT_ID = "jellyfish-training-demo-6"
    SUBSCRIPTION_NAME = "dsl-clickstream-ddos"
    # IMPORTANT: Make sure this GCS bucket exists in your project.
    BUCKET_NAME = "jellyfish-training-demo-6" 
    REGION = "us-central1"
    
    # Construct the full subscription path
    subscription_path = f"projects/{PROJECT_ID}/subscriptions/{SUBSCRIPTION_NAME}"
    
    # Create a unique job name using a timestamp
    job_name = f"pageview-counter-notebook-job-{int(time.time())}"

    # Build the list of arguments as if they were coming from the command line.
    # This is the standard way to launch Dataflow programmatically.
    pipeline_args = [
        f'--runner=DataflowRunner',
        f'--project={PROJECT_ID}',
        f'--region={REGION}',
        f'--job_name={job_name}',
        f'--temp_location=gs://{BUCKET_NAME}/temp',
        f'--staging_location=gs://{BUCKET_NAME}/staging',
        '--streaming' # Enable streaming mode
    ]
    
    pipeline_options = PipelineOptions(pipeline_args)

    print(f"--- Submitting Dataflow Streaming Pipeline: {job_name} ---")
    print(f"--- Reading from: {subscription_path} ---")

    # The 'with' block is crucial for proper pipeline execution and teardown.
    with beam.Pipeline(options=pipeline_options) as pipeline:
        (
            pipeline
            # The Dataflow runner fully supports ReadFromPubSub.
            | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(subscription=subscription_path)
            | 'Extract Page Views' >> beam.ParDo(ExtractPageViews())
            | 'Window into One Minute Batches' >> beam.WindowInto(window.FixedWindows(60))
            | 'Count Events per Minute' >> beam.combiners.Count.Globally().without_defaults()
            | 'Format and Log Count' >> beam.ParDo(LogAggregation())
        )
    print("--- Job submitted successfully! You can monitor it in the Dataflow UI. ---")
    print(f"--- Job Name: {job_name} ---")


# =================================================================
# 3. SCRIPT ENTRY POINT
# =================================================================
# This block will now execute the function defined above when you run the cell.

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run_dataflow_from_notebook()


INFO:root:Runner defaulting to pickling library: cloudpickle.


--- Submitting Dataflow Streaming Pipeline: pageview-counter-notebook-job-1750732750 ---
--- Reading from: projects/jellyfish-training-demo-6/subscriptions/dsl-clickstream-ddos ---


INFO:apache_beam.runners.dataflow.dataflow_runner:Pipeline has additional dependencies to be installed in SDK worker container, consider using the SDK container image pre-building workflow to avoid repetitive installations. Learn more on https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://jellyfish-training-demo-6/staging/pageview-counter-notebook-job-1750732750.1750732752.323969/submission_environment_dependencies.txt...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://jellyfish-training-demo-6/staging/pageview-counter-notebook-job-1750732750.1750732752.323969/submission_environment_dependencies.txt in 0 seconds.
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://jellyfish-training-demo-6/staging/pageview-counter-notebook-job-1750732750.1750732752.323969/pipeline.pb...
INFO:apache_beam.runners.dataflow.internal.apicl

{"message": "PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:41:00Z, there were 58 page views.", "severity": "INFO", "jsonPayload": {"window_start": "2025-06-24T02:41:00Z", "page_view_count": 58, "metric": "page_views_per_minute"}}
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:41:00Z, there were 22 page views.
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:41:00Z, there were 15 page views.
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:42:00Z, there were 16 page views.
{"message": "PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:42:00Z, there were 52 page views.", "severity": "INFO", "jsonPayload": {"window_start": "2025-06-24T02:42:00Z", "page_view_count": 52, "metric": "page_views_per_minute"}}
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:42:00Z, there were 17 page views.


INFO:apache_beam.runners.dataflow.dataflow_runner:2025-06-24T02:43:48.049Z: JOB_MESSAGE_BASIC: All workers have finished the startup processes and began to receive work requests.


PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:43:00Z, there were 23 page views.
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:43:00Z, there were 20 page views.
{"message": "PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:43:00Z, there were 20 page views.", "severity": "INFO", "jsonPayload": {"window_start": "2025-06-24T02:43:00Z", "page_view_count": 20, "metric": "page_views_per_minute"}}
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:44:00Z, there were 19 page views.
{"message": "PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:44:00Z, there were 23 page views.", "severity": "INFO", "jsonPayload": {"window_start": "2025-06-24T02:44:00Z", "page_view_count": 23, "metric": "page_views_per_minute"}}PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:44:00Z, there were 23 page views.

PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T02:45:00Z, there were 19 page views.


KeyboardInterrupt: 

INFO:apache_beam.runners.dataflow.dataflow_runner:2025-06-24T03:16:55.491Z: JOB_MESSAGE_BASIC: Cancel request is committed for workflow job: 2025-06-23_19_39_12-18214927796337867699.
INFO:apache_beam.runners.dataflow.dataflow_runner:2025-06-24T03:16:55.649Z: JOB_MESSAGE_BASIC: Stopping worker pool...
INFO:apache_beam.runners.dataflow.dataflow_runner:2025-06-24T03:16:55.702Z: JOB_MESSAGE_BASIC: Stopping worker pool...
INFO:apache_beam.runners.dataflow.dataflow_runner:Job 2025-06-23_19_39_12-18214927796337867699 is in state JOB_STATE_CANCELLING


In [42]:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import time
import logging

def run_minimal_dataflow_job():
    # --- Manually define options for a minimal test job ---
    PROJECT_ID = "jellyfish-training-demo-6"
    BUCKET_NAME = "jellyfish-training-demo-6"
    REGION = "us-central1"
    job_name = f"minimal-test-job-{int(time.time())}"

    pipeline_args = [
        f'--runner=DataflowRunner',
        f'--project={PROJECT_ID}',
        f'--region={REGION}',
        f'--job_name={job_name}',
        f'--temp_location=gs://{BUCKET_NAME}/temp',
        f'--staging_location=gs://{BUCKET_NAME}/staging',
        # NOTE: This is a batch job, so no '--streaming' flag is needed.
    ]

    pipeline_options = PipelineOptions(pipeline_args)

    print(f"--- Submitting MINIMAL test job: {job_name} ---")

    try:
        with beam.Pipeline(options=pipeline_options) as pipeline:
            (
                pipeline
                | 'Create Test Data' >> beam.Create(['hello world'])
                | 'Log Output' >> beam.Map(print)
            )
        
        print("--- Minimal job submitted successfully! ---")
        print(f"--- Job Name: {job_name} ---")

    except Exception as e:
        print("--- An error occurred during submission ---")
        print(e)

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run_minimal_dataflow_job()

INFO:root:Runner defaulting to pickling library: cloudpickle.


--- Submitting MINIMAL test job: minimal-test-job-1750734980 ---


INFO:apache_beam.runners.dataflow.dataflow_runner:Pipeline has additional dependencies to be installed in SDK worker container, consider using the SDK container image pre-building workflow to avoid repetitive installations. Learn more on https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://jellyfish-training-demo-6/staging/minimal-test-job-1750734980.1750734982.540984/submission_environment_dependencies.txt...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://jellyfish-training-demo-6/staging/minimal-test-job-1750734980.1750734982.540984/submission_environment_dependencies.txt in 0 seconds.
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://jellyfish-training-demo-6/staging/minimal-test-job-1750734980.1750734982.540984/pipeline.pb...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://jelly

--- Minimal job submitted successfully! ---
--- Job Name: minimal-test-job-1750734980 ---
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T03:36:00Z, there were 14 page views.
{"message": "PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T03:36:00Z, there were 27 page views.", "severity": "INFO", "jsonPayload": {"window_start": "2025-06-24T03:36:00Z", "page_view_count": 27, "metric": "page_views_per_minute"}}
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T03:36:00Z, there were 6 page views.
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T03:37:00Z, there were 21 page views.
{"message": "PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T03:37:00Z, there were 40 page views.", "severity": "INFO", "jsonPayload": {"window_start": "2025-06-24T03:37:00Z", "page_view_count": 40, "metric": "page_views_per_minute"}}
PIPELINE OUTPUT: In the 1-minute window starting at 2025-06-24T03:37:00Z, there were 22 page views.
PIPELINE OU

In [None]:
python3 ddos_pipeline.py \
    --runner DataflowRunner \
    --project jellyfish-training-demo-6 \
    --region us-central1 \
    --temp_location gs://jellyfish-training-demo-6/temp \
    --job_name "pageview-counter-final-job-$(date +'%Y%m%d-%H%M%S')" \
    --subscription projects/jellyfish-training-demo-6/subscriptions/dsl-clickstream-ddos