# 24 - NiFi + Kafka + Spark Integration

Integracja Apache NiFi z Kafka i Spark Structured Streaming w kontekscie systemu rekomendacji MovieLens.

**Tematy:**
- NiFi jako producent do Kafki (PublishKafka processor)
- Spark Structured Streaming konsumujacy z Kafki
- Full pipeline: Source -> NiFi -> Kafka -> Spark -> HDFS/PostgreSQL
- NiFi Site-to-Site protocol
- NiFi Registry - wersjonowanie flow
- Error handling i retry w NiFi
- Monitoring end-to-end pipeline
- Zadanie koncowe: end-to-end pipeline dla nowych ratings

## 1. Architektura end-to-end pipeline

```
  ┌────────────┐     ┌───────────────────────┐     ┌─────────┐     ┌──────────────────┐
  │   Source    │     │       Apache NiFi      │     │  Kafka  │     │  Spark Streaming │
  │            │     │                         │     │         │     │                  │
  │ CSV files  │────>│ GetFile -> Transform    │────>│  topic: │────>│ readStream       │
  │ REST API   │     │         -> PublishKafka  │     │ ratings │     │   .format(kafka) │
  │ Database   │     │                         │     │         │     │   .groupBy()     │
  └────────────┘     └───────────────────────┘     └─────────┘     │   .writeStream   │
                                                                     └────────┬─────────┘
                                                                              │
                                                              ┌───────────────┼───────────────┐
                                                              ▼               ▼               ▼
                                                        ┌──────────┐  ┌────────────┐  ┌───────────┐
                                                        │   HDFS   │  │ PostgreSQL │  │  Console  │
                                                        │ (parquet)│  │ (aggregaty)│  │ (debug)   │
                                                        └──────────┘  └────────────┘  └───────────┘
```

### Dlaczego Kafka miedzy NiFi a Spark?
- **Buforowanie** - Kafka trzyma dane nawet gdy Spark jest niedostepny
- **Decoupling** - NiFi i Spark moga skalowac sie niezaleznie
- **Replay** - mozna ponownie przetworzyc dane z Kafki
- **Multi-consumer** - wiele aplikacji moze czytac te same dane

## 2. Setup

In [None]:
import requests
import json
import time
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# === NiFi API ===
NIFI_API = "https://nifi:8443/nifi-api"
nifi = requests.Session()
nifi.verify = False

def get_nifi_token(username="admin", password="admin123456789"):
    try:
        resp = nifi.post(
            f"{NIFI_API}/access/token",
            data={"username": username, "password": password},
            headers={"Content-Type": "application/x-www-form-urlencoded"}
        )
        if resp.status_code == 201:
            nifi.headers.update({"Authorization": f"Bearer {resp.text}"})
            print("NiFi: token uzyskany")
    except Exception as e:
        print(f"NiFi: proba bez auth ({e})")

get_nifi_token()

def nifi_get(path):
    r = nifi.get(f"{NIFI_API}{path}"); r.raise_for_status(); return r.json()

def nifi_post(path, data):
    r = nifi.post(f"{NIFI_API}{path}", json=data, headers={"Content-Type": "application/json"})
    r.raise_for_status(); return r.json()

def nifi_put(path, data):
    r = nifi.put(f"{NIFI_API}{path}", json=data, headers={"Content-Type": "application/json"})
    r.raise_for_status(); return r.json()

def nifi_delete(path, params=None):
    r = nifi.delete(f"{NIFI_API}{path}", params=params); r.raise_for_status(); return r.json()

root_flow = nifi_get("/flow/process-groups/root")
ROOT_PG_ID = root_flow["processGroupFlow"]["id"]
print(f"NiFi Root PG: {ROOT_PG_ID}")

# Test polaczenia
diag = nifi_get("/system-diagnostics")
print(f"NiFi Heap: {diag['systemDiagnostics']['aggregateSnapshot']['heapUtilization']}")

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("24_NiFi_Kafka_Spark") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages",
            "org.postgresql:postgresql:42.7.1,"
            "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "7g") \
    .config("spark.driver.host", "recommender-jupyter") \
    .config("spark.driver.bindAddress", "0.0.0.0") \
    .getOrCreate()

HDFS_URL = "hdfs://namenode:9000"
jdbc_url = "jdbc:postgresql://postgres:5432/recommender"
jdbc_props = {"user": "recommender", "password": "recommender", "driver": "org.postgresql.Driver"}

print(f"Spark UI: {spark.sparkContext.uiWebUrl}")
print("Spark + Kafka gotowy!")

## 3. NiFi jako producent do Kafki

Tworzymy flow w NiFi ktory:
1. Czyta pliki CSV z danymi MovieLens
2. Dzieli na linie (rekordy)
3. Konwertuje na JSON
4. Publikuje do tematu Kafka `movielens-ratings`

In [None]:
# Helper do tworzenia procesorow
def create_processor(pg_id, proc_type, name, position, config=None, auto_terminate=None):
    body = {
        "revision": {"version": 0},
        "component": {
            "type": proc_type,
            "name": name,
            "position": position,
            "config": {}
        }
    }
    if config:
        body["component"]["config"]["properties"] = config
    if auto_terminate:
        body["component"]["config"]["autoTerminatedRelationships"] = auto_terminate
    result = nifi_post(f"/process-groups/{pg_id}/processors", body)
    print(f"  + {name} ({result['id'][:8]}...)")
    return result

def create_connection(pg_id, src_id, dst_id, rels, name=""):
    body = {
        "revision": {"version": 0},
        "component": {
            "name": name,
            "source": {"id": src_id, "type": "PROCESSOR", "groupId": pg_id},
            "destination": {"id": dst_id, "type": "PROCESSOR", "groupId": pg_id},
            "selectedRelationships": rels,
            "backPressureObjectThreshold": 10000,
            "backPressureDataSizeThreshold": "100 MB"
        }
    }
    return nifi_post(f"/process-groups/{pg_id}/connections", body)

# === Tworzenie flow: CSV -> Kafka ===
pg = nifi_post(f"/process-groups/{ROOT_PG_ID}/process-groups", {
    "revision": {"version": 0},
    "component": {"name": "Ratings to Kafka", "position": {"x": 100, "y": 100}}
})
KAFKA_PG_ID = pg["id"]
print(f"Process Group: {KAFKA_PG_ID}\n")

# 1. GetFile - czytaj CSV
p_getfile = create_processor(
    KAFKA_PG_ID,
    "org.apache.nifi.processors.standard.GetFile",
    "Read Ratings CSV",
    {"x": 100, "y": 100},
    config={"Input Directory": "/data/raw/movielens", "File Filter": "rating\\.csv",
            "Keep Source File": "true"}
)

# 2. SplitText - podziel CSV na linie (1 FlowFile = 1 rating)
p_split = create_processor(
    KAFKA_PG_ID,
    "org.apache.nifi.processors.standard.SplitText",
    "Split CSV Lines",
    {"x": 100, "y": 300},
    config={"Line Split Count": "1", "Header Line Count": "1",
            "Remove Trailing Newlines": "true"},
    auto_terminate=["original", "failure"]
)

# 3. ReplaceText - konwertuj CSV linie na JSON
p_to_json = create_processor(
    KAFKA_PG_ID,
    "org.apache.nifi.processors.standard.ReplaceText",
    "CSV to JSON",
    {"x": 100, "y": 500},
    config={
        "Search Value": "^(.*?),(.*?),(.*?),(.*?)$",
        "Replacement Value": '{"user_id":"$1","movie_id":"$2","rating":"$3","timestamp":"$4"}',
        "Replacement Strategy": "Regex Replace"
    }
)

# 4. UpdateAttribute - klucz Kafka = user_id
p_key = create_processor(
    KAFKA_PG_ID,
    "org.apache.nifi.processors.attributes.UpdateAttribute",
    "Set Kafka Key",
    {"x": 100, "y": 700},
    config={"kafka.key": "${fragment.index}"}
)

# 5. PublishKafka - wyslij do Kafka
p_kafka = create_processor(
    KAFKA_PG_ID,
    "org.apache.nifi.processors.kafka.pubsub.PublishKafka_2_6",
    "Publish to Kafka",
    {"x": 100, "y": 900},
    config={
        "bootstrap.servers": "kafka:9092",
        "topic": "movielens-ratings",
        "Delivery Guarantee": "Guarantee Replicated Delivery",
        "Message Key Field": "kafka.key",
        "Compression Type": "snappy",
        "Max Request Size": "1 MB"
    },
    auto_terminate=["success", "failure"]
)

# Polaczenia
create_connection(KAFKA_PG_ID, p_getfile["id"], p_split["id"], ["success"])
create_connection(KAFKA_PG_ID, p_split["id"], p_to_json["id"], ["splits"])
create_connection(KAFKA_PG_ID, p_to_json["id"], p_key["id"], ["success"])
create_connection(KAFKA_PG_ID, p_key["id"], p_kafka["id"], ["success"])

print("\nFlow: GetFile -> SplitText -> ReplaceText -> UpdateAttribute -> PublishKafka")

## 4. Spark Structured Streaming z Kafki

Spark czyta dane z tematu Kafka `movielens-ratings` w trybie streaming.

### Kafka message format:
```
key:   "12345"  (user_id)
value: {"user_id":"1","movie_id":"31","rating":"4.5","timestamp":"1260759144"}
```

In [None]:
# Schemat JSON z Kafki
rating_schema = StructType([
    StructField("user_id", StringType(), True),
    StructField("movie_id", StringType(), True),
    StructField("rating", StringType(), True),
    StructField("timestamp", StringType(), True)
])

# Czytaj stream z Kafki
kafka_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "movielens-ratings") \
    .option("startingOffsets", "earliest") \
    .option("maxOffsetsPerTrigger", 10000) \
    .load()

# Parsuj JSON z value
parsed_stream = kafka_stream \
    .select(
        col("key").cast("string").alias("kafka_key"),
        from_json(col("value").cast("string"), rating_schema).alias("data"),
        col("topic"),
        col("partition"),
        col("offset"),
        col("timestamp").alias("kafka_timestamp")
    ) \
    .select(
        "kafka_key",
        col("data.user_id").cast("integer").alias("user_id"),
        col("data.movie_id").cast("integer").alias("movie_id"),
        col("data.rating").cast("double").alias("rating"),
        col("data.timestamp").cast("long").alias("rating_ts"),
        "topic", "partition", "offset", "kafka_timestamp"
    )

print("Schema streamu:")
parsed_stream.printSchema()

In [None]:
# Output 1: Console (debug) - sprawdz czy dane plyna
console_query = parsed_stream \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", False) \
    .option("numRows", 10) \
    .trigger(processingTime="10 seconds") \
    .queryName("ratings_console") \
    .start()

print("Stream do konsoli uruchomiony.")
print("Uruchom flow NiFi aby wyslac dane do Kafki!")
print(f"Query status: {console_query.status}")

# Poczekaj na kilka micro-batchy
time.sleep(30)
console_query.stop()
print("Stream zatrzymany.")

## 5. Full Pipeline: NiFi -> Kafka -> Spark -> HDFS + PostgreSQL

Teraz budujemy kompletny pipeline z dwoma sinkami:
- **HDFS**: surowe dane w formacie Parquet (archiwum)
- **PostgreSQL**: agregaty w czasie rzeczywistym

In [None]:
# Sink 1: HDFS Parquet (append mode, partycjonowanie po dacie)
hdfs_stream = parsed_stream \
    .withColumn("date", to_date(col("kafka_timestamp"))) \
    .select("user_id", "movie_id", "rating", "rating_ts", "date") \
    .writeStream \
    .outputMode("append") \
    .format("parquet") \
    .option("path", f"{HDFS_URL}/data/movielens/streaming/ratings") \
    .option("checkpointLocation", f"{HDFS_URL}/checkpoints/ratings_hdfs") \
    .partitionBy("date") \
    .trigger(processingTime="30 seconds") \
    .queryName("ratings_to_hdfs") \
    .start()

print("Stream -> HDFS uruchomiony")
print(f"Output: {HDFS_URL}/data/movielens/streaming/ratings/")
print(f"Checkpoint: {HDFS_URL}/checkpoints/ratings_hdfs/")

In [None]:
# Sink 2: PostgreSQL (agregaty per film w micro-batch)

def write_to_postgres(batch_df, batch_id):
    """Zapisz agregaty z micro-batcha do PostgreSQL."""
    if batch_df.isEmpty():
        return
    
    # Oblicz agregaty per movie
    movie_stats = batch_df.groupBy("movie_id").agg(
        count("*").alias("batch_count"),
        avg("rating").alias("batch_avg_rating"),
        max("rating").alias("batch_max_rating"),
        min("rating").alias("batch_min_rating")
    )
    
    # Zapisz do PostgreSQL (tabela streaming_movie_stats)
    movie_stats.write \
        .mode("append") \
        .jdbc(jdbc_url, "movielens.streaming_movie_stats", properties=jdbc_props)
    
    print(f"  Batch {batch_id}: {batch_df.count()} ratings -> "
          f"{movie_stats.count()} movie stats")

pg_stream = parsed_stream \
    .select("user_id", "movie_id", "rating") \
    .writeStream \
    .outputMode("append") \
    .foreachBatch(write_to_postgres) \
    .option("checkpointLocation", f"{HDFS_URL}/checkpoints/ratings_pg") \
    .trigger(processingTime="30 seconds") \
    .queryName("ratings_to_postgres") \
    .start()

print("Stream -> PostgreSQL uruchomiony")

In [None]:
# Monitoring aktywnych streamow
print("=== Aktywne streamy ===")
for q in spark.streams.active:
    status = q.status
    progress = q.lastProgress
    print(f"\n  Query: {q.name}")
    print(f"  Status: {'active' if q.isActive else 'stopped'}")
    print(f"  Message: {status.get('message', 'n/a')}")
    if progress:
        print(f"  Rows/sec: {progress.get('processedRowsPerSecond', 0):.0f}")
        print(f"  Batch ID: {progress.get('batchId', 'n/a')}")
        sources = progress.get('sources', [{}])
        if sources:
            s = sources[0]
            print(f"  Start offset: {s.get('startOffset', 'n/a')}")
            print(f"  End offset: {s.get('endOffset', 'n/a')}")

# Poczekaj na przetworzenie
print("\nCzekam 60 sekund na przetworzenie danych...")
time.sleep(60)

# Zatrzymaj streamy
for q in spark.streams.active:
    q.stop()
    print(f"Stream '{q.name}' zatrzymany.")

## 6. NiFi Site-to-Site Protocol

Site-to-Site (S2S) to natywny protokol NiFi do przesylania danych miedzy instancjami NiFi lub z zewnetrznych aplikacji.

### Zalety S2S vs Kafka:
| Cecha | Site-to-Site | Kafka |
|-------|-------------|-------|
| Konfiguracja | Minimalna | Wymaga brokera |
| Back pressure | Natywny | Konfigurowalny |
| Provenance | Pelny lineage | Tylko offset |
| Multi-consumer | Nie | Tak |
| Replay | Nie | Tak (retention) |
| Use case | NiFi-to-NiFi | Uniwersalny |

### Konfiguracja S2S:
```
nifi.remote.input.http.enabled=true
nifi.remote.input.socket.port=10000
```

In [None]:
# Sprawdz konfiguracje Site-to-Site
try:
    s2s_info = nifi_get("/site-to-site")
    print("=== Site-to-Site Configuration ===")
    print(f"  Controller:  {s2s_info.get('controller', {})}")
except Exception as e:
    print(f"S2S info: {e}")

# Tworzenie Remote Process Group (RPG) - polaczenie do innej instancji NiFi
# RPG pozwala przesylac dane miedzy klasterami NiFi

def create_remote_process_group(pg_id, target_uri, transport="HTTP"):
    """Tworzy Remote Process Group do komunikacji S2S."""
    body = {
        "revision": {"version": 0},
        "component": {
            "targetUris": target_uri,
            "communicationsTimeout": "30 sec",
            "yieldDuration": "10 sec",
            "transportProtocol": transport,
            "position": {"x": 400, "y": 100}
        }
    }
    return nifi_post(f"/process-groups/{pg_id}/remote-process-groups", body)

# Przyklad: polaczenie do innej instancji NiFi
# rpg = create_remote_process_group(ROOT_PG_ID, "https://nifi-remote:8443/nifi")
# print(f"RPG created: {rpg['id']}")

print("\nSite-to-Site jest przydatny gdy masz wiele instancji NiFi")
print("np. Edge NiFi (zbiera dane) -> Central NiFi (przetwarza)")

## 7. NiFi Registry - wersjonowanie flow

NiFi Registry to osobny serwis do wersjonowania flow (jak Git dla NiFi).

### Workflow:
1. Utworz bucket w Registry
2. Zapisz Process Group do Registry (commit)
3. Edytuj flow w NiFi
4. Zapisz nowa wersje (commit)
5. Mozesz cofnac do poprzedniej wersji (revert)

```
NiFi Canvas  ───commit───>  NiFi Registry
    │                           │
    │<──────revert─────────────┘
    │
    │         ┌─ v1.0 (initial)
    │         ├─ v1.1 (added transform)
    │         ├─ v2.0 (added Kafka output)
    │         └─ v2.1 (fixed error handling)
```

In [None]:
# NiFi Registry API (osobny serwis, domyslnie port 18080)
REGISTRY_API = "http://nifi-registry:18080/nifi-registry-api"

def registry_get(path):
    r = requests.get(f"{REGISTRY_API}{path}"); r.raise_for_status(); return r.json()

def registry_post(path, data):
    r = requests.post(f"{REGISTRY_API}{path}", json=data,
                      headers={"Content-Type": "application/json"})
    r.raise_for_status(); return r.json()

# 1. Sprawdz istniejace buckety
try:
    buckets = registry_get("/buckets")
    print(f"Istniejace buckety ({len(buckets)}):")
    for b in buckets:
        print(f"  {b['name']} (id: {b['identifier']})")
except Exception as e:
    print(f"Registry niedostepny: {e}")
    print("Upewnij sie ze kontener nifi-registry jest uruchomiony.")

# 2. Utworz bucket "movielens-flows"
try:
    bucket = registry_post("/buckets", {
        "name": "movielens-flows",
        "description": "Flow definitions for MovieLens recommender system"
    })
    BUCKET_ID = bucket["identifier"]
    print(f"\nBucket utworzony: {BUCKET_ID}")
except Exception as e:
    print(f"Nie udalo sie utworzyc bucketu: {e}")
    BUCKET_ID = None

In [None]:
# 3. Podlacz NiFi do Registry
# Najpierw dodaj Registry Client w NiFi

try:
    # Sprawdz istniejacych klientow
    clients = nifi_get("/controller/registry-clients")
    existing = clients.get("registries", [])
    print(f"Istniejacy Registry Clients: {len(existing)}")
    
    if not existing:
        # Dodaj nowego klienta
        client = nifi_post("/controller/registry-clients", {
            "revision": {"version": 0},
            "component": {
                "name": "Local NiFi Registry",
                "uri": "http://nifi-registry:18080"
            }
        })
        print(f"Registry Client dodany: {client['id']}")
    else:
        print(f"Registry Client juz istnieje: {existing[0]['id']}")
        
except Exception as e:
    print(f"Registry client: {e}")

# 4. Zapisz Process Group do Registry (start version control)
if BUCKET_ID and KAFKA_PG_ID:
    try:
        vc = nifi_post(f"/versions/process-groups/{KAFKA_PG_ID}", {
            "processGroupRevision": {"version": 0},
            "versionedFlow": {
                "bucketId": BUCKET_ID,
                "flowName": "ratings-to-kafka",
                "description": "Pipeline: CSV ratings -> Kafka topic",
                "comments": "Initial version"
            }
        })
        print(f"\nFlow zapisany w Registry!")
        print(f"  Version: {vc.get('versionControlInformation', {}).get('version', '?')}")
    except Exception as e:
        print(f"Version control: {e}")

## 8. Error handling i retry w NiFi

NiFi ma wbudowane mechanizmy obslugi bledow:

### Strategie:
1. **Failure relationship** - kazdy procesor ma relacje `failure` ktora mozna skierowac do:
   - Retry (polacz failure z powrotem do tego samego procesora)
   - Dead letter queue (osobny procesor PutFile/PutHDFS)
   - Alert (LogMessage, PutEmail)

2. **Penalty duration** - FlowFile ktory spowodowal blad jest "ukarany" (domyslnie 30s)

3. **Yield duration** - procesor ktory napotkal blad czeka przed ponowna proba (domyslnie 1s)

4. **Bulletin** - komunikat o bledzie widoczny w UI

In [None]:
# Tworzymy flow z error handling
err_pg = nifi_post(f"/process-groups/{ROOT_PG_ID}/process-groups", {
    "revision": {"version": 0},
    "component": {"name": "Error Handling Example", "position": {"x": 100, "y": 400}}
})
ERR_PG_ID = err_pg["id"]
print(f"Error Handling PG: {ERR_PG_ID}\n")

# Glowny procesor (moze failowac)
p_invoke = create_processor(
    ERR_PG_ID,
    "org.apache.nifi.processors.standard.InvokeHTTP",
    "Call Rating API",
    {"x": 100, "y": 100},
    config={
        "HTTP Method": "POST",
        "Remote URL": "http://api:8000/ratings",
        "Content-Type": "application/json"
    }
)

# Retry (UpdateAttribute z licznikiem prob)
p_retry_count = create_processor(
    ERR_PG_ID,
    "org.apache.nifi.processors.attributes.UpdateAttribute",
    "Increment Retry Count",
    {"x": 400, "y": 100},
    config={
        "retry.count": "${retry.count:replaceNull('0'):plus(1)}"
    }
)

# Router: retry vs dead letter
p_retry_router = create_processor(
    ERR_PG_ID,
    "org.apache.nifi.processors.standard.RouteOnAttribute",
    "Check Retry Limit",
    {"x": 400, "y": 300},
    config={
        "Routing Strategy": "Route to Property name",
        "can_retry": "${retry.count:lt(3)}",   # max 3 proby
    },
    auto_terminate=["unmatched"]
)

# Dead letter queue
p_dead_letter = create_processor(
    ERR_PG_ID,
    "org.apache.nifi.processors.standard.PutFile",
    "Dead Letter Queue",
    {"x": 400, "y": 500},
    config={
        "Directory": "/data/dead_letter/${now():format('yyyy-MM-dd')}",
        "Conflict Resolution Strategy": "replace"
    },
    auto_terminate=["success", "failure"]
)

# Success sink
p_log_success = create_processor(
    ERR_PG_ID,
    "org.apache.nifi.processors.standard.LogAttribute",
    "Log Success",
    {"x": 100, "y": 300},
    config={"Log Level": "info"},
    auto_terminate=["success"]
)

# Polaczenia
# success -> log
create_connection(ERR_PG_ID, p_invoke["id"], p_log_success["id"],
                  ["Response"], "success")
# failure -> retry counter
create_connection(ERR_PG_ID, p_invoke["id"], p_retry_count["id"],
                  ["Failure", "No Retry", "Retry"], "on failure")
# retry counter -> router
create_connection(ERR_PG_ID, p_retry_count["id"], p_retry_router["id"],
                  ["success"], "check retry")
# router -> retry (back to invoke)
create_connection(ERR_PG_ID, p_retry_router["id"], p_invoke["id"],
                  ["can_retry"], "retry")
# router -> dead letter (unmatched = exceeded max retries)
create_connection(ERR_PG_ID, p_retry_router["id"], p_dead_letter["id"],
                  ["unmatched"], "dead letter")
# auto-terminate Original relationship from InvokeHTTP

print("\nError handling flow:")
print("  InvokeHTTP -success-> LogAttribute")
print("  InvokeHTTP -failure-> UpdateAttribute(retry++) -> RouteOnAttribute")
print("    -> can_retry -> InvokeHTTP (loop)")
print("    -> exceeded  -> PutFile (dead letter queue)")

## 9. Monitoring end-to-end pipeline

Monitorujemy caly pipeline: NiFi -> Kafka -> Spark -> Sinks.
Kazdy komponent ma swoje metryki - zbieramy je w jednym widoku.

In [None]:
def monitor_pipeline():
    """Zbierz metryki z calego pipeline."""
    print("=" * 70)
    print("  END-TO-END PIPELINE MONITORING")
    print("=" * 70)
    
    # --- NiFi ---
    print("\n[NiFi]")
    try:
        root_status = nifi_get(f"/flow/process-groups/root/status")
        rs = root_status["processGroupStatus"]["aggregateSnapshot"]
        print(f"  Running processors:  {rs.get('runningCount', 0)}")
        print(f"  Stopped processors:  {rs.get('stoppedCount', 0)}")
        print(f"  Invalid processors:  {rs.get('invalidCount', 0)}")
        print(f"  Queued FlowFiles:    {rs.get('flowFilesQueued', 0)}")
        print(f"  Active threads:      {rs.get('activeThreadCount', 0)}")
        
        # Bulletins
        bulletins = nifi_get("/flow/bulletin-board")
        bl = bulletins.get("bulletinBoard", {}).get("bulletins", [])
        errors = [b for b in bl if b.get("bulletin", {}).get("level") == "ERROR"]
        print(f"  Error bulletins:     {len(errors)}")
    except Exception as e:
        print(f"  Blad: {e}")
    
    # --- Kafka (via Spark) ---
    print("\n[Kafka]")
    try:
        kafka_batch = spark.read \
            .format("kafka") \
            .option("kafka.bootstrap.servers", "kafka:9092") \
            .option("subscribe", "movielens-ratings") \
            .option("startingOffsets", "earliest") \
            .option("endingOffsets", "latest") \
            .load()
        msg_count = kafka_batch.count()
        partitions = kafka_batch.select("partition").distinct().count()
        print(f"  Topic: movielens-ratings")
        print(f"  Messages:            {msg_count}")
        print(f"  Partitions:          {partitions}")
    except Exception as e:
        print(f"  Blad: {e}")
    
    # --- Spark Streaming ---
    print("\n[Spark Streaming]")
    active = spark.streams.active
    print(f"  Active streams:      {len(active)}")
    for q in active:
        p = q.lastProgress
        rows_sec = p.get("processedRowsPerSecond", 0) if p else 0
        batch_id = p.get("batchId", "?") if p else "?"
        print(f"    {q.name}: batch={batch_id}, {rows_sec:.0f} rows/s")
    
    # --- HDFS ---
    print("\n[HDFS Output]")
    try:
        hdfs_data = spark.read.parquet(
            f"{HDFS_URL}/data/movielens/streaming/ratings")
        print(f"  Ratings on HDFS:     {hdfs_data.count()}")
    except Exception as e:
        print(f"  Brak danych: {e}")
    
    # --- PostgreSQL ---
    print("\n[PostgreSQL Output]")
    try:
        pg_data = spark.read.jdbc(
            jdbc_url, "movielens.streaming_movie_stats", properties=jdbc_props)
        print(f"  Movie stats rows:    {pg_data.count()}")
        print(f"  Unique movies:       {pg_data.select('movie_id').distinct().count()}")
    except Exception as e:
        print(f"  Tabela nie istnieje: {e}")
    
    print("\n" + "=" * 70)

monitor_pipeline()

## 10. Czyszczenie zasobow

In [None]:
# Zatrzymaj streamy Spark
for q in spark.streams.active:
    q.stop()
    print(f"Stream '{q.name}' zatrzymany.")

# Czyszczenie NiFi Process Groups (odkomentuj)
def cleanup_pg(pg_id):
    """Zatrzymaj i usun Process Group."""
    try:
        nifi_put(f"/flow/process-groups/{pg_id}", {"id": pg_id, "state": "STOPPED"})
        time.sleep(2)
        flow = nifi_get(f"/flow/process-groups/{pg_id}")
        for c in flow["processGroupFlow"]["flow"]["connections"]:
            try:
                nifi_post(f"/flowfile-queues/{c['id']}/drop-requests", {})
                time.sleep(1)
            except: pass
            nifi_delete(f"/connections/{c['id']}", params={"version": c["revision"]["version"]})
        for p in flow["processGroupFlow"]["flow"]["processors"]:
            nifi_delete(f"/processors/{p['id']}", params={"version": p["revision"]["version"]})
        pg = nifi_get(f"/process-groups/{pg_id}")
        nifi_delete(f"/process-groups/{pg_id}", params={"version": pg["revision"]["version"]})
        print(f"PG {pg_id} usuniety.")
    except Exception as e:
        print(f"Cleanup error: {e}")

# Odkomentuj aby posprzatac:
# cleanup_pg(KAFKA_PG_ID)
# cleanup_pg(ERR_PG_ID)
print("Odkomentuj powyzsze linie aby usunac flow.")

In [None]:
spark.stop()
print("Spark zatrzymany.")

## Zadanie koncowe: End-to-end pipeline dla nowych ratings

Zbuduj kompletny pipeline:

1. **NiFi flow** (via API):
   - `GenerateFlowFile` - symuluje nowe ratings (JSON)
   - `UpdateAttribute` - dodaj metadane (timestamp, source)
   - `PublishKafka` - wyslij do `movielens-new-ratings`
   - Dodaj error handling z retry (max 3 proby) i dead letter queue

2. **Spark Structured Streaming**:
   - Czytaj z Kafki `movielens-new-ratings`
   - Parsuj JSON, waliduj dane (rating 0.5-5.0, user_id > 0)
   - Sink 1: HDFS Parquet (partycjonowanie po dacie)
   - Sink 2: PostgreSQL (running average per movie)

3. **Monitoring**:
   - Uzyj funkcji `monitor_pipeline()` do sprawdzenia stanu
   - Sprawdz provenance w NiFi
   - Sprawdz Spark UI -> Streaming tab

4. **Weryfikacja**:
   - Ile wiadomosci jest w Kafce?
   - Ile rekordow na HDFS?
   - Ile filmow w tabeli agregatow?

In [None]:
# Twoje rozwiazanie:
