# 23 - NiFi Data Flows

Apache NiFi - platforma do automatyzacji przepływu danych między systemami.

**Tematy:**
- Architektura NiFi: FlowFile, Processor, Connection, Process Group
- NiFi REST API - zarządzanie z poziomu Pythona
- Tworzenie procesorów: GetFile, PutFile, RouteOnAttribute
- Tworzenie flow: CSV -> transformacja -> Parquet/HDFS
- NiFi Expression Language
- Data provenance i lineage
- Monitoring: bulletins, back pressure, stats via API
- Zadanie koncowe: flow do ingestion danych MovieLens

## 1. Architektura Apache NiFi

```
                        ┌─────────────────────────────────────┐
                        │           NiFi Cluster              │
                        │                                     │
  ┌──────────┐          │  ┌───────────┐    ┌───────────┐    │         ┌──────────┐
  │  Source   │──FlowFile─>│ Processor │───>│ Processor │────│─────────│  Sink    │
  │(CSV/API)  │          │  │ (GetFile) │    │(Transform)│    │         │(HDFS/DB) │
  └──────────┘          │  └───────────┘    └───────────┘    │         └──────────┘
                        │       │                  │          │
                        │       └──Connection──────┘          │
                        │      (queue + back pressure)        │
                        └─────────────────────────────────────┘
```

### Kluczowe koncepty:

| Koncept | Opis |
|---------|------|
| **FlowFile** | Jednostka danych w NiFi - content (bajty) + attributes (metadane key-value) |
| **Processor** | Komponent przetwarzajacy FlowFile (GetFile, PutHDFS, RouteOnAttribute...) |
| **Connection** | Kolejka FIFO miedzy procesorami z back pressure |
| **Process Group** | Kontener grupujacy procesory (jak folder) |
| **Controller Service** | Wspoldzielona usluga (np. DBCPConnectionPool, SSLContext) |
| **Provenance** | Pelna historia kazdego FlowFile - skad, dokad, co sie zmienilo |

### Dlaczego NiFi?
- **Wizualny interfejs** - projektowanie flow drag & drop
- **Back pressure** - automatyczne spowalnianie zrodla gdy sink nie nadaza
- **Guaranteed delivery** - dane nie gina (Write-Ahead Log)
- **Data provenance** - pelny lineage kazdego bajta
- **300+ procesorow** out of the box

## 2. Setup - polaczenie z NiFi REST API

NiFi udostepnia pelne REST API do zarzadzania flow programowo.
Dokumentacja: `https://nifi:8443/nifi-api` lub `http://nifi:8080/nifi-api`

In [None]:
import requests
import json
import time
import urllib3

# Wylacz ostrzezenia SSL dla self-signed cert
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Konfiguracja NiFi API
# Dostosuj URL w zaleznosci od konfiguracji (HTTPS z auth lub HTTP bez)
NIFI_API = "https://nifi:8443/nifi-api"
# Alternatywnie: NIFI_API = "http://nifi:8080/nifi-api"

# Sesja HTTP z domyslnymi ustawieniami
session = requests.Session()
session.verify = False  # self-signed cert

# Jezeli NiFi wymaga autentykacji (HTTPS), uzyskaj token:
def get_nifi_token(username="admin", password="admin123456789"):
    """Uzyskaj token dostepu do NiFi API."""
    try:
        resp = session.post(
            f"{NIFI_API}/access/token",
            data={"username": username, "password": password},
            headers={"Content-Type": "application/x-www-form-urlencoded"}
        )
        if resp.status_code == 201:
            token = resp.text
            session.headers.update({"Authorization": f"Bearer {token}"})
            print(f"Token uzyskany pomyslnie (dlugosc: {len(token)})")
            return token
        else:
            print(f"Auth nie wymagany lub blad: {resp.status_code}")
            return None
    except Exception as e:
        print(f"Proba polaczenia bez auth: {e}")
        return None

# Probuj uzyskac token (opcjonalne)
token = get_nifi_token()

# Test polaczenia
resp = session.get(f"{NIFI_API}/system-diagnostics")
print(f"\nStatus polaczenia: {resp.status_code}")
if resp.status_code == 200:
    diag = resp.json()
    heap = diag["systemDiagnostics"]["aggregateSnapshot"]["heapUtilization"]
    print(f"NiFi Heap: {heap}")
    print("Polaczenie z NiFi OK!")

In [None]:
# Helper functions do pracy z NiFi API

def nifi_get(path):
    """GET request do NiFi API."""
    resp = session.get(f"{NIFI_API}{path}")
    resp.raise_for_status()
    return resp.json()

def nifi_post(path, data):
    """POST request do NiFi API."""
    resp = session.post(
        f"{NIFI_API}{path}",
        json=data,
        headers={"Content-Type": "application/json"}
    )
    resp.raise_for_status()
    return resp.json()

def nifi_put(path, data):
    """PUT request do NiFi API."""
    resp = session.put(
        f"{NIFI_API}{path}",
        json=data,
        headers={"Content-Type": "application/json"}
    )
    resp.raise_for_status()
    return resp.json()

def nifi_delete(path, params=None):
    """DELETE request do NiFi API."""
    resp = session.delete(f"{NIFI_API}{path}", params=params)
    resp.raise_for_status()
    return resp.json()

# Pobierz root process group ID
root_flow = nifi_get("/flow/process-groups/root")
ROOT_PG_ID = root_flow["processGroupFlow"]["id"]
print(f"Root Process Group ID: {ROOT_PG_ID}")

# Wyswietl istniejace procesory
processors = root_flow["processGroupFlow"]["flow"]["processors"]
print(f"\nIstniejace procesory ({len(processors)}):")
for p in processors:
    name = p["component"]["name"]
    ptype = p["component"]["type"].split(".")[-1]
    state = p["component"]["state"]
    print(f"  {name:<30} [{ptype}] state={state}")

## 3. Tworzenie Process Group

Process Group to kontener na procesory - jak folder w systemie plikow.
Pozwala organizowac flow w logiczne bloki.

In [None]:
# Stworz Process Group "MovieLens Ingestion"
pg_body = {
    "revision": {"version": 0},
    "component": {
        "name": "MovieLens Ingestion",
        "position": {"x": 100, "y": 100}
    }
}

pg_resp = nifi_post(f"/process-groups/{ROOT_PG_ID}/process-groups", pg_body)
PG_ID = pg_resp["id"]
print(f"Process Group utworzony: {PG_ID}")
print(f"Nazwa: {pg_resp['component']['name']}")

## 4. Tworzenie procesorow

Procesory to podstawowe jednostki przetwarzania w NiFi.

### Najwazniejsze typy procesorow:

| Procesor | Kategoria | Opis |
|----------|-----------|------|
| **GetFile** | Input | Czyta pliki z lokalnego systemu plikow |
| **PutFile** | Output | Zapisuje FlowFile do lokalnego FS |
| **PutHDFS** | Output | Zapisuje FlowFile do HDFS |
| **RouteOnAttribute** | Routing | Kieruje FlowFile na podstawie atrybutow |
| **UpdateAttribute** | Transform | Modyfikuje atrybuty FlowFile |
| **ConvertRecord** | Transform | Konwersja formatow (CSV->JSON->Avro) |
| **ExecuteSQL** | Database | Wykonuje zapytanie SQL |
| **PutDatabaseRecord** | Database | Wstawia rekordy do bazy |
| **PublishKafka** | Messaging | Wysyla do tematu Kafka |
| **InvokeHTTP** | External | Wywoluje HTTP endpoint |

In [None]:
def create_processor(pg_id, proc_type, name, position, config=None, auto_terminate=None):
    """Tworzy procesor w danym Process Group.
    
    Args:
        pg_id: ID Process Group
        proc_type: pelen typ procesora (np. org.apache.nifi.processors.standard.GetFile)
        name: nazwa wyswietlana
        position: dict z x, y
        config: dict z propertiami procesora
        auto_terminate: lista relacji do auto-terminate
    """
    body = {
        "revision": {"version": 0},
        "component": {
            "type": proc_type,
            "name": name,
            "position": position,
            "config": {}
        }
    }
    if config:
        body["component"]["config"]["properties"] = config
    if auto_terminate:
        body["component"]["config"]["autoTerminatedRelationships"] = auto_terminate
    
    result = nifi_post(f"/process-groups/{pg_id}/processors", body)
    print(f"Procesor '{name}' utworzony: {result['id']}")
    return result


# --- Procesor 1: GetFile - czyta CSV z dysku ---
get_file = create_processor(
    PG_ID,
    "org.apache.nifi.processors.standard.GetFile",
    "Read MovieLens CSV",
    {"x": 100, "y": 100},
    config={
        "Input Directory": "/data/raw/movielens",
        "File Filter": "rating\\.csv",
        "Keep Source File": "true",       # nie usuwaj oryginalu
        "Batch Size": "1"                  # 1 plik na raz
    }
)

# --- Procesor 2: UpdateAttribute - dodaj metadane ---
update_attr = create_processor(
    PG_ID,
    "org.apache.nifi.processors.attributes.UpdateAttribute",
    "Add Metadata",
    {"x": 100, "y": 300},
    config={
        "dataset": "movielens",
        "ingestion.timestamp": "${now():format('yyyy-MM-dd HH:mm:ss')}",
        "source.type": "csv",
        "schema.name": "ratings"
    }
)

# --- Procesor 3: RouteOnAttribute - routing na podstawie rozmiaru ---
route = create_processor(
    PG_ID,
    "org.apache.nifi.processors.standard.RouteOnAttribute",
    "Route by Size",
    {"x": 100, "y": 500},
    config={
        "Routing Strategy": "Route to Property name",
        "large_file": "${fileSize:gt(10485760)}",   # > 10MB
        "small_file": "${fileSize:le(10485760)}"     # <= 10MB
    },
    auto_terminate=["unmatched"]
)

# --- Procesor 4: PutFile - zapisz wynik ---
put_file = create_processor(
    PG_ID,
    "org.apache.nifi.processors.standard.PutFile",
    "Write to Staging",
    {"x": 100, "y": 700},
    config={
        "Directory": "/data/staging/movielens/${schema.name}",
        "Conflict Resolution Strategy": "replace"
    },
    auto_terminate=["success", "failure"]
)

print("\nWszystkie procesory utworzone!")

In [None]:
# Tworzenie polaczen (Connection) miedzy procesorami

def create_connection(pg_id, source_id, dest_id, relationships, name=""):
    """Tworzy polaczenie miedzy dwoma procesorami."""
    body = {
        "revision": {"version": 0},
        "component": {
            "name": name,
            "source": {"id": source_id, "type": "PROCESSOR", "groupId": pg_id},
            "destination": {"id": dest_id, "type": "PROCESSOR", "groupId": pg_id},
            "selectedRelationships": relationships,
            "backPressureObjectThreshold": 10000,
            "backPressureDataSizeThreshold": "1 GB",
            "flowFileExpiration": "0 sec"   # nigdy nie wygasaj
        }
    }
    result = nifi_post(f"/process-groups/{pg_id}/connections", body)
    print(f"Connection '{name}': {source_id[:8]}... -> {dest_id[:8]}... [{', '.join(relationships)}]")
    return result

# GetFile -> UpdateAttribute
conn1 = create_connection(
    PG_ID,
    get_file["id"], update_attr["id"],
    ["success"],
    "raw csv"
)

# UpdateAttribute -> RouteOnAttribute
conn2 = create_connection(
    PG_ID,
    update_attr["id"], route["id"],
    ["success"],
    "with metadata"
)

# RouteOnAttribute (large_file) -> PutFile
conn3 = create_connection(
    PG_ID,
    route["id"], put_file["id"],
    ["large_file"],
    "large files"
)

# RouteOnAttribute (small_file) -> PutFile
conn4 = create_connection(
    PG_ID,
    route["id"], put_file["id"],
    ["small_file"],
    "small files"
)

print("\nFlow polaczony!")
print("Schemat: GetFile -> UpdateAttribute -> RouteOnAttribute -> PutFile")

## 5. NiFi Expression Language

NiFi Expression Language (EL) pozwala dynamicznie odwolywac sie do atrybutow FlowFile.

### Podstawowa skladnia:

```
${attribute_name}                          # wartosc atrybutu
${filename:substringBefore('.')}           # operacje na stringach
${fileSize:gt(1048576)}                    # porownanie (> 1MB)
${now():format('yyyy-MM-dd')}              # aktualna data
${literal('hello'):append(' world')}       # literal + append
```

### Najczesciej uzywane funkcje:

| Funkcja | Opis | Przyklad |
|---------|------|----------|
| `substringBefore(sep)` | Tekst przed separatorem | `${filename:substringBefore('.')}` -> `rating` |
| `substringAfter(sep)` | Tekst po separatorze | `${filename:substringAfter('.')}` -> `csv` |
| `replace(old, new)` | Zamiana tekstu | `${filename:replace('.csv', '.parquet')}` |
| `toUpper()` / `toLower()` | Zmiana wielkosci | `${dataset:toUpper()}` -> `MOVIELENS` |
| `format(pattern)` | Formatowanie daty | `${now():format('yyyyMMdd')}` -> `20260209` |
| `gt(val)` / `lt(val)` | Porownania | `${fileSize:gt(1000000)}` -> `true` |
| `isEmpty()` | Czy pusty | `${attr:isEmpty()}` -> `true/false` |
| `ifElse(t, f)` | Warunkowe | `${x:gt(5):ifElse('big','small')}` |

In [None]:
# Przyklad: UpdateAttribute z NiFi Expression Language
# Tworzymy procesor ktory dynamicznie generuje sciezki

el_processor = create_processor(
    PG_ID,
    "org.apache.nifi.processors.attributes.UpdateAttribute",
    "Dynamic Path Generator",
    {"x": 400, "y": 100},
    config={
        # Dynamiczna sciezka na HDFS z data partycjonowania
        "hdfs.output.path": "/data/movielens/bronze/${now():format('yyyy/MM/dd')}",
        
        # Nazwa pliku z timestampem
        "output.filename": "${filename:substringBefore('.')}_${now():format('yyyyMMdd_HHmmss')}.${filename:substringAfter('.')}",
        
        # Rozmiar w czytelnym formacie
        "file.size.human": "${fileSize:div(1048576)} MB",
        
        # Typ pliku
        "file.extension": "${filename:substringAfterLast('.')}",
        
        # Warunkowy atrybut
        "priority": "${fileSize:gt(52428800):ifElse('high', 'normal')}"
    }
)

print("\nPrzykladowe Expression Language wyrazenia:")
print("  hdfs.output.path  = /data/movielens/bronze/2026/02/09")
print("  output.filename   = rating_20260209_143022.csv")
print("  file.size.human   = 245 MB")
print("  file.extension    = csv")
print("  priority          = high (jesli > 50MB)")

## 6. Flow: CSV -> HDFS z transformacja

Budujemy bardziej zaawansowany flow ktory:
1. Czyta pliki CSV z katalogu
2. Waliduje schemat
3. Konwertuje na format Avro/JSON
4. Zapisuje na HDFS z partycjonowaniem po dacie

In [None]:
# Tworzymy nowy Process Group dla HDFS flow
hdfs_pg = nifi_post(f"/process-groups/{ROOT_PG_ID}/process-groups", {
    "revision": {"version": 0},
    "component": {
        "name": "CSV to HDFS Pipeline",
        "position": {"x": 500, "y": 100}
    }
})
HDFS_PG_ID = hdfs_pg["id"]
print(f"HDFS Process Group: {HDFS_PG_ID}")

# 1. GetFile - zrodlo CSV
hdfs_getfile = create_processor(
    HDFS_PG_ID,
    "org.apache.nifi.processors.standard.GetFile",
    "Ingest CSV",
    {"x": 100, "y": 100},
    config={
        "Input Directory": "/data/raw/movielens",
        "File Filter": ".*\\.csv",
        "Keep Source File": "true",
        "Batch Size": "10"
    }
)

# 2. UpdateAttribute - dodaj metadane i sciezke HDFS
hdfs_metadata = create_processor(
    HDFS_PG_ID,
    "org.apache.nifi.processors.attributes.UpdateAttribute",
    "Set HDFS Path",
    {"x": 100, "y": 300},
    config={
        "hdfs.directory": "/data/movielens/bronze/${now():format('yyyy-MM-dd')}",
        "dataset.name": "movielens",
        "ingestion.id": "${UUID()}"
    }
)

# 3. PutHDFS - zapis do HDFS
hdfs_put = create_processor(
    HDFS_PG_ID,
    "org.apache.nifi.processors.hadoop.PutHDFS",
    "Write to HDFS",
    {"x": 100, "y": 500},
    config={
        "Hadoop Configuration Resources": "/etc/hadoop/core-site.xml,/etc/hadoop/hdfs-site.xml",
        "Directory": "${hdfs.directory}",
        "Conflict Resolution Strategy": "replace"
    },
    auto_terminate=["success", "failure"]
)

# Polaczenia
create_connection(HDFS_PG_ID, hdfs_getfile["id"], hdfs_metadata["id"], ["success"], "raw csv")
create_connection(HDFS_PG_ID, hdfs_metadata["id"], hdfs_put["id"], ["success"], "to HDFS")

print("\nFlow CSV -> HDFS gotowy!")
print("Schemat: GetFile -> UpdateAttribute -> PutHDFS")

## 7. Uruchamianie i zatrzymywanie flow

Procesory w NiFi maja stany:
- **STOPPED** - nie przetwarza (domyslny po utworzeniu)
- **RUNNING** - aktywnie przetwarza FlowFile
- **DISABLED** - wylaczony (nie mozna uruchomic)
- **INVALID** - blad konfiguracji

In [None]:
def start_processor(processor_id):
    """Uruchom procesor."""
    proc = nifi_get(f"/processors/{processor_id}")
    version = proc["revision"]["version"]
    return nifi_put(f"/processors/{processor_id}/run-status", {
        "revision": {"version": version},
        "state": "RUNNING"
    })

def stop_processor(processor_id):
    """Zatrzymaj procesor."""
    proc = nifi_get(f"/processors/{processor_id}")
    version = proc["revision"]["version"]
    return nifi_put(f"/processors/{processor_id}/run-status", {
        "revision": {"version": version},
        "state": "STOPPED"
    })

def start_process_group(pg_id):
    """Uruchom wszystkie procesory w Process Group."""
    return nifi_put(f"/flow/process-groups/{pg_id}", {
        "id": pg_id,
        "state": "RUNNING"
    })

def stop_process_group(pg_id):
    """Zatrzymaj wszystkie procesory w Process Group."""
    return nifi_put(f"/flow/process-groups/{pg_id}", {
        "id": pg_id,
        "state": "STOPPED"
    })

# Uruchom caly flow MovieLens Ingestion
print("Uruchamiam Process Group...")
start_process_group(PG_ID)
print("Flow uruchomiony!")

# Poczekaj chwile i sprawdz status
time.sleep(5)

pg_status = nifi_get(f"/flow/process-groups/{PG_ID}/status")
stats = pg_status["processGroupStatus"]["aggregateSnapshot"]
print(f"\nStatystyki flow:")
print(f"  FlowFiles In:     {stats.get('flowFilesIn', 0)}")
print(f"  FlowFiles Out:    {stats.get('flowFilesOut', 0)}")
print(f"  Bytes Read:       {stats.get('bytesRead', 0)}")
print(f"  Bytes Written:    {stats.get('bytesWritten', 0)}")
print(f"  Queued:           {stats.get('flowFilesQueued', 0)} files")

# Zatrzymaj flow
stop_process_group(PG_ID)
print("\nFlow zatrzymany.")

## 8. Data Provenance - sledzenie danych

NiFi rejestruje pelna historie kazdego FlowFile:
- **CREATE** - FlowFile stworzony (np. przez GetFile)
- **RECEIVE** - FlowFile otrzymany z zewnatrz
- **SEND** - FlowFile wyslany na zewnatrz
- **ATTRIBUTES_MODIFIED** - zmiana atrybutow
- **CONTENT_MODIFIED** - zmiana zawartosci
- **ROUTE** - FlowFile przekierowany
- **DROP** - FlowFile usuniety

To jest kluczowa przewaga NiFi nad innymi narzedziami ETL - pelny **data lineage**.

In [None]:
# Provenance Query - historia FlowFile
def query_provenance(processor_id=None, max_results=100):
    """Zapytaj o provenance events."""
    query = {
        "provenance": {
            "request": {
                "maxResults": max_results,
                "summarize": False,
                "searchTerms": {}
            }
        }
    }
    if processor_id:
        query["provenance"]["request"]["searchTerms"]["ProcessorID"] = {
            "value": processor_id
        }
    
    # Submit query
    result = nifi_post("/provenance", query)
    query_id = result["provenance"]["id"]
    
    # Poll for results
    for _ in range(10):
        time.sleep(1)
        result = nifi_get(f"/provenance/{query_id}")
        if result["provenance"]["finished"]:
            break
    
    events = result["provenance"]["results"]["provenanceEvents"]
    
    # Cleanup
    nifi_delete(f"/provenance/{query_id}")
    
    return events

# Pobierz provenance events
events = query_provenance(max_results=20)
print(f"Provenance events ({len(events)}):")
print(f"{'Czas':<25} {'Typ':<25} {'Procesor':<25} {'Plik'}")
print("-" * 100)
for e in events[:20]:
    ts = e.get("eventTime", "?")
    etype = e.get("eventType", "?")
    comp = e.get("componentName", "?")[:24]
    fname = e.get("attributes", {}).get("filename", "?")
    print(f"{ts:<25} {etype:<25} {comp:<25} {fname}")

## 9. Monitoring: bulletins, back pressure, statystyki

### Back Pressure
Kazda Connection (kolejka) ma dwa limity:
- **Object Threshold**: max liczba FlowFile w kolejce (domyslnie 10,000)
- **Data Size Threshold**: max rozmiar danych w kolejce (domyslnie 1 GB)

Gdy limit zostanie osiagniety, procesor zrodlowy jest **wstrzymywany** automatycznie.
To zapobiega OOM i zapewnia stabilnosc pipeline.

### Bulletins
Komunikaty o bledach i ostrzezeniach - widoczne w UI i przez API.

In [None]:
# === Monitoring NiFi via API ===

# 1. System Diagnostics
diag = nifi_get("/system-diagnostics")
snap = diag["systemDiagnostics"]["aggregateSnapshot"]
print("=== System Diagnostics ===")
print(f"  Heap Used:        {snap['usedHeap']}")
print(f"  Heap Available:   {snap['freeHeap']}")
print(f"  Heap Utilization: {snap['heapUtilization']}")
print(f"  Processors:       {snap['totalThreads']} threads")
print(f"  Content Repo:     {snap['contentRepositoryStorageUsage'][0]['utilization']}")
print(f"  FlowFile Repo:    {snap['flowFileRepositoryStorageUsage']['utilization']}")

# 2. Process Group status (metryki calego flow)
print("\n=== Root Process Group Status ===")
root_status = nifi_get(f"/flow/process-groups/root/status")
rs = root_status["processGroupStatus"]["aggregateSnapshot"]
print(f"  Active Threads:    {rs.get('activeThreadCount', 0)}")
print(f"  FlowFiles Queued:  {rs.get('flowFilesQueued', 0)}")
print(f"  Bytes Queued:      {rs.get('bytesQueued', 0)}")
print(f"  Running:           {rs.get('runningCount', 0)} processors")
print(f"  Stopped:           {rs.get('stoppedCount', 0)} processors")
print(f"  Invalid:           {rs.get('invalidCount', 0)} processors")

# 3. Bulletins (errors/warnings)
print("\n=== Bulletins (ostatnie komunikaty) ===")
bulletins = nifi_get("/flow/bulletin-board")
bulletin_list = bulletins.get("bulletinBoard", {}).get("bulletins", [])
if bulletin_list:
    for b in bulletin_list[:10]:
        bb = b.get("bulletin", {})
        print(f"  [{bb.get('level', '?')}] {bb.get('sourceName', '?')}: {bb.get('message', '?')[:80]}")
else:
    print("  Brak bulletinow - wszystko dziala poprawnie.")

In [None]:
# 4. Szczegolowe metryki per procesor

def get_processor_stats(pg_id):
    """Pobierz statystyki wszystkich procesorow w Process Group."""
    flow = nifi_get(f"/flow/process-groups/{pg_id}")
    processors = flow["processGroupFlow"]["flow"]["processors"]
    
    print(f"{'Procesor':<30} {'Status':<10} {'In':<8} {'Out':<8} {'Read':<12} {'Written':<12} {'Tasks'}")
    print("-" * 100)
    
    for p in processors:
        comp = p["component"]
        status = p.get("status", {}).get("aggregateSnapshot", {})
        print(
            f"{comp['name']:<30} "
            f"{comp['state']:<10} "
            f"{status.get('flowFilesIn', 0):<8} "
            f"{status.get('flowFilesOut', 0):<8} "
            f"{status.get('bytesRead', 0):<12} "
            f"{status.get('bytesWritten', 0):<12} "
            f"{status.get('taskCount', 0)}"
        )

# 5. Connection queue stats (back pressure monitoring)
def get_connection_stats(pg_id):
    """Pobierz statystyki kolejek w Process Group."""
    flow = nifi_get(f"/flow/process-groups/{pg_id}")
    connections = flow["processGroupFlow"]["flow"]["connections"]
    
    print(f"\n{'Connection':<25} {'Queued Files':<15} {'Queued Size':<15} {'Back Pressure'}")
    print("-" * 70)
    
    for c in connections:
        name = c["component"].get("name", "unnamed")[:24]
        status = c.get("status", {}).get("aggregateSnapshot", {})
        queued_count = status.get("flowFilesQueued", 0)
        queued_size = status.get("bytesQueued", 0)
        bp_pct = status.get("percentUseCount", "0%")
        print(f"{name:<25} {queued_count:<15} {queued_size:<15} {bp_pct}")

print("=== Statystyki MovieLens Ingestion ===")
get_processor_stats(PG_ID)
get_connection_stats(PG_ID)

## 10. Czyszczenie zasobow

Przed usunieciem Process Group nalezy:
1. Zatrzymac wszystkie procesory
2. Oproznic kolejki (connections)
3. Usunac polaczenia
4. Usunac procesory
5. Usunac Process Group

In [None]:
def cleanup_process_group(pg_id):
    """Usun Process Group i wszystkie jej elementy."""
    # 1. Stop all processors
    stop_process_group(pg_id)
    time.sleep(2)
    
    # 2. Get all connections and processors
    flow = nifi_get(f"/flow/process-groups/{pg_id}")
    connections = flow["processGroupFlow"]["flow"]["connections"]
    processors = flow["processGroupFlow"]["flow"]["processors"]
    
    # 3. Drop queues and delete connections
    for c in connections:
        conn_id = c["id"]
        # Drop queue
        try:
            nifi_post(f"/flowfile-queues/{conn_id}/drop-requests", {})
            time.sleep(1)
        except:
            pass
        # Delete connection
        version = c["revision"]["version"]
        nifi_delete(f"/connections/{conn_id}", params={"version": version})
    
    # 4. Delete processors
    for p in processors:
        proc_id = p["id"]
        version = p["revision"]["version"]
        nifi_delete(f"/processors/{proc_id}", params={"version": version})
    
    # 5. Delete process group
    pg = nifi_get(f"/process-groups/{pg_id}")
    version = pg["revision"]["version"]
    nifi_delete(f"/process-groups/{pg_id}", params={"version": version})
    
    print(f"Process Group {pg_id} usuniety.")

# Odkomentuj aby posprzatac:
# cleanup_process_group(PG_ID)
# cleanup_process_group(HDFS_PG_ID)
print("Odkomentuj powyzsze linie aby usunac utworzone flow.")

## Zadanie koncowe

Stworz kompletny flow do ingestion danych MovieLens:

1. **GetFile** - czytaj pliki CSV z `/data/raw/movielens/`
2. **UpdateAttribute** - dodaj metadane: dataset, timestamp, schema name
3. **RouteOnAttribute** - rozdziel pliki: `rating.csv` vs `movie.csv` vs inne
4. **PutHDFS** (x2) - osobne sciezki HDFS dla ratings i movies:
   - `/data/movielens/bronze/ratings/YYYY-MM-DD/`
   - `/data/movielens/bronze/movies/YYYY-MM-DD/`
5. Uruchom flow i zweryfikuj ze dane sa na HDFS
6. Sprawdz provenance - ile eventow? Jakie typy?
7. Sprawdz statystyki - ile bajtow przetworzone?

**Bonus:** Dodaj procesor LogAttribute miedzy krokami zeby logowac atrybuty FlowFile.

In [None]:
# Twoje rozwiazanie:
