# 26 - Security z Apache Ranger

Apache Ranger to centralna platforma zarzadzania bezpieczenstwem w ekosystemie Hadoop. Umozliwia definiowanie, administrowanie i audytowanie polityk dostepu do danych.

**Tematy:**
- Architektura Apache Ranger: Admin, Plugin, UserSync, Audit
- Ranger REST API - zarzadzanie politykami z Pythona
- Polityki dostepu: HDFS, Hive, Spark SQL
- Row-level security: ograniczenie widocznosci danych
- Column masking: maskowanie PII (user_id -> hash)
- Tag-based policies: polityki bazujace na klasyfikacjach Atlas
- Audit i compliance: kto kiedy czytal dane
- Integracja Ranger + Atlas: governance-driven security

## 1. Architektura Apache Ranger

```
                    +-------------------------------+
                    |       Ranger Admin UI         |
                    |    (http://ranger:6080)       |
                    +---------------+---------------+
                                    |
                    +---------------v---------------+
                    |      Ranger Admin Server      |
                    |   - Policy Management         |
                    |   - User/Group Management     |
                    |   - Audit aggregation         |
                    |   - REST API                  |
                    +------+--------+--------+------+
                           |        |        |
              +------------+  +-----+-----+  +------------+
              |               |           |               |
    +---------v---------+ +--v-----------v--+ +---------v---------+
    |   HDFS Plugin     | | Hive/Spark      | |  Kafka Plugin     |
    |                   | | Plugin          | |                   |
    | Intercepts HDFS   | | Intercepts SQL  | | Intercepts topic  |
    | file access       | | queries         | | access            |
    +-------------------+ +-----------------+ +-------------------+
              |                   |                   |
    +---------v---------+ +------v----------+ +------v------------+
    |   HDFS            | | Hive/SparkSQL   | |  Kafka            |
    +-------------------+ +-----------------+ +-------------------+

    +-------------------+
    |   UserSync        |   <-- Synchronizacja uzytkownikow z LDAP/AD
    +-------------------+

    +-------------------+
    |   Audit (Solr)    |   <-- Logi audytowe: kto, co, kiedy
    +-------------------+
```

### Kluczowe komponenty:
- **Ranger Admin**: centralny serwer zarzadzania politykami
- **Plugins**: agenty na uslugach (HDFS, Hive, Spark) wymuszajace polityki
- **UserSync**: synchronizacja uzytkownikow z LDAP/Active Directory
- **Audit**: rejestrowanie kazdego dostepu do danych (Solr/HDFS/DB)
- **KMS**: Key Management Server do szyfrowania danych

## 2. Setup - polaczenie z Ranger REST API

Ranger Admin udostepnia REST API na porcie 6080. Domyslne dane logowania: `admin`/`rangerR0cks!`.

In [None]:
import requests
import json
import hashlib
from datetime import datetime, timedelta

# Ranger REST API configuration
RANGER_URL = "http://ranger:6080"
RANGER_AUTH = ("admin", "rangerR0cks!")
HEADERS = {"Content-Type": "application/json", "Accept": "application/json"}


class RangerClient:
    """Klient do komunikacji z Apache Ranger REST API."""

    def __init__(self, base_url, auth):
        self.base_url = base_url
        self.auth = auth
        self.session = requests.Session()
        self.session.auth = auth
        self.session.headers.update(HEADERS)

    def get(self, endpoint, params=None):
        resp = self.session.get(f"{self.base_url}{endpoint}", params=params)
        resp.raise_for_status()
        return resp.json()

    def post(self, endpoint, data):
        resp = self.session.post(f"{self.base_url}{endpoint}", json=data)
        resp.raise_for_status()
        return resp.json()

    def put(self, endpoint, data):
        resp = self.session.put(f"{self.base_url}{endpoint}", json=data)
        resp.raise_for_status()
        return resp.json()

    def delete(self, endpoint):
        resp = self.session.delete(f"{self.base_url}{endpoint}")
        resp.raise_for_status()
        return resp.status_code


ranger = RangerClient(RANGER_URL, RANGER_AUTH)

# Test polaczenia
try:
    services = ranger.get("/service/public/v2/api/service")
    print(f"Polaczono z Ranger. Zarejestrowane serwisy: {len(services)}")
    for svc in services:
        print(f"  - {svc['name']} (type: {svc['type']}, status: {'active' if svc.get('isEnabled') else 'disabled'})")
except requests.exceptions.ConnectionError:
    print("UWAGA: Ranger nie jest dostepny. Uruchom kontener Ranger.")
    print("Notebook mozna czytac jako material edukacyjny.")

## 3. Zarzadzanie uzytkownikami i grupami

Zanim zdefiniujemy polityki, musimy miec uzytkownikow i grupy.
W produkcji UserSync synchronizuje je z LDAP/AD. Tutaj tworzymy recznie.

### Model rol dla systemu rekomendacji:

| Rola | Opis | Dostep |
|------|------|--------|
| `data_engineer` | Buduje pipeline'y ETL | Full access do Bronze/Silver/Gold |
| `data_scientist` | Trenuje modele ML | Read Silver/Gold, write models |
| `analyst` | Analizuje dane biznesowe | Read Gold only, no PII |
| `app_service` | API serwujace rekomendacje | Read Gold recommendations only |

In [None]:
# Tworzenie uzytkownikow
users = [
    {"name": "etl_admin", "firstName": "ETL", "lastName": "Admin",
     "emailAddress": "etl@recommender.local", "password": "Etl@dmin123",
     "userRoleList": ["ROLE_USER"], "groupIdList": []},
    {"name": "ml_engineer", "firstName": "ML", "lastName": "Engineer",
     "emailAddress": "ml@recommender.local", "password": "Ml@eng123",
     "userRoleList": ["ROLE_USER"], "groupIdList": []},
    {"name": "business_analyst", "firstName": "Business", "lastName": "Analyst",
     "emailAddress": "analyst@recommender.local", "password": "An@lyst123",
     "userRoleList": ["ROLE_USER"], "groupIdList": []},
    {"name": "api_service", "firstName": "API", "lastName": "Service",
     "emailAddress": "api@recommender.local", "password": "Api@svc123",
     "userRoleList": ["ROLE_USER"], "groupIdList": []},
]

for user in users:
    try:
        result = ranger.post("/service/xusers/secure/users", user)
        print(f"Utworzono uzytkownika: {user['name']} (ID: {result.get('id', '?')})")
    except Exception as e:
        print(f"Uzytkownik {user['name']}: {e}")

# Tworzenie grup
groups = [
    {"name": "data_engineering", "description": "Zespol data engineering - pelny dostep ETL"},
    {"name": "data_science", "description": "Zespol data science - dostep do Silver/Gold"},
    {"name": "analysts", "description": "Analitycy biznesowi - tylko Gold, bez PII"},
    {"name": "services", "description": "Konta serwisowe - ograniczony dostep"}
]

for group in groups:
    try:
        result = ranger.post("/service/xusers/secure/groups", group)
        print(f"Utworzono grupe: {group['name']} (ID: {result.get('id', '?')})")
    except Exception as e:
        print(f"Grupa {group['name']}: {e}")

# Przypisanie uzytkownikow do grup
user_group_mapping = [
    ("etl_admin", "data_engineering"),
    ("ml_engineer", "data_science"),
    ("business_analyst", "analysts"),
    ("api_service", "services")
]

for user_name, group_name in user_group_mapping:
    try:
        membership = {"name": user_name, "groupNameList": [group_name]}
        # W API Rangera przypisanie odbywa sie przez group membership endpoint
        ranger.post(f"/service/xusers/secure/users", {
            "name": user_name, "groupIdList": [],
            "userRoleList": ["ROLE_USER"]
        })
        print(f"Przypisano {user_name} -> {group_name}")
    except Exception as e:
        print(f"Mapping {user_name}->{group_name}: {e}")

## 4. Polityki dostepu - HDFS

Polityki HDFS kontroluja dostep do plikow i katalogow na HDFS.
Kazda polityka definiuje:
- **Resource**: sciezka HDFS (np. `/data/movielens/gold/*`)
- **Users/Groups**: kto ma dostep
- **Permissions**: jakie operacje (read, write, execute)
- **Conditions**: warunki (np. tylko w godzinach pracy)

In [None]:
def create_hdfs_policy(name, description, paths, policy_items, is_deny=False):
    """Tworzy polityke HDFS w Ranger."""
    policy = {
        "service": "hdfs_service",  # nazwa serwisu HDFS w Rangerze
        "name": name,
        "description": description,
        "isEnabled": True,
        "isAuditEnabled": True,
        "resources": {
            "path": {
                "values": paths,
                "isRecursive": True
            }
        },
        "policyItems": [] if is_deny else policy_items,
        "denyPolicyItems": policy_items if is_deny else [],
        "allowExceptions": [],
        "denyExceptions": []
    }
    return policy


def policy_item(users=None, groups=None, accesses=None):
    """Tworzy element polityki (kto i jakie uprawnienia)."""
    return {
        "users": users or [],
        "groups": groups or [],
        "accesses": [{"type": a, "isAllowed": True} for a in (accesses or ["read"])],
        "delegateAdmin": False
    }


# --- Polityki HDFS dla systemu rekomendacji ---

hdfs_policies = [
    # 1. Data Engineering - pelny dostep do wszystkich warstw
    create_hdfs_policy(
        name="recommender-de-full-access",
        description="Data Engineering: pelny dostep do calego data lake",
        paths=["/data/movielens"],
        policy_items=[
            policy_item(groups=["data_engineering"],
                        accesses=["read", "write", "execute"])
        ]
    ),
    # 2. Data Science - odczyt Silver i Gold
    create_hdfs_policy(
        name="recommender-ds-silver-gold-read",
        description="Data Science: odczyt Silver i Gold",
        paths=["/data/movielens/silver", "/data/movielens/gold"],
        policy_items=[
            policy_item(groups=["data_science"],
                        accesses=["read", "execute"])
        ]
    ),
    # 3. Analysts - tylko Gold (bez PII)
    create_hdfs_policy(
        name="recommender-analyst-gold-read",
        description="Analysts: odczyt tylko warstwy Gold",
        paths=["/data/movielens/gold"],
        policy_items=[
            policy_item(groups=["analysts"],
                        accesses=["read", "execute"])
        ]
    ),
    # 4. Analysts - DENY na Bronze i Silver
    create_hdfs_policy(
        name="recommender-analyst-deny-raw",
        description="Analysts: zakaz dostepu do Bronze i Silver",
        paths=["/data/movielens/bronze", "/data/movielens/silver"],
        policy_items=[
            policy_item(groups=["analysts"],
                        accesses=["read", "write", "execute"])
        ],
        is_deny=True
    ),
    # 5. API Service - tylko Gold recommendations
    create_hdfs_policy(
        name="recommender-api-recommendations-only",
        description="API Service: dostep tylko do recommendations output",
        paths=["/data/movielens/gold/recommendations", "/data/movielens/gold/movie_stats"],
        policy_items=[
            policy_item(groups=["services"],
                        accesses=["read", "execute"])
        ]
    )
]

for pol in hdfs_policies:
    try:
        result = ranger.post("/service/public/v2/api/policy", pol)
        print(f"Utworzono polityke HDFS: {pol['name']} (ID: {result.get('id', '?')})")
    except Exception as e:
        print(f"Blad polityki {pol['name']}: {e}")

print("\n--- Macierz dostepu HDFS ---")
print(f"{'Warstwa':<15} {'data_eng':>12} {'data_sci':>12} {'analyst':>12} {'api_svc':>12}")
print("-" * 65)
print(f"{'Bronze':<15} {'RWX':>12} {'---':>12} {'DENY':>12} {'---':>12}")
print(f"{'Silver':<15} {'RWX':>12} {'R-X':>12} {'DENY':>12} {'---':>12}")
print(f"{'Gold':<15} {'RWX':>12} {'R-X':>12} {'R-X':>12} {'R-X*':>12}")
print("\n* api_svc: tylko /gold/recommendations i /gold/movie_stats")

## 5. Row-Level Security (RLS)

Row-level security pozwala ograniczyc widocznosc wierszy na podstawie warunkow.
Przyklad: analityk widzi tylko oceny filmow z okreslonych gatunkow.

W Rangerze RLS jest dostepne dla Hive/SparkSQL przez Row Filter Policies.

In [None]:
# Row-level security policy dla Hive/SparkSQL
def create_row_filter_policy(name, description, database, table, filters):
    """Tworzy polityke row-level security."""
    policy = {
        "service": "hive_service",
        "name": name,
        "description": description,
        "isEnabled": True,
        "isAuditEnabled": True,
        "policyType": 2,  # 0=access, 1=masking, 2=row filter
        "resources": {
            "database": {"values": [database]},
            "table": {"values": [table]},
        },
        "rowFilterPolicyItems": filters
    }
    return policy


def row_filter_item(users=None, groups=None, filter_expr=""):
    """Tworzy element row filter policy."""
    return {
        "users": users or [],
        "groups": groups or [],
        "accesses": [{"type": "select", "isAllowed": True}],
        "rowFilterInfo": {
            "filterExpr": filter_expr
        },
        "delegateAdmin": False
    }


# Przyklad: Analitycy widza tylko oceny >= 3.0 (bez niskich ocen)
# Data Scientists widza wszystko
rls_ratings = create_row_filter_policy(
    name="recommender-rls-ratings",
    description="Row-level security na tabeli ratings",
    database="recommender",
    table="ratings",
    filters=[
        # Analitycy - tylko oceny >= 3.0
        row_filter_item(
            groups=["analysts"],
            filter_expr="rating >= 3.0"
        ),
        # Data Scientists - wszystkie wiersze
        row_filter_item(
            groups=["data_science"],
            filter_expr="1=1"  # no filter
        ),
        # API Service - tylko ostatnie 2 lata
        row_filter_item(
            groups=["services"],
            filter_expr="rating_timestamp >= date_sub(current_date(), 730)"
        )
    ]
)

try:
    result = ranger.post("/service/public/v2/api/policy", rls_ratings)
    print(f"Utworzono polityke RLS: {rls_ratings['name']}")
except Exception as e:
    print(f"Blad: {e}")

print("\n--- Row-Level Security: ratings ---")
print(f"{'Grupa':<20} {'Filtr wierszy':<50}")
print("-" * 70)
print(f"{'data_engineering':<20} {'(brak filtra - pelny dostep)':50}")
print(f"{'data_science':<20} {'1=1 (pelny dostep)':50}")
print(f"{'analysts':<20} {'rating >= 3.0':50}")
print(f"{'services':<20} {'rating_timestamp >= now() - 2 years':50}")

## 6. Column Masking - maskowanie PII

Column masking pozwala ukryc lub transformowac wartosci kolumn.
Dostepne metody maskowania:

| Metoda | Opis | Przyklad |
|--------|------|----------|
| `MASK` | Zastep znakami X | `Jan Kowalski` -> `XXXXXXXXXX` |
| `MASK_SHOW_LAST_4` | Pokaz ostatnie 4 znaki | `123456789` -> `XXXXX6789` |
| `MASK_HASH` | Hash SHA-256 | `12345` -> `5994471abb...` |
| `MASK_NULL` | Zamien na NULL | `Jan` -> `NULL` |
| `CUSTOM` | Wlasna transformacja | `user_id * 1000 + 7` |

In [None]:
def create_masking_policy(name, description, database, table, column, masking_items):
    """Tworzy polityke column masking."""
    policy = {
        "service": "hive_service",
        "name": name,
        "description": description,
        "isEnabled": True,
        "isAuditEnabled": True,
        "policyType": 1,  # 0=access, 1=masking, 2=row filter
        "resources": {
            "database": {"values": [database]},
            "table": {"values": [table]},
            "column": {"values": [column]}
        },
        "dataMaskPolicyItems": masking_items
    }
    return policy


def masking_item(users=None, groups=None, mask_type="MASK_HASH", custom_expr=None):
    """Tworzy element masking policy."""
    item = {
        "users": users or [],
        "groups": groups or [],
        "accesses": [{"type": "select", "isAllowed": True}],
        "dataMaskInfo": {
            "dataMaskType": mask_type
        },
        "delegateAdmin": False
    }
    if custom_expr:
        item["dataMaskInfo"]["valueExpr"] = custom_expr
    return item


# Maskowanie user_id w tabeli ratings
masking_user_id = create_masking_policy(
    name="recommender-mask-user-id",
    description="Maskowanie user_id (PII) - hash dla analitykow, null dla serwisow",
    database="recommender",
    table="ratings",
    column="user_id",
    masking_items=[
        # Analitycy - widza hash zamiast user_id
        masking_item(groups=["analysts"], mask_type="MASK_HASH"),
        # API Service - widza NULL
        masking_item(groups=["services"], mask_type="MASK_NULL"),
        # Data Science - widza oryginaly (potrzebne do ML)
        # Brak masking = pelny dostep
    ]
)

try:
    result = ranger.post("/service/public/v2/api/policy", masking_user_id)
    print(f"Utworzono polityke masking: {masking_user_id['name']}")
except Exception as e:
    print(f"Blad: {e}")

# Demonstracja maskowania (bez Rangera - symulacja w Pythonie)
print("\n--- Demonstracja maskowania user_id ---")
sample_user_ids = [12345, 67890, 11111, 99999]

print(f"\n{'user_id':<12} {'data_eng':<12} {'data_sci':<12} {'analyst':<20} {'api_svc':<12}")
print("-" * 70)
for uid in sample_user_ids:
    hashed = hashlib.sha256(str(uid).encode()).hexdigest()[:12]
    print(f"{uid:<12} {uid:<12} {uid:<12} {hashed:<20} {'NULL':<12}")

## 7. Tag-Based Policies - integracja z Atlas

Zamiast definiowac polityki per-resource (sciezka HDFS, tabela), mozna je definiowac per-tag.
Tagi (klasyfikacje) pochodza z Apache Atlas.

**Przyklad**: "Wszystkie dane oznaczone jako PII wymagaja maskowania" - jedna polityka zamiast wielu.

```
Atlas                          Ranger
+------------------+           +------------------+
| ratings -> PII   |  -------> | Tag: PII         |
| user_profiles    |           | -> mask user_id  |
|   -> PII         |           | -> audit all     |
+------------------+           +------------------+
                               Automatycznie stosowane
                               do WSZYSTKICH datasetow
                               oznaczonych jako PII!
```

In [None]:
# Tag-based policy - kazdy zasob z tagiem PII
def create_tag_policy(name, description, tag, policy_items, policy_type=0):
    """Tworzy polityke tag-based w Ranger."""
    policy = {
        "service": "tag_service",  # specjalny serwis Tag
        "name": name,
        "description": description,
        "isEnabled": True,
        "isAuditEnabled": True,
        "policyType": policy_type,
        "resources": {
            "tag": {"values": [tag]}
        },
        "policyItems": policy_items if policy_type == 0 else [],
        "dataMaskPolicyItems": policy_items if policy_type == 1 else [],
        "rowFilterPolicyItems": policy_items if policy_type == 2 else []
    }
    return policy


# Polityka: dane PII - analitycy moga czytac ale z maskowaniem
tag_pii_access = create_tag_policy(
    name="tag-pii-restricted-access",
    description="Dane PII: ograniczony dostep. Tylko data_eng i data_sci.",
    tag="PII",
    policy_items=[
        policy_item(groups=["data_engineering"], accesses=["read", "write", "execute"]),
        policy_item(groups=["data_science"], accesses=["read", "execute"]),
        # analysts i services NIE maja dostepu do PII
    ]
)

# Polityka: dane Confidential - audyt kazdego dostepu
tag_confidential = create_tag_policy(
    name="tag-confidential-audit",
    description="Dane Confidential: wymuszony audyt kazdego dostepu",
    tag="Confidential",
    policy_items=[
        policy_item(groups=["data_engineering", "data_science"],
                    accesses=["read", "execute"]),
    ]
)

# Polityka: dane Public - swobodny dostep
tag_public = create_tag_policy(
    name="tag-public-open-access",
    description="Dane Public: dostep dla wszystkich",
    tag="Public",
    policy_items=[
        policy_item(groups=["data_engineering", "data_science", "analysts", "services"],
                    accesses=["read", "execute"]),
    ]
)

for pol in [tag_pii_access, tag_confidential, tag_public]:
    try:
        result = ranger.post("/service/public/v2/api/policy", pol)
        print(f"Utworzono polityke tag-based: {pol['name']}")
    except Exception as e:
        print(f"Blad: {pol['name']} - {e}")

print("\n--- Tag-Based Policies ---")
print(f"{'Tag':<15} {'data_eng':>12} {'data_sci':>12} {'analyst':>12} {'api_svc':>12}")
print("-" * 65)
print(f"{'PII':<15} {'RWX':>12} {'R-X':>12} {'DENY':>12} {'DENY':>12}")
print(f"{'Confidential':<15} {'R-X':>12} {'R-X':>12} {'DENY':>12} {'DENY':>12}")
print(f"{'Public':<15} {'R-X':>12} {'R-X':>12} {'R-X':>12} {'R-X':>12}")
print("\nKorzysc: nowy dataset z tagiem PII automatycznie dziedziczy polityke!")

## 8. Audit i Compliance

Ranger rejestruje kazdy dostep do danych:
- **Kto**: uzytkownik/serwis
- **Co**: jaki zasob (plik HDFS, tabela, kolumna)
- **Kiedy**: timestamp
- **Jak**: operacja (read, write, execute)
- **Wynik**: allow/deny

Audyt jest przechowywany w Solr i dostepny przez API.

In [None]:
def get_audit_logs(service_type=None, user=None, resource=None,
                   start_date=None, end_date=None, limit=20):
    """Pobiera logi audytowe z Ranger."""
    params = {
        "page": 0,
        "pageSize": limit,
        "sortBy": "eventTime",
        "sortType": "desc"
    }
    if service_type:
        params["repoType"] = service_type  # 1=HDFS, 3=Hive
    if user:
        params["requestUser"] = user
    if resource:
        params["resourcePath"] = resource
    if start_date:
        params["startDate"] = start_date
    if end_date:
        params["endDate"] = end_date

    try:
        result = ranger.get("/service/assets/accessAudit", params=params)
        return result.get("vXAccessAudits", [])
    except Exception as e:
        print(f"Blad: {e}")
        return []


def print_audit_report(logs):
    """Formatuje raport audytowy."""
    print(f"\n{'Czas':<22} {'User':<18} {'Operacja':<10} {'Zasob':<40} {'Wynik':<8}")
    print("-" * 100)
    for log in logs:
        event_time = log.get("eventTime", "")[:19]
        user = log.get("requestUser", "?")
        access_type = log.get("accessType", "?")
        resource = log.get("resourcePath", "?")[:38]
        result = log.get("accessResult", 0)
        result_str = "ALLOW" if result == 1 else "DENY"
        print(f"{event_time:<22} {user:<18} {access_type:<10} {resource:<40} {result_str:<8}")


# Pobierz logi audytowe
print("=== Ostatnie logi audytowe ===")
logs = get_audit_logs(limit=10)
if logs:
    print_audit_report(logs)
else:
    print("Brak logow (Ranger Audit nie jest dostepny lub brak aktywnosci)")

# Raport compliance: kto dostepowal do danych PII?
print("\n=== Dostep do danych PII (ratings) ===")
pii_logs = get_audit_logs(resource="/data/movielens/*/ratings", limit=10)
if pii_logs:
    print_audit_report(pii_logs)
else:
    print("Brak logow dostepu do PII")

# Symulacja raportu compliance
print("\n=== Raport compliance (symulacja) ===")
compliance_report = [
    {"eventTime": "2025-01-15 10:23:45", "requestUser": "etl_admin",
     "accessType": "write", "resourcePath": "/data/movielens/bronze/ratings", "accessResult": 1},
    {"eventTime": "2025-01-15 11:00:12", "requestUser": "ml_engineer",
     "accessType": "read", "resourcePath": "/data/movielens/silver/ratings", "accessResult": 1},
    {"eventTime": "2025-01-15 11:30:00", "requestUser": "business_analyst",
     "accessType": "read", "resourcePath": "/data/movielens/bronze/ratings", "accessResult": 0},
    {"eventTime": "2025-01-15 12:00:00", "requestUser": "business_analyst",
     "accessType": "read", "resourcePath": "/data/movielens/gold/movie_stats", "accessResult": 1},
    {"eventTime": "2025-01-15 14:15:30", "requestUser": "api_service",
     "accessType": "read", "resourcePath": "/data/movielens/gold/recommendations", "accessResult": 1},
    {"eventTime": "2025-01-15 14:20:00", "requestUser": "api_service",
     "accessType": "read", "resourcePath": "/data/movielens/silver/ratings", "accessResult": 0},
]
print_audit_report(compliance_report)

In [None]:
# Analiza audytu - statystyki
from collections import Counter

def audit_statistics(logs):
    """Generuje statystyki z logow audytowych."""
    print("=== Statystyki audytu ===")

    # Dostepy per uzytkownik
    user_counts = Counter(log["requestUser"] for log in logs)
    print("\nDostepy per uzytkownik:")
    for user, count in user_counts.most_common():
        print(f"  {user}: {count}")

    # Allow vs Deny
    results = Counter("ALLOW" if log["accessResult"] == 1 else "DENY" for log in logs)
    total = len(logs)
    print(f"\nAllow vs Deny:")
    print(f"  ALLOW: {results['ALLOW']} ({results['ALLOW']/total*100:.0f}%)")
    print(f"  DENY:  {results.get('DENY', 0)} ({results.get('DENY', 0)/total*100:.0f}%)")

    # Deny - potencjalne naruszenia
    denies = [log for log in logs if log["accessResult"] == 0]
    if denies:
        print("\nODMOWY DOSTEPU (potencjalne naruszenia):")
        for d in denies:
            print(f"  {d['requestUser']} probowal {d['accessType']} na {d['resourcePath']}")


audit_statistics(compliance_report)

## 9. Integracja Ranger + Atlas: Governance-Driven Security

Pelna integracja Atlas + Ranger daje **governance-driven security**:

1. **Atlas** klasyfikuje dane (PII, Confidential, Public)
2. **Ranger** wymusza polityki na podstawie tych klasyfikacji
3. Nowy dataset z tagiem PII **automatycznie** dostaje polityke bezpieczenstwa

```
  Atlas (Governance)              Ranger (Security)
  +-------------------+          +-------------------+
  | 1. Klasyfikacja   |  ------> | 3. Tag-based      |
  |    PII, Conf.     |  tagsync |    policy apply    |
  +-------------------+          +-------------------+
         ^                              |
         |                              v
  +------+------------+          +-------------------+
  | 2. Lineage        |          | 4. Audit log      |
  |    propagation    |          |    compliance     |
  +-------------------+          +-------------------+
```

### TagSync
Ranger TagSync synchronizuje tagi z Atlas do Ranger co kilka sekund.
Konfiguracja w `ranger-tagsync-site.xml`:

```xml
<property>
  <name>ranger.tagsync.source.atlas</name>
  <value>true</value>
</property>
<property>
  <name>ranger.tagsync.source.atlas.kafka.bootstrap.servers</name>
  <value>kafka:9092</value>
</property>
```

In [None]:
# Symulacja pelnego flow: Atlas tag -> Ranger policy -> Audit

print("""
=== Scenariusz: Nowy dataset z danymi PII ===

1. Data Engineer tworzy nowy dataset: recommendation_feedback
   Zawiera: user_id, movie_id, feedback_text, timestamp

2. Atlas Hook automatycznie rejestruje dataset
   -> Atlas: recommender_dataset 'recommendation_feedback'

3. Data Steward klasyfikuje w Atlas:
   -> Tag: PII (bo user_id + feedback_text)
   -> Tag: Confidential (bo feedback_text)

4. Ranger TagSync pobiera tagi z Atlas
   -> Tag PII -> stosuje polityke 'tag-pii-restricted-access'
   -> Tag Confidential -> stosuje polityke 'tag-confidential-audit'

5. Efekt natychmiastowy (bez dodatkowej konfiguracji!):
   - data_engineering: pelny dostep (RWX)
   - data_science: odczyt (R-X)
   - analysts: BRAK DOSTEPU (PII policy deny)
   - api_service: BRAK DOSTEPU (PII policy deny)
   - Kazdy dostep jest logowany w audycie

6. Jesli analyst potrzebuje danych:
   -> Masking policy: user_id -> HASH, feedback_text -> MASK
   -> Row filter: tylko feedback z ostatniego miesiaca
""")

# Podsumowanie modelu bezpieczenstwa
print("\n=== Model bezpieczenstwa systemu rekomendacji ===")
print(f"\n{'Warstwa/Dane':<25} {'Klasyfikacja':<15} {'data_eng':>10} {'data_sci':>10} {'analyst':>10} {'api_svc':>10}")
print("-" * 85)
print(f"{'Bronze/ratings':<25} {'PII':<15} {'RWX':>10} {'---':>10} {'DENY':>10} {'DENY':>10}")
print(f"{'Bronze/movies':<25} {'Public':<15} {'RWX':>10} {'---':>10} {'DENY':>10} {'DENY':>10}")
print(f"{'Silver/ratings':<25} {'PII':<15} {'RWX':>10} {'R-X':>10} {'DENY':>10} {'DENY':>10}")
print(f"{'Silver/movies':<25} {'Public':<15} {'RWX':>10} {'R-X':>10} {'DENY':>10} {'DENY':>10}")
print(f"{'Gold/movie_stats':<25} {'Public':<15} {'RWX':>10} {'R-X':>10} {'R-X':>10} {'R-X':>10}")
print(f"{'Gold/user_profiles':<25} {'Confid.':<15} {'RWX':>10} {'R-X':>10} {'R-X*':>10} {'DENY':>10}")
print(f"{'Gold/recommendations':<25} {'Confid.':<15} {'RWX':>10} {'R-X':>10} {'DENY':>10} {'R-X':>10}")
print("\n* analyst: user_id zamaskowany (HASH), row filter: rating >= 3.0")

## Zadanie koncowe

Zaprojektuj kompletny model bezpieczenstwa dla systemu rekomendacji z czterema rolami:

**Role:**
- `data_engineer` - buduje pipeline'y, pelny dostep
- `data_scientist` - trenuje modele, potrzebuje Silver/Gold
- `analyst` - analizy biznesowe, nie widzi PII
- `app_service` - API serwujace rekomendacje

**Zadania:**
1. Zdefiniuj polityki HDFS dla nowej sciezki `/data/movielens/models/` (modele ML)
   - data_engineer: RWX, data_scientist: RWX, analyst: deny, app_service: R-X
2. Dodaj polityke row-level security na tabeli `user_profiles`:
   - analyst widzi tylko power_user i active segmenty
   - app_service widzi wszystko ale z zamaskowanym user_id
3. Dodaj polityke column masking na tabeli `user_profiles`:
   - analyst: avg_rating -> MASK_NULL, user_id -> MASK_HASH
   - app_service: user_id -> MASK_HASH
4. Stworz tag-based policy dla nowego tagu `ML_Feature`:
   - tylko data_scientist i data_engineer maja dostep
5. Napisz funkcje generujaca raport compliance:
   - kto probowal czytac dane PII w ostatnim tygodniu?
   - ile prÃ³b odmowy dostepu?

In [None]:
# Twoje rozwiazanie:


In [None]:
# Podsumowanie
print("""
=== Podsumowanie: Apache Ranger w systemie rekomendacji ===

1. KONTROLA DOSTEPU
   - HDFS: sciezki Bronze/Silver/Gold z roznymi uprawnieniami per rola
   - Hive/SparkSQL: tabele i kolumny z granularnymi politikami
   - Kafka: tematy z rekomendacjami -> tylko serwisy

2. ROW-LEVEL SECURITY
   - Analitycy nie widza niskich ocen (bias w raportach?)
   - API widzi tylko swieeze dane (ostatnie 2 lata)
   - ML inzynierowie widza wszystko (potrzebne do treningu)

3. COLUMN MASKING
   - user_id: hash dla analitykow, null dla API
   - Spelnienie GDPR bez zmiany pipeline'u

4. TAG-BASED POLICIES
   - PII -> automatyczne ograniczenie dostepu
   - Nowy dataset z tagiem PII -> od razu zabezpieczony
   - Integracja z Atlas: governance -> security

5. AUDIT
   - Pelna historia dostepu do danych
   - Raporty compliance: kto, co, kiedy
   - Wykrywanie naruszen (deny logs)

Ranger + Atlas = pelne governance-driven security.
Dane sa chronione od momentu klasyfikacji do momentu konsumpcji.
""")