# 22 - Flink SQL & Table API

Nauka Flink SQL i Table API - przetwarzanie danych strumieniowych i batchowych w jednym modelu.

**Tematy:**
- Flink SQL vs Spark SQL - porownanie skladni i mozliwosci
- PyFlink Table Environment setup
- Tworzenie tabel: DDL, connectors (JDBC, filesystem)
- Flink SQL: SELECT, JOIN, GROUP BY, window functions
- Continuous queries vs batch queries
- Temporal joins (versioned tables)
- CDC (Change Data Capture): Flink CDC connectors
- Pattern matching: MATCH_RECOGNIZE (CEP w SQL)
- Katalogi: Flink + Hive Metastore
- Porownanie wydajnosci: Flink SQL vs Spark SQL na MovieLens
- Zadanie koncowe

## 1. Flink SQL vs Spark SQL - porownanie

| Cecha | Flink SQL | Spark SQL |
|-------|-----------|----------|
| **Model** | Stream-first (batch = bounded stream) | Batch-first (streaming dodany pozniej) |
| **Latencja** | Milisekundy (true streaming) | Sekundy-minuty (micro-batch) |
| **Semantyka czasu** | Event time + watermarki natywnie | Event time przez okna |
| **Continuous queries** | Tak - zapytanie dziala ciagle | Nie - micro-batch lub trigger |
| **CDC** | Natywne wsparcie (changelog stream) | Wymaga Delta Lake / Hudi |
| **MATCH_RECOGNIZE** | Tak - CEP w SQL | Brak |
| **Temporal joins** | Natywne | Brak (trzeba recznie) |
| **Catalog** | Hive Metastore, JDBC Catalog | Hive Metastore |
| **UDF** | Java, Scala, Python (PyFlink) | Python, Scala, Java |

### Kiedy Flink SQL?
- Przetwarzanie strumieni danych w czasie rzeczywistym (Kafka, CDC)
- Continuous queries - zapytania dzialajace non-stop
- Complex Event Processing (CEP) - wykrywanie wzorcow w strumieniach
- Niskie latencje (ms vs sekundy w Spark)

### Kiedy Spark SQL?
- Duze batch ETL na danych historycznych
- Machine Learning pipeline (MLlib)
- Ad-hoc analityka na duzych zbiorach danych
- Integracja z ekosystemem Python (pandas UDF)

## 2. PyFlink Table Environment - setup

W Flink sa dwa glowne srodowiska:
- `StreamTableEnvironment` - dla streaming + batch (zalecane)
- `TableEnvironment` - tylko batch

Flink Table API uzywa koncepcji **Table** - odpowiednik DataFrame w Spark.

In [None]:
from pyflink.table import EnvironmentSettings, TableEnvironment
from pyflink.table.expressions import col, lit, call
from pyflink.common import Row
import pandas as pd

# Streaming mode - obsluguje zarowno stream jak i batch
env_settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env_settings)

# Konfiguracja Flink - JobManager jest na flink-jobmanager:8081
t_env.get_config().set("parallelism.default", "2")
t_env.get_config().set("table.exec.resource.default-parallelism", "2")

# Dla batch queries w streaming mode:
t_env.get_config().set("execution.runtime-mode", "batch")

print("Flink Table Environment created!")
print(f"Planner: {t_env.get_config().get('table.planner', 'blink')}")

## 3. Tworzenie tabel - DDL i connectors

Flink uzywa instrukcji DDL do tworzenia tabel polaczonych ze zrodlami danych.
Kazda tabela ma zdefiniowany **connector** (JDBC, filesystem, Kafka itp.).

### Architektura connectorow:
```
Flink SQL Query
      |
      v
Table (DDL)  -->  Connector  -->  Zewnetrzny system
                  - jdbc          - PostgreSQL
                  - filesystem    - HDFS / local
                  - kafka         - Kafka topics
                  - upsert-kafka  - Kafka compacted
```

In [None]:
# Tabela JDBC - polaczenie z PostgreSQL
t_env.execute_sql("""
    CREATE TABLE IF NOT EXISTS ratings (
        user_id     INT,
        movie_id    INT,
        rating      DOUBLE,
        rating_timestamp BIGINT
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://postgres:5432/recommender',
        'table-name' = 'movielens.ratings',
        'username' = 'recommender',
        'password' = 'recommender',
        'scan.partition.column' = 'user_id',
        'scan.partition.num' = '4',
        'scan.partition.lower-bound' = '1',
        'scan.partition.upper-bound' = '300000'
    )
""")

t_env.execute_sql("""
    CREATE TABLE IF NOT EXISTS movies (
        movie_id    INT,
        title       STRING,
        genres      STRING
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://postgres:5432/recommender',
        'table-name' = 'movielens.movies',
        'username' = 'recommender',
        'password' = 'recommender'
    )
""")

print("Tabele JDBC utworzone.")

# Tabela filesystem - odczyt z HDFS (Parquet)
t_env.execute_sql("""
    CREATE TABLE IF NOT EXISTS ratings_hdfs (
        user_id     INT,
        movie_id    INT,
        rating      DOUBLE,
        rating_timestamp BIGINT
    ) WITH (
        'connector' = 'filesystem',
        'path' = 'hdfs://namenode:9000/user/spark/movielens/ratings',
        'format' = 'parquet'
    )
""")

print("Tabela HDFS (filesystem) utworzona.")

In [None]:
# Sprawdz dostepne tabele
t_env.execute_sql("SHOW TABLES").print()

## 4. Flink SQL - podstawowe zapytania

Skladnia Flink SQL jest bardzo zblizona do ANSI SQL i Spark SQL.
Glowna roznica: Flink traktuje wyniki jako **changelog stream** (insert/update/delete).

In [None]:
# SELECT z WHERE - tak samo jak w Spark SQL
t_env.execute_sql("""
    SELECT user_id, movie_id, rating
    FROM ratings
    WHERE rating >= 4.5 AND user_id <= 100
    ORDER BY rating DESC
    LIMIT 10
""").print()

# GROUP BY z HAVING
t_env.execute_sql("""
    SELECT user_id,
           COUNT(*) AS rating_count,
           ROUND(AVG(rating), 2) AS avg_rating,
           MIN(rating) AS min_rating,
           MAX(rating) AS max_rating
    FROM ratings
    GROUP BY user_id
    HAVING COUNT(*) > 1000
    ORDER BY rating_count DESC
    LIMIT 20
""").print()

In [None]:
# JOIN - identyczny SQL jak w Spark
t_env.execute_sql("""
    SELECT m.title,
           COUNT(*) AS num_ratings,
           ROUND(AVG(r.rating), 2) AS avg_rating
    FROM ratings r
    JOIN movies m ON r.movie_id = m.movie_id
    GROUP BY m.title
    HAVING COUNT(*) > 5000
    ORDER BY avg_rating DESC
    LIMIT 15
""").print()

# CTE (Common Table Expressions) - identycznie jak w Spark SQL
t_env.execute_sql("""
    WITH user_stats AS (
        SELECT user_id,
               COUNT(*) AS num_ratings,
               ROUND(AVG(rating), 2) AS avg_rating
        FROM ratings
        GROUP BY user_id
    ),
    user_categories AS (
        SELECT *,
               CASE
                   WHEN num_ratings >= 1000 THEN 'power_user'
                   WHEN num_ratings >= 100 THEN 'active'
                   WHEN num_ratings >= 20 THEN 'casual'
                   ELSE 'rare'
               END AS user_category
        FROM user_stats
    )
    SELECT user_category,
           COUNT(*) AS num_users,
           ROUND(AVG(avg_rating), 2) AS mean_avg_rating,
           CAST(ROUND(AVG(num_ratings), 0) AS INT) AS mean_num_ratings
    FROM user_categories
    GROUP BY user_category
    ORDER BY mean_num_ratings DESC
""").print()

## 5. Table API - programistyczny odpowiednik SQL

Table API to programistyczny interfejs do Flink - odpowiednik DataFrame API w Spark.
Kazde zapytanie SQL mozna wyrazic jako lancuch operacji Table API i odwrotnie.

| Spark DataFrame API | Flink Table API |
|--------------------|-----------------|
| `df.select(...)` | `table.select(...)` |
| `df.filter(...)` | `table.where(...)` |
| `df.groupBy(...).agg(...)` | `table.group_by(...).select(...)` |
| `df.join(...)` | `table.join(...)` |
| `df.orderBy(...)` | `table.order_by(...)` |

In [None]:
from pyflink.table.expressions import col, lit, call

# Pobierz tabele jako obiekty Table
ratings_table = t_env.from_path("ratings")
movies_table = t_env.from_path("movies")

# SELECT z WHERE - Table API
result = ratings_table \
    .where(col("rating") >= 4.5) \
    .where(col("user_id") <= 100) \
    .select(col("user_id"), col("movie_id"), col("rating")) \
    .order_by(col("rating").desc) \
    .fetch(10)

result.print()

# GROUP BY - Table API
user_stats = ratings_table \
    .group_by(col("user_id")) \
    .select(
        col("user_id"),
        col("rating").count.alias("rating_count"),
        col("rating").avg.alias("avg_rating")
    ) \
    .where(col("rating_count") > 500) \
    .order_by(col("rating_count").desc) \
    .fetch(10)

user_stats.print()

In [None]:
# JOIN - Table API
joined = ratings_table \
    .join(movies_table) \
    .where(col("ratings.movie_id") == col("movies.movie_id")) \
    .group_by(col("title")) \
    .select(
        col("title"),
        col("rating").count.alias("num_ratings"),
        col("rating").avg.alias("avg_rating")
    ) \
    .where(col("num_ratings") > 5000) \
    .order_by(col("avg_rating").desc) \
    .fetch(10)

joined.print()

# Mieszanie Table API z SQL - identycznie jak w Spark
temp_table = ratings_table \
    .group_by(col("movie_id")) \
    .select(
        col("movie_id"),
        col("rating").count.alias("cnt"),
        col("rating").avg.alias("avg_r")
    )

t_env.create_temporary_view("movie_stats", temp_table)

# Kontynuuj w SQL
t_env.execute_sql("""
    SELECT m.title, s.cnt, ROUND(s.avg_r, 2) AS avg_rating
    FROM movie_stats s
    JOIN movies m ON s.movie_id = m.movie_id
    WHERE s.cnt > 10000
    ORDER BY s.avg_r DESC
    LIMIT 10
""").print()

## 6. Continuous queries vs batch queries

To jest **kluczowa roznica** miedzy Flink a Spark:

- **Batch query**: przetwarza skonczone dane i konczy sie (jak Spark SQL)
- **Continuous query**: dziala w nieskonczonosc, aktualizujac wyniki na biezaco

W Flink SQL ten sam SQL moze dzialac w obu trybach!

```
Batch:      [dane] ---> [zapytanie] ---> [wynik] (koniec)
Continuous: [strumien] ---> [zapytanie] ---> [changelog stream] (ciagle)
```

### Changelog stream
Wynik continuous query to **changelog**:
- `+I` = INSERT (nowy wiersz)
- `-U` = UPDATE BEFORE (stara wartosc przed aktualizacja)
- `+U` = UPDATE AFTER (nowa wartosc po aktualizacji)
- `-D` = DELETE

Przyklad: `SELECT genre, COUNT(*) FROM ratings GROUP BY genre`
```
+I[Action, 1]        -- pierwszy rating Action
+I[Comedy, 1]        -- pierwszy rating Comedy
-U[Action, 1]        -- aktualizacja Action (stara wartosc)
+U[Action, 2]        -- aktualizacja Action (nowa wartosc)
```

In [None]:
# Batch mode - przetwarza wszystkie dane i konczy sie
t_env.get_config().set("execution.runtime-mode", "batch")

t_env.execute_sql("""
    SELECT movie_id,
           COUNT(*) AS num_ratings,
           ROUND(AVG(rating), 2) AS avg_rating
    FROM ratings
    GROUP BY movie_id
    HAVING COUNT(*) > 10000
    ORDER BY avg_rating DESC
    LIMIT 10
""").print()

# Streaming mode - ten sam SQL, ale jako continuous query
# W trybie streaming GROUP BY emituje changelog (+I, -U, +U)
# Wlaczamy streaming mode aby zobaczyc roznice
t_env.get_config().set("execution.runtime-mode", "streaming")

# Definicja tabeli sink do zapisu wynikow (print connector)
t_env.execute_sql("""
    CREATE TABLE IF NOT EXISTS rating_counts_sink (
        movie_id INT,
        num_ratings BIGINT,
        avg_rating DOUBLE,
        PRIMARY KEY (movie_id) NOT ENFORCED
    ) WITH (
        'connector' = 'print'
    )
""")

# INSERT INTO ... SELECT - uruchamia continuous query
# W srodowisku produkcyjnym to dzialalaby w nieskonczonosc
# Tutaj zakonczy sie po przetworzeniu wszystkich danych z JDBC (bounded source)
t_env.execute_sql("""
    INSERT INTO rating_counts_sink
    SELECT movie_id,
           COUNT(*) AS num_ratings,
           ROUND(AVG(rating), 2) AS avg_rating
    FROM ratings
    GROUP BY movie_id
""").wait()

print("Continuous query zakonczony (bounded source).")

# Powrot do batch mode
t_env.get_config().set("execution.runtime-mode", "batch")

## 7. Window functions i Flink SQL specifics

Flink SQL obsluguje standardowe window functions jak Spark, ale dodaje tez **okna czasowe** specyficzne dla streamingu:

| Typ okna | Opis | Spark equivalent |
|----------|------|------------------|
| `TUMBLE` | Okna nienakladajace sie o stalym rozmiarze | `window(col, "1 hour")` |
| `HOP` | Okna nakladajace sie (sliding) | `window(col, "1 hour", "15 min")` |
| `CUMULATE` | Okna kumulatywne (rosnace) | Brak odpowiednika |
| `SESSION` | Okna sesyjne (gap-based) | Brak w SQL (jest w Structured Streaming) |
| `OVER` | Standardowe window functions | `Window.partitionBy().orderBy()` |

In [None]:
# Standardowe window functions - dzialaja identycznie jak w Spark SQL
t_env.execute_sql("""
    SELECT user_id, movie_id, rating, rn
    FROM (
        SELECT user_id, movie_id, rating,
               ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY rating DESC) AS rn
        FROM ratings
        WHERE user_id <= 5
    )
    WHERE rn <= 3
    ORDER BY user_id, rn
""").print()

# TUMBLE window - okna czasowe (specyficzne dla Flink)
# Wymaga kolumny czasowej zadeklarowanej jako TIMESTAMP
t_env.execute_sql("""
    CREATE TABLE IF NOT EXISTS ratings_with_ts (
        user_id     INT,
        movie_id    INT,
        rating      DOUBLE,
        rating_timestamp BIGINT,
        ts AS TO_TIMESTAMP_LTZ(rating_timestamp, 0),
        WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://postgres:5432/recommender',
        'table-name' = 'movielens.ratings',
        'username' = 'recommender',
        'password' = 'recommender'
    )
""")

# Agregacja w oknach czasowych TUMBLE (np. per rok)
t_env.execute_sql("""
    SELECT
        window_start,
        window_end,
        COUNT(*) AS num_ratings,
        ROUND(AVG(rating), 2) AS avg_rating
    FROM TABLE(
        TUMBLE(TABLE ratings_with_ts, DESCRIPTOR(ts), INTERVAL '365' DAY)
    )
    GROUP BY window_start, window_end
    ORDER BY window_start
""").print()

## 8. Temporal joins i CDC (Change Data Capture)

### Temporal joins
Temporal join laczy strumien zdarzen z **wersjonowana tabela** - dla kazdego zdarzenia
uzywa wersji tabeli obowiazujacej w momencie zdarzenia.

Przyklad: rating z momentu T laczymy z cenami filmow obowiazujacymi w momencie T.

```sql
-- Temporal join w Flink SQL
SELECT r.user_id, r.movie_id, r.rating, p.price
FROM ratings r
JOIN movie_prices FOR SYSTEM_TIME AS OF r.event_time AS p
  ON r.movie_id = p.movie_id
```

### CDC - Change Data Capture
Flink moze czytac **changelog** bazy danych jako strumien:
- `mysql-cdc` - czyta MySQL binlog
- `postgres-cdc` - czyta PostgreSQL WAL (Write-Ahead Log)
- `mongodb-cdc` - czyta MongoDB oplog

```sql
-- Tabela CDC - Flink automatycznie czyta zmiany z PostgreSQL
CREATE TABLE movies_cdc (
    movie_id INT,
    title STRING,
    genres STRING,
    PRIMARY KEY (movie_id) NOT ENFORCED
) WITH (
    'connector' = 'postgres-cdc',
    'hostname' = 'postgres',
    'port' = '5432',
    'username' = 'recommender',
    'password' = 'recommender',
    'database-name' = 'recommender',
    'schema-name' = 'movielens',
    'table-name' = 'movies',
    'slot.name' = 'flink_movies_slot'
);
```

Gdy ktos zrobi UPDATE/INSERT/DELETE w PostgreSQL, Flink natychmiast zobaczy zmiane!

In [None]:
# Symulacja temporal join - uzycie lookup join (bounded version)
# Lookup join: dla kazdego wiersza ze strumienia odpytuje tabele wymiarowa

# Tabela wymiarowa (movies) z wlaczonym lookup cache
t_env.execute_sql("""
    CREATE TABLE IF NOT EXISTS movies_lookup (
        movie_id    INT,
        title       STRING,
        genres      STRING,
        PRIMARY KEY (movie_id) NOT ENFORCED
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://postgres:5432/recommender',
        'table-name' = 'movielens.movies',
        'username' = 'recommender',
        'password' = 'recommender',
        'lookup.cache.max-rows' = '10000',
        'lookup.cache.ttl' = '1h'
    )
""")

# Lookup join: strumien ratings laczy sie z tabela movies
# FOR SYSTEM_TIME AS OF - kluczowa skladnia temporal join
# W batch mode dziala jak zwykly join z cache
t_env.execute_sql("""
    SELECT r.user_id,
           r.movie_id,
           r.rating,
           m.title,
           m.genres
    FROM ratings_with_ts r
    JOIN movies_lookup FOR SYSTEM_TIME AS OF r.ts AS m
      ON r.movie_id = m.movie_id
    WHERE r.user_id <= 5 AND r.rating = 5.0
    LIMIT 15
""").print()

## 9. MATCH_RECOGNIZE - Pattern matching (CEP w SQL)

MATCH_RECOGNIZE to unikalna cecha Flink SQL - pozwala wykrywac **wzorce w sekwencjach zdarzen**
bezposrednio w SQL. To odpowiednik Complex Event Processing (CEP).

Spark SQL **nie ma** tej funkcjonalnosci.

### Skladnia:
```sql
SELECT *
FROM tabela
MATCH_RECOGNIZE (
    PARTITION BY klucz
    ORDER BY czas
    MEASURES
        -- co chcemy wyciagnac z dopasowania
    ONE ROW PER MATCH          -- lub ALL ROWS PER MATCH
    PATTERN (wzorzec regex)
    DEFINE
        -- definicje symboli we wzorcu
)
```

### Przyklad: wykryj uzytkownikow z rosnacymi ocenami
Wzorzec: 3 kolejne oceny, kazda wyzsza od poprzedniej.

In [None]:
# MATCH_RECOGNIZE - wykryj wzorzec rosnacych ocen
# Szukamy uzytkownikow, ktorzy dali 3+ kolejnych rosnacych ocen
t_env.execute_sql("""
    SELECT *
    FROM ratings_with_ts
    MATCH_RECOGNIZE (
        PARTITION BY user_id
        ORDER BY ts
        MEASURES
            FIRST(A.rating) AS first_rating,
            LAST(A.rating) AS last_rating,
            COUNT(A.rating) AS streak_length,
            FIRST(A.ts) AS streak_start,
            LAST(A.ts) AS streak_end
        ONE ROW PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (A{3,})
        DEFINE
            A AS A.rating > LAST(A.rating, 1) OR FIRST(A.rating) = A.rating
    )
    WHERE user_id <= 100
    ORDER BY streak_length DESC
    LIMIT 15
""").print()

# MATCH_RECOGNIZE - wykryj nagly spadek ocen (V-shape)
# Wzorzec: wysoka ocena -> niska ocena -> wysoka ocena
t_env.execute_sql("""
    SELECT *
    FROM ratings_with_ts
    MATCH_RECOGNIZE (
        PARTITION BY user_id
        ORDER BY ts
        MEASURES
            A.rating AS high_before,
            B.rating AS low_point,
            C.rating AS high_after,
            B.movie_id AS disliked_movie
        ONE ROW PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (A B C)
        DEFINE
            A AS A.rating >= 4.0,
            B AS B.rating <= 2.0,
            C AS C.rating >= 4.0
    )
    WHERE user_id <= 50
    LIMIT 15
""").print()

## 10. Katalogi - Flink + Hive Metastore

Flink moze uzyc **Hive Metastore** jako katalogu, dzieki czemu:
- Tabele sa wspoldzielone miedzy Flink, Spark i Trino
- Metadane przetrwaja restart Flink (persistent catalog)
- Mozna odpytywac tabele Hive/Parquet na HDFS

```sql
-- Rejestracja Hive Catalog
CREATE CATALOG hive_catalog WITH (
    'type' = 'hive',
    'hive-conf-dir' = '/opt/hive/conf'
);

USE CATALOG hive_catalog;
SHOW DATABASES;
SHOW TABLES;

-- Teraz mozesz odpytywac tabele Hive!
SELECT * FROM movielens.ratings LIMIT 10;
```

### Hierarchia obiektow:
```
Catalog (np. hive_catalog, default_catalog)
  └── Database (np. movielens, default)
       └── Table (np. ratings, movies)
```

In [None]:
# Domyslny katalog (in-memory)
t_env.execute_sql("SHOW CATALOGS").print()
t_env.execute_sql("SHOW DATABASES").print()

# Hive Catalog - wspoldzielone metadane z innymi silnikami
# Wymaga hive-conf-dir z plikiem hive-site.xml
try:
    t_env.execute_sql("""
        CREATE CATALOG hive_catalog WITH (
            'type' = 'hive',
            'hive-conf-dir' = '/opt/hive/conf'
        )
    """)
    t_env.execute_sql("USE CATALOG hive_catalog")
    t_env.execute_sql("SHOW DATABASES").print()
    t_env.execute_sql("USE CATALOG default_catalog")
except Exception as e:
    print(f"Hive Catalog niedostepny (wymagany Hive Metastore): {e}")
    print("W produkcji Hive Catalog pozwala wspoldzielic tabele z Spark i Trino.")

## 11. Porownanie wydajnosci: Flink SQL vs Spark SQL na MovieLens

Uruchomimy te same zapytania w Flink SQL i porownamy z wynikami ze Spark SQL (notebook 05).

**Uwaga:** Flink w batch mode na bounded data nie jest typowym use case - Flink swieci
w streaming. Ten benchmark sluzy celom edukacyjnym.

In [None]:
import time

t_env.get_config().set("execution.runtime-mode", "batch")

queries = {
    "simple_count": """
        SELECT COUNT(*) AS total FROM ratings
    """,
    "group_by": """
        SELECT movie_id, COUNT(*) AS cnt, ROUND(AVG(rating), 2) AS avg_r
        FROM ratings
        GROUP BY movie_id
        ORDER BY cnt DESC
        LIMIT 10
    """,
    "join_agg": """
        SELECT m.title, COUNT(*) AS cnt, ROUND(AVG(r.rating), 2) AS avg_r
        FROM ratings r
        JOIN movies m ON r.movie_id = m.movie_id
        GROUP BY m.title
        HAVING COUNT(*) > 5000
        ORDER BY avg_r DESC
        LIMIT 10
    """,
    "window_function": """
        SELECT user_id, movie_id, rating, rn FROM (
            SELECT user_id, movie_id, rating,
                   ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY rating DESC) AS rn
            FROM ratings WHERE user_id <= 100
        ) WHERE rn <= 3
        ORDER BY user_id, rn
        LIMIT 20
    """
}

print(f"{'Query':<20} {'Flink Time':>12}")
print("-" * 35)

for name, sql in queries.items():
    start = time.time()
    result = t_env.execute_sql(sql)
    # Consume all results
    with result.collect() as results:
        rows = list(results)
    elapsed = time.time() - start
    print(f"{name:<20} {elapsed:>10.2f}s ({len(rows)} rows)")

print()
print("Porownaj z wynikami Spark SQL z notebooka 05.")
print("Flink batch jest zazwyczaj porownwyalny ze Spark na malych danych.")
print("Przewaga Flink pojawia sie w streaming (ms latency vs Spark micro-batch).")

## Zadanie koncowe

Napisz w Flink SQL kompletna analize MovieLens laczaca poznane techniki:

### Czesc 1: Analiza SQL (batch)
1. Znajdz **top 10 najlepiej ocenianych filmow** (min. 1000 ocen) - uzyj JOIN z tabela `movies`
2. Dla kazdego filmu pokaz: tytul, gatunek, liczbe ocen, srednia ocene
3. Dodaj **ranking** filmow za pomoca `ROW_NUMBER()` OVER

### Czesc 2: Segmentacja uzytkownikow
1. Podziel uzytkownikow na segmenty: power (>1000 ocen), active (100-1000), casual (<100)
2. Dla kazdego segmentu oblicz: liczbe uzytkownikow, srednia ocene, srednia liczbe ocen

### Czesc 3: MATCH_RECOGNIZE (CEP)
1. Dla uzytkownikow 1-200 znajdz wzorzec: **3 kolejne oceny 5.0**
2. Pokaz user_id, movie_id pierwszego i ostatniego filmu w serii, dlugosc serii

### Czesc 4 (dodatkowe): Porownanie zrodel
1. Porownaj COUNT(*) i AVG(rating) z tabeli `ratings` (JDBC) i `ratings_hdfs` (filesystem)
2. Czy wyniki sa identyczne? Dlaczego moga sie roznic?

In [None]:
# Twoje rozwiazanie - Czesc 1: Top 10 filmow z rankingiem
t_env.execute_sql("""

""").print()

In [None]:
# Twoje rozwiazanie - Czesc 2: Segmentacja uzytkownikow
t_env.execute_sql("""

""").print()

In [None]:
# Twoje rozwiazanie - Czesc 3: MATCH_RECOGNIZE
t_env.execute_sql("""

""").print()

In [None]:
# Twoje rozwiazanie - Czesc 4: Porownanie zrodel
t_env.execute_sql("""

""").print()