# 05 - Spark SQL

Nauka Spark SQL - pisanie zapytań SQL na danych Spark.

**Tematy:**
- Tworzenie widoków tymczasowych (createOrReplaceTempView)
- Zapytania SQL: SELECT, WHERE, GROUP BY, HAVING
- Podzapytania (subqueries) i CTE (WITH)
- Funkcje okienkowe (window functions)
- UDF w SQL
- Łączenie DataFrame API z SQL

## 1. Setup

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("05_Spark_SQL") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.7.1") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "7g") \
    .config("spark.driver.host", "recommender-jupyter") \
    .config("spark.driver.bindAddress", "0.0.0.0") \
    .getOrCreate()

jdbc_url = "jdbc:postgresql://postgres:5432/recommender"
properties = {
    "user": "recommender",
    "password": "recommender",
    "driver": "org.postgresql.Driver"
}

ratings = spark.read.jdbc(
    jdbc_url, "movielens.ratings", properties=properties,
    column="user_id", lowerBound=1, upperBound=300000, numPartitions=10
)
movies = spark.read.jdbc(jdbc_url, "movielens.movies", properties=properties)

## 2. Tworzenie widoków tymczasowych

Aby używać SQL w Spark, trzeba zarejestrować DataFrame jako widok tymczasowy.

In [None]:
# createOrReplaceTempView - widok w ramach sesji
ratings.createOrReplaceTempView("ratings")
movies.createOrReplaceTempView("movies")

# Teraz możemy pisać SQL!
spark.sql("SELECT * FROM ratings LIMIT 5").show()

In [None]:
# createOrReplaceGlobalTempView - widok dostępny między sesjami
# movies.createOrReplaceGlobalTempView("movies_global")
# spark.sql("SELECT * FROM global_temp.movies_global LIMIT 5").show()

# Sprawdź dostępne tabele
spark.sql("SHOW TABLES").show()

## 3. Podstawowe zapytania SQL

In [None]:
# SELECT z WHERE
spark.sql("""
    SELECT user_id, movie_id, rating
    FROM ratings
    WHERE rating >= 4.5 AND user_id <= 100
    ORDER BY rating DESC
    LIMIT 10
""").show()

In [None]:
# GROUP BY z HAVING
spark.sql("""
    SELECT user_id, 
           COUNT(*) as rating_count, 
           ROUND(AVG(rating), 2) as avg_rating,
           MIN(rating) as min_rating,
           MAX(rating) as max_rating
    FROM ratings
    GROUP BY user_id
    HAVING COUNT(*) > 1000
    ORDER BY rating_count DESC
    LIMIT 20
""").show()

In [None]:
# JOIN w SQL
spark.sql("""
    SELECT m.title, 
           COUNT(*) as num_ratings, 
           ROUND(AVG(r.rating), 2) as avg_rating
    FROM ratings r
    JOIN movies m ON r.movie_id = m.movie_id
    GROUP BY m.title
    HAVING COUNT(*) > 5000
    ORDER BY avg_rating DESC
    LIMIT 20
""").show(truncate=False)

### Zadanie 1
Napisz zapytanie SQL, które pokaże 10 najgorzej ocenianych filmów (z minimum 100 ocenami).

In [None]:
# Twoje rozwiązanie:
spark.sql("""

""").show(truncate=False)

## 4. Podzapytania (Subqueries)

In [None]:
# Podzapytanie w WHERE - użytkownicy, którzy ocenili więcej filmów niż średnia
spark.sql("""
    SELECT user_id, COUNT(*) as cnt
    FROM ratings
    GROUP BY user_id
    HAVING cnt > (
        SELECT AVG(user_cnt)
        FROM (
            SELECT COUNT(*) as user_cnt
            FROM ratings
            GROUP BY user_id
        )
    )
    ORDER BY cnt DESC
    LIMIT 10
""").show()

In [None]:
# Podzapytanie w FROM
spark.sql("""
    SELECT rating_bucket, COUNT(*) as num_users
    FROM (
        SELECT user_id, 
               CASE 
                   WHEN AVG(rating) >= 4.0 THEN 'generous'
                   WHEN AVG(rating) >= 3.0 THEN 'moderate'
                   ELSE 'harsh'
               END as rating_bucket
        FROM ratings
        GROUP BY user_id
    )
    GROUP BY rating_bucket
    ORDER BY num_users DESC
""").show()

In [None]:
# EXISTS / NOT EXISTS
spark.sql("""
    SELECT m.movie_id, m.title
    FROM movies m
    WHERE NOT EXISTS (
        SELECT 1 FROM ratings r WHERE r.movie_id = m.movie_id
    )
    LIMIT 10
""").show(truncate=False)

## 5. CTE (Common Table Expressions) - WITH

CTE pozwalają na tworzenie nazwanych podzapytań - czytelniejszy kod.

In [None]:
# CTE - profil użytkownika z kategorią aktywności
spark.sql("""
    WITH user_stats AS (
        SELECT user_id,
               COUNT(*) as num_ratings,
               ROUND(AVG(rating), 2) as avg_rating,
               ROUND(STDDEV(rating), 2) as std_rating
        FROM ratings
        GROUP BY user_id
    ),
    user_categories AS (
        SELECT *,
               CASE
                   WHEN num_ratings >= 1000 THEN 'power_user'
                   WHEN num_ratings >= 100 THEN 'active'
                   WHEN num_ratings >= 20 THEN 'casual'
                   ELSE 'rare'
               END as user_category
        FROM user_stats
    )
    SELECT user_category,
           COUNT(*) as num_users,
           ROUND(AVG(avg_rating), 2) as mean_avg_rating,
           ROUND(AVG(num_ratings), 0) as mean_num_ratings
    FROM user_categories
    GROUP BY user_category
    ORDER BY mean_num_ratings DESC
""").show()

In [None]:
# CTE z wieloma krokami - top filmy per gatunek
spark.sql("""
    WITH movie_ratings AS (
        SELECT m.movie_id, m.title, m.genres,
               COUNT(*) as num_ratings,
               ROUND(AVG(r.rating), 2) as avg_rating
        FROM movies m
        JOIN ratings r ON m.movie_id = r.movie_id
        GROUP BY m.movie_id, m.title, m.genres
        HAVING COUNT(*) >= 500
    ),
    comedy_top AS (
        SELECT title, avg_rating, num_ratings, 'Comedy' as genre
        FROM movie_ratings
        WHERE genres LIKE '%Comedy%'
        ORDER BY avg_rating DESC
        LIMIT 5
    ),
    drama_top AS (
        SELECT title, avg_rating, num_ratings, 'Drama' as genre
        FROM movie_ratings
        WHERE genres LIKE '%Drama%'
        ORDER BY avg_rating DESC
        LIMIT 5
    )
    SELECT * FROM comedy_top
    UNION ALL
    SELECT * FROM drama_top
    ORDER BY genre, avg_rating DESC
""").show(truncate=False)

### Zadanie 2
Napisz zapytanie z CTE, które:
1. Znajdzie użytkowników, którzy ocenili >500 filmów
2. Dla tych użytkowników policzy ile filmów ocenili na 5.0
3. Pokaże top 10 użytkowników z największym % ocen 5.0

In [None]:
# Twoje rozwiązanie:
spark.sql("""

""").show()

## 6. Funkcje okienkowe (Window Functions)

Window functions pozwalają na obliczenia w kontekście "okna" wierszy, bez redukowania liczby wierszy (w przeciwieństwie do GROUP BY).

- `ROW_NUMBER()` - numer wiersza w oknie
- `RANK()` / `DENSE_RANK()` - ranking
- `LAG()` / `LEAD()` - poprzedni/następny wiersz
- `SUM() OVER` / `AVG() OVER` - kumulatywne/ruchome agregacje
- `NTILE(n)` - podział na n grup

In [None]:
# ROW_NUMBER - numeracja wierszy w oknie
# Np. top 3 najwyżej ocenionych filmów per użytkownik
spark.sql("""
    SELECT user_id, movie_id, rating, rn
    FROM (
        SELECT user_id, movie_id, rating,
               ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY rating DESC) as rn
        FROM ratings
    )
    WHERE rn <= 3 AND user_id <= 5
    ORDER BY user_id, rn
""").show(20)

In [None]:
# RANK vs DENSE_RANK
# RANK: 1, 2, 2, 4 (pomija pozycje)
# DENSE_RANK: 1, 2, 2, 3 (nie pomija)
spark.sql("""
    SELECT user_id, movie_id, rating,
           RANK() OVER (PARTITION BY user_id ORDER BY rating DESC) as rank,
           DENSE_RANK() OVER (PARTITION BY user_id ORDER BY rating DESC) as dense_rank
    FROM ratings
    WHERE user_id = 1
    LIMIT 15
""").show()

In [None]:
# LAG / LEAD - poprzedni/następny wiersz
# Pokaż jak zmieniała się ocena użytkownika w czasie
spark.sql("""
    SELECT user_id, movie_id, rating, rating_timestamp,
           LAG(rating) OVER (PARTITION BY user_id ORDER BY rating_timestamp) as prev_rating,
           rating - LAG(rating) OVER (PARTITION BY user_id ORDER BY rating_timestamp) as rating_diff
    FROM ratings
    WHERE user_id = 42
    ORDER BY rating_timestamp
    LIMIT 20
""").show()

In [None]:
# Kumulatywna suma i średnia krocząca
spark.sql("""
    SELECT user_id, movie_id, rating, rating_timestamp,
           COUNT(*) OVER (PARTITION BY user_id ORDER BY rating_timestamp) as cumulative_count,
           ROUND(AVG(rating) OVER (
               PARTITION BY user_id 
               ORDER BY rating_timestamp 
               ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
           ), 2) as moving_avg_5
    FROM ratings
    WHERE user_id = 42
    ORDER BY rating_timestamp
    LIMIT 20
""").show()

In [None]:
# NTILE - podział na kwartyle
spark.sql("""
    WITH user_avg AS (
        SELECT user_id, AVG(rating) as avg_rating, COUNT(*) as cnt
        FROM ratings
        GROUP BY user_id
        HAVING COUNT(*) >= 50
    )
    SELECT 
        quartile,
        COUNT(*) as num_users,
        ROUND(MIN(avg_rating), 2) as min_avg,
        ROUND(MAX(avg_rating), 2) as max_avg
    FROM (
        SELECT *, NTILE(4) OVER (ORDER BY avg_rating) as quartile
        FROM user_avg
    )
    GROUP BY quartile
    ORDER BY quartile
""").show()

### Zadanie 3
Użyj window functions, żeby znaleźć dla każdego użytkownika (id <= 50) jego:
- pierwszą ocenę (chronologicznie)
- ostatnią ocenę
- różnicę między nimi

Pokaż użytkowników, którzy z czasem stali się bardziej surowi (ostatnia ocena niższa od pierwszej).

In [None]:
# Twoje rozwiązanie:


## 7. Window Functions z DataFrame API

Te same operacje okienkowe można wykonać w DataFrame API.

In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead, col, avg, count, desc

# Zdefiniuj okno
user_window = Window.partitionBy("user_id").orderBy(desc("rating"))

# Top 3 filmy per użytkownik
top_movies = ratings \
    .withColumn("rn", row_number().over(user_window)) \
    .filter(col("rn") <= 3) \
    .filter(col("user_id") <= 5)

top_movies.orderBy("user_id", "rn").show(20)

In [None]:
# Średnia krocząca
time_window = Window.partitionBy("user_id") \
    .orderBy("rating_timestamp") \
    .rowsBetween(-4, 0)

ratings.filter(col("user_id") == 42) \
    .withColumn("moving_avg", avg("rating").over(time_window)) \
    .select("user_id", "movie_id", "rating", "rating_timestamp", "moving_avg") \
    .orderBy("rating_timestamp") \
    .show(15)

## 8. UDF (User-Defined Functions) w SQL

In [None]:
from pyspark.sql.types import StringType, ArrayType

# Zdefiniuj Python UDF
def extract_year(title):
    """Wyciąga rok z tytułu filmu, np. 'Toy Story (1995)' -> '1995'"""
    import re
    match = re.search(r'\((\d{4})\)', title or '')
    return match.group(1) if match else None

def genres_to_list(genres):
    """Zamienia 'Comedy|Drama' na ['Comedy', 'Drama']"""
    return (genres or '').split('|')

# Zarejestruj UDFy w Spark SQL
spark.udf.register("extract_year", extract_year, StringType())
spark.udf.register("genres_to_list", genres_to_list, ArrayType(StringType()))

# Użyj w SQL
spark.sql("""
    SELECT title, 
           extract_year(title) as year,
           genres,
           genres_to_list(genres) as genre_list
    FROM movies
    LIMIT 10
""").show(truncate=False)

In [None]:
# UDF w analizie - rozkład filmów per dekada
spark.sql("""
    SELECT CONCAT(FLOOR(CAST(extract_year(title) AS INT) / 10) * 10, 's') as decade,
           COUNT(*) as num_movies
    FROM movies
    WHERE extract_year(title) IS NOT NULL
    GROUP BY decade
    ORDER BY decade
""").show()

## 9. Mieszanie SQL z DataFrame API

Wynik `spark.sql()` to zwykły DataFrame - można go dalej przetwarzać.

In [None]:
# SQL -> DataFrame API
popular_movies = spark.sql("""
    SELECT movie_id, COUNT(*) as num_ratings, AVG(rating) as avg_rating
    FROM ratings
    GROUP BY movie_id
    HAVING COUNT(*) > 1000
""")

# Kontynuuj z DataFrame API
result = popular_movies \
    .join(movies, "movie_id") \
    .filter(col("avg_rating") >= 4.0) \
    .orderBy(desc("num_ratings")) \
    .select("title", "num_ratings", "avg_rating")

result.show(10, truncate=False)

In [None]:
# DataFrame API -> SQL
# Wynik DataFrame API może być widokiem do SQL
result.createOrReplaceTempView("popular_good_movies")

spark.sql("""
    SELECT title, num_ratings, ROUND(avg_rating, 2) as avg_rating
    FROM popular_good_movies
    WHERE num_ratings > 10000
    ORDER BY avg_rating DESC
""").show(truncate=False)

## Zadanie końcowe

Napisz zapytanie SQL (możesz użyć CTE i window functions), które znajdzie:

Dla każdego gatunku filmowego (rozbij genres na poszczególne gatunki za pomocą UDF):
1. Liczbę filmów
2. Średnią ocenę
3. Film z najwyższą średnią oceną (min. 100 ocen) - użyj ROW_NUMBER
4. Ranking gatunków po popularności (liczba ocen)

Wynik posortuj po liczbie filmów malejąco.

In [None]:
# Twoje rozwiązanie:


In [None]:
spark.stop()