# 11 - Zaawansowane MLlib

Zaawansowane techniki ML w Spark MLlib.

**Tematy:**
- Pipeline API - łańcuch transformacji i estymatorów
- Content-based filtering z TF-IDF na gatunkach/tagach
- Clustering użytkowników - K-Means na wektorach latentnych ALS
- FPGrowth - frequent itemsets ("ludzie którzy lubią X, lubią też Y")
- Hybrydowy model - łączenie ALS + content-based
- Feature engineering z Spark

## 1. Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("11_Advanced_MLlib") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.7.1") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "7g") \
    .config("spark.driver.host", "recommender-jupyter") \
    .config("spark.driver.bindAddress", "0.0.0.0") \
    .getOrCreate()

jdbc_url = "jdbc:postgresql://postgres:5432/recommender"
properties = {
    "user": "recommender",
    "password": "recommender",
    "driver": "org.postgresql.Driver"
}

ratings = spark.read.jdbc(
    jdbc_url, "movielens.ratings", properties=properties,
    column="user_id", lowerBound=1, upperBound=300000, numPartitions=10
)
movies = spark.read.jdbc(jdbc_url, "movielens.movies", properties=properties)

ratings.cache()
movies.cache()
print(f"Ratings: {ratings.count()}, Movies: {movies.count()}")

## 2. Pipeline API

Pipeline łączy wiele kroków przetwarzania w jeden obiekt:
- **Transformer** - przekształca DataFrame (np. VectorAssembler, StringIndexer)
- **Estimator** - uczy się z danych i produkuje Transformer (np. ALS, KMeans)
- **Pipeline** - sekwencja Transformers i Estimators

```
Pipeline: [StringIndexer → VectorAssembler → ALS → Evaluator]
         fit(training)  →  PipelineModel
         PipelineModel.transform(test)  →  predictions
```

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Prosty pipeline ALS
als = ALS(
    userCol="user_id",
    itemCol="movie_id",
    ratingCol="rating",
    coldStartStrategy="drop",
    seed=42
)

pipeline = Pipeline(stages=[als])

# ParamGrid na pipeline
param_grid = ParamGridBuilder() \
    .addGrid(als.rank, [10, 20]) \
    .addGrid(als.regParam, [0.1, 0.3]) \
    .addGrid(als.maxIter, [10]) \
    .build()

evaluator = RegressionEvaluator(
    metricName="rmse", labelCol="rating", predictionCol="prediction"
)

# CrossValidator na pipeline
cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3,
    seed=42
)

# Trenuj na próbce
sample = ratings.sample(0.05, seed=42).cache()
(train, test) = sample.randomSplit([0.8, 0.2], seed=42)

print(f"Training: {train.count()}, Test: {test.count()}")
cv_model = cv.fit(train)

# Ewaluuj
predictions = cv_model.transform(test)
rmse = evaluator.evaluate(predictions)
print(f"\nBest Pipeline RMSE: {rmse:.4f}")
print(f"Best params: {cv_model.bestModel.stages[0].rank}")

In [None]:
# Zapisz cały pipeline model
cv_model.bestModel.write().overwrite().save("/tmp/als_pipeline")

# Odczytaj
from pyspark.ml import PipelineModel
loaded_pipeline = PipelineModel.load("/tmp/als_pipeline")
loaded_predictions = loaded_pipeline.transform(test)
print(f"Loaded pipeline RMSE: {evaluator.evaluate(loaded_predictions):.4f}")

## 3. Content-Based Filtering z TF-IDF

Zamiast collaborative filtering (ALS) - rekomendacje na podstawie **cech filmów** (gatunki, tagi).

1. Zamień gatunki na wektor TF-IDF
2. Policz cosine similarity między filmami
3. Rekomenduj filmy podobne do tych, które user lubi

In [None]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StringIndexer, VectorAssembler
from pyspark.ml.linalg import Vectors

# Przygotuj "dokument" per film = lista gatunków
movies_docs = movies \
    .withColumn("genres_list", split(col("genres"), "\\|")) \
    .withColumn("year", regexp_extract(col("title"), r"\((\d{4})\)", 1).cast("int"))

movies_docs.select("movie_id", "title", "genres_list").show(5, truncate=False)

In [None]:
# TF-IDF pipeline na gatunkach
# HashingTF: zamienia listę słów na sparse vector (term frequency)
# IDF: ważenie inverse document frequency (rzadkie gatunki mają większą wagę)

hashing_tf = HashingTF(inputCol="genres_list", outputCol="raw_features", numFeatures=30)
idf = IDF(inputCol="raw_features", outputCol="tfidf_features")

tfidf_pipeline = Pipeline(stages=[hashing_tf, idf])
tfidf_model = tfidf_pipeline.fit(movies_docs)

movies_tfidf = tfidf_model.transform(movies_docs)
movies_tfidf.select("movie_id", "title", "genres", "tfidf_features").show(5, truncate=False)

In [None]:
import numpy as np
from pyspark.sql.types import DoubleType

# Znajdź filmy podobne content-wise do Toy Story
target_movie = 1  # Toy Story

target_features = movies_tfidf.filter(col("movie_id") == target_movie) \
    .select("tfidf_features").collect()[0][0]
target_np = target_features.toArray()

@udf(DoubleType())
def cosine_sim(features):
    v = features.toArray()
    denom = np.linalg.norm(target_np) * np.linalg.norm(v)
    if denom == 0:
        return 0.0
    return float(np.dot(target_np, v) / denom)

content_similar = movies_tfidf \
    .withColumn("similarity", cosine_sim(col("tfidf_features"))) \
    .filter(col("movie_id") != target_movie) \
    .orderBy(desc("similarity"))

print(f"Content-based: filmy podobne do Toy Story (gatunki):")
content_similar.select("title", "genres", round(col("similarity"), 4).alias("sim")) \
    .show(15, truncate=False)

### Zadanie 1
Stwórz content-based recommender, który dla danego użytkownika:
1. Znajdzie jego top 5 filmów (najwyżej ocenione)
2. Dla każdego z nich znajdzie 5 najbardziej podobnych content-wise
3. Odfiltruje filmy, które user już widział
4. Zwróci top 10 rekomendacji

In [None]:
# Twoje rozwiązanie:


## 4. Clustering użytkowników z K-Means

Użyjemy wektorów latentnych z ALS jako features i pogrupujemy użytkowników w klastry.

Klastry = segmenty użytkowników o podobnym guście.

In [None]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Najpierw wytrenuj ALS, żeby dostać user factors
als = ALS(
    maxIter=10, regParam=0.1, rank=20,
    userCol="user_id", itemCol="movie_id", ratingCol="rating",
    coldStartStrategy="drop", seed=42
)

als_model = als.fit(ratings)

# User factors - wektory latentne użytkowników
user_factors = als_model.userFactors \
    .withColumnRenamed("id", "user_id")

print(f"User factors: {user_factors.count()} users × rank=20")
user_factors.show(5, truncate=False)

In [None]:
# Elbow method - znajdź optymalną liczbę klastrów
silhouette_scores = []
cost_scores = []
K_values = [3, 5, 8, 10, 15, 20]

evaluator_cluster = ClusteringEvaluator(
    predictionCol="cluster", featuresCol="features", metricName="silhouette"
)

for k in K_values:
    kmeans = KMeans(k=k, featuresCol="features", predictionCol="cluster", seed=42)
    model = kmeans.fit(user_factors)
    predictions = model.transform(user_factors)
    
    silhouette = evaluator_cluster.evaluate(predictions)
    cost = model.summary.trainingCost
    
    silhouette_scores.append(silhouette)
    cost_scores.append(cost)
    print(f"K={k:2d}: silhouette={silhouette:.4f}, cost={cost:.0f}")

# Najlepszy K wg silhouette
best_k = K_values[silhouette_scores.index(max(silhouette_scores))]
print(f"\nNajlepszy K wg silhouette: {best_k}")

In [None]:
# Trenuj finalny model
kmeans_final = KMeans(k=best_k, featuresCol="features", predictionCol="cluster", seed=42)
kmeans_model = kmeans_final.fit(user_factors)
user_clusters = kmeans_model.transform(user_factors)

# Rozmiary klastrów
print("Rozmiary klastrów:")
user_clusters.groupBy("cluster") \
    .count() \
    .orderBy("cluster") \
    .show()

In [None]:
# Profil każdego klastra - jakie gatunki preferują?
# Join z ratings i movies
cluster_profiles = user_clusters.select("user_id", "cluster") \
    .join(ratings.select("user_id", "movie_id", "rating"), "user_id") \
    .join(movies, "movie_id")

# Rozbij gatunki
cluster_genres = cluster_profiles \
    .withColumn("genre", explode(split(col("genres"), "\\|"))) \
    .groupBy("cluster", "genre") \
    .agg(
        count("*").alias("num_ratings"),
        round(avg("rating"), 2).alias("avg_rating")
    )

# Top 3 gatunki per klaster
from pyspark.sql.window import Window

w = Window.partitionBy("cluster").orderBy(desc("num_ratings"))

top_genres = cluster_genres \
    .withColumn("rank", row_number().over(w)) \
    .filter(col("rank") <= 3)

print("Top 3 gatunki per klaster:")
top_genres.orderBy("cluster", "rank").show(best_k * 3, truncate=False)

In [None]:
# Średnia ocena per klaster
cluster_stats = cluster_profiles \
    .groupBy("cluster") \
    .agg(
        countDistinct("user_id").alias("num_users"),
        round(avg("rating"), 2).alias("avg_rating"),
        round(stddev("rating"), 2).alias("std_rating"),
        round(avg(when(col("rating") >= 4.0, 1).otherwise(0)), 2).alias("pct_positive")
    ) \
    .orderBy("cluster")

print("Statystyki klastrów:")
cluster_stats.show()

## 5. FPGrowth - Frequent Itemsets

"Użytkownicy, którzy lubią film X, lubią też film Y."

FPGrowth znajduje **częste wzorce** w transakcjach.

In [None]:
from pyspark.ml.fpm import FPGrowth

# Przygotuj "koszyki" - filmy wysoko ocenione przez każdego użytkownika
# (traktujemy jak zakupy: user "kupił" filmy które ocenił >= 4.0)

# Pracujemy na próbce - FPGrowth jest kosztowne
user_baskets = ratings \
    .filter(col("rating") >= 4.0) \
    .filter(col("user_id") <= 10000) \
    .groupBy("user_id") \
    .agg(collect_set("movie_id").alias("items")) \
    .filter(size(col("items")) >= 5)  # min 5 filmów w koszyku

print(f"Koszyki: {user_baskets.count()} użytkowników")
user_baskets.show(5, truncate=False)

In [None]:
# FPGrowth
fp = FPGrowth(
    itemsCol="items",
    minSupport=0.05,     # item musi pojawić się w min 5% koszyków
    minConfidence=0.3    # reguła musi mieć min 30% confidence
)

fp_model = fp.fit(user_baskets)

# Frequent itemsets - najczęstsze zestawy filmów
print(f"Frequent itemsets: {fp_model.freqItemsets.count()}")
fp_model.freqItemsets \
    .orderBy(desc("freq")) \
    .show(20, truncate=False)

In [None]:
# Association rules - reguły "jeśli X to Y"
rules = fp_model.associationRules
print(f"Association rules: {rules.count()}")

# Zamień movie_id na tytuły dla czytelności
rules_top = rules.orderBy(desc("confidence")).limit(30).collect()

# Cache movies jako dict
movie_names = {r.movie_id: r.title for r in movies.collect()}

print("\nTop reguły asocjacyjne:")
print(f"{'Antecedent':<60} → {'Consequent':<40} conf={'':<6} lift")
print("-" * 130)
for rule in rules_top:
    ant = ", ".join([movie_names.get(mid, str(mid))[:25] for mid in rule.antecedent])
    con = ", ".join([movie_names.get(mid, str(mid))[:25] for mid in rule.consequent])
    print(f"{ant:<60} → {con:<40} {rule.confidence:.2f}   {rule.lift:.2f}")

In [None]:
# Użyj reguł do rekomendacji
# Dla usera: jakie filmy mu rekomendujemy na podstawie jego historii?

user_id = 42
user_movies = ratings.filter(
    (col("user_id") == user_id) & (col("rating") >= 4.0)
).select("movie_id").rdd.flatMap(lambda x: x).collect()

user_movies_set = set(user_movies)
print(f"User {user_id} lubi {len(user_movies_set)} filmów")

# Znajdź reguły, których antecedent jest podzbiorem filmów usera
@udf("boolean")
def is_subset(antecedent):
    return all(mid in user_movies_set for mid in antecedent)

@udf("boolean")
def not_seen(consequent):
    return all(mid not in user_movies_set for mid in consequent)

recommendations = rules \
    .filter(is_subset(col("antecedent"))) \
    .filter(not_seen(col("consequent"))) \
    .orderBy(desc("confidence"))

print(f"\nRekomendacje FPGrowth dla user {user_id}:")
recs = recommendations.limit(10).collect()
for r in recs:
    ant = ", ".join([movie_names.get(m, str(m))[:30] for m in r.antecedent])
    con = ", ".join([movie_names.get(m, str(m))[:30] for m in r.consequent])
    print(f"  Bo lubisz [{ant}] → polecamy [{con}] (conf={r.confidence:.2f})")

## 6. Hybrydowy model - ALS + Content-Based

Łączymy wyniki ALS (collaborative) i TF-IDF (content-based) w jeden ranking.

**Strategia: weighted hybrid**
- `final_score = α * als_score + (1 - α) * content_score`
- α to waga ALS (np. 0.7)

In [None]:
# ALS rekomendacje dla usera 42
user_42_df = spark.createDataFrame([(42,)], ["user_id"])
als_recs = als_model.recommendForUserSubset(user_42_df, 50)

als_scores = als_recs.select(
    explode("recommendations").alias("rec")
).select(
    col("rec.movie_id"),
    col("rec.rating").alias("als_score")
)

# Znormalizuj ALS scores do [0, 1]
als_min = als_scores.agg(min("als_score")).collect()[0][0]
als_max = als_scores.agg(max("als_score")).collect()[0][0]

als_normalized = als_scores.withColumn(
    "als_norm",
    (col("als_score") - als_min) / (als_max - als_min)
)

print("ALS recommendations (normalized):")
als_normalized.show(10)

In [None]:
# Content-based scores: similarity do ulubionych filmów usera
# Weź top 5 filmów usera, policz średnią similarity do wszystkich filmów

user_top_movies = ratings.filter(
    (col("user_id") == 42) & (col("rating") >= 4.0)
).join(movies_tfidf, "movie_id").select("movie_id", "tfidf_features").collect()

# Średni wektor TF-IDF ulubionych filmów = "profil użytkownika"
user_profile = np.mean([r.tfidf_features.toArray() for r in user_top_movies], axis=0)
user_profile_norm = np.linalg.norm(user_profile)

@udf(DoubleType())
def content_score(features):
    v = features.toArray()
    denom = user_profile_norm * np.linalg.norm(v)
    if denom == 0:
        return 0.0
    return float(np.dot(user_profile, v) / denom)

content_scores = movies_tfidf \
    .withColumn("content_score", content_score(col("tfidf_features"))) \
    .select("movie_id", "content_score")

# Znormalizuj do [0, 1]
cs_min = content_scores.agg(min("content_score")).collect()[0][0]
cs_max = content_scores.agg(max("content_score")).collect()[0][0]

content_normalized = content_scores.withColumn(
    "content_norm",
    (col("content_score") - cs_min) / (cs_max - cs_min)
)

In [None]:
# Hybrid: połącz ALS i content-based
ALPHA = 0.7  # waga ALS

hybrid = als_normalized.join(content_normalized, "movie_id") \
    .withColumn(
        "hybrid_score",
        ALPHA * col("als_norm") + (1 - ALPHA) * col("content_norm")
    ) \
    .join(movies, "movie_id") \
    .filter(~col("movie_id").isin(user_movies)) \
    .orderBy(desc("hybrid_score"))

print("Hybrid recommendations (ALS + Content-Based):")
hybrid.select(
    "title", "genres",
    round(col("als_norm"), 3).alias("als"),
    round(col("content_norm"), 3).alias("content"),
    round(col("hybrid_score"), 3).alias("hybrid")
).show(15, truncate=False)

### Zadanie 2
Porównaj jakość trzech podejść (ALS, content-based, hybrid) na zbiorze testowym:
1. Dla 100 losowych użytkowników, generuj top-10 rekomendacji z każdego podejścia
2. Policz Precision@10 dla każdego (ground truth = filmy ocenione >= 4.0 w zbiorze testowym)
3. Który approach daje najlepsze wyniki?

In [None]:
# Twoje rozwiązanie:


## 7. Feature Engineering Pipeline

Budowa features, które mogą zasilać dowolny model ML.

In [None]:
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer

# Movie features
movie_features = movies \
    .withColumn("year", regexp_extract(col("title"), r"\((\d{4})\)", 1).cast("int")) \
    .withColumn("num_genres", size(split(col("genres"), "\\|"))) \
    .withColumn("is_comedy", col("genres").contains("Comedy").cast("int")) \
    .withColumn("is_drama", col("genres").contains("Drama").cast("int")) \
    .withColumn("is_action", col("genres").contains("Action").cast("int")) \
    .withColumn("is_horror", col("genres").contains("Horror").cast("int")) \
    .withColumn("is_scifi", col("genres").contains("Sci-Fi").cast("int")) \
    .withColumn("title_length", length(col("title")))

# Agregaty z ratings
movie_agg = ratings.groupBy("movie_id").agg(
    count("*").alias("num_ratings"),
    round(avg("rating"), 2).alias("avg_rating"),
    round(stddev("rating"), 2).alias("std_rating"),
    countDistinct("user_id").alias("unique_raters")
)

# Połącz
movie_full = movie_features.join(movie_agg, "movie_id", "left").fillna(0)

# VectorAssembler - zbierz features w jeden wektor
feature_cols = [
    "year", "num_genres", "is_comedy", "is_drama", "is_action",
    "is_horror", "is_scifi", "title_length", "num_ratings",
    "avg_rating", "std_rating", "unique_raters"
]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="raw_features")
scaler = StandardScaler(inputCol="raw_features", outputCol="scaled_features")

feature_pipeline = Pipeline(stages=[assembler, scaler])
feature_model = feature_pipeline.fit(movie_full)
movies_featured = feature_model.transform(movie_full)

movies_featured.select("title", "scaled_features").show(5, truncate=False)

## Zadanie końcowe

Zbuduj kompletny **Recommendation Engine** z:

1. **ALS collaborative filtering** - wytrenuj z optymalnymi parametrami
2. **Content-based** - TF-IDF na gatunkach
3. **FPGrowth rules** - reguły asocjacyjne
4. **Hybrid scorer** - łączony ranking:
   - ALS score (waga 0.5)
   - Content similarity (waga 0.3)  
   - FPGrowth confidence boost (waga 0.2)

5. Dla 5 wybranych użytkowników pokaż:
   - Ich historię (co lubią)
   - Top 10 hybrydowych rekomendacji
   - Porównanie z czystym ALS

6. Zapisz pipeline i modele

In [None]:
# Twoje rozwiązanie:


In [None]:
ratings.unpersist()
movies.unpersist()
spark.stop()