# 14 - Projekty Integracyjne

Łączenie Sparka z innymi komponentami systemu rekomendacji.

**Tematy:**
- Feature Store - prekomputuj features w Spark, serwuj z PostgreSQL
- Model Export - eksport faktorów ALS do formatu serwowalnego
- Batch Scoring Pipeline - generowanie rekomendacji w batchu
- A/B Testing Framework - porównanie modeli
- Lambda Architecture - batch + speed layer
- End-to-end pipeline: train → export → serve

## 1. Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
import time

spark = SparkSession.builder \
    .appName("14_Integration") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.7.1") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "7g") \
    .config("spark.driver.host", "recommender-jupyter") \
    .config("spark.driver.bindAddress", "0.0.0.0") \
    .getOrCreate()

jdbc_url = "jdbc:postgresql://postgres:5432/recommender"
properties = {
    "user": "recommender",
    "password": "recommender",
    "driver": "org.postgresql.Driver"
}

ratings = spark.read.jdbc(
    jdbc_url, "movielens.ratings", properties=properties,
    column="user_id", lowerBound=1, upperBound=300000, numPartitions=10
)
movies = spark.read.jdbc(jdbc_url, "movielens.movies", properties=properties)

ratings.cache()
movies.cache()
print(f"Ratings: {ratings.count()}, Movies: {movies.count()}")

## 2. Feature Store

Feature Store = prekomputowane features gotowe do użycia przez modele ML i API.

**Pipeline:**
```
Spark (batch)  →  PostgreSQL (feature store)  →  FastAPI (serving)
                    ↑ odświeżanie nocne            ↑ odczyt <10ms
```

In [None]:
# === USER FEATURES ===

user_features = ratings.groupBy("user_id").agg(
    count("*").alias("total_ratings"),
    round(avg("rating"), 4).alias("avg_rating"),
    round(stddev("rating"), 4).alias("std_rating"),
    sum(when(col("rating") >= 4.0, 1).otherwise(0)).alias("positive_count"),
    sum(when(col("rating") <= 2.0, 1).otherwise(0)).alias("negative_count"),
    countDistinct("movie_id").alias("unique_movies"),
    min("rating_timestamp").alias("first_rating_at"),
    max("rating_timestamp").alias("last_rating_at")
)

# Dodaj derived features
user_features = user_features \
    .withColumn("positivity_ratio", round(col("positive_count") / col("total_ratings"), 4)) \
    .withColumn("activity_span_days",
        datediff(col("last_rating_at"), col("first_rating_at"))
    ) \
    .withColumn("ratings_per_day",
        when(col("activity_span_days") > 0,
             round(col("total_ratings") / col("activity_span_days"), 4))
        .otherwise(col("total_ratings").cast("double"))
    ) \
    .withColumn("user_segment",
        when(col("total_ratings") >= 1000, "power")
        .when(col("total_ratings") >= 100, "active")
        .when(col("total_ratings") >= 20, "casual")
        .otherwise("new")
    ) \
    .withColumn("_computed_at", current_timestamp())

print(f"User features: {user_features.count()} users")
user_features.show(5)

In [None]:
# === MOVIE FEATURES ===

movie_features = ratings.groupBy("movie_id").agg(
    count("*").alias("total_ratings"),
    round(avg("rating"), 4).alias("avg_rating"),
    round(stddev("rating"), 4).alias("std_rating"),
    countDistinct("user_id").alias("unique_raters"),
    round(avg(when(col("rating") >= 4.0, 1).otherwise(0)), 4).alias("positive_rate"),
    min("rating_timestamp").alias("first_rated_at"),
    max("rating_timestamp").alias("last_rated_at")
)

# Join z movie metadata
movie_features = movie_features.join(movies, "movie_id") \
    .withColumn("year", regexp_extract(col("title"), r"\((\d{4})\)", 1).cast("int")) \
    .withColumn("primary_genre", element_at(split(col("genres"), "\\|"), 1)) \
    .withColumn("num_genres", size(split(col("genres"), "\\|"))) \
    .withColumn("_computed_at", current_timestamp())

print(f"Movie features: {movie_features.count()} movies")
movie_features.show(5)

In [None]:
# === USER-GENRE PREFERENCE FEATURES ===
# Macierz: user × genre → średnia ocena

user_genre_prefs = ratings.join(
    movies.select("movie_id", "genres"), "movie_id"
).withColumn(
    "genre", explode(split(col("genres"), "\\|"))
).groupBy("user_id", "genre").agg(
    count("*").alias("genre_count"),
    round(avg("rating"), 4).alias("genre_avg_rating")
)

# Pivot do szerokiej tabeli (kolumna per gatunek)
top_genres = ["Drama", "Comedy", "Action", "Thriller", "Romance",
              "Adventure", "Sci-Fi", "Crime", "Horror", "Animation"]

user_genre_wide = user_genre_prefs \
    .filter(col("genre").isin(top_genres)) \
    .groupBy("user_id") \
    .pivot("genre", top_genres) \
    .agg(first("genre_avg_rating")) \
    .fillna(0.0)

# Renamed columns
for g in top_genres:
    user_genre_wide = user_genre_wide.withColumnRenamed(g, f"pref_{g.lower().replace('-', '_')}")

user_genre_wide.withColumn("_computed_at", current_timestamp())

print(f"User-genre preferences: {user_genre_wide.count()} users × {len(top_genres)} genres")
user_genre_wide.show(5)

In [None]:
# === EXPORT DO POSTGRESQL ===

# Zapisz features do PostgreSQL (feature store)
user_features.write.mode("overwrite") \
    .jdbc(jdbc_url, "features.user_features", properties=properties)

movie_features.select(
    "movie_id", "title", "genres", "primary_genre", "year", "num_genres",
    "total_ratings", "avg_rating", "std_rating", "unique_raters",
    "positive_rate", "_computed_at"
).write.mode("overwrite") \
    .jdbc(jdbc_url, "features.movie_features", properties=properties)

user_genre_wide.write.mode("overwrite") \
    .jdbc(jdbc_url, "features.user_genre_preferences", properties=properties)

print("Feature store exported to PostgreSQL (schema: features)")

## 3. Model Export - faktory ALS do serwowania

Zamiast serwować z Spark (wolne), eksportujemy wektory latentne do PostgreSQL.

**Serving:**
1. User request → FastAPI
2. FastAPI ładuje user_factor z PostgreSQL
3. Mnożenie user_factor × item_factors = scores
4. Top N filmów → response

In [None]:
# Trenuj ALS
als = ALS(
    maxIter=10, regParam=0.1, rank=20,
    userCol="user_id", itemCol="movie_id", ratingCol="rating",
    coldStartStrategy="drop", seed=42
)
model = als.fit(ratings)

# Eksportuj faktory
user_factors = model.userFactors \
    .withColumnRenamed("id", "user_id") \
    .withColumn("features_json",
        to_json(col("features").cast(ArrayType(DoubleType()))))

item_factors = model.itemFactors \
    .withColumnRenamed("id", "movie_id") \
    .withColumn("features_json",
        to_json(col("features").cast(ArrayType(DoubleType()))))

print(f"User factors: {user_factors.count()} users × rank={model.rank}")
print(f"Item factors: {item_factors.count()} items × rank={model.rank}")

user_factors.select("user_id", "features_json").show(3, truncate=80)

In [None]:
# Eksport do PostgreSQL
user_factors.select("user_id", "features_json") \
    .write.mode("overwrite") \
    .jdbc(jdbc_url, "models.user_factors", properties=properties)

item_factors.select("movie_id", "features_json") \
    .write.mode("overwrite") \
    .jdbc(jdbc_url, "models.item_factors", properties=properties)

# Metadane modelu
model_metadata = spark.createDataFrame([
    ("als_v1", model.rank, 10, 0.1, float(ratings.count()), str(current_timestamp()))
], ["model_name", "rank", "max_iter", "reg_param", "training_size", "trained_at"])

model_metadata.write.mode("overwrite") \
    .jdbc(jdbc_url, "models.metadata", properties=properties)

print("Model factors exported to PostgreSQL (schema: models)")

In [None]:
# Symulacja serwowania: jak FastAPI użyłoby tych faktorów
import numpy as np
import json

def get_recommendations_from_factors(user_id, top_n=10):
    """
    Symulacja serwowania rekomendacji z PostgreSQL.
    W prawdziwym API:
    - user_factor ładowany z PostgreSQL (1 query)
    - item_factors preloaded w pamięci (cache)
    - numpy dot product → scores
    """
    # Pobierz user factor
    uf = user_factors.filter(col("user_id") == user_id) \
        .select("features_json").collect()
    if not uf:
        return None
    
    user_vec = np.array(json.loads(uf[0].features_json))
    
    # Pobierz item factors i policz score
    all_items = item_factors.select("movie_id", "features_json").collect()
    
    scores = []
    for item in all_items:
        item_vec = np.array(json.loads(item.features_json))
        score = float(np.dot(user_vec, item_vec))
        scores.append((item.movie_id, score))
    
    # Sort i top N
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_n]

# Test
start = time.time()
recs = get_recommendations_from_factors(42)
serving_time = time.time() - start

print(f"Recommendations for user 42 (computed in {serving_time:.3f}s):")
movie_names = {r.movie_id: r.title for r in movies.collect()}
for movie_id, score in recs:
    print(f"  {movie_names.get(movie_id, 'Unknown')[:50]:<50} score={score:.3f}")

## 4. Batch Scoring Pipeline

Generowanie rekomendacji dla WSZYSTKICH użytkowników w batchu.

**Cron:** uruchamiaj co noc → aktualizuj tabelę rekomendacji → API czyta z niej.

In [None]:
# Top 20 rekomendacji dla wszystkich użytkowników
start = time.time()
all_recs = model.recommendForAllUsers(20)
batch_time = time.time() - start
print(f"Batch scoring for {all_recs.count()} users: {batch_time:.1f}s")

# Rozpakuj rekomendacje do płaskiej tabeli
recs_flat = all_recs.select(
    col("user_id"),
    posexplode("recommendations").alias("rank", "rec")
).select(
    "user_id",
    (col("rank") + 1).alias("rank"),
    col("rec.movie_id").alias("movie_id"),
    round(col("rec.rating"), 4).alias("predicted_score")
)

# Dodaj metadane
recs_with_meta = recs_flat \
    .join(movies.select("movie_id", "title", "genres"), "movie_id") \
    .withColumn("model_version", lit("als_v1")) \
    .withColumn("generated_at", current_timestamp())

print(f"Total recommendations: {recs_with_meta.count()} rows")
recs_with_meta.filter(col("user_id") == 42).show(10, truncate=False)

In [None]:
# Eksport do PostgreSQL - gotowe do serwowania przez API
recs_with_meta.select(
    "user_id", "rank", "movie_id", "title", "genres",
    "predicted_score", "model_version", "generated_at"
).write.mode("overwrite") \
    .jdbc(jdbc_url, "recommendations.batch_recs", properties=properties)

print("Batch recommendations exported to PostgreSQL")
print("API query: SELECT * FROM recommendations.batch_recs WHERE user_id = ? ORDER BY rank")

## 5. A/B Testing Framework

Porównanie dwóch modeli na tym samym zbiorze danych.

**Scenariusz:**
- Model A: ALS z rank=10
- Model B: ALS z rank=50
- 50% użytkowników dostaje model A, 50% model B
- Porównujemy metryki offline

In [None]:
# Przygotuj train/test
(train, test) = ratings.randomSplit([0.8, 0.2], seed=42)
train.cache()
test.cache()

# Model A: rank=10
als_a = ALS(maxIter=10, regParam=0.1, rank=10,
    userCol="user_id", itemCol="movie_id", ratingCol="rating",
    coldStartStrategy="drop", seed=42)
model_a = als_a.fit(train)

# Model B: rank=50
als_b = ALS(maxIter=10, regParam=0.1, rank=50,
    userCol="user_id", itemCol="movie_id", ratingCol="rating",
    coldStartStrategy="drop", seed=42)
model_b = als_b.fit(train)

print("Models trained")

In [None]:
# Losowe przypisanie użytkowników do grup A/B
users = ratings.select("user_id").distinct() \
    .withColumn("ab_group",
        when(crc32(col("user_id").cast("string")) % 2 == 0, "A")
        .otherwise("B")
    )

print("A/B group distribution:")
users.groupBy("ab_group").count().show()

In [None]:
# Offline metryki per model
evaluator_rmse = RegressionEvaluator(
    metricName="rmse", labelCol="rating", predictionCol="prediction"
)
evaluator_mae = RegressionEvaluator(
    metricName="mae", labelCol="rating", predictionCol="prediction"
)

# Test set per grupa
test_a = test.join(users.filter(col("ab_group") == "A"), "user_id")
test_b = test.join(users.filter(col("ab_group") == "B"), "user_id")

# Predykcje
preds_a = model_a.transform(test_a)
preds_b = model_b.transform(test_b)

# Metryki
results = {
    "Model A (rank=10)": {
        "RMSE": evaluator_rmse.evaluate(preds_a),
        "MAE": evaluator_mae.evaluate(preds_a),
        "users": test_a.select("user_id").distinct().count(),
        "predictions": preds_a.count()
    },
    "Model B (rank=50)": {
        "RMSE": evaluator_rmse.evaluate(preds_b),
        "MAE": evaluator_mae.evaluate(preds_b),
        "users": test_b.select("user_id").distinct().count(),
        "predictions": preds_b.count()
    }
}

print("\n=== A/B Test Results ===")
for model_name, metrics in results.items():
    print(f"\n{model_name}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}" if isinstance(value, float) else f"  {metric}: {value}")

In [None]:
# Ranking metrics per model
K = 10

def compute_precision_at_k(model, test_df, k=10):
    """Precision@K for a model."""
    # Ground truth: filmy ocenione >= 4.0
    relevant = test_df.filter(col("rating") >= 4.0) \
        .groupBy("user_id") \
        .agg(collect_set("movie_id").alias("relevant_movies"))
    
    # Rekomendacje
    recs = model.recommendForUserSubset(
        test_df.select("user_id").distinct(), k
    ).select(
        "user_id",
        expr("transform(recommendations, x -> x.movie_id)").alias("rec_movies")
    )
    
    # Precision@K
    precision = recs.join(relevant, "user_id") \
        .withColumn("hits", size(array_intersect(col("rec_movies"), col("relevant_movies")))) \
        .agg(round(avg(col("hits") / k), 4).alias("precision_at_k")) \
        .collect()[0][0]
    
    return precision

prec_a = compute_precision_at_k(model_a, test_a)
prec_b = compute_precision_at_k(model_b, test_b)

print(f"\nPrecision@{K}:")
print(f"  Model A (rank=10): {prec_a}")
print(f"  Model B (rank=50): {prec_b}")
print(f"  Winner: {'Model A' if prec_a > prec_b else 'Model B'}")

In [None]:
# Zapisz wyniki A/B testu
ab_results = spark.createDataFrame([
    ("A", "als_rank10", 10, results["Model A (rank=10)"]["RMSE"],
     results["Model A (rank=10)"]["MAE"], float(prec_a or 0)),
    ("B", "als_rank50", 50, results["Model B (rank=50)"]["RMSE"],
     results["Model B (rank=50)"]["MAE"], float(prec_b or 0)),
], ["ab_group", "model_name", "rank", "rmse", "mae", "precision_at_10"])

ab_results.withColumn("test_date", current_timestamp()) \
    .write.mode("overwrite") \
    .jdbc(jdbc_url, "experiments.ab_test_results", properties=properties)

ab_results.show()

## 6. Lambda Architecture - Batch + Speed Layer

```
                         ┌─────────────────┐
  New Data ──────────────┤  Speed Layer     │──── Real-time view
       │                 │  (Streaming)     │      (recent data)
       │                 └─────────────────┘
       │
       └────────────────┐
                        ▼
                 ┌──────────────┐
                 │  Batch Layer │──── Batch view
                 │  (nightly)   │      (full history)
                 └──────────────┘
                                          
                 ┌──────────────┐
                 │ Serving Layer│──── Merge batch + speed
                 │  (API/DB)    │      = complete view
                 └──────────────┘
```

In [None]:
# Batch Layer: model wytrenowany na pełnych danych (co noc)
# To już mamy - model ALS + batch recommendations

# Speed Layer: aktualizacja popularności w real-time
# Symulujmy strumień nowych ocen (ostatnie 24h)

recent_ratings = ratings.filter(
    col("rating_timestamp") >= date_sub(current_date(), 30)  # ostatnie 30 dni w danych
)

# Real-time movie popularity (speed layer)
speed_layer_stats = recent_ratings.groupBy("movie_id").agg(
    count("*").alias("recent_ratings"),
    round(avg("rating"), 2).alias("recent_avg_rating"),
    max("rating_timestamp").alias("latest_rating")
)

# Batch layer: historyczne statystyki
batch_layer_stats = ratings.groupBy("movie_id").agg(
    count("*").alias("total_ratings"),
    round(avg("rating"), 2).alias("overall_avg_rating")
)

# Serving Layer: merge batch + speed
serving_view = batch_layer_stats \
    .join(speed_layer_stats, "movie_id", "left") \
    .join(movies.select("movie_id", "title"), "movie_id") \
    .withColumn("trending_score",
        coalesce(col("recent_ratings"), lit(0)) / col("total_ratings")
    ) \
    .fillna(0)

print("Serving view - filmy z największym 'trending' (batch + speed):")
serving_view.orderBy(desc("trending_score")) \
    .select("title", "total_ratings", "overall_avg_rating",
            "recent_ratings", "recent_avg_rating",
            round(col("trending_score"), 4).alias("trending")) \
    .show(15, truncate=False)

## 7. End-to-End Pipeline: train → export → serve

Pełny pipeline, jaki uruchamiałbyś co noc.

In [None]:
def run_nightly_pipeline():
    """Pełny nocny pipeline rekomendacji."""
    pipeline_start = time.time()
    
    # 1. Load data
    print("[1/6] Loading data...")
    ratings = spark.read.jdbc(
        jdbc_url, "movielens.ratings", properties=properties,
        column="user_id", lowerBound=1, upperBound=300000, numPartitions=10
    ).cache()
    movies = spark.read.jdbc(jdbc_url, "movielens.movies", properties=properties).cache()
    n_ratings = ratings.count()
    print(f"   Loaded {n_ratings} ratings")
    
    # 2. Train model
    print("[2/6] Training ALS model...")
    als = ALS(
        maxIter=10, regParam=0.1, rank=20,
        userCol="user_id", itemCol="movie_id", ratingCol="rating",
        coldStartStrategy="drop", seed=42
    )
    model = als.fit(ratings)
    print(f"   Model trained (rank={model.rank})")
    
    # 3. Evaluate
    print("[3/6] Evaluating...")
    (_, test) = ratings.randomSplit([0.8, 0.2], seed=42)
    preds = model.transform(test)
    rmse = RegressionEvaluator(
        metricName="rmse", labelCol="rating", predictionCol="prediction"
    ).evaluate(preds)
    print(f"   RMSE: {rmse:.4f}")
    
    # 4. Generate batch recommendations
    print("[4/6] Generating batch recommendations...")
    all_recs = model.recommendForAllUsers(20)
    recs_flat = all_recs.select(
        "user_id",
        posexplode("recommendations").alias("rank", "rec")
    ).select(
        "user_id", (col("rank") + 1).alias("rank"),
        col("rec.movie_id"), round(col("rec.rating"), 4).alias("score")
    ).join(movies.select("movie_id", "title"), "movie_id") \
     .withColumn("model_version", lit("als_v1")) \
     .withColumn("generated_at", current_timestamp())
    print(f"   Generated {recs_flat.count()} recommendations")
    
    # 5. Export to PostgreSQL
    print("[5/6] Exporting to PostgreSQL...")
    recs_flat.write.mode("overwrite") \
        .jdbc(jdbc_url, "recommendations.batch_recs", properties=properties)
    
    # Faktory modelu
    model.userFactors.withColumnRenamed("id", "user_id") \
        .withColumn("features_json", to_json(col("features").cast(ArrayType(DoubleType())))) \
        .select("user_id", "features_json") \
        .write.mode("overwrite") \
        .jdbc(jdbc_url, "models.user_factors", properties=properties)
    
    model.itemFactors.withColumnRenamed("id", "movie_id") \
        .withColumn("features_json", to_json(col("features").cast(ArrayType(DoubleType())))) \
        .select("movie_id", "features_json") \
        .write.mode("overwrite") \
        .jdbc(jdbc_url, "models.item_factors", properties=properties)
    print("   Export complete")
    
    # 6. Log pipeline run
    print("[6/6] Logging pipeline run...")
    total_time = time.time() - pipeline_start
    
    run_log = spark.createDataFrame([
        ("als_v1", n_ratings, rmse, model.rank, total_time, "success")
    ], ["model_version", "training_size", "rmse", "rank", "duration_seconds", "status"])
    run_log.withColumn("run_at", current_timestamp()) \
        .write.mode("append") \
        .jdbc(jdbc_url, "pipeline.run_log", properties=properties)
    
    print(f"\nPipeline completed in {total_time:.1f}s")
    print(f"RMSE: {rmse:.4f}")
    
    ratings.unpersist()
    movies.unpersist()
    
    return model, rmse

# Uruchom pipeline!
model, rmse = run_nightly_pipeline()

### Zadanie 1
Rozbuduj nightly pipeline o:
1. Feature store update (user_features + movie_features z sekcji 2)
2. Data quality check na wejściu (min. 10M ratings, brak null w kluczach)
3. A/B test: wytrenuj 2 modele (rank=10 i rank=30), porównaj RMSE
4. Automatyczny wybór lepszego modelu
5. Alert jeśli RMSE > threshold (np. 1.0)

In [None]:
# Twoje rozwiązanie:


## Zadanie końcowe

Zbuduj **kompletny system rekomendacji** z:

1. **Data Pipeline** (ETL)
   - Bronze: surowe dane z PostgreSQL
   - Silver: oczyszczone, wzbogacone
   - Gold: feature store, model input

2. **Model Pipeline**
   - Train ALS z optymalnym rank (grid search)
   - Ewaluacja: RMSE + Precision@10
   - Export faktorów do PostgreSQL

3. **Serving Pipeline**
   - Batch recommendations (top 20 per user) → PostgreSQL
   - Similar movies (top 10 per movie) → PostgreSQL
   - Trending movies (last 30 days) → PostgreSQL

4. **Monitoring**
   - Pipeline run log z metrykami
   - Porównanie z poprzednim modelem

Wynikowe tabele PostgreSQL:
- `recommendations.batch_recs` - rekomendacje per user
- `recommendations.similar_movies` - podobne filmy
- `recommendations.trending` - trendujące filmy
- `features.user_features` - feature store
- `features.movie_features` - feature store
- `models.user_factors` / `item_factors` - wektory ALS
- `pipeline.run_log` - log uruchomień

In [None]:
# Twoje rozwiązanie:


In [None]:
spark.stop()