### Задача: Маркетинговая кампания "Вернись!"

У вас есть users_df (все юзеры) и orders_df (покупки).  
Найдите пользователей, которые никогда ничего не покупали.  
Способ Junior: Сделать Left Join, потом отфильтровать, где orders.id IS NULL.  
Способ Senior: Использовать тип джойна left_anti.  
Синтаксис: users_df.join(orders_df, ..., "left_anti").  
Сравните планы выполнения (explain()) для обоих способов. left_anti обычно чище и эффективнее по памяти.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("Joins_Lab").getOrCreate()

# Таблица пользователей (User Dimension)
users_data = [
    (1, "Alice", "USA"),
    (2, "Bob", "UK"),
    (3, "Charlie", "CN"),
    (4, "David", "USA")
]

users_df = spark.createDataFrame(users_data, ["id", "name", "country"])

# Таблица заказов (Orders Fact)
orders_data = [
    (101, 1, 500.0), # Alice
    (102, 1, 300.0), # Alice
    (103, 2, 1200.0), # Bob
    (104, 99, 50.0)   # Неизвестный пользователь (id 99)
]

orders_df = spark.createDataFrame(orders_data, ["order_id", "user_id", "amount"])

users_df.show()
orders_df.show()

+---+-------+-------+
| id|   name|country|
+---+-------+-------+
|  1|  Alice|    USA|
|  2|    Bob|     UK|
|  3|Charlie|     CN|
|  4|  David|    USA|
+---+-------+-------+

+--------+-------+------+
|order_id|user_id|amount|
+--------+-------+------+
|     101|      1| 500.0|
|     102|      1| 300.0|
|     103|      2|1200.0|
|     104|     99|  50.0|
+--------+-------+------+



In [9]:
join_j_df = users_df.join(orders_df, users_df.id == orders_df.user_id, "left").filter(F.col("amount").isNull())
join_j_df.show()

join_s_df = users_df.join(orders_df, users_df.id == orders_df.user_id, "left_anti")
join_s_df.show()

+---+-------+-------+--------+-------+------+
| id|   name|country|order_id|user_id|amount|
+---+-------+-------+--------+-------+------+
|  3|Charlie|     CN|    NULL|   NULL|  NULL|
|  4|  David|    USA|    NULL|   NULL|  NULL|
+---+-------+-------+--------+-------+------+

+---+-------+-------+
| id|   name|country|
+---+-------+-------+
|  3|Charlie|     CN|
|  4|  David|    USA|
+---+-------+-------+



In [11]:
join_j_df.explain()
join_s_df.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Filter isnull(amount#8)
   +- SortMergeJoin [id#0L], [user_id#7L], LeftOuter
      :- Sort [id#0L ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(id#0L, 8), ENSURE_REQUIREMENTS, [plan_id=1136]
      :     +- Scan ExistingRDD[id#0L,name#1,country#2]
      +- Sort [user_id#7L ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(user_id#7L, 8), ENSURE_REQUIREMENTS, [plan_id=1137]
            +- Filter isnotnull(user_id#7L)
               +- Scan ExistingRDD[order_id#6L,user_id#7L,amount#8]


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [id#0L], [user_id#7L], LeftAnti
   :- Sort [id#0L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#0L, 8), ENSURE_REQUIREMENTS, [plan_id=1163]
   :     +- Scan ExistingRDD[id#0L,name#1,country#2]
   +- Sort [user_id#7L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(user_id#7L, 8), ENSURE_REQUIREMENTS, [plan_id=11