 📈 Zadanie 2: Przychód ze sprzedanych produktów (filtrowane zamówienia)
 
Cel:
- Wyklucz zamówienia anulowane i podejrzane o oszustwo.
- Oblicz sumaryczny przychód dla każdego produktu.
- Wyświetl produkty wraz z uzyskanym przychodem.
- Zapisz wyniki do pliku CSV.
- Zastosuj odpowiednie formatowanie i style.

## ✅ Krok 1: Uruchomienie sesji Spark

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum

spark = SparkSession.builder.appName("ProductRevenue").getOrCreate()

25/04/25 20:08:31 WARN Utils: Your hostname, Katana-Pro-2.local resolves to a loopback address: 127.0.0.1; using 192.168.0.108 instead (on interface en0)
25/04/25 20:08:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/25 20:08:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/04/25 20:08:32 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## ✅ Krok 2: Wczytanie danych z Parquet

Wczytaj tabele `order_items` oraz `orders`:

In [35]:
project_dir = "<PROJECT_DIR>"
data_dir = f"{project_dir}/data"
outputs_dir = f"{data_dir}/outputs"
input_tables = f"{data_dir}/sklep/"
output_tables = f"{outputs_dir}/sklep/"

In [25]:
order_items = spark.read.parquet(f"{input_tables}/order_items/")
orders = spark.read.parquet(f"{input_tables}/orders/")

## ✅ Krok 3: Filtrowanie poprawnych zamówień

**TODO**: Usuń zamówienia o statusie `CANCELED` i `SUSPECTED_FRAUD`

In [36]:
valid_orders = orders.filter(~col("order_status").isin("CANCELED", "SUSPECTED_FRAUD"))

ConnectionRefusedError: [Errno 61] Connection refused

## ✅ Krok 4: Obliczenie przychodu

TODO: Połącz dane i grupuj po `order_item_product_id`, sumując `order_item_subtotal`:

In [37]:
revenue_per_product = order_items \
    .join(valid_orders, order_items.order_item_order_id == valid_orders.order_id) \
    .groupBy("order_item_product_id") \
    .agg(_sum("order_item_subtotal").alias("revenue"))


ConnectionRefusedError: [Errno 61] Connection refused

## ✅ Krok 5: Wyświetlenie wyników

Wyświetl produkty z ich przychodem:

In [28]:
revenue_per_product.show(10)

+---------------------+------------------+
|order_item_product_id|           revenue|
+---------------------+------------------+
|                  897| 78568.55933380127|
|                  858| 44797.76123046875|
|                  251|  74511.7201538086|
|                  804| 71963.99922943115|
|                   78| 72392.75991821289|
|                  642|          106800.0|
|                   44| 218843.5263671875|
|                  743| 41477.56134033203|
|                  860|     19199.6796875|
|                  926|56220.839210510254|
+---------------------+------------------+


Wyświetl przychód z dwoma miejscami po przecinku:

In [29]:
from pyspark.sql.functions import format_number

revenue_formatted = revenue_per_product.select(
    "order_item_product_id",
    format_number("revenue", 2).alias("revenue_formatted")
)

revenue_formatted.show(10)

+---------------------+-----------------+
|order_item_product_id|revenue_formatted|
+---------------------+-----------------+
|                  897|        78,568.56|
|                  858|        44,797.76|
|                  251|        74,511.72|
|                  804|        71,964.00|
|                   78|        72,392.76|
|                  642|       106,800.00|
|                   44|       218,843.53|
|                  743|        41,477.56|
|                  860|        19,199.68|
|                  926|        56,220.84|
+---------------------+-----------------+


Wyświetl przychód z walutą:

In [33]:
from pyspark.sql.functions import concat, lit

revenue_with_currency = revenue_per_product.select(
    "order_item_product_id",
    concat(lit("$ "), format_number("revenue", 2)).alias("revenue_usd")
)

revenue_with_currency.show(truncate=False)

+---------------------+------------+
|order_item_product_id|revenue_usd |
+---------------------+------------+
|897                  |$ 78,568.56 |
|858                  |$ 44,797.76 |
|251                  |$ 74,511.72 |
|804                  |$ 71,964.00 |
|78                   |$ 72,392.76 |
|642                  |$ 106,800.00|
|44                   |$ 218,843.53|
|743                  |$ 41,477.56 |
|860                  |$ 19,199.68 |
|926                  |$ 56,220.84 |
|822                  |$ 149,152.92|
|625                  |$ 47,197.64 |
|93                   |$ 79,668.12 |
|924                  |$ 53,342.64 |
|725                  |$ 31,104.00 |
|671                  |$ 52,077.52 |
|305                  |$ 51,740.00 |
|906                  |$ 87,065.16 |
|797                  |$ 61,525.80 |
|777                  |$ 64,951.88 |
+---------------------+------------+


## ✅ Krok 6: Zapis do pliku CSV
Zapisz wyniki do 2 plików CSV:

In [31]:
revenue_per_product.write.csv(f"{output_tables}/revenue_per_product.csv", header=True, mode="overwrite")

'../../data/outputs/sklep//revenue_per_product.csv'