# 📈 Zadanie 1: Najbardziej popularne kategorie produktów

Cel:
- Załaduj dane o produktach, pozycjach zamówień i kategoriach z plików Parquet.
- Połącz dane, aby określić, które kategorie mają największą liczbę sprzedanych przedmiotów.
- Wyświetl 10 najpopularniejszych kategorii według liczby sprzedanych sztuk.
- Zapisz wyniki do pliku CSV.
- Zbadaj zapisane dane.

## ✅ Krok 1: Uruchomienie sesji Spark

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

spark = SparkSession.builder.appName("PopularCategories").getOrCreate()
spark

25/04/25 20:00:26 WARN Utils: Your hostname, Katana-Pro-2.local resolves to a loopback address: 127.0.0.1; using 192.168.0.108 instead (on interface en0)
25/04/25 20:00:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/25 20:00:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## ✅ Krok 2: Wczytanie danych z Parquet

In [14]:
project_dir = "<PROJECT_DIR>"
data_dir = f"{project_dir}/data"
outputs_dir = f"{data_dir}/outputs"
input_tables = f"{data_dir}/sklep/"
output_tables = f"{outputs_dir}/sklep/"

In [3]:
order_items = spark.read.parquet(f"{input_tables}/order_items/")
products = spark.read.parquet(f"{input_tables}/products/")
categories = spark.read.parquet(f"{input_tables}/categories/")

## ✅ Krok 3: Połączenie tabel

Połącz tabele w jedną: 

joined = order_items -> products -> categories


In [4]:
joined_df = order_items \
    .join(products, order_items.order_item_product_id == products.product_id) \
    .join(categories, products.product_category_id == categories.category_id)

## ✅ Krok 4: Grupowanie po kategorii i zliczanie

In [5]:
popular_categories = joined_df \
    .groupBy("category_name") \
    .agg(count("order_item_quantity").alias("count")) \
    .orderBy(col("count").desc())

In [6]:
popular_categories.count()

32

In [7]:
top_popular_categories = popular_categories.limit(10)

## ✅ Krok 5: Wyświetlenie wyników

In [8]:
top_popular_categories.show()

+--------------------+------+
|       category_name| count|
+--------------------+------+
|              Cleats|196408|
|      Men's Footwear|177968|
|     Women's Apparel|168280|
|Indoor/Outdoor Games|154384|
|             Fishing|138600|
|        Water Sports|124320|
|    Camping & Hiking|109832|
|    Cardio Equipment| 99896|
|       Shop By Sport| 87872|
|         Electronics| 25248|
+--------------------+------+


## ✅ Krok 6 (opcjonalnie): Zapis do pliku CSV

In [9]:
popular_categories_location = f"{output_tables}/popular_categories"
top_popular_categories.write.mode("overwrite").csv(popular_categories_location, header=True)

### Sprawdzenie zapisanych danych

In [10]:
df = spark.read.csv(popular_categories_location, header=True, inferSchema=True)

In [11]:
df.show()

+--------------------+------+
|       category_name| count|
+--------------------+------+
|              Cleats|196408|
|      Men's Footwear|177968|
|     Women's Apparel|168280|
|Indoor/Outdoor Games|154384|
|             Fishing|138600|
|        Water Sports|124320|
|    Camping & Hiking|109832|
|    Cardio Equipment| 99896|
|       Shop By Sport| 87872|
|         Electronics| 25248|
+--------------------+------+


In [12]:
df.count()

10

In [13]:
df.printSchema()

root
 |-- category_name: string (nullable = true)
 |-- count: integer (nullable = true)
