In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SampleForDataScience") \
    .master("local[2]") \
    .config("spark.sql.shuffle.partitions", "2") \
    .getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### Load Enriched Transactions for Data Science

This cell loads the enriched transaction dataset created during the data engineering stage.

A small sample and row count are displayed to confirm successful loading.
This dataset will be used for data science exploration and modeling.


In [2]:
transactions_enriched = spark.read.parquet(
    "../data_enriched/transactions_enriched"
)

transactions_enriched.show(5)
transactions_enriched.count()


+----------+----------+--------+-----------------+---------+-------------+--------+--------------------+------------------+---------+
|department|product_id|order_id|add_to_cart_order|reordered|department_id|aisle_id|        product_name|             aisle|avg_price|
+----------+----------+--------+-----------------+---------+-------------+--------+--------------------+------------------+---------+
|    pantry|     30064|  188723|                2|        0|           13|      17|Double Acting Bak...|baking ingredients|      2.0|
|   produce|     38557|  124150|                9|        0|            4|      24|Citrus Mandarins ...|      fresh fruits|      1.5|
|   produce|     13176|  539689|                2|        1|            4|      24|Bag of Organic Ba...|      fresh fruits|      1.5|
|    babies|      2611|  544204|                1|        1|           18|      92|Gluten Free Spong...| baby food formula|      2.0|
|    pantry|     18441|    4707|                1|        0|  

3245091

### Select Features for Data Science Analysis

This cell creates a simplified dataset containing only the columns needed for data science tasks.

The selected features represent:
- Transaction identifiers
- Product information
- Department category
- Average price

This streamlined dataset serves as the base for analytical modeling and pattern discovery.


In [3]:
ml_base = transactions_enriched.select(
    "order_id",
    "product_id",
    "product_name",
    "department",
    "avg_price"
)

ml_base.show(5)


+--------+----------+--------------------+----------+---------+
|order_id|product_id|        product_name|department|avg_price|
+--------+----------+--------------------+----------+---------+
|  188723|     30064|Double Acting Bak...|    pantry|      2.0|
|  124150|     38557|Citrus Mandarins ...|   produce|      1.5|
|  539689|     13176|Bag of Organic Ba...|   produce|      1.5|
|  544204|      2611|Gluten Free Spong...|    babies|      2.0|
|    4707|     18441|     Organic Ketchup|    pantry|      2.0|
+--------+----------+--------------------+----------+---------+
only showing top 5 rows


### Create Sample Dataset for Modeling

This cell creates a random sample from the base dataset.

Sampling reduces data size and speeds up experimentation.
A fixed seed ensures the sample is reproducible for consistent results.


In [4]:
ml_sample = ml_base.sample(
    fraction=0.5,
    seed=42
)


### Create Shopping Baskets from Transactions

This cell groups products by order to create shopping baskets.

- Each basket represents one customer order
- Products within the same order are collected into a list
- The result matches the format required for association rule mining

These baskets are the core input for market basket analysis.


In [5]:
from pyspark.sql.functions import collect_list

baskets = ml_sample.groupBy("order_id").agg(
    collect_list("product_name").alias("items")
)

baskets.show(5, truncate=False)


[Stage 6:>                                                          (0 + 2) / 2]

+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|order_id|items                                                                                                                                                    |
+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|1000003 |[Organic Medium Salsa, Organic Green Onions]                                                                                                             |
|1000006 |[Pluot, Large Yellow Flesh Nectarine]                                                                                                                    |
|1000012 |[Organic Sweet Potato Puree, Organic Avocado]                                                                                                            |
|1000019 |

                                                                                

### Save Sample Dataset for Data Science

This cell writes the sampled dataset to disk in CSV format.

The output is intended for data scientists and analysts to use for modeling and exploration.
Overwrite mode allows the sample to be regenerated during development.


In [6]:
ml_sample.write \
    .mode("overwrite") \
    .csv("../data_enriched/sample_for_data_science", header=True)


                                                                                