In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("InstacartRetailProject") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### Load Raw Instacart Data Without Schema Inference

This cell loads the raw Instacart CSV files into Spark DataFrames without inferring data types.

All columns are initially read as strings.
This approach avoids incorrect type casting during ingestion and allows data types to be handled explicitly in later cleaning steps.

At this stage, the focus is on safely ingesting raw data, not transforming it.


In [9]:
order_products = spark.read.csv(
    "../data_raw/order_products__prior.csv",
    header=True,
    inferSchema=False
)

products = spark.read.csv(
    "../data_raw/products.csv",
    header=True,
    inferSchema=False
)

aisles = spark.read.csv(
    "../data_raw/aisles.csv",
    header=True,
    inferSchema=False
)

departments = spark.read.csv(
    "../data_raw/departments.csv",
    header=True,
    inferSchema=False
)


### Enrich Product Data with Aisle and Department Information

This cell combines product data with aisle and department metadata.

- Products are joined with aisles using `aisle_id`
- The result is further joined with departments using `department_id`
- Left joins are used to ensure no products are lost during enrichment

This creates a unified product dataset with full categorical context, which is required for downstream analysis.


In [10]:
products_full = products \
    .join(aisles, "aisle_id", "left") \
    .join(departments, "department_id", "left")


### Create Transaction-Level Dataset

This cell joins order-level purchase data with enriched product information.

- Each row represents a product purchased within an order
- Product details (aisle and department) are added to each transaction
- A left join ensures all purchase records are preserved

The resulting dataset forms the core transaction table used for enrichment, analysis, and modeling.


In [11]:
transactions = order_products.join(
    products_full,
    "product_id",
    "left"
)

### Validate Transaction Dataset

This cell previews the merged transaction data and checks the total number of records.

- `show(5)` displays sample rows to verify joins and column values
- `count()` confirms the dataset size after merging

This validation step ensures the transaction table was created correctly before further processing.


In [12]:
transactions.show(5)
transactions.count()

+----------+--------+-----------------+---------+-------------+--------+------------------+------------------+----------+
|product_id|order_id|add_to_cart_order|reordered|department_id|aisle_id|      product_name|             aisle|department|
+----------+--------+-----------------+---------+-------------+--------+------------------+------------------+----------+
|     33120|       2|                1|        1|           16|      86|Organic Egg Whites|              eggs|dairy eggs|
|     30035|       2|                5|        0|           13|      17| Natural Sweetener|baking ingredients|    pantry|
|     17794|       2|                6|        1|            4|      83|           Carrots|  fresh vegetables|   produce|
|     45918|       2|                4|        1|           13|      19|    Coconut Butter|     oils vinegars|    pantry|
|      9327|       2|                3|        0|           13|     104|     Garlic Powder| spices seasonings|    pantry|
+----------+--------+---

                                                                                

32434489

### Persist Cleaned Transactions Data

This cell writes the cleaned and merged transaction dataset to disk in Parquet format.

Parquet is a columnar storage format that is efficient for large-scale analytics and Spark processing.
The overwrite mode ensures the output can be regenerated cleanly during development.

This dataset becomes the official cleaned input for downstream enrichment and analysis.


In [13]:
transactions.write \
    .mode("overwrite") \
    .parquet("../data_clean/transactions")


                                                                                