In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("InstacartIngestion") \
    .getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### Load Raw CSV Files (Without Schema Inference)

This cell loads the raw Instacart CSV files into Spark DataFrames using only the header information.

Schema inference is intentionally not used here to keep ingestion lightweight and fast.
Data types will be reviewed and handled explicitly in later cleaning steps.

At this stage, the goal is simply to make the raw data available in Spark.


In [2]:
order_products = spark.read.csv(
    "../data_raw/order_products__prior.csv",
    header=True
)

products = spark.read.csv(
    "../data_raw/products.csv",
    header=True
)

aisles = spark.read.csv(
    "../data_raw/aisles.csv",
    header=True
)

departments = spark.read.csv(
    "../data_raw/departments.csv",
    header=True
)


### Verify Row Counts for Ingested Datasets

This cell prints the number of rows in each raw dataset.

Row counts are used to confirm that all files were loaded successfully and completely.
This basic validation helps detect missing files, partial reads, or ingestion issues early in the pipeline.


In [3]:
print("order_products rows:", order_products.count())
print("products rows:", products.count())
print("aisles rows:", aisles.count())
print("departments rows:", departments.count())


[Stage 4:>                                                        (0 + 10) / 10]

order_products rows: 32434489
products rows: 49688
aisles rows: 134
departments rows: 21


                                                                                

### Preview Sample Records

This cell displays a small sample of records from the loaded DataFrames.

Previewing the data helps visually confirm that:
- Columns are read correctly
- Values look reasonable
- No obvious formatting issues exist

This step supports quick sanity checks before moving to data cleaning.


In [4]:
order_products.show(5)
products.show(5)



+--------+----------+-----------------+---------+
|order_id|product_id|add_to_cart_order|reordered|
+--------+----------+-----------------+---------+
|       2|     33120|                1|        1|
|       2|     28985|                2|        1|
|       2|      9327|                3|        0|
|       2|     45918|                4|        1|
|       2|     30035|                5|        0|
+--------+----------+-----------------+---------+
only showing top 5 rows
+----------+--------------------+--------+-------------+
|product_id|        product_name|aisle_id|department_id|
+----------+--------------------+--------+-------------+
|         1|Chocolate Sandwic...|      61|           19|
|         2|    All-Seasons Salt|     104|           13|
|         3|Robust Golden Uns...|      94|            7|
|         4|Smart Ones Classi...|      38|            1|
|         5|Green Chile Anyti...|       5|           13|
+----------+--------------------+--------+-------------+
only showing 