<B>INSTALLING SPARK<B>

In [None]:
pip install pyspark


<B>This cell initializes the SparkSession, which is the entry point for using Apache Spark. The SparkSession manages the connection between Python and Spark’s execution engine. All Spark DataFrame operations in this project depend on this session. If Spark fails to start here, no further Spark-based processing can run.<B>

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("InstacartRetailProject") \
    .getOrCreate()

spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


###  Load Raw Instacart Data into Spark

This cell loads the raw Instacart datasets from CSV files into Spark DataFrames.

Each dataset represents a different part of the retail domain:
- Order and product purchase data  
- Product metadata  
- Aisle information  
- Department information  

The data is loaded with headers and automatic schema inference for easier exploration.
No transformations are performed here — this step is only for data ingestion.


In [2]:
order_products = spark.read.csv(
    "../data_raw/order_products__prior.csv",
    header=True,
    inferSchema=True
)

products = spark.read.csv(
    "../data_raw/products.csv",
    header=True,
    inferSchema=True
)

aisles = spark.read.csv(
    "../data_raw/aisles.csv",
    header=True,
    inferSchema=True
)

departments = spark.read.csv(
    "../data_raw/departments.csv",
    header=True,
    inferSchema=True
)


                                                                                

### Validate Order-Product Data

This cell checks the structure and size of the `order_products` dataset.

- `printSchema()` is used to verify column names and data types.
- `count()` is used to confirm the number of records loaded.

This validation step ensures the data was ingested correctly before moving to cleaning and merging.


In [3]:
order_products.printSchema()
order_products.count()

root
 |-- order_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- add_to_cart_order: integer (nullable = true)
 |-- reordered: integer (nullable = true)



32434489