# Load E-commerce Data 

This notebook creates raw temporary views for each source dataset in our e-commerce analytics project.

## Overview

We will:
1. Load data from three source files (Customer Excel, Orders JSON, Products CSV)
2. Perform comprehensive data quality checks on each dataset
3. Create temporary SQL views for the raw data
4. Verify view creation with sample queries

## Data Sources

- **Customers**: Excel file (`Customer.xlsx`) containing customer information
- **Orders**: JSON file (`Orders.json`) with order transaction details
- **Products**: CSV file (`Products.csv`) with product catalog information

## Views Created

- `raw_customers_vw` - Customer data temporary view
- `raw_orders_vw` - Orders data temporary view
- `raw_products_vw` - Products data temporary view

## 1. Setup: Import Libraries and Initialize Spark Session

In [10]:
import os
import sys

# Add parent directory to path for module imports
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..')))

# Import required functions
from src.spark_session import get_spark_session
from src.load_source_data import load_customer_data,load_orders_data,load_products_data,perform_data_quality_checks

# Initialize Spark session
spark = get_spark_session("CreateRawTables")
print(f"Spark version: {spark.version}")
print("Spark session initialized successfully")

# Define project root path
project_root = os.path.abspath(os.path.join(os.getcwd(), '../..'))
print(f"Project root: {project_root}")

Spark version: 3.5.1
Spark session initialized successfully
Project root: /Users/kushalsenlaskar/Documents/E-commerce Sales Data


In [11]:
print("="*80)
print("LOADING CUSTOMER DATA")
print("="*80)

# Define customer file path
customer_path = os.path.join(project_root, "data", "Customer.xlsx")
print(f"\nCustomer file path: {customer_path}")

# Load customer data
customers_df = load_customer_data(spark, customer_path)

LOADING CUSTOMER DATA

Customer file path: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Customer.xlsx

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Customer.xlsx
File found. Loading Excel data using Spark...
Customer data loaded successfully


In [12]:
# Perform data quality checks on customer data
customer_quality_report = perform_data_quality_checks(customers_df, "Customers")

# Data quality report for Customers
print(f"Quality Check Status: {customer_quality_report['status']}")


DATA QUALITY REPORT: Customers
DataFrame Created: Yes
Columns Available: ['Customer ID', 'Customer Name', 'email', 'phone', 'address', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region']
Total Columns: 11
Number of Records: 793

Table Schema:
root
 |-- Customer ID: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- address: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Postal Code: string (nullable = true)
 |-- Region: string (nullable = true)


Sample Records (First 2 rows):
Number of Records: 793

Table Schema:
root
 |-- Customer ID: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- address: string (nullable = true)
 |-- Segment: string (nullabl

In [13]:
print("="*80)
print("LOADING ORDERS DATA")
print("="*80)

# Define orders file path
orders_path = os.path.join(project_root, "data", "Orders.json")
print(f"\nOrders file path: {orders_path}")

# Load orders data
orders_df = load_orders_data(spark, orders_path)

LOADING ORDERS DATA

Orders file path: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Orders.json

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Orders.json
File found. Loading JSON data using Spark...
Orders data loaded successfully


In [14]:
# Perform data quality checks on orders data
orders_quality_report = perform_data_quality_checks(orders_df, "Orders")

# Data quality report for Orders
print(f"Quality Check Status: {orders_quality_report['status']}")


DATA QUALITY REPORT: Orders
DataFrame Created: Yes
Columns Available: ['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Product ID', 'Quantity', 'Price', 'Discount', 'Profit']
Total Columns: 11
Number of Records: 9994

Table Schema:
root
 |-- Row ID: integer (nullable = true)
 |-- Order ID: string (nullable = true)
 |-- Order Date: string (nullable = true)
 |-- Ship Date: string (nullable = true)
 |-- Ship Mode: string (nullable = true)
 |-- Customer ID: string (nullable = true)
 |-- Product ID: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Price: double (nullable = true)
 |-- Discount: double (nullable = true)
 |-- Profit: double (nullable = true)


Sample Records (First 2 rows):
+------+--------------+----------+---------+--------------+-----------+---------------+--------+------+--------+------+
|Row ID|Order ID      |Order Date|Ship Date|Ship Mode     |Customer ID|Product ID     |Quantity|Price |Discount|Profit|
+------+-------

In [15]:
print("="*80)
print("LOADING PRODUCTS DATA")
print("="*80)

# Define products file path
products_path = os.path.join(project_root, "data", "Products.csv")
print(f"\nProducts file path: {products_path}")

# Load products data
products_df = load_products_data(spark, products_path)

LOADING PRODUCTS DATA

Products file path: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Products.csv

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Products.csv
File found. Loading CSV data using Spark...
Products data loaded successfully


In [16]:
# Perform data quality checks on products data
products_quality_report = perform_data_quality_checks(products_df, "Products")

# Store quality report for summary
print(f"Quality Check Status: {products_quality_report['status']}")


DATA QUALITY REPORT: Products
DataFrame Created: Yes
Columns Available: ['Product ID', 'Category', 'Sub-Category', 'Product Name', 'State', 'Price per product']
Total Columns: 6
Number of Records: 1851

Table Schema:
root
 |-- Product ID: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Sub-Category: string (nullable = true)
 |-- Product Name: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Price per product: double (nullable = true)


Sample Records (First 2 rows):
+---------------+----------+------------+------------------------------------------------------------+--------+-----------------+
|Product ID     |Category  |Sub-Category|Product Name                                                |State   |Price per product|
+---------------+----------+------------+------------------------------------------------------------+--------+-----------------+
|FUR-CH-10002961|Furniture |Chairs      |Leather Task Chair, Black                                

## 2. Create Temporary Views

Create temporary SQL views for easy querying of the raw data. These views will persist for the duration of the Spark session.

In [18]:
print("="*80)
print("CREATING TEMPORARY VIEWS")
print("="*80)

# Create temporary view for customers
customers_df.createOrReplaceTempView("raw_customers_vw")
print("\n Temporary view 'raw_customers_vw' created successfully")

# Create temporary view for orders
orders_df.createOrReplaceTempView("raw_orders_vw")
print(" Temporary view 'raw_orders_vw' created successfully")

# Create temporary view for products
products_df.createOrReplaceTempView("raw_products_vw")
print(" Temporary view 'raw_products_vw' created successfully")

print("\n" + "="*80)
print("ALL TEMPORARY VIEWS CREATED SUCCESSFULLY")
print("="*80)
print("\nAvailable temporary views:")
print("  - raw_customers_vw")
print("  - raw_orders_vw")
print("  - raw_products_vw")
print("\nThese views are available for the duration of this Spark session.")


CREATING TEMPORARY VIEWS

1. Temporary view 'raw_customers_vw' created successfully
2. Temporary view 'raw_orders_vw' created successfully
3. Temporary view 'raw_products_vw' created successfully

ALL TEMPORARY VIEWS CREATED SUCCESSFULLY

Available temporary views:
  - raw_customers_vw
  - raw_orders_vw
  - raw_products_vw

These views are available for the duration of this Spark session.


## 3. Verify Temporary Views

Verify the temporary views are working correctly by running sample SQL queries.

In [19]:
print("\n" + "="*80)
print("VERIFYING TEMPORARY VIEWS WITH SAMPLE QUERIES")
print("="*80)

# Verify raw_customers_vw
print("\n1. Raw Customers View (raw_customers_vw) - First 3 rows:")
spark.sql("SELECT `Customer ID`, `Customer Name`, Country FROM raw_customers_vw LIMIT 3").show(truncate=False)

# Verify raw_orders_vw
print("\n2. Raw Orders View (raw_orders_vw) - First 3 rows:")
spark.sql("SELECT `Order ID`, `Customer ID`, `Product ID`, Quantity FROM raw_orders_vw LIMIT 3").show(truncate=False)

# Verify raw_products_vw
print("\n3. Raw Products View (raw_products_vw) - First 3 rows:")
spark.sql("SELECT `Product ID`, Category, `Sub-Category`, `Product Name` FROM raw_products_vw LIMIT 3").show(truncate=False)

print("\n" + "="*80)
print("ALL TEMPORARY VIEWS VERIFIED AND READY FOR USE")
print("="*80)



VERIFYING TEMPORARY VIEWS WITH SAMPLE QUERIES

1. Raw Customers View (raw_customers_vw) - First 3 rows:
+-----------+--------------+-------------+
|Customer ID|Customer Name |Country      |
+-----------+--------------+-------------+
|PW-19240   |Pierre Wener  |United States|
|GH-14410   |Gary567 Hansen|United States|
|KL-16555   |Kelly Lampkin |United States|
+-----------+--------------+-------------+


2. Raw Orders View (raw_orders_vw) - First 3 rows:
+--------------+-----------+---------------+--------+
|Order ID      |Customer ID|Product ID     |Quantity|
+--------------+-----------+---------------+--------+
|CA-2016-122581|JK-15370   |FUR-CH-10002961|7       |
|CA-2017-117485|BD-11320   |TEC-AC-10004659|4       |
|US-2016-157490|LB-16795   |OFF-BI-10002824|4       |
+--------------+-----------+---------------+--------+


3. Raw Products View (raw_products_vw) - First 3 rows:
+---------------+---------------+------------+------------------------------------------------------------

In [21]:
spark.sql("select * from raw_orders_vw").show()

#spark.sql("select count(1) from raw_orders_vw where profit <= 0").show()


+--------+
|count(1)|
+--------+
|       0|
+--------+

