1. Schema Validation

    Column Names & Data Types: Verify column names and data types match expectations (e.g., dates as datetime, numerical fields as int/float).

    Schema Enforcement: Use tools like PySpark or Pandera to enforce predefined schemas.

    Unexpected Columns: Check for extra/missing columns.

2. Data Quality Checks

    Null/Missing Values:

        Identify columns with nulls.

        Validate against allowed thresholds (e.g., "customer_id" must have 0% nulls).

    Duplicates:

        Detect duplicate rows.

        Ensure primary key uniqueness (e.g., user IDs are unique).

    Data Formats:

        Validate formats (e.g., dates as YYYY-MM-DD, emails with regex).

        Check categorical values against allowed lists (e.g., "status" ∈ ["active", "inactive"]).

    Numeric Ranges:

        Ensure values fall within expected ranges (e.g., age ≥ 0, salary ≥ minimum wage).

    Outliers: Use statistical methods (IQR, Z-score) to flag anomalies.

3. Integrity Checks

    Referential Integrity: Validate foreign keys against related tables (e.g., order.customer_id exists in customer.id).

    Business Rules:

        Enforce domain-specific logic (e.g., "discount" ≤ 30% for non-premium users).

        Validate cross-field dependencies (e.g., end_date ≥ start_date).

4. Consistency Checks

    Across Sources: Compare overlapping data from different sources (e.g., revenue figures in CSV vs. database).

    Historical Consistency: Ensure new data distributions match historical trends (e.g., sudden spikes in user signups).

5. Statistical Validation

    Summary Statistics: Check mean, median, std dev, and quantiles for unexpected shifts.

    Data Distribution: Validate distributions of categorical/numerical fields (e.g., country ratios remain stable).

6. Structural/File Validation

    Row Count: Ensure row count aligns with source (e.g., CSV row count matches wc -l).

    File Integrity: Validate file formats (e.g., CSV delimiters, JSON nesting) and compression.

    Encoding: Check for correct character encoding (e.g., UTF-8) and special characters.

7. Operational Checks

    Data Freshness: Confirm data is up-to-date (e.g., latest transaction timestamp is recent).

    Performance: Assess data size and partitioning for efficient processing.

    Security: Mask/encrypt sensitive fields (e.g., PII, passwords).

8. Domain-Specific Rules

    Compliance: Ensure adherence to regulations (e.g., GDPR, HIPAA).

    Unit Consistency: Standardize units (e.g., currency, weight) across sources.

    Deprecated Values: Flag legacy codes or discontinued product IDs.

9. Logging & Reporting

    Error Logs: Record validation failures for debugging.

    Summary Reports: Generate pass/fail summaries for stakeholders.

In [0]:
%run "./Ecommerce Dataset Schema and Dataframes"

### **CHECK DATA LEKAGE OR DROP ANY ROWS DURING IMPORTING THE DATA**

In [0]:
display(customers_df.count())
display(geolocation_df.count())
display(order_items_df.count())
display(order_payments_df.count())
display(order_reviews_df.count())
display(orders_df.count())
display(products_df.count())
display(sellers_df.count())

### **CHECK DUPLICATE VALUES**

In [0]:
from pyspark.sql.functions import col

In [0]:
def dup_values(df, df_name, column):
    print(f"Duplicate Values: {df_name}")
    display(df.groupBy(col(column)).count().filter('count>1'))

In [0]:
dup_values(customers_df, "Customer", "customer_id")
dup_values(order_items_df, "Order Items", "order_id")
dup_values(order_reviews_df, "Order Reviews", "review_id")
dup_values(orders_df, "Orders", "order_id")
dup_values(products_df, "Products", "product_id")
dup_values(sellers_df, "Sellers", "seller_id")

### **CHECK NULL VALUES**

In [0]:
from pyspark.sql.functions import col, when, count

def missing_value(df, df_name):
    print(f"Missing Values: {df_name}")
    display(df.select([count(when(col(c).isNull(), 1)).alias(c) for c in df.columns]))

In [0]:
missing_value(customers_df, "Customer")
missing_value(geolocation_df, "Gelocation")
missing_value(order_items_df, "Order Items")
missing_value(order_payments_df, "Order Payments")
missing_value(order_reviews_df, "Order Reviews")
missing_value(orders_df, "Orders")
missing_value(products_df, "Products")
missing_value(sellers_df, "Sellers")