# Data Audit Report

This notebook generates an audit report for data in the Bronze layer.

## 1. Environment Setup

Import required libraries and set environment variables.

In [11]:
# Set JAVA_HOME environment variable
import os
os.environ['JAVA_HOME'] = r'C:\Program Files\Java\jre1.8.0_451'

# Import required libraries
from pyspark.sql import SparkSession
from datetime import datetime
import logging
import pyspark.sql.functions as F
import re
import json
import glob

## 2. Initialize Spark Session

Create or reuse an existing Spark session.

In [12]:
# Initialize Spark session
try:
    # Try to get an existing Spark session
    spark = SparkSession.builder \
        .appName("Data Audit") \
        .getOrCreate()
    
    # Test if the session is active
    spark.sparkContext.getConf().getAll()
    print("Using existing Spark session")
except:
    # If there's an error, create a new session
    print("Creating new Spark session")
    spark = SparkSession.builder \
        .appName("Data Audit") \
        .getOrCreate()

Using existing Spark session


## 3. Configure Paths and Logging

Set up paths for data, logs, and reports.

In [13]:
# Define paths
current_dir = os.path.dirname(os.path.abspath("__file__"))
base_dir = os.path.dirname(current_dir)  # Go up one level to reach the root directory

# Get the bronze layer path
bronze_base_path = os.path.join(base_dir, "output", "bronzeLayer")

# Use current date for logs
date_str = datetime.now().strftime("%Y-%m-%d")

# Log path organized by date
log_dir = os.path.join(base_dir, "logs", "data_audit", date_str)
log_path = os.path.join(log_dir, "data_audit.log")
report_path = os.path.join(log_dir, "audit_report.txt")
json_report_path = os.path.join(log_dir, "audit_report.json")

# Ensure the logs directory exists
os.makedirs(log_dir, exist_ok=True)

# Configure logging
logging.basicConfig(filename=log_path, level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

def log_message(message, level="info"):
    """Logs messages with the specified level."""
    if level == "info":
        logging.info(message)
    elif level == "error":
        logging.error(message)
    print(message)  # Also print to console for real-time feedback

## 4. Locate Latest Data

Find the most recent data in the Bronze layer.

In [14]:
# Find the latest date folder in bronze layer
date_folders = glob.glob(os.path.join(bronze_base_path, "*"))
if not date_folders:
    log_message(f"No date folders found in {bronze_base_path}", level="error")
else:
    latest_folder = max(date_folders)
    print(f"Latest folder: {latest_folder}")

Latest folder: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\output\bronzeLayer\2025-05-08


## 5. Load and Examine Data

Read the data from the Bronze layer and display basic information.

In [15]:
# Read the data
try:
    # Read data from Bronze layer (Parquet format)
    df = spark.read.parquet(latest_folder)
    
    # Display basic info
    print(f"Source: mrp_production")
    print(f"Path: {latest_folder}")
    print(f"Row count: {df.count()}")
    print(f"Column count: {len(df.columns)}")
    print(f"Columns: {', '.join(df.columns)}\n")
    
    # Show a sample of the data
    print("Sample data (first 5 rows):")
    df.show(5, truncate=False)
except Exception as e:
    log_message(f"Error reading data: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Source: mrp_production
Path: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\output\bronzeLayer\2025-05-08
Row count: 1734
Column count: 14
Columns: Component Status, Product, Reference, Source, Responsible, Start, End, Deadline, State, Quantity Producing, Quantity To Produce, Total Quantity, Product/Cost, Product/Sales Price

Sample data (first 5 rows):
+----------------+----------------------+-----------+-------+---------------------+-------------------+-------------------+-------------------+---------+------------------+-------------------+--------------+------------+-------------------+
|Component Status|Product               |Reference  |Source |Responsible          |Start              |End                |Deadline           |State    |Quantity Producing|Quantity To Produce|Total Quantity|Product/Cost|Product/Sales Price|
+----------------+----------------------+-----------+-------+---------------------+-------------------+-------------------+-------------------+---------+---------

## 6. Check Data Types and Missing Values

Analyze data types and identify missing values in each column.

In [16]:
# Check for missing values
try:
    # Create a list to hold the expressions for each column
    missing_value_exprs = []
    for c in df.columns:
        # Get the data type of the column
        col_type = df.schema[c].dataType.typeName()
        print(f"Column {c} has type {col_type}")
        
        # For numeric types, check both null and NaN
        if col_type in ['double', 'float']:
            expr = F.count(F.when(F.col(c).isNull() | F.isnan(F.col(c)), c)).alias(c)
        else:
            # For non-numeric types, only check for null
            expr = F.count(F.when(F.col(c).isNull(), c)).alias(c)
        
        missing_value_exprs.append(expr)
    
    missing_values = df.select(missing_value_exprs)
    missing_values_dict = missing_values.collect()[0].asDict()
    
    print("\nMissing Values:")
    for column, count in missing_values_dict.items():
        if count > 0:
            print(f" - {column}: {count} missing values ({(count/df.count())*100:.2f}%)")
except Exception as e:
    log_message(f"Error checking missing values: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Column Component Status has type string
Column Product has type string
Column Reference has type string
Column Source has type string
Column Responsible has type string
Column Start has type timestamp
Column End has type timestamp
Column Deadline has type timestamp
Column State has type string
Column Quantity Producing has type double
Column Quantity To Produce has type double
Column Total Quantity has type double
Column Product/Cost has type double
Column Product/Sales Price has type double

Missing Values:
 - Component Status: 526 missing values (30.33%)
 - Source: 1226 missing values (70.70%)
 - State: 325 missing values (18.74%)


## 7. Check for Duplicate Rows

Identify duplicate records in the dataset.

In [17]:
# Check for duplicates
try:
    duplicate_count = df.count() - df.dropDuplicates().count()
    print(f"Duplicate Rows: {duplicate_count} ({(duplicate_count/df.count())*100:.2f}% of total)")
except Exception as e:
    log_message(f"Error checking duplicates: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Duplicate Rows: 0 (0.00% of total)


## 8. Generate Audit Report

Create and save a comprehensive audit report in both text and JSON formats.

In [18]:
# Generate and save the audit report
try:
    # Create the report
    report = []
    report.append(f"Audit Report for mrp_production")
    report.append(f"Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    report.append(f"Row Count: {df.count()}")
    report.append(f"Column Count: {len(df.columns)}")
    report.append(f"Columns: {', '.join(df.columns)}\n")
    
    # Add missing values to report
    report.append("Missing Values:")
    for column, count in missing_values_dict.items():
        if count > 0:
            report.append(f" - {column}: {count} missing values ({(count/df.count())*100:.2f}%)")
    report.append("")
    
    # Add duplicates to report
    report.append(f"Duplicate Rows: {duplicate_count} ({(duplicate_count/df.count())*100:.2f}% of total)\n")
    
    # Save the report to a text file
    with open(report_path, "w") as f:
        f.write("\n".join(report))
    
    # Create JSON report
    json_report = {
        "source": "mrp_production",
        "row_count": df.count(),
        "column_count": len(df.columns),
        "missing_values": missing_values_dict,
        "duplicate_rows": duplicate_count
    }
    
    # Save JSON report
    with open(json_report_path, "w") as f:
        json.dump([json_report], f, indent=4)
    
    print(f"\nAudit report saved to: {report_path}")
    print(f"JSON report saved to: {json_report_path}")
except Exception as e:
    log_message(f"Error generating report: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")


Audit report saved to: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\logs\data_audit\2025-05-08\audit_report.txt
JSON report saved to: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\logs\data_audit\2025-05-08\audit_report.json


## 9. Display Audit Report

Show the generated audit report.

In [19]:
# Display the audit report
try:
    with open(report_path, "r") as f:
        report_content = f.read()
    print(report_content)
except Exception as e:
    print(f"Error reading report: {str(e)}")

Audit Report for mrp_production
Generated on: 2025-05-08 22:31:52
Row Count: 1734
Column Count: 14
Columns: Component Status, Product, Reference, Source, Responsible, Start, End, Deadline, State, Quantity Producing, Quantity To Produce, Total Quantity, Product/Cost, Product/Sales Price

Missing Values:
 - Component Status: 526 missing values (30.33%)
 - Source: 1226 missing values (70.70%)
 - State: 325 missing values (18.74%)

Duplicate Rows: 0 (0.00% of total)



## 10. Cleanup

Clean up resources by stopping the Spark session.

In [20]:
# Safely stop Spark session
try:
    # Check if spark session is active before stopping
    spark.sparkContext.getConf().getAll()
    spark.stop()
    print("Spark session stopped.")
except:
    print("No active Spark session to stop.")

Spark session stopped.
