# Phase 2: Silver Layer Verification ü•à

**Goal**: Validate that the Bronze -> Silver ETL Job worked correctly.

**Checklist**:
1.  **Format**: Is it Parquet?
2.  **Schema**: Are sensors now `DoubleType`? (No more strings)
3.  **Partitions**: Is it partitioned by `campaign`?
4.  **Cleaning**: Are outliers clipped? Are gaps handled (`sequence_id`)?

In [1]:
import os
import sys
sys.path.append(os.path.abspath('../src'))

from config import get_spark_session, get_data_path
from pyspark.sql import functions as F

# Ensure Spark knows where Python is (Safety Check)
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark = get_spark_session("SilverVerification")
silver_path = get_data_path("silver")

print(f"üîç Inspecting: {silver_path}")

üîß Configuring specific S3 endpoint for MinIO: http://minio:9000


:: loading settings :: url = jar:file:/usr/local/lib/python3.11/dist-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-758abdd4-65d3-4421-a0e1-2cf1cf6d33fe;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central


	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 242ms :: artifacts dl 8ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-758abdd4-65d3-4421-a0e1-2cf1cf6d33fe
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/5ms)


26/01/21 17:47:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


26/01/21 17:47:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


üîç Inspecting: s3a://silver


In [2]:
# Load Silver Data (Parquet)
path = f"{silver_path}/Process"
df_silver = spark.read.parquet(path)

print(f"üìä Total Rows: {df_silver.count():,}")
print("üìã Schema Validation:")
df_silver.printSchema()

26/01/21 17:47:18 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


[Stage 0:>                                                        (0 + 4) / 187][Stage 0:==>                                                      (8 + 4) / 187]











                                                                                

[Stage 2:>                                                          (0 + 4) / 7]





üìä Total Rows: 4,720,208
üìã Schema Validation:
root
 |-- batch: integer (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- code: integer (nullable = true)
 |-- tbl_speed: double (nullable = true)
 |-- fom: double (nullable = true)
 |-- main_comp: double (nullable = true)
 |-- tbl_fill: double (nullable = true)
 |-- SREL: double (nullable = true)
 |-- pre_comp: double (nullable = true)
 |-- produced: integer (nullable = true)
 |-- waste: integer (nullable = true)
 |-- cyl_main: double (nullable = true)
 |-- cyl_pre: double (nullable = true)
 |-- stiffness: integer (nullable = true)
 |-- ejection: double (nullable = true)
 |-- sequence_id: long (nullable = true)
 |-- strength: string (nullable = true)
 |-- size: integer (nullable = true)
 |-- start: string (nullable = true)
 |-- api_code: integer (nullable = true)
 |-- api_batch: integer (nullable = true)
 |-- smcc_batch: integer (nullable = true)
 |-- lactose_batch: integer (nullable = true)
 |-- starch_batch: integer (

                                                                                

## 1. Type Enforcement Check
We need to ensure `main_comp`, `pre_comp` etc. are `double`.

In [3]:
dtypes = dict(df_silver.dtypes)
assert dtypes['main_comp'] == 'double', "‚ùå Main Comp should be Double!"
assert dtypes['tbl_speed'] == 'double', "‚ùå Tbl Speed should be Double!"
print("‚úÖ Type Check Passed: All sensors are DoubleType.")

‚úÖ Type Check Passed: All sensors are DoubleType.


## 2. Cleaning Verification (Outliers & Gaps)
Check if `sequence_id` exists and outlier clipping worked.

In [4]:
# 1. Gaps (Sequence ID)
if 'sequence_id' in df_silver.columns:
    files_count = df_silver.select("sequence_id").distinct().count()
    print(f"‚úÖ Sequence Spitting logic applied. Found {files_count} distinct continuous sequences.")
else:
    print("‚ùå sequence_id column MISSING!")

# 2. Outliers (Max/Min Check)
stats = df_silver.select(F.min("main_comp"), F.max("main_comp")).collect()[0]
print(f"üìâ Main Compression Range: [{stats[0]:.2f}, {stats[1]:.2f}]")

# Note: In the ETL log we saw clipping at (-4.44, 16.86). Values should be within this.
if stats[1] <= 17.0 and stats[0] >= -5.0:
    print("‚úÖ Outlier Clipping appears effective (Values within reasonable 5-sigma range).")
else:
    print("‚ö†Ô∏è Values outside expected clip range. Check ETL logic.")

[Stage 5:>                                                          (0 + 4) / 7]



                                                                                

‚úÖ Sequence Spitting logic applied. Found 91 distinct continuous sequences.


[Stage 11:>                                                         (0 + 4) / 7]

üìâ Main Compression Range: [0.00, 16.86]
‚úÖ Outlier Clipping appears effective (Values within reasonable 5-sigma range).


                                                                                

## 3. Enrichment (Lab Data)
Did we successfully join the `Laboratory.csv` labels?

In [5]:
# Check for a Lab specific column, e.g., 'dissolution_av'
if 'dissolution_av' in df_silver.columns:
    row = df_silver.filter("dissolution_av IS NOT NULL").first()
    if row:
        print(f"‚úÖ Data Enrichment Successful. Example Dissolution: {row['dissolution_av']} (Batch: {row.batch})")
    else:
        print("‚ö†Ô∏è Column exists but all values are Null. Check Join Link!")
else:
    print("‚ùå Lab columns missing!")

26/01/21 17:47:30 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


‚úÖ Data Enrichment Successful. Example Dissolution: 93.0 (Batch: 241)
