
## Silver Layer
Purpose of this layer is to transform data into clean, reliable and standardised data

1. Pivot the data to create tag names as columns, timestamp as index
2. Filter for good quality
3. Forward fill for missing timestamp values

The above are the steps that have to be undertaken in this layer.


We will process in the following order: 

1. Filter for good quality to reduce data volume to help with processing first
2. Pivot
3. Forward fill for missing timestamp values


### Imports

In [0]:
from pyspark.sql import Window
from pyspark.sql.functions import col, first, last, to_timestamp, lower
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, ArrayType, MapType

import json
import pickle
import pandas as pd

from io import BytesIO


### Paths

In [0]:
bronze_source = "abfss://bronze@sembcorpete.dfs.core.windows.net"
silver = "abfss://silver@sembcorpete.dfs.core.windows.net"


### Sensor(s) processing


1. The use of lower to check tag quality is to make sure that entries like GoOD, GOOD, gOOd, etc...
are considered. Although discovered to be unnecessary because the rows do not change after filtering

2. Timestamp schema for silver column 

3. Identify sensors with similar dates or types


   i.e Sensor 4 with different dates, Sensor 1 and 2 with same dates


###Sensor 1

In [0]:
bronze_sensor1 = spark.read \
                      .format("delta") \
                      .load(f"{bronze_source}/sensor1/")

In [0]:
bronze_sensor1 = bronze_sensor1.withColumn("timestamp", to_timestamp(col("created_timestamp")))

#first step, filter for good quality 
silver_sensor1 = bronze_sensor1 \
                    .filter(lower(col("tag_quality")) == "good") \
                    .select("timestamp", "tag_key", "tag_val" , "tag_quality")

In [0]:
silver_sensor1.write \
              .mode("overwrite") \
              .format("delta") \
              .option("mergeSchema", "true").save(f"{silver}/sensor1")


###Sensor 2

In [0]:
bronze_sensor2 = spark.read \
                      .format("delta") \
                      .load(f"{bronze_source}/sensor2/")

In [0]:
bronze_sensor2 = bronze_sensor2.withColumn("timestamp", to_timestamp(col("created_timestamp")))

#first step, filter for good quality 
silver_sensor2 = bronze_sensor2 \
                    .filter(lower(col("tag_quality")) == "good") \
                    .select("timestamp", "tag_key", "tag_val" , "tag_quality")

In [0]:
silver_sensor2.write \
              .mode("overwrite") \
              .format("delta") \
              .option("mergeSchema", "true").save(f"{silver}/sensor2")


###Sensor 4

Combined data for sensor 4 is shown later in data unioning

In [0]:
#sensor 4 bronze parquet file
bronze_sensor4_pq = spark.read \
                         .format("delta") \
                         .load(f"{bronze_source}/sensor4/parquet/")


In [0]:
bronze_sensor4_pq = bronze_sensor4_pq.withColumn("timestamp", to_timestamp(col("created_timestamp")))

#first step, filter for good quality 
silver_sensor4_pq = bronze_sensor4_pq \
                        .filter(lower(col("tag_quality")) == "good") \
                        .select("timestamp", "tag_key", "tag_val", "tag_quality")

In [0]:
#sensor 4 bronze pickle file
bronze_sensor4_pk = spark.read \
                         .format("delta") \
                         .load(f"{bronze_source}/sensor4/pickle/")

In [0]:
bronze_sensor4_pk = bronze_sensor4_pk.withColumn("timestamp", to_timestamp(col("created_timestamp")))

#first step, filter for good quality 
silver_sensor4_pk = bronze_sensor4_pk \
                        .filter(lower(col("tag_quality")) == "good") \
                        .select("timestamp", "tag_key", "tag_val", "tag_quality")

In [0]:
#sensor 4 union
silver_sensor4 = silver_sensor4_pq.union(silver_sensor4_pk)

In [0]:
#sensor 4 write
silver_sensor4.write \
              .mode("overwrite") \
              .format("delta") \
              .option("mergeSchema", "true").save(f"{silver}/sensor4")


###Sensor 5

In [0]:
bronze_sensor5 = spark.read \
                      .format("delta") \
                      .load(f"{bronze_source}/sensor5")

In [0]:
bronze_sensor5 = bronze_sensor5.withColumn("timestamp", to_timestamp(col("created_timestamp")))

#first step, filter for good quality
silver_sensor5 = bronze_sensor5 \
                    .filter(lower(col("tag_quality")) == "good") \
                    .select("timestamp", "tag_key", "tag_val", "tag_quality")

In [0]:
silver_sensor5.write \
              .mode("overwrite") \
              .format("delta") \
              .option("mergeSchema", "true").save(f"{silver}/sensor5")


## Data Union and Pivoting

Union of sensor 1,2,4,5 DFs for further pivot to create **tag names as columns, timestamp as index**


Also point to note is that this will be key silver portion

In [0]:
#combine all sensors (1,2,4,5)
all_sensors = silver_sensor1.union(silver_sensor2).union(silver_sensor4).union(silver_sensor5)

In [0]:
#all sets combined pivot
silver_pivot = all_sensors.groupBy("timestamp").pivot("tag_key").agg(first("tag_val")).orderBy("timestamp")

In [0]:
#all sets pivot:
silver_pivot.write \
              .mode("overwrite") \
              .format("delta") \
              .option("mergeSchema", "true").save(f"{silver}/pivot/all_pivot")


## Forward fill for missing timestamp values

We have sensor tag names (1, 2, 4, 5) as columns, and we already have filtered for good quality.The data set for sensors are largely on different days, so we proceed with forward fill to handle missing values. 


Some NULL values still remain because forward fill handles values where there are prior values. 
(e.g Sens 4 NULL on 2021-07-05 because data only from 2021-7-10)


### Common window spec

In [0]:
windowSpec = Window.orderBy("timestamp").rowsBetween(Window.unboundedPreceding, Window.currentRow)


### All sets of data union together

In [0]:
#check all sensors:
sensor_cols = [c for c in silver_pivot.columns if c != "timestamp"]

silver_ffilled = silver_pivot
for sensor_col in sensor_cols:
    silver_ffilled = silver_ffilled.withColumn(sensor_col, last(sensor_col, ignorenulls=True).over(windowSpec))

In [0]:
silver_ffilled.write \
              .mode("overwrite") \
              .format("delta") \
              .option("mergeSchema", "true").save(f"{silver}/ffill/all_sensors_ffill/")