# DeltaLake
- [DeltaLake Getting Started](https://delta.io/learn/getting-started)
- [DeltaLake Best Practics](https://docs.delta.io/latest/best-practices.html)

## Introduction
Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.

Specifically, Delta Lake offers:

- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.

- Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.

- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.

- Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.

- Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.

- Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

## What Is a Lakehouse?
Few systems are beginning to emerge that address the limitations of data lakes. A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low cost cloud storage in open formats. They are what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available.

A lakehouse has the following key features:

- Transaction support: In an enterprise lakehouse many data pipelines will often be reading and writing data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL.
Schema enforcement and governance: The Lakehouse should have a way to support schema enforcement and evolution, supporting DW schema architectures such as star/snowflake-schemas. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms.
- BI support: Lakehouses enable using BI tools directly on the source data. This reduces staleness and improves recency, reduces latency, and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse.
Storage is decoupled from compute: In practice this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property.
- Openness: The storage formats they use are open and standardized, such as Parquet, and they provide an API so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly.
- Support for diverse data types ranging from unstructured to structured data: The lakehouse can be used to store, refine, analyze, and access data types needed for many new data applications, including images, video, audio, semi-structured data, and text.
- Support for diverse workloads: including data science, machine learning, and SQL and analytics. Multiple tools might be needed to support all these workloads but they all rely on the same data repository.
- End-to-end streaming: Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications.

<center> <img src='img/warehouseVSlakeVslakehouse.png'> </center>

Benefits of a lakehouse architecture:
- Simple data model
- Easy to understand and implement
- Enables incremental ETL
- Can recreate your tables from raw data at any time
- ACID transactions, time travel

## DeltaLake Highlight Features
- [Pandas to DeltaLake](https://delta.io/blog/2022-10-15-version-pandas-dataset/)
- [Change Data feed (CDF)](https://docs.delta.io/latest/delta-change-data-feed.html)
- [Table deletes, updates, and merges](https://docs.delta.io/latest/delta-update.html)
- [Table utility commands](https://docs.delta.io/latest/delta-utility.html#history-schema):
- [TimeTravel](https://docs.delta.io/latest/quick-start.html#-read-older-versions-of-data-using-time-travel)




## Medallion Arquitecture (Multi Hopp):

What is a medallion architecture?
A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables). Medallion architectures are sometimes also referred to as "multi-hop" architectures.

<center> <img src='img/multihop.png'> </center>


In [1]:
import pyspark.sql.functions as F
from delta import *
from pyspark.sql.types import *

sql = lambda statement, limit=5: spark.sql(statement).limit(limit).toPandas()
minutes_to_seconds = lambda x: x * 60
run_every_n_seconds = minutes_to_seconds(10)

In [2]:
import os
import shutil
import subprocess
import sys
import threading
import time
from datetime import datetime as dt
from glob import glob

MAIN_PATH = "speed_test/"
TEMP_PATH = f"{MAIN_PATH}temp/"
RAW_PATH = f"{MAIN_PATH}raw_data/"
CHECK_POINT = f"{MAIN_PATH}_checkpoint"
os.makedirs(RAW_PATH, exist_ok=True)


def clean_project():
    try:
        shutil.rmtree("spark-warehouse")
        shutil.rmtree("metastore_db")
        shutil.rmtree("speed_test/_checkpoint")
    except:
        pass


clean_project()

In [3]:
def speed_test():

    def move_temp_to_raw():
        files = glob(f"{TEMP_PATH}*.json")
        list(map(lambda x: os.rename(x, x.replace(TEMP_PATH, RAW_PATH)), files))

    now = dt.now()
    file_name = f"log_{now.strftime('%Y_%m_%d_%H_%M_%s')}.json"
    subprocess.run(
        f"touch {TEMP_PATH}{file_name} && speedtest --format=json >> {TEMP_PATH}{file_name} && mv  {TEMP_PATH}{file_name} {RAW_PATH}{file_name}",
        shell=True,
        executable="/bin/bash",
    )
    move_temp_to_raw()
    list(
        map(
            os.remove,
            filter(
                lambda x: os.path.getsize(x) == 0, glob("speed_test/raw_data/*.json")
            ),
        )
    )
    return dt

<center><img src='img/example.png'></center>

### Bronze layer (raw data)
The Bronze layer is where we land all the data from external source systems. The table structures in this layer correspond to the source system table structures "as-is," along with any additional metadata columns that capture the load date/time, process ID, etc. The focus in this layer is quick Change Data Capture and the ability to provide an historical archive of source (cold storage), data lineage, auditability, reprocessing if needed without rereading the data from the source system.

In [4]:
SCHEMA = "speed_test"

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {SCHEMA} ")

spark.sql(
    "set spark.databricks.delta.changeDataFeed.timestampOutOfRange.enabled = true;"
)
spark.sql("set SQLConf.ADAPTIVE_EXECUTION_ENABLED.key= true")

BRONZE_TABLE = f"{SCHEMA}.speed_test_logs"

spark.sql(
    f"""
  CREATE  TABLE IF NOT EXISTS {BRONZE_TABLE} 
  (`timestamp` Timestamp
  ) USING DELTA
  TBLPROPERTIES (delta.enableChangeDataFeed = true)
"""
)

SILVER_TABLE = f"{SCHEMA}.silver_speed_test_logs"

spark.sql(
    f"""
  CREATE  TABLE IF NOT EXISTS {SILVER_TABLE } 
  (`timestamp` Timestamp
  ) USING DELTA
  TBLPROPERTIES (delta.enableChangeDataFeed = true)
"""
)

GOLD_TABLE = f"{SCHEMA}.summary_by_day_of_the_week"
spark.sql(
    f"""
  CREATE  TABLE IF NOT EXISTS {GOLD_TABLE } 
  (`dayofweek` Integer
  ) USING DELTA
  TBLPROPERTIES (delta.enableChangeDataFeed = true)
"""
)

24/04/16 19:48:34 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/04/16 19:48:34 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/04/16 19:48:37 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
24/04/16 19:48:37 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore hadoop@127.0.1.1
24/04/16 19:48:37 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
24/04/16 19:48:37 WARN ObjectStore: Failed to get database speed_test, returning NoSuchObjectException
24/04/16 19:48:37 WARN ObjectStore: Failed to get database speed_test, returning NoSuchObjectException
24/04/16 19:48:37 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
24/04/16 19:48:37 WARN ObjectStore: Failed to get database speed_test, returnin

DataFrame[]

In [5]:
sql(f"DESCRIBE HISTORY  {BRONZE_TABLE}")

Unnamed: 0,version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
0,0,2024-04-16 19:28:33.491,,,CREATE TABLE,"{'description': None, 'partitionBy': '[]', 'pr...",,,,,Serializable,True,{},,Apache-Spark/3.5.1 Delta-Lake/3.1.0


In [5]:
sql(f"SELECT * FROM {BRONZE_TABLE}")

Unnamed: 0,timestamp


In [7]:
def update_bronze_table():

    streaming_logs_schema = StructType(
        [
            StructField(
                "download",
                StructType(
                    [
                        StructField("bandwidth", LongType(), True),
                        StructField("bytes", LongType(), True),
                        StructField("elapse", LongType(), True),
                        StructField("latency", StringType(), True),
                    ]
                ),
                True,
            ),
            StructField(
                "ping",
                StructType(
                    [
                        StructField("jitter", FloatType(), True),
                        StructField("latency", FloatType(), True),
                        StructField("low", FloatType(), True),
                        StructField("high", FloatType(), True),
                    ]
                ),
                True,
            ),
            StructField("isp", StringType(), True),
            StructField(
                "result",
                StructType(
                    [
                        StructField("id", StringType(), True),
                        StructField("url", StringType(), True),
                        StructField("persisted", BooleanType(), True),
                    ]
                ),
                True,
            ),
            StructField(
                "server",
                StructType(
                    [
                        StructField("id", StringType(), True),
                        StructField("host", StringType(), True),
                        StructField("port", LongType(), True),
                        StructField("name", StringType(), True),
                        StructField("location", StringType(), True),
                        StructField("country", StringType(), True),
                        StructField("ip", StringType(), True),
                    ]
                ),
                True,
            ),
            StructField("timestamp", TimestampType(), True),
            StructField("type", StringType(), True),
            StructField(
                "upload",
                StructType(
                    [
                        StructField("bandwidth", LongType(), True),
                        StructField("bytes", LongType(), True),
                        StructField("elapse", LongType(), True),
                        StructField("latency", StringType(), True),
                    ]
                ),
                True,
            ),
        ]
    )

    streaming_logs_sdf = (
        spark.readStream.option("ignoreCorruptFiles", "true")
        .schema(streaming_logs_schema)
        .json(RAW_PATH)
        .where(F.col("timestamp").isNotNull())
        .withColumn("log_file", F.input_file_name())
    )

    streaming_logs_sdf.writeStream.format("delta").outputMode("append").option(
        "checkpointLocation", CHECK_POINT
    ).trigger(availableNow=True).option("overwriteSchema", "true").option(
        "mergeSchema", "true"
    ).toTable(
        BRONZE_TABLE
    ).awaitTermination()
    print("BRONZE UPDATED")

In [9]:
speed_test()
update_bronze_table()

24/04/16 19:57:40 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


BRONZE UPDATED


24/04/16 19:57:41 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/04/16 19:57:41 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/04/16 19:57:41 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist


In [11]:
spark.sql(f""" SELECT * FROM {BRONZE_TABLE} ORDER BY timestamp DESC LIMIT 5 """).show()

+-------------------+--------------------+--------------------+-----------+--------------------+--------------------+------+--------------------+--------------------+
|          timestamp|            download|                ping|        isp|              result|              server|  type|              upload|            log_file|
+-------------------+--------------------+--------------------+-----------+--------------------+--------------------+------+--------------------+--------------------+
|2024-04-16 19:57:40|{60520311, 841629...|{5.699, 9.402, 6....|Tigo Panama|{dbd152a2-7ed4-41...|{3416, velocidad....|result|{1645309, 1013386...|file:///home/hado...|
|2024-04-16 19:55:15|{60483522, 872522...|{4.574, 8.196, 5....|Tigo Panama|{e2868f6a-1a5e-41...|{3416, velocidad....|result|{1641571, 8091320...|file:///home/hado...|
|2024-04-16 19:29:36|{57027795, 462588...|{1.738, 13.917, 1...|Tigo Panama|{05079732-882c-4d...|{55647, speedtest...|result|{1869349, 6766504...|file:///home/hado...

### Silver layer (cleansed and conformed data)
In the Silver layer of the lakehouse, the data from the Bronze layer is matched, merged, conformed and cleansed ("just-enough") so that the Silver layer can provide an "Enterprise view" of all its key business entities, concepts and transactions. (e.g. master customers, stores, non-duplicated transactions and cross-reference tables).

The Silver layer brings the data from different sources into an Enterprise view and enables self-service analytics for ad-hoc reporting, advanced analytics and ML. It serves as a source for Departmental Analysts, Data Engineers and Data Scientists to further create projects and analysis to answer business problems via enterprise and departmental data projects in the Gold Layer.

In the lakehouse data engineering paradigm, typically the ELT methodology is followed vs. ETL - which means only minimal or "just-enough" transformations and data cleansing rules are applied while loading the Silver layer. Speed and agility to ingest and deliver the data in the data lake is prioritized, and a lot of project-specific complex transformations and business rules are applied while loading the data from the Silver to Gold layer. From a data modeling perspective, the Silver Layer has more 3rd-Normal Form like data models. Data Vault-like, write-performant data models can be used in this layer.

In [16]:
def update_silver_table():
    bronze_delta_table = DeltaTable.forName(spark, BRONZE_TABLE)
    silver_delta_table = DeltaTable.forName(spark, SILVER_TABLE)

    def clean_bronze_table(sdf):
        sdf = (
            sdf.select(
                [
                    F.col("result").getItem("id").alias("test_id"),
                    F.col("timestamp"),
                    F.dayofweek(F.col("timestamp")).alias("dayofweek"),
                    F.to_date("timestamp").alias("date"),
                    F.hour("timestamp").alias("hour"),
                    "download",
                    "upload",
                    "isp",
                    F.col("server").getItem("name").alias("server_name"),
                    F.col("server").getItem("ip").alias("server_ip"),
                    F.col("server").getItem("location").alias("server_location"),
                    F.col("server").getItem("country").alias("server_country"),
                ]
            )
            .withColumn(
                "part_of_the_day",
                (
                    F.when((F.col("hour") >= 5) & (F.col("hour") < 12), "Morning")
                    .when((F.col("hour") >= 12) & (F.col("hour") < 17), "Afternoon")
                    .when((F.col("hour") >= 17) & (F.col("hour") < 21), "Evening")
                    .otherwise("Night")
                ),
            )
            .withColumn("download_Mbytes", F.col("download").getItem("bytes") / 1000000)
            .withColumn("upload_Mbytes", F.col("upload").getItem("bytes") / 1000000)
            .drop("download")
            .drop("upload")
        )
        return sdf

    if silver_delta_table.toDF().limit(1).count() == 0:

        sdf = clean_bronze_table(spark.read.format("delta").table(BRONZE_TABLE))
        (
            sdf.write.format("delta")
            .mode("append")
            .option("mergeSchema", "true")
            .saveAsTable(SILVER_TABLE)
        )
        print("SILVER FIRST LOAD")

    else:
        bronze_last_update = (
            bronze_delta_table.history()
            .where('operation != "CREATE TABLE"')
            .select(F.max("timestamp").alias("bronze_last_update"))
            .collect()[0][0]
        )
        silver_last_update = (
            silver_delta_table.history()
            .where('operation != "CREATE TABLE"')
            .select(F.max("timestamp").alias("bronze_last_update"))
            .collect()[0][0]
        )
        if bronze_last_update > silver_last_update:
            sdf = clean_bronze_table(
                spark.read.format("delta")
                .option("readChangeFeed", "true")
                .option("startingTimestamp", str(silver_last_update))
                .table(BRONZE_TABLE)
            )
            silver_delta_table.alias("sink").merge(
                sdf.alias("source"), "source.test_id = sink.test_id"
            ).whenNotMatchedInsertAll().execute()
            print("SILVER UPDATED")
        else:
            print("No updates")

In [17]:
update_silver_table()

SILVER FIRST LOAD


In [21]:
sql(f"DESCRIBE HISTORY {SILVER_TABLE}").operationMetrics.iloc[0]

{'numOutputRows': '111', 'numOutputBytes': '45507', 'numFiles': '9'}

In [23]:
speed_test()
update_bronze_table()
update_silver_table()

24/04/16 20:08:45 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
24/04/16 20:08:45 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/04/16 20:08:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/04/16 20:08:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist


BRONZE UPDATED
SILVER UPDATED


In [24]:
sql(f"SELECT count(*) FROM {SILVER_TABLE}")

Unnamed: 0,count(1)
0,112


In [25]:
sql(f"DESCRIBE HISTORY {SILVER_TABLE}")['operationParameters'].iloc[0]

{'matchedPredicates': '[]',
 'predicate': '["(test_id#10804 = test_id#10488)"]',
 'notMatchedBySourcePredicates': '[]',
 'notMatchedPredicates': '[{"actionType":"insert"}]'}

In [27]:
sql(f"SELECT * FROM {SILVER_TABLE} ORDER BY TIMESTAMP DESC LIMIT 5")

Unnamed: 0,timestamp,test_id,dayofweek,date,hour,isp,server_name,server_ip,server_location,server_country,part_of_the_day,download_Mbytes,upload_Mbytes
0,2024-04-16 20:08:45,8c71e0aa-e72d-4e27-bfcc-01e7e7d34721,3,2024-04-16,20,Tigo Panama,Pacific Network Communication S.A.,45.181.68.210,Panama City,Panama,Evening,388.02056,12.929192
1,2024-04-16 19:57:40,dbd152a2-7ed4-4122-bc27-bbb15850e494,3,2024-04-16,19,Tigo Panama,Tigo,200.124.21.230,Panama City,Panama,Evening,841.629537,10.13386
2,2024-04-16 19:55:15,e2868f6a-1a5e-4104-b737-d24cb1e6bf05,3,2024-04-16,19,Tigo Panama,Tigo,200.124.21.230,Panama City,Panama,Evening,872.52228,8.09132
3,2024-04-16 19:29:36,05079732-882c-4dbf-9390-f231e8f8f127,3,2024-04-16,19,Tigo Panama,TopManage,200.71.81.11,Panama City,Panama,Evening,462.588216,6.766504
4,2024-04-16 19:29:15,dd01b7b9-eb7d-411e-8071-fee4dafc0ad2,3,2024-04-16,19,Tigo Panama,Ufinet,186.148.105.30,Panama,Panama,Evening,490.020576,7.48616


### Gold layer (curated business-level tables)
Data in the Gold layer of the lakehouse is typically organized in consumption-ready "project-specific" databases. The Gold layer is for reporting and uses more de-normalized and read-optimized data models with fewer joins. The final layer of data transformations and data quality rules are applied here. Final presentation layer of projects such as Customer Analytics, Product Quality Analytics, Inventory Analytics, Customer Segmentation, Product Recommendations, Marking/Sales Analytics etc. fit in this layer. We see a lot of Kimball style star schema-based data models or Inmon style Data marts fit in this Gold Layer of the lakehouse.

So you can see that the data is curated as it moves through the different layers of a lakehouse. In some cases, we also see that lot of Data Marts and EDWs from the traditional RDBMS technology stack are ingested into the lakehouse, so that for the first time Enterprises can do "pan-EDW" advanced analytics and ML - which was just not possible or too cost prohibitive to do on a traditional stack. (e.g. IoT/Manufacturing data is tied with Sales and Marketing data for defect analysis or health care genomics, EMR/HL7 clinical data markets are tied with financial claims data to create a Healthcare Data Lake for timely and improved patient care analytics.)

In [28]:
def create_dayofweek_names():
    index = [1, 2, 3, 4, 5, 6, 7]
    esp = ["Domingo", "Lunes", "Marte", "Miercoles", "Jueve", "Viernes", "Sabado"]
    eng = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]

    from pandas import DataFrame

    from deltalake.writer import write_deltalake

    write_deltalake(
        "spark-warehouse/speed_test.db/day_of_the_week",
        DataFrame(data={"dayofweek": index, "esp": esp, "eng": eng}),
    )


create_dayofweek_names()

In [29]:
def summary_test_by(sdf, group_columns):

    return (
        sdf.groupby(group_columns)
        .agg(
            F.countDistinct("test_id").alias("no_test"),
            F.sum("download_Mbytes").alias("total_download_mbytes_recored"),
            F.sum("upload_Mbytes").alias("total_upload_mbytes_recored"),
        )
        .withColumn(
            "mean_download_Mbytes",
            F.col("total_download_mbytes_recored") / F.col("no_test"),
        )
        .withColumn(
            "mean_upload_Mbytes",
            F.col("total_upload_mbytes_recored") / F.col("no_test"),
        )
    )


def update_gold_table():
    dayofweek_sdf = (
        spark.read.format("delta")
        .load("spark-warehouse/speed_test.db/day_of_the_week")
        .withColumn("dayofweek", F.col("dayofweek").cast(IntegerType()))
    )
    gold_delta_table = DeltaTable.forName(spark, GOLD_TABLE)
    if gold_delta_table.toDF().limit(1).count() > 0:
        silver_delta_table = DeltaTable.forName(spark, SILVER_TABLE)
        silver_last_update = (
            silver_delta_table.history()
            .where('operation != "CREATE TABLE"')
            .select(F.max("timestamp").alias("bronze_last_update"))
            .collect()[0][0]
        )
        gold_last_update = (
            gold_delta_table.history()
            .where('operation != "CREATE TABLE"')
            .select(F.max("timestamp").alias("gold_last_update"))
            .collect()[0][0]
        )
        if silver_last_update > gold_last_update:
            sdf = summary_test_by(
                spark.read.format("delta")
                .option("readChangeFeed", "true")
                .option("startingTimestamp", str(gold_last_update))
                .table(SILVER_TABLE),
                "dayofweek",
            ).join(dayofweek_sdf, on="dayofweek", how="right")
            (
                gold_delta_table.alias("sink")
                .merge(sdf.alias("source"), "source.dayofweek = sink.dayofweek")
                .whenNotMatchedInsertAll()
                .whenMatchedUpdate(
                    set={
                        "no_test": "source.no_test + sink.no_test",
                        "total_download_mbytes_recored": "source.total_download_mbytes_recored + sink.total_download_mbytes_recored",
                        "total_upload_mbytes_recored": "source.total_upload_mbytes_recored  + sink.total_upload_mbytes_recored	",
                    }
                )
                .execute()
            )
            print("GOLD UPDATED")
        else:
            print("GOLD No Updates")

    else:
        silver_sdf = spark.read.format("delta").table(SILVER_TABLE)
        summary_test_by(silver_sdf, "dayofweek").join(
            dayofweek_sdf, on="dayofweek", how="right"
        ).write.format("delta").mode("append").option(
            "mergeSchema", "true"
        ).saveAsTable(
            GOLD_TABLE
        )
        print("Gold Table first load")


update_gold_table()

Gold Table first load


In [30]:
spark.sql(f"Describe HISTORY {GOLD_TABLE}").toPandas()

Unnamed: 0,version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
0,1,2024-04-16 20:11:16.036,,,WRITE,"{'mode': 'Append', 'partitionBy': '[]'}",,,,0.0,Serializable,True,"{'numOutputRows': '7', 'numOutputBytes': '2749...",,Apache-Spark/3.5.1 Delta-Lake/3.1.0
1,0,2024-04-16 19:48:42.656,,,CREATE TABLE,"{'description': None, 'partitionBy': '[]', 'pr...",,,,,Serializable,True,{},,Apache-Spark/3.5.1 Delta-Lake/3.1.0


In [31]:
spark.sql(f"select * FROM {GOLD_TABLE} ORDER BY dayofweek").toPandas()

Unnamed: 0,dayofweek,no_test,total_download_mbytes_recored,total_upload_mbytes_recored,mean_download_Mbytes,mean_upload_Mbytes,esp,eng
0,1,18.0,9044.284122,145.268592,502.460229,8.070477,Domingo,Sunday
1,2,17.0,9046.792024,123.835236,532.164237,7.284426,Lunes,Monday
2,3,37.0,20834.215407,287.315924,563.086903,7.765295,Marte,Tuesday
3,4,,,,,,Miercoles,Wednesday
4,5,,,,,,Jueve,Thursday
5,6,1.0,456.704708,6.7671,456.704708,6.7671,Viernes,Friday
6,7,39.0,19476.615979,467.441368,499.40041,11.985676,Sabado,Saturday


In [32]:
speed_test()
update_bronze_table()
update_silver_table()
update_gold_table()

24/04/16 20:11:51 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
24/04/16 20:11:51 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/04/16 20:11:51 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/04/16 20:11:51 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist


BRONZE UPDATED
SILVER UPDATED
GOLD UPDATED


In [36]:
spark.read.format("delta").option("readChangeFeed", "true").option(
    "startingVersion", 1
).table(GOLD_TABLE).orderBy("_commit_timestamp", ascending=False).toPandas()

Unnamed: 0,dayofweek,no_test,total_download_mbytes_recored,total_upload_mbytes_recored,mean_download_Mbytes,mean_upload_Mbytes,esp,eng,_change_type,_commit_version,_commit_timestamp
0,1,18.0,9044.284122,145.268592,502.460229,8.070477,Domingo,Sunday,update_preimage,2,2024-04-16 20:11:54.466
1,1,,,,502.460229,8.070477,Domingo,Sunday,update_postimage,2,2024-04-16 20:11:54.466
2,2,17.0,9046.792024,123.835236,532.164237,7.284426,Lunes,Monday,update_preimage,2,2024-04-16 20:11:54.466
3,2,,,,532.164237,7.284426,Lunes,Monday,update_postimage,2,2024-04-16 20:11:54.466
4,3,37.0,20834.215407,287.315924,563.086903,7.765295,Marte,Tuesday,update_preimage,2,2024-04-16 20:11:54.466
5,3,38.0,21359.073727,295.211604,563.086903,7.765295,Marte,Tuesday,update_postimage,2,2024-04-16 20:11:54.466
6,4,,,,,,Miercoles,Wednesday,update_preimage,2,2024-04-16 20:11:54.466
7,4,,,,,,Miercoles,Wednesday,update_postimage,2,2024-04-16 20:11:54.466
8,5,,,,,,Jueve,Thursday,update_preimage,2,2024-04-16 20:11:54.466
9,5,,,,,,Jueve,Thursday,update_postimage,2,2024-04-16 20:11:54.466


## Dashboard

In [37]:
plot_bgcolor = "#2c292d"
# paper_bgcolor ="#211f22"
paper_bgcolor = "#1a1d21"
download_color = "#ab9df2"
upload_color = "#78dce8"
default_fontcolor = "white"

import dash
import dash_bootstrap_components as dbc
import plotly.graph_objects as go
from dash import dcc, html
from dash.dependencies import Input, Output
from pandas import melt
from plotly.subplots import make_subplots


def line_chart_download_vs_upload(fig, df):

    x = df.timestamp

    fig.add_trace(
        go.Scatter(
            x=x, y=df.upload_Mbytes, name="Upload (Mbps)", line=dict(color=upload_color)
        ),
        row=1,
        col=2,
    )
    fig.update_yaxes(
        title_text="<b>Upload (Mbps)</b>",
        color=upload_color,
        rangemode="tozero",
        showgrid=False,
        row=1,
        col=2,
    )

    fig.add_trace(
        go.Scatter(
            x=x,
            y=df.download_Mbytes,
            name="Download (Mbps)",
            line=dict(color=download_color),
        ),
        row=2,
        col=2,
    )

    fig.update_yaxes(
        title_text="<b>Download (Mbps)</b>",
        color=download_color,
        rangemode="tozero",
        showgrid=False,
        row=2,
        col=2,
    )

    fig.update_xaxes(
        showgrid=False,
    )

    fig.update_layout(
        plot_bgcolor=plot_bgcolor,
        paper_bgcolor=paper_bgcolor,
        font=dict(color="white"),
        legend=dict(orientation="h", yanchor="bottom", y=-0.15, xanchor="right", x=1),
    )


def gauges_indicators(fig, value):

    def gauge_chart(value, steps, title, color):
        max_step = steps[-1][-1]
        title = f"{title} <span style='font-size:0.8em;color:gray'>MBps</span><br><span style='font-size:0.5em;color:gray'>Average</span>"
        gauge = go.Indicator(
            mode="gauge+number+delta",
            value=value,
            domain={"x": [0.25, 0.55], "y": [0.25, 0.55]},
            title={"text": title, "font": {"size": 25}, "align": "center"},
            delta={
                "reference": steps[-1][0],
                "font": {"size": 13},
                "increasing": {"color": color},
            },
            number={"font": {"size": 25}},
            gauge={
                "axis": {
                    "range": [None, max_step],
                    "tickwidth": 2,
                    "tickcolor": plot_bgcolor,
                },
                "bar": {"color": color},
                "bgcolor": "white",
                "borderwidth": 2,
                "bordercolor": "gray",
                "steps": [
                    {"range": steps[0], "color": "#ff6188"},
                    {"range": steps[1], "color": "#fc9867"},
                    {"range": steps[2], "color": "#a9dc76"},
                ],
            },
        )

        return gauge

    fig.add_trace(
        gauge_chart(
            value["upload_Mbytes"],
            steps=[[0, 10], [10, 15], [15, 20]],
            title=f"<span style='font-size:0.8em;color:{upload_color}'>Upload</span>",
            color=upload_color,
        ),
        row=1,
        col=1,
    )

    fig.add_trace(
        gauge_chart(
            value["download_Mbytes"],
            steps=[[0, 450], [450, 600], [600, 700]],
            title=f"<span style='font-size:0.8em;color:{download_color}'>Download</span>",
            color=download_color,
        ),
        row=2,
        col=1,
    )

    fig.update_layout(
        paper_bgcolor=paper_bgcolor,
        font={"color": "white", "family": "Arial"},
        showlegend=False,
    )
    fig.update_traces(number=dict(font=dict(size=28)), delta=dict(font=dict(size=25)))


def Heatmaps():
    df = (
        spark.read.format("delta")
        .table(GOLD_TABLE)
        .toPandas()
        .set_index("dayofweek")
        .assign(
            download=lambda x: x["mean_download_Mbytes"].apply(
                lambda _x: 600 if _x > 600 else _x
            )
            / 600
        )
        .assign(
            upload=lambda x: x["mean_upload_Mbytes"].apply(
                lambda _x: 15 if _x > 15 else _x
            )
            / 15
        )[["download", "upload", "eng"]]
    ).sort_index()
    df = melt(df, id_vars=["eng"], ignore_index=False).fillna(0)
    fig = go.Figure(
        data=go.Heatmap(
            x=df.eng,
            z=df["value"],
            y=df["variable"],
            colorscale="Spectral",
            zmax=1,
            zmin=0,
        )
    )
    fig.layout.update(
        paper_bgcolor=paper_bgcolor,
        font={"color": "white", "family": "Arial"},
        height=300,
        margin=dict(l=0, r=0, b=20, t=10),
    )
    return fig


def multiplot_speedtest(df):

    fig = make_subplots(
        rows=2,
        cols=2,
        specs=[[{"type": "domain"}, {}], [{"type": "domain"}, {}]],
        column_widths=[0.30, 0.70],
        row_heights=[0.25, 0.25],
        horizontal_spacing=0.15,
        vertical_spacing=0.15,
    )

    values = df[["download_Mbytes", "upload_Mbytes"]].iloc[-3:].mean()
    gauges_indicators(fig, values)
    line_chart_download_vs_upload(fig, df)
    fig.update_layout(height=550, margin=dict(l=35, r=35, b=30, t=55))

    return fig


def register_Callback(app):
    @app.callback(
        Output("stream_line_chart", "figure"),
        [
            Input("interval-component", "n_intervals"),
        ],
    )
    def streamFig(intervals):
        df = (
            spark.read.table(SILVER_TABLE)
            .where(
                F.col("timestamp")
                >= (F.current_timestamp() - F.expr("INTERVAL 60 minutes"))
            )
            .orderBy("timestamp", ascending=False)
            .limit(10)
            .toPandas()
            .sort_values("timestamp", ascending=True)
        )
        return multiplot_speedtest(df)

    @app.callback(
        Output("heatmaps", "figure"),
        [
            Input("interval-component", "n_intervals"),
        ],
    )
    def heatMaps(intervals):
        return Heatmaps()


config = {"displaylogo": False, "scrollZoom": False, "displayModeBar": False}

updates = dcc.Interval(
    id="interval-component", interval=10000, n_intervals=0  # in milliseconds
)


navbar = dbc.Navbar(
    dbc.Container(
        [
            html.A(
                # Use row and col to control vertical alignment of logo / brand
                dbc.Row(
                    [
                        dbc.Col(
                            html.Img(
                                src="https://www.pinclipart.com/picdir/big/491-4917274_panama-flag-png-palestine-flag-vector-clipart.png",
                                height="30px",
                            )
                        ),
                        dbc.Col(
                            dbc.NavbarBrand(
                                "Network Speed Test by Jose Quesada", className="ms-2"
                            )
                        ),
                    ],
                    align="center",
                    className="g-0",
                ),
                href="https://plotly.com",
                style={"textDecoration": "none"},
            ),
            dbc.NavbarToggler(id="navbar-toggler", n_clicks=0),
        ]
    ),
    color=paper_bgcolor,
    dark=True,
)


streaming_col = dbc.Col(dcc.Graph(id="stream_line_chart", config=config))
heatmap_col = dbc.Col(dcc.Graph(id="heatmaps"))

layout = dbc.Container(
    [
        navbar,
        dbc.Container(
            [
                updates,
                dcc.Store(id="last_32hrs"),
                dbc.Row(streaming_col),
                dbc.Row(heatmap_col),
            ],
            style={"background-color": paper_bgcolor, "color": default_fontcolor},
        ),
    ]
)

app = dash.Dash(
    external_stylesheets=[
        "https://cdn.jsdelivr.net/npm/bootstrap@4.3.1/dist/css/bootstrap.min.css"
    ],
)
# app.config.suppress_callback_exceptions = True
app.layout = layout
register_Callback(app)
app.run(jupyter_mode="external")

Dash app running on http://127.0.0.1:8050/


<img src='img/dashboard.png'>

In [25]:
print(1)

1
