# PySpark 4.0 â†’ DataHub (acryl-spark)

[![PySpark](https://img.shields.io/badge/PySpark-4.0-262A38?style=flat-square&logo=apachespark&logoColor=E36B22&labelColor=262A38)](https://spark.apache.org/docs/4.0.2/api/python/user_guide/index.html)
[![Scala](https://img.shields.io/badge/Scala-2.13-262A38?style=flat-square&logo=scala&logoColor=E03E3C&labelColor=262A38)](https://sdkman.io/)
[![JDK](https://img.shields.io/badge/JDK-17-35667C?style=flat&logo=openjdk&logoColor=FFFFFF&labelColor=1D213B)](https://sdkman.io/)
[![Acryl-Spark](https://img.shields.io/badge/acryl--spark--lineage-262A38?style=flat-square&logo=lineageos&logoColor=73A4BC&labelColor=262A38)](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage)

Read from the staging HackerNews RSS tables (`stg_hot_articles`, `stg_newest_articles`) and write
the combined result into the `hackernews_rss` dataset on BigQuery.

**Notes:** 
This notebook requires **Spark 4.0** as it uses dependencies bound to the Spark version and/or the Scala version (2.13) that is bundled with

**Known limitation:** 
- The **Spark 4.x** variant (`acryl-spark-lineage_2.13:0.2.19-rc4`) only emits **input** lineage to DataHub (Datasets read) - the **output** dataset (`articles_spark`) is missing

### SparkSession Setup

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark_jars = ",".join([
    "com.google.cloud.spark:spark-4.0-bigquery:0.44.0",
    "0.2.19-rc4:0.2.19-rc4",
])

In [3]:
spark = (
    SparkSession.builder
    .appName("jupyter-acryl-datahub")
    .master("local[*]")
    .config("spark.jars.packages", spark_jars)
    .config("spark.extraListeners", "datahub.spark.DatahubSparkListener")
    .config("spark.datahub.rest.server", "http://localhost:9090")
    .config("spark.datahub.metadata.dataset.materialize", "true")
    .config("spark.datahub.capture_spark_plan", "true")
    .getOrCreate()
)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/02/13 07:52:16 WARN Utils: Your hostname, Brunos-M1-Max-MBP-16.local, resolves to a loopback address: 127.0.0.1; using 192.168.15.5 instead (on interface en0)
26/02/13 07:52:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/iobruno/Vault/data-catalog-labs/spark/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/iobruno/.ivy2.5.2/cache
The jars for the packages stored in: /Users/iobruno/.ivy2.5.2/jars
com.google.cloud.spark#spark-4.0-bigquery added as a dependency
io.acryl#acryl-spark-lineage_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-488627a6-0425-4307-b5a8-fd42cf53fc00;1.0
	confs: [default]
	found com.google.cloud.spark#spark-4.0-bigquery;0.44.0 in central
	found com.google.cloud.spark#spark-bigque

### Read Staging Tables

In [4]:
GCP_PROJECT = "iobruno-gcp-labs"
STG_DATASET = "stg_hackernews_rss"

df_hot = (
    spark.read
    .format("bigquery")
    .option("table", f"{GCP_PROJECT}.{STG_DATASET}.stg_hot_articles")
    .option("viewsEnabled", "true")
    .option("materializationDataset", STG_DATASET)
    .load()
)

df_hot.printSchema()
df_hot.show(5, truncate=False)

root
 |-- uid: string (nullable = true)
 |-- title: string (nullable = true)
 |-- username: string (nullable = true)
 |-- url: string (nullable = true)
 |-- redirect_url: string (nullable = true)
 |-- published_at: timestamp (nullable = true)



26/02/13 07:52:29 WARN SparkOpenLineageExtensionVisitorWrapper: Different classloaders detected for openlineage-spark integration and Spark connector. This may cause extension loading issues. For optimal compatibility, ensure both libraries are loaded using the same classloader by: 
1. Placing both libraries in the /usr/lib/spark/jars directory, or 
2. Loading both libraries through the --jars parameter.
[Stage 0:>                                                          (0 + 1) / 1]

+--------------------------------+---------------------------------------------------------------------------+-----------+---------------------------------------------+--------------------------------------------------------------------------------------------------------------------+-------------------+
|uid                             |title                                                                      |username   |url                                          |redirect_url                                                                                                        |published_at       |
+--------------------------------+---------------------------------------------------------------------------+-----------+---------------------------------------------+--------------------------------------------------------------------------------------------------------------------+-------------------+
|479d1f5cab04cb940af0d9b882af8dad|The missing digit of Stela C                    

                                                                                

In [5]:
df_newest = (
    spark.read
    .format("bigquery")
    .option("table", f"{GCP_PROJECT}.{STG_DATASET}.stg_newest_articles")
    .option("viewsEnabled", "true")
    .option("materializationDataset", STG_DATASET)
    .load()
)

df_newest.printSchema()
df_newest.show(5, truncate=False)

root
 |-- uid: string (nullable = true)
 |-- title: string (nullable = true)
 |-- username: string (nullable = true)
 |-- url: string (nullable = true)
 |-- redirect_url: string (nullable = true)
 |-- published_at: timestamp (nullable = true)



26/02/13 07:52:35 WARN SparkOpenLineageExtensionVisitorWrapper: Different classloaders detected for openlineage-spark integration and Spark connector. This may cause extension loading issues. For optimal compatibility, ensure both libraries are loaded using the same classloader by: 
1. Placing both libraries in the /usr/lib/spark/jars directory, or 
2. Loading both libraries through the --jars parameter.


+--------------------------------+-----------------------------------------------------------------+---------------+---------------------------------------------+----------------------------------------------------------------------------------+-------------------+
|uid                             |title                                                            |username       |url                                          |redirect_url                                                                      |published_at       |
+--------------------------------+-----------------------------------------------------------------+---------------+---------------------------------------------+----------------------------------------------------------------------------------+-------------------+
|e552654b1a018da0e9152366afcacf1b|ChargePoint data shows a new EV bottleneck forming               |toomuchtodo    |https://news.ycombinator.com/item?id=46990233|https://electrek.co/2026/02/11/chargepoi

                                                                                

### Combine Articles

In [6]:
from pyspark.sql.functions import lit

df_hot_tagged = df_hot.withColumn("source_feed", lit("hot"))
df_newest_tagged = df_newest.withColumn("source_feed", lit("newest"))

df_articles = df_hot_tagged.unionByName(df_newest_tagged)

print(f"Total articles: {df_articles.count()}")
df_articles.show(10, truncate=False)

26/02/13 07:52:40 WARN SparkOpenLineageExtensionVisitorWrapper: Different classloaders detected for openlineage-spark integration and Spark connector. This may cause extension loading issues. For optimal compatibility, ensure both libraries are loaded using the same classloader by: 
1. Placing both libraries in the /usr/lib/spark/jars directory, or 
2. Loading both libraries through the --jars parameter.


Total articles: 40


26/02/13 07:52:41 WARN SparkOpenLineageExtensionVisitorWrapper: Different classloaders detected for openlineage-spark integration and Spark connector. This may cause extension loading issues. For optimal compatibility, ensure both libraries are loaded using the same classloader by: 
1. Placing both libraries in the /usr/lib/spark/jars directory, or 
2. Loading both libraries through the --jars parameter.


+--------------------------------+--------------------------------------------------------------------------------+-----------+---------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+-------------------+-----------+
|uid                             |title                                                                           |username   |url                                          |redirect_url                                                                                                           |published_at       |source_feed|
+--------------------------------+--------------------------------------------------------------------------------+-----------+---------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+-------------------+-----------+
|479d1f5cab04cb940af0d

### Write to BigQuery

In [7]:
TARGET_DATASET = "hackernews_rss"
TARGET_TABLE = f"{GCP_PROJECT}.{TARGET_DATASET}.articles_spark"

(
    df_articles.write
    .format("bigquery")
    .option("table", TARGET_TABLE)
    .option("writeMethod", "direct")
    .mode("overwrite")
    .save()
)

print(f"Written to {TARGET_TABLE}")

26/02/13 07:52:42 WARN SparkOpenLineageExtensionVisitorWrapper: Different classloaders detected for openlineage-spark integration and Spark connector. This may cause extension loading issues. For optimal compatibility, ensure both libraries are loaded using the same classloader by: 
1. Placing both libraries in the /usr/lib/spark/jars directory, or 
2. Loading both libraries through the --jars parameter.
26/02/13 07:52:44 WARN SparkOpenLineageExtensionVisitorWrapper: Different classloaders detected for openlineage-spark integration and Spark connector. This may cause extension loading issues. For optimal compatibility, ensure both libraries are loaded using the same classloader by: 
1. Placing both libraries in the /usr/lib/spark/jars directory, or 
2. Loading both libraries through the --jars parameter.
                                                                                

Written to iobruno-gcp-labs.hackernews_rss.articles_spark
