In [1]:
%AddDeps org.apache.spark spark-sql-kafka-0-10_2.11 2.4.5 --transitive

Marking org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 for download
Obtained 12 files


Download the connector. Note the _--transitive_ flag- it is necessary in order to download Kafka utils etc

In [76]:
import org.apache.spark._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DataType, StructType, StructField}

## Spark Context

In [3]:
val spark = SparkSession.builder().appName("MessageProcessor")
            .master("spark://spark:7077").getOrCreate()

spark = org.apache.spark.sql.SparkSession@41a180e8


org.apache.spark.sql.SparkSession@41a180e8

## Obtain dataframes from Kafka

In [4]:
val dataframe = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka1:29092")
    .option("subscribe", "Messages")
    .option("startingOffsets", "earliest")
    .load()

dataframe = [key: binary, value: binary ... 5 more fields]


[key: binary, value: binary ... 5 more fields]

In [5]:
dataframe.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [16]:
val output = dataframe
                    .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
                    .as[(String, String)]
                    .writeStream
                    .outputMode("append").format("memory")
                    .queryName("table")
                    .start()

output = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@36e94747


org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@36e94747

## Transformacje Dataframe

In [27]:
val table = spark.sql("select * from table")

table = [key: string, value: string]


[key: string, value: string]

In [8]:
output.stop()

Odczytuję schemat kolumny z wartościami. Jest on w JSON, ale zamienię go na osobne kolumny.

In [54]:
val schema = spark.read.json(table.select("value").as[String]).schema

schema = StructType(StructField(kol1,StringType,true), StructField(kol2,StringType,true))


StructType(StructField(kol1,StringType,true), StructField(kol2,StringType,true))

Kolumna z kluczem nie jest potrzebna- służy ona do oznaczania wiadomości podczas przesyłu przez Kafkę i nie stanowi części danych.

In [98]:
val transformedDF = table.withColumn("JSON", from_json(col("value"), schema)).select("JSON.*")

transformedDF = [kol1: string, kol2: string]


[kol1: string, kol2: string]

In [99]:
transformedDF.show()

+----+----+
|kol1|kol2|
+----+----+
|dana|dana|
|dana|dana|
|    |null|
|dana|dana|
|dana|dana|
|    |null|
|dana|dana|
|dana|dana|
|    |null|
+----+----+



Dataframe można zapisać jako Parquet- w przypadku wielu wiadomości nie będzie to zalecane.

In [101]:
transformedDF.write.parquet("transformed.parquet")