# Kafka Streaming + PySpark 예제

### 1. findspark를 통해 pyspark 등 라이브러리 추가

In [1]:
import findspark
findspark.init("/usr/local/lib/spark-3.3.2-bin-hadoop3")

### 2. 동작하고있는 Kafka 서버와 Topic을 정의

In [27]:
from pyspark import SparkConf
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import udf
from pyspark.sql.functions import col, pandas_udf, split

kafka_bootstrap_servers = 'master01:9092,master02:9092,slave01:9092,slave02:9092,slave03:9092'
topic = 'tagmanager'

### 3. SparkConf를 통해 configuration 추가하고, SparkContext 생성
spark.jars.packages 옵션을 통해 Maven Repository에서 특절 Group,Artifact, Version의 Jar 파일을 가져올 수 있다. \<groupId>:\<artifactID>:\<version>의 형식으로 값을 넘겨줄 수 있으며, Spark는 받은 jar 파일을 자동으로 HDFS에 넘겨주어 의존성을 추가한다. 이 예제에선 org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2 패키지를 통해 Spark와 Kafka를 연동한다.

In [28]:
sc.stop

<bound method SparkContext.stop of <SparkContext master=yarn appName=Jupyter_Notebook>>

In [31]:
sconf = SparkConf()
sconf.setAppName("Jupyter_Notebook").set("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2")

sc = SparkContext(conf=sconf)

23/03/21 16:19:29 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
23/03/21 16:19:34 WARN Client: Same path resource file:///root/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.12-3.3.2.jar added multiple times to distributed cache.
23/03/21 16:19:34 WARN Client: Same path resource file:///root/.ivy2/jars/org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.3.2.jar added multiple times to distributed cache.
23/03/21 16:19:34 WARN Client: Same path resource file:///root/.ivy2/jars/org.apache.kafka_kafka-clients-2.8.1.jar added multiple times to distributed cache.
23/03/21 16:19:34 WARN Client: Same path resource file:///root/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.0.jar added multiple times to distributed cache.
23/03/21 16:19:34 WARN Client: Same path resource file:///root/.ivy2/jars/org.apache.commons_commons-pool2-2.11.1.jar added multiple times to distributed cache.
23/03/21 16:19:34 WARN Client: 

### 4. SparkSession을 Kafka 세션으로 정의, readStream-load를 통해 스트리밍 세션으로 연동

In [21]:
session = SparkSession(sc)
streamming_df = session \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
  .option("failOnDataLoss","False") \
  .option("subscribe", topic) \
  .load()
streamming_df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [33]:
schema = StructType(
        [
                StructField("serviceToken", StringType()),
                StructField("clientId", LongType()),
                StructField("sessionId", StringType()),
                StructField("event", StringType()),
                StructField("targetId", StringType()),
                StructField("positionX", IntegerType()),
                StructField("positionY", IntegerType()),
                StructField("location", StringType()),
                StructField("timestamp", LongType())
        ]
)

session = SparkSession(sc)

streaming_df = session \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
  .option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") \
  .option("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") \
  .option("failOnDataLoss","False") \
  .option("subscribe", topic) \
  .load() \
  .withColumn("key", col("key").cast("string")) \
  .withColumn("value", from_json(col("value").cast("string"), schema))
streaming_df.printSchema()

root
 |-- key: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- serviceToken: string (nullable = true)
 |    |-- clientId: long (nullable = true)
 |    |-- sessionId: string (nullable = true)
 |    |-- event: string (nullable = true)
 |    |-- targetId: string (nullable = true)
 |    |-- positionX: integer (nullable = true)
 |    |-- positionY: integer (nullable = true)
 |    |-- location: string (nullable = true)
 |    |-- timestamp: long (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### 5. timestamp 컬럼에서 밀리초 단위 제거, 초 단위 그룹핑, 20초동안 콘솔로 스트림 출력

In [35]:
import time
from pyspark.sql.functions import col

cassandra_keyspace = "tagmanager"
cassandra_table = "stream"


streamming_query = streaming_df.select("key", "value.*") \
    .withColumnRenamed("serviceToken", "service_token") \
    .withColumnRenamed("clientId", "client_id") \
    .withColumnRenamed("sessionId", "session_id") \
    .withColumnRenamed("event", "event") \
    .withColumnRenamed("targetId", "target_id") \
    .withColumnRenamed("positionX", "position_x") \
    .withColumnRenamed("location", "location") \
    .withColumnRenamed("timestamp", "creation_timestamp")

In [37]:
query = streamming_query.writeStream.format("console").outputMode("append").start()
time.sleep(10)
query.stop()

23/03/21 16:31:55 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-6e458655-5275-4320-a3e3-6433ed5ea884. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/03/21 16:31:55 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
-------------------------------------------
Batch: 0
-------------------------------------------
+---+-------------+---------+----------+-----+---------+----------+---------+--------+------------------+
|key|service_token|client_id|session_id|event|target_id|position_x|positionY|location|creation_timestamp|
+---+-------------+---------+----------+-----+---------+----------+---------+--------+------------------+
+---+-------------+---------+----------+-----+---------+

                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+--------------------+---------+---------------+---------+---------------+----------+---------+--------------------+------------------+
|                 key|       service_token|client_id|     session_id|    event|      target_id|position_x|positionY|            location|creation_timestamp|
+--------------------+--------------------+---------+---------------+---------+---------------+----------+---------+--------------------+------------------+
|test-session-id-1...|tag-manager-servi...|        1|test-session-id|    click|button-to-first|       925|      731|http://localhost:...|     1679416319472|
|test-session-id-1...|tag-manager-servi...|        1|test-session-id|pageenter|           none|         0|        0|http://localhost:...|     1679416319474|
|test-session-id-1...|tag-manager-servi...|        1|test-session-id|pageleave|           none|         0|        0|ht

                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------+--------------------+---------+---------------+---------+-----------------+----------+---------+--------------------+------------------+
|                 key|       service_token|client_id|     session_id|    event|        target_id|position_x|positionY|            location|creation_timestamp|
+--------------------+--------------------+---------+---------------+---------+-----------------+----------+---------+--------------------+------------------+
|test-session-id-1...|tag-manager-servi...|        1|test-session-id|    click|button-first-back|       920|      819|http://localhost:...|     1679416320898|
|test-session-id-1...|tag-manager-servi...|        1|test-session-id|pageleave|             none|         0|        0|http://localhost:...|     1679416320900|
|test-session-id-1...|tag-manager-servi...|        1|test-session-id|pageenter|             none|         0|

In [26]:
import time

streamming_query = streamming_df.withColumn("timestamp_sec", col("timestamp").cast("string").substr(12, 8)).groupby("timestamp_sec").count()
#print(streamming_query)
query = streamming_query.writeStream.format("console").outputMode("complete").start()
#print(query)

time.sleep(20)
query.stop()

23/03/21 15:42:48 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-fd32a4d9-0f69-4a49-ac87-3b1616b5db1e. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/03/21 15:42:48 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------+-----+
|timestamp_sec|count|
+-------------+-----+
+-------------+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+-------------+-----+
|timestamp_sec|count|
+-------------+-----+
|     15:42:58|    2|
|     15:42:53|    3|
|     15:42:51|    2|
|     15:42:54|    3|
|     15:42:57|    5|
|     15:42:55|    6|
|     15:42:52|    4|
+-------------+-----+





23/03/21 15:43:09 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@784ebba1 is aborting.
23/03/21 15:43:09 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@784ebba1 aborted.
23/03/21 15:43:09 WARN TaskSetManager: Lost task 44.0 in stage 15.0 (TID 1252) (slave03 executor 2): TaskKilled (Stage cancelled)




23/03/21 16:01:36 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state KILLED! Check the YARN application logs for more details.
23/03/21 16:01:36 ERROR YarnClientSchedulerBackend: Diagnostics message: Application application_1679405241026_0006 was killed by user root at 172.16.238.2


### 6. Session과 Context 종료

In [38]:
session.stop()
sc.stop()