# Kafka Streaming + PySpark 예제

### 1. findspark를 통해 pyspark 등 라이브러리 추가

In [39]:
import findspark
findspark.init("/usr/local/lib/spark-3.3.2-bin-hadoop3")

### 2. 동작하고있는 Kafka 서버와 Topic을 정의

In [40]:
from pyspark import SparkConf
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import udf
from pyspark.sql.functions import col, pandas_udf, split

kafka_bootstrap_servers = 'slave03:9092'
topic = 'quickstart-events'

### 3. SparkConf를 통해 configuration 추가하고, SparkContext 생성
spark.jars.packages 옵션을 통해 Maven Repository에서 특절 Group,Artifact, Version의 Jar 파일을 가져올 수 있다. \<groupId>:\<artifactID>:\<version>의 형식으로 값을 넘겨줄 수 있으며, Spark는 받은 jar 파일을 자동으로 HDFS에 넘겨주어 의존성을 추가한다. 이 예제에선 org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2 패키지를 통해 Spark와 Kafka를 연동한다.

In [46]:
# sc.stop()

In [47]:
sconf = SparkConf()
sconf.setAppName("Jupyter_Notebook").set("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2")

sc = SparkContext(conf=sconf)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=Jupyter_Notebook, master=yarn) created by __init__ at /tmp/ipykernel_730/178433890.py:4 

### 4. SparkSession을 Kafka 세션으로 정의, readStream-load를 통해 스트리밍 세션으로 연동

In [36]:
session = SparkSession(sc)
streamming_df = session \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
  .option("failOnDataLoss","False") \
  .option("subscribe", topic) \
  .load()
streamming_df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### 5. timestamp 컬럼에서 밀리초 단위 제거, 초 단위 그룹핑, 20초동안 콘솔로 스트림 출력

In [37]:
import time

streamming_query = streamming_df.withColumn("timestamp_sec", col("timestamp").cast("string").substr(12, 8)).groupby("timestamp_sec").count()

query = streamming_query.writeStream.format("console").outputMode("complete").start()

time.sleep(20)
query.stop()

23/03/09 06:12:04 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-e3d6a5e7-710a-4cce-a642-cf6524fae721. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/03/09 06:12:04 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------+-----+
|timestamp_sec|count|
+-------------+-----+
+-------------+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+-------------+-----+
|timestamp_sec|count|
+-------------+-----+
|     06:12:14|    1|
|     06:12:11|    3|
|     06:12:10|    4|
+-------------+-----+



### 6. Session과 Context 종료

In [38]:
session.stop()
sc.stop()