Streaming and processing Twitter data from Kafka topic using Spark Streaming
========

This notebook is created using https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/ blogpost as an inspiration.

You need Kafka cluster and producer that is reading Twitter API and storing tweets to Kafka topic twitter-tweets. Kafka cluster have to be accessible from the Spark cluster running this notebook. This demonstration is done using CSC Rahti (OpenShift) environment, running Apache Spark version 2.4.0 and Strimzi Kafka cluster in same namespaces.
https://strimzi.io/2018/05/17/running-strimzi-on-openshift-online.html

More info about integrating Kafka to Spark Streaming can be found from https://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html
and https://spark.apache.org/docs/2.3.1/api/python/pyspark.streaming.html#module-pyspark.streaming.kafka

---
# Setting up system

In [1]:
# First include Maven repository for Spark Streaming Kafka
# Check your Spark version and get correct version of repository
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.0 pyspark-shell'

In [2]:
# Spark, Spark Streaming, Kafka and json for twitter data
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json

In [3]:
# specify Kafka variables
zookeeper = 'my-cluster-zookeeper:2181' # Zookeeper server default address and port for Strimzi cluster
group = 'spark-streaming' # Consumer group name, can be defined here
topic = 'twitter-tweets' # Kafkat topic to be consumed 
partitions = 2 # Each partition is consumed in its own thread.

In [4]:
sc = SparkContext(appName="Twitter Streaming from Kafka")
sc.setLogLevel("WARN")
# Initialize streaming Context and define processing interval in seconds
ssc = StreamingContext(sc, 30)
kafkaStream = KafkaUtils.createStream(ssc, zookeeper, group, {topic:partitions})

# Spark commands to be run on streming context

Counting tweets per user

In [5]:
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.count().map(lambda x:'Tweets in this batch: %s' % x).pprint()
authors_dstream = parsed.map(lambda tweet: tweet['user']['screen_name'])
author_counts = authors_dstream.countByValue()
author_counts.pprint()


Sorting users according their productivity and printing top five authors

In [6]:
author_counts_sorted_dstream = author_counts.transform(\
  (lambda foo:foo\
   .sortBy(lambda x:( -x[1]))))

top_five_authors = author_counts_sorted_dstream.transform\
  (lambda rdd:sc.parallelize(rdd.take(5)))
top_five_authors.pprint()

Parsing text and counting individual words and printing sorted list

In [7]:
parsed.\
    flatMap(lambda tweet:tweet['text'].split(" "))\
    .countByValue()\
    .transform\
      (lambda rdd:rdd.sortBy(lambda x:-x[1]))\
    .pprint()

# Starting streaming context to be run on 30 sec interval

It'll take 30 seconds before anything happens so be patient!

In [8]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 2019-07-04 12:31:00
-------------------------------------------
Tweets in this batch: 565

-------------------------------------------
Time: 2019-07-04 12:31:00
-------------------------------------------
('User1', 1)
('User2', 1)
('User3', 1)
('User4', 1)
('User5', 1)
('User6', 1)
('User7', 1)
('User8', 1)
('User9', 1)
('User10', 2)
...

-------------------------------------------
Time: 2019-07-04 12:31:00
-------------------------------------------
('UserX', 3)
('UserX', 2)
('UserX', 2)
('UserX', 2)
('UserX', 2)

-------------------------------------------
Time: 2019-07-04 12:31:00
-------------------------------------------
('RT', 353)
('and', 176)
('the', 155)
('in', 131)
('to', 121)
('on', 100)
('a', 93)
('of', 92)
('you', 91)
('this', 88)
...



KeyboardInterrupt: 