## This notebook is part of Hadoop and Spark training delivered by IT-DB group
### SPARK Streaming Hands-On Lab
_ by Prasanth Kothuri _

### Hands-On 1 - Stream processing using Spark Streaming and Kafka
*This demonstrates processing of unbounded data from Kafka topic and perform simple string manipulations and aggregations*

#### Import the required modules

In [1]:
import os
import json
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

#### Make spark streaming kafka module available to Spark executors

In [2]:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0 pyspark-shell'

#### Create SparkContext

In [3]:
conf = SparkConf().setMaster("local[4]").set("spark.driver.memory", "2g").set("spark.executor.memory", "2g")
sc = SparkContext(conf = conf)

#### Create streaming context

In [4]:
ssc = StreamingContext(sc, 60)

#### Hook upto kafka topic

In [5]:
kafkaStream = KafkaUtils.createStream(ssc, 'sstreaming:2181', 'spark-streaming-pkothuri1', {'twitter_json':1})

#### Parse the messages into json

In [6]:
tweets_json = kafkaStream.map(lambda x: json.loads(x[1]))

#### Number of tweets in each batch

In [7]:
tweets_json.count().map(lambda x:'Number of tweets in this batch: %s' % x).pprint()

#### Count tweets by location

In [8]:
location_counts = tweets_json.map(lambda tweet: tweet['payload']['user']['location']).countByValue()

In [9]:
top_locations = location_counts \
    .transform( (lambda foo:foo .sortBy(lambda x:( -x[1]))) ) \
    .transform(lambda rdd:sc.parallelize(rdd.take(5)))

In [10]:
top_locations.pprint()

#### High frequency words in the tweets

In [11]:
tweets_json \
    .flatMap(lambda tweet:tweet['text'].split(" ")) \
    .countByValue() \
    .transform(lambda rdd:rdd.sortBy(lambda x:-x[1])) \
    .pprint()

#### Start the streaming context

In [None]:
ssc.start()
ssc.awaitTermination(timeout=180)

#### stop the streaming context

In [12]:
ssc.stop()