# Spark streaming
#### It is an extension of the core spark API that enables scalable,high throughput, fault-tolerant stream processing of live data streams

#### Data can be ingested from many sources like Kafka,Flume,Kinesis,HDFS/S3 or TCP sockets and can be processed using complex algorithms expressed with high-level functions like map,reduce,join and window

#### Spark streaming from these data souces above output to HDFS, databases, dashboards

#### spark streaming internally receives live input data streams and divides the data into batches, which are then processed by spark engine to generate the final stream of results in batches

## Structured Streaming
#### Steaming from spark sessions meaning spark streaming running on top of spark sql engine

### For spark streaming and structured streaming , we need to use spark context instead of spark session which is RDD syntax (not dataframe syntax)


## Reading streamed data from terminal using Spark Streaming
#### Steps:
##### create a spark context
##### create a streaming context
##### create a socket text stream
##### read lines as a DStream

#### Steps to work with data:
##### split input line to a list of words
##### map each word to tuple
##### group the tuples by word(key) and sum up the second argument of tuple(count)

In [2]:
from pyspark import SparkContext

In [3]:
#begin streaming
from pyspark.streaming import StreamingContext

In [4]:
#two threads
sc = SparkContext('local[2]','NetworkWordCount')

In [7]:
#interval is 1sec
ssc = StreamingContext(sc,1)

In [9]:
#create a datastream that is connected to a hostname and local port
lines = ssc.socketTextStream('localhost',9999)

In [12]:
#input string to line of words
words = lines.flatMap(lambda line:line.split(' '))

In [13]:
pairs = words.map(lambda word:(word,1))

In [14]:
word_counts = pairs.reduceByKey(lambda num1,num2:num+num2)

In [15]:
word_counts.pprint()

In [16]:
ssc.start()

-------------------------------------------
Time: 2018-01-24 14:13:38
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:39
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:40
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:41
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:42
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:43
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:44
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:45
-------------------------------------------

-------------------------------------------
Time: 2018-01-24 14:13:46
----------

#### before starting the streaming, go to terminal and enter "nc -lk 9999"