### Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, 
fault-tolerant stream processing of live data streams. Data can be ingested from many sources like 
Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed 
with high-level functions like map, reduce, join and window. Finally, processed data can be 
pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

First, we import the names of the Spark Streaming classes and some implicit conversions from StreamingContext into our environment in order to add useful methods to other classes we need (like DStream). StreamingContext is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second.

In [ ]:
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf

import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf


In [ ]:
//val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val sparkConf = new SparkConf().setAppName("Test Kafka2") //sparkContext.getConf
val ssc =  new StreamingContext(sparkContext, Seconds(10))
ssc.checkpoint("checkpoint")
//val Array(zkQuorum, group, topics, numThreads) = args*/
val zkQuorum = "ecoles.node1.pro.hupi.loc"
val group = "DEMO_HUPI_VINCENT"
val topics = "ecoles_hupilytics_scandivie"
val numThreads = "1"

sparkConf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@14232954
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@691a570d
zkQuorum: String = ecoles.node1.pro.hupi.loc
group: String = DEMO_HUPI_VINCENT
topics: String = ecoles_hupilytics_scandivie
numThreads: String = 1


# On imprime les événements et on compte les network_events

In [ ]:
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

topicMap: scala.collection.immutable.Map[String,Int] = Map(ecoles_hupilytics_scandivie -> 1)


In [ ]:
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)

lines: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.MappedDStream@158717b4


In [ ]:
val words = lines.flatMap(_.split(" "))

words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@74b649df


In [ ]:
val wordCounts = words.map(x => (x, 1L)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(10), Seconds(10), 1)

wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Long)] = org.apache.spark.streaming.dstream.ReducedWindowedDStream@4d8347bf


In [ ]:
// Il faut déterminer Output avant de démarrer le Spark Streaming
wordCounts.print()

In [ ]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 1517239960000 ms
-------------------------------------------
(Safari/603.1.30","rec":"1","current_ts":1517239630,"ref":"https://www.google.fr/","java":true,"major":"10","action_name":"Cocotte,1)
(fonte"],"5":["_pkc","[3,63,276]"],"42":["lang","FR"]},"event_timestamp":"2018-01-29T14:51:57.000Z","urlref_domain":"search.lilo.org","@version":"1","gt_ms":947,"client":"scandivie","fla":false,"lang":"FR","geoip":{"timezone":"Europe/Paris","ip":"86.200.132.173","latitude":48.8582,"country_name":"France","country_code2":"FR","continent_code":"EU","country_code3":"FR","location":[48.8582,2.3387000000000002],"real_region_name":null,"longitude":2.3387000000000002},"urlref":"https://search.lilo.org/searchweb.php?q=fer+a+gaufre+en+fonte","os":"Mac,2)
(Safari/603.1.30","rec":"1","current_ts":1517239630,"ref":"https://www.google.fr/","java":true,"major":"10","e_n":"6","event_date":"2018-01-29","action":"add","catchbox_ts":1517239630,"gears":false,"send

The cell was cancelled.


In [ ]:
ssc.stop()