For Twitter data, you'll need to visit [the Twitter App management page](https://dev.twitter.com), create an application if necessary, and fill in your credentials in `twitter4j.properties`.

In [0]:
%dependency org.apache.spark %% spark-streaming % 1.3.1

Added dependency org.apache.spark %% spark-streaming % 1.3.1

In [0]:
%dependency org.apache.spark %% spark-streaming-twitter % 1.3.1

Added dependency org.apache.spark %% spark-streaming-twitter % 1.3.1

In [0]:
%dependency com.google.code.gson % gson % 2.4

Added dependency com.google.code.gson % gson % 2.4

In [1]:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._

import com.google.gson.Gson





Q: What is Spark?  
A: A distributed computation engine.

Q: How does Spark fit in to a data science workflow?  
A: It can be used standalone or in conjunction with other technologies like HDFS, YARN, Hive, etc.

Q: What are the nouns in Spark?  
A: Resilient Distributed Datasets (RDDs) are the most important. DataFrames are cool too.

Q: What are the verbs in Spark?  
A: Transformations and actions. Understanding the nuances of these will lead you to learn about shuffles, lazy evaluation, partitioning, and other details which influence how Spark performs.

Q: Why is Spark popular right now?  
A: It's flexible, powerful, and fast at the kind of tasks currently in vogue.

## Goals

* Ingest streaming data from one of two sources: either Twitter or Wikipedia edits
* Perform pre-processing and save data to persistent storage

In [None]:
val WikiServerHost  = "ec2-54-213-33-240.us-west-2.compute.amazonaws.com"
val WikiServerPort  = 9002

val BatchInterval   = Seconds(10)

In [None]:
val ssc = new StreamingContext(spark, BatchInterval)

*Note:* The Spark context `spark` is provided for us here. Usually you will instantiate one yourself, based on configuration parameters in-app, at command line, or in environment variables.

### Choose a data source

In [None]:
// Twitter
val dStream = TwitterUtils.createStream(ssc, None)
dStream.foreachRDD { (rdd, time) =>
    if(! rdd.isEmpty) {
        rdd.map{x =>
            var gson = new Gson()
            gson.toJson(x)
        }
           .saveAsTextFile("data/" + time.milliseconds.toString)
    }
}

// Task serializability is why we put the gson declaration inside the map.

In [None]:
// Wikipedia
val dStream = ssc.socketTextStream(WikiServerHost, WikiServerPort)
dStream.foreachRDD { (rdd, time) =>
    if(! rdd.isEmpty) {
        rdd.saveAsTextFile("data/" + time.milliseconds.toString)
    }
}

Check out [the programming guide](https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html) for useful diagrams of DStreams.

*Question:* How do we decide what logic to put here and what to save till later?

In [None]:
ssc.start()

In [None]:
ssc.stop(stopSparkContext = false, stopGracefully = true)
// We'll still have to create a new StreamingContext to try again, though

**Remember to stop the Streaming Context!**