# Overview of Discretized Streams

**Discretized Streams** (or **DStreams**) are the basic abstraction provided by Spark Streaming. These are continuous streams of data. The DStream could be the input coming from a source, or the output data that was generated by performing functions on the input. DStreams are basically continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. 

One of the consequences of this is that any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream.

Spark Streaming provides two categories of built-in streaming sources.

* Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
* Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. We will go into more depth in these in later sections of the course.

If you want to receive multiple streams of data in parallel, you can create multiple input DStreams. This will create multiple receivers which will simultaneously receive multiple data streams. However, it is important to remember that a Spark Streaming application needs to be allocated enough cores to process the received data, as well as to run the receiver.

Besides TCP sockets, the StreamingContext API provides methods for creating DStreams from files as input sources.

* **File Streams:** For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as `streamingContext.textFileStream(dataDirectory)`. Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). It's worth noting that 1) The files must have the same data format, 2) files must be created in the dataDirectory by atomically moving or renaming them into the data directory, and 3) once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read. For simple text files, there is also an easier method `streamingContext.textFileStream(dataDirectory)`. Since file streams do not require running a receiver, they do not require allocating cores for cluster computing.

* **Queue of RDDs as a Stream:** For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream

* **Streams based on Custom Receivers:** DStreams can be created with data streams received through custom receivers.

### Demo
For testing a Spark Streaming application with test data, we are going to create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.

In [None]:
import time
from pyspark import SparkContext
from pyspark.streaming import StreamingContext


if __name__ == "__main__":
    sc = SparkContext(appName="PythonStreamingQueueStream")
    ssc = StreamingContext(sc, 1)
    
    
    rddQueue = []
    for i in range(5):
        rddQueue += [ssc.sparkContext.parallelize([j for j in range(1, 1001)], 10)]
    
    inputStream = ssc.queueStream(rddQueue)
    mappedStream = inputStream.map(lambda x: (x % 10, 1))
    reducedStream = mappedStream.reduceByKey(lambda a, b: a + b)
    reducedStream.pprint()
    
    ssc.start()
    time.sleep(6)
    ssc.stop(stopSparkContext=True, stopGraceFully=True)

**important notes**

* When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).

* Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

### References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
2. https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
3. https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext