## Sources: Acquiring Streaming Data

Call to spark.readStream creates a `DataStreamReader` instance. This instance is in charge of managing the different stream formats and configurations provided through the builder method calls. 

If `DataStreamReader` was called *.format(...)* method, only after calling *.load(...)* method on `DataStreamReader` instance, the options provided to the builder are validated and, if everything checks out, streaming DataFrame is returned.

If `DataStreamReader` was called with diffrent "loading" methods, such as *.json(...)*, *.csv(...)* and more, streaming DataFrame is returned immediately


### Available streaming source

1. **File-based source** <br>
    Monitors a path in a filesystem and consumes files atomically placed in it. The found files will then be parsed by the specified formatter and processed in order of file modification time.
    Supported formats are: 
        * text
        * csv
        * json
        * orc



2. **Socket source**  <br>
Establishes a client connection to a TCP server amd reads UTF-8 text data through a socket connection.
3. **Kafka source** <br>
Creates Kafka consumer able to retrieve data from Kafka.
4. **Rate source & Rate per Micro-batch source** <br>
Generates a stream of specified number of rows per second / specified number of rows per micro-batch . Each output row contains a *timestamp* and *value*.  It’s mainly intended as a testing source.
5. **Table source** <br>
Creates a Streaming DataFrame on a table


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("SourcesSinks") \
        .config("spark.sql.warehouse.dir", "./warehouse") \
        .enableHiveSupport() \
        .getOrCreate()

your 131072x1 screen size is bogus. expect trouble
24/04/20 18:46:24 WARN Utils: Your hostname, DELEQ0283302041 resolves to a loopback address: 127.0.1.1; using 172.31.227.62 instead (on interface eth0)
24/04/20 18:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/20 18:46:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/20 18:46:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/04/20 18:46:26 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [3]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DecimalType
# define schema
real_estate_schema = StructType(
    [StructField('UID', IntegerType()), 
    StructField('Location', StringType(), True), 
    StructField('Price', DecimalType(15,2), True), 
    StructField('Bedrooms', IntegerType(), True), 
    StructField('Bathrooms', IntegerType(), True), 
    StructField('Size', IntegerType(), True), 
    StructField('Price SQ Ft', DecimalType(10,2), True), 
    StructField('Status', StringType(), True)])

# Cleanup
spark.sql("DROP TABLE IF EXISTS real_estate")

# create table from DataFrame
real_estate_df = (
    spark
    .read
    .schema(real_estate_schema)
    .csv("../data/batch_resource", header = True))

real_estate_df.write.saveAsTable('real_estate')

24/04/20 18:46:42 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
24/04/20 18:46:42 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/04/20 18:46:42 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/04/20 18:46:42 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/04/20 18:46:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException


In [4]:
real_estate_stream = (
    spark
    .readStream
    .format("table")
    .table("real_estate"))

Streaming DataFrame, like normal DataFrame, is lazily evaluated. What we get is a representation of the stream that we can use to express the series of transformations we want to apply.
Creating a streaming DataFrame does not result in any data actually being consumed or processed until the stream is materialized. (like creating `Streaming Query` instance and calling *.start()* method)

## Data Tranformation
....

## Sinks: Output the resulting Data

In [5]:
query = real_estate_stream.writeStream.format("console").queryName("Table stream").start()

24/04/20 18:50:25 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-ce0080eb-6f5f-471a-8756-ebcf67d80d6c. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/04/20 18:50:25 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


-------------------------------------------
Batch: 0
-------------------------------------------
+------+------------------+---------+--------+---------+----+-----------+----------+
|   UID|          Location|    Price|Bedrooms|Bathrooms|Size|Price SQ Ft|    Status|
+------+------------------+---------+--------+---------+----+-----------+----------+
|132842|     Arroyo Grande|795000.00|       3|        3|2371|     335.30|Short Sale|
|134364|       Paso Robles|399000.00|       4|        3|2818|     141.59|Short Sale|
|135141|       Paso Robles|545000.00|       4|        3|3032|     179.75|Short Sale|
|135712|         Morro Bay|909000.00|       4|        4|3540|     256.78|Short Sale|
|136282|Santa Maria-Orcutt|109900.00|       3|        1|1249|      87.99|Short Sale|
|136431|            Oceano|324900.00|       3|        3|1800|     180.50|Short Sale|
|137036|Santa Maria-Orcutt|192900.00|       4|        2|1603|     120.34|Short Sale|
|137090|Santa Maria-Orcutt|215000.00|       3|       

query object we have just created is `StreamingQuery` instance. It provides a handle to query that is executing continuously in the background as new data arrives
 

In [6]:

query.isActive

True

In [None]:
query.lastProgress