AutoLoader 
- by specifying format("cloudFiles") it tells autoloader to load files as they arrive
- to infer Schema automatically AutoLoader discover first 50 Gb or 1000 files (what comes first) to define the schema. This is quite heavy operation. To avoid this operation to be performed repeatedly (just once) every time new streaming is started, and do make the schema consistent it is worth to specify the schema location, as follows: 
    
    .option("cloudFiles.schemaLocation", 
                                 "dbfs:/FileStore/datasets/laptop_source_stream")

This will create a "_schema" directory at "dbfs:/FileStore/datasets/laptop_source_stream/_schema"

In [0]:
laptop_stream_data = spark.readStream.format("cloudFiles") \
                          .option("cloudFiles.format", "csv") \
                          .option("cloudFiles.schemaLocation", 
                                 "dbfs:/FileStore/datasets/laptop_source_stream") \
                          .load("dbfs:/FileStore/datasets/laptop_source_stream")

In [0]:
# dataframe contains new column "_rescued_data"
display(laptop_stream_data)

Id,Company,Product,TypeName,Price_euros,_rescued_data
21,Asus,Vivobook E200HA,Netbook,191.9,
22,Lenovo,Legion Y520-15IKBN,Gaming,999.0,
23,HP,255 G6,Notebook,258.0,
24,Dell,Inspiron 5379,2 in 1 Convertible,819.0,
25,HP,15-BS101nv (i7-8550U/8GB/256GB/FHD/W10),Ultrabook,659.0,
26,Dell,Inspiron 3567,Notebook,418.64,
27,Apple,MacBook Air,Ultrabook,1099.0,
28,Dell,Inspiron 5570,Notebook,800.0,
29,Dell,Latitude 5590,Ultrabook,1298.0,
30,HP,ProBook 470,Notebook,896.0,


In [0]:
# let's study how does schema look like. Despite it's hard to read, it's visible that the output it is json format
dbutils.fs.head('dbfs:/FileStore/datasets/laptop_source_stream/_schemas/0', 1000)

now we're going to add a file in our source location with different schemas(contains extra column). Let's observe what will happen
take a look at above command: display(laptop_stream_data) output. Check the bottom of "_rescued_data" column. Now it's being appended with rows in the following format {"TYPENAME":"Notebook","_file_path":"dbfs:/FileStore/datasets/laptop_source_stream/laptops_03_extracol.csv"}

now we're going to add a file in our source location with different schemas(rearranged columns). Let's observe what will happen
take a look at above command: display(laptop_stream_data) output. Check the bottom of "_rescued_data" column. Now it's being appended with null values. It means that autoloader cleared to re-arrange the columns in a proper way.

##Write processed data to a sink

In [0]:
laptop_stream_subset = laptop_stream_data.filter(laptop_stream_data.Company =="Dell").select('Company', 'Product', 'Price_euros')

In [0]:
laptop_stream_subset.display()

Company,Product,Price_euros
Dell,Inspiron 7577,1499.0
Dell,Inspiron 7773,999.0
Dell,Inspiron 3567,639.0
Dell,Inspiron 3576,767.8
Dell,Inspiron 5770,1299.0
Dell,Vostro 5471,879.0
Dell,Inspiron 5370,955.0
Dell,Inspiron 5570,870.0
Dell,Inspiron 5570,855.0
Dell,Inspiron 5379,819.0


In [0]:
# create a destination folder to where filtered files will be uploaded
dbutils.fs.mkdirs("dbfs:/FileStore/datasets/dest_location/")

In [0]:
# writing data is performed by invoking .writeStream. Also note that .option("mergeSchema", "true") will merge all schemas togather into .csv
# also checkpoint needs to be specified when writing into a persistant storage (will be covered in future sessions/notebooks)

laptop_stream_subset.writeStream \
                    .option("mergeSchema", "true") \
                    .format("csv") \
                    .option("checkpointLocation", 
                           "dbfs:/FileStore/datasets/dest_location/checkpoint_1") \
                    .start("dbfs:/FileStore/datasets/dest_location/")