Structured Streaming - take the same operations that you perform in batch mode using Spark’s structured APIs,and run them in a streaming fashion

In [0]:
staticDataFrame = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("/FileStore/tables/retail-data-by-day/*.csv")

staticDataFrame.createOrReplaceTempView("retail_data")
staticSchema = staticDataFrame.schema


Sale hours during which a given customer (identified by CustomerId) makes a large purchase. Add a total cost column and see on what days a customer spent the most

In [0]:
from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
  .selectExpr(
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate")\
  .groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
  .sum("total_cost")\
  .sort(desc("sum(total_cost)"))\
  .show(5)


good practice to set the number of shuffle partitions to a better fit to reduce costs. This configuration specifies the number of partitions that should be created after a shuffle. By default, the value is 200, but for illustration we can pretend there aren’t many executors on this machine, so 
it’s worth reducing this to 5.

In [0]:
spark.conf.set("spark.sql.shuffle.partitions","5")

Very little actually changes about the code when Streaming - biggest change using readStream instead of read, and maxFilesPerTrigger option, which specifies  number of files we read in at once. This is to demonstrate “streaming,” and in a production scenario would probably be omitted.

In [0]:
streamingDataFrame = spark.readStream\
    .schema(staticSchema)\
    .option("maxFilesPerTrigger", 1)\
    .format("csv")\
    .option("header", "true")\
    .load("/FileStore/tables/retail-data-by-day/*.csv")


Can see whether our DataFrame is streaming:

In [0]:
streamingDataFrame.isStreaming 

Same business logic as the manipulation, perform summation in the process.

In [0]:
purchaseByCustomerPerHour = streamingDataFrame\
  .selectExpr(
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate")\
  .groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
  .sum("total_cost")


Still a lazy operation, so we will need to call a streaming action to start the execution of data flow. Streaming actions are bit different from conventional static action because they populate data instead of just calling something like count (which doesn’t make sense on a stream anyways). The action we will use will output to an in-memory table that will update after each trigger. Each trigger is based on an individual file (the readoption that we set). Spark will mutate the data in the in-memory table such that we will always have the highest value (as specified in our previous aggregation).

In [0]:


# %sh
# du --human-readable --max-depth=1 --exclude='/dbfs'/

In [0]:
# %sh
# rm -rf /dbfs/FileStore/plots/*.png

In [0]:
# %sh
# rm -rf /dbfs/tmp/*
# rm -rf /dbfs/local_disk0/tmp/*

Can run queries against the Stream to debug the result if we were to write this out to a production sink:

In [0]:
spark.sql("""
  SELECT *
  FROM customer_purchases
  ORDER BY `sum(total_cost)` DESC
  """)\
  .show(5)


Option to write the results out to the console:

In [0]:
# purchaseByCustomerPerHour.writeStream
# .format("console")
# .queryName("customer_purchases_2")
# .outputMode("complete")
# .start()

Shouldn’t use either of these streaming methods in production 
  - convenient demonstration of Structured Streaming’s power. 
  - Window is built on event time, not the time at which Spark processes the data
  - Shortcoming of Spark Streaming that Structured Streaming has resolved

In [0]:
staticDataFrame.printSchema()

Begin with some raw data, build up transformations before getting the data into the right format, at which point we can actually train our model and then serve predictions. Transform this data into some numerical representation

In [0]:
from pyspark.sql.functions import date_format, col
preppedDataFrame = staticDataFrame\
  .na.fill(0)\
  .withColumn("day_of_week", date_format(col("InvoiceDate"), "EEEE"))\
  .coalesce(5)


split the data into training and test sets

In [0]:
trainDataFrame = preppedDataFrame\
  .where("InvoiceDate < '2011-07-01'")
testDataFrame = preppedDataFrame\
  .where("InvoiceDate >= '2011-07-01'")


splits our dataset roughly in half

In [0]:
trainDataFrame.count()
testDataFrame.count()

**StringIndexer** - turns our days of weeks into corresponding numerical values. Represents Saturday as 6, and Monday as 1. Numbering scheme implicitly states that Saturday is greater than Monday (by pure numerical values).

In [0]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer()\
  .setInputCol("day_of_week")\
  .setOutputCol("day_of_week_index")


We want Monday to be greater than Saturday. Use a OneHotEncoder to encode each of these values as their own column. These Boolean flags state whether that day of week is the relevant day of the week.

In [0]:
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder()\
  .setInputCol("day_of_week_index")\
  .setOutputCol("day_of_week_encoded")


Each of these will result in a set of columns that we will “assemble” into a vector. All machine learning algorithms in Spark take as input a Vector type, which must be a set of numerical values. Three key features: the price, the quantity, and the day of week.

In [0]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler()\
  .setInputCols(["UnitPrice", "Quantity", "day_of_week_encoded"])\
  .setOutputCol("features")


Set up a pipeline so that future data we need to transform can go through the exact same process:

In [0]:
from pyspark.ml import Pipeline

transformationPipeline = Pipeline()\
  .setStages([indexer, encoder, vectorAssembler])


Fit our transformers to this dataset. StringIndexer needs to know how many unique values there are to be indexed. Spark must look at all the distinct values in the column to be indexed in order to store those values later on.

In [0]:
fittedPipeline = transformationPipeline.fit(trainDataFrame)


Take that fitted pipeline and use it to transform our data in a consistent and repeatable way:

In [0]:
transformedTraining = fittedPipeline.transform(trainDataFrame)


Perform some hyperparameter tuning on the model because we do not want to repeat the exact same transformations over and over again. **Caching** - puts a copy of the intermediately transformed dataset into memory. Allows repeat access it at much lower cost than running pipeline again. Run the training without caching the data:

In [0]:

from pyspark.ml.clustering import KMeans
kmeans = KMeans().setK(20).setSeed(1)


In [0]:
transformedTraining.cache()

Train the model:

In [0]:
kmModel = kmeans.fit(transformedTraining)


If we were to compute the cost according to some success merits on ourtraining set:

In [0]:
transformedTest = fittedPipeline.transform(testDataFrame)


Use RDDs to parallelize raw data that you have stored in memory on the driver machine. Parallelize some simple numbers and create a DataFrame after:

In [0]:
from pyspark.sql import Row

spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()
