In [0]:
df = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("/FileStore/tables/online_retail_dataset.csv")\
  .coalesce(5)
df.cache()
df.createOrReplaceTempView("dfTable")


In [0]:
display(df.head(5))

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom


In [0]:
from pyspark.sql.functions import count
df.select(count("StockCode")).show() # 541909


In [0]:
from pyspark.sql.functions import countDistinct
df.select(countDistinct("StockCode")).show() # 4070


For a large data set, approximate distinct count is helpful:

In [0]:
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("StockCode", 0.1)).show() # 3364


In [0]:
from pyspark.sql.functions import first, last
df.select(first("StockCode"), last("StockCode")).show()


In [0]:
from pyspark.sql.functions import min, max
df.select(min("Quantity"), max("Quantity")).show()


In [0]:
from pyspark.sql.functions import sum
df.select(sum("Quantity")).show() # 5176450


Summing just the distinct values:

In [0]:
from pyspark.sql.functions import sumDistinct
df.select(sumDistinct("Quantity")).show() # 29310


Aliasing a average function to reuse columns later:

In [0]:
from pyspark.sql.functions import sum, count, avg, expr

df.select(
    count("Quantity").alias("total_transactions"),
    sum("Quantity").alias("total_purchases"),
    avg("Quantity").alias("avg_purchases"),
    expr("mean(Quantity)").alias("mean_purchases"))\
  .selectExpr(
    "total_purchases/total_transactions",
    "avg_purchases",
    "mean_purchases").show()


Spark has both the formula for the sample standard deviation as well as the formula for the population standard deviation. These are fundamentally different statistical formulae By default, Spark performs the formula forthe sample standard deviation or variance if you use the variance or stddev functions

Explicitly refer to the population  or standard deviation or variance:

In [0]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp
df.select(var_pop("Quantity"), var_samp("Quantity"),
  stddev_pop("Quantity"), stddev_samp("Quantity")).show()


Skewness - measures the asymmetry of the values in your data around the mean

Kurtosis - measures the tail of data 


Both measure the extremities of your data. Both relevant specifically when modeling your data as a probability distribution of a random variable.

In [0]:
from pyspark.sql.functions import skewness, kurtosis
df.select(skewness("Quantity"), kurtosis("Quantity")).show()


**covariance and correlation**

Correlation measures the Pearson correlation coefficient, which is scaled between –1 and +1. 

Covariance is scaled according to the inputs in the data - it is measure of the joint variability of two random variables. Like the var function, covariance can be calculated either as the sample covariance or the population covariance.  Correlation has no notion of this and therefore does not have calculations for population orsample

In [0]:
from pyspark.sql.functions import corr, covar_pop, covar_samp
df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),
    covar_pop("InvoiceNo", "Quantity")).show()


Spark can perform aggregations, not just of numerical values using formulas, but on complex types. Can collect a list of values present in a given column, or only the unique values, by collecting to a set. You can use this to define programmatic access later on in the pipeline or pass the entire collection in a user-defined function (UDF)

In [0]:
from pyspark.sql.functions import collect_set, collect_list
df.agg(collect_set("Country"), collect_list("Country")).show()


Group by each unique invoice number and get the count of items on that invoice - this returns another DataFrame and is lazily performed. 

We do this grouping in two phases
* First we specify the column(s) on which we would like to group -  returns a RelationalGroupedDataset
* Then we specify the aggregation(s) - returns a DataFrame.

In [0]:
df.groupBy("InvoiceNo","CustomerId").count().show()

Rather than passing that function as an expression into a select statement, we specify it as within agg. This makes it possible to pass in arbitrary expressions that have an aggregation specified. You can even do things like alias a column after transforming it for later use in your data flow:

In [0]:
from pyspark.sql.functions import count

test = df.groupBy("InvoiceNo").agg(
    count("Quantity").alias("quan"),
    expr("count(Quantity)")).show()

display(test)


Can specify your transformations as a series of Maps for which the key is the column, and the value is the aggregation function (as a string) that you would like to perform. You can reuse multiple column names if you specify them inline.

In [0]:
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
  .show()


Add a date column that will convert our invoice date into a column that contains only date information (not time information, too):

In [0]:
from pyspark.sql.functions import col, to_date
dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy H:mm"))
dfWithDate.createOrReplaceTempView("dfWithDate")


The first step to a window function is to create a window specification. Note that the partitionby is unrelated to the partitioning scheme concept that we have covered thus far. The ordering determines the ordering within a given partition, and, finally, the frame specification (the rowsBetween statement) states which rows will be included in the frame based on its reference to the currentinput row. In the following example, we look at all previous rows up to the current row:

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
  .partitionBy("CustomerId", "date")\
  .orderBy(desc("Quantity"))\
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)




Use an aggregation function to learn more about each specific customer, establishing the maximum purchase quantity over all time. Use the same aggregation functions that we saw earlier, by passing a column name or expression, indicating the window specification that defines to which frames of data this function will apply:

In [0]:
from pyspark.sql.functions import max
maxPurchaseQuantity = max(col("Quantity")).over(windowSpec)

# display(print(maxPurchaseQuantity))prints sql?

This returns a column (or expressions). We can now use this in a DataFrame select statement. Before doing so, though, we will create the purchase quantity rank. To do that, we use the dense_rank function to determine which date had the maximum purchase quantity for every customer. We use dense_rank as opposed to rank to avoid gaps in the ranking sequence when there are tied values (or in our case, duplicate rows):

In [0]:
from pyspark.sql.functions import dense_rank, rank
purchaseDenseRank = dense_rank().over(windowSpec)
purchaseRank = rank().over(windowSpec)


Perform a select to view the calculated window values:

In [0]:
from pyspark.sql.functions import col

# EDIT: NOT IN TEXTBOOK - must set spark.sql.legacy.timeParserPolicy to LEGACY 
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")\
  .select(
    col("CustomerId"),
    col("date"),
    col("Quantity"),
    purchaseRank.alias("quantityRank"),
    purchaseDenseRank.alias("quantityDenseRank"),
    maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()





In [0]:
dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")


Rollup - a multidimensional aggregation that performs a variety of group-by style calculations for us. 

create a rollup that looks across time (with our new Date column) and space (with the Country column) and creates a new DataFrame that includes the grand total over all dates, the grand total for each date in the DataFrame, and the subtotal for each country on each date in theDataFrame

In [0]:
rolledUpDF = dfNoNull.rollup("Date", "Country").agg(sum("Quantity"))\
  .selectExpr("Date", "Country", "`sum(Quantity)` as total_quantity")\
  .orderBy("Date")
rolledUpDF.show()


Where you see the null values is where you’ll find the grand totals. A null in both rollup columns specifies the grand total across both of those columns:

In [0]:
# rolledUpDF.where("Country IS NULL").show()
rolledUpDF.where("Date IS NULL && Country IS NULL").show()

Cube - Rather than treating elements hierarchically, a cube does the same thing across all dimensions

This means that it won’t just go by date over the entire time period, but also the country. We can you make a tablethat includes the following: 
* The total across all dates and countries
* The total for each date across all countries
* The total for each country on each dateThe total for each country across all dates

In [0]:
from pyspark.sql.functions import sum

dfNoNull.cube("Date", "Country").agg(sum(col("Quantity")))\
  .select("Date", "Country", "sum(Quantity)").orderBy("Date").show()


**Pivots** make it possible for you to convert a row into a column. With a pivot, we can aggregate according to some function for each of those given rows and display them in an easy-to-query way:

In [0]:
pivoted = dfWithDate.groupBy("date").pivot("Country").sum()
display(pivoted)

date,Australia_sum(CAST(Quantity AS BIGINT)),Australia_sum(UnitPrice),Australia_sum(CAST(CustomerID AS BIGINT)),Austria_sum(CAST(Quantity AS BIGINT)),Austria_sum(UnitPrice),Austria_sum(CAST(CustomerID AS BIGINT)),Bahrain_sum(CAST(Quantity AS BIGINT)),Bahrain_sum(UnitPrice),Bahrain_sum(CAST(CustomerID AS BIGINT)),Belgium_sum(CAST(Quantity AS BIGINT)),Belgium_sum(UnitPrice),Belgium_sum(CAST(CustomerID AS BIGINT)),Brazil_sum(CAST(Quantity AS BIGINT)),Brazil_sum(UnitPrice),Brazil_sum(CAST(CustomerID AS BIGINT)),Canada_sum(CAST(Quantity AS BIGINT)),Canada_sum(UnitPrice),Canada_sum(CAST(CustomerID AS BIGINT)),Channel Islands_sum(CAST(Quantity AS BIGINT)),Channel Islands_sum(UnitPrice),Channel Islands_sum(CAST(CustomerID AS BIGINT)),Cyprus_sum(CAST(Quantity AS BIGINT)),Cyprus_sum(UnitPrice),Cyprus_sum(CAST(CustomerID AS BIGINT)),Czech Republic_sum(CAST(Quantity AS BIGINT)),Czech Republic_sum(UnitPrice),Czech Republic_sum(CAST(CustomerID AS BIGINT)),Denmark_sum(CAST(Quantity AS BIGINT)),Denmark_sum(UnitPrice),Denmark_sum(CAST(CustomerID AS BIGINT)),EIRE_sum(CAST(Quantity AS BIGINT)),EIRE_sum(UnitPrice),EIRE_sum(CAST(CustomerID AS BIGINT)),European Community_sum(CAST(Quantity AS BIGINT)),European Community_sum(UnitPrice),European Community_sum(CAST(CustomerID AS BIGINT)),Finland_sum(CAST(Quantity AS BIGINT)),Finland_sum(UnitPrice),Finland_sum(CAST(CustomerID AS BIGINT)),France_sum(CAST(Quantity AS BIGINT)),France_sum(UnitPrice),France_sum(CAST(CustomerID AS BIGINT)),Germany_sum(CAST(Quantity AS BIGINT)),Germany_sum(UnitPrice),Germany_sum(CAST(CustomerID AS BIGINT)),Greece_sum(CAST(Quantity AS BIGINT)),Greece_sum(UnitPrice),Greece_sum(CAST(CustomerID AS BIGINT)),Hong Kong_sum(CAST(Quantity AS BIGINT)),Hong Kong_sum(UnitPrice),Hong Kong_sum(CAST(CustomerID AS BIGINT)),Iceland_sum(CAST(Quantity AS BIGINT)),Iceland_sum(UnitPrice),Iceland_sum(CAST(CustomerID AS BIGINT)),Israel_sum(CAST(Quantity AS BIGINT)),Israel_sum(UnitPrice),Israel_sum(CAST(CustomerID AS BIGINT)),Italy_sum(CAST(Quantity AS BIGINT)),Italy_sum(UnitPrice),Italy_sum(CAST(CustomerID AS BIGINT)),Japan_sum(CAST(Quantity AS BIGINT)),Japan_sum(UnitPrice),Japan_sum(CAST(CustomerID AS BIGINT)),Lebanon_sum(CAST(Quantity AS BIGINT)),Lebanon_sum(UnitPrice),Lebanon_sum(CAST(CustomerID AS BIGINT)),Lithuania_sum(CAST(Quantity AS BIGINT)),Lithuania_sum(UnitPrice),Lithuania_sum(CAST(CustomerID AS BIGINT)),Malta_sum(CAST(Quantity AS BIGINT)),Malta_sum(UnitPrice),Malta_sum(CAST(CustomerID AS BIGINT)),Netherlands_sum(CAST(Quantity AS BIGINT)),Netherlands_sum(UnitPrice),Netherlands_sum(CAST(CustomerID AS BIGINT)),Norway_sum(CAST(Quantity AS BIGINT)),Norway_sum(UnitPrice),Norway_sum(CAST(CustomerID AS BIGINT)),Poland_sum(CAST(Quantity AS BIGINT)),Poland_sum(UnitPrice),Poland_sum(CAST(CustomerID AS BIGINT)),Portugal_sum(CAST(Quantity AS BIGINT)),Portugal_sum(UnitPrice),Portugal_sum(CAST(CustomerID AS BIGINT)),RSA_sum(CAST(Quantity AS BIGINT)),RSA_sum(UnitPrice),RSA_sum(CAST(CustomerID AS BIGINT)),Saudi Arabia_sum(CAST(Quantity AS BIGINT)),Saudi Arabia_sum(UnitPrice),Saudi Arabia_sum(CAST(CustomerID AS BIGINT)),Singapore_sum(CAST(Quantity AS BIGINT)),Singapore_sum(UnitPrice),Singapore_sum(CAST(CustomerID AS BIGINT)),Spain_sum(CAST(Quantity AS BIGINT)),Spain_sum(UnitPrice),Spain_sum(CAST(CustomerID AS BIGINT)),Sweden_sum(CAST(Quantity AS BIGINT)),Sweden_sum(UnitPrice),Sweden_sum(CAST(CustomerID AS BIGINT)),Switzerland_sum(CAST(Quantity AS BIGINT)),Switzerland_sum(UnitPrice),Switzerland_sum(CAST(CustomerID AS BIGINT)),USA_sum(CAST(Quantity AS BIGINT)),USA_sum(UnitPrice),USA_sum(CAST(CustomerID AS BIGINT)),United Arab Emirates_sum(CAST(Quantity AS BIGINT)),United Arab Emirates_sum(UnitPrice),United Arab Emirates_sum(CAST(CustomerID AS BIGINT)),United Kingdom_sum(CAST(Quantity AS BIGINT)),United Kingdom_sum(UnitPrice),United Kingdom_sum(CAST(CustomerID AS BIGINT)),Unspecified_sum(CAST(Quantity AS BIGINT)),Unspecified_sum(UnitPrice),Unspecified_sum(CAST(CustomerID AS BIGINT))
2011-05-06,,,,42.0,58.95,74484.0,,,,182.0,63.32,260274.0,,,,,,,,,,,,,,,,,,,1694.0,75.37,283120.0,,,,,,,,,,222.0,12.64,126857.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,88.0,113.1,187530.0,,,,,,,,,,,,,17404,6952.629999999985,18722445,,,
2011-01-30,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3367,2321.720000000002,11334037,,,
2011-10-07,,,,,,,,,,,,,,,,,,,,,,345.0,136.51000000000002,360180.0,325.0,49.38,127810.0,637.0,27.69,74364.0,448.0,225.1,760461.0,,,,,,,527.0,88.43999999999998,483504.0,2053.0,327.79999999999995,1150698.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,212.0,25.35,140558.0,,,,,,,,,,,,,,,,,,,227.0,123.08,275880.0,,,,,,,,,,,,,25657,12425.409999999976,27566889,,,
2011-01-23,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,126.0,60.4,152724.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5068,2551.560000000002,13409810,,,
2011-07-18,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1377.0,146.97999999999988,999037.0,,,,,,,921.0,132.25000000000003,533282.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2160.0,230.46,955800.0,,,,,,,,,,,,,,,,9908,26946.80000000045,16266983,,,
2011-07-07,,,,,,,,,,143.0,96.71000000000004,247257.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,550.0,175.46000000000006,763866.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,216.0,139.00000000000003,338514.0,,,,,,,,,,,,,18047,6223.269999999994,18629372,,,
2011-08-21,,,,,,,,,,,,,,,,,,,800.0,7.16,58264.0,,,,,,,,,,148.0,115.71,238576.0,,,,,,,270.0,115.96000000000002,376350.0,187.0,51.17,225324.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,25.6,68388.0,,,,,,,,,,,,,6758,3010.6999999999944,15639107,,,
2011-11-18,,,,,,,,,,315.0,87.74000000000001,210460.0,,,,,,,,,,300.0,245.99000000000004,1239100.0,-40.0,6.94,38343.0,,,,263.0,230.84999999999985,819350.0,,,,,,,550.0,167.77,593838.0,491.0,128.18000000000004,274863.0,,,,,,,,,,,,,544.0,232.32,692230.0,,,,,,,,,,,,,34.0,19.25,29292.0,242.0,186.04,435540.0,-8.0,8.190000000000001,38337.0,,,,,,,,,,,,,,,,-3.0,2.89,12697.0,669.0,78.28999999999998,547844.0,,,,,,,20178,10676.19999999988,29578557,,,
2010-12-15,,,,-48.0,0.42,12865.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,134.0,117.72,842659.0,-12.0,31.3,75882.0,,,,,,,,,,-56.0,8.06,25332.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,18211,4661.809999999966,19690139,,,
2010-12-01,107.0,73.9,174034.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,243.0,133.64,313131.0,,,,,,,449.0,55.29,251660.0,117.0,93.82000000000002,364538.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,97.0,16.85,25582.0,1852.0,102.67,907609.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,23949,12428.080000000024,28785059,,,
