## This notebook is part of Hadoop and Spark training delivered by IT-DB group
### SPARK DataFrame1 Hands-On Lab
_ by Prasanth Kothuri _

### Hands-On 1 - Construct a DataFrame from csv file
*This demostrates how to read a csv file and construct a DataFrame*

#### Read the csv file into DataFrame

In [9]:
df = spark.read\
        .option("header", "true")\
        .option("inferSchema", "true")\
        .csv("/tmp/online-retail-dataset.csv")

#### Inspect the data

In [10]:
df.show(2,False)

+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                       |Quantity|InvoiceDate   |UnitPrice|CustomerID|Country       |
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER|6       |12/1/2010 8:26|2.55     |17850     |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN               |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
only showing top 2 rows



### Hands-On 2 - Spark Transformations - select, add, rename and drop columns

#### select dataframe columns

In [20]:
# select single column
from pyspark.sql.functions import col
df.select(col("Country")).show(2)
# select multiple columns
df.select("StockCode","Description").show(2,False)

+--------------+
|       Country|
+--------------+
|United Kingdom|
|United Kingdom|
+--------------+
only showing top 2 rows

+---------+----------------------------------+
|StockCode|Description                       |
+---------+----------------------------------+
|85123A   |WHITE HANGING HEART T-LIGHT HOLDER|
|71053    |WHITE METAL LANTERN               |
+---------+----------------------------------+
only showing top 2 rows



In [21]:
# selects all the original columns and adds a new column that specifies high value item
df.selectExpr(
  "*", # all original columns
  "(UnitPrice > 100) as HighValueItem")\
  .show(2)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+-------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|HighValueItem|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+-------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|        false|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|        false|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+-------------+
only showing top 2 rows



#### Adding, renaming and dropping columns

In [36]:
# add a new column called InvoiceValue
from pyspark.sql.functions import expr
df_1 = df.withColumn("InvoiceValue", expr("UnitPrice * Quantity"))\
    .select("InvoiceNo","Description","InvoiceValue")
df_1.show(2)

# rename InvoiceValue to LineTotal
df_2 = df_1.withColumnRenamed("InvoiceValue","LineTotal")
df_2.show(2)

# drop a column
df_2.drop("LineTotal").show(2)

+---------+--------------------+------------------+
|InvoiceNo|         Description|      InvoiceValue|
+---------+--------------------+------------------+
|   536365|WHITE HANGING HEA...|15.299999999999999|
|   536365| WHITE METAL LANTERN|             20.34|
+---------+--------------------+------------------+
only showing top 2 rows

+---------+--------------------+------------------+
|InvoiceNo|         Description|         LineTotal|
+---------+--------------------+------------------+
|   536365|WHITE HANGING HEA...|15.299999999999999|
|   536365| WHITE METAL LANTERN|             20.34|
+---------+--------------------+------------------+
only showing top 2 rows

+---------+--------------------+
|InvoiceNo|         Description|
+---------+--------------------+
|   536365|WHITE HANGING HEA...|
|   536365| WHITE METAL LANTERN|
+---------+--------------------+
only showing top 2 rows



### Hands-On 3 - Spark Transformations - filter, sort and cast

In [46]:
# select invoice lines with quantity > 100 and unitprice > 20
df.where(col("Quantity") > 100).where(col("UnitPrice") > 20).show(2)

+---------+---------+--------------------+--------+----------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|     InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+----------------+---------+----------+--------------+
|   558777|    22501|PICNIC BASKET WIC...|     125|  7/4/2011 10:23|    20.79|      null|United Kingdom|
|   572209|    23485|BOTANICAL GARDENS...|     120|10/21/2011 12:08|     20.8|     18102|United Kingdom|
+---------+---------+--------------------+--------+----------------+---------+----------+--------------+



In [49]:
# select invoice lines with quantity > 100 0r unitprice > 20
df.where((col("Quantity") > 100) | (col("UnitPrice") > 20)).show(2)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536378|    21212|PACK OF 72 RETROS...|     120|12/1/2010 9:37|     0.42|     14688|United Kingdom|
|  C536379|        D|            Discount|      -1|12/1/2010 9:41|     27.5|     14527|United Kingdom|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
only showing top 2 rows



In [54]:
from pyspark.sql.functions import desc, asc
df.orderBy(expr("UnitPrice desc")).show(2)
df.orderBy(col("Quantity").desc(), col("UnitPrice").asc()).show(20)

+---------+---------+---------------+--------+---------------+---------+----------+--------------+
|InvoiceNo|StockCode|    Description|Quantity|    InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+---------------+--------+---------------+---------+----------+--------------+
|  A563187|        B|Adjust bad debt|       1|8/12/2011 14:52|-11062.06|      null|United Kingdom|
|  A563186|        B|Adjust bad debt|       1|8/12/2011 14:51|-11062.06|      null|United Kingdom|
+---------+---------+---------------+--------+---------------+---------+----------+--------------+
only showing top 2 rows

+---------+---------+--------------------+--------+----------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|     InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+----------------+---------+----------+--------------+
|   581483|    23843|PAPER CRAFT , LIT...|   80995|  12/9/2011 9:1

### Hands-On 4 - Spark Transformations - aggregations

In [64]:
# Count distinct customers
from pyspark.sql.functions import countDistinct
df.select(countDistinct("CustomerID")).show()

# approx. distinct stock items
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("StockCode", 0.1)).show()

+--------------------------+
|count(DISTINCT CustomerID)|
+--------------------------+
|                      4372|
+--------------------------+

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+--------------------------------+



In [67]:
# average and mean purchase quantity
from pyspark.sql.functions import sum, count, avg, expr, mean
df.select(
    avg("Quantity").alias("avg_purchases"),
    mean("Quantity").alias("mean_purchases"))\
   .show()

+----------------+----------------+
|   avg_purchases|  mean_purchases|
+----------------+----------------+
|9.55224954743324|9.55224954743324|
+----------------+----------------+



### Hands-On 5 - Spark Transformations - grouping and windows

In [60]:
# count of items on the invoice
df.groupBy("InvoiceNo", "CustomerId").count().show(5)
# grouping with expressions
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
  .show()

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
+---------+----------+-----+
only showing top 5 rows

+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   536596|               1.5|  1.1180339887498947|
|   536938|33.142857142857146|  20.698023172885524|
|   537252|              31.0|                 0.0|
|   537691|              8.15|   5.597097462078001|
|   538041|              30.0|                 0.0|
|   538184|12.076923076923077|   8.142590198943392|
|   538517|3.0377358490566038|  2.3946659604837897|
|   538879|21.157894736842106|  11.811070444356483|
|   539275|              26.0|  12.806248474865697|
|   539630|20.333333333333332|  10.225241100118645|
|   540499|              3.75|  2.6653

In [57]:
# window functions

# add date column
from pyspark.sql.functions import col, to_date
dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy H:mm"))
dfWithDate.createOrReplaceTempView("dfWithDate")

# create a window specification
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
  .partitionBy("CustomerId", "date")\
  .orderBy(desc("Quantity"))\
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

# max purchase quantity
from pyspark.sql.functions import max
maxPurchaseQuantity = max(col("Quantity")).over(windowSpec)

# dense rank
from pyspark.sql.functions import dense_rank, rank
purchaseDenseRank = dense_rank().over(windowSpec)
purchaseRank = rank().over(windowSpec)

from pyspark.sql.functions import col

dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")\
  .select(
    col("CustomerId"),
    col("date"),
    col("Quantity"),
    purchaseRank.alias("quantityRank"),
    purchaseDenseRank.alias("quantityDenseRank"),
    maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()

+----------+----------+--------+------------+-----------------+-------------------+
|CustomerId|      date|Quantity|quantityRank|quantityDenseRank|maxPurchaseQuantity|
+----------+----------+--------+------------+-----------------+-------------------+
|     12346|2011-01-18|   74215|           1|                1|              74215|
|     12346|2011-01-18|  -74215|           2|                2|              74215|
|     12347|2010-12-07|      36|           1|                1|                 36|
|     12347|2010-12-07|      30|           2|                2|                 36|
|     12347|2010-12-07|      24|           3|                3|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|             