## Chapter 6. Working with Different Types of Data

This chapter covers building expressions, which are the bread and butter of Spark’s structured operations.

All SQL and DataFrame functions are found at "https://spark.apache.org/docs/latest/api/python/pyspark.sql.html"

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('DataTypes').getOrCreate()

df = spark.read.format("csv")\
     .option("header","true")\
     .option("inferSchema","true")\
     .load("data/2010-12-01.csv")

df.printSchema()
df.createOrReplaceTempView("dftable")

22/10/20 11:17:59 WARN Utils: Your hostname, tars resolves to a loopback address: 127.0.1.1; using 192.168.1.66 instead (on interface wlan0)
22/10/20 11:17:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/20 11:17:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



Converting native types to SPARK types

In [5]:
from pyspark.sql.functions import lit
df.select(lit(5),lit("five"),lit(5.0)).show()

+---+----+---+
|  5|five|5.0|
+---+----+---+
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
+---+----+---+
only showing top 20 rows



In [0]:
%sql
select 5,"five",5.0

5,five,5.0
5,five,5.0


###### working with Booleans
* Boolean statements consist of four elements: and, or, true, and false. We use these simple structures to build logical statements that evaluate to either true or false.
* We can specify Boolean expressions with multiple parts when you use **and** or **or** . 
* In spark we should always chain together **and** filters as a sequential filter. The reason for this is that even if Boolean statements are expressed serially (one after the other),Spark will flatten all of these filters into one statement and perform the filter at the same time, creating the and statement for us.
* **or** statements need to be specified in the same statement

In [2]:
from pyspark.sql.functions import col, instr
priceFilter = col("UnitPrice") > 600
descripFIlter = col("Description").contains("POSTAGE")

df.where(df.StockCode.isin("DOT")).where(priceFilter | descripFIlter).show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



In [3]:
from pyspark.sql.functions import instr
priceFilter = col("UnitPrice") > 600
descripFilter = instr(df.Description, "POSTAGE") >= 1

In [4]:
df.filter(col('StockCode').isin('DOT')).filter(priceFilter | descripFilter).show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



In [0]:
%sql
SELECT * 
FROM dftable 
WHERE StockCode in ("DOT")
AND (UnitPrice > 660 OR instr(Description,"POSTAGE") >= 1)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536544,DOT,DOTCOM POSTAGE,1,2010-12-01T14:32:00.000+0000,569.77,,United Kingdom
536592,DOT,DOTCOM POSTAGE,1,2010-12-01T17:06:00.000+0000,607.49,,United Kingdom


To filter a DataFrame, you can also just specify a Boolean column. Here is code to

In [0]:
%sql
SELECT UnitPrice, (StockCode = 'DOT' AND
(UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1)) as isExpensive
FROM dfTable
WHERE (StockCode = 'DOT' AND
(UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1))

UnitPrice,isExpensive
569.77,True
607.49,True


In [5]:
DOTCodeFilter = col("StockCode") == "DOT"
priceFilter = col("UnitPrice") > 600
descripFilter = instr(col("Description"), "POSTAGE") >= 1

df.withColumn("isExpensive",DOTCodeFilter & (priceFilter | descripFilter))\
  .where("isExpensive")\
  .select("unitPrice","isExpensive").show(5)

+---------+-----------+
|unitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



In [6]:
from pyspark.sql.functions import expr
df.withColumn("isExpensive",expr("NOT UnitPrice <= 250"))\
  .filter("isExpensive")\
  .select("Description","UnitPrice").show(5)

+--------------+---------+
|   Description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



In [7]:
import pyspark.sql.functions as F

###### Working with Numbers

In [8]:
from pyspark.sql.functions import expr, pow
fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"),2)+5
df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



In [9]:
df.selectExpr(
"CustomerId",
"(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



In [10]:
#round to a whole number
from pyspark.sql.functions import round,bround
df.select(round(col("UnitPrice"),1).alias("rounded"),col("UnitPrice")).show(5)

+-------+---------+
|rounded|UnitPrice|
+-------+---------+
|    2.6|     2.55|
|    3.4|     3.39|
|    2.8|     2.75|
|    3.4|     3.39|
|    3.4|     3.39|
+-------+---------+
only showing top 5 rows



In [11]:
df.select(round(F.lit("2.5")), bround(F.lit("2.5"))).show(2)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



correlation of two columns

In [13]:
from pyspark.sql.functions import corr
df.stat.corr("Quantity", "UnitPrice")
df.select(corr("Quantity", "UnitPrice")).show()

+-------------------------+
|corr(Quantity, UnitPrice)|
+-------------------------+
|     -0.04112314436835551|
+-------------------------+



To compute summary statistics for a column or set of columns we can use **describe()** function. This will take all numeric columns and
calculate the count, mean, standard deviation, min, and max.

In [14]:
df.describe().show()

[Stage 15:>                                                         (0 + 1) / 1]

+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                null| 8.627413127413128| 4.151946589446603|15661.388719512195|          null|
| stddev|72.89447869788873|17407.897548583845|                null|26.371821677029203|15.638659854603892|1854.4496996893627|          null|
|    min|           536365|             10002| 4 PURPLE FLOCK D...|               -24|               0.0|           12431.0|     Australia|
|    max|          C

                                                                                

In [15]:
colName = 'UnitPrice'
quantileProbs = [0.5]
relError = 0.05
df.stat.approxQuantile('UnitPrice', quantileProbs, relError)

[2.51]

In [16]:
from pyspark.sql.functions import monotonically_increasing_id

In [17]:
df.select(F.monotonically_increasing_id()).show(2)

+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
+-----------------------------+
only showing top 2 rows



###### Working with Strings

* **initcap** function will capitalize every word in a given string when that word is separated from another by a space.

In [18]:
from pyspark.sql.functions import initcap,lower,upper

df.select(initcap(col("Description"))).show(2)

+--------------------+
|initcap(Description)|
+--------------------+
|White Hanging Hea...|
| White Metal Lantern|
+--------------------+
only showing top 2 rows



In [19]:
df.select(col("Description").alias("DESC"),
          lower(col("Description")).alias("LOWER DESC"),
          upper(col("Description")).alias("UPPER DESC")).show(2)

+--------------------+--------------------+--------------------+
|                DESC|          LOWER DESC|          UPPER DESC|
+--------------------+--------------------+--------------------+
|WHITE HANGING HEA...|white hanging hea...|WHITE HANGING HEA...|
| WHITE METAL LANTERN| white metal lantern| WHITE METAL LANTERN|
+--------------------+--------------------+--------------------+
only showing top 2 rows



In [20]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim
df.select(ltrim(lit(" HELLO ")).alias("ltrim"),
          rtrim(lit(" HELLO ")).alias("rtrim"),
          trim(lit(" HELLO ")).alias("trim"),
          lpad(lit("HELLO"), 3, " ").alias("lp"),
          rpad(lit("HELLO"), 10, " ").alias("rp")).show(2)

+------+------+-----+---+----------+
| ltrim| rtrim| trim| lp|        rp|
+------+------+-----+---+----------+
|HELLO | HELLO|HELLO|HEL|HELLO     |
|HELLO | HELLO|HELLO|HEL|HELLO     |
+------+------+-----+---+----------+
only showing top 2 rows



###### Regular Expression
There are two key functions in Spark to perform regular expression tasks
* **regexp_extract** functions extract values
* **regexp_replace** replace values

In [21]:
from pyspark.sql.functions import regexp_replace
regex_string = "BLACK|WHITE|RED|GREEN|BLUE"
df.select(regexp_replace(df.Description,regex_string,'COLOR').alias("color_clean"),col("Description")).show(2)

+--------------------+--------------------+
|         color_clean|         Description|
+--------------------+--------------------+
|COLOR HANGING HEA...|WHITE HANGING HEA...|
| COLOR METAL LANTERN| WHITE METAL LANTERN|
+--------------------+--------------------+
only showing top 2 rows



In [23]:
from pyspark.sql.functions import translate

df.select(translate(col('Description'), 'LEET', '1337'), col('Description')).show(2)

+----------------------------------+--------------------+
|translate(Description, LEET, 1337)|         Description|
+----------------------------------+--------------------+
|              WHI73 HANGING H3A...|WHITE HANGING HEA...|
|               WHI73 M37A1 1AN73RN| WHITE METAL LANTERN|
+----------------------------------+--------------------+
only showing top 2 rows



In [24]:
from pyspark.sql.functions import regexp_extract
extract_string = "(BLACK|WHITE|RED|GREEN|BLUE)"

df.select(regexp_extract(df.Description,extract_string,1).alias("color_clean"),col("Description")).show(2)

+-----------+--------------------+
|color_clean|         Description|
+-----------+--------------------+
|      WHITE|WHITE HANGING HEA...|
|      WHITE| WHITE METAL LANTERN|
+-----------+--------------------+
only showing top 2 rows



When we want to check if the string exists in the column, use **Contains()c** function. This will returns a Boolean declaring whether the value you specify is in the column's string

In [26]:
containsBlack = df.Description.contains("BLACK")
containsWhite = instr(col("Description"),"WHITE") >= 1

df.withColumn("hasBlackNWhite", containsBlack | containsWhite).where("hasBlackNWhite").select("Description").show(5)

+--------------------+
|         Description|
+--------------------+
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
|RED WOOLLY HOTTIE...|
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
+--------------------+
only showing top 5 rows



In [0]:
%sql
SELECT Description from dftable
WHERE instr(Description, 'BLACK') >= 1 OR instr(Description, 'WHITE') >= 1
LIMIT 3

Description
WHITE HANGING HEART T-LIGHT HOLDER
WHITE METAL LANTERN
RED WOOLLY HOTTIE WHITE HEART.


When we convert a list of values into a set of arguments and pass them into a function, we use a language feature called **varargs**. Using this feature, we can effectively unravel an array of arbitrary length and pass it as arguments to a function. 

**locate** that returns the integer location

Locate the position of the first occurrence of substr in a string column, after position pos.

**Note** The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.
Parameters:
substr – a string
str – a Column of pyspark.sql.types.StringType
pos – start position (zero based)

In [27]:
from pyspark.sql.functions import expr, locate

simpleColors = ["black", "white", "red", "green", "blue"]

def color_locator(column, color_string):
  return locate(color_string.upper(), column).cast("boolean").alias("is_" + color_string)

selectColumns = [color_locator(df.Description, c) for c in simpleColors]
selectColumns.append(expr("*")) # has to a be Column type

df.select(*selectColumns).where(expr("is_white OR is_red")).select("Description").show(3, False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



If you use **col()** and want to perform transformations on that column, you must perform those on that column reference. 
When using an expression, the **expr** function can actually **parse transformations** and **column references** from a string and can subsequently be passed into further transformations.
Key-Points

* Columns are just expressions.
* Columns and transformations of those columns compile to the same logical plan as parsed expressions.

###### Working with Dates and Timestamps

You can set a session local timezone if necessary by **setting spark.conf.sessionLocalTimeZone** in the SQL configurations.

In [28]:
from pyspark.sql.functions import current_date, current_timestamp

dateDf = spark.range(10)\
     .withColumn("TodayDate",current_date())\
     .withColumn("Now",current_timestamp())

dateDf.createOrReplaceTempView("dateTable")
dateDf.printSchema()

root
 |-- id: long (nullable = false)
 |-- TodayDate: date (nullable = false)
 |-- Now: timestamp (nullable = false)



In [31]:
#To add or substract five days from today
from pyspark.sql.functions import date_add, date_sub

dateDf.select(date_add("TodayDate",5),date_sub("TodayDate",5)).show(1)

+----------------------+----------------------+
|date_add(TodayDate, 5)|date_sub(TodayDate, 5)|
+----------------------+----------------------+
|            2022-10-25|            2022-10-15|
+----------------------+----------------------+
only showing top 1 row



In [0]:
%sql
SELECT date_add(TodayDate,5),date_sub(TodayDate,5)
FROM dateTable
LIMIT 1

"date_add(TodayDate, 5)","date_sub(TodayDate, 5)"
2020-07-10,2020-06-30


In [32]:
from pyspark.sql.functions import datediff, months_between, to_date, col, lit

dateDf.withColumn("week_ago", date_sub(col("TodayDate"),7))\
      .select(datediff(col("week_ago"),col("TodayDate"))).show(1)

+-----------------------------+
|datediff(week_ago, TodayDate)|
+-----------------------------+
|                           -7|
+-----------------------------+
only showing top 1 row



In [33]:
dateDf.select(
to_date(lit("2019-06-23")).alias("start"),
  to_date(lit("2019-11-23")).alias("end")
).select(months_between(col("start"),col("end"))).show(1)

+--------------------------------+
|months_between(start, end, true)|
+--------------------------------+
|                            -5.0|
+--------------------------------+
only showing top 1 row



Spark will not throw an error if it cannot parse the date; rather, it will just return **null.**

Example:

In [34]:
dateDf.select(to_date(lit("2016-20-12")),to_date(lit("2017-12-11"))).show(1)

+-------------------+-------------------+
|to_date(2016-20-12)|to_date(2017-12-11)|
+-------------------+-------------------+
|               null|         2017-12-11|
+-------------------+-------------------+
only showing top 1 row



notice how the second date appears as Decembers 11th instead of the correct day, November 12th. Spark doesn’t throw an
error because it cannot know whether the days are mixed up or that specific row is incorrect. To fix this we use **to_date** and **to_timestamp**

In [37]:
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')

dateFormat = "YYYY-dd-MM"

cleanDF = spark.range(1).select(
          to_date(lit("2017-12-11"),dateFormat).alias("date"),
          to_date(lit("2017-20-12"),dateFormat).alias("date2"))

cleanDF.createOrReplaceTempView("dataTable2")

In [38]:
from pyspark.sql.functions import to_timestamp



cleanDF.select(to_timestamp(col("date"),dateFormat)).show()

+------------------------------+
|to_timestamp(date, YYYY-dd-MM)|
+------------------------------+
|           2017-01-01 00:00:00|
+------------------------------+



###### Working with Nulls in Data
Spark can optimize working with null values more than it can if you use empty strings or other values. 
* use .na subpackage on a DataFrame
* **Spark includes a function to allow you to select the first non-null value from a set of columns by using the coalesce function.**

In [39]:
from pyspark.sql.functions import coalesce
df.select(coalesce(col("Description"), col("CustomerId"))).show()

+---------------------------------+
|coalesce(Description, CustomerId)|
+---------------------------------+
|             WHITE HANGING HEA...|
|              WHITE METAL LANTERN|
|             CREAM CUPID HEART...|
|             KNITTED UNION FLA...|
|             RED WOOLLY HOTTIE...|
|             SET 7 BABUSHKA NE...|
|             GLASS STAR FROSTE...|
|             HAND WARMER UNION...|
|             HAND WARMER RED P...|
|             ASSORTED COLOUR B...|
|             POPPY'S PLAYHOUSE...|
|             POPPY'S PLAYHOUSE...|
|             FELTCRAFT PRINCES...|
|             IVORY KNITTED MUG...|
|             BOX OF 6 ASSORTED...|
|             BOX OF VINTAGE JI...|
|             BOX OF VINTAGE AL...|
|             HOME BUILDING BLO...|
|             LOVE BUILDING BLO...|
|             RECIPE BOX WITH M...|
+---------------------------------+
only showing top 20 rows



The simplest function is drop, which removes rows that contain nulls. The default is to drop any row in which any value is null:
* df.na.drop() / df.na.drop("any") - drops a row if any of the values are null.
* df.na.drop("all") - drops the row only if all values are null or NaN for that row

**fill:**

Using the fill function, you can fill one or more columns with a set of values. This can be done by specifying a map—that is a particular value and a set of columns.

In [42]:
df.na.fill("all", subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

We can also do this with with a Scala Map, where the key is the column name and the value is the
value we would like to use to fill null values

In [44]:
fill_cols_vals = {"StockCode": 5, "Description" : "No Value"}
df.na.fill(fill_cols_vals).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|2010-12-01 08:26:00|     7.65|   17850.0|United Kingdom|
|   536365|    21730|GLASS S

###### replace
To replace all values in a certain column according to their current value. The only requirement is that this value be the same type as the original value

In [47]:
df.na.replace([""], ["UNKNOWN"], "Description").show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|2010-12-01 08:26:00|     7.65|   17850.0|United Kingdom|
|   536365|    21730|GLASS S

###### Working with Complex Types
three kinds of complex types: **structs, arrays,& maps.**

**structs:** You can think of structs as DataFrames within DataFrames.

In [51]:
df.selectExpr("(Description, InvoiceNo) as complex", "*")
df.selectExpr("struct(Description, InvoiceNo) as complex", "*")

+--------------------+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|             complex|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+--------------------+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|{WHITE HANGING HE...|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|{WHITE METAL LANT...|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|{CREAM CUPID HEAR...|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|{KNITTED UNION FL...|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|{RED WOOLLY HOTTI...|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     

DataFrame[complex: struct<Description:string,InvoiceNo:string>, InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [52]:
from pyspark.sql.functions import struct
complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"))
complexDF.createOrReplaceTempView("complexDF")

We now have a DataFrame with a column complex. We can query it just as we might another DataFrame, the only difference is that we use a dot syntax to do so, or the column method **getField**

In [56]:
complexDF.select("complex.InvoiceNo").show(2)
complexDF.select(col("complex").getField("Description")).show()

+---------+
|InvoiceNo|
+---------+
|   536365|
|   536365|
+---------+
only showing top 2 rows

+--------------------+
| complex.Description|
+--------------------+
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
|CREAM CUPID HEART...|
|KNITTED UNION FLA...|
|RED WOOLLY HOTTIE...|
|SET 7 BABUSHKA NE...|
|GLASS STAR FROSTE...|
|HAND WARMER UNION...|
|HAND WARMER RED P...|
|ASSORTED COLOUR B...|
|POPPY'S PLAYHOUSE...|
|POPPY'S PLAYHOUSE...|
|FELTCRAFT PRINCES...|
|IVORY KNITTED MUG...|
|BOX OF 6 ASSORTED...|
|BOX OF VINTAGE JI...|
|BOX OF VINTAGE AL...|
|HOME BUILDING BLO...|
|LOVE BUILDING BLO...|
|RECIPE BOX WITH M...|
+--------------------+
only showing top 20 rows



In [58]:
#We can also query all values in the struct by using *. This brings up all the columns to the toplevel DataFrame
complexDF.select("complex.*").show(3)

+--------------------+---------+
|         Description|InvoiceNo|
+--------------------+---------+
|WHITE HANGING HEA...|   536365|
| WHITE METAL LANTERN|   536365|
|CREAM CUPID HEART...|   536365|
+--------------------+---------+
only showing top 3 rows



###### Arrays
to explain it in more details lets take every single word in our Description column and convert that into a row in our DataFrame. In this we use 
* **split** function and specify the delimiter
* **Array Length** to query its size
* **array_contains** to check whether the given array contains the specified value
* **explode function** takes a column that consists of arrays and creates one row (with the rest of the values duplicated) per value in the array.

In [62]:
from pyspark.sql.functions import split

df.select(split(col("Description"), " ")).show(1)

df.select(split(col("Description"), " ").alias("array_col")).selectExpr("array_col[0]").show(2)

+-------------------------+
|split(Description,  , -1)|
+-------------------------+
|     [WHITE, HANGING, ...|
+-------------------------+
only showing top 1 row

+------------+
|array_col[0]|
+------------+
|       WHITE|
|       WHITE|
+------------+
only showing top 2 rows



In [63]:
from pyspark.sql.functions import size

df.select(size(split(col("Description"), " "))).show(2)

+-------------------------------+
|size(split(Description,  , -1))|
+-------------------------------+
|                              5|
|                              3|
+-------------------------------+
only showing top 2 rows



In [64]:
from pyspark.sql.functions import array_contains

df.select(array_contains(split(col("Description")," "),'WHITE')).show(2)

+------------------------------------------------+
|array_contains(split(Description,  , -1), WHITE)|
+------------------------------------------------+
|                                            true|
|                                            true|
+------------------------------------------------+
only showing top 2 rows



In [65]:
# To convert a complex type into a set of rows (one per value in our array), we need to use the explode function.
from pyspark.sql.functions import explode

df.withColumn("splitted",split(col("Description")," "))\
  .withColumn("exploded",explode("splitted"))\
  .select("Description","InvoiceNo","exploded").show(3)

+--------------------+---------+--------+
|         Description|InvoiceNo|exploded|
+--------------------+---------+--------+
|WHITE HANGING HEA...|   536365|   WHITE|
|WHITE HANGING HEA...|   536365| HANGING|
|WHITE HANGING HEA...|   536365|   HEART|
+--------------------+---------+--------+
only showing top 3 rows



###### Maps:
Maps are created by using the map function and key-value pairs of columns. You then can select them just like you might select from an array

In [66]:
from pyspark.sql.functions import create_map

df.select(create_map(col("Description"),col("InvoiceNo")).alias("complex_map")).show(2)

+--------------------+
|         complex_map|
+--------------------+
|{WHITE HANGING HE...|
|{WHITE METAL LANT...|
+--------------------+
only showing top 2 rows



You can query them by using the proper key. A missing key returns **null**

In [67]:
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
.selectExpr("complex_map['WHITE METAL LANTERN']").show(2)

+--------------------------------+
|complex_map[WHITE METAL LANTERN]|
+--------------------------------+
|                            null|
|                          536365|
+--------------------------------+
only showing top 2 rows



We can use **explode** function on map which will turn them in to columns

In [68]:
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
.selectExpr("explode(complex_map)").show(2)

+--------------------+------+
|                 key| value|
+--------------------+------+
|WHITE HANGING HEA...|536365|
| WHITE METAL LANTERN|536365|
+--------------------+------+
only showing top 2 rows



###### Working with JSON
We can operate directly on strings of JSON in Spark and parse from JSON or extract JSON objects. We can use
* **get_json_object** to inline query a JSON object, be it a dictionary or array.
* **json_tuple** if this object has only one level of nesting.

In [69]:
jsonDF = spark.range(1).selectExpr(""" '{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString """)
jsonDF.show()

+--------------------+
|          jsonString|
+--------------------+
|{"myJSONKey" : {"...|
+--------------------+



In [70]:
from pyspark.sql.functions import get_json_object, json_tuple,col

jsonDF.select(\
              get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]"),
              json_tuple(col("jsonString"), "myJSONKey").alias("ex_jsonString")
             ).show(2)

+-------------------------------------------------------+--------------------+
|get_json_object(jsonString, $.myJSONKey.myJSONValue[1])|       ex_jsonString|
+-------------------------------------------------------+--------------------+
|                                                      2|{"myJSONValue":[1...|
+-------------------------------------------------------+--------------------+



* we can also turn a StructType into a JSON string by using the **to_json** function. This function also accepts a dictionary (map) of parameters that are the same as the JSON data source.

* use **from_json** function to parse this (or other JSON data) back in. This naturally requires you to specify a schema, and optionally you can specify a map of options

In [71]:
from pyspark.sql.functions import to_json, from_json

df.selectExpr("(InvoiceNo, Description) as myStruct"
             ).select(to_json(col("myStruct"))).show()

+--------------------+
|   to_json(myStruct)|
+--------------------+
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
+--------------------+
only showing top 20 rows



In [72]:
from pyspark.sql.functions import to_json, from_json
from pyspark.sql.types import *

Schema = StructType((
                    StructField("InvoiceNo",StringType(),True),
                    StructField("Description",StringType(),True)))
df.selectExpr("(InvoiceNo, Description) as myStruct")\
  .select(to_json(col("myStruct")).alias("newJSON"))\
  .select(from_json(col("newJSON"),Schema),col("newJSON")).show(2)

+--------------------+--------------------+
|  from_json(newJSON)|             newJSON|
+--------------------+--------------------+
|{536365, WHITE HA...|{"InvoiceNo":"536...|
|{536365, WHITE ME...|{"InvoiceNo":"536...|
+--------------------+--------------------+
only showing top 2 rows



###### User-Defined Functions - define your own functions

* UDFs can take and return one or more columns as input.
* you can write them in several different programming languages; you do not need to create them in an esoteric format or domain-specific language.
* By default, these functions are registered as temporary functions to be used in that specific SparkSession or Context

**performance considerations:**
* If the function is written in Scala or Java, you can use it within the Java Virtual Machine (JVM). This means that there will be little performance penalty aside from the fact that you can’t take advantage of code generation capabilities that Spark has for built-in functions.

* If the function is written in Python, SPARK 
  * strats a Python process on the worker
  * serializes all of the data to a format that Python can understand
  * executes the function row by row on that data in the Python process, and then finally 
  * returns the results of the row operations to the JVM and Spark
  
Starting this Python process is expensive, but the real cost is in serializing the data to Python.it is an expensive computation, but also, after the data enters Python, Spark cannot manage the memory of the worker. This means that it could potentially cause a worker to fail
if it becomes resource constrained (because both the JVM and Python are competing for memory on the same machine).

It is recommended to write your UDFs in Scala or Java—the small amount of time it should take you to write the function in Scala will always yield significant speed ups, and we can still use the function from Python!

In [73]:
#create UDF function to calculate power3 

udfExampleDF = spark.range(5).toDF("num")
def power3(double_value):
  return double_value **3

power3(2.0)

8.0

In [79]:
# register them with Spark so that we can use them on all of our worker machines. Spark will serialize the function on the driver and transfer it over the network to all executor processes. This happens regardless of language

from pyspark.sql.functions import udf

power3udf = udf(power3)
udfExampleDF.select(power3udf(col("num")))

DataFrame[power3(num): string]

we can use this only as a DataFrame function. We can’t use it within a **string expression, only on an expression.**

* Spark SQL function or expression is valid to use as an expression when working with DataFrames.


Let’s register the function in Scala and use it in python

In [0]:
%scala
val udfExampleDF = spark.range(5).toDF("num")
def power3scala(number:Double):Double = number * number * number

spark.udf.register("power3Scala", power3scala(_:Double):Double)
udfExampleDF.selectExpr("power3Scala(num)").show(2)

In [None]:
#using the power3Scala function in python
udfExampleDF.selectExpr("power3(num)").show(2)

* It’s a best practice to define the return type for your function when you define it. FUnction works fine if it doesn't have a return type.

* If you specify the type that doesn’t align with the actual type returned by the function, Spark will not throw an error but will just return **null** to designate a failure

In [0]:
from pyspark.sql.types import IntegerType, DoubleType

spark.udf.register("power3py",power3,DoubleType())
udfExampleDF.selectExpr("power3py(num)").show(2)

*** End of the chapter ***