# Chapter 6. Working with Different Types of Data

- This chapter covers building expressions, and working with different kinds of data, including the following:
    - Boolean.
    - Numbers.
    - Strings.
    - Dates and timestamps.
    - Handling null.
    - Complex types.
    - User-defined functions.


## Where to Look for APIs

In [35]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Working with Different Types of Data") \
    .config("spark.sql.shuffle.partitions", "5") \
    .getOrCreate()

In [36]:
df = spark.read.format("csv")\
  .option("header", "true")\
    .option("inferSchema", "true")\
      .load("../data/retail-data/by-day/2010-12-01.csv")

df.printSchema()
df.createOrReplaceTempView("dfTable")

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



## Converting to Spark Types
- Use the *lit* function to convert native types to Spark types.
- Converts values from another language to their corresponding Spark representation.

In [37]:
from pyspark.sql.functions import lit 

df.select(lit(5), lit("five"), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

## Working with Booleans
- Essential for filtering data.
- Foundation for logical statements.
- Components of Boolean Statements:
    - Logical Operators: and, or.
    - Values: true, false.
- Usage:
  - Build logical statements that evaluate to true or false.
  - Used as conditional requirements for filtering rows of data.
  - Rows that pass the test (evaluate to true) are kept; others are filtered out.


In [38]:
from pyspark.sql.functions import col

df.where(col("InvoiceNo") != 536365)\
  .select("InvoiceNo", "Description")\
    .show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows



In [39]:
df.where("InvoiceNo = 536365").show(5, False)
df.where("InvoiceNo <> 536365").show(5, False)

+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |2010-12-01 08:26:00|2.55     |17850.0   |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |2010-12-01 08:26:00|2.75     |17850.0   |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
+---------+-----

- Boolean Expressions in Spark:
  - Use *and* or *or* for Boolean expressions.
  - Chain and filters sequentially for clarity and efficiency.
- Filtering Logic in Spark:
  - Spark flattens serial Boolean statements into one and performs the filter simultaneously.
  - Explicitly specify or statements in the same statement for readability.

In [40]:
from pyspark.sql.functions import instr

priceFilter = col("UnitPrice") > 600
descripFilter = instr(df.Description, "POSTAGE") >= 1
df.where(df.StockCode.isin("DOT")).where(priceFilter | descripFilter).show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      NULL|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      NULL|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



In [41]:
DOTCodeFilter = col("StockCode") == "DOT"
priceFilter = col("UnitPrice") > 600
descripFilter = instr(col("Description"), "POSTAGE") >= 1
df.withColumn("isExpensive", DOTCodeFilter & (priceFilter | descripFilter))\
  .where("isExpensive")\
    .select("isExpensive", "UnitPrice")\
      .show(5)

# -- in SQL
#   SELECT UnitPrice, (StockCode = 'DOT' AND
#     (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1)) as isExpensive
#   FROM dfTable
#   WHERE (StockCode = 'DOT' AND
# (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1))

+-----------+---------+
|isExpensive|UnitPrice|
+-----------+---------+
|       true|   569.77|
|       true|   607.49|
+-----------+---------+



In [42]:
from pyspark.sql.functions import expr 

df.withColumn("isExpensive", expr("NOT UnitPrice < 250"))\
  .where("isExpensive")\
    .select("Description", "UnitPrice")\
      .show(5)

+--------------+---------+
|   Description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



**Note:**: Working with null data

In [43]:

df.where(col("Description").eqNullSafe("hello")).show()

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



## Working with Numbers
- After filtering, counting and performing computations are common tasks.
- Use numerical data types for computations.

In [44]:
from pyspark.sql.functions import expr, pow

fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5
df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



In [45]:
df.selectExpr(
  "CustomerId",
  "(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity")\
    .show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



#### Rounding numbers

In [46]:
from pyspark.sql.functions import lit, round, bround

df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



In [47]:
from pyspark.sql.functions import corr

df.stat.corr("Quantity", "UnitPrice")
df.select(corr("Quantity", "UnitPrice")).show()

+-------------------------+
|corr(Quantity, UnitPrice)|
+-------------------------+
|     -0.04112314436835551|
+-------------------------+



#### Compute summary statics for column or set of columns

In [48]:
df.describe().show()

[Stage 68:>                                                         (0 + 1) / 1]

+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                NULL| 8.627413127413128| 4.151946589446603|15661.388719512195|          NULL|
| stddev|72.89447869788873|17407.897548583845|                NULL|26.371821677029203|15.638659854603892|1854.4496996893627|          NULL|
|    min|           536365|             10002| 4 PURPLE FLOCK D...|               -24|               0.0|           12431.0|     Australia|
|    max|          C

                                                                                

In [49]:
from pyspark.sql.functions import count, mean, stddev_pop, min, max;

#### Static functions

In [50]:
col = "UnitPrice"
quantileProbs = [0.5]
relError = 0.05
df.stat.approxQuantile(col, quantileProbs, relError)

                                                                                

[2.51]

#### See a cross-tabulation or frequent item pairs

In [51]:
df.stat.crosstab("StockCode", "Quantity").show()

+------------------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|StockCode_Quantity| -1|-10|-12| -2|-24| -3| -4| -5| -6| -7|  1| 10|100| 11| 12|120|128| 13| 14|144| 15| 16| 17| 18| 19|192|  2| 20|200| 21|216| 22| 23| 24| 25|252| 27| 28|288|  3| 30| 32| 33| 34| 36|384|  4| 40|432| 47| 48|480|  5| 50| 56|  6| 60|600| 64|  7| 70| 72|  8| 80|  9| 96|
+------------------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|            84029E|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  3|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0| 

In [52]:
df.stat.freqItems(["StockCode", "Quantity"]).show()

+--------------------+--------------------+
| StockCode_freqItems|  Quantity_freqItems|
+--------------------+--------------------+
|[22086, 21705, 72...|[200, 128, 23, 50...|
+--------------------+--------------------+



#### Add a unique ID to each row by using *monotonically_increasing_id* function

In [53]:
from pyspark.sql.functions import monotonically_increasing_id

df.select(monotonically_increasing_id()).show(5)

+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
|                            3|
|                            4|
+-----------------------------+
only showing top 5 rows



## Working with Strings

#### *initcap* function will capitalize every word in a given string

In [69]:
from pyspark.sql.functions import initcap, column

df.select("InvoiceNo", initcap(column("Description")).alias("UpperDes")).show(5)

+---------+--------------------+
|InvoiceNo|            UpperDes|
+---------+--------------------+
|   536365|White Hanging Hea...|
|   536365| White Metal Lantern|
|   536365|Cream Cupid Heart...|
|   536365|Knitted Union Fla...|
|   536365|Red Woolly Hottie...|
+---------+--------------------+
only showing top 5 rows



#### Uppercase, Lowercase

In [71]:
from pyspark.sql.functions import lower, upper

df.select(column("Description"), 
  lower(column("Description")),
  upper(lower(column("Description")))).show(2)

+--------------------+--------------------+-------------------------+
|         Description|  lower(Description)|upper(lower(Description))|
+--------------------+--------------------+-------------------------+
|WHITE HANGING HEA...|white hanging hea...|     WHITE HANGING HEA...|
| WHITE METAL LANTERN| white metal lantern|      WHITE METAL LANTERN|
+--------------------+--------------------+-------------------------+
only showing top 2 rows



#### Adding or removing spaces around a string

In [73]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

df.select(
    ltrim(lit("    HELLO    ")).alias("ltrim"),
    rtrim(lit("    HELLO    ")).alias("rtrim"),
    trim(lit("    HELLO    ")).alias("trim"),
    lpad(lit("HELLO  "), 3, " ").alias("lp"),
    rpad(lit("HELLO  "), 10, " ").alias("rp")).show(2)

+---------+---------+-----+---+----------+
|    ltrim|    rtrim| trim| lp|        rp|
+---------+---------+-----+---+----------+
|HELLO    |    HELLO|HELLO|HEL|HELLO     |
|HELLO    |    HELLO|HELLO|HEL|HELLO     |
+---------+---------+-----+---+----------+
only showing top 2 rows



### Regular Expression
- Regular expressions are used to find or replace strings based on a set of rules.
- Spark uses Java's regular expression syntax.
- Key Spark functions for regular expressions:
  - *regexp_extract*: Extracts values from strings.
  - *regexp_replace*: Replaces values in strings.

In [75]:
from pyspark.sql.functions import regexp_replace

regex_string = "BLACK|WHITE|RED|GREEN|BLUE"
df.select(
  regexp_replace(column("Description"), regex_string, "COLOR").alias("color_clean"),
  column("Description")
).show(2)

+--------------------+--------------------+
|         color_clean|         Description|
+--------------------+--------------------+
|COLOR HANGING HEA...|WHITE HANGING HEA...|
| COLOR METAL LANTERN| WHITE METAL LANTERN|
+--------------------+--------------------+
only showing top 2 rows



#### Replace given characters with other characters

In [78]:
from pyspark.sql.functions import translate

df.select(translate(column("Description"), "LEET", "1337"), column("Description")).show(2)

+----------------------------------+--------------------+
|translate(Description, LEET, 1337)|         Description|
+----------------------------------+--------------------+
|              WHI73 HANGING H3A...|WHITE HANGING HEA...|
|               WHI73 M37A1 1AN73RN| WHITE METAL LANTERN|
+----------------------------------+--------------------+
only showing top 2 rows



In [85]:
from pyspark.sql.functions import regexp_extract

extract_str = "(BLACK|WHITE|RED|GREEN|BLUE)"
df.select(
  regexp_extract(column("Description"), extract_str, 1).alias("color_clean"),
  column("Description")
).show(5)

+-----------+--------------------+
|color_clean|         Description|
+-----------+--------------------+
|      WHITE|WHITE HANGING HEA...|
|      WHITE| WHITE METAL LANTERN|
|           |CREAM CUPID HEART...|
|           |KNITTED UNION FLA...|
|        RED|RED WOOLLY HOTTIE...|
+-----------+--------------------+
only showing top 5 rows



#### Check existence

In [88]:
containsBlack = instr(column("Description"), "BLACK") >= 1
containsWhite = instr(column("Description"), "WHITE") >= 1

df.withColumn("hasSimpleColor", containsBlack | containsWhite)\
  .where("hasSimpleColor")\
    .select("Description").show(3, False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



In [100]:
from pyspark.sql.functions import expr, locate

simpleColors = ["black", "white", "red", "green", "blue"]
def color_locator(column, color_string):
  return locate(color_string.upper(), column)\
    .cast("boolean")\
      .alias("is_" + color_string)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
print(selectedColumns)
selectedColumns.append(expr("*"))
print(*selectedColumns)
df.select(selectedColumns).where(expr("is_white OR is_red"))\
  .select("Description").show(3, False)

[Column<'CAST(locate(BLACK, Description, 1) AS BOOLEAN) AS is_black'>, Column<'CAST(locate(WHITE, Description, 1) AS BOOLEAN) AS is_white'>, Column<'CAST(locate(RED, Description, 1) AS BOOLEAN) AS is_red'>, Column<'CAST(locate(GREEN, Description, 1) AS BOOLEAN) AS is_green'>, Column<'CAST(locate(BLUE, Description, 1) AS BOOLEAN) AS is_blue'>]
Column<'CAST(locate(BLACK, Description, 1) AS BOOLEAN) AS is_black'> Column<'CAST(locate(WHITE, Description, 1) AS BOOLEAN) AS is_white'> Column<'CAST(locate(RED, Description, 1) AS BOOLEAN) AS is_red'> Column<'CAST(locate(GREEN, Description, 1) AS BOOLEAN) AS is_green'> Column<'CAST(locate(BLUE, Description, 1) AS BOOLEAN) AS is_blue'> Column<'unresolvedstar()'>
[Column<'CAST(locate(BLACK, Description, 1) AS BOOLEAN) AS is_black'>, Column<'CAST(locate(WHITE, Description, 1) AS BOOLEAN) AS is_white'>, Column<'CAST(locate(RED, Description, 1) AS BOOLEAN) AS is_red'>, Column<'CAST(locate(GREEN, Description, 1) AS BOOLEAN) AS is_green'>, Column<'CAST

## Working with Dates and Timestamps
- Challenge of Dates and Times:
  - Dates and times are challenging in programming and databases.
  - Important to manage timezones and ensure correct formats.
  - Spark simplifies this by focusing on two types: dates (calendar dates) and timestamps (dates with time information).
- Schema Inference:
  - Spark can identify column types, including dates and timestamps, using inferSchema.
  - The dataset example shows Spark correctly identified the InvoiceDate column as a timestamp.
- Handling Strings:
  - Often, timestamps or dates are stored as strings and converted at runtime.
  - This is common with text and CSV files but less common with databases and structured data.
- Timezone Handling:
  - Prior to version 2.1, Spark used the machine’s session local timezone.
  - Users can set a session local timezone using spark.conf.sessionLocalTimeZone in SQL configurations.

In [101]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



- Timestamp Precision:
  - Spark’s TimestampType class supports second-level precision.
  - For milliseconds or microseconds, handle them as longs.
- Java Date Standards:
  - Spark works with Java dates and timestamps, adhering to Java standards.

In [102]:
from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(10)\
  .withColumn("today", current_date())\
  .withColumn("now", current_timestamp())
dateDF.createOrReplaceTempView("dateTable")
dateDF.printSchema()

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)



In [105]:
from pyspark.sql.functions import date_add, date_sub

dateDF.select(
  date_sub(column("today"), 5),
  date_add(column("today"), 5)
).show(1)

# -- in SQL
#   SELECT date_sub(today, 5), date_add(today, 5) FROM dateTable

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2024-08-02|        2024-08-12|
+------------------+------------------+
only showing top 1 row



#### Compare between 2 date by using *datediff*

In [108]:
from pyspark.sql.functions import datediff, months_between, to_date

dateDF.withColumn("week_ago", date_sub(column("today"), 7))\
  .select(datediff(column("week_ago"), column("today"))).show(1)

dateDF.select(
  to_date(lit("2016-01-01")).alias("start"),
  to_date(lit("2017-05-22")).alias("end")
).select(months_between(column("start"), column("end"))).show(1)

+-------------------------+
|datediff(week_ago, today)|
+-------------------------+
|                       -7|
+-------------------------+
only showing top 1 row

+--------------------------------+
|months_between(start, end, true)|
+--------------------------------+
|                    -16.67741935|
+--------------------------------+
only showing top 1 row



**The to_date function allows you to convert a string to a date, optionally with a specified format.**

In [109]:
from pyspark.sql.functions import to_date, lit

spark.range(5).withColumn("date", lit("2017-01-01"))\
  .select(to_date(column("date"))).show(1)

+-------------+
|to_date(date)|
+-------------+
|   2017-01-01|
+-------------+
only showing top 1 row



- Spark does not throw an error if it cannot parse a date; it returns null.
- This behavior can be tricky in larger pipelines due to format mismatches.
- Example: Switching date format from year-month-day to year-day-month.
- Spark fails to parse the incorrect format and returns null silently.

In [111]:
dateDF.select(to_date(lit("2016-20-12")), to_date(lit("2017-12-11"))).show(1)

+-------------------+-------------------+
|to_date(2016-20-12)|to_date(2017-12-11)|
+-------------------+-------------------+
|               NULL|         2017-12-11|
+-------------------+-------------------+
only showing top 1 row



- Parsing dates can lead to tricky bugs because some dates might match the correct format while others do not.
- Example: “2017-12-11” appears correct, but “2017-20-12” is interpreted incorrectly.
- Spark does not throw an error, making it hard to identify issues.
- To avoid these issues, specify the date format using the Java SimpleDateFormat standard.
- This ensures consistency and helps prevent incorrect date parsing.

> We will use two functions to fix this: to_date and to_timestamp. The former optionally expects a format, whereas the latter requires one

In [115]:
from pyspark.sql.functions import to_date

dateFormat = "yyyy-dd-MM"
cleanDateDF = spark.range(1).select(
  to_date(lit("2017-12-11"), dateFormat).alias("date"),
  to_date(lit("2017-20-12"), dateFormat).alias("date2")
)
cleanDateDF.createOrReplaceTempView("dateTable2")
cleanDateDF.show(1)

# -- in SQL
#   SELECT to_date(date, 'yyyy-dd-MM'), to_date(date2, 'yyyy-dd-MM'), to_date(date)
#   FROM dateTable2


+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2017-12-20|
+----------+----------+



**to_timestamp**

In [119]:
from pyspark.sql.functions import to_timestamp

cleanDateDF.select(to_timestamp(column("date"), dateFormat)).show()

# -- in SQL
#   SELECT to_timestamp(date, 'yyyy-dd-MM'), to_timestamp(date2, 'yyyy-dd-MM')
#   FROM dateTable2

+------------------------------+
|to_timestamp(date, yyyy-dd-MM)|
+------------------------------+
|           2017-11-12 00:00:00|
+------------------------------+



                                                                                

Casting between dates and timestamps is simple in all languages—in SQL, we would do it in the following way:

In [120]:
# -- in SQL
#   SELECT cast(to_date("2017-01-01", "yyyy-dd-MM") as timestamp)

- Once the date or timestamp is in the correct format, comparing them is straightforward.
- Ensure either a date/timestamp type is used or specify the string format as yyyy-MM-dd for date comparison.

In [124]:
cleanDateDF.filter(column("date2") > lit("2017-12-12")).show()
cleanDateDF.filter(column("date2") > "'2017-12-12'").show()

+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2017-12-20|
+----------+----------+

+----+-----+
|date|date2|
+----+-----+
+----+-----+



### Working with Nulls in Data
- Best Practice for Nulls: Always use nulls to represent missing or empty data in DataFrames. Spark optimizes handling null values better than empty strings or other values.
- Interacting with Nulls: Use the .na subpackage on a DataFrame to handle null values.
- Operations on Nulls: Several functions are available for performing operations on null values and specifying how Spark should handle them.
- Explicit Handling: Being explicit about handling null values is better than being implicit.
- Nullable Signal: Helps Spark SQL optimize handling of the column but does not enforce non-null values.
- Schema Declaration: 
	- Declaring a column as non-nullable in the schema is not enforced by Spark. 
	- If the data contains nulls, Spark will allow them and could result in incorrect results or exceptions.
- Two Actions for Nulls:
	1.	Explicitly drop null values.
	2.	Fill null values with a specified value (globally or on a per-column basis).

### Coalesce
- Select the first non-nullable value from a set of columns by using the *coalesce* function.
- In case there are no null values, it returns the first column.

In [126]:
from pyspark.sql.functions import coalesce

df.select(coalesce(column("Description"), column("CustomerId"))).show()

+---------------------------------+
|coalesce(Description, CustomerId)|
+---------------------------------+
|             WHITE HANGING HEA...|
|              WHITE METAL LANTERN|
|             CREAM CUPID HEART...|
|             KNITTED UNION FLA...|
|             RED WOOLLY HOTTIE...|
|             SET 7 BABUSHKA NE...|
|             GLASS STAR FROSTE...|
|             HAND WARMER UNION...|
|             HAND WARMER RED P...|
|             ASSORTED COLOUR B...|
|             POPPY'S PLAYHOUSE...|
|             POPPY'S PLAYHOUSE...|
|             FELTCRAFT PRINCES...|
|             IVORY KNITTED MUG...|
|             BOX OF 6 ASSORTED...|
|             BOX OF VINTAGE JI...|
|             BOX OF VINTAGE AL...|
|             HOME BUILDING BLO...|
|             LOVE BUILDING BLO...|
|             RECIPE BOX WITH M...|
+---------------------------------+
only showing top 20 rows



### ifnull, nullIf, nvl, and nvl2
- SQL Functions for Null Handling:
  - ifnull: Selects the second value if the first is null; otherwise, defaults to the first value.
  - nullif: Returns null if the two values are equal; otherwise, returns the second value.
  - nvl: Returns the second value if the first is null; otherwise, defaults to the first value.
  - nvl2: Returns the second value if the first is not null; otherwise, returns the last specified value.

In [None]:
# -- in SQL
#   SELECT
#     ifnull(null, 'return_value'),
#     nullif('value', 'value'),
# nvl(null, 'return_value'),
#     nvl2('not_null', 'return_value', "else_value")
#   FROM dfTable LIMIT 1

# +------------+----+------------+------------+ 
# |           a|   b|           c|           d| 
# +------------+----+------------+------------+ 
# |return_value|null|return_value|return_value| 
# +------------+----+------------+------------+

### drop
- The simplest function is drop, which removes rows that contain nulls.


In [128]:
# The default is to drop any row in which any value is null:

df.na.drop()
df.na.drop("any")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

- Specifying "any" as an argument drops a row if any of the values are null. 
- Using “all” drops the row only if all values are null or NaN for that row

In [129]:
df.na.drop("all")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

We can also apply this to certain sets of columns by passing in an array of columns:

In [130]:
# in Python
df.na.drop("all", subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

### fill
- The fill function allows you to replace all null values in one or more columns with a specified value.

In [131]:
df.na.fill("All Null values become this string")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [132]:
# in Python
df.na.fill("all", subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [134]:
# in Python
fill_cols_vals = {"StockCode": 5, "Description" : "No Value"}
df.na.fill(fill_cols_vals)

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

### replace
- Besides *drop* and *fill*, there are more flexible options that you can use is *replace*

In [136]:
# in Python
df.na.replace([""], ["UNKNOWN"], "Description")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

## Ordering
We can use asc_nulls_first, desc_nulls_first, asc_nulls_last, or desc_nulls_last to specify where you would like your null values to appear in an ordered DataFrame.

## Working with Complex Types
- Complex types in Spark are useful for organizing and structuring data.
- There are 3 kinds of complex types:
  - Structs.
  - Arrays.
  - Maps.

### Structs
- Structs are like DataFrames within DataFrames, allowing for nested data structures.
- Use selectExpr to create a struct by wrapping a set of columns in parentheses in a query:

In [137]:
df.selectExpr("struct(Description, InvoiceNo) as complex", "*")

DataFrame[complex: struct<Description:string,InvoiceNo:string>, InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [139]:
from pyspark.sql.functions import struct

complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"))
complexDF.createOrReplaceTempView("complexDF")

**Dot Syntax: Access fields within the struct using dot notation:**

In [140]:
complexDF.select("complex.Description")

DataFrame[Description: string]

Use the getField method to access a field:

In [142]:
complexDF.select(column("complex").getField("Description"))

DataFrame[complex.Description: string]

Use * to bring all columns to the top-level DataFrame

In [143]:
complexDF.select("complex.*")
#   -- in SQL
# SELECT complex.* FROM complexDF

DataFrame[Description: string, InvoiceNo: string]

## Arrays

### split
- Using the *split* function and specify the delimiter

In [145]:
# in Python
from pyspark.sql.functions import split
df.select(split(column("Description"), " ")).show(2)
# -- in SQL
#   SELECT split(Description, ' ') FROM dfTable

+-------------------------+
|split(Description,  , -1)|
+-------------------------+
|     [WHITE, HANGING, ...|
|     [WHITE, METAL, LA...|
+-------------------------+
only showing top 2 rows



This is quite powerful because Spark allows us to manipulate this complex type as another column.

In [147]:
# in Python
df.select(split(column("Description"), " ").alias("array_col"))\
    .selectExpr("array_col[0]").show(2)
# -- in SQL
# SELECT split(Description, ' ')[0] FROM dfTable

+------------+
|array_col[0]|
+------------+
|       WHITE|
|       WHITE|
+------------+
only showing top 2 rows



### Array Length

In [149]:
# in Python
from pyspark.sql.functions import size
df.select(size(split(column("Description"), " "))).show(2) # shows 5 and 3

+-------------------------------+
|size(split(Description,  , -1))|
+-------------------------------+
|                              5|
|                              3|
+-------------------------------+
only showing top 2 rows



### array_contains

In [151]:
# in Python
from pyspark.sql.functions import array_contains
df.select(array_contains(split(column("Description"), " "), "WHITE")).show(2)
# -- in SQL
# SELECT array_contains(split(Description, ' '), 'WHITE') FROM dfTable

+------------------------------------------------+
|array_contains(split(Description,  , -1), WHITE)|
+------------------------------------------------+
|                                            true|
|                                            true|
+------------------------------------------------+
only showing top 2 rows



> However, this does not solve our current problem. To convert a complex type into a set of rows (one per value in our array), we need to use the explode function.

### explode
- The explode function takes a column that contains arrays and creates one row per array element, duplicating the rest of the values.
- Process:
	1.	Split: First, split a column of text into an array of words.
	2.	Explode: Then, explode the array so that each word becomes a separate row.

In [153]:
from pyspark.sql.functions import split, explode

df.withColumn("splitted", split(column("Description"), " "))\
    .withColumn("exploded", explode(column("splitted")))\
    .select("Description", "InvoiceNo", "exploded").show(2)
# -- in SQL
#   SELECT Description, InvoiceNo, exploded
#   FROM (SELECT *, split(Description, " ") as splitted FROM dfTable)
#   LATERAL VIEW explode(splitted) as exploded

+--------------------+---------+--------+
|         Description|InvoiceNo|exploded|
+--------------------+---------+--------+
|WHITE HANGING HEA...|   536365|   WHITE|
|WHITE HANGING HEA...|   536365| HANGING|
+--------------------+---------+--------+
only showing top 2 rows



### Maps
- Maps are created using the map function and key-value pairs of columns. They are similar to dictionaries in Python and allow you to organize data efficiently.

In [155]:
# in Python
from pyspark.sql.functions import create_map
df.select(create_map(column("Description"), column("InvoiceNo")).alias("complex_map"))\
.show(2)
# -- in SQL
#   SELECT map(Description, InvoiceNo) as complex_map FROM dfTable
#   WHERE Description IS NOT NULL

+--------------------+
|         complex_map|
+--------------------+
|{WHITE HANGING HE...|
|{WHITE METAL LANT...|
+--------------------+
only showing top 2 rows



- To query maps, use the appropriate key. If the key is missing, it returns null.

In [158]:
from pyspark.sql.functions import create_map, col

df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
    .selectExpr("complex_map['WHITE METAL LANTERN']").show(2)

+--------------------------------+
|complex_map[WHITE METAL LANTERN]|
+--------------------------------+
|                            NULL|
|                          536365|
+--------------------------------+
only showing top 2 rows



Exploding map types will turn them into columns.

In [161]:
from pyspark.sql.functions import create_map, col, explode

df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
    .select(explode(col("complex_map"))).show(2)

+--------------------+------+
|                 key| value|
+--------------------+------+
|WHITE HANGING HEA...|536365|
| WHITE METAL LANTERN|536365|
+--------------------+------+
only showing top 2 rows



## Working with JSON
- The example shows how to create a DataFrame with a JSON string in Scala and Python.

In [162]:
jsonDF = spark.range(1)\
  .selectExpr("""
    '{
      "myJSONKey" : {
        "myJSONValue" : [1, 2, 3]
      }
    }' as jsonString
  """)

- You can use the *get_json_object* to inline query a JSON, be it a dictionary or array.
- You can use *json_tuple* if this object has only one level of nesting.


In [165]:
from pyspark.sql.functions import get_json_object, json_tuple

jsonDF.select(
  get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]").alias("column"),
  json_tuple(col("jsonString"), "myJSONKey")
).show(2)

+------+--------------------+
|column|                  c0|
+------+--------------------+
|     2|{"myJSONValue":[1...|
+------+--------------------+



You can also turn a StructType into a JSON string by using the *to_json* function:

In [167]:
from pyspark.sql.functions import to_json

df.selectExpr("(InvoiceNo, Description) as myStruct")\
  .select(to_json(col("myStruct")))

DataFrame[to_json(myStruct): string]

In [170]:
from pyspark.sql.functions import from_json
from pyspark.sql.types import *

parseSchema = StructType((
  StructField("InvoiceNo", StringType(), True),
  StructField("Description", StringType(), True)))
df.selectExpr("(InvoiceNo, Description) as myStruct")\
  .select(to_json(col("myStruct")).alias("newJSON"))\
    .select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2)

+--------------------+--------------------+
|  from_json(newJSON)|             newJSON|
+--------------------+--------------------+
|{536365, WHITE HA...|{"InvoiceNo":"536...|
|{536365, WHITE ME...|{"InvoiceNo":"536...|
+--------------------+--------------------+
only showing top 2 rows



                                                                                

## User-Defined Functions
- UDFs enable you to perform custom transformations on your data.
- They can be written in Python, Scala, or Java and operate on data record by record.
- UDFs are registered as temporary functions for use within a specific SparkSession.
- The first step involves defining the function. For this example, a function power3 that raises a number to the power of three is created.

In [174]:
# in Python
udfExampleDF = spark.range(5).toDF("num")
def power3(double_value):
  return double_value ** 3
power3(2.0)

8.0

- Function Testing:
  - Verified the power3 functions for individual inputs, confirming expected results.
  - Input expectations: specific type, non-null values.
- Function Registration with Spark:
  - Functions need to be registered with Spark to use them across worker machines.
  - Spark serializes the function on the driver and distributes it to all executors over the network.
- Execution Details:
  - Scala/Java UDFs:
  - Executed within the Java Virtual Machine (JVM).
  - Minimal performance penalty, except for lack of Spark’s built-in function code generation optimization.
  - Performance can be impacted by excessive object creation/use.
- Python UDFs:
  - Spark initiates a Python process on the worker.
  - Data is serialized to a Python-compatible format.
  - The function is executed row by row in the Python process.
  - Results are serialized back to the JVM and Spark.
> Warning: 
- Starting a Python process is costly due to:
  - Expensive computation
  - Serialization of data to Python
- Risks:
  - Worker failure if memory is constrained
- Recommendation:
  - Write UDFs in Scala or Java for speed improvements and use them from Python if needed.

In [196]:
# in Python
from pyspark.sql.functions import udf
power3udf = udf(power3)

In [197]:
# in Python
from pyspark.sql.functions import col
udfExampleDF.select(power3udf(column("num"))).show(2)

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
+-----------+
only showing top 2 rows



- DataFrame Functions Limitation:
  - Currently, the UDF is only usable in DataFrame expressions.
  - Registering the UDF as a Spark SQL function allows use in SQL queries and across languages.

In [198]:
udfExampleDF.selectExpr("power3(num)").show(2)

+-----------+
|power3(num)|
+-----------+
|       NULL|
|       NULL|
+-----------+
only showing top 2 rows



In [199]:
from pyspark.sql.types import IntegerType, DoubleType
spark.udf.register("power3py", power3, DoubleType())

24/08/07 20:34:58 WARN SimpleFunctionRegistry: The function power3py replaced a previously registered function.


<function __main__.power3(double_value)>

In [200]:
udfExampleDF.selectExpr("power3py(num)").show(2)

+-------------+
|power3py(num)|
+-------------+
|         NULL|
|         NULL|
+-------------+
only showing top 2 rows

