In [3]:
import findspark

In [4]:
findspark.init('/home/purvil/spark-2.4.3-bin-hadoop2.7')

In [5]:
from pyspark.sql import SparkSession

In [6]:
spark = SparkSession.builder.appName('intro').getOrCreate()

* Using structured API we can manipulate all kinds of data like csv, parquet files

* Spark has DataFrames, Datasets and SQL tables and views as structured collections of data.
* Internally spark uses engine called catalyst that maintains its own type information through the planning and processing of work. Spark type maps directly to programming language types via lookup table. Each operation is performed on Spark's own type.

### DataFrame vs Dataset
* Spark maintains type information of Data Frame. Checks type match with schema at runtime.
* Dataframe and dataset are distributed table like collections with well defined row and column.
* Schema define column name and types of columns in dataframe.
* Dataset check types at compile time. Only available for JVM based language.

* Create Row

In [7]:
spark.range(2)

DataFrame[id: bigint]

In [8]:
spark.range(2).collect()

[Row(id=0), Row(id=1)]

### Types

In [9]:
from pyspark.sql.types import *

In [10]:
b = ByteType() # 1 byte signed int

In [11]:
s = ShortType() # 2 byte signed int
i = IntegerType()
l = LongType() # 8 byte signed int
f = FloatType() # 4 bytes signed precision foating point
d = DoubleType() # float
d = DecimalType()
s = StringType() # string 
b = BinaryType() # bytearray
b = BooleanType() # bool
t = TimestampType() # datetime.datetime
d = DataType() # datetime.date
a = ArrayType(ShortType()) # list, tuple, array
m = MapType(StringType(), DoubleType()) # dict
# s = StructType() # list or tuple
# s = StructField() # 

### Structured API execution

* Write dataframe/dataset/sql code
* If valid code, spark converts to logical plan
* Spark transform logical plan to physical plan with optimization.
* Spark execute physical plan (RDD manipulation) on the cluster.
![](images/optimizatioon.png)

* Logical plan only represent set of abstract transformation that do not refer to executors or drivers. It just convert user's expression to optimized version.
![](images/logical.png)

* First it convert user code to unresolved logical plan, it is called unresolved because code might be valid but table, dataframe it refers to might not exist. Catalog has information about current table and frames, using that analyzer make resolved logical plan.
* Catalyst optimizer then optimize logical plan by pishing down predicates and selections.

* Spark will generate different physical execution strategies and compare them with cost model.
![](images/physical.png)

* Final result is RDD transformation which is executed on cluster.

* DataFrame has seried of records of type Row. Partition of Dataframe defines physical distribution across the cluster.

In [12]:
df = spark.read.json("spark_data/flight-data/json/2015-summary.json")

In [13]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [14]:
spark.read.json("spark_data/flight-data/json/2015-summary.json").schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

* StructField have name, type and boolean field designating null is allowed?
* Let's define schema manually

In [15]:
myManualSchema = StructType([
    StructField("DEST_COUNTRY_NAME", StringType(), True),
    StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
    StructField("count", LongType(), False, metadata={"hello":"world"})
])

In [16]:
df = spark.read.json('spark_data/flight-data/json/2015-summary.json', schema = myManualSchema)

### Manipulate columns

In [17]:
from pyspark.sql.functions import col, column

In [18]:
col("ColName")

Column<b'ColName'>

In [19]:
column("ColName")

Column<b'ColName'>

* To specify specific column of a dataframe

In [20]:
df["count"]

Column<b'count'>

* Expression is set of transformation on one or more values in a record in DataFrame.

In [21]:
from pyspark.sql.functions import expr

In [22]:
expr("colName")

Column<b'colName'>

### expr vs col

In [23]:
expr("colName - 4")

Column<b'(colName - 4)'>

In [24]:
col("colname") - 4

Column<b'(colname - 4)'>

* Above two expression are same

In [25]:
spark.read.json('spark_data/flight-data/json/2015-summary.json').columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

### Records and rows
* Each row in dataframe is single record. Each record has type Row. Row object internally has arrays of bytes.

In [26]:
df.first()

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)

In [27]:
from pyspark.sql import Row

In [28]:
myRow = Row("Hello", None, 1, False) # creating new row

In [29]:
myRow[0]

'Hello'

In [30]:
myRow[-1]

False

### DataFrame Transformation

* Add row, columns
* Remove row, columns
* Transform row in colum (vice versa)
* Change order of rows based on values in columns.

In [31]:
df.createOrReplaceTempView("dfTable")

#### Creating DataFrame

In [32]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType

In [33]:
myManualSchema = StructType([
    StructField("some", StringType(), True),
    StructField("col", StringType(), True),
    StructField("name", LongType(), True)
])

In [34]:
myRow = Row("Hello",None, 1)

In [35]:
myDf = spark.createDataFrame([myRow], schema=myManualSchema)

In [36]:
myDf.show()

+-----+----+----+
| some| col|name|
+-----+----+----+
|Hello|null|   1|
+-----+----+----+



#### select and selectExpr

* Same as

```
SELECT * FROM dataFrameName
SELECT colName FROM dataFrameName
```

In [37]:
df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [38]:
spark.sql("SELECT DEST_COUNTRY_NAME FROM dfTable LIMIT 2").show()

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+



In [39]:
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



In [40]:
spark.sql("SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME FROM dfTable LIMIT 2").show()

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+



In [41]:
df.select(expr("DEST_COUNTRY_NAME"), col("DEST_COUNTRY_NAME"), column("DEST_COUNTRY_NAME")).show(2)

+-----------------+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
+-----------------+-----------------+-----------------+
|    United States|    United States|    United States|
|    United States|    United States|    United States|
+-----------------+-----------------+-----------------+
only showing top 2 rows



In [42]:
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



In [43]:
spark.sql("SELECT DEST_COUNTRY_NAME AS destination FROM dfTable LIMIT 2").show()

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+



* Compact form of above select expr is

In [44]:
df.selectExpr("DEST_COUNTRY_NAME AS destination").show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



In [45]:
df.select(expr("DEST_COUNTRY_NAME").alias("destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



In [46]:
df.selectExpr("*", "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry").show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



In [47]:
spark.sql("SELECT *,(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry FROM dfTable LIMIT 2").show()

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+



* Within selectExpr we can define aggregation over entire dataframe

In [48]:
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show()

+-----------+---------------------------------+
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+



In [49]:
spark.sql("SELECT avg(count), count(DISTINCT(DEST_COUNTRY_NAME)) FROM dfTable").show()

+-----------+---------------------------------+
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+



### Literals
* To pass pure value, which translate given programming language literal to spark type.

In [50]:
from pyspark.sql.functions import lit

In [51]:
df.select(expr('*'), lit(1).alias("One")).show(2)

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|One|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows



In [52]:
df.select(lit(5), lit("five"), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

In [53]:
spark.sql("SELECT *, 1 FROM dfTable LIMIT 2").show()

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|  1|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+



In [54]:
df.withColumn("numberOne", lit(1)).show(2) # Adding column numberOne

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



In [55]:
spark.sql("SELECT *, 1 AS numberOne FROM dfTable LIMIT 2").show()

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+



In [56]:
df.withColumn("withinCountry", expr("DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME")).show(5)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
|    United States|            Ireland|  344|        false|
|            Egypt|      United States|   15|        false|
|    United States|              India|   62|        false|
+-----------------+-------------------+-----+-------------+
only showing top 5 rows



* withColumn take 2 argument column name and expression that create value for new columns.

### Rename column

In [57]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

['dest', 'ORIGIN_COUNTRY_NAME', 'count']

### Reserved character and keywords

In [58]:
dfWithLongName = df.withColumn("This Long Column_name", expr("ORIGIN_COUNTRY_NAME"))

In [59]:
dfWithLongName.show(2)

+-----------------+-------------------+-----+---------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|This Long Column_name|
+-----------------+-------------------+-----+---------------------+
|    United States|            Romania|   15|              Romania|
|    United States|            Croatia|    1|              Croatia|
+-----------------+-------------------+-----+---------------------+
only showing top 2 rows



* To access such column we need ```

In [60]:
dfWithLongName.selectExpr("`This Long Column_name`").show(2)

+---------------------+
|This Long Column_name|
+---------------------+
|              Romania|
|              Croatia|
+---------------------+
only showing top 2 rows



In [61]:
dfWithLongName.createOrReplaceTempView("dfTableLong")

In [62]:
spark.sql("SELECT `This Long Column_name` FROM dfTableLong Limit 2").show()

+---------------------+
|This Long Column_name|
+---------------------+
|              Romania|
|              Croatia|
+---------------------+



* By default spark is case intensitive
* To make it sensitive

```
set spark.sql.caseSensitive true
```

### Removing column

In [63]:
df.drop("ORIGIN_COUNTRY_NAME").columns

['DEST_COUNTRY_NAME', 'count']

In [64]:
dfWithLongName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").columns

['count', 'This Long Column_name']

### Type casting

In [65]:
df.withColumn("count2", col("count").cast("long"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint, count2: bigint]

In [66]:
spark.sql("SELECT *, cast(count AS long) AS count2 FROM dfTable")

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint, count2: bigint]

### Filtering Rows

In [67]:
df.filter(col("count") < 2).show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
|            Malta|      United States|    1|
|    United States|          Gibraltar|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [68]:
df.where("count < 2").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [69]:
spark.sql('SELECT * FROM dfTable WHERE count < 2 LIMIT 2').show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+



In [70]:
df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [71]:
spark.sql('SELECT * FROM dfTable WHERE count < 2 AND ORIGIN_COUNTRY_NAME != "Croatia" LIMIT 2').show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+



### Getting unique Rows

In [72]:
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

256

In [73]:
spark.sql("SELECT count(distinct(ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)) FROM dfTable").show()

+------------------------------------------------------------------------------------------------------------+
|count(DISTINCT named_struct(ORIGIN_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME, DEST_COUNTRY_NAME))|
+------------------------------------------------------------------------------------------------------------+
|                                                                                                         256|
+------------------------------------------------------------------------------------------------------------+



### Random Sample

In [74]:
seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()

126

### Random Split

* Breakup dataframe in random split of original dataframe.
* Useful in ML for creating training, validation, test sets.

In [75]:
dataFrames = df.randomSplit([0.25, 0.75], seed)

In [76]:
dataFrames[0].count()

60

In [77]:
dataFrames[1].count()

196

### Concatenate and appending the Rows

* Union are performed based on location, not on schema

In [78]:
schema = df.schema

In [79]:
newRows = [
    Row("New Country", "Other country", 5),
    Row("New Country 2", "Other country 3", 1)
]

In [80]:
parallelizedRows = spark.sparkContext.parallelize(newRows)

In [81]:
newDf = spark.createDataFrame(parallelizedRows, schema)

In [82]:
df.union(newDf)

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

In [83]:
df.union(newDf).where("count = 1").where(col("ORIGIN_COUNTRY_NAME") != "United States").show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
|    United States|          Gibraltar|    1|
|    United States|             Cyprus|    1|
|    United States|            Estonia|    1|
|    United States|          Lithuania|    1|
|    United States|           Bulgaria|    1|
|    United States|            Georgia|    1|
|    United States|            Bahrain|    1|
|    United States|   Papua New Guinea|    1|
|    United States|         Montenegro|    1|
|    United States|            Namibia|    1|
|    New Country 2|    Other country 3|    1|
+-----------------+-------------------+-----+



### Sorting Rows

In [84]:
df.sort("count").show(5)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
+--------------------+-------------------+-----+
only showing top 5 rows



In [85]:
df.orderBy("count", "DEST_COUNTRY_NAME").show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [86]:
from pyspark.sql.functions import desc, asc

In [87]:
df.orderBy(col("count").desc()).show(5)

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
|           Canada|      United States|  8399|
|    United States|             Mexico|  7187|
|           Mexico|      United States|  7140|
+-----------------+-------------------+------+
only showing top 5 rows



In [88]:
spark.sql("SELECT * FROM dfTable ORDER BY count DESC LIMIT 5").show()

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
|           Canada|      United States|  8399|
|    United States|             Mexico|  7187|
|           Mexico|      United States|  7140|
+-----------------+-------------------+------+



* To allow null values in sorted dataframe use `asc_nulls_first`, `desc_nulls_first`, `asc_nulls_last` or `desc_nulls_last`

* For optimization sort withing each partition

In [89]:
spark.read.json("spark_data/flight-data/json/2015-summary.json").sortWithinPartitions("count")

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

### Limit

In [90]:
df.limit(5).show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+



In [91]:
df.orderBy(col("count").desc()).limit(6).show()

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
|           Canada|      United States|  8399|
|    United States|             Mexico|  7187|
|           Mexico|      United States|  7140|
|   United Kingdom|      United States|  2025|
+-----------------+-------------------+------+



### Repartition and Coalesce

* Partition data with frequently filtered column, which control physical layout of data across cluster.

In [92]:
df.repartition(5)

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

In [93]:
df.rdd.getNumPartitions()

1

In [94]:
df.repartition(col("DEST_COUNTRY_NAME"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

* Coalesce is combining partition

In [95]:
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

### collect

* Spark maintain state of cluster in driver.
* If we want to collect some data to the driver, collect gets entire dataframe

In [96]:
collectDf = df.limit(10)

In [97]:
collectDf.take(5)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62)]

In [98]:
collectDf.show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+



In [99]:
collectDf.show(5, False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
+-----------------+-------------------+-----+
only showing top 5 rows



In [100]:
collectDf.collect()

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Grenada', count=62),
 Row(DEST_COUNTRY_NAME='Costa Rica', ORIGIN_COUNTRY_NAME='United States', count=588),
 Row(DEST_COUNTRY_NAME='Senegal', ORIGIN_COUNTRY_NAME='United States', count=40),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

### toLocalIterator

* Collects paritions to the driver as iterator and allows to iterate over the entire dataset partition by partition in serial manner

In [101]:
collectDf.toLocalIterator()

<itertools.chain at 0x7fbbc20ef198>

In [102]:
df = spark.read.csv("spark_data/retail-data/by-day/2010-12-01.csv", header=True, inferSchema = True)

In [103]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



In [104]:
df.createOrReplaceTempView('dfTable')

In [105]:
df.where(col("InvoiceNo") == 536365).select("InvoiceNo", "Description").show(5, False)

+---------+-----------------------------------+
|InvoiceNo|Description                        |
+---------+-----------------------------------+
|536365   |WHITE HANGING HEART T-LIGHT HOLDER |
|536365   |WHITE METAL LANTERN                |
|536365   |CREAM CUPID HEARTS COAT HANGER     |
|536365   |KNITTED UNION FLAG HOT WATER BOTTLE|
|536365   |RED WOOLLY HOTTIE WHITE HEART.     |
+---------+-----------------------------------+
only showing top 5 rows



In [106]:
df.where("InvoiceNo = 536365").show(5, False)

+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |2010-12-01 08:26:00|2.55     |17850.0   |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |2010-12-01 08:26:00|2.75     |17850.0   |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
+---------+-----

In [111]:
df.where("InvoiceNo <> 536365").show(5, False)

+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                  |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|536366   |22633    |HAND WARMER UNION JACK       |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536366   |22632    |HAND WARMER RED POLKA DOT    |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536367   |84879    |ASSORTED COLOUR BIRD ORNAMENT|32      |2010-12-01 08:34:00|1.69     |13047.0   |United Kingdom|
|536367   |22745    |POPPY'S PLAYHOUSE BEDROOM    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
|536367   |22748    |POPPY'S PLAYHOUSE KITCHEN    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
+---------+---------+-----------------------------+--------+----

In [114]:
from pyspark.sql.functions import instr # in string? locate first occurence of substring in given string

* To define multiple filter chain together all conditions. Spark will flatten all filters into one statement and perform at the same time.

In [115]:
priceFilter = col('UnitPrice') > 600
descripFilter = instr(df.Description, "POSTAGE") >= 1

In [116]:
df.where(df.StockCode.isin("DOT")).where(priceFilter | descripFilter).show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



In [123]:
spark.sql('SELECT * FROM dfTable WHERE StockCode in ("DOT") AND (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1)').show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



In [126]:
DotCodeFilter = col("StockCode") == "DOT"
priceFilter = col("UnitPrice") > 600
descripFilter = instr(col("Description"), "POSTAGE") >= 1
df.withColumn("isExpensive", DotCodeFilter & (priceFilter | descripFilter)).where("isExpensive").select("unitPrice", "isExpensive").show(2)

+---------+-----------+
|unitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



In [134]:
spark.sql("SELECT UnitPrice, (StockCode = 'DOT' AND (UnitPrice > 600 OR instr(Description, 'POSTAGE') >= 1)) as isExpensive \
          FROM dfTable \
          WHERE (StockCode = 'DOT' AND (UnitPrice > 600 OR instr(Description, 'POSTAGE') >= 1))").show(2)

+---------+-----------+
|UnitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



In [135]:
df.withColumn("isExpensive", expr("NOT UnitPrice <= 250")).where("isExpensive").select("Description", "UnitPrice").show(5)

+--------------+---------+
|   Description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



* If we have null in columns

In [136]:
df.where(col("Description").eqNullSafe("hello")).show()

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



### Numerical data

In [137]:
from pyspark.sql.functions import expr, pow

In [138]:
fabricated = pow(col("Quantity") * col("UnitPrice"), 2) + 5

In [139]:
df.select(expr("CustomerId"), fabricated.alias("realQuantity")).show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



In [144]:
spark.sql("SELECT CustomerId, (POWER((Quantity * UnitPrice), 2.0) + 5) AS realQuantity FROM dfTable").show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



#### Rounding

In [145]:
from pyspark.sql.functions import round, bround

In [148]:
df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



In [150]:
spark.sql("SELECT round(2.5), bround(2.5)").show()

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|            3|             2|
+-------------+--------------+



#### correlation

In [152]:
from pyspark.sql.functions import corr

In [153]:
df.stat.corr("Quantity", "UnitPrice")

-0.04112314436835551

In [155]:
spark.sql("SELECT corr(Quantity, UnitPrice) FROM dfTable").show()

+-----------------------------------------+
|corr(CAST(Quantity AS DOUBLE), UnitPrice)|
+-----------------------------------------+
|                     -0.04112314436835551|
+-----------------------------------------+



#### Describe

* Summary statistics

In [157]:
df.describe().show()

+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                null| 8.627413127413128| 4.151946589446603|15661.388719512195|          null|
| stddev|72.89447869788873|17407.897548583845|                null|26.371821677029203|15.638659854603892|1854.4496996893627|          null|
|    min|           536365|             10002| 4 PURPLE FLOCK D...|               -24|               0.0|           12431.0|     Australia|
|    max|          C

In [158]:
from pyspark.sql.functions import count, mean, stddev_pop, min, max

In [166]:
df.stat.approxQuantile("UnitPrice", [0.5], 0.05) # Calculate quantiles, last parameter is relative Error

[2.51]

#### monotonically_increasing_id
* Generate unique value for each row starting with 0

In [168]:
from pyspark.sql.functions import monotonically_increasing_id

In [170]:
df.select(monotonically_increasing_id(), "UnitPrice").show(5)

+-----------------------------+---------+
|monotonically_increasing_id()|UnitPrice|
+-----------------------------+---------+
|                            0|     2.55|
|                            1|     3.39|
|                            2|     2.75|
|                            3|     3.39|
|                            4|     3.39|
+-----------------------------+---------+
only showing top 5 rows



### String

#### initcap :
* capitalize every word

In [179]:
from pyspark.sql.functions import initcap, lower, upper

In [177]:
df.select(initcap(col("Description"))).show(2)

+--------------------+
|initcap(Description)|
+--------------------+
|White Hanging Hea...|
| White Metal Lantern|
+--------------------+
only showing top 2 rows



In [181]:
df.select(lower(col("Description"))).show(2)

+--------------------+
|  lower(Description)|
+--------------------+
|white hanging hea...|
| white metal lantern|
+--------------------+
only showing top 2 rows



In [182]:
df.select(upper(col("Description"))).show(2)

+--------------------+
|  upper(Description)|
+--------------------+
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
+--------------------+
only showing top 2 rows



In [183]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

In [201]:
df.select(ltrim(lit("    Hello    ")).alias("ltrim"), 
          rtrim(lit("    Hello    ")).alias("rtrim"), 
          trim(lit("    Hello    ")).alias("trim"), 
          lpad(lit("Hello"), 3, " ").alias("lp"), 
          rpad(lit("Hello"), 10, " ").alias("rp")).show(1)

+---------+---------+-----+---+----------+
|    ltrim|    rtrim| trim| lp|        rp|
+---------+---------+-----+---+----------+
|Hello    |    Hello|Hello|Hel|Hello     |
+---------+---------+-----+---+----------+
only showing top 1 row



* lpad, rpad always remove value right side of the string if it takes less than length of string as parameter

#### Regular Expression
* `regexp_extract` `regexp_replace`

In [203]:
from pyspark.sql.functions import regexp_extract, regexp_replace

In [205]:
regex_str = "BLACK|WHITE|RED|GREEN|BLUE"

In [207]:
df.select(
    regexp_replace(col("Description"), regex_str, "COLOR").alias("color_clean"), col("Description")).show(2, False)

+----------------------------------+----------------------------------+
|color_clean                       |Description                       |
+----------------------------------+----------------------------------+
|COLOR HANGING HEART T-LIGHT HOLDER|WHITE HANGING HEART T-LIGHT HOLDER|
|COLOR METAL LANTERN               |WHITE METAL LANTERN               |
+----------------------------------+----------------------------------+
only showing top 2 rows



In [210]:
spark.sql("""SELECT regexp_replace(Description, 'BLACK|WHITE|RED|GREEN|BLUE', 'COLOR') AS color_clean, Description
            FROM dfTable""").show(5, False)

+-----------------------------------+-----------------------------------+
|color_clean                        |Description                        |
+-----------------------------------+-----------------------------------+
|COLOR HANGING HEART T-LIGHT HOLDER |WHITE HANGING HEART T-LIGHT HOLDER |
|COLOR METAL LANTERN                |WHITE METAL LANTERN                |
|CREAM CUPID HEARTS COAT HANGER     |CREAM CUPID HEARTS COAT HANGER     |
|KNITTED UNION FLAG HOT WATER BOTTLE|KNITTED UNION FLAG HOT WATER BOTTLE|
|COLOR WOOLLY HOTTIE COLOR HEART.   |RED WOOLLY HOTTIE WHITE HEART.     |
+-----------------------------------+-----------------------------------+
only showing top 5 rows



#### translate
* Replace all instance of character with indexed character in replacement string

In [211]:
from pyspark.sql.functions import translate

In [213]:
df.select(translate(col("Description"), "LEET", "1337"), col("Description")).show(2, False)

+----------------------------------+----------------------------------+
|translate(Description, LEET, 1337)|Description                       |
+----------------------------------+----------------------------------+
|WHI73 HANGING H3AR7 7-1IGH7 HO1D3R|WHITE HANGING HEART T-LIGHT HOLDER|
|WHI73 M37A1 1AN73RN               |WHITE METAL LANTERN               |
+----------------------------------+----------------------------------+
only showing top 2 rows



In [216]:
spark.sql("""SELECT translate(Description, 'LEET', '1337'), Description
            FROM dfTable""").show(2,False)

+----------------------------------+----------------------------------+
|translate(Description, LEET, 1337)|Description                       |
+----------------------------------+----------------------------------+
|WHI73 HANGING H3AR7 7-1IGH7 HO1D3R|WHITE HANGING HEART T-LIGHT HOLDER|
|WHI73 M37A1 1AN73RN               |WHITE METAL LANTERN               |
+----------------------------------+----------------------------------+
only showing top 2 rows



In [227]:
extract_str = "(BLACK|WHITE|RED|GREEN|BLUE)"

In [230]:
df.select(
    regexp_extract(col("Description"), extract_str, 1).alias("color_clean"), 
    col("Description")
).show(2,False)

+-----------+----------------------------------+
|color_clean|Description                       |
+-----------+----------------------------------+
|WHITE      |WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE      |WHITE METAL LANTERN               |
+-----------+----------------------------------+
only showing top 2 rows



### Contains

In [233]:
df.withColumn("hasSimpleColor", col("Description").contains("WHITE")).select("hasSimpleColor").show(5)

+--------------+
|hasSimpleColor|
+--------------+
|          true|
|          true|
|         false|
|         false|
|          true|
+--------------+
only showing top 5 rows



In [236]:
df.withColumn("hasSimpleColor", instr(col("Description"), "WHITE") >= 1).select("hasSimpleColor").show(5)

+--------------+
|hasSimpleColor|
+--------------+
|          true|
|          true|
|         false|
|         false|
|          true|
+--------------+
only showing top 5 rows



### Date Time

* date only keeps track of calander date, timestamp keeps track of both date and time
* If timezone not specified, timezone will be considered as machine's local time zone.

* To specify timezone for current session use,

```
spark.conf.sessionLocalTimeZone
```

In [237]:
from pyspark.sql.functions import current_date, current_timestamp

In [238]:
dateDF = spark.range(10).withColumn("today", current_date()).withColumn("now", current_timestamp())

In [241]:
dateDF.show()

+---+----------+--------------------+
| id|     today|                 now|
+---+----------+--------------------+
|  0|2019-06-12|2019-06-12 15:32:...|
|  1|2019-06-12|2019-06-12 15:32:...|
|  2|2019-06-12|2019-06-12 15:32:...|
|  3|2019-06-12|2019-06-12 15:32:...|
|  4|2019-06-12|2019-06-12 15:32:...|
|  5|2019-06-12|2019-06-12 15:32:...|
|  6|2019-06-12|2019-06-12 15:32:...|
|  7|2019-06-12|2019-06-12 15:32:...|
|  8|2019-06-12|2019-06-12 15:32:...|
|  9|2019-06-12|2019-06-12 15:32:...|
+---+----------+--------------------+



In [242]:
dateDF.createOrReplaceTempView('dateTable')

In [243]:
dateDF.printSchema()

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)



In [244]:
from pyspark.sql.functions import date_add, date_sub

In [245]:
dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(2)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2019-06-07|        2019-06-17|
|        2019-06-07|        2019-06-17|
+------------------+------------------+
only showing top 2 rows



In [247]:
spark.sql("""SELECT date_sub(today, 5), date_add(today, 5)
            FROM dateTable""").show(2)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2019-06-07|        2019-06-17|
|        2019-06-07|        2019-06-17|
+------------------+------------------+
only showing top 2 rows



In [248]:
from pyspark.sql.functions import datediff, months_between, to_date

* `datediff` is number of days between 2 dates, `months_between` number of months between two date.

In [249]:
dateDF.withColumn("weekago", date_sub(col("today"), 7)).select(datediff(col("weekago"), col("today"))).show(1)

+------------------------+
|datediff(weekago, today)|
+------------------------+
|                      -7|
+------------------------+
only showing top 1 row



In [252]:
dateDF.select(
    to_date(lit("2016-01-01")).alias("start"),
    to_date(lit("2017-05-22")).alias("end")
).select(months_between(col("start"), col("end"))).show(1)

+--------------------------------+
|months_between(start, end, true)|
+--------------------------------+
|                    -16.67741935|
+--------------------------------+
only showing top 1 row



In [260]:
spark.sql('SELECT to_date("2016-01-01") AS start, months_between("2016-01-01","2017-01-01") as months, datediff("2016-01-01","2017-01-01") as days ').show()

+----------+------+----+
|     start|months|days|
+----------+------+----+
|2016-01-01| -12.0|-366|
+----------+------+----+



* If spark can not conver to date it will NOT throw an error, but convert to null

In [261]:
dateDF.select(to_date(lit("2016-20-12")), to_date(lit("2017-12-11"))).show(1)

+---------------------+---------------------+
|to_date('2016-20-12')|to_date('2017-12-11')|
+---------------------+---------------------+
|                 null|           2017-12-11|
+---------------------+---------------------+
only showing top 1 row



* To overcome it specify date format in SimpleDateFormat standard.

In [262]:
date_format = "yyy-dd-MM"

In [265]:
cleanDateDF = spark.range(1).select(
                    to_date(lit("2017-12-11"), date_format).alias("date"),
                    to_date(lit("2017-20-12"), date_format).alias("date2")
)

In [266]:
cleanDateDF.show()

+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2017-12-20|
+----------+----------+



In [267]:
cleanDateDF.createOrReplaceTempView('dateTable2')

In [272]:
spark.sql('SELECT to_date(date, "yyyy-dd-MM"), to_date(date2, "yyyy-dd-MM"), to_date(date) FROM dateTable2').show(1)

+----------------------------------------+-----------------------------------------+--------------------------+
|to_date(datetable2.`date`, 'yyyy-dd-MM')|to_date(datetable2.`date2`, 'yyyy-dd-MM')|to_date(datetable2.`date`)|
+----------------------------------------+-----------------------------------------+--------------------------+
|                              2017-11-12|                               2017-12-20|                2017-11-12|
+----------------------------------------+-----------------------------------------+--------------------------+



In [274]:
from pyspark.sql.functions import to_timestamp

In [275]:
cleanDateDF.select(to_timestamp(col("date"), date_format)).show()

+---------------------------------+
|to_timestamp(`date`, 'yyy-dd-MM')|
+---------------------------------+
|              2017-11-12 00:00:00|
+---------------------------------+



In [277]:
spark.sql("SELECT to_timestamp(date, 'yyyy-dd-MM'), to_timestamp(date2, 'yyyy-dd-MM') FROM dateTable2").show()

+---------------------------------------------+----------------------------------------------+
|to_timestamp(datetable2.`date`, 'yyyy-dd-MM')|to_timestamp(datetable2.`date2`, 'yyyy-dd-MM')|
+---------------------------------------------+----------------------------------------------+
|                          2017-11-12 00:00:00|                           2017-12-20 00:00:00|
+---------------------------------------------+----------------------------------------------+



In [279]:
spark.sql('SELECT cast(to_date("2017-01-01", "yyyy-dd-MM") as timestamp)').show()

+------------------------------------------------------+
|CAST(to_date('2017-01-01', 'yyyy-dd-MM') AS TIMESTAMP)|
+------------------------------------------------------+
|                                   2017-01-01 00:00:00|
+------------------------------------------------------+



In [280]:
cleanDateDF.filter(col("date2") > lit("2017-12-12")).show()

+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2017-12-20|
+----------+----------+



### Handling null

#### Coalesce
* Choose first non null value fom set of columns.

In [281]:
from pyspark.sql.functions import coalesce

In [282]:
df.select(coalesce(col("Description"), col("CustomerId"))).show(2)

+---------------------------------+
|coalesce(Description, CustomerId)|
+---------------------------------+
|             WHITE HANGING HEA...|
|              WHITE METAL LANTERN|
+---------------------------------+
only showing top 2 rows



####  ifnull, nullif, nvl, nvl2
* ifnull allows to choose second value if first is null
* nullif, returns null if 2 values are equal or else returns the second if they are not
* nvl : returns second if first is null
* nvl2 : returns second value if first is not null

In [284]:
spark.sql("""
    SELECT ifnull(null, "return_val"), nullif('val', 'val'), nvl(null, 'val'), nvl2('not_null', 'return_val', "else_val")
""").show()

+--------------------------+--------------------+----------------+------------------------------------------+
|ifnull(NULL, 'return_val')|nullif('val', 'val')|nvl(NULL, 'val')|nvl2('not_null', 'return_val', 'else_val')|
+--------------------------+--------------------+----------------+------------------------------------------+
|                return_val|                null|             val|                                return_val|
+--------------------------+--------------------+----------------+------------------------------------------+



#### drop
* Drops rows which cotains null value

In [286]:
df.na.drop("any") # default is "any", we can pass "all"

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [287]:
spark.sql("SELECT * FROM dfTable WHERE Description IS NOT NULL")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

* Even we can specify the list of columns on which  we want to check null condition

In [290]:
df.na.drop("all", subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

#### fill

In [291]:
df.na.fill("Null was here, I replaced it")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [294]:
df.na.fill("i am", subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [297]:
fill_cols_vals = {'StockCode':5, "Description":"No value"}

In [298]:
df.na.fill(fill_cols_vals)

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

#### replace
* Replace all value in current column in according to current value

In [300]:
df.na.replace([""], ["UNKNOWN"], "Description")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

### Structs
* DataFrame within DataFrame
* We can create struct by wrapping set of columns in parenthesis in a query

In [301]:
from pyspark.sql.functions import struct

In [303]:
complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"))

In [304]:
complexDF.createOrReplaceTempView("complexDF")

In [307]:
complexDF.show(2,False)

+--------------------------------------------+
|complex                                     |
+--------------------------------------------+
|[WHITE HANGING HEART T-LIGHT HOLDER, 536365]|
|[WHITE METAL LANTERN, 536365]               |
+--------------------------------------------+
only showing top 2 rows



In [310]:
complexDF.select("complex.Description").show(2,False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
+----------------------------------+
only showing top 2 rows



In [314]:
complexDF.select(col("complex").getField("Description")).show(2, False)

+----------------------------------+
|complex.Description               |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
+----------------------------------+
only showing top 2 rows



In [317]:
complexDF.select("complex.*").show(2,False)

+----------------------------------+---------+
|Description                       |InvoiceNo|
+----------------------------------+---------+
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |
|WHITE METAL LANTERN               |536365   |
+----------------------------------+---------+
only showing top 2 rows



In [319]:
spark.sql("SELECT complex.* FROM complexDF").show(2)

+--------------------+---------+
|         Description|InvoiceNo|
+--------------------+---------+
|WHITE HANGING HEA...|   536365|
| WHITE METAL LANTERN|   536365|
+--------------------+---------+
only showing top 2 rows



### Array

In [320]:
from pyspark.sql.functions import split

In [323]:
df.select(split(col("Description"), " ")).show(2, False)

+----------------------------------------+
|split(Description,  )                   |
+----------------------------------------+
|[WHITE, HANGING, HEART, T-LIGHT, HOLDER]|
|[WHITE, METAL, LANTERN]                 |
+----------------------------------------+
only showing top 2 rows



In [322]:
spark.sql("SELECT split(Description, ' ') FROM dfTable").show(2)

+---------------------+
|split(Description,  )|
+---------------------+
| [WHITE, HANGING, ...|
| [WHITE, METAL, LA...|
+---------------------+
only showing top 2 rows



In [324]:
df.select(split(col("Description"), " ").alias("array_col")).selectExpr("array_col[0]").show(2)

+------------+
|array_col[0]|
+------------+
|       WHITE|
|       WHITE|
+------------+
only showing top 2 rows



In [325]:
spark.sql("SELECT split(Description, ' ')[0] FROM dfTable").show(2)

+------------------------+
|split(Description,  )[0]|
+------------------------+
|                   WHITE|
|                   WHITE|
+------------------------+
only showing top 2 rows



In [326]:
from pyspark.sql.functions import size

In [327]:
df.select(size(split(col("Description"), " "))).show(2)

+---------------------------+
|size(split(Description,  ))|
+---------------------------+
|                          5|
|                          3|
+---------------------------+
only showing top 2 rows



In [328]:
from pyspark.sql.functions import array_contains

In [329]:
df.select(array_contains(split("Description", " "), "WHITE")).show(2)

+--------------------------------------------+
|array_contains(split(Description,  ), WHITE)|
+--------------------------------------------+
|                                        true|
|                                        true|
+--------------------------------------------+
only showing top 2 rows



In [330]:
spark.sql("SELECT array_contains(split(Description, ' '), 'WHITE') FROM dfTable").show(2)

+--------------------------------------------+
|array_contains(split(Description,  ), WHITE)|
+--------------------------------------------+
|                                        true|
|                                        true|
+--------------------------------------------+
only showing top 2 rows



####  explode
* Takes a column that consist of arrays and creates one row per value in the array.

In [332]:
from pyspark.sql.functions import explode

In [335]:
df.withColumn("splitted", split(col("Description"), " ")).withColumn("exploded", explode(col("splitted"))).select("Description", "InvoiceNo", "exploded").show(20, False)

+-----------------------------------+---------+--------+
|Description                        |InvoiceNo|exploded|
+-----------------------------------+---------+--------+
|WHITE HANGING HEART T-LIGHT HOLDER |536365   |WHITE   |
|WHITE HANGING HEART T-LIGHT HOLDER |536365   |HANGING |
|WHITE HANGING HEART T-LIGHT HOLDER |536365   |HEART   |
|WHITE HANGING HEART T-LIGHT HOLDER |536365   |T-LIGHT |
|WHITE HANGING HEART T-LIGHT HOLDER |536365   |HOLDER  |
|WHITE METAL LANTERN                |536365   |WHITE   |
|WHITE METAL LANTERN                |536365   |METAL   |
|WHITE METAL LANTERN                |536365   |LANTERN |
|CREAM CUPID HEARTS COAT HANGER     |536365   |CREAM   |
|CREAM CUPID HEARTS COAT HANGER     |536365   |CUPID   |
|CREAM CUPID HEARTS COAT HANGER     |536365   |HEARTS  |
|CREAM CUPID HEARTS COAT HANGER     |536365   |COAT    |
|CREAM CUPID HEARTS COAT HANGER     |536365   |HANGER  |
|KNITTED UNION FLAG HOT WATER BOTTLE|536365   |KNITTED |
|KNITTED UNION FLAG HOT WATER B

In [340]:
spark.sql("""
    SELECT Description, InvoiceNo, explode(splitted) as exploded
    FROM (SELECT *, split(Description, " ") as splitted FROM dfTable)
""").show(10, False)

+----------------------------------+---------+--------+
|Description                       |InvoiceNo|exploded|
+----------------------------------+---------+--------+
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |WHITE   |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HANGING |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HEART   |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |T-LIGHT |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HOLDER  |
|WHITE METAL LANTERN               |536365   |WHITE   |
|WHITE METAL LANTERN               |536365   |METAL   |
|WHITE METAL LANTERN               |536365   |LANTERN |
|CREAM CUPID HEARTS COAT HANGER    |536365   |CREAM   |
|CREAM CUPID HEARTS COAT HANGER    |536365   |CUPID   |
+----------------------------------+---------+--------+
only showing top 10 rows



#### Maps

In [342]:
from pyspark.sql.functions import create_map

In [344]:
df.select(create_map(col("Description"), col("InvoiceNo"))).alias("complex_map").show(2,False)

+----------------------------------------------+
|map(Description, InvoiceNo)                   |
+----------------------------------------------+
|[WHITE HANGING HEART T-LIGHT HOLDER -> 536365]|
|[WHITE METAL LANTERN -> 536365]               |
+----------------------------------------------+
only showing top 2 rows



In [349]:
spark.sql("""
    SELECT map(Description, InvoiceNo) AS complex_map FROM dfTable WHERE Description IS NOT NULL
""").show(2, False)

+----------------------------------------------+
|complex_map                                   |
+----------------------------------------------+
|[WHITE HANGING HEART T-LIGHT HOLDER -> 536365]|
|[WHITE METAL LANTERN -> 536365]               |
+----------------------------------------------+
only showing top 2 rows



In [354]:
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")).selectExpr("complex_map['WHITE METAL LANTERN']").show(2)

+--------------------------------+
|complex_map[WHITE METAL LANTERN]|
+--------------------------------+
|                            null|
|                          536365|
+--------------------------------+
only showing top 2 rows



In [358]:
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")).selectExpr("explode(complex_map)").show(10,False)

+-----------------------------------+------+
|key                                |value |
+-----------------------------------+------+
|WHITE HANGING HEART T-LIGHT HOLDER |536365|
|WHITE METAL LANTERN                |536365|
|CREAM CUPID HEARTS COAT HANGER     |536365|
|KNITTED UNION FLAG HOT WATER BOTTLE|536365|
|RED WOOLLY HOTTIE WHITE HEART.     |536365|
|SET 7 BABUSHKA NESTING BOXES       |536365|
|GLASS STAR FROSTED T-LIGHT HOLDER  |536365|
|HAND WARMER UNION JACK             |536366|
|HAND WARMER RED POLKA DOT          |536366|
|ASSORTED COLOUR BIRD ORNAMENT      |536367|
+-----------------------------------+------+
only showing top 10 rows



### JSON

In [359]:
jsonDF = spark.range(1).selectExpr("""
    '{"myJSONKey":{"myJSONValue":[1,2,3]}}' AS jsonString
""")

* `get_json_object` is used to inline query a JSON object, be it in dictionary or array
* `json_tuple` is used if object has only one level of nesting

In [360]:
from pyspark.sql.functions import get_json_object, json_tuple

In [369]:
jsonDF.select(get_json_object(col('jsonString'), "$.myJSONKey.myJSONValue[1]"), 
              json_tuple(col("jsonString"), "myJSONKey")).show(1, False)

+------+
|column|
+------+
|  null|
+------+



#### to_json
* Change struct type to json

In [373]:
from pyspark.sql.functions import to_json, from_json

In [374]:
df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json("myStruct"))

DataFrame[structstojson(myStruct): string]

#### from_json
* From json to struct

In [375]:
parseSchema = StructType((
    StructField("InvoiceNo", StringType(), True),
    StructField("Description", StringType(), True)
))

In [378]:
df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")).alias("newJson")).select(from_json(col("newJson"), parseSchema), col("newJson")).show(2,False)

+--------------------------------------------+-------------------------------------------------------------------------+
|jsontostructs(newJson)                      |newJson                                                                  |
+--------------------------------------------+-------------------------------------------------------------------------+
|[536365, WHITE HANGING HEART T-LIGHT HOLDER]|{"InvoiceNo":"536365","Description":"WHITE HANGING HEART T-LIGHT HOLDER"}|
|[536365, WHITE METAL LANTERN]               |{"InvoiceNo":"536365","Description":"WHITE METAL LANTERN"}               |
+--------------------------------------------+-------------------------------------------------------------------------+
only showing top 2 rows



### User Defined Function

* UDF can take and return one or more columns. Operate on data record by record.

In [379]:
def power3(val):
    return val ** 3
power3(2.0)

8.0

In [380]:
udfDF = spark.range(5).toDF("num")

In [381]:
udfDF.show()

+---+
|num|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+



* We have to register function to spark, so we can call it from worker machine. Spark will serialize the funciton on driver and transfer it to executor. If function in python, spark starts python process at worker, serialize all data to format python can understand, python function process it and finally return result of the row operation to the JVM and Spark.
![](images/python_function.png)

* Serializing data to python is expensive task. Further more after data comes with python, Spark can not manage memory and bad python funcion can kill worker. So it is better to write in scala or java.
* First register function

In [382]:
from pyspark.sql.functions import udf

In [383]:
power3udf = udf(power3)

In [384]:
udfDF.select(power3udf(col("num"))).show()

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
|          8|
|         27|
|         64|
+-----------+



* Above function is registered as dataframe function, so we can not use as SQL syntax with string expression.

In [389]:
spark.udf.register("power3", power3, IntegerType())

<function __main__.power3(val)>

In [390]:
udfDF.selectExpr("power3(num)").show()

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
|          8|
|         27|
|         64|
+-----------+



* If specified type does not allign with actual type returned by the function, Spark will not throw an error but will return null.

In [391]:
spark.udf.register("power3py", power3, DoubleType())

<function __main__.power3(val)>

In [392]:
udfDF.selectExpr("power3py(num)").show()

+-------------+
|power3py(num)|
+-------------+
|         null|
|         null|
|         null|
|         null|
|         null|
+-------------+

