## Performace optimization

One of the optimizations techniques are **cache()** and **persist()** methods. The methods are used to store an intermediate calculation of an RDD, DataFrame, and Dataset so that they can be reused in subsequent actions. As tranformation are lazy, then we will probably run the same series of transformation even when we need to use different actions on the same transformation. So the best practice is to persist the result of transformations that we are going to use more than one time.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName('Test').getOrCreate()

In [0]:
# Let's create a Data Frame
emp = [
    (1, "AAA", "dept1", 1000),
    (2, "BBB", "dept1", 1100),
    (3, "CCC", "dept1", 3000),
    (4, "DDD", "dept1", 1500),
    (5, "EEE", "dept2", 8000),
    (6, "FFF", "dept2", 7200),
    (7, "GGG", "dept3", 7100),
    (None, None, None, 7500),
    (9, "III", None, 4500),
    (10, None, "dept5", 2500)
]
emp_columns = ["id", "name", "dept", "salary"]

dept = [
    ("dept1", "Departament - 1"),
    ("dept2", "Departament - 2"),
    ("dept3", "Departament - 3"),
    ("dept4", "Departament - 4")
]
dept_columns = ["id", "name"]

empColumns = [
    "emp_id","name","superior_emp_id","year_joined",
    "emp_dept_id","gender","salary"
]

df = spark.createDataFrame(emp, emp_columns)
dept_df = spark.createDataFrame(dept, dept_columns)

# Create temporary tables
df.createOrReplaceTempView("emp_df")
dept_df.createOrReplaceTempView("dept_df")

# Save as HIVE tables
df.write.saveAsTable("hive_emp_df", mode="overwrite")
dept_df.write.saveAsTable("hive_dept_df", mode="overwrite")

## Broadcast JOIN

We use this when we want to join two DataFrames and one of them is small, we can broadcast the small DataFrame to all the nodes to optimize the JOIN operation. By default, the size of the broadcast table is 10MB. However, we can increase the size up to 8GB.

We can check the size of the transmision table as follows:

In [0]:
data_size = int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold").replace('b', '')) / (1024 * 1024)
print(f"Default size of the Broadcast table: {data_size} MB")

Default size of the Broadcast table: 10.0 MB


We can set the size of the transmision table as follows:

In [0]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024)

Now, we are ready for the JOIN operation.

In [0]:
#join_df = big_df.join(broadcast(small_df), big_df["id"] == small_df["id"])

## Cache

We can use cache/persistence function to keep the DataFrame in memory. You can significantly improve the performance of your Spark application if we cache data that we need to use very frequently in our application.

In [0]:
df.cache()
df.count()
print(f"Memory used: {df.storageLevel.useMemory}")
print(f"Disk used: {df.storageLevel.useDisk}")

Memory used: True
Disk used: True


We can specify that we only want to keep the data in memory

In [0]:
from pyspark.storagelevel import StorageLevel
dept_df.persist(StorageLevel.MEMORY_ONLY)
dept_df.count()
print(f"Memory used: {dept_df.storageLevel.useMemory}")
print(f"Disk used: {dept_df.storageLevel.useDisk}")

Memory used: True
Disk used: False


We can delete the cache we it's not longer necessary

In [0]:
df.unpersist()

# If we want to delete all the cache tables, then use the code below
sqlContext.clearCache()

## SQL Expressions

We can also use SQL expressions for data manipulation . We have the function expr and also a variant of a select method like selectExpr for evaluations SQL expressions.

In [0]:
from pyspark.sql.functions import expr, selectExpr

query = '''
CASE WHEN salary > 5000 THEN 'high-salary'
    ELSE CASE WHEN salary > 2000 THEN 'mid-salary'
        ELSE CASE WHEN salary > 0 THEN 'low-salary'
            ELSE 'invalid-salary'
        END
    END
END AS salary_level
'''
new_df = df.withColumn("salary_level", expr(query))
new_df.show()

+----+----+-----+------+------------+
|  id|name| dept|salary|salary_level|
+----+----+-----+------+------------+
|   1| AAA|dept1|  1000|  low-salary|
|   2| BBB|dept1|  1100|  low-salary|
|   3| CCC|dept1|  3000|  mid-salary|
|   4| DDD|dept1|  1500|  low-salary|
|   5| EEE|dept2|  8000| high-salary|
|   6| FFF|dept2|  7200| high-salary|
|   7| GGG|dept3|  7100| high-salary|
|NULL|NULL| NULL|  7500| high-salary|
|   9| III| NULL|  4500|  mid-salary|
|  10|NULL|dept5|  2500|  mid-salary|
+----+----+-----+------+------------+



In [0]:
# * Use all the variables in the original DataFrame
new_df2 = df.selectExpr("*", query)
new_df2.show()

+----+----+-----+------+------------+
|  id|name| dept|salary|salary_level|
+----+----+-----+------+------------+
|   1| AAA|dept1|  1000|  low-salary|
|   2| BBB|dept1|  1100|  low-salary|
|   3| CCC|dept1|  3000|  mid-salary|
|   4| DDD|dept1|  1500|  low-salary|
|   5| EEE|dept2|  8000| high-salary|
|   6| FFF|dept2|  7200| high-salary|
|   7| GGG|dept3|  7100| high-salary|
|NULL|NULL| NULL|  7500| high-salary|
|   9| III| NULL|  4500|  mid-salary|
|  10|NULL|dept5|  2500|  mid-salary|
+----+----+-----+------+------------+



## User Defined Functions (UDF)

When we need a specific transformation hard to process via SQL, we can create a function in Python and register it as UDF.

In [0]:
def get_salary_level(salary):
    if salary > 5000:
        return 'high-salary'
    elif salary > 2000:
        return 'mid-salary'
    elif salary > 0:
        return 'low-salary'
    return 'invalid_salary'

In [0]:
from pyspark.sql.types import StringType
# Register the function as UDF
sal_level = udf(get_salary_level, StringType())

In [0]:
new_df3 = df.withColumn("salary_level", sal_level("salary"))
new_df3.show()

+----+----+-----+------+------------+
|  id|name| dept|salary|salary_level|
+----+----+-----+------+------------+
|   1| AAA|dept1|  1000|  low-salary|
|   2| BBB|dept1|  1100|  low-salary|
|   3| CCC|dept1|  3000|  mid-salary|
|   4| DDD|dept1|  1500|  low-salary|
|   5| EEE|dept2|  8000| high-salary|
|   6| FFF|dept2|  7200| high-salary|
|   7| GGG|dept3|  7100| high-salary|
|NULL|NULL| NULL|  7500| high-salary|
|   9| III| NULL|  4500|  mid-salary|
|  10|NULL|dept5|  2500|  mid-salary|
+----+----+-----+------+------------+



## Working with NULL values

In [0]:
new_df = df.filter(df["dept"].isNull())
#new_df = df.where(df["dept"].isNull())
new_df.show()

+----+----+----+------+
|  id|name|dept|salary|
+----+----+----+------+
|NULL|NULL|NULL|  7500|
|   9| III|NULL|  4500|
+----+----+----+------+



In [0]:
new_df = df.filter(df["dept"].isNotNull())
#new_df = df.where(df["dept"].isNotNull())
new_df.show()

+---+----+-----+------+
| id|name| dept|salary|
+---+----+-----+------+
|  1| AAA|dept1|  1000|
|  2| BBB|dept1|  1100|
|  3| CCC|dept1|  3000|
|  4| DDD|dept1|  1500|
|  5| EEE|dept2|  8000|
|  6| FFF|dept2|  7200|
|  7| GGG|dept3|  7100|
| 10|NULL|dept5|  2500|
+---+----+-----+------+



In [0]:
new_df = df.fillna("INVALID", ["dept"])
new_df.show()

+----+----+-------+------+
|  id|name|   dept|salary|
+----+----+-------+------+
|   1| AAA|  dept1|  1000|
|   2| BBB|  dept1|  1100|
|   3| CCC|  dept1|  3000|
|   4| DDD|  dept1|  1500|
|   5| EEE|  dept2|  8000|
|   6| FFF|  dept2|  7200|
|   7| GGG|  dept3|  7100|
|NULL|NULL|INVALID|  7500|
|   9| III|INVALID|  4500|
|  10|NULL|  dept5|  2500|
+----+----+-------+------+



In [0]:
new_df = df.dropna()
# new_df = df.dropna(how="all") only if all the variables are NULL
# new_df = df.dropna(subset="dept")
new_df.show()

+---+----+-----+------+
| id|name| dept|salary|
+---+----+-----+------+
|  1| AAA|dept1|  1000|
|  2| BBB|dept1|  1100|
|  3| CCC|dept1|  3000|
|  4| DDD|dept1|  1500|
|  5| EEE|dept2|  8000|
|  6| FFF|dept2|  7200|
|  7| GGG|dept3|  7100|
+---+----+-----+------+



## Partitioning

In [0]:
df.rdd.getNumPartitions()

8

In [0]:
# This is expensive because we need to shuffle the data across our cluster nodes
# Decrease
new_df = df.coalesce(6)
# Increase
new_df = df.repartition(10)
new_df.rdd.getNumPartitions()

6

In [0]:
# Change the default values
num_partitions = spark.conf.get( "spark.sql.shuffle.partitions")
print(f"Old # partitions: {num_partitions}")
spark.conf.set( "spark.sql.shuffle.partitions", 500)


num_partitions = spark.conf.get( "spark.sql.shuffle.partitions")
print(f"New # partitions: {num_partitions}")

Old # partitions: 200
New # partitions: 500


## API Catalog


In [0]:
spark.catalog.listDatabases()

[Database(name='default', catalog='spark_catalog', description='Default Hive database', locationUri='dbfs:/user/hive/warehouse')]

In [0]:
spark.catalog.listTables("default")

[Table(name='hive_dept_df', catalog='spark_catalog', namespace=['default'], description=None, tableType='MANAGED', isTemporary=False),
 Table(name='hive_emp_df', catalog='spark_catalog', namespace=['default'], description=None, tableType='MANAGED', isTemporary=False),
 Table(name='dept_df', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='emp_df', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

In [0]:
spark.catalog.listColumns("hive_emp_df", "default")



[Column(name='id', description=None, dataType='bigint', nullable=True, isPartition=False, isBucket=False, isCluster=False),
 Column(name='name', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False, isCluster=False),
 Column(name='dept', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False, isCluster=False),
 Column(name='salary', description=None, dataType='bigint', nullable=True, isPartition=False, isBucket=False, isCluster=False)]

In [0]:
spark.catalog.listFunctions()

[Function(name='!', catalog=None, namespace=None, description='! expr - Logical not.', className='org.apache.spark.sql.catalyst.expressions.Not', isTemporary=True),
 Function(name='!=', catalog=None, namespace=None, description='expr1 != expr2 - Returns true if `expr1` is not equal to `expr2`.', className=None, isTemporary=True),
 Function(name='%', catalog=None, namespace=None, description='expr1 % expr2 - Returns the remainder after `expr1`/`expr2`.', className='org.apache.spark.sql.catalyst.expressions.Remainder', isTemporary=True),
 Function(name='&', catalog=None, namespace=None, description='expr1 & expr2 - Returns the result of bitwise AND of `expr1` and `expr2`.', className='org.apache.spark.sql.catalyst.expressions.BitwiseAnd', isTemporary=True),
 Function(name='*', catalog=None, namespace=None, description='expr1 * expr2 - Returns `expr1`*`expr2`.', className='org.apache.spark.sql.catalyst.expressions.Multiply', isTemporary=True),
 Function(name='+', catalog=None, namespace=N

In [0]:
#spark.catalog.setCurrentDatabase()
spark.catalog.currentDatabase()

'default'

In [0]:
spark.catalog.cacheTable("default.hive_emp_df")
spark.catalog.isCached("default.hive_emp_df")
spark.catalog.uncacheTable("default.hive_emp_df")
spark.catalog.isCached("default.hive_emp_df")
spark.catalog.clearCache()