# [Advanced] 5 Spark Tips that will get you to another level

There are many different tools in the world, each of which solves a range of problems. Many of them are judged by how well and correct they solve this or that problem, but there are tools that you just like, you want to use them. They are properly designed and fit well in your hand, you do not need to dig into the documentation and understand how to do this or that simple action. About one of these tools for me I will be writing this series of posts.

[Reference](https://luminousmen.com/)

## 1. DO NOT collect data on Local Driver

If your RDD/DataFrame is so large that all its elements will not fit into the driver machine memory, **DO NOT** do the following:

In [None]:
data = df.collect()

Collect action will try to move all data in RDD/DataFrame to the machine with the driver and where it may run out of memory and crash. Instead, you can make sure that the number of items returned is sampled by calling 'take' or 'takeSample', or perhaps by filtering your RDD/DataFrame.

## 2. Specify the schema

When reading CSV and JSON files, you get better performance by specifying the schema, instead of using the inference mechanism - specifying the schema reduces errors and is recommended for production code.

In [None]:
from pyspark.sql.types import (StructType, StructField, 
    DoubleType, IntegerType, StringType)

schema = StructType([   
    StructField('A', IntegerType(), nullable=False),    
    StructField('B', DoubleType(), nullable=False),    
    StructField('C', StringType(), nullable=False)
])

df = sc.read.csv('/some/input/file.csv', inferSchema=False)

Also, Use the right datatype. Avro has easy serialization/deserialization, which allows for efficient integration of ingestion processes. Meanwhile, Parquet allows you to work effectively when selecting specific columns and can be effective for storing intermediate files. But the parquet files are immutable, modifications require overwriting the whole data set, however, Avro files can easily cope with frequent schema changes.
Reference: https://luminousmen.com/post/big-data-file-formats

## 3. Avoid reduceByKey when the input and output value types are different

If for any reason you have RDD-based jobs, use wisely reduceByKey operations.

Consider the job of creating a set of strings for each key:

In [None]:
rdd.map(lambda p: (p[0], {p[1]})) \
    .reduceByKey(lambda x, y: x | y) \
    .collect()

Note that the input values are strings and the output values are sets. The map operation creates lots of temporary small objects. A better way to handle this scenario is to use aggregateByKey:

In [None]:
def merge_vals(xs, x):
    xs.add(x)
    return xs

def combine(xs, ys):
    return xs | ys

rdd.aggregateByKey(set(), merge_vals, combine).collect()

## 4. Don't use count when you don't need to return the exact number of rows

When you don't need to return the exact number of rows, It's efficient to use

In [None]:
df = sqlContext.read().json(...);
if not len(df.take(1)):
    ...

#### instead of

In [None]:
if not df.count():
    ...

## 5. Using bucketing in Pyspark

Bucketing is an optimization method that breaks down data into more manageable parts (buckets) to determine the data partitioning while it is written out. The motivation for this method is to make successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. In our example, we can optimize the execution of join queries by avoiding shuffles(also known as exchanges) of the tables involved in the join. Using bucketing leads to a smaller number of exchanges (and, consequently, stages), because shuffling may not be required — both DataFrames may already be located in the same partitions.

Bucketing is on by default. Spark uses the configuration property spark.sql.sources.bucketing.enabled to control whether or not it should be enabled and used to optimize requests.

Bucketing determines the physical layout of the data, so we shuffle the data beforehand because we want to avoid such shuffling later in the process.

Okay, do I really need to do an extra step if the shuffle is to be executed anyway?

If you join several times, then yes. 

**The more times you join, the better the performance gains.**

An example of how to create a bucketed table:

In [None]:
df.write\
    .bucketBy(16, 'key') \
    .sortBy('value') \
    .saveAsTable('bucketed', format='parquet')

Thus, here bucketBy distributes data to a fixed number of buckets (16 in our case) and can be used when the number of unique values is not limited. If the number of unique values is limited, it's better to use a partitioning instead of a bucketing.

In [None]:
t2 = spark.table('bucketed')
t3 = spark.table('bucketed')

# bucketed - bucketed join. 
# Both sides have the same bucketing, and no shuffles are needed.
t3.join(t2, 'key').explain()

Apart from the single-stage sort-merge join, bucketing also supports quick data sampling. As of Spark 2.4, Spark SQL supports bucket pruning to optimize filtering on the bucketed column (by reducing the number of bucket files to scan).

Bucketing works well when the number of unique values is unlimited. Columns that are often used in queries and provide high selectivity are a good choice for bucketing. Bucketed Spark tables store metadata about how they are bucketed and sorted, which helps optimize joins, aggregations, and queries for bucketed columns.

Reference
https://spark.apache.org/docs/latest/tuning.html
Uber Case Study: Choosing the Right HDFS File Format for Your Apache Spark Jobs
https://luminousmen.com/post/the-5-minute-guide-to-using-bucketing-in-pyspark