In [None]:
PySpark partitionBy() is a function of pyspark.sql.DataFrameWriter class which is used to partition the large dataset 
(DataFrame) into smaller files based on one or multiple columns while writing to disk

PySpark supports partition in two ways; partition in memory (DataFrame) and partition on the disk (File system).

Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations.

Partition on disk: While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns
using partitionBy() of pyspark.sql.DataFrameWriter. This is similar to Hives partitions scheme.

Fast access to the data
Provides the ability to perform an operation on a smaller dataset


In [None]:
How is partitionBy() different from groupBy() in PySpark?
partitionBy() is used for physically organizing data on disk when writing to a file system, while groupBy() is used for the logical grouping of data within a DataFrame.

Can I use multiple columns with partitionBy()?
Yes, We can specify multiple columns in the partitionBy() function to create a hierarchical directory structure. For example:
df.write.partitionBy(“column1”, “column2”).parquet(“/path/to/output”)

How does partitioning affect query performance?
Partitioning can significantly improve query performance, especially when querying specific subsets of data. It helps skip irrelevant data when reading, reducing the amount of data that needs to be processed

In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

# Create DataFrame by reading CSV file
df=spark.read.option("header",True).csv("/home/jovyan/work/data/simple-zipcodes.csv")
df.printSchema()

# partitionBy() Example
df.write.option("header",True).partitionBy("state").mode("overwrite").csv("/home/jovyan/work/data/raw/zipcodes-state")

#partitionBy() multiple columns
df.write.option("header",True).partitionBy("state","city").mode("overwrite").csv("/home/jovyan/work/data/raw/zipcodes-state")


#Use repartition() and partitionBy() together
df.repartition(2).write.option("header",True).partitionBy("state").mode("overwrite").csv("/home/jovyan/work/data/raw/zipcodes-state-more")
    
#Data Skew – Control Number of Records per Partition File    
#partitionBy() control number of partitions
df.write.option("header",True).option("maxRecordsPerFile", 2).partitionBy("state").mode("overwrite").csv("/home/jovyan/work/data/raw/zipcodes-state")    

# Reading from partitioned data
dfSinglePart=spark.read.option("header",True).csv("/home/jovyan/work/data/raw/zipcodes-state/state=AL/city=SPRINGVILLE")
dfSinglePart.printSchema()
dfSinglePart.show()


# Read from partitioned data using sql
parqDF = spark.read.option("header",True).csv("/home/jovyan/work/data/raw/zipcodes-state")
parqDF.createOrReplaceTempView("ZIPCODE")
spark.sql("select * from ZIPCODE  where state='AL' and city = 'SPRINGVILLE'").show()

root
 |-- RecordNumber: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: string (nullable = true)
 |-- State: string (nullable = true)



AnalysisException: Path does not exist: file:/home/jovyan/work/data/raw/zipcodes-state/state=AL/city=SPRINGVILLE