In [None]:
Pyspark SQL provides methods to read Parquet file into DataFrame and write DataFrame to Parquet files, parquet() function
from DataFrameReader and DataFrameWriter are used to read from and write/create a Parquet file respectively. Parquet files
maintain the schema along with the data hence it is used to process a structured file.

In [None]:
Apache Parquet file is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the 
choice of data processing framework, data model, or programming language.
While querying columnar storage, it skips the nonrelevant data very quickly, making faster query execution. 
As a result aggregation queries consume less time compared to row-oriented databases.
It is able to support advanced nested data structures.
Parquet supports efficient compression options and encoding schemes.
It also reduces data storage by 75% on average. Pyspark by default supports Parquet in its library hence we don’t need 
to add any dependency libraries.

In [10]:
# Imports
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("parquetFile").getOrCreate()
data =[("James ","","Smith","36636","M",3000),
              ("Michael ","Rose","","40288","M",4000),
              ("Robert ","","Williams","42114","M",4000),
              ("Maria ","Anne","Jones","39192","F",4000),
              ("Jen","Mary","Brown","","F",-1)]
columns=["firstname","middlename","lastname","dob","gender","salary"]
df=spark.createDataFrame(data,columns)
df.printSchema()
# Write DataFrame to parquet file using write.parquet()
df.write.mode('overwrite').parquet("/home/jovyan/work/data/raw/parquetfiles/people.parquet")

# Read parquet file using read.parquet()
parDF=spark.read.parquet("/home/jovyan/work/data/raw/parquetfiles/people.parquet")
parDF.show(truncate=False)

#Using append and overwrite to save parquet file
df.write.mode('append').parquet("/home/jovyan/work/data/raw/parquetfiles/people.parquet")
df.write.mode('overwrite').parquet("/home/jovyan/work/data/raw/parquetfiles/people.parquet")

# Using spark.sql
df.createOrReplaceTempView("ParquetTable")
parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
parkSQL.show()

# Create temp view on Parquet file
# Drop the existing temporary view if it exists
spark.sql("DROP VIEW IF EXISTS PERSON")
spark.sql("CREATE TEMPORARY VIEW PERSON USING parquet OPTIONS (path \"/home/jovyan/work/data/raw/parquetfiles/people.parquet\")")
spark.sql("SELECT * FROM PERSON").show()

#Create Parquet partition file
df.write.partitionBy("gender","salary").mode("overwrite").parquet("/home/jovyan/work/data/raw/parquetfiles/people2.parquet")

#Retrieving from a partitioned Parquet file
parDF2=spark.read.parquet("/home/jovyan/work/data/raw/parquetfiles/people2.parquet/gender=M")
parDF2.show(truncate=False)

# Create a temporary view on partitioned Parquet file
spark.sql("DROP VIEW IF EXISTS PERSON2")
spark.sql("CREATE TEMPORARY VIEW PERSON2 USING parquet OPTIONS (path \"/home/jovyan/work/data/raw/parquetfiles/people2.parquet/gender=F\")")
spark.sql("SELECT * FROM PERSON2" ).show()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|dob  |gender|salary|
+---------+----------+--------+-----+------+------+
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Michael  |Rose      |        |40288|M     |4000  |
|James    |          |Smith   |36636|M     |3000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39