## Read and Write Parquet File
Pyspark SQL provides methods to read Parquet file into DataFrame and write DataFrame to Parquet files, `parquet()` function from `DataFrameReader` and `DataFrameWriter` are used to read from and write/create a Parquet file respectively. Parquet files maintain the schema along with the data hence it is used to process a structured file.

### Parquet files
Apache Parquet file is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Advantages:
While querying columnar storage, it skips the nonrelevant data very quickly, making faster query execution. As a result aggregation queries consume less time compared to row-oriented databases.

It is able to support advanced nested data structures.

Parquet supports efficient compression options and encoding schemes.

Pyspark SQL provides support for both reading and writing Parquet files that automatically capture the schema of the original data, It also reduces data storage by 75% on average. Pyspark by default supports Parquet in its library hence we don’t need to add any dependency libraries.

Create test Dataframe

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('parquet').getOrCreate()


data =[("James ","","Smith","36636","M",3000),
       ("Michael ","Rose","","40288","M",4000),
       ("Robert ","","Williams","42114","M",4000),
       ("Maria ","Anne","Jones","39192","F",4000),
       ("Jen","Mary","Brown","","F",-1)]

columns=["firstname","middlename","lastname","dob","gender","salary"]

df = spark.createDataFrame(data,columns)
df.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



### Write DataFrame to Parquet file format
Create a parquet file from PySpark DataFrame by calling the `parquet()` function of `DataFrameWriter` class. When you write a DataFrame to parquet file, it automatically preserves column names and their data types. Each part file Pyspark creates has the .parquet file extension

In [42]:
df.write.mode("overwrite").parquet("./resources/people.parquet")

In [43]:
!ls -al ./resources/people.parquet

total 20
drwxr-xr-x 2 javier javier 4096 Aug  8 20:29 .
drwxrwxr-x 5 javier javier 4096 Aug  8 20:29 ..
-rw-r--r-- 1 javier javier 1968 Aug  8 20:29 part-00000-466e2640-507a-4079-989c-50fd47e313c8-c000.snappy.parquet
-rw-r--r-- 1 javier javier   24 Aug  8 20:29 .part-00000-466e2640-507a-4079-989c-50fd47e313c8-c000.snappy.parquet.crc
-rw-r--r-- 1 javier javier    0 Aug  8 20:29 _SUCCESS
-rw-r--r-- 1 javier javier    8 Aug  8 20:29 ._SUCCESS.crc


### Read Parquet file into DataFrame
Pyspark provides a `parquet()` method in `DataFrameReader` class to read the parquet file into dataframe.

In [44]:
parDF = spark.read.parquet("./resources/people.parquet")
parDF.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



### Append or Overwrite an existing Parquet file
Using `append` save mode, you can append a dataframe to an existing parquet file. To overwrite, use `overwrite` save mode.

In [45]:
df.write.mode('append').parquet("./resources/people.parquet")

In [46]:
parDF = spark.read.parquet("./resources/people.parquet")
parDF.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



In [47]:
df.write.mode('overwrite').parquet("./resources/people.parquet")
parDF = spark.read.parquet("./resources/people.parquet")
parDF.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



### Executing SQL queries DataFrame
Pyspark Sql provides to create temporary views on parquet files for executing sql queries. These views are available until your program exists

In [48]:
parDF.createOrReplaceTempView("ParquetTable")

parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
parkSQL.show(truncate=False)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|dob  |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
+---------+----------+--------+-----+------+------+



In [49]:
spark.sql("CREATE OR REPLACE TEMPORARY VIEW PERSON USING parquet OPTIONS (path \"./resources/people.parquet\")")
spark.sql("SELECT * FROM PERSON").show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



### Create Parquet partition file
When we execute a particular query on the PERSON table, it scan’s through all the rows and returns the results back. This is similar to the traditional database query execution. In PySpark, we can improve query execution in an optimized way by doing partitions on the data using pyspark `partitionBy()` method

In [50]:
df.write.partitionBy("gender","salary").mode("overwrite").parquet("./resources/people2.parquet")

In [51]:
!ls -al ./resources/people2.parquet

total 20
drwxr-xr-x 4 javier javier 4096 Aug  8 20:29  .
drwxrwxr-x 5 javier javier 4096 Aug  8 20:29  ..
drwxr-xr-x 4 javier javier 4096 Aug  8 20:29 'gender=F'
drwxr-xr-x 4 javier javier 4096 Aug  8 20:29 'gender=M'
-rw-r--r-- 1 javier javier    0 Aug  8 20:29  _SUCCESS
-rw-r--r-- 1 javier javier    8 Aug  8 20:29  ._SUCCESS.crc


### Retrieving from a partitioned Parquet file
Below it reads a partitioned parquet file into DataFrame with gender=M.

In [52]:
parDF2=spark.read.parquet("./resources/people2.parquet/gender=M")
parDF2.show(truncate=False)

+---------+----------+--------+-----+------+
|firstname|middlename|lastname|dob  |salary|
+---------+----------+--------+-----+------+
|James    |          |Smith   |36636|3000  |
|Michael  |Rose      |        |40288|4000  |
|Robert   |          |Williams|42114|4000  |
+---------+----------+--------+-----+------+



In [53]:
spark.sql("CREATE OR REPLACE TEMPORARY VIEW PERSON2 USING parquet OPTIONS (path \"./resources/people2.parquet/gender=F\")")
spark.sql("SELECT * FROM PERSON2" ).show()

+---------+----------+--------+-----+------+
|firstname|middlename|lastname|  dob|salary|
+---------+----------+--------+-----+------+
|   Maria |      Anne|   Jones|39192|  4000|
|      Jen|      Mary|   Brown|     |    -1|
+---------+----------+--------+-----+------+

