# PySpark - Parquet
Pyspark SQL provides methods to read Parquet file into DataFrame and write DataFrame to Parquet files, parquet() function from DataFrameReader and DataFrameWriter are used to read from and write/create a Parquet file respectively. Parquet files maintain the schema along with the data hence it is used to process a structured file.

Below are the simple statements on how to write and read parquet files in PySpark which I will explain in detail later sections.
```
df.write.parquet("/tmp/out/people.parquet") 
parDF1=spark.read.parquet("/temp/out/people.parquet")
```

## What is Parquet File?
Apache Parquet file is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

## Advantages:
While querying columnar storage, it skips the nonrelevant data very quickly, making faster query execution. As a result aggregation queries consume less time compared to row-oriented databases.

It is able to support advanced nested data structures.

Parquet supports efficient compression options and encoding schemes.

Pyspark SQL provides support for both reading and writing Parquet files that automatically capture the schema of the original data, It also reduces data storage by 75% on average. Pyspark by default supports Parquet in its library hence we don’t need to add any dependency libraries.

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
          .appName("SparkByExamples.com") \
          .getOrCreate()

In [2]:
data =[("James ","","Smith","36636","M",3000),
              ("Michael ","Rose","","40288","M",4000),
              ("Robert ","","Williams","42114","M",4000),
              ("Maria ","Anne","Jones","39192","F",4000),
              ("Jen","Mary","Brown","","F",-1)]
columns=["firstname","middlename","lastname","dob","gender","salary"]
df=spark.createDataFrame(data,columns)

## Pyspark Write DataFrame to Parquet file format
Now let’s create a parquet file from PySpark DataFrame by calling the parquet() function of DataFrameWriter class. When you write a DataFrame to parquet file, it automatically preserves column names and their data types. Each part file Pyspark creates has the .parquet file extension. Below is the example,

In [3]:
df.write.parquet("../resources/tmp/output/people.parquet")

## Pyspark Read Parquet file into DataFrame
Pyspark provides a parquet() method in DataFrameReader class to read the parquet file into dataframe. Below is an example of a reading parquet file to data frame.

In [7]:
parqDF=spark.read.parquet("../resources/tmp/output/people.parquet")

## Append or Overwrite an existing Parquet file
Using append save mode, you can append a dataframe to an existing parquet file. Incase to overwrite use overwrite save mode.

In [5]:
df.write.mode('append').parquet("../resources/tmp/output/people.parque")
df.write.mode('overwrite').parquet("../resources/tmp/output/people.parque")

## Executing SQL queries DataFrame
Pyspark Sql provides to create temporary views on parquet files for executing sql queries. These views are available until your program exists.

In [8]:
parqDF.createOrReplaceTempView("ParquetTable")
parqSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")

## Creating a table on Parquet file
Now let’s walk through executing SQL queries on parquet file. In order to execute sql queries, create a temporary view or table directly on the parquet file instead of creating from DataFrame.

In [11]:
spark.sql("CREATE OR REPLACE TEMPORARY VIEW PERSON USING parquet OPTIONS (path '../resources/tmp/output/people.parquet')")
spark.sql("SELECT * FROM PERSON").show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



## Create Parquet partition file
When we execute a particular query on the PERSON table, it scan’s through all the rows and returns the results back. This is similar to the traditional database query execution. In PySpark, we can improve query execution in an optimized way by doing partitions on the data using pyspark partitionBy() method. Following is the example of partitionBy().

In [12]:
df.write.partitionBy("gender","salary").mode("overwrite").parquet("../resources/tmp/output/people2.parquet")

## Retrieving from a partitioned Parquet file
The example below explains of reading partitioned parquet file into DataFrame with gender=M.

In [13]:
parDF2=spark.read.parquet("../resources/tmp/output/people2.parquet/gender=M")
parDF2.show(truncate=False)

+---------+----------+--------+-----+------+
|firstname|middlename|lastname|dob  |salary|
+---------+----------+--------+-----+------+
|James    |          |Smith   |36636|3000  |
|Michael  |Rose      |        |40288|4000  |
|Robert   |          |Williams|42114|4000  |
+---------+----------+--------+-----+------+



## Creating a table on Partitioned Parquet file
Here, I am creating a table on partitioned parquet file and executing a query that executes faster than the table without partition, hence improving the performance.

In [14]:
spark.sql("CREATE TEMPORARY VIEW PERSON2 USING parquet OPTIONS (path '../resources/tmp/output/people2.parquet/gender=F')")
spark.sql("SELECT * FROM PERSON2" ).show()

+---------+----------+--------+-----+------+
|firstname|middlename|lastname|  dob|salary|
+---------+----------+--------+-----+------+
|   Maria |      Anne|   Jones|39192|  4000|
|      Jen|      Mary|   Brown|     |    -1|
+---------+----------+--------+-----+------+

