# 3.2 Read write parquet files

The official doc can be found [here](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). Apache Parquet file is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import os

In [2]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .appName("ReadWriteParquet").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("ReadWriteParquet") \
                      .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"]) \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()

22/02/12 02:22:32 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/02/12 02:22:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/02/12 02:22:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 3.2.1 Read parquet file

As parquet file contains rich metadata such as schema, compression codec, etc. Reading a parquet requires none configuration (e.g. separator, null_value)
Below command use spark.read.parquet() to read a parquet file. Note the file_path could be a single parquet file or a directory that contains a list of parquet files

In [3]:
file_path="data/adult.snappy.parquet"
df=spark.read.parquet(file_path)

                                                                                

In [5]:
df.show(5)

+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education-num|    marital-status|       occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|          Divorced|Handlers-cleaners|Not-i

In [6]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



In [8]:
row_numbers=df.count()
print(f"row numbers: {row_numbers}")

row numbers: 32561


## 3.2.2 Partition discovering

In the above example, we read a single parquet file. Now let's read a partitioned parquet file. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically.

## 3.2.3 Write parquet

When spark write parquet file, you can set up few options
- save mode : in this tutorial, we only use overwrite mode. For more information. please check [doc](./README.md)
- compression codec:
- partition by column
- mergeSchema

The output parquet file can be one or more based on the dataframe partitions. Below example show us the different output parquet files when we change partition numbers.

In [37]:
partition_number=df.rdd.getNumPartitions()
print(f"Data partition number: {partition_number}")

Data partition number: 1


In [38]:
output_path="/tmp/out_test"
df.write.mode("overwrite").parquet(output_path)

In [39]:
!ls -lah {output_path}

total 332K
drwxr-xr-x  2 pliu pliu 4.0K Feb 12 06:33 .
drwxrwxrwt 26 root root 4.0K Feb 12 06:33 ..
-rw-r--r--  1 pliu pliu 314K Feb 12 06:33 part-00000-513396ab-0d8a-4322-8ced-3e3f1c8265e2-c000.snappy.parquet
-rw-r--r--  1 pliu pliu 2.5K Feb 12 06:33 .part-00000-513396ab-0d8a-4322-8ced-3e3f1c8265e2-c000.snappy.parquet.crc
-rw-r--r--  1 pliu pliu    0 Feb 12 06:33 _SUCCESS
-rw-r--r--  1 pliu pliu    8 Feb 12 06:33 ._SUCCESS.crc


You can notice, the above dataframe write to only one parquet file. That's because the partition number of the dataframe is 1. Now let's try to repartition the dataframe and write to parquet file again.

In [20]:
# check the execution plan
df.repartition(4,col("age")).explain()

== Physical Plan ==
Exchange hashpartitioning(age#0, 4), REPARTITION_WITH_NUM, [id=#151]
+- *(1) ColumnarToRow
   +- FileScan parquet [age#0,workclass#1,fnlwgt#2,education#3,education-num#4,marital-status#5,occupation#6,relationship#7,race#8,sex#9,capital-gain#10,capital-loss#11,hours-per-week#12,native-country#13,income#14] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/pliu/git/PySparkCommonFunc/notebooks/pysparkbasics/L03_ReadFromVario..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<age:int,workclass:string,fnlwgt:int,education:string,education-num:int,marital-status:stri...




In [21]:
# Note all dataframe are immutable, you need to create a new dataframe to save the modified dataframe
df1=df.repartition(4,col("age"))

In [22]:
# now you should see the partition number is 4
partition_number1=df1.rdd.getNumPartitions()
print(f"Data partition number: {partition_number1}")

Data partition number: 4


In [24]:
df1.write.mode("overwrite").parquet(output_path)

In [26]:
!ls {output_path}

part-00000-d8a1f2a5-4b5c-4f4c-8c04-d7972e3f9fcd-c000.snappy.parquet
part-00001-d8a1f2a5-4b5c-4f4c-8c04-d7972e3f9fcd-c000.snappy.parquet
part-00002-d8a1f2a5-4b5c-4f4c-8c04-d7972e3f9fcd-c000.snappy.parquet
part-00003-d8a1f2a5-4b5c-4f4c-8c04-d7972e3f9fcd-c000.snappy.parquet
_SUCCESS


Now, you can notice we have 4 parquet file instead of 1. Because we changed the partition number of the dataframe

## 3.2.4 Write parquet with specific compression algo
You can notice, by default spark write parquet file with snappy compression. But if you want, we can change the default compression codec.
Below are possible compression codec shorten names
- none, uncompressed
- snappy
- gzip
- lzo
- brotli
- lz4
- zstd

Note they are case-insensitive, and after setup, it will override `spark.sql.parquet.compression.codec`
Note only the snappy, gzip dependencies are included in the spark distribution. If you want to other compression codec, you need to add the dependencies by yourself


In [34]:
compression_algo="gzip"

df.write\
    .mode("overwrite")\
    .option("parquet.compression",compression_algo) \
    .parquet(output_path)


In [36]:
!ls -lah {output_path}

total 260K
drwxr-xr-x  2 pliu pliu 4.0K Feb 12 06:24 .
drwxrwxrwt 26 root root 4.0K Feb 12 06:24 ..
-rw-r--r--  1 pliu pliu 242K Feb 12 06:24 part-00000-a56b9b54-2dd7-444a-8116-796d298f439d-c000.gz.parquet
-rw-r--r--  1 pliu pliu 1.9K Feb 12 06:24 .part-00000-a56b9b54-2dd7-444a-8116-796d298f439d-c000.gz.parquet.crc
-rw-r--r--  1 pliu pliu    0 Feb 12 06:24 _SUCCESS
-rw-r--r--  1 pliu pliu    8 Feb 12 06:24 ._SUCCESS.crc


You can notice there are three more files
```text
-rw-r--r--  1 pliu pliu 1.9K Feb 12 06:24 .part-00000-a56b9b54-2dd7-444a-8116-796d298f439d-c000.gz.parquet.crc
-rw-r--r--  1 pliu pliu    0 Feb 12 06:24 _SUCCESS
-rw-r--r--  1 pliu pliu    8 Feb 12 06:24 ._SUCCESS.crc
```
These files are the checksum, as spark is a distributed system, jobs may fail. In case job failure, the checksum and metadata will be used to check and rerun the failed jobs

You can remove them by setting up
```python
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark.conf.set("parquet.enable.summary-metadata", "false")
```

## 3.2.5 Write parquet with specific partition

In section 3.2.3, We only show the partition number, we can do better. Because a good partitioned table(parquet) can help spark to do projection and predicate push down.

In below column, we will partition the dataframe by using column sex, then race.


In [40]:
df.write.partitionBy("sex","race").mode("overwrite").parquet(output_path)

                                                                                

In [42]:
!ls -R {output_path}

/tmp/out_test:
'sex=Female'  'sex=Male'   _SUCCESS

'/tmp/out_test/sex=Female':
'race=Amer-Indian-Eskimo'  'race=Black'  'race=White'
'race=Asian-Pac-Islander'  'race=Other'

'/tmp/out_test/sex=Female/race=Amer-Indian-Eskimo':
part-00000-5f34fe08-6254-4ed9-b8f0-1da94abf23c8.c000.snappy.parquet

'/tmp/out_test/sex=Female/race=Asian-Pac-Islander':
part-00000-5f34fe08-6254-4ed9-b8f0-1da94abf23c8.c000.snappy.parquet

'/tmp/out_test/sex=Female/race=Black':
part-00000-5f34fe08-6254-4ed9-b8f0-1da94abf23c8.c000.snappy.parquet

'/tmp/out_test/sex=Female/race=Other':
part-00000-5f34fe08-6254-4ed9-b8f0-1da94abf23c8.c000.snappy.parquet

'/tmp/out_test/sex=Female/race=White':
part-00000-5f34fe08-6254-4ed9-b8f0-1da94abf23c8.c000.snappy.parquet

'/tmp/out_test/sex=Male':
'race=Amer-Indian-Eskimo'  'race=Black'  'race=White'
'race=Asian-Pac-Islander'  'race=Other'

'/tmp/out_test/sex=Male/race=Amer-Indian-Eskimo':
part-00000-5f34fe08-6254-4ed9-b8f0-1da94abf23c8.c000.snappy.p

You could notice, it generates the below directory. Each directory and subdirectory only contains rows that satisfy the condition(directory name). This will help spark to skipped unnecessary reads.

```text
.
├── sex=Female
│   ├── race=Amer-Indian-Eskimo
│   ├── race=Asian-Pac-Islander
│   ├── race=Black
│   ├── race=Other
│   └── race=White
├── sex=Male
│   ├── race=Amer-Indian-Eskimo
│   ├── race=Asian-Pac-Islander
│   ├── race=Black
│   ├── race=Other
│   └── race=White

```

## 3.2.6 Merge schema

Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

- setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
- setting the global SQL option spark.sql.parquet.mergeSchema to true.

### merge schema example

In this example, we create two data frame,
- df1, we have two columns: single, double.
- df2, we have two columns: single, triple



In [43]:
from pyspark import Row

sc = spark.sparkContext
schema_output_path1= f"{output_path}/merge_schema/test_table/key=1"
schema_output_path2= f"{output_path}/merge_schema/test_table/key=2"
df1 = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(single=i, double=i ** 2)))
df1.show()
df2 = spark.createDataFrame(sc.parallelize(range(6, 11)).map(lambda i: Row(single=i, triple=i ** 3)))
df2.show()
# then we write the two data frame in a partition folder, here we put key=1, key=2.
df1.write.parquet(schema_output_path1)
df2.write.parquet(schema_output_path2)

                                                                                

+------+------+
|single|double|
+------+------+
|     1|     1|
|     2|     4|
|     3|     9|
|     4|    16|
|     5|    25|
+------+------+

+------+------+
|single|triple|
+------+------+
|     6|   216|
|     7|   343|
|     8|   512|
|     9|   729|
|    10|  1000|
+------+------+



Note the above code will generate below parquet files

```text
merge_schema/
└── test_table
    ├── key=1
    │   ├── part-00000-6ec4cd58-26fd-4398-a910-66eae8699174-c000.snappy.parquet
    │   ├── part-00001-6ec4cd58-26fd-4398-a910-66eae8699174-c000.snappy.parquet
    │   ├── part-00002-6ec4cd58-26fd-4398-a910-66eae8699174-c000.snappy.parquet
    │   ├── part-00003-6ec4cd58-26fd-4398-a910-66eae8699174-c000.snappy.parquet
    │   └── _SUCCESS
    └── key=2
        ├── part-00000-892aa2e3-753f-4ebe-8136-51c735b0a46a-c000.snappy.parquet
        ├── part-00001-892aa2e3-753f-4ebe-8136-51c735b0a46a-c000.snappy.parquet
        ├── part-00002-892aa2e3-753f-4ebe-8136-51c735b0a46a-c000.snappy.parquet
        ├── part-00003-892aa2e3-753f-4ebe-8136-51c735b0a46a-c000.snappy.parquet
        └── _SUCCESS

```

If we read the parent folder(merge_schema) that contains the partitioned folder(test_table/key=1,key=2). The partition key become a column name, we call it partitioned column.

As the data frame in each partition folder has different schema, we need to set mergeSchema to true. Otherwise, it will only use the schema
of the first parquet file which it reads. It will drop all columns in other parquet files that do not exist in the schema. Check the below data frame.

In [48]:
parent_path=f"{output_path}/merge_schema/test_table"
# note by default mergeSchema is set to false, no need to add it. We add it only for clarity fo the example
unmergedDF = spark.read.option("mergeSchema", "false").parquet(parent_path)
unmergedDF.printSchema()
unmergedDF.show()

root
 |-- single: long (nullable = true)
 |-- double: long (nullable = true)
 |-- key: integer (nullable = true)

+------+------+---+
|single|double|key|
+------+------+---+
|     9|  null|  2|
|    10|  null|  2|
|     4|    16|  1|
|     5|    25|  1|
|     6|  null|  2|
|     8|  null|  2|
|     7|  null|  2|
|     1|     1|  1|
|     2|     4|  1|
|     3|     9|  1|
+------+------+---+



You could notice it only has three column single, double, key. The two first column come from the first parquet partition schema. The column key indicate the row value is from parquet partition 1 or 2.

Now let's try to set `mergeSchema` to true. You can notice the mergedDF has four columns right now.

In [49]:
mergedDF = spark.read.option("mergeSchema", "true").parquet(parent_path)
mergedDF.printSchema()
mergedDF.show()

root
 |-- single: long (nullable = true)
 |-- double: long (nullable = true)
 |-- triple: long (nullable = true)
 |-- key: integer (nullable = true)

+------+------+------+---+
|single|double|triple|key|
+------+------+------+---+
|     9|  null|   729|  2|
|    10|  null|  1000|  2|
|     4|    16|  null|  1|
|     5|    25|  null|  1|
|     6|  null|   216|  2|
|     8|  null|   512|  2|
|     7|  null|   343|  2|
|     1|     1|  null|  1|
|     2|     4|  null|  1|
|     3|     9|  null|  1|
+------+------+------+---+

