In [1]:
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd

# 1. Add Custom metadata in parquet

In this tutorial, we will add some custom metadata into parquet file.

## 1.1 Pandas dataframe use case

Data source is a pandas dataframe, convert it to pyarrow table, then write to parquet file

In [4]:
data={
    'name': ["Alice", "Bob", "Charlie", "Foo"],
    'age': [20, 21, 22, 23],
    'sex': ["F","M","M","F"]
}

In [5]:
df=pd.DataFrame.from_dict(data)

In [6]:
df.head()

Unnamed: 0,name,age,sex
0,Alice,20,F
1,Bob,21,M
2,Charlie,22,M
3,Foo,23,F


## Check default metadata

Convert pandas dataframe to pyarrow table

In [7]:
table = pa.Table.from_pandas(df)

In [8]:
# let's see if there are some default metadata
print(table.schema.metadata)

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


In the above output, we can notice, it contains one metadata, where the key is **"pandas"**, the value is a **dictionary**. Note they are both in byte.

In fact, pyarrow will use this metadata if we want to convert the table back to pandas dataframe

## Add custom metadata

 As we know that **Arrow tables are immutable**. So we need to construct a new Arrow table if we want to add custom metadata. The metadata of the new table will be a combination of the existing metadata and the custom metadata which we want to add.

As we know the metadata is in **type byte**. So we need to convert the key, value paire to byte too.

In [10]:
origin_meta=table.schema.metadata
my_meta_key="data_provider"
my_meta_value="Pengfei liu"
new_meta = {
    my_meta_key.encode() : my_meta_value.encode(),
    **origin_meta
}

In [None]:
new_table = table.replace_schema_metadata(new_meta)

In newer version of pyarrow, for the string type we don't need to convert it by our self, pyarrow will do the conversion for us automatically.

Check below example, the key is string, the value is int

In [43]:
my_meta_key1="data_version"
my_meta_value1="1"
str_meta = {
    my_meta_key1 : my_meta_value1,
    **origin_meta
}

In [44]:
new_table1=table.replace_schema_metadata(str_meta)

Now let's check the newly added meta

In [46]:
# manual byte conversion
print(new_table.schema.metadata)

{b'data_provider': b'Pengfei liu', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


In [45]:
# auto conversion
print(new_table1.schema.metadata)

{b'data_version': b'1', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


### check other type for key value

In this section, we will test other type for key value.

In [65]:
my_meta_key2="data_version"
my_meta_value2=1
int_meta = {
    my_meta_key2 : my_meta_value2.to_bytes(2, byteorder='big'),
    **origin_meta
}

In [66]:
new_table2=table.replace_schema_metadata(int_meta)

In [67]:
print(new_table2.schema.metadata)

{b'data_version': b'\x00\x01', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


In [68]:
raw_value=new_table2.schema.metadata[my_meta_key2.encode()]
value=int.from_bytes(raw_value, "big")
print(f"the data_version is : {value}")

the data_version is : 1


Conclusion, the key value paire are type byte. In theory, they can take any serializable type. But you can see if you don't know the origin type, when you convert the byte back to his origin type, you will have troubles. So we don't recommend you to use any type other than string.

## Test multiple metadata

In below example, we will test multi key value. And if we have duplicate key, what will happen?

In [70]:
my_meta_key3="key3"
my_meta_value3="value3"
multi_meta = {
    my_meta_key : my_meta_value,
    my_meta_key1 : my_meta_value1,
    my_meta_key2 : my_meta_value2.to_bytes(2, byteorder='big'),
    my_meta_key3 : my_meta_value3,
    **origin_meta
}

In [71]:
new_table3=table.replace_schema_metadata(multi_meta)

In [72]:
print(new_table3.schema.metadata)

{b'data_provider': b'toto', b'data_version': b'\x00\x01', b'key3': b'value3', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


In the above output, you can notice that the second value(toto) for data_provider overwrites the first value(Pengfei Liu). So if we enter duplicate key in metadata, the latest value will overwrite the existing value.

## Write the table to parquet


In [3]:
output_path="../../data/custom_meta.parquet"
compression_algo='GZIP'

In [18]:
pq.write_table(new_table, output_path, compression=compression_algo)

## Read meta data from the parquet file

Note custom metadata is only a part of the parquet metadata. The main part of the metadata is generated automatically to manage the row group, column chunk, stats, etc.

Below example shows how to get the main metadata

In [4]:
pf=pq.ParquetFile(output_path)

In [5]:
# get the main metadata
print(pf.metadata)

<pyarrow._parquet.FileMetaData object at 0x7f1057066f90>
  created_by: parquet-cpp-arrow version 7.0.0
  num_columns: 3
  num_rows: 4
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 2326


In [11]:
# get metadata of row group
print(pf.metadata.row_group(0))

<pyarrow._parquet.RowGroupMetaData object at 0x7f1056fd3360>
  num_columns: 3
  num_rows: 4
  total_byte_size: 268


In [12]:
# get metadata of column chunk
print(pf.metadata.row_group(0).column(0))

<pyarrow._parquet.ColumnChunkMetaData object at 0x7f1056ccb3b0>
  file_offset: 129
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 4
  path_in_schema: name
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f1056ccbd10>
      has_min_max: True
      min: Alice
      max: Foo
      null_count: 0
      distinct_count: 0
      num_values: 4
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: GZIP
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 66
  total_compressed_size: 125
  total_uncompressed_size: 91


In [15]:
# get the statistic of column 0
print(pf.metadata.row_group(0).column(1).statistics)

<pyarrow._parquet.Statistics object at 0x7f1056cee450>
  has_min_max: True
  min: 20
  max: 23
  null_count: 0
  distinct_count: 0
  num_values: 4
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE


In [6]:
# get the schema
print(pf.schema)

<pyarrow._parquet.ParquetSchema object at 0x7f1057135b40>
required group field_id=-1 schema {
  optional binary field_id=-1 name (String);
  optional int64 field_id=-1 age;
  optional binary field_id=-1 sex (String);
}



In [8]:
# get the custom metadata
print(pf.schema_arrow)

name: string
age: int64
sex: string
-- schema metadata --
data_provider: 'Pengfei liu'
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 575


You can notice in the above output, the metadata returned by schema_arrow is inferred (official doc: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html) from and not complete.

To get the complete raw custom metadata, we need to use the table object.

In [9]:
read_table=pq.read_table(output_path)

In [10]:
print(read_table.schema.metadata)

{b'data_provider': b'Pengfei liu', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


With the above output, we are sure that we can get the custom metadata

Let's try if we can get the specific metadata value with the key

In [22]:
key_byte=my_meta_key.encode()
print(f"The key: {key_byte}, The value: {read_table.schema.metadata[key_byte]}")

The key: b'data_provider', The value: b'Pengfei liu'
