In [3]:
import pyarrow.csv as pv
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd

# 1. Add Custom metadata in parquet

In this tutorial, we will add some custom metadata into parquet file.

## 1.1 Pandas dataframe use case

Data source is a pandas dataframe, convert it to pyarrow table, then write to parquet file

In [4]:
data={
    'name': ["Alice", "Bob", "Charlie", "Foo"],
    'age': [20, 21, 22, 23],
    'sex': ["F","M","M","F"]
}

In [5]:
df=pd.DataFrame.from_dict(data)

In [6]:
df.head()

Unnamed: 0,name,age,sex
0,Alice,20,F
1,Bob,21,M
2,Charlie,22,M
3,Foo,23,F


## Check default metadata

Convert pandas dataframe to pyarrow table

In [7]:
table = pa.Table.from_pandas(df)

In [8]:
# let's see if there are some default metadata
print(table.schema.metadata)

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


In the above output, we can notice, it contains one metadata, where the key is **"pandas"**, the value is a **dictionary**. Note they are both in byte.

In fact, pyarrow will use this metadata if we want to convert the table back to pandas dataframe

## Add custom metadata

 As we know that **Arrow tables are immutable**. So we need to construct a new Arrow table if we want to add custom metadata. The metadata of the new table will be a combination of the existing metadata and the custom metadata which we want to add.

In [10]:
origin_meta=table.schema.metadata
my_meta_key="data_provider"
my_meta_value="Pengfei liu"
new_meta = {
    my_meta_key.encode() : my_meta_value.encode(),
    **origin_meta
}

In [11]:
new_table = table.replace_schema_metadata(new_meta)

Now let's check the newly added meta

In [12]:
print(new_table.schema.metadata)

{b'data_provider': b'Pengfei liu', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


## Write the table to parquet


In [13]:
output_path="../../data/custom_meta.parquet"
compression_algo='GZIP'

In [18]:
pq.write_table(new_table, output_path, compression=compression_algo)

## Read the parquet file

Now lets read the parquet file, and check if we can get the custom metadata

In [19]:
read_table=pq.read_table(output_path)

In [20]:
print(read_table.schema.metadata)

{b'data_provider': b'Pengfei liu', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "age", "field_name": "age", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "sex", "field_name": "sex", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.4.2"}'}


With the above output, we are sure that we can get the custom metadata

Let's try if we can get the specific metadata value with the key

In [22]:
key_byte=my_meta_key.encode()
print(f"The key: {key_byte}, The value: {read_table.schema.metadata[key_byte]}")

The key: b'data_provider', The value: b'Pengfei liu'
