## Read partitioned data

In parquet file, we can partition a table via its column. In below example, we create a file with 3 level partition:
 1. sex
 2. year
  3. dep

We need to check if duckdb can read the partitioned parquet file correctly.


In [2]:
from pyspark.sql import SparkSession
import os

local=True
if local:
    spark = SparkSession.builder\
        .master("local[4]")\
        .appName("RepartitionAndCoalesce")\
        .config("spark.executor.memory", "2g")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443")\
        .appName("RepartitionAndCoalesce")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory","2g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()

In [4]:
csv_path="../data/people.csv"
output_path="../data/people_partiton"

In [7]:
df=spark.read\
    .options(header=True,inferSchema=True,delimiter=',',nullValue="?")\
    .csv(path=csv_path)
df.show(5)

                                                                                

+---+----+---+----+---+
| id|name|sex|year|dep|
+---+----+---+----+---+
|  1|toto|  F|2002|  1|
|  2|toto|  F|2003|  1|
|  3|toto|  F|2004|  1|
|  4|titi|  M|2002|  2|
|  5|titi|  M|2003|  2|
+---+----+---+----+---+
only showing top 5 rows



In [8]:

df.write.partitionBy("sex","year","dep").mode("overwrite").parquet(output_path)

                                                                                

duck db can't read partitioned parquet file directly.
https://stackoverflow.com/questions/71952623/reading-partitioned-parquet-files-in-duckdb

In [1]:
import duckdb

In [6]:
# Connexion Ã  DuckDB
conn = duckdb.connect()


In [7]:
import pyarrow.dataset as ds

dataset = ds.dataset(output_path, format="parquet", partitioning="hive")
conn.register_arrow("Hierarchy", dataset)
conn.execute("Select * from Hierarchy").df()

AttributeError: 'duckdb.duckdb.DuckDBPyConnection' object has no attribute 'register_arrow'