# Use duckdb python api

In this notebook, we will use duckdb python api to explore some functionality of the duck db. 

In [1]:
import pandas as pd
import duckdb

In [2]:
root_dir = "/home/pengfei/data_set/demo_chu"
csv_file_path = f"{root_dir}/pathologies.csv"
parquet_file_path = f"{root_dir}/pathologies.parquet"

In [3]:
! ls -lah /home/pengfei/data_set/demo_chu

total 683M
drwxrwxr-x  2 pengfei pengfei 4.0K May 22 10:25 .
drwxrwxr-x 14 pengfei pengfei 4.0K May 22 10:25 ..
-rw-rw-r--  1 pengfei pengfei 657M May 16 15:54 pathologies.csv
-rw-rw-r--  1 pengfei pengfei  26M May 17 09:03 pathologies.parquet


## Create a duckdb instance

As we mentioned in the introduction, duckdb has two modes:
- in-memory: duckdb.connect()
- on-disk: duckdb.connect("path/to/file")

In [4]:
## Create an in-memory duckdb instance 
conn = duckdb.connect()

# create an on disk instance, you can also activate the read only option
# conn = duckdb.connect("mydb.db", read_only=True)

## 1. Compare the data loading speed 

In this section, we compare the data loading speed between duck db and pandas

### 1.1 Compare the reading speed of  csv


In [5]:
%%time
# read csv
csv_query = f"""create or replace view patho_csv as select * from read_csv('{csv_file_path}', header=true, delim = ';'); select count(*) from patho_csv"""

# this will return a pandas dataframe
csv_row_count = conn.execute(csv_query).df()

print(type(csv_row_count))
csv_row_count.head(5)

<class 'pandas.core.frame.DataFrame'>
CPU times: user 1.59 s, sys: 639 ms, total: 2.23 s
Wall time: 1.42 s


Unnamed: 0,count_star()
0,4057201


> we can notice the execution time of count row number action takes about 1.5 seconds

Now let's try to read the csv with pandas


In [6]:
%%time
csv_pdf = pd.read_csv(csv_file_path, sep=";")

print(f"row count: {len(csv_pdf)}")



row count: 4057201
CPU times: user 8.9 s, sys: 1.58 s, total: 10.5 s
Wall time: 11.4 s


In [7]:
row_number, col_number = csv_pdf.shape
print(f"The data set contains {row_number} rows and {col_number} columns")

The data set contains 4057201 rows and 16 columns


> we can notice the execution time for pandas is about 11 seconds. so we gain about 7 times the execution time

### 1.2 Compare the reading speed of parquet


In [7]:
%%time
# read parquet with duck db
parquet_query = f"""create or replace view patho_parquet as select * from read_parquet('{parquet_file_path}'); select count(*) from patho_parquet"""

# this will return a pandas dataframe
parquet_row_count = conn.execute(parquet_query).df()

parquet_row_count.head(5)

CPU times: user 5.12 ms, sys: 0 ns, total: 5.12 ms
Wall time: 4 ms


Unnamed: 0,count_star()
0,4057201


> With parquet and duckdb, we can reduce the data loading time to 4 ms

In [8]:
%%time
# read parquet with pandas

parquet_pdf = pd.read_parquet(parquet_file_path, engine='pyarrow')
print(f"row count: {len(parquet_pdf)}")

row count: 4057201
CPU times: user 2.08 s, sys: 913 ms, total: 2.99 s
Wall time: 1.88 s


> pandas can't read parquet natively, it requires pyarrow or fastparquet, here we use the pyarrow package. 

## 2. Compare the different action times

### 2.1 Get the table schema

In [15]:
%%time
table_name = "patho_csv"
query2 = f"Describe {table_name}"
schema = conn.execute(query2).df()
schema.head(15)

CPU times: user 1.97 ms, sys: 0 ns, total: 1.97 ms
Wall time: 2.1 ms


Unnamed: 0,column_name,column_type,null,key,default,extra
0,annee,BIGINT,YES,,,
1,patho_niv1,VARCHAR,YES,,,
2,patho_niv2,VARCHAR,YES,,,
3,patho_niv3,VARCHAR,YES,,,
4,top,VARCHAR,YES,,,
5,cla_age_5,VARCHAR,YES,,,
6,sexe,BIGINT,YES,,,
7,region,VARCHAR,YES,,,
8,dept,VARCHAR,YES,,,
9,ntop,BIGINT,YES,,,


In [19]:
%%time
print(csv_pdf.dtypes)

annee                 float64
patho_niv1             object
patho_niv2             object
patho_niv3             object
top                    object
cla_age_5              object
sexe                  float64
region                  int64
dept                   object
ntop                  float64
npop                  float64
prev                  float64
niveau_prioritaire     object
libelle_classe_age     object
libelle_sexe           object
tri                   float64
dtype: object
CPU times: user 689 µs, sys: 166 µs, total: 855 µs
Wall time: 855 µs


> This time the winner is pyarrow/pandas