# **3: DuckDB's Best Practices**

---

By Jean-Yves Tran | jy.tran@[datascience-jy.com](https://datascience-jy.com) | [LinkedIn](https://www.linkedin.com/in/jytran-datascience/)  
IBM Certified Data Analyst 

---

Source: 
- [Getting Started with DuckDB](https://www.packtpub.com/en-ar/product/getting-started-with-duckdb-9781803232539) by Simon Aubury & Ned Letcher
- [DuckDB documentation](https://duckdb.org/docs/)
---

The interactive links in this notebook are not working due to GitHub limitations. View this notebook with the interactive links working [here](https://nbviewer.org/github/jendives2000/Data_ML_Practice_2025/blob/main/1-3-SQL/practice/DuckDB/notebooks/2_duckdb_python_API.ipynb).

---

This is part 3 of the series of notebooks on DuckDB.  
In the previous two notebooks I've introduced DuckDB and shoed how to load, transform and briefly analyse data from different sources.  
Here I will learn about best practices that:
- save time when:
  - querying or inserting data to a DuckDB database
  - joining tables (positional and temporal joins)

For an introduction to DuckDB, check [my first notebook](https://github.com/jendives2000/Data_ML_Practice_2025/blob/82571ad44176666f9cf0735c5141c6a96d5eace9/1-3-SQL/practice/DuckDB/notebooks/1_duckdb_intro.ipynb). I also say in there when you should not use DuckDB. 

For understanding how DuckDB works, check my [second notebook](https://github.com/jendives2000/Data_ML_Practice_2025/blob/ef8533ad82586234cfdc54a494c0c5be590816cc/1-3-SQL/practice/DuckDB/notebooks/2_duckdb_python_API.ipynb).

In more details, I will cover the followings:
- **Selecting columns** effectively
- Applying **function chaining**
- Using **INSERT** effectively
- Leveraging **positional** joins and **temporal joins**
- **Recursive** queries and macros
- additional **tips and tricks**

**SKIERS DATABASE**:  
I added a very simple Skiers Database in the data/data_in folder: `skiers.csv`  
It'll be used throughout this notebook. 

**The two main takeaways are**:


---


## Imports:

In [None]:
# Add parent directory to sys.path
sys.path.append(os.path.abspath(".."))
from utils.duckdb_shared_code import

### Table setup using the DuckDB Shell:

I want to work with the DuckDB shell, from my Jupyter notebook. It is possible by using the following syntax:

In [42]:
!duckdb ../databases/mydatabase.duckdb -c "create or replace table skiers as select * from read_csv('../data/data_in/skiers.csv');"

In [43]:
!duckdb ../databases/mydatabase.duckdb -c "show tables"

┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ skiers  │
└─────────┘


Let's refactor this syntax and copy it over to the `duckdb_shared_code.py` file. 

In [44]:
def shell_commd(stmt):
    # Collapse all whitespace (including newlines) into single spaces
    cleaned_stmt = " ".join(stmt.split())
    get_ipython().system(f"duckdb ../databases/mydatabase.duckdb -c \"{cleaned_stmt}\"")

## select *:
I want to see the whole table skiers now:

In [45]:
shell_commd(
    """
    select * 
    from skiers
    ;
    """
)

┌──────────────────┬─────────────────┬───────────┬──────────────┬────────────────────┬─────────────────┐
│ skier_first_name │ skier_last_name │ skier_age │ skier_height │ skier_helmet_color │ skier_bib_color │
│     varchar      │     varchar     │   int64   │    int64     │      varchar       │     varchar     │
├──────────────────┼─────────────────┼───────────┼──────────────┼────────────────────┼─────────────────┤
│ Alice            │ Smith           │        12 │          152 │ red                │ black           │
│ Bob              │ Blaese          │        16 │          178 │ blue               │ yellow          │
│ Carol            │ Wilson          │        32 │          159 │ yellow             │ pink            │
│ Dan              │ Jones           │        52 │          182 │ red                │ yellow          │
│ Erin             │ Taylor          │        22 │          168 │ black              │ green           │
│ Frank            │ Williams        │        18 │     

### Trailing comma is fine:  
One little improvement that I appreciate is that a final trailing comma is not throwing an error.  
I select multiple columns, and I forget to remove my last comma here `skier_bib_color,`, line 9: 

In [46]:
shell_commd(
    """
    select
        skier_first_name,
        skier_last_name,
        skier_age,
        skier_height,
        skier_helmet_color,
        skier_bib_color,
    from skiers
    ;
    """
)

┌──────────────────┬─────────────────┬───────────┬──────────────┬────────────────────┬─────────────────┐
│ skier_first_name │ skier_last_name │ skier_age │ skier_height │ skier_helmet_color │ skier_bib_color │
│     varchar      │     varchar     │   int64   │    int64     │      varchar       │     varchar     │
├──────────────────┼─────────────────┼───────────┼──────────────┼────────────────────┼─────────────────┤
│ Alice            │ Smith           │        12 │          152 │ red                │ black           │
│ Bob              │ Blaese          │        16 │          178 │ blue               │ yellow          │
│ Carol            │ Wilson          │        32 │          159 │ yellow             │ pink            │
│ Dan              │ Jones           │        52 │          182 │ red                │ yellow          │
│ Erin             │ Taylor          │        22 │          168 │ black              │ green           │
│ Frank            │ Williams        │        18 │     

No problems, the last trailing comma is valid!

## Exclude columns: 

What if I need every column but one? Normally I'd have to select each single column except for the one I don't want.  
Now with EXCLUDE() I can just exclude the one column (or more) to get everything else. 

In [47]:
shell_commd(
    """
    select s.*
    exclude(skier_age, skier_height)
    from skiers as s
    ;
    """
)

┌──────────────────┬─────────────────┬────────────────────┬─────────────────┐
│ skier_first_name │ skier_last_name │ skier_helmet_color │ skier_bib_color │
│     varchar      │     varchar     │      varchar       │     varchar     │
├──────────────────┼─────────────────┼────────────────────┼─────────────────┤
│ Alice            │ Smith           │ red                │ black           │
│ Bob              │ Blaese          │ blue               │ yellow          │
│ Carol            │ Wilson          │ yellow             │ pink            │
│ Dan              │ Jones           │ red                │ yellow          │
│ Erin             │ Taylor          │ black              │ green           │
│ Frank            │ Williams        │ yellow             │ red             │
│ Grace            │ Miller          │ pink               │ black           │
│ Heidi            │ Johnson         │ yellow             │ yellow          │
│ Ivan             │ Brown           │ green              │ pink

## Dynamic Column Replacement: 

Let's say I now want to round the skier_age to the nearest 10 years and convert them into integers, effectively changing its value and data type. The method REPLACE() helps do that kind of modifications: 

In [48]:
shell_commd(
    """
    select s.*
        replace(
            round(skier_age / 10) *10
        )::integer 
        as skier_age
    from skiers as s
    ;
    """
)

┌──────────────────┬─────────────────┬───────────┬──────────────┬────────────────────┬─────────────────┐
│ skier_first_name │ skier_last_name │ skier_age │ skier_height │ skier_helmet_color │ skier_bib_color │
│     varchar      │     varchar     │   int32   │    int64     │      varchar       │     varchar     │
├──────────────────┼─────────────────┼───────────┼──────────────┼────────────────────┼─────────────────┤
│ Alice            │ Smith           │        10 │          152 │ red                │ black           │
│ Bob              │ Blaese          │        20 │          178 │ blue               │ yellow          │
│ Carol            │ Wilson          │        30 │          159 │ yellow             │ pink            │
│ Dan              │ Jones           │        50 │          182 │ red                │ yellow          │
│ Erin             │ Taylor          │        20 │          168 │ black              │ green           │
│ Frank            │ Williams        │        20 │     

## Column Selector:

I can also use the column selector COLUMNS() with a regular expression to, say, get columns that have the word 'color' in their name:  

In [50]:
shell_commd(
    """
    select 
        skier_first_name,
        columns(
            '.*color$'
        )
    from skiers
    ;
    """
)

┌──────────────────┬────────────────────┬─────────────────┐
│ skier_first_name │ skier_helmet_color │ skier_bib_color │
│     varchar      │      varchar       │     varchar     │
├──────────────────┼────────────────────┼─────────────────┤
│ Alice            │ red                │ black           │
│ Bob              │ blue               │ yellow          │
│ Carol            │ yellow             │ pink            │
│ Dan              │ red                │ yellow          │
│ Erin             │ black              │ green           │
│ Frank            │ yellow             │ red             │
│ Grace            │ pink               │ black           │
│ Heidi            │ yellow             │ yellow          │
│ Ivan             │ green              │ pink            │
│ Judy             │ red                │ black           │
├──────────────────┴────────────────────┴─────────────────┤
│ 10 rows                                       3 columns │
└───────────────────────────────────────

The regular expression `'.*color$'` matched 2 columns that have in their name the word color. 

I can take it one step further and filter it to get only one type of value from these columns, say the color yellow. 

In [51]:
shell_commd(
    """
    select 
        skier_first_name,
        columns(
            '.*color.*'
        )
    from skiers
    where columns('.*color.*') == 'yellow'
    ;
    """
)

┌──────────────────┬────────────────────┬─────────────────┐
│ skier_first_name │ skier_helmet_color │ skier_bib_color │
│     varchar      │      varchar       │     varchar     │
├──────────────────┼────────────────────┼─────────────────┤
│ Heidi            │ yellow             │ yellow          │
└──────────────────┴────────────────────┴─────────────────┘
