# **4: Hands-on Exploratory Data Analysis with DuckDB :duck:**

---

By Jean-Yves Tran | jy.tran@[datascience-jy.com](https://datascience-jy.com) | [LinkedIn](https://www.linkedin.com/in/jytran-datascience/)  
IBM Certified Data Analyst 

---

Source: 
- [Getting Started with DuckDB](https://www.packtpub.com/en-ar/product/getting-started-with-duckdb-9781803232539) by Simon Aubury & Ned Letcher
- [DuckDB documentation](https://duckdb.org/docs/)
---

The interactive links in this notebook are not working due to GitHub limitations. View this notebook with the interactive links working [here](https://nbviewer.org/github/jendives2000/Data_ML_Practice_2025/blob/main/1-3-SQL/practice/DuckDB/notebooks/4_duckdb_handson_eda.ipynb).

---

This is part 4 of this series of notebooks on DuckDB.  
Find the 3 previous notebooks: 
- intro to DuckDB, [notebook 1](https://github.com/jendives2000/Data_ML_Practice_2025/blob/82571ad44176666f9cf0735c5141c6a96d5eace9/1-3-SQL/practice/DuckDB/notebooks/1_duckdb_intro.ipynb).
- how DuckDB works, [notebook 2](https://github.com/jendives2000/Data_ML_Practice_2025/blob/ef8533ad82586234cfdc54a494c0c5be590816cc/1-3-SQL/practice/DuckDB/notebooks/2_duckdb_python_API.ipynb)
- DuckDB best practices, [notebook 3](https://github.com/jendives2000/Data_ML_Practice_2025/blob/ef8533ad82586234cfdc54a494c0c5be590816cc/1-3-SQL/practice/DuckDB/notebooks/3_duckdb_bestpractices.ipynb)

For this notebook, I will:
- first introduce the dataset I will work on
- clean it up
- and explore that data (EDA)
  - including with visualizations (Plotly)


<u>**Relational API & JupySQL:**</u>  

I use two ways to leverage DuckDB:
- its Relational API, efficient and pythonic 
- and JupySQL, which enables direct SQL queries in a Jupyter notebook cell (with just '%%sql')
  
I will not reintroduce the Relational API, for that check my notebook 1.  
JupySQL needs to be installed (`pip install jupysql`) and imported. I will also show how to enable it. 

**Large File**:
Be aware that the dataset I will explore weighs 420MB. While this is by no means a heavy dataset, you still need the memory space for it. More about that in the dataset paragraph.

**OUTLINE:**  

**DATABASE & Git LFS**:  
Like I said this dataset weighs 420MB. I will give the link to download it if you prefer that way (you could just clone the DuckDB folder).  
Note that because of that size, Git LFS was needed to make it available in this repo. So you may have to download the dataset via Git itself, I will not show how. 

**The main takeaway is**:
- to **better comprehend** the **differences** between regular SQL queries and DuckDB enhanced queries, which is often a lot less verbosity and more readability. 

---


## **Imports**:

In [1]:
#! pip install pandas matplotlib
import duckdb
import pandas as pd

---

## **Dataset: Melbourne Pedestrian Count**:

I will be using a dataset made available by the city of Melbourne, Australia. contains **hourly pedestrian counts** from pedestrian sensors located in and around the Melbourne Central business district. We’ll be working with a historical timeframe of this dataset ranging from **2009 to 2022**.

I [imported](https://data.melbourne.vic.gov.au/api/datasets/1.0/pedestrian-counting-system-monthly-counts-per-hour/attachments/pedestrian_counting_system_monthly_counts_per_hour_may_2009_to_14_dec_2022_csv_zip/) it in the data/data_in folder. 
Once unzipped, this is:
- a **420MB** dataset 
- with over **2,1 million entries**


## **Relational API**: 

I will use the Relational API because it offers more for what I will be doing: data exploratory analysis. 

Let's get our dataset into a Relational Object (RO from now on):

In [2]:
records = duckdb.read_csv("../data/data_in/pedestrian_records_2009-2022.csv")

# 200 is the max number of characters possible inside an entry:
records.show(max_width = 200)

┌─────────┬───────────────────────────────┬───────┬──────────┬───────┬──────────┬───────┬───────────┬───────────────────────────────┬───────────────┐
│   ID    │           Date_Time           │ Year  │  Month   │ Mdate │   Day    │ Time  │ Sensor_ID │          Sensor_Name          │ Hourly_Counts │
│  int64  │            varchar            │ int64 │ varchar  │ int64 │ varchar  │ int64 │   int64   │            varchar            │     int64     │
├─────────┼───────────────────────────────┼───────┼──────────┼───────┼──────────┼───────┼───────────┼───────────────────────────────┼───────────────┤
│ 2887628 │ November 01, 2019 05:00:00 PM │  2019 │ November │     1 │ Friday   │    17 │        34 │ Flinders St-Spark La          │           300 │
│ 2887629 │ November 01, 2019 05:00:00 PM │  2019 │ November │     1 │ Friday   │    17 │        39 │ Alfred Place                  │           604 │
│ 2887630 │ November 01, 2019 05:00:00 PM │  2019 │ November │     1 │ Friday   │    17 │        37 

## **Looking at the Data**:

Because this is an RO, DuckDB loaded 10,000 rows and this what it **lazily returned** us here.  
10,000 rows with 10 columns each makes **100,000 entries** that were lazily output (out of the total 2,1 millions entries). 

Lots of info in this dataset: 
- **count of pedestrians** detected by 
- a specific **sensor** 
- during **each hour**. 
- Additionally, it provides other details related to the hourly readings, including the **sensor name** 
  - and the **timestamp**, along with date and time components derived from the timestamp.

### **Data types**:
Notice in the header of the output that **datetime** is of the data type **VARCHAR**, meaning text. It should be a timestamp (docs on this [here](http://duckdb.org/docs/sql/functions/dateformat)). Let's fix that:

In [3]:
records = duckdb.read_csv(
    "../data/data_in/pedestrian_records_2009-2022.csv",
    dtype={"Date_Time": "TIMESTAMP"},
    timestamp_format="%B %d, %Y %H:%M:%S %p",
)

Let's confirm this change:

In [4]:
records.limit(5).show(max_width=200)

┌─────────┬─────────────────────┬───────┬──────────┬───────┬─────────┬───────┬───────────┬──────────────────────────────┬───────────────┐
│   ID    │      Date_Time      │ Year  │  Month   │ Mdate │   Day   │ Time  │ Sensor_ID │         Sensor_Name          │ Hourly_Counts │
│  int64  │      timestamp      │ int64 │ varchar  │ int64 │ varchar │ int64 │   int64   │           varchar            │     int64     │
├─────────┼─────────────────────┼───────┼──────────┼───────┼─────────┼───────┼───────────┼──────────────────────────────┼───────────────┤
│ 2887628 │ 2019-11-01 17:00:00 │  2019 │ November │     1 │ Friday  │    17 │        34 │ Flinders St-Spark La         │           300 │
│ 2887629 │ 2019-11-01 17:00:00 │  2019 │ November │     1 │ Friday  │    17 │        39 │ Alfred Place                 │           604 │
│ 2887630 │ 2019-11-01 17:00:00 │  2019 │ November │     1 │ Friday  │    17 │        37 │ Lygon St (East)              │           216 │
│ 2887631 │ 2019-11-01 17:00:00 │ 

### **Enums for low cardinality String Columns**:

[ENUM types](https://duckdb.org/docs/sql/data_types/enum.html) are a way to **convert string values into numbers** in a database.  
This is useful for columns with a limited number of different values, like:
- month names 
- and days of the week. 

By using ENUMs for these columns, we can:
- **save storage space** 
- and **speed up queries** 

because the database only stores numbers instead of full strings.  

In DuckDB, because the **SQL parser doesn’t support subqueries** in a CREATE TYPE ... AS ENUM statement, the best approach is to:
- **extract** the distinct values via SQL (or pandas),
- **format** them into a comma-separated list, 
- and then **build** the ENUM type dynamically.  

This method gives me flexibility and control over the ENUM values while keeping my code concise.

In [5]:
# refactored logic to create ENUM with an RO:
def create_enum(col, db, enum_name):
    """
    Create an ENUM type in DuckDB from unique values in a specified column.

    Parameters:
    col (str): The column name to extract unique values from.
    db (str): The name of the DuckDB relation (table).
    enum_name (str): The name of the ENUM type to be created.
    """
    # extracting the unique values from the specified column
    col_forEnum = duckdb.sql(
        f"""
        select distinct {col}
        from {db}
        """
    ).fetchall()

    # Build a comma-separated list of ENUM values
    col_enum_val = ", ".join(f"'{v[0]}'" for v in col_forEnum)

    # create the ENUM type
    duckdb.sql(f"create type {enum_name} as enum ({col_enum_val});")

# creating enums for the Month column:
create_enum("Month", "records", "month_enum")

In [6]:
# creating enums for the Day column:
create_enum("Day", "records", "day_enum")

Did I actually created them? The output says yes!

In [7]:
print(duckdb.sql("select enum_range(NULL::month_enum);"))
print(duckdb.sql("select enum_range(NULL::day_enum);"))

┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                enum_range(CAST(NULL AS month_enum))                                │
│                                             varchar[]                                              │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [October, August, September, March, April, July, January, December, November, May, June, February] │
└────────────────────────────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│                enum_range(CAST(NULL AS day_enum))                │
│                            varchar[]                             │
├──────────────────────────────────────────────────────────────────┤
│ [Sunday, Monday, Wednesday, Friday, Tuesday, Saturday, Thursday] │
└───────────────────────────────────

### **Selecting useful Columns**:

Before we load our dataset into an on-disk database for analysis, we should think about any data changes we might want to make. 
For instance, we can remove columns we won’t need:  
- the additional date and time fields are likely to be useful, 
- and we should keep the Sensor_ID column since it helps connect this dataset with another one from the city of Melbourne that includes details about each sensor, like their locations.  
- **However**, we can **drop the ID field** because it doesn’t relate to any other datasets about the pedestrian counting system.  
- Since we’ll be doing various time series analyses, we also need to **sort the records by the Date_Time column**.  

Let’s make these changes and take a look at the results.

In [8]:
records_v2 = records.select("* exclude ID").sort("Date_Time")

In [9]:
records_v2.limit(5).show(max_width=200)

┌─────────────────────┬───────┬─────────┬───────┬─────────┬───────┬───────────┬───────────────────────────────────┬───────────────┐
│      Date_Time      │ Year  │  Month  │ Mdate │   Day   │ Time  │ Sensor_ID │            Sensor_Name            │ Hourly_Counts │
│      timestamp      │ int64 │ varchar │ int64 │ varchar │ int64 │   int64   │              varchar              │     int64     │
├─────────────────────┼───────┼─────────┼───────┼─────────┼───────┼───────────┼───────────────────────────────────┼───────────────┤
│ 2009-05-01 00:00:00 │  2009 │ May     │     1 │ Friday  │     0 │         5 │ Princes Bridge                    │           157 │
│ 2009-05-01 00:00:00 │  2009 │ May     │     1 │ Friday  │     0 │         1 │ Bourke Street Mall (North)        │            53 │
│ 2009-05-01 00:00:00 │  2009 │ May     │     1 │ Friday  │     0 │         6 │ Flinders Street Station Underpass │           139 │
│ 2009-05-01 00:00:00 │  2009 │ May     │     1 │ Friday  │     0 │         

All this pre-analysis work is done, the data is good to go.  

## **Advantages of loading into a disk-based database**:
I need to load it into a **persistent** disk-based database so that the whole data is **written and safely saved**, making it **available** for another time or another person. 

This means I will NOT use the default database duckdb anymore. This data I just cleaned up will be in a **table** in the database. That cleaning-up process will **NOT happen again** as it is now 'persistingly' reflected in the new database. 

This has the following advantages: 
- **saves compute time**, especially on large and/or complex datasets
- **separates data loading from data consumption**: 
  - so multiple notebooks can do their own analyses, **consuming** the same database data **without impacting** the data itself.

Now, to encapsulate all of: 
- the cleaning-up I did so far 
- and the table/database creation 
- while ensuring that everything is written off and saved safely 

I am using the context's manager `with` block. I call: 
- the table `pedestrian_counts`, 
- which is held by the database `pedestrian.duckdb`. 


In [10]:
# context manager with block:
with duckdb.connect("../data/data_out/pedestrian.duckdb") as conn:
    # Drop the table if it exists
    conn.execute("DROP TABLE IF EXISTS pedestrian_counts")
    
    result=(
        # repeating the cleaning-up steps:
        conn.read_csv(
            "../data/data_in/pedestrian_records_2009-2022.csv",
            dtype={"Date_Time": "TIMESTAMP"},
            timestamp_format="%B %d, %Y %H:%M:%S %p",
        )
        .select("* exclude ID")
        .sort("Date_Time")
    )
    # copying the whole result into the new table:
    result.to_table("pedestrian_counts")

## EDA: 

I can now start the EDA, however let's first look at some **more tools** I need: 
- **JupySQl**: to conveniently **run SQL queries directly** within Jupyter cells
- **Plotly**: to create any kind of **interactive visualizations**, well integrated with Jupyter

### SQL directly in Jupyter with JupySQL:

To better understand its advantages, I first use the Relational API with the `.sql()` method to:
- count the total number of pedestrians 
- for the _Melbourne Central_ sensor
- for 2022


In [11]:
conn = duckdb.connect("../data/data_out/pedestrian.duckdb")

conn.sql(
    """
    select sum(Hourly_Counts) as Total_Counts
    from pedestrian_counts
    where Year = 2022 and Sensor_Name = 'Melbourne Central'
    """
)

┌──────────────┐
│ Total_Counts │
│    int128    │
├──────────────┤
│      6897406 │
└──────────────┘

So obviously the downsides are:
- use of a python string (doc strings) to introduce SQL
- use of the sql() method too

This is similar to what I used in the previous notebook where I used the DuckDB shell directly but within the Jupyter notebook. 

### **JupySQL Magic SQL**:
Here's what JupySQL brings to the table. I am configuring it to be used with our new database, but first I **have to close the `conn` one**.  
<u>**REMEMBER**</u>, DuckDB does **NOT allow for multiple concurrent connections** to the same database. 

In [12]:
conn.close()

In [13]:
# enabling SQL Magic
%load_ext sql
%sql duckdb:///../data/data_out/pedestrian.duckdb

On top of that, I can now configure it to automatically return a pandas dataframe instead of the usual SQL output, and simplify the output: 

In [14]:
%config Sqlmagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Remember the lazy evaluation thing? By doing that autopandas configuration I **lost that lazy eval**. This has **performance impacts**, for me not so much at it is 'only' a 420MB dataset. If you do that on a very much larger dataset, this will show. 

And with that I am ready to type SQL magic:

In [15]:
%%sql
select sum(Hourly_Counts) as Total_Counts
from pedestrian_counts
where Year = 2022 and Sensor_Name = 'Melbourne Central'

Total_Counts
6897406


All right. What about putting all of that in a variable?  

### **`%%sql var <<`: to assign a df in a var**:
Here's how, by using that syntax: 
`%%sql var_name <<`

I am using it now to:
- calculate the total counts
- for each sensor 
- in 2022
- and sorts them in descending order (biggest first)


In [16]:
%%sql sensors_2022_df <<
SELECT Sensor_Name, sum(Hourly_Counts)::BIGINT AS Total_Counts
FROM pedestrian_counts
WHERE Year = 2022
GROUP BY Sensor_Name
ORDER BY Total_Counts DESC

So in the code above: 
- `::BIGINT` turned the sum of hours count **to an integer**, removing any decimal that would have appeared otherwise
  - Why? 
    - Because the sum itself is of data type **HUGEINT** actually (128-bit integer) preventing the numbers from overflowing (being too big)
    - and because pandas does **not support 128-bit** integers, it converts them to float64 (hence the decimal)
  

In [17]:
print(type(sensors_2022_df))

<class 'sql.run.resultset.ResultSet'>


I printed the type of the variable `sensors_2022_df` and it is actually **not a dataframe**.  
The `autopandas = True` config did <u>**not work as expected**</u>.  
  
So I have to do the conversion manually:

In [None]:
# refactoring the code to turn a variable into a pandas dataframe:
def var_to_df(var):
    """
    Convert a SQL result set to a pandas DataFrame.

    Parameters:
    var (sql.run.resultset.ResultSet): The SQL result set to convert.

    Returns:
    pandas.DataFrame: The converted pandas DataFrame.
    """
    return var.DataFrame()

df_sensors_2022 = var_to_df(sensors_2022_df)
print(type(df_sensors_2022))

df_sensors_2022 = var_to_df(sensors_2022_df)
print(type(df_sensors_2022))

<class 'pandas.core.frame.DataFrame'>


All right, let's look at it:

In [19]:
df_sensors_2022.head(10)

Unnamed: 0,Sensor_Name,Total_Counts
0,Flinders La-Swanston St (West),10492872
1,Southbank,8737282
2,Melbourne Central,6897406
3,Elizabeth St - Flinders St (East) - New footpath,6511465
4,Princes Bridge,6202149
5,State Library - New,6049385
6,Flinders Street Station Underpass,5772514
7,Melbourne Convention Exhibition Centre,5634531
8,Bourke Street Mall (North),5614610
9,Melbourne Central-Elizabeth St (East),5380759


An I can see the top ten sensors ranked by traffic. 

### Plotly

I am using the Plotly Python API. Plotly is actually a JavaScript library, which is called to essentially render the vis into a browser page.  
These rendered objects are named **figures**.  

To build figures, I can use:
- Plotly components given to me in its Python API
- or another tool given in that same API, Plotly Express, to expressly create figures 

Plotly Express is less finicky but more efficient. In general, it is used first and then components are accessed for fine tuning. 
And Plotly Express was designed for dataframes visualizations for pandas and polars.

Just like any other tool, I need to pip install it (already done) and import it. 

In [20]:
import plotly.express as px

Ok, I want to see a bar chart of that dataframe (df from now on) I just created `df_sensors_2022`:

In [40]:
figure = px.bar(
    df_sensors_2022.head(10),
    x="Sensor_Name",
    y="Total_Counts",
    height=900,  # 5 times bigger than the original 500
    title="Top 10 sensors by traffic - 2022",
    color="Total_Counts",
    color_continuous_scale=[[0, 'rgb(65,65,65)'], [1, 'rgb(255,255,100)']]  # Medium grey to pale yellow
)

figure.update_layout(
    title_font_size=35,
    title_font_color="white",
    font=dict(size=25, color="white"),
    xaxis_tickangle=-45,
    plot_bgcolor="black",
    paper_bgcolor="black",
    xaxis=dict(tickfont=dict(color='grey'))  # Set x-axis labels to grey
)

# Adjust the y-axis range to ensure the highest bar is fully visible
figure.update_yaxes(range=[0, 11000000])

figure.show()