Feature/duckdb #25

gjbex · 2025-11-25T07:08:56Z

Summary by Sourcery

Add a new DuckDB-based patient data analysis example and update documentation for workshop prerequisites and available code examples.

New Features:

Introduce a DuckDB Jupyter notebook demonstrating SQL queries, views, pivots, and joins over patient experiment and metadata CSV datasets.
Add patient_experiment and patient_metadata CSV data files used by the DuckDB notebook.
Document the new DuckDB example module in the source code README and describe the notebook and data files in a dedicated DuckDB README.

Documentation:

Clarify hands-on workshop prerequisites, including HPC access, Python/Jupyter environment, and Google Colab as an alternative.

review-notebook-app · 2025-11-25T07:09:01Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sourcery-ai · 2025-11-25T07:09:02Z

Reviewer's Guide

Adds a new DuckDB-based patient data analysis example, wires it into the source-code docs, and updates workshop prerequisites to mention environment and access requirements.

Sequence diagram for new-style lazy DuckDB high fever query

sequenceDiagram
    actor Analyst
    participant Notebook
    participant DuckDBPyConnection
    participant DuckDBPyRelation
    participant DuckDBEngine

    Analyst->>Notebook: Run_cell_create_connection
    Notebook->>DuckDBPyConnection: connect(database_path)
    DuckDBPyConnection-->>Notebook: connection_instance

    Analyst->>Notebook: Run_cell_build_lazy_query
    Notebook->>DuckDBPyConnection: sql(SELECT patient, temperature FROM patient_experiment)
    DuckDBPyConnection-->>Notebook: DuckDBPyRelation

    Notebook->>DuckDBPyRelation: filter(temperature > 39_5)
    DuckDBPyRelation-->>Notebook: DuckDBPyRelation

    Notebook->>DuckDBPyRelation: aggregate(patient, COUNT(temperature) AS high_fever_count, MAX(temperature) AS max_temperature, group_expr=patient)
    DuckDBPyRelation-->>Notebook: DuckDBPyRelation

    Analyst->>Notebook: Run_show_on_relation
    Notebook->>DuckDBPyRelation: show()
    DuckDBPyRelation->>DuckDBEngine: execute_lazy_plan
    DuckDBEngine-->>DuckDBPyRelation: result_rows
    DuckDBPyRelation-->>Notebook: rendered_table
    Notebook-->>Analyst: Display_high_fever_counts

Entity relationship diagram for DuckDB patient analysis views and tables

erDiagram
    patient_experiment {
        BIGINT patient
        TIMESTAMP date
        DOUBLE temperature
        DOUBLE dose
    }

    patient_metadata {
        BIGINT patient
        VARCHAR condition
        VARCHAR gender
        VARCHAR filename
    }

    hypothesis {
        BIGINT patient
        DOUBLE max_temperature
        DOUBLE total_dose
    }

    time_series {
        TIMESTAMP date
        DOUBLE temperature_1
        DOUBLE dose_1
        DOUBLE temperature_2
        DOUBLE dose_2
        DOUBLE temperature_3
        DOUBLE dose_3
        DOUBLE temperature_4
        DOUBLE dose_4
        DOUBLE temperature_5
        DOUBLE dose_5
        DOUBLE temperature_6
        DOUBLE dose_6
        DOUBLE temperature_7
        DOUBLE dose_7
        DOUBLE temperature_8
        DOUBLE dose_8
        DOUBLE temperature_9
        DOUBLE dose_9
    }

    patient_experiment ||--|| hypothesis : aggregates_to
    patient_experiment ||--|| time_series : pivoted_by_date
    patient_experiment ||--|| patient_metadata : shares_patient_id

File-Level Changes

Change	Details	Files
Document updated workshop prerequisites for hands-on participation.	Added bullet list of requirements like internet-connected machine, HPC access, Jupyter Lab-capable Python environment, and/or Google Colab access. Clarified that HPC accounts/credits are needed only if the user wants to run jobs on an HPC system.	`docs/README.md`
Linked a new DuckDB example module in the source code overview documentation.	Added duckdb entry describing DuckDB SQL query illustrations. Added description noting DuckDB integration with pandas and polars for SQL-like operations on dataframes.	`source-code/README.md`
Introduced a new DuckDB tutorial notebook demonstrating SQL analysis workflows over patient CSV data, including classic and new (lazy) query styles.	Created patients.ipynb notebook that imports duckdb and pandas and connects to CSV-based data using duckdb.connect. Showed how to inspect schema and tables, define helper functions (like show_tables), and run parameterized queries returning pandas DataFrames. Demonstrated aggregations (counts, max temperatures, total dose) with GROUP BY/ORDER BY and creating reusable views (e.g., hypothesis). Illustrated DuckDB pivot usage to transform long patient_experiment data into a wide time_series table and queried specific patient series from it. Added examples of reading a second CSV as a view (patient_metadata), using FULL OUTER JOIN and INNER JOIN to compare/merge datasets, and using CASE/COALESCE for presence classification. Contrasted classic SQL string execution with the newer relational API (conn.sql(...).filter(...).aggregate(...).order(...)) and explained lazy vs immediate evaluation.	`source-code/duckdb/patients.ipynb`
Added patient CSV datasets used by the new DuckDB notebook and documented them in a DuckDB-specific README.	Created DuckDB README describing DuckDB, its intended analytical use case, and the contents of the duckdb example folder. Added patient_experiment.csv and patient_metadata.csv under a duckdb/data directory for use by the notebook examples.	`source-code/duckdb/README.md` `source-code/duckdb/data/patient_experiment.csv` `source-code/duckdb/data/patient_metadata.csv`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

In source-code/README.md the DuckDB section is listed twice (and one bullet has a typo: Kllustrations); consider consolidating these into a single DuckDB bullet with consistent wording.
The notebook currently uses hard-coded relative paths like 'data/patient_experiment.csv' and 'data/patient_metadata.csv'; consider resolving the data directory via pathlib relative to the notebook file to make it more robust when run from different working directories.
The call duckdb.connect('data/patient_experiment.csv') overloads the concept of a database file and the CSV input; it may be clearer to connect to a .duckdb database file (e.g., duckdb.connect('patients.duckdb')) and then read the CSVs via read_csv_auto into views or tables.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In source-code/README.md the DuckDB section is listed twice (and one bullet has a typo: `Kllustrations`); consider consolidating these into a single DuckDB bullet with consistent wording.
- The notebook currently uses hard-coded relative paths like `'data/patient_experiment.csv'` and `'data/patient_metadata.csv'`; consider resolving the data directory via `pathlib` relative to the notebook file to make it more robust when run from different working directories.
- The call `duckdb.connect('data/patient_experiment.csv')` overloads the concept of a database file and the CSV input; it may be clearer to connect to a `.duckdb` database file (e.g., `duckdb.connect('patients.duckdb')`) and then read the CSVs via `read_csv_auto` into views or tables.

## Individual Comments

### Comment 1
<location> `source-code/duckdb/patients.ipynb:473-480` </location>
<code_context>
+   "outputs": [],
+   "source": [
+    "conn.execute('''\n",
+    "    CREATE VIEW hypothesis AS SELECT\n",
+    "        patient,\n",
+    "        MAX(temperature) AS 'max_temperature',\n",
+    "        SUM(dose) AS 'total_dose'\n",
+    "      FROM patient_experiment\n",
+    "      GROUP BY patient\n",
+    "      ORDER BY max_temperature DESC;\n",
+    "''');"
+   ]
</code_context>

<issue_to_address>
**suggestion:** Avoid including ORDER BY inside the hypothesis view, since ordering is not guaranteed when selecting from a view.

Here, the ORDER BY has no effect on queries like `SELECT MAX(total_dose) FROM hypothesis`, and it may mislead future readers into thinking the view guarantees sorted output. Please remove the ORDER BY from the view definition and apply ordering only in the outer queries that need it.

```suggestion
    "    CREATE VIEW hypothesis AS SELECT\n",
    "        patient,\n",
    "        MAX(temperature) AS 'max_temperature',\n",
    "        SUM(dose) AS 'total_dose'\n",
    "      FROM patient_experiment\n",
    "      GROUP BY patient;\n",
    "''');"
```
</issue_to_address>

### Comment 2
<location> `docs/README.md:57-59` </location>
<code_context>
 be familiar with these as well.

+For following along hands-on, you need
+* laptop or desktop with internet access.
+* a system set up so you can connect to an HPC system, an account on an HPC
+  system (e.g., VSC, CECI, ...), compute credits if that is required to run
</code_context>

<issue_to_address>
**nitpick (typo):** Add an article to make the bullet grammatically complete.

For example, update this bullet to "* a laptop or desktop with internet access." so it matches the others that start with "a".

```suggestion
For following along hands-on, you need
* a laptop or desktop with internet access.
* a system set up so you can connect to an HPC system, an account on an HPC
```
</issue_to_address>

### Comment 3
<location> `docs/README.md:59-61` </location>
<code_context>

+For following along hands-on, you need
+* laptop or desktop with internet access.
+* a system set up so you can connect to an HPC system, an account on an HPC
+  system (e.g., VSC, CECI, ...), compute credits if that is required to run
+  jobs on the HPC system if you want to use an HPC system;
+* a Python environment that can run Jupyter Lab if you want to use your own system;
+* access to Google Colaboratory if you prefer not to install software.
</code_context>

<issue_to_address>
**suggestion (typo):** Improve readability of the long HPC-related bullet point.

The list in this bullet reads a bit awkwardly, especially around "system (e.g., VSC, CECI, ...), compute credits". Consider adding an "and" before "compute credits" or otherwise restructuring to make the list of requirements clearer.

```suggestion
* a system set up so you can connect to an HPC system, an account on an HPC
  system (e.g., VSC, CECI, ...) and compute credits if required to run
  jobs on the HPC system;
```
</issue_to_address>

### Comment 4
<location> `source-code/README.md:17` </location>
<code_context>
   representation and algorithms.
 * [`pandas`](pandas): illustrations of using pandas and seaborn.
 * [`polars`](polars): Kllustrations of using polars.
+* [`duckdb`](duckdb): illustrations of using DuckDB for SQL queries.
 * [`regexes`](regexes): illustrations of using regular expressions for
</code_context>

<issue_to_address>
**issue (typo):** Fix the typo in "Kllustrations".

Change "Kllustrations" to "Illustrations".

```suggestion
* [`polars`](polars): illustrations of using polars.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-11-25T07:10:16Z

source-code/duckdb/patients.ipynb

+    "    CREATE VIEW hypothesis AS SELECT\n",
+    "        patient,\n",
+    "        MAX(temperature) AS 'max_temperature',\n",
+    "        SUM(dose) AS 'total_dose'\n",
+    "      FROM patient_experiment\n",
+    "      GROUP BY patient\n",
+    "      ORDER BY max_temperature DESC;\n",
+    "''');"


suggestion: Avoid including ORDER BY inside the hypothesis view, since ordering is not guaranteed when selecting from a view.

Here, the ORDER BY has no effect on queries like SELECT MAX(total_dose) FROM hypothesis, and it may mislead future readers into thinking the view guarantees sorted output. Please remove the ORDER BY from the view definition and apply ordering only in the outer queries that need it.

Suggested change

" CREATE VIEW hypothesis AS SELECT\n",

" patient,\n",

" MAX(temperature) AS 'max_temperature',\n",

" SUM(dose) AS 'total_dose'\n",

" FROM patient_experiment\n",

" GROUP BY patient\n",

" ORDER BY max_temperature DESC;\n",

"''');"

" CREATE VIEW hypothesis AS SELECT\n",

" patient,\n",

" MAX(temperature) AS 'max_temperature',\n",

" SUM(dose) AS 'total_dose'\n",

" FROM patient_experiment\n",

" GROUP BY patient;\n",

"''');"

sourcery-ai · 2025-11-25T07:10:16Z

docs/README.md

+For following along hands-on, you need
+* laptop or desktop with internet access.
+* a system set up so you can connect to an HPC system, an account on an HPC


nitpick (typo): Add an article to make the bullet grammatically complete.

For example, update this bullet to "* a laptop or desktop with internet access." so it matches the others that start with "a".

Suggested change

For following along hands-on, you need

* laptop or desktop with internet access.

* a system set up so you can connect to an HPC system, an account on an HPC

For following along hands-on, you need

* a laptop or desktop with internet access.

* a system set up so you can connect to an HPC system, an account on an HPC

sourcery-ai · 2025-11-25T07:10:16Z

docs/README.md

+* a system set up so you can connect to an HPC system, an account on an HPC
+  system (e.g., VSC, CECI, ...), compute credits if that is required to run
+  jobs on the HPC system if you want to use an HPC system;


suggestion (typo): Improve readability of the long HPC-related bullet point.

The list in this bullet reads a bit awkwardly, especially around "system (e.g., VSC, CECI, ...), compute credits". Consider adding an "and" before "compute credits" or otherwise restructuring to make the list of requirements clearer.

Suggested change

* a system set up so you can connect to an HPC system, an account on an HPC

system (e.g., VSC, CECI, ...), compute credits if that is required to run

jobs on the HPC system if you want to use an HPC system;

* a system set up so you can connect to an HPC system, an account on an HPC

system (e.g., VSC, CECI, ...) and compute credits if required to run

jobs on the HPC system;

gjbex added 5 commits July 16, 2025 14:22

Add initial work on duckdb

30d3dbd

Merge branch 'development' into feature/duckdb

930ca8f

Add material requirements

53da957

Added pivot operation

014249d

Add DuckDB examples

6f444ec

gjbex merged commit f807e7f into development Nov 25, 2025
1 check was pending

gjbex deleted the feature/duckdb branch November 25, 2025 07:09

sourcery-ai bot reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/duckdb #25

Feature/duckdb #25

Uh oh!

gjbex commented Nov 25, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

review-notebook-app bot commented Nov 25, 2025

Uh oh!

sourcery-ai bot commented Nov 25, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Nov 25, 2025

Uh oh!

sourcery-ai bot Nov 25, 2025

Uh oh!

sourcery-ai bot Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/duckdb #25

Feature/duckdb #25

Uh oh!

Conversation

gjbex commented Nov 25, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

review-notebook-app bot commented Nov 25, 2025

Uh oh!

sourcery-ai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for new-style lazy DuckDB high fever query

Entity relationship diagram for DuckDB patient analysis views and tables

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gjbex commented Nov 25, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 25, 2025 •

edited

Loading