Skip to content

Conversation

@gjbex
Copy link
Owner

@gjbex gjbex commented Nov 25, 2025

Summary by Sourcery

Add a new DuckDB-based patient data analysis example and update documentation for workshop prerequisites and available code examples.

New Features:

  • Introduce a DuckDB Jupyter notebook demonstrating SQL queries, views, pivots, and joins over patient experiment and metadata CSV datasets.
  • Add patient_experiment and patient_metadata CSV data files used by the DuckDB notebook.
  • Document the new DuckDB example module in the source code README and describe the notebook and data files in a dedicated DuckDB README.

Documentation:

  • Clarify hands-on workshop prerequisites, including HPC access, Python/Jupyter environment, and Google Colab as an alternative.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sourcery-ai
Copy link

sourcery-ai bot commented Nov 25, 2025

Reviewer's Guide

Adds a new DuckDB-based patient data analysis example, wires it into the source-code docs, and updates workshop prerequisites to mention environment and access requirements.

Sequence diagram for new-style lazy DuckDB high fever query

sequenceDiagram
    actor Analyst
    participant Notebook
    participant DuckDBPyConnection
    participant DuckDBPyRelation
    participant DuckDBEngine

    Analyst->>Notebook: Run_cell_create_connection
    Notebook->>DuckDBPyConnection: connect(database_path)
    DuckDBPyConnection-->>Notebook: connection_instance

    Analyst->>Notebook: Run_cell_build_lazy_query
    Notebook->>DuckDBPyConnection: sql(SELECT patient, temperature FROM patient_experiment)
    DuckDBPyConnection-->>Notebook: DuckDBPyRelation

    Notebook->>DuckDBPyRelation: filter(temperature > 39_5)
    DuckDBPyRelation-->>Notebook: DuckDBPyRelation

    Notebook->>DuckDBPyRelation: aggregate(patient, COUNT(temperature) AS high_fever_count, MAX(temperature) AS max_temperature, group_expr=patient)
    DuckDBPyRelation-->>Notebook: DuckDBPyRelation

    Analyst->>Notebook: Run_show_on_relation
    Notebook->>DuckDBPyRelation: show()
    DuckDBPyRelation->>DuckDBEngine: execute_lazy_plan
    DuckDBEngine-->>DuckDBPyRelation: result_rows
    DuckDBPyRelation-->>Notebook: rendered_table
    Notebook-->>Analyst: Display_high_fever_counts
Loading

Entity relationship diagram for DuckDB patient analysis views and tables

erDiagram
    patient_experiment {
        BIGINT patient
        TIMESTAMP date
        DOUBLE temperature
        DOUBLE dose
    }

    patient_metadata {
        BIGINT patient
        VARCHAR condition
        VARCHAR gender
        VARCHAR filename
    }

    hypothesis {
        BIGINT patient
        DOUBLE max_temperature
        DOUBLE total_dose
    }

    time_series {
        TIMESTAMP date
        DOUBLE temperature_1
        DOUBLE dose_1
        DOUBLE temperature_2
        DOUBLE dose_2
        DOUBLE temperature_3
        DOUBLE dose_3
        DOUBLE temperature_4
        DOUBLE dose_4
        DOUBLE temperature_5
        DOUBLE dose_5
        DOUBLE temperature_6
        DOUBLE dose_6
        DOUBLE temperature_7
        DOUBLE dose_7
        DOUBLE temperature_8
        DOUBLE dose_8
        DOUBLE temperature_9
        DOUBLE dose_9
    }

    patient_experiment ||--|| hypothesis : aggregates_to
    patient_experiment ||--|| time_series : pivoted_by_date
    patient_experiment ||--|| patient_metadata : shares_patient_id
Loading

File-Level Changes

Change Details Files
Document updated workshop prerequisites for hands-on participation.
  • Added bullet list of requirements like internet-connected machine, HPC access, Jupyter Lab-capable Python environment, and/or Google Colab access.
  • Clarified that HPC accounts/credits are needed only if the user wants to run jobs on an HPC system.
docs/README.md
Linked a new DuckDB example module in the source code overview documentation.
  • Added duckdb entry describing DuckDB SQL query illustrations.
  • Added description noting DuckDB integration with pandas and polars for SQL-like operations on dataframes.
source-code/README.md
Introduced a new DuckDB tutorial notebook demonstrating SQL analysis workflows over patient CSV data, including classic and new (lazy) query styles.
  • Created patients.ipynb notebook that imports duckdb and pandas and connects to CSV-based data using duckdb.connect.
  • Showed how to inspect schema and tables, define helper functions (like show_tables), and run parameterized queries returning pandas DataFrames.
  • Demonstrated aggregations (counts, max temperatures, total dose) with GROUP BY/ORDER BY and creating reusable views (e.g., hypothesis).
  • Illustrated DuckDB pivot usage to transform long patient_experiment data into a wide time_series table and queried specific patient series from it.
  • Added examples of reading a second CSV as a view (patient_metadata), using FULL OUTER JOIN and INNER JOIN to compare/merge datasets, and using CASE/COALESCE for presence classification.
  • Contrasted classic SQL string execution with the newer relational API (conn.sql(...).filter(...).aggregate(...).order(...)) and explained lazy vs immediate evaluation.
source-code/duckdb/patients.ipynb
Added patient CSV datasets used by the new DuckDB notebook and documented them in a DuckDB-specific README.
  • Created DuckDB README describing DuckDB, its intended analytical use case, and the contents of the duckdb example folder.
  • Added patient_experiment.csv and patient_metadata.csv under a duckdb/data directory for use by the notebook examples.
source-code/duckdb/README.md
source-code/duckdb/data/patient_experiment.csv
source-code/duckdb/data/patient_metadata.csv

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gjbex gjbex merged commit f807e7f into development Nov 25, 2025
1 check was pending
@gjbex gjbex deleted the feature/duckdb branch November 25, 2025 07:09
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In source-code/README.md the DuckDB section is listed twice (and one bullet has a typo: Kllustrations); consider consolidating these into a single DuckDB bullet with consistent wording.
  • The notebook currently uses hard-coded relative paths like 'data/patient_experiment.csv' and 'data/patient_metadata.csv'; consider resolving the data directory via pathlib relative to the notebook file to make it more robust when run from different working directories.
  • The call duckdb.connect('data/patient_experiment.csv') overloads the concept of a database file and the CSV input; it may be clearer to connect to a .duckdb database file (e.g., duckdb.connect('patients.duckdb')) and then read the CSVs via read_csv_auto into views or tables.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In source-code/README.md the DuckDB section is listed twice (and one bullet has a typo: `Kllustrations`); consider consolidating these into a single DuckDB bullet with consistent wording.
- The notebook currently uses hard-coded relative paths like `'data/patient_experiment.csv'` and `'data/patient_metadata.csv'`; consider resolving the data directory via `pathlib` relative to the notebook file to make it more robust when run from different working directories.
- The call `duckdb.connect('data/patient_experiment.csv')` overloads the concept of a database file and the CSV input; it may be clearer to connect to a `.duckdb` database file (e.g., `duckdb.connect('patients.duckdb')`) and then read the CSVs via `read_csv_auto` into views or tables.

## Individual Comments

### Comment 1
<location> `source-code/duckdb/patients.ipynb:473-480` </location>
<code_context>
+   "outputs": [],
+   "source": [
+    "conn.execute('''\n",
+    "    CREATE VIEW hypothesis AS SELECT\n",
+    "        patient,\n",
+    "        MAX(temperature) AS 'max_temperature',\n",
+    "        SUM(dose) AS 'total_dose'\n",
+    "      FROM patient_experiment\n",
+    "      GROUP BY patient\n",
+    "      ORDER BY max_temperature DESC;\n",
+    "''');"
+   ]
</code_context>

<issue_to_address>
**suggestion:** Avoid including ORDER BY inside the hypothesis view, since ordering is not guaranteed when selecting from a view.

Here, the ORDER BY has no effect on queries like `SELECT MAX(total_dose) FROM hypothesis`, and it may mislead future readers into thinking the view guarantees sorted output. Please remove the ORDER BY from the view definition and apply ordering only in the outer queries that need it.

```suggestion
    "    CREATE VIEW hypothesis AS SELECT\n",
    "        patient,\n",
    "        MAX(temperature) AS 'max_temperature',\n",
    "        SUM(dose) AS 'total_dose'\n",
    "      FROM patient_experiment\n",
    "      GROUP BY patient;\n",
    "''');"
```
</issue_to_address>

### Comment 2
<location> `docs/README.md:57-59` </location>
<code_context>
 be familiar with these as well.

+For following along hands-on, you need
+* laptop or desktop with internet access.
+* a system set up so you can connect to an HPC system, an account on an HPC
+  system (e.g., VSC, CECI, ...), compute credits if that is required to run
</code_context>

<issue_to_address>
**nitpick (typo):** Add an article to make the bullet grammatically complete.

For example, update this bullet to "* a laptop or desktop with internet access." so it matches the others that start with "a".

```suggestion
For following along hands-on, you need
* a laptop or desktop with internet access.
* a system set up so you can connect to an HPC system, an account on an HPC
```
</issue_to_address>

### Comment 3
<location> `docs/README.md:59-61` </location>
<code_context>

+For following along hands-on, you need
+* laptop or desktop with internet access.
+* a system set up so you can connect to an HPC system, an account on an HPC
+  system (e.g., VSC, CECI, ...), compute credits if that is required to run
+  jobs on the HPC system if you want to use an HPC system;
+* a Python environment that can run Jupyter Lab if you want to use your own system;
+* access to Google Colaboratory if you prefer not to install software.
</code_context>

<issue_to_address>
**suggestion (typo):** Improve readability of the long HPC-related bullet point.

The list in this bullet reads a bit awkwardly, especially around "system (e.g., VSC, CECI, ...), compute credits". Consider adding an "and" before "compute credits" or otherwise restructuring to make the list of requirements clearer.

```suggestion
* a system set up so you can connect to an HPC system, an account on an HPC
  system (e.g., VSC, CECI, ...) and compute credits if required to run
  jobs on the HPC system;
```
</issue_to_address>

### Comment 4
<location> `source-code/README.md:17` </location>
<code_context>
   representation and algorithms.
 * [`pandas`](pandas): illustrations of using pandas and seaborn.
 * [`polars`](polars): Kllustrations of using polars.
+* [`duckdb`](duckdb): illustrations of using DuckDB for SQL queries.
 * [`regexes`](regexes): illustrations of using regular expressions for
</code_context>

<issue_to_address>
**issue (typo):** Fix the typo in "Kllustrations".

Change "Kllustrations" to "Illustrations".

```suggestion
* [`polars`](polars): illustrations of using polars.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +473 to +480
" CREATE VIEW hypothesis AS SELECT\n",
" patient,\n",
" MAX(temperature) AS 'max_temperature',\n",
" SUM(dose) AS 'total_dose'\n",
" FROM patient_experiment\n",
" GROUP BY patient\n",
" ORDER BY max_temperature DESC;\n",
"''');"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Avoid including ORDER BY inside the hypothesis view, since ordering is not guaranteed when selecting from a view.

Here, the ORDER BY has no effect on queries like SELECT MAX(total_dose) FROM hypothesis, and it may mislead future readers into thinking the view guarantees sorted output. Please remove the ORDER BY from the view definition and apply ordering only in the outer queries that need it.

Suggested change
" CREATE VIEW hypothesis AS SELECT\n",
" patient,\n",
" MAX(temperature) AS 'max_temperature',\n",
" SUM(dose) AS 'total_dose'\n",
" FROM patient_experiment\n",
" GROUP BY patient\n",
" ORDER BY max_temperature DESC;\n",
"''');"
" CREATE VIEW hypothesis AS SELECT\n",
" patient,\n",
" MAX(temperature) AS 'max_temperature',\n",
" SUM(dose) AS 'total_dose'\n",
" FROM patient_experiment\n",
" GROUP BY patient;\n",
"''');"

Comment on lines +57 to +59
For following along hands-on, you need
* laptop or desktop with internet access.
* a system set up so you can connect to an HPC system, an account on an HPC
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): Add an article to make the bullet grammatically complete.

For example, update this bullet to "* a laptop or desktop with internet access." so it matches the others that start with "a".

Suggested change
For following along hands-on, you need
* laptop or desktop with internet access.
* a system set up so you can connect to an HPC system, an account on an HPC
For following along hands-on, you need
* a laptop or desktop with internet access.
* a system set up so you can connect to an HPC system, an account on an HPC

Comment on lines +59 to +61
* a system set up so you can connect to an HPC system, an account on an HPC
system (e.g., VSC, CECI, ...), compute credits if that is required to run
jobs on the HPC system if you want to use an HPC system;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Improve readability of the long HPC-related bullet point.

The list in this bullet reads a bit awkwardly, especially around "system (e.g., VSC, CECI, ...), compute credits". Consider adding an "and" before "compute credits" or otherwise restructuring to make the list of requirements clearer.

Suggested change
* a system set up so you can connect to an HPC system, an account on an HPC
system (e.g., VSC, CECI, ...), compute credits if that is required to run
jobs on the HPC system if you want to use an HPC system;
* a system set up so you can connect to an HPC system, an account on an HPC
system (e.g., VSC, CECI, ...) and compute credits if required to run
jobs on the HPC system;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants