-
Notifications
You must be signed in to change notification settings - Fork 20
Feature/duckdb #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/duckdb #25
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Reviewer's GuideAdds a new DuckDB-based patient data analysis example, wires it into the source-code docs, and updates workshop prerequisites to mention environment and access requirements. Sequence diagram for new-style lazy DuckDB high fever querysequenceDiagram
actor Analyst
participant Notebook
participant DuckDBPyConnection
participant DuckDBPyRelation
participant DuckDBEngine
Analyst->>Notebook: Run_cell_create_connection
Notebook->>DuckDBPyConnection: connect(database_path)
DuckDBPyConnection-->>Notebook: connection_instance
Analyst->>Notebook: Run_cell_build_lazy_query
Notebook->>DuckDBPyConnection: sql(SELECT patient, temperature FROM patient_experiment)
DuckDBPyConnection-->>Notebook: DuckDBPyRelation
Notebook->>DuckDBPyRelation: filter(temperature > 39_5)
DuckDBPyRelation-->>Notebook: DuckDBPyRelation
Notebook->>DuckDBPyRelation: aggregate(patient, COUNT(temperature) AS high_fever_count, MAX(temperature) AS max_temperature, group_expr=patient)
DuckDBPyRelation-->>Notebook: DuckDBPyRelation
Analyst->>Notebook: Run_show_on_relation
Notebook->>DuckDBPyRelation: show()
DuckDBPyRelation->>DuckDBEngine: execute_lazy_plan
DuckDBEngine-->>DuckDBPyRelation: result_rows
DuckDBPyRelation-->>Notebook: rendered_table
Notebook-->>Analyst: Display_high_fever_counts
Entity relationship diagram for DuckDB patient analysis views and tableserDiagram
patient_experiment {
BIGINT patient
TIMESTAMP date
DOUBLE temperature
DOUBLE dose
}
patient_metadata {
BIGINT patient
VARCHAR condition
VARCHAR gender
VARCHAR filename
}
hypothesis {
BIGINT patient
DOUBLE max_temperature
DOUBLE total_dose
}
time_series {
TIMESTAMP date
DOUBLE temperature_1
DOUBLE dose_1
DOUBLE temperature_2
DOUBLE dose_2
DOUBLE temperature_3
DOUBLE dose_3
DOUBLE temperature_4
DOUBLE dose_4
DOUBLE temperature_5
DOUBLE dose_5
DOUBLE temperature_6
DOUBLE dose_6
DOUBLE temperature_7
DOUBLE dose_7
DOUBLE temperature_8
DOUBLE dose_8
DOUBLE temperature_9
DOUBLE dose_9
}
patient_experiment ||--|| hypothesis : aggregates_to
patient_experiment ||--|| time_series : pivoted_by_date
patient_experiment ||--|| patient_metadata : shares_patient_id
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In source-code/README.md the DuckDB section is listed twice (and one bullet has a typo:
Kllustrations); consider consolidating these into a single DuckDB bullet with consistent wording. - The notebook currently uses hard-coded relative paths like
'data/patient_experiment.csv'and'data/patient_metadata.csv'; consider resolving the data directory viapathlibrelative to the notebook file to make it more robust when run from different working directories. - The call
duckdb.connect('data/patient_experiment.csv')overloads the concept of a database file and the CSV input; it may be clearer to connect to a.duckdbdatabase file (e.g.,duckdb.connect('patients.duckdb')) and then read the CSVs viaread_csv_autointo views or tables.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In source-code/README.md the DuckDB section is listed twice (and one bullet has a typo: `Kllustrations`); consider consolidating these into a single DuckDB bullet with consistent wording.
- The notebook currently uses hard-coded relative paths like `'data/patient_experiment.csv'` and `'data/patient_metadata.csv'`; consider resolving the data directory via `pathlib` relative to the notebook file to make it more robust when run from different working directories.
- The call `duckdb.connect('data/patient_experiment.csv')` overloads the concept of a database file and the CSV input; it may be clearer to connect to a `.duckdb` database file (e.g., `duckdb.connect('patients.duckdb')`) and then read the CSVs via `read_csv_auto` into views or tables.
## Individual Comments
### Comment 1
<location> `source-code/duckdb/patients.ipynb:473-480` </location>
<code_context>
+ "outputs": [],
+ "source": [
+ "conn.execute('''\n",
+ " CREATE VIEW hypothesis AS SELECT\n",
+ " patient,\n",
+ " MAX(temperature) AS 'max_temperature',\n",
+ " SUM(dose) AS 'total_dose'\n",
+ " FROM patient_experiment\n",
+ " GROUP BY patient\n",
+ " ORDER BY max_temperature DESC;\n",
+ "''');"
+ ]
</code_context>
<issue_to_address>
**suggestion:** Avoid including ORDER BY inside the hypothesis view, since ordering is not guaranteed when selecting from a view.
Here, the ORDER BY has no effect on queries like `SELECT MAX(total_dose) FROM hypothesis`, and it may mislead future readers into thinking the view guarantees sorted output. Please remove the ORDER BY from the view definition and apply ordering only in the outer queries that need it.
```suggestion
" CREATE VIEW hypothesis AS SELECT\n",
" patient,\n",
" MAX(temperature) AS 'max_temperature',\n",
" SUM(dose) AS 'total_dose'\n",
" FROM patient_experiment\n",
" GROUP BY patient;\n",
"''');"
```
</issue_to_address>
### Comment 2
<location> `docs/README.md:57-59` </location>
<code_context>
be familiar with these as well.
+For following along hands-on, you need
+* laptop or desktop with internet access.
+* a system set up so you can connect to an HPC system, an account on an HPC
+ system (e.g., VSC, CECI, ...), compute credits if that is required to run
</code_context>
<issue_to_address>
**nitpick (typo):** Add an article to make the bullet grammatically complete.
For example, update this bullet to "* a laptop or desktop with internet access." so it matches the others that start with "a".
```suggestion
For following along hands-on, you need
* a laptop or desktop with internet access.
* a system set up so you can connect to an HPC system, an account on an HPC
```
</issue_to_address>
### Comment 3
<location> `docs/README.md:59-61` </location>
<code_context>
+For following along hands-on, you need
+* laptop or desktop with internet access.
+* a system set up so you can connect to an HPC system, an account on an HPC
+ system (e.g., VSC, CECI, ...), compute credits if that is required to run
+ jobs on the HPC system if you want to use an HPC system;
+* a Python environment that can run Jupyter Lab if you want to use your own system;
+* access to Google Colaboratory if you prefer not to install software.
</code_context>
<issue_to_address>
**suggestion (typo):** Improve readability of the long HPC-related bullet point.
The list in this bullet reads a bit awkwardly, especially around "system (e.g., VSC, CECI, ...), compute credits". Consider adding an "and" before "compute credits" or otherwise restructuring to make the list of requirements clearer.
```suggestion
* a system set up so you can connect to an HPC system, an account on an HPC
system (e.g., VSC, CECI, ...) and compute credits if required to run
jobs on the HPC system;
```
</issue_to_address>
### Comment 4
<location> `source-code/README.md:17` </location>
<code_context>
representation and algorithms.
* [`pandas`](pandas): illustrations of using pandas and seaborn.
* [`polars`](polars): Kllustrations of using polars.
+* [`duckdb`](duckdb): illustrations of using DuckDB for SQL queries.
* [`regexes`](regexes): illustrations of using regular expressions for
</code_context>
<issue_to_address>
**issue (typo):** Fix the typo in "Kllustrations".
Change "Kllustrations" to "Illustrations".
```suggestion
* [`polars`](polars): illustrations of using polars.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| " CREATE VIEW hypothesis AS SELECT\n", | ||
| " patient,\n", | ||
| " MAX(temperature) AS 'max_temperature',\n", | ||
| " SUM(dose) AS 'total_dose'\n", | ||
| " FROM patient_experiment\n", | ||
| " GROUP BY patient\n", | ||
| " ORDER BY max_temperature DESC;\n", | ||
| "''');" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Avoid including ORDER BY inside the hypothesis view, since ordering is not guaranteed when selecting from a view.
Here, the ORDER BY has no effect on queries like SELECT MAX(total_dose) FROM hypothesis, and it may mislead future readers into thinking the view guarantees sorted output. Please remove the ORDER BY from the view definition and apply ordering only in the outer queries that need it.
| " CREATE VIEW hypothesis AS SELECT\n", | |
| " patient,\n", | |
| " MAX(temperature) AS 'max_temperature',\n", | |
| " SUM(dose) AS 'total_dose'\n", | |
| " FROM patient_experiment\n", | |
| " GROUP BY patient\n", | |
| " ORDER BY max_temperature DESC;\n", | |
| "''');" | |
| " CREATE VIEW hypothesis AS SELECT\n", | |
| " patient,\n", | |
| " MAX(temperature) AS 'max_temperature',\n", | |
| " SUM(dose) AS 'total_dose'\n", | |
| " FROM patient_experiment\n", | |
| " GROUP BY patient;\n", | |
| "''');" |
| For following along hands-on, you need | ||
| * laptop or desktop with internet access. | ||
| * a system set up so you can connect to an HPC system, an account on an HPC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (typo): Add an article to make the bullet grammatically complete.
For example, update this bullet to "* a laptop or desktop with internet access." so it matches the others that start with "a".
| For following along hands-on, you need | |
| * laptop or desktop with internet access. | |
| * a system set up so you can connect to an HPC system, an account on an HPC | |
| For following along hands-on, you need | |
| * a laptop or desktop with internet access. | |
| * a system set up so you can connect to an HPC system, an account on an HPC |
| * a system set up so you can connect to an HPC system, an account on an HPC | ||
| system (e.g., VSC, CECI, ...), compute credits if that is required to run | ||
| jobs on the HPC system if you want to use an HPC system; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (typo): Improve readability of the long HPC-related bullet point.
The list in this bullet reads a bit awkwardly, especially around "system (e.g., VSC, CECI, ...), compute credits". Consider adding an "and" before "compute credits" or otherwise restructuring to make the list of requirements clearer.
| * a system set up so you can connect to an HPC system, an account on an HPC | |
| system (e.g., VSC, CECI, ...), compute credits if that is required to run | |
| jobs on the HPC system if you want to use an HPC system; | |
| * a system set up so you can connect to an HPC system, an account on an HPC | |
| system (e.g., VSC, CECI, ...) and compute credits if required to run | |
| jobs on the HPC system; |
Summary by Sourcery
Add a new DuckDB-based patient data analysis example and update documentation for workshop prerequisites and available code examples.
New Features:
Documentation: