Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add SQL support for NATURAL joins and the COLUMNS function #17295

Merged
merged 2 commits into from
Jul 1, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jun 29, 2024

New

  • Adds NATURAL [INNER|LEFT|FULL] JOIN1 support for the SQL interface (this flavour of join automatically uses the common cols between the two tables as the join-key), and the associated SQL docs entries/examples.
  • Partial implementation of the useful DuckDB COLUMNS2 function; this maps almost 1:1 onto Polars' col expression, enabling use of regular expressions to select multiple columns. Also offers some (currently limited) broadcasting capabilities; at the moment broadcasting support only works in the WHERE clause.

Also

  • Ensure SQLFunctionVisitor translations also have access to the current/active schema.
  • Some refactoring, with more to come later; the logic is getting quite involved, and could do with more inline comments, explanations and additional cleanups.

Examples

Establish test data:

import polars as pl

df1 = pl.DataFrame({
    "CharacterID": [1, 2, 3],
    "FirstName": ["Jernau Morat", "Cheradenine", "Byr"],
    "LastName": ["Gurgeh", "Zakalwe", "Genar-Hofoen"],
})
df2 = pl.DataFrame({
    "CharacterID": [1, 2, 3],
    "Book": ["Player of Games", "Use of Weapons", "Excession"],
})
df3 = pl.DataFrame({
    "CharacterID": [1, 2, 3],
    "Ship": ["Limiting Factor", "Xenophobe", "Grey Area"],
})

Apply NATURAL joins, use a COLUMNS regex to avoid selecting <col>:<suffix> cols:

pl.sql("""
  SELECT COLUMNS('^[^:]+$')
  FROM df1
    NATURAL JOIN df2
    NATURAL JOIN df3
  ORDER BY CharacterID
""").collect()

# shape: (3, 5)
# ┌─────────────┬──────────────┬──────────────┬─────────────────┬─────────────────┐
# │ CharacterID ┆ FirstName    ┆ LastName     ┆ Book            ┆ Ship            │
# │ ---         ┆ ---          ┆ ---          ┆ ---             ┆ ---             │
# │ i64         ┆ str          ┆ str          ┆ str             ┆ str             │
# ╞═════════════╪══════════════╪══════════════╪═════════════════╪═════════════════╡
# │ 1           ┆ Jernau Morat ┆ Gurgeh       ┆ Player of Games ┆ Limiting Factor │
# │ 2           ┆ Cheradenine  ┆ Zakalwe      ┆ Use of Weapons  ┆ Xenophobe       │
# │ 3           ┆ Byr          ┆ Genar-Hofoen ┆ Excession       ┆ Grey Area       │
# └─────────────┴──────────────┴──────────────┴─────────────────┴─────────────────┘

Use COLUMNS to broadcast constraints in WHERE clause; filter for value >= 4 across all columns:

df1 = pl.DataFrame({
    "x": [1, 5, 3, 8, 6, 7, 4, 0, 2],
    "y": [3, 4, 6, 8, 3, 4, 1, 7, 8],
})
df2 = pl.DataFrame({
    "y": [0, 4, 0, 8, 0, 4, 0, 7, 0],
    "z": [9, 8, 7, 6, 5, 4, 3, 2, 1],
})

pl.sql(f"""
  SELECT * EXCLUDE "y:df2"
  FROM df1 NATURAL JOIN df2
  WHERE COLUMNS(*) >= 4
""").collect()

# shape: (5, 3)
# ┌─────┬─────┬─────┐
# │ x   ┆ y   ┆ z   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 5   ┆ 4   ┆ 8   │
# │ 7   ┆ 4   ┆ 8   │
# │ 8   ┆ 8   ┆ 6   │
# │ 5   ┆ 4   ┆ 4   │
# │ 7   ┆ 4   ┆ 4   │
# └─────┴─────┴─────┘

Footnotes

  1. NATURAL JOIN: https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-JOIN

  2. COLUMNS: https://duckdb.org/docs/sql/expressions/star.html#columns-regular-expression

feat: Add SQL support for `NATURAL` joins and `COLUMNS`

feat: Add SQL support for `NATURAL` joins and `COLUMNS` function
@alexander-beedie alexander-beedie marked this pull request as ready for review June 29, 2024 19:21
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jun 29, 2024
@alexander-beedie alexander-beedie added the A-sql Area: Polars SQL functionality label Jun 29, 2024
Copy link

codecov bot commented Jun 29, 2024

Codecov Report

Attention: Patch coverage is 87.62887% with 36 lines in your changes missing coverage. Please review.

Project coverage is 80.71%. Comparing base (d444b79) to head (1aaa1fb).
Report is 1 commits behind head on main.

Files Patch % Lines
crates/polars-sql/src/context.rs 90.56% 20 Missing ⚠️
crates/polars-sql/src/functions.rs 78.94% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17295      +/-   ##
==========================================
- Coverage   80.72%   80.71%   -0.01%     
==========================================
  Files        1475     1475              
  Lines      193162   193260      +98     
  Branches     2751     2751              
==========================================
+ Hits       155922   155993      +71     
- Misses      36732    36759      +27     
  Partials      508      508              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie alexander-beedie changed the title feat: Add SQL support for NATURAL joins and COLUMNS function feat: Add SQL support for NATURAL joins and the COLUMNS function Jun 29, 2024
@ritchie46 ritchie46 merged commit 41610e3 into pola-rs:main Jul 1, 2024
37 checks passed
@alexander-beedie alexander-beedie deleted the sql-natural-join branch July 1, 2024 06:58
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql Area: Polars SQL functionality enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants