Use SQLTransforms to compute SQL Source schemas on database #246

philippjfr · 2022-02-22T17:42:59Z

Currently all sources load the data into memory to compute the schema. For many sources this is okay because it can at least leverage dask to lazily load and compute the unique categories or range of a numerical or date value. Unfortunately for most SQL sources this can be extremely inefficient. So for SQL sources we leverage the SQLTransforms recently added to Lumen to let the database itself compute the ranges and unique categories for each column.

The approach can be described as follows:

Fetch the schema for the SQL query but with LIMIT 1
Now find all columns with ranges and all columns with enums and compute these values over the whole table using MIN/MAX and DISTINCT SQL statements respectively

lumen/sources/intake_sql.py

philippjfr · 2022-02-22T17:57:04Z

I actually don't think this will currently work if there's multiple enums since I'm not sure what DISTINCT will do there.

eli-pinkus

A couple clarification Q's, otherwise, All good from me with mind to the points that @TvieiraB brought up. Appreciate the quick turn on this one.

lumen/sources/intake.py

lumen/sources/intake_sql.py

lumen/transforms/sql.py

codecov-commenter · 2022-02-22T17:58:37Z

Codecov Report

Merging #246 (257c657) into master (1eee9ca) will increase coverage by 0.20%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
+ Coverage   63.90%   64.11%   +0.20%     
==========================================
  Files          56       56              
  Lines        5054     5113      +59     
==========================================
+ Hits         3230     3278      +48     
- Misses       1824     1835      +11

Impacted Files	Coverage Δ
lumen/sources/intake.py	`74.13% <0.00%> (-13.59%)`	⬇️
lumen/sources/intake_sql.py	`91.22% <89.36%> (+1.22%)`	⬆️
lumen/tests/sources/test_intake_sql.py	`100.00% <100.00%> (ø)`
lumen/transforms/sql.py	`97.61% <100.00%> (+1.46%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1eee9ca...257c657. Read the comment docs.

github-actions · 2023-07-11T08:25:58Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

TvieiraB reviewed Feb 22, 2022

View reviewed changes

lumen/sources/intake_sql.py Show resolved Hide resolved

philippjfr added 2 commits February 22, 2022 18:55

Use SQLTransforms to compute SQL Source schemas on database

7d8d6f5

Fix linting errors

257c657

philippjfr force-pushed the sql_schema branch from b41e673 to 257c657 Compare February 22, 2022 17:55

eli-pinkus reviewed Feb 22, 2022

View reviewed changes

lumen/sources/intake.py Show resolved Hide resolved

lumen/sources/intake_sql.py Show resolved Hide resolved

lumen/transforms/sql.py Show resolved Hide resolved

Fix handling of distinct on sql source schema

7a7a1d4

philippjfr merged commit e13aa21 into master Feb 23, 2022

maximlt deleted the sql_schema branch February 23, 2022 15:01

github-actions bot locked as resolved and limited conversation to collaborators Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SQLTransforms to compute SQL Source schemas on database #246

Use SQLTransforms to compute SQL Source schemas on database #246

philippjfr commented Feb 22, 2022 •

edited

Loading

philippjfr commented Feb 22, 2022

eli-pinkus left a comment

codecov-commenter commented Feb 22, 2022 •

edited

Loading

github-actions bot commented Jul 11, 2023

Use SQLTransforms to compute SQL Source schemas on database #246

Use SQLTransforms to compute SQL Source schemas on database #246

Conversation

philippjfr commented Feb 22, 2022 • edited Loading

philippjfr commented Feb 22, 2022

eli-pinkus left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 22, 2022 • edited Loading

Codecov Report

github-actions bot commented Jul 11, 2023

philippjfr commented Feb 22, 2022 •

edited

Loading

codecov-commenter commented Feb 22, 2022 •

edited

Loading