Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SQLTransforms to compute SQL Source schemas on database #246

Merged
merged 3 commits into from
Feb 23, 2022

Conversation

philippjfr
Copy link
Member

@philippjfr philippjfr commented Feb 22, 2022

Currently all sources load the data into memory to compute the schema. For many sources this is okay because it can at least leverage dask to lazily load and compute the unique categories or range of a numerical or date value. Unfortunately for most SQL sources this can be extremely inefficient. So for SQL sources we leverage the SQLTransforms recently added to Lumen to let the database itself compute the ranges and unique categories for each column.

The approach can be described as follows:

  1. Fetch the schema for the SQL query but with LIMIT 1
  2. Now find all columns with ranges and all columns with enums and compute these values over the whole table using MIN/MAX and DISTINCT SQL statements respectively

@philippjfr
Copy link
Member Author

I actually don't think this will currently work if there's multiple enums since I'm not sure what DISTINCT will do there.

Copy link
Collaborator

@eli-pinkus eli-pinkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple clarification Q's, otherwise, All good from me with mind to the points that @TvieiraB brought up. Appreciate the quick turn on this one.

lumen/sources/intake.py Show resolved Hide resolved
lumen/sources/intake_sql.py Show resolved Hide resolved
lumen/transforms/sql.py Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2022

Codecov Report

Merging #246 (257c657) into master (1eee9ca) will increase coverage by 0.20%.
The diff coverage is 87.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
+ Coverage   63.90%   64.11%   +0.20%     
==========================================
  Files          56       56              
  Lines        5054     5113      +59     
==========================================
+ Hits         3230     3278      +48     
- Misses       1824     1835      +11     
Impacted Files Coverage Δ
lumen/sources/intake.py 74.13% <0.00%> (-13.59%) ⬇️
lumen/sources/intake_sql.py 91.22% <89.36%> (+1.22%) ⬆️
lumen/tests/sources/test_intake_sql.py 100.00% <100.00%> (ø)
lumen/transforms/sql.py 97.61% <100.00%> (+1.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1eee9ca...257c657. Read the comment docs.

@philippjfr philippjfr merged commit e13aa21 into master Feb 23, 2022
@maximlt maximlt deleted the sql_schema branch February 23, 2022 15:01
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants