-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... #1087
Comments
I have following ideas as a potential solution: Unique Constraints in DuckDB: Cleanup Scripts: |
Why complicate with timestamps or ids? Can't we do a simple differential of identical rows? Before saving/updating from csv, remove rows that already exist in the table. |
Because we're working with timeseries data and can leverage those two things to build a much more efficient system... You guys are advocating for vanilla fetch and join strategies that lead to N<>N joins that have multiple O(n^2). Please go ahead and try doing this with 1 million rows of anything. These checks that you are suggesting add "checks for duplicates and overhead by scanning the whole database every insert" are extremely expensive. There is a reason why we've worked on this for months, and it's to build a pipeline that processes records efficiently, does as few scans and joins as possible, and it can scale. [Users should not be deleting their lake.] If anything, focus on this.... Enforce the SLAs. PROPOSED/EXAMPLE Solution: Do a simple check at the cli level when "lake raw update" or "lake etl update" is called. Warn the user they'll get duplicates because their csv.head < max(raw_tables.head, etl_tables.head) (i.e. they deleted things)... Prompt the user with a hard-stop, such that they are forced to drop their records to get csv.head == table.head... pseudo code
In this example, we simply check if csv.head diverged from raw_tables.head & etl_tables.head. Once. When the CLI is first run. Notice that we're prompting the user to drop records rather than just doing it for them. No black boxes. Nothing "smart". Just simple checks, leveraging the system, and then resuming things. This is a much more efficient way of dealing with this. It does everything through timestamps such that we lookup our checkpoint, and then uses the drop() functionality to get: csvs, raw tables, and etl tables to the same mark. This is a better way of doing things because we're not adding wasteful checks and computation. We're simply looking up our checkpoints (int timestamps) and leveraging the existing functionality to manage the lake. Further, we're stressing the SLAs to the user. Also, this is only done once. At the very beginning of the pipeliine, outside it's main loop and operations, carrying no additional overhead. [Repeating Patterns] These do not scale. |
Issue: CSV Deletion/Raw Table Duplication
TLDR; just drop raw and etl tables before updating and getting new CSVs.
Checks in save_to_storage against duckdb raw tables
The reason the problem above happened is that GQLDF uses save_to_storage, which doesn't do any checks against the DuckDB table when doing the insert. This is because it assumes that the DuckDB table will be in the same state as the CSV... clean, empty.
So, it's currently the responsibility of the user, that if they delete the CSV, that they'll also trim the DuckDB table (lake) with the drop commands that are currently there.
And, let's compartmentalise the problem...
- Right now, we do not want users to delete their CSV, or expose them to managing this.
This is a low priority issue
[CSV Management - Considerations / Potential Solutions]
The text was updated successfully, but these errors were encountered: