Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserts are unstable when indexes are defined on parquet table #989

Closed
34code opened this issue Mar 16, 2024 · 5 comments · Fixed by #1006
Closed

Inserts are unstable when indexes are defined on parquet table #989

34code opened this issue Mar 16, 2024 · 5 comments · Fixed by #1006
Labels
bug Something isn't working priority-medium Medium priority issue user-request This issue was directly requested by a user

Comments

@34code
Copy link

34code commented Mar 16, 2024

Bug Description

Inserting records from an existing non-parquet table to a parquet table is unstable.

How To Reproduce
--pg_analytics contd
CREATE TABLE t_items (
img_1_url character varying,
img_2_url character varying,
img_3_url character varying,
img_4_url character varying,
title character varying,
alt_title character varying,
category character varying,
description text,
barcode character varying PRIMARY KEY,
price_count integer,
prices_updated_at timestamp without time zone,
brand character varying DEFAULT 'Unknown'::character varying,
product_type character varying DEFAULT 'Unknown'::character varying,
min_latest_price double precision,
created_at timestamp(6) without time zone NOT NULL,
updated_at timestamp(6) without time zone NOT NULL,
img_5_url character varying,
asin character varying,
source text DEFAULT 'Store.com'::text,
bm25_id SERIAL
) USING parquet;

CREATE UNIQUE INDEX t_items_pkey ON t_items(barcode text_ops);
CREATE UNIQUE INDEX t_items_barcode_idx ON t_items(barcode text_ops);
CREATE INDEX t_items_brand_idx ON t_items(brand text_ops);
CREATE INDEX t_items_category_idx ON t_items(category text_ops);
CREATE INDEX t_items_min_price_idx ON t_items(min_latest_price float8_ops);
CREATE INDEX t_items_price_count_idx ON t_items(price_count int4_ops);
CREATE INDEX t_items_prices_updated_at_idx ON t_items(prices_updated_at timestamp_ops);
CREATE INDEX t_items_product_type_idx ON t_items(product_type text_ops);
CREATE INDEX t_items_search_idx_bm25_index ON t_items((t_items.*) anyelement_bm25_ops);

insert into t_items(
img_1_url,
img_2_url,
img_3_url,
img_4_url,
title,
alt_title,
category,
description,
barcode,
price_count,
prices_updated_at,
brand,
product_type,
min_latest_price,
created_at,
updated_at,
img_5_url,
asin,
source,
bm25_id
)
select
img_1_url,
img_2_url,
img_3_url,
img_4_url,
title,
alt_title,
category,
description,
barcode,
price_count,
prices_updated_at,
brand,
product_type,
min_latest_price,
created_at,
updated_at,
img_5_url,
asin,
source,
bm25_id
from items
limit 10000;

Proposed Fix
Noticed only 412 or 413 rows are being inserted and then this error is being thrown:

ERROR:  table tid from new index tuple (4294967295,0) cannot find insert offset between offsets 1 and 3 of block 1 in index "t_items_category_idx"
@philippemnoel philippemnoel added bug Something isn't working priority-medium Medium priority issue pg_analytics user-request This issue was directly requested by a user labels Mar 16, 2024
@rebasedming
Copy link
Collaborator

Same root cause as #975 , index support over Parquet tables is not yet implemented. We're working on it.

@rebasedming
Copy link
Collaborator

Hey @34code , do you want to test this branch and see if it fixes your issue? https://github.com/paradedb/paradedb/pulls

By the way, I noticed you created lots of b-tree indexes over the Parquet table. I wouldn't recommend doing this -- b-tree indexes are designed to speed up sequential scans over row-oriented tables. Parquet tables are column oriented, so b-tree indexes are ignored during query execution.

I also noticed you created a BM25 index using CREATE INDEX. Please use the provided paradedb.create_bm25 function for this instead.

@34code
Copy link
Author

34code commented Apr 5, 2024

Thanks, I think the CREATE INDEX command was from the table dump and restore. But good to know I should not be doing it that way..

@34code
Copy link
Author

34code commented Apr 5, 2024

Also, what's the best way to test that branch? I'm assuming there isn't a docker image until you do a release right?

@philippemnoel
Copy link
Collaborator

Also, what's the best way to test that branch? I'm assuming there isn't a docker image until you do a release right?

You're right. Best way would be to clone the repo, checkout the branch, then build the extension. All the details are in the pg_analytics README. If that's too much hassle, we'll hopefully be releasing this feature soon but would appreciate any help in testing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-medium Medium priority issue user-request This issue was directly requested by a user
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants