Ingesting CSV files takes a very long time. #9

sangee2004 · 2024-05-03T21:44:07Z

Ingestion of cvs files takes a very long time. This csv file https://github.com/gptscript-ai/csv-reader/blob/main/examples/Electric_Vehicle_Population_Data.csv which is 42 MB is not done with ingestion even after 6 minutes.

%/usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/Electric_Vehicle_Population_Data.csv
2024/05/03 13:24:28 INFO IngestOpts opts="{Filename:0x1400709a000 FileMetadata:0x1400705e000 IsDuplicateFuncName: IsDuplicateFunc:0x105378920}"
^C2024/05/03 13:30:28 ERROR Failed to add documents error="couldn't add document '10a2c04c-7a9a-43fd-9c3b-be85d1e226b8': couldn't create embedding of document: couldn't send request: Post \"https://api.openai.com/v1/embeddings\": context canceled"

Even ingestion of relatively small file industry_sic.csv (36 kB), takes about 15 seconds

% /usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/industry_sic.csv                    
2024/05/03 13:31:00 INFO IngestOpts opts="{Filename:0x1400e4da780 FileMetadata:0x1400c69c340 IsDuplicateFuncName: IsDuplicateFunc:0x101ce8920}"
2024/05/03 13:31:15 INFO Ingested document filename=industry_sic.csv count=731 absolute_path=/Users/sangeethahariharan/Downloads/industry_sic.csv

The text was updated successfully, but these errors were encountered:

iwilltry42 · 2024-05-07T07:28:22Z

Confirmed, will check it out in more detail

iwilltry42 · 2024-06-18T15:18:07Z

It's pretty slow, because the current implementation of the CSV documentloader splits the file into one document (chunk) per row and then calls the embeddings API once per document. #25 should improve embeddings speed.
Additionally, I'll put it on my list to create a new variant of the CSV documentloader that allows reading the whole csv as a single document or as a set of documents with a pre-defined max size.

StrongMonkey · 2024-06-18T18:07:13Z

Also, if CSV is usually some structured dataset, it might be better to use tools like https://github.com/gptscript-ai/structured-data-querier? It feels like RAG might not be a good case if there are millions of data in csv file.

sangee2004 · 2024-06-18T18:14:37Z

When there are unsupported file formats that ignored in the directory that gets ingested , Can we provide the user with a message indicating the files that were not ingested ?

iwilltry42 · 2024-06-19T05:49:04Z

I agree with @StrongMonkey that the knowledge tool may not be the best tool for very structured data like CSV at least when it comes to factual searches with specific answers. It may work though if it's only about finding a single row with some content or for more exploratory searches.

@sangee2004 I think we have warning/debug logs indicating that files are being ignored. However, those are not shown to the LLM so may be hidden to the user. I'm not sure what's the best approach here, but to be transparent I guess we can log this information to stdout as well so that the LLM can tell the user which files have been ignored 🤔
WDYT @StrongMonkey ?

iwilltry42 · 2024-08-28T11:40:39Z

Closing this in favor of #90

sangee2004 added the bug Something isn't working label May 3, 2024

sangee2004 assigned iwilltry42 May 6, 2024

iwilltry42 changed the title ~~Ingesting cvs files takes a very long time.~~ Ingesting CSV files takes a very long time. Aug 5, 2024

iwilltry42 closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingesting CSV files takes a very long time. #9

Ingesting CSV files takes a very long time. #9

sangee2004 commented May 3, 2024 •

edited

Loading

iwilltry42 commented May 7, 2024

iwilltry42 commented Jun 18, 2024

StrongMonkey commented Jun 18, 2024

sangee2004 commented Jun 18, 2024

iwilltry42 commented Jun 19, 2024

iwilltry42 commented Aug 28, 2024

Ingesting CSV files takes a very long time. #9

Ingesting CSV files takes a very long time. #9

Comments

sangee2004 commented May 3, 2024 • edited Loading

iwilltry42 commented May 7, 2024

iwilltry42 commented Jun 18, 2024

StrongMonkey commented Jun 18, 2024

sangee2004 commented Jun 18, 2024

iwilltry42 commented Jun 19, 2024

iwilltry42 commented Aug 28, 2024

sangee2004 commented May 3, 2024 •

edited

Loading