Skip to content
This repository has been archived by the owner on Oct 30, 2024. It is now read-only.

Ingesting CSV files takes a very long time. #9

Closed
sangee2004 opened this issue May 3, 2024 · 6 comments
Closed

Ingesting CSV files takes a very long time. #9

sangee2004 opened this issue May 3, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@sangee2004
Copy link

sangee2004 commented May 3, 2024

Ingestion of cvs files takes a very long time. This csv file https://github.com/gptscript-ai/csv-reader/blob/main/examples/Electric_Vehicle_Population_Data.csv which is 42 MB is not done with ingestion even after 6 minutes.

%/usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/Electric_Vehicle_Population_Data.csv
2024/05/03 13:24:28 INFO IngestOpts opts="{Filename:0x1400709a000 FileMetadata:0x1400705e000 IsDuplicateFuncName: IsDuplicateFunc:0x105378920}"
^C2024/05/03 13:30:28 ERROR Failed to add documents error="couldn't add document '10a2c04c-7a9a-43fd-9c3b-be85d1e226b8': couldn't create embedding of document: couldn't send request: Post \"https://api.openai.com/v1/embeddings\": context canceled"

Even ingestion of relatively small file industry_sic.csv (36 kB), takes about 15 seconds

% /usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/industry_sic.csv                    
2024/05/03 13:31:00 INFO IngestOpts opts="{Filename:0x1400e4da780 FileMetadata:0x1400c69c340 IsDuplicateFuncName: IsDuplicateFunc:0x101ce8920}"
2024/05/03 13:31:15 INFO Ingested document filename=industry_sic.csv count=731 absolute_path=/Users/sangeethahariharan/Downloads/industry_sic.csv
@sangee2004 sangee2004 added the bug Something isn't working label May 3, 2024
@iwilltry42
Copy link
Collaborator

Confirmed, will check it out in more detail

@iwilltry42
Copy link
Collaborator

It's pretty slow, because the current implementation of the CSV documentloader splits the file into one document (chunk) per row and then calls the embeddings API once per document. #25 should improve embeddings speed.
Additionally, I'll put it on my list to create a new variant of the CSV documentloader that allows reading the whole csv as a single document or as a set of documents with a pre-defined max size.

@StrongMonkey
Copy link
Contributor

Also, if CSV is usually some structured dataset, it might be better to use tools like https://github.com/gptscript-ai/structured-data-querier? It feels like RAG might not be a good case if there are millions of data in csv file.

@sangee2004
Copy link
Author

When there are unsupported file formats that ignored in the directory that gets ingested , Can we provide the user with a message indicating the files that were not ingested ?

@iwilltry42
Copy link
Collaborator

I agree with @StrongMonkey that the knowledge tool may not be the best tool for very structured data like CSV at least when it comes to factual searches with specific answers. It may work though if it's only about finding a single row with some content or for more exploratory searches.

@sangee2004 I think we have warning/debug logs indicating that files are being ignored. However, those are not shown to the LLM so may be hidden to the user. I'm not sure what's the best approach here, but to be transparent I guess we can log this information to stdout as well so that the LLM can tell the user which files have been ignored 🤔
WDYT @StrongMonkey ?

@iwilltry42 iwilltry42 changed the title Ingesting cvs files takes a very long time. Ingesting CSV files takes a very long time. Aug 5, 2024
@iwilltry42
Copy link
Collaborator

Closing this in favor of #90

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants