-
Notifications
You must be signed in to change notification settings - Fork 14
Ingesting CSV files takes a very long time. #9
Comments
Confirmed, will check it out in more detail |
It's pretty slow, because the current implementation of the CSV documentloader splits the file into one document (chunk) per row and then calls the embeddings API once per document. #25 should improve embeddings speed. |
Also, if CSV is usually some structured dataset, it might be better to use tools like https://github.com/gptscript-ai/structured-data-querier? It feels like RAG might not be a good case if there are millions of data in csv file. |
When there are unsupported file formats that ignored in the directory that gets ingested , Can we provide the user with a message indicating the files that were not ingested ? |
I agree with @StrongMonkey that the knowledge tool may not be the best tool for very structured data like CSV at least when it comes to factual searches with specific answers. It may work though if it's only about finding a single row with some content or for more exploratory searches. @sangee2004 I think we have warning/debug logs indicating that files are being ignored. However, those are not shown to the LLM so may be hidden to the user. I'm not sure what's the best approach here, but to be transparent I guess we can log this information to stdout as well so that the LLM can tell the user which files have been ignored 🤔 |
Closing this in favor of #90 |
Ingestion of cvs files takes a very long time. This csv file https://github.com/gptscript-ai/csv-reader/blob/main/examples/Electric_Vehicle_Population_Data.csv which is 42 MB is not done with ingestion even after 6 minutes.
Even ingestion of relatively small file industry_sic.csv (36 kB), takes about 15 seconds
The text was updated successfully, but these errors were encountered: