Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various issues with subindex reloading #618

Closed
jbouder opened this issue Dec 19, 2023 · 19 comments
Closed

Fix various issues with subindex reloading #618

jbouder opened this issue Dec 19, 2023 · 19 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jbouder
Copy link

jbouder commented Dec 19, 2023

I can't seem to get txtai running properly within our AKS (Azure Kubernetes Service) environment, specifically when mapping to external storage. We're creating our txtai instance with a config similar to the below:

writable: true
path: /mnt/data

embeddings:
  content: true
  defaults: false

  indexes:
    document:
      path: sentence-transformers/multi-qa-mpnet-base-dot-v1
      columns:
        text: document
        object: _document

wiki.QAsearch:
  application: embeddings
  model_path: t5-small

workflow:
  wikisearch:
    tasks:
      - action: wiki.QAsearch

I am able to start the txtai instance successfully, but when I call the index endpoint, it essentially just spins and eventually times out. Upon investigation of the pod, I can see that the config and database files are created, but the database file is empty and the indexes directories are not created at all. Below is what I see when inspecting the mapped directory:

root@txtai-api:/mnt/data# ls -al
total 5
drwxrwxrwx 2 root root    0 Dec 18 22:09 .
drwxr-xr-x 1 root root 4096 Dec 19 14:22 ..
-rwxrwxrwx 1 root root  853 Dec 19 14:25 config
-rwxrwxrwx 1 root root    0 Dec 19 14:25 documents

And I am not seeing any actual errors in the pod logs when I call index.

Lastly, the above setup does work on a local machine, and it works when not mapping to external storage. Any ideas what might be going on?

@davidmezzetti
Copy link
Member

Nothing with the configuration looks out of the ordinary. Can you confirm you can write to the storage using another process/program? For example, a test python script that just writes some files to the external storage mount.

@jbouder
Copy link
Author

jbouder commented Dec 22, 2023

All seems well on that front. I was able to exec into the pod and...create a file through bash, create a directory through bash, create a file through python, create a directory through python. And all files were showing in external storage as expected.

Not sure it helps, but in case you want to try to reproduce, this can also be replicated with Azure Container Apps, which is much easier to setup than an AKS cluster.

@davidmezzetti
Copy link
Member

Hard to understand what the issue could be. If this is presented as a regular file volume perhaps Faiss or SQLite do some sort of file operation that the filesystem doesn't support.

You can try to debug the components directly using methods found in this article: https://neuml.hashnode.dev/embeddings-index-components

@jbouder
Copy link
Author

jbouder commented Dec 22, 2023

Yeah, I’m wondering the same thing. Right now I’m using Azure File storage, but might try Blob storage next. A few questions though while I continue to troubleshoot on my end:

  1. Any chance there might be new cloud configs for Azure?
  2. What is the default location where I might find the embeddings, index, etc? I’m wondering if maybe the path config might be doing something I’m not expecting
  3. Should any of these files be present when the app first loads? Or are they not created until I call an index?

@davidmezzetti
Copy link
Member

Yeah, I’m wondering the same thing. Right now I’m using Azure File storage, but might try Blob storage next. A few questions though while I continue to troubleshoot on my end:

  1. Any chance there might be new cloud configs for Azure? In what way? Outside of what is provided by Apache Libcloud?
  2. What is the default location where I might find the embeddings, index, etc? I’m wondering if maybe the path config might be doing something I’m not expecting There would be a file named embeddings in the directory.
  3. Should any of these files be present when the app first loads? Or are they not created until I call an index? The files aren't created until index time

Does the behavior change at all if you set path to path: /mnt/data/index, if you disable content or you set the ANN backend to another value like hnsw?

@jbouder
Copy link
Author

jbouder commented Dec 22, 2023

For 2 above, what would be the file path, if I don’t provide a path config

@jbouder
Copy link
Author

jbouder commented Dec 26, 2023

A few updates:

  1. Changing the path to /mnt/data/index doesn't make a difference
  2. Disabling content DOES result in the indexes being created, but as expected, the content db is no longer available, which we will need. I'm going to try setting it up for an external db once again, but I'm expecting that will not work. Any thoughts?

@davidmezzetti
Copy link
Member

That is interesting. What happens if you set content to duckdb (you'll need to pip install duckdb)? I wonder if something is silently failing with SQLite. What version of Python are you using? SQLite is bundled with Python.

@jbouder
Copy link
Author

jbouder commented Dec 26, 2023

Haven't tried duckdb yet, but an update before proceeding...updating to use an external Postgres DB (setting content: client and providing a CLIENT_URL env variable) does allow the app to start up correctly and I can successfully embed/index some data...however when I restart the container, the index isn't returning anything. The data is still in the content db and the embeddings are still in storage as before, but the app doesn't seem to recognize it. Also, I can re-embed/index successfully, but that kind of defeats the purpose of what I'm trying to do.

And it looks like the container is running: Python 3.8.10

@jbouder
Copy link
Author

jbouder commented Dec 26, 2023

I may have figured out the issue...digging through the logic a bit, it looks like when the app initializes, you check for an embeddings file based on the path provided here: https://github.com/neuml/txtai/blob/b44d5778d87a81662cae563082089dde2661c61e/src/python/txtai/embeddings/base.py#L508C35-L508C35.

In our case however, since we're using sub-indexes, the embeddings files are within the indexes sub-directories, not in the top level directory (/mnt/data). Does that make sense? And any thoughts on how to proceed?

@jbouder
Copy link
Author

jbouder commented Dec 26, 2023

Ok, final bit of findings...based on the above, a previous comment you made in another thread, and something I saw in the source code about using archive files, I updated my config to use index compression. After doing that, I was able to verify I only had 1 file in external storage (index.tar.gz) and when I restart the container, the app picks up the index as expected 🎉 ...although that was only after I worked around 1 other error. Essentially my current setup has 2 sub-indexes, for testing I was only indexing data into 1 of those indexes. Upon restarted I noticed the following errors:

2023-12-26T22:26:58.358006112Z ERROR:    Traceback (most recent call last):
2023-12-26T22:26:58.358011993Z   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 677, in lifespan
2023-12-26T22:26:58.358015800Z     async with self.lifespan_context(app) as maybe_state:
2023-12-26T22:26:58.358021461Z   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 538, in __aenter__
2023-12-26T22:26:58.358027672Z     return self._cm.__enter__()
2023-12-26T22:26:58.358033092Z   File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
2023-12-26T22:26:58.358038182Z     return next(self.gen)
2023-12-26T22:26:58.358043812Z   File "/usr/local/lib/python3.8/dist-packages/txtai/api/application.py", line 89, in lifespan
2023-12-26T22:26:58.358049784Z     INSTANCE = Factory.create(config, api) if api else API(config)
2023-12-26T22:26:58.358054973Z   File "/usr/local/lib/python3.8/dist-packages/txtai/api/base.py", line 18, in __init__
2023-12-26T22:26:58.358059842Z     super().__init__(config, loaddata)
2023-12-26T22:26:58.358064792Z   File "/usr/local/lib/python3.8/dist-packages/txtai/app/base.py", line 78, in __init__
2023-12-26T22:26:58.358070763Z     self.indexes(loaddata)
2023-12-26T22:26:58.358075662Z   File "/usr/local/lib/python3.8/dist-packages/txtai/app/base.py", line 209, in indexes
2023-12-26T22:26:58.358080431Z     self.embeddings.load(self.config.get("path"), self.config.get("cloud"))
2023-12-26T22:26:58.358085090Z   File "/usr/local/lib/python3.8/dist-packages/txtai/embeddings/base.py", line 556, in load
2023-12-26T22:26:58.358089708Z     self.indexes.load(f"{path}/indexes")
2023-12-26T22:26:58.358095048Z   File "/usr/local/lib/python3.8/dist-packages/txtai/embeddings/index/indexes.py", line 158, in load
2023-12-26T22:26:58.358099186Z     index.load(os.path.join(path, name))
2023-12-26T22:26:58.358103584Z   File "/usr/local/lib/python3.8/dist-packages/txtai/embeddings/base.py", line 536, in load
2023-12-26T22:26:58.358107962Z     self.ann.load(f"{path}/embeddings")
2023-12-26T22:26:58.358112290Z   File "/usr/local/lib/python3.8/dist-packages/txtai/ann/faiss.py", line 32, in load
2023-12-26T22:26:58.358117130Z     self.backend = readindex(path, IO_FLAG_MMAP if self.setting("mmap") is True else 0)
2023-12-26T22:26:58.358121548Z   File "/usr/local/lib/python3.8/dist-packages/faiss/swigfaiss_avx2.py", line 10206, in read_index
2023-12-26T22:26:58.358125856Z     return _swigfaiss_avx2.read_index(*args)
2023-12-26T22:26:58.358130975Z RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /project/faiss/faiss/impl/io.cpp:67: Error: 'f' failed: could not open /tmp/tmp9gjgdja_/indexes/document2/embeddings for reading: No such file or directory
2023-12-26T22:26:58.358136616Z 
2023-12-26T22:26:58.358150742Z ERROR:    Application startup failed. Exiting.

So, looks like its trying to find an embeddings file in the other index (which I didn't index any data yet). I then indexed some data into that sub-index and it started fine.

...so, 2 things:

  1. What are the implications of using index compression? Is that a scalable/performant solution for an application which will index thousands of documents? If not, any thoughts on why it doesn't work without index compression and when might that be resolved?
  2. While I can work around the error above, looks like a bug that should be fixed. Any thoughts on that potential bug?

TIA!

@davidmezzetti
Copy link
Member

Thank you for the dedication on trying to solve this issue. It sounds like it might be more API related than AKS related, which is good from a reproducibility standpoint. Lot for me to unpack but I'll try to put focused time on this in the next couple of days.

@jbouder
Copy link
Author

jbouder commented Dec 27, 2023

Yup, finally was able to dedicate some time myself. And definitely a lot of rambling on my part. Please don’t hesitate to reach out if you have any questions or need me to try anything. Thanks again!!

@jbouder
Copy link
Author

jbouder commented Dec 27, 2023

FYI, DuckDB works. Indexes are created as expected. But still have the restart issue with sud-indexes.

@jbouder
Copy link
Author

jbouder commented Jan 19, 2024

Hello! Just wanted to checkin on any progress with this. TIA!

@davidmezzetti
Copy link
Member

I don't have an answer yet. I have pending work to run on K8s clusters and was hoping to see if anything came up with that.

@davidmezzetti davidmezzetti self-assigned this Feb 2, 2024
@davidmezzetti davidmezzetti added the bug Something isn't working label Feb 2, 2024
@davidmezzetti davidmezzetti added this to the v6.4.0 milestone Feb 2, 2024
@davidmezzetti davidmezzetti changed the title Index not being created within AKS pod, and path mapped to Persisted Volume Fix various issues with subindex reloading Feb 2, 2024
@davidmezzetti
Copy link
Member

I just checked in a change that I believe addresses this issue. If you want to confirm, you can install txtai from GitHub.

@jbouder
Copy link
Author

jbouder commented Feb 2, 2024

It works! Currently running with index compression, external mounted storage, and external postgres content storage. Restarted the pod and everything seems to have come up fine.

Thanks for the help!

@davidmezzetti
Copy link
Member

Great, glad to hear it! I'll go ahead and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants