Feature request - to add support for nomic-ai/nomic-embed-text-v1 embeddings. #1418

slavag · 2024-02-19T10:00:14Z

Hi,
Can you please add support for nomic-ai/nomic-embed-text-v1.
Trying to run it as is, and getting next errors :
H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'

And
ValueError: Loading nomic-ai/nomic-embed-text-v1 requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.

Thanks

The text was updated successfully, but these errors were encountered:

pseudotensor · 2024-02-21T02:43:42Z

Probably bge m3 is better. Did you try that one? Also has long context and much smaller than other LLM based models.

pseudotensor · 2024-02-21T02:44:15Z

H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2' sounds like DocTR is not installed correctly.

pseudotensor · 2024-02-21T03:00:04Z

For nomic, it uses an unreleased sentence transformers 2.4.0dev. An attempt to pip install that dev version leads to issues failures in their API. So nothing is usable.

Once sentence transformers 2.4.0 is released without bugs, we can upgrade h2oGPT and pass the required option trust_remote_code, that option does not exist in prior sentence transformers.

slavag · 2024-02-21T09:45:52Z

@pseudotensor no, I didn't tried bge m3, will try. The usage is just to specify BAAI/bge-m3 or do I need to install anything ?
As for DocTR, I checked everything according to their github, don't see anything that is missing, and yet getting : H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
Thanks

slavag · 2024-02-21T10:43:54Z

@pseudotensor wanted to try bge, but it's failed (though regardless of the embeddings model). A few month ago that was working fine, when I did ingestion with another embeddings (same dataset)

python src/make_db.py --hf_embedding_model=BAAI/bge-m3 --chunk_size=8192 --user_path=/Users/slava/Documents/Development/private/ZendDeskTicketsNew -collection_name=ZenDeskTicketsWithDocsBGE
 59%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                         | 27/46 [00:07<00:05,  3.41it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 60%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                       | 28/47 [00:08<00:05,  3.30it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 65%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                              | 35/54 [00:08<00:04,  4.09it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 66%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                            | 37/56 [00:08<00:04,  4.30it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 67%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                          | 39/58 [00:08<00:04,  4.53it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 69%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                      | 43/62 [00:08<00:03,  4.97it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                     | 44/63 [00:08<00:03,  5.05it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                 | 49/68 [00:08<00:03,  5.58it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                | 51/70 [00:08<00:03,  5.81it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                            | 3106/4776 [00:09<00:05, 329.55it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 63%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                               | 3107/4904 [00:09<00:05, 329.02it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 62%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                                 | 3108/5032 [00:09<00:05, 328.77it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 60%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                    | 3109/5160 [00:09<00:06, 328.64it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 63%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                              | 4007/6312 [00:09<00:05, 418.69it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41067/41067 [00:15<00:00, 2675.56it/s]
Exceptions: 0/92227 []
[1]    41576 killed     python3 src/make_db.py --hf_embedding_model=BAAI/bge-m3 --chunk_size=8192  
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown                      
  warnings.warn('resource_tracker: There appear to be %d '
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 1 leaked folder objects to clean up at shutdown
  warnings.warn(
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /var/folders/z1/qsct20p17nxdfhjlp29yr6r40000gn/T/joblib_memmapping_folder_41576_f980b37c109a4082a6f2c1759202b4c9_31ee9ea798ce4943b8c16dcd8596d055: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")

No other info is available, the process just terminated in the middle.

slavag · 2024-02-23T09:12:07Z

@pseudotensor Hi, can you please advise on the issue I mentioned above with make_db ?
Thanks a lot !!!

pseudotensor · 2024-02-24T22:11:22Z

For docTR you can't use their repo, has to be installed from our fork as in the readme_linux.md or its linux_install.sh

The model being missing seems like the original DocTR repo is used.

slavag · 2024-02-25T06:55:00Z

@pseudotensor Thanks,
What about make_db that crashes in the first minutes of execution, without too many info

Exceptions: 0/92227 []
[1]    41576 killed     python3 src/make_db.py --hf_embedding_model=BAAI/bge-m3 --chunk_size=8192  
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown                      
  warnings.warn('resource_tracker: There appear to be %d '
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 1 leaked folder objects to clean up at shutdown
  warnings.warn(
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /var/folders/z1/qsct20p17nxdfhjlp29yr6r40000gn/T/joblib_memmapping_folder_41576_f980b37c109a4082a6f2c1759202b4c9_31ee9ea798ce4943b8c16dcd8596d055: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")

Thanks.

pseudotensor · 2024-02-25T09:13:18Z

Looks like system OOM. You can check sudo dmesg -T to see if OOM Killer hit.

slavag · 2024-02-25T15:50:54Z

@pseudotensor indeed OOM, but don't know why this started to happen, the memory size of the process reached 86GB. When I have Mac with 32GB and swap. But in the past I was able to create db from those files (tried now default embeddings).

Please advise.
Thanks

pseudotensor · 2024-02-26T03:39:30Z

Maybe one of the parsers went nuts, e.g. tesseract may have a bug. On gpt.h2o.ai I had one case when memory went up to the peak of 512GB.

Are you able to see from the verbose logging which document might have been an issue?

slavag · 2024-02-26T05:18:14Z

@pseudotensor No, I don't see , and now I have only text files (removed pdfs) and still got same issue.

slavag · 2024-02-26T10:16:52Z

@pseudotensor seems that this happens issue is after the parsing

.....
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket54810.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket72169.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket72169.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket119236.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket119236.txt
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 42124/42128 [00:26<00:00, 1642.83it/s]Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket73277.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket73277.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket35222.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket35222.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket87490.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket87490.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket63064.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket63064.txt
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 42128/42128 [00:26<00:00, 1576.36it/s]
0it [00:00, ?it/s]
END consuming path_or_paths=/Users/slava/Documents/Development/private/ZendDeskTickets url=None text=None
Exceptions: 0/498289 []
Loading and updating db
Found 498289 new sources (0 have no hash in original source, so have to reprocess for migration to sources with hash)
Removing 0 duplicate files from db because ingesting those as new documents
Existing db, adding to db_dir_ZenDeskTicketsWithDocsBGE
[1]    91113 killed     python3 src/make_db.py  -collection_name=ZenDeskTicketsWithDocsBGE  --n_jobs=
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown                      
  warnings.warn('resource_tracker: There appear to be %d '

slavag · 2024-02-26T13:48:00Z

Tried to run make_db with memory profiler.
First, I don't have PDFs but looks like somehow code uses doctr and other, and then I see huge allocation on transformers.xml_roberta.
Have a look on the screenshot

Summary of allocations

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Location                                                                                                           ┃        <Total Memory> ┃        Total Memory % ┃            Own Memory ┃          Own Memory % ┃      Allocation Count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ _PyEval_Vector at <unknown>                                                                                        │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677454 │
│ _run_tracker at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/memray/commands/run.py            │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677474 │
│ run_path at <frozen runpy>                                                                                         │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677470 │
│ PyObject_Vectorcall at <unknown>                                                                                   │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677457 │
│ cfunction_vectorcall_FASTCALL_KEYWORDS at <unknown>                                                                │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677442 │
│ PyEval_EvalCode at <unknown>                                                                                       │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677434 │
│ builtin_exec at <unknown>                                                                                          │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677434 │
│ _run_code at <frozen runpy>                                                                                        │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677425 │
│ _run_module_code at <frozen runpy>                                                                                 │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677425 │
│ <module> at src/make_db.py                                                                                         │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677423 │
│ _PyObject_MakeTpCall at <unknown>                                                                                  │             168.601GB │                99.09% │                0.000B │                 0.00% │               1288653 │
│ H2O_Fire at /Users/slava/Documents/Development/private/AI/h2ogpt/src/utils.py                                      │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450699 │
│ Fire at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py                              │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450694 │
│ _Fire at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py                             │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450689 │
│ _CallAndUpdateTrace at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py               │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450688 │
│ make_db_main at src/make_db.py                                                                                     │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450681 │
│ _PyVectorcall_Call at <unknown>                                                                                    │             137.671GB │                80.91% │                0.000B │                 0.00% │               1402343 │
│ method_vectorcall at <unknown>                                                                                     │             136.676GB │                80.33% │                0.000B │                 0.00% │               1254230 │
│ create_or_update_db at /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py                   │             136.557GB │                80.26% │                0.000B │                 0.00% │               1211409 │
│ get_db at /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py                                │             136.557GB │                80.26% │                0.000B │                 0.00% │               1211407 │
│ _PyObject_FastCallDictTstate at <unknown>                                                                          │             136.520GB │                80.24% │                0.000B │                 0.00% │               1219361 │
│ _PyObject_Call_Prepend at <unknown>                                                                                │             136.499GB │                80.22% │                0.000B │                 0.00% │               1199368 │
│ _PyObject_Call at <unknown>                                                                                        │             136.447GB │                80.19% │                0.000B │                 0.00% │               1174211 │
│ cfunction_call at <unknown>                                                                                        │             136.258GB │                80.08% │                0.000B │                 0.00% │                 77360 │
│ c10::DefaultCPUAllocator::allocate(unsigned long) const at <unknown>                                               │             136.205GB │                80.05% │                0.000B │                 0.00% │                   508 │
│ at::TensorBase at::detail::_empty_generic<long long>(c10::ArrayRef<long long>, c10::Allocator*,                    │             136.205GB │                80.05% │                0.000B │                 0.00% │                  1531 │
│ c10::DispatchKeySet, c10::ScalarType, std::__1::optional<c10::MemoryFormat>) at <unknown>                          │                       │                       │                       │                       │                       │
│ c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl>>            │             136.205GB │                80.05% │                0.000B │                 0.00% │                  1014 │
│ c10::intrusive_ptr<c10::StorageImpl,                                                                               │                       │                       │                       │                       │                       │
│ c10::detail::intrusive_target_default_null_type<c10::StorageImpl>>::make<c10::StorageImpl::use_byte_size_t,        │                       │                       │                       │                       │                       │
│ unsigned long&, c10::Allocator*&, bool>(c10::StorageImpl::use_byte_size_t&&, unsigned long&, c10::Allocator*&,     │                       │                       │                       │                       │                       │
│ bool&&) at <unknown>                                                                                               │                       │                       │                       │                       │                       │
│ c10::StorageImpl::StorageImpl(c10::StorageImpl::use_byte_size_t, c10::SymInt const&, c10::Allocator*, bool) at     │             136.205GB │                80.05% │                0.000B │                 0.00% │                   506 │
│ <unknown>                                                                                                          │                       │                       │                       │                       │                       │
│ add_to_db at /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py                             │             134.109GB │                78.82% │                0.000B │                 0.00% │                 37446 │
│ add_documents at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain_core/vectorstores.py   │             134.104GB │                78.82% │                0.000B │                 0.00% │                 37402 │
│ add_texts at                                                                                                       │             134.102GB │                78.82% │                0.000B │                 0.00% │                 37399 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain_community/vectorstores/chroma.py        │                       │                       │                       │                       │                       │
│ embed_documents at                                                                                                 │             134.093GB │                78.81% │                0.000B │                 0.00% │                 37552 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain_community/embeddings/huggingface.py     │                       │                       │                       │                       │                       │
│ slot_tp_call at <unknown>                                                                                          │             134.010GB │                78.76% │                0.000B │                 0.00% │                  2414 │
│ encode at                                                                                                          │             134.007GB │                78.76% │                0.000B │                 0.00% │                   231 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py      │                       │                       │                       │                       │                       │
│ _wrapped_call_impl at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/nn/modules/module.py  │             134.001GB │                78.76% │                0.000B │                 0.00% │                   125 │
│ forward at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/nn/modules/container.py          │             134.001GB │                78.76% │                0.000B │                 0.00% │                   125 │
│ _call_impl at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/nn/modules/module.py          │             134.001GB │                78.76% │                0.000B │                 0.00% │                   124 │
│ forward at                                                                                                         │             134.001GB │                78.76% │                0.000B │                 0.00% │                   112 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/sentence_transformers/models/Transformer.py       │                       │                       │                       │                       │                       │
│ forward at                                                                                                         │             134.001GB │                78.76% │                0.000B │                 0.00% │                   111 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_rob… │                       │                       │                       │                       │                       │
│ torch::autograd::THPVariable_matmul(_object*, _object*, _object*) at <unknown>                                     │             130.000GB │                76.40% │                0.000B │                 0.00% │                    22 │
│ at::native::_matmul_impl(at::Tensor&, at::Tensor const&, at::Tensor const&) at <unknown>                           │             130.000GB │                76.40% │                0.000B │                 0.00% │                    20 │
│ at::native::matmul(at::Tensor const&, at::Tensor const&) at <unknown>                                              │             130.000GB │                76.40% │                0.000B │                 0.00% │                    20 │
│ at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) at <unknown>                                          │             130.000GB │                76.40% │                0.000B │                 0.00% │                    20 │
│ c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPoint… │             128.000GB │                75.23% │                0.000B │                 0.00% │                    12 │
│ (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CPU_bmm(at::Tensor const&, at::Tensor  │                       │                       │                       │                       │                       │
│ const&)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&>>, at::Tensor (at::Tensor │                       │                       │                       │                       │                       │
│ const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) │                       │                       │                       │                       │                       │
│ at <unknown>                                                                                                       │                       │                       │                       │                       │                       │
│ c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPoint… │             128.000GB │                75.23% │                0.000B │                 0.00% │                    12 │
│ (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous            │                       │                       │                       │                       │                       │
│ namespace)::bmm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>, at::Tensor,                           │                       │                       │                       │                       │                       │
│ c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&>>, at::Tensor              │                       │                       │                       │                       │                       │
│ (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet,      │                       │                       │                       │                       │                       │
│ at::Tensor const&, at::Tensor const&) at <unknown>                                                                 │                       │                       │                       │                       │                       │
│ at::_ops::bmm::call(at::Tensor const&, at::Tensor const&) at <unknown>                                             │             128.000GB │                75.23% │                0.000B │                 0.00% │                    12 │
│ at::(anonymous namespace)::structured_bmm_out_cpu_functional::set_output_raw_strided(long long, c10::ArrayRef<long │             128.000GB │                75.23% │                0.000B │                 0.00% │                     4 │
│ long>, c10::ArrayRef<long long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) at <unknown>                      │                       │                       │                       │                       │                       │
│ void at::meta::common_checks_baddbmm_bmm<at::meta::structured_bmm>(at::meta::structured_bmm&, at::Tensor const&,   │             128.000GB │                75.23% │                0.000B │                 0.00% │                     4 │
│ at::Tensor const&, c10::Scalar const&, c10::Scalar const&, bool, std::__1::optional<at::Tensor> const&) at         │                       │                       │                       │                       │                       │
│ <unknown>                                                                                                          │                       │                       │                       │                       │                       │
│ at::meta::structured_bmm::meta(at::Tensor const&, at::Tensor const&) at <unknown>                                  │             128.000GB │                75.23% │                0.000B │                 0.00% │                     4 │
│ _find_and_load at <frozen importlib._bootstrap>                                                                    │              32.409GB │                19.05% │                0.000B │                 0.00% │                241978 │
│ _find_and_load_unlocked at <frozen importlib._bootstrap>                                                           │              32.409GB │                19.05% │                0.000B │                 0.00% │                241976 │
│ _load_unlocked at <frozen importlib._bootstrap>                                                                    │              32.409GB │                19.05% │                0.000B │                 0.00% │                241770 │
│ exec_module at <frozen importlib._bootstrap_external>                                                              │              32.409GB │                19.05% │                0.000B │                 0.00% │                241768 │
│ object_vacall at <unknown>                                                                                         │              32.406GB │                19.05% │                0.000B │                 0.00% │                239971 │
│ PyObject_CallMethodObjArgs at <unknown>                                                                            │              32.406GB │                19.05% │                0.000B │                 0.00% │                239971 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┘

🥇 Top 5 largest allocating locations (by size):
	- forward:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:237 -> 130.002GB
	- init_lib:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/pypdfium2/_library_scope.py:25 -> 32.000GB
	- hash_file:/Users/slava/Documents/Development/private/AI/h2ogpt/src/utils.py:1124 -> 5.313GB
	- filter:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/fnmatch.py:56 -> 4.468GB
	- load_file:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/safetensors/torch.py:308 -> 4.231GB

Can you please to try to create any large db with default embedding model and check if it's working on your end ?
Thanks

pseudotensor · 2024-02-26T16:01:12Z

Nice tool. It looks as if the chroma team changed something. Maybe the batching size is larger and they send arbitrarily large batch to the embedding model.

Can you add some prints to this code?

h2ogpt/src/gpt_langchain.py

Lines 318 to 346 in 190310d

    
           num_new_sources = len(sources) 
        
           if num_new_sources == 0: 
        
               return db, num_new_sources, [] 
        
           if hasattr(db, '_persist_directory'): 
        
               print("Existing db, adding to %s" % db._persist_directory, flush=True) 
        
               # chroma only 
        
               lock_file = get_db_lock_file(db) 
        
               context = filelock.FileLock 
        
           else: 
        
               lock_file = None 
        
               context = NullContext 
        
           with context(lock_file): 
        
               # this is place where add to db, but others maybe accessing db, so lock access. 
        
               # else see RuntimeError: Index seems to be corrupted or unsupported 
        
               import chromadb 
        
               api = chromadb.PersistentClient(path=db._persist_directory) 
        
               if hasattr(api, 'max_batch_size'): 
        
                   max_batch_size = api.max_batch_size 
        
               elif hasattr(api, '_producer') and hasattr(api._producer, 'max_batch_size'): 
        
                   max_batch_size = api._producer.max_batch_size 
        
               else: 
        
                   max_batch_size = int(os.getenv('CHROMA_MAX_BATCH_SIZE', '100')) 
        
               sources_batches = split_list(sources, max_batch_size) 
        
               for sources_batch in sources_batches: 
        
                   db.add_documents(documents=sources_batch) 
        
                   db.persist() 
        
               clear_embedding(db) 
        
               # save here is for migration, in case old db directory without embedding saved 
        
               save_embed(db, use_openai_embedding, hf_embedding_model)

Specifically, the max_batch_size can be printed. Maybe it's crazy large.

slavag · 2024-02-26T16:49:51Z

@pseudotensor Max batch size 83333, coming from max_batch_size attr.

pseudotensor · 2024-02-26T17:24:43Z

Try the latest changes. 83333 is very large. I made the max 4096. Or you can control via env CHROMA_MAX_BATCH_SIZE

slavag · 2024-02-26T20:18:46Z

@pseudotensor much better, thanks.
Btw, bge m3 on MacBook Pro 32GB M1Max, batch size > 4 is failing. not enough memory.
On Linux with NVidia nvidia a10g (24GB) with 4096 also failed, working with 1 or 2.

Also, maybe it's a good idea to add device to make_db, as on mac it can use metal.

pseudotensor · 2024-02-27T10:05:15Z

4096 is on high end, yes can make it smaller as required. If on CPU I expect should work pretty ok, but issue is bge-m3 has 8k context so uses alot more memory despite size if chunks are large.

I think issue is that for summarization purposes, we double the chunks, and there's no limit to their size, so that might be hitting the bge-m3 model hard since it'll take 8k.

One will have to tell the model to truncate at (say) smaller token counts or (yes) limit batch size.

pseudotensor added the type/feature Feature request label Feb 21, 2024

pseudotensor added a commit that referenced this issue Feb 26, 2024

For Issue #1418

0780034

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request - to add support for nomic-ai/nomic-embed-text-v1 embeddings. #1418

Feature request - to add support for nomic-ai/nomic-embed-text-v1 embeddings. #1418

slavag commented Feb 19, 2024

pseudotensor commented Feb 21, 2024

pseudotensor commented Feb 21, 2024

pseudotensor commented Feb 21, 2024

slavag commented Feb 21, 2024 •

edited

Loading

slavag commented Feb 21, 2024 •

edited

Loading

slavag commented Feb 23, 2024

pseudotensor commented Feb 24, 2024 •

edited

Loading

slavag commented Feb 25, 2024

pseudotensor commented Feb 25, 2024

slavag commented Feb 25, 2024

pseudotensor commented Feb 26, 2024

slavag commented Feb 26, 2024

slavag commented Feb 26, 2024

slavag commented Feb 26, 2024 •

edited

Loading

pseudotensor commented Feb 26, 2024

slavag commented Feb 26, 2024

pseudotensor commented Feb 26, 2024 •

edited

Loading

slavag commented Feb 26, 2024

pseudotensor commented Feb 27, 2024

Feature request - to add support for nomic-ai/nomic-embed-text-v1 embeddings. #1418

Feature request - to add support for nomic-ai/nomic-embed-text-v1 embeddings. #1418

Comments

slavag commented Feb 19, 2024

pseudotensor commented Feb 21, 2024

pseudotensor commented Feb 21, 2024

pseudotensor commented Feb 21, 2024

slavag commented Feb 21, 2024 • edited Loading

slavag commented Feb 21, 2024 • edited Loading

slavag commented Feb 23, 2024

pseudotensor commented Feb 24, 2024 • edited Loading

slavag commented Feb 25, 2024

pseudotensor commented Feb 25, 2024

slavag commented Feb 25, 2024

pseudotensor commented Feb 26, 2024

slavag commented Feb 26, 2024

slavag commented Feb 26, 2024

slavag commented Feb 26, 2024 • edited Loading

pseudotensor commented Feb 26, 2024

slavag commented Feb 26, 2024

pseudotensor commented Feb 26, 2024 • edited Loading

slavag commented Feb 26, 2024

pseudotensor commented Feb 27, 2024

slavag commented Feb 21, 2024 •

edited

Loading

slavag commented Feb 21, 2024 •

edited

Loading

pseudotensor commented Feb 24, 2024 •

edited

Loading

slavag commented Feb 26, 2024 •

edited

Loading

pseudotensor commented Feb 26, 2024 •

edited

Loading