Support `embedders` setting #554

CommanderStorm · 2024-03-02T15:42:36Z

Pull Request

Related issue

Fixes #541

What does this PR do?

~~introduces the experimental-vector-search-feature flag~~
Adds the required settings
- with_embedders does use the same "API" (not using impl AsRef for items passed) as with_synonyms, as this is the closest existing
- given set_embedders has not been implemented upstream (at least when I try to PATCH the object, it does not work)
- only {get,reset}_embedders settings have been implemented.
  Said implementation goes with the work done in Implement vector search experimental feature v2 (v1.6) meilisearch-python#924
adds the hybrid field to search via the vector search to add an end-to-end test of this feature with the huggingface configuration.

userProvided seens more brittle, but we may want change to this instead
using userProvided instead would mean (at the cost of hardcoding stuff)
=> lower cpu effort
=> no higher timeout necceeseary
=> aligning with meilisearch/meilisearch to only have a test for userProvided)

Caution

This is the difficult part:
I have not gotten this feature to work as I expect it.
I have documented what I would expect to happen here:

meilisearch-rust/src/search.rs

Lines 1223 to 1304 in 86334cd

    
           #[cfg(feature = "experimental-vector-search")] 
        
           #[meilisearch_test] 
        
           async fn test_hybrid(client: Client, index: Index) -> Result<(), Error> { 
        
               log::warn!("You are executing the vector search test. This WILL take a while and might lead to timeouts in other tests. You can disable this testcase by not enabling the `experimental-vector-search`-feature and running this "); 
        
               // enable vector searching and configure an embedder 
        
               let features = crate::ExperimentalFeatures::new(&client) 
        
                   .set_vector_store(true) 
        
                   .update() 
        
                   .await 
        
                   .expect("could not enable the vector store"); 
        
               assert_eq!(features.vector_store, true); 
        
               let embedder_setting = crate::Embedder::HuggingFace(crate::HuggingFaceEmbedderSettings { 
        
                   model: "BAAI/bge-base-en-v1.5".into(), 
        
                   revision: None, 
        
                   document_template: Some("{{ doc.value }}".into()), 
        
               }); 
        
               let t3 = index 
        
                   .set_settings(&crate::Settings { 
        
                       embedders: Some(HashMap::from([("default".to_string(), embedder_setting)])), 
        
                       ..crate::Settings::default() 
        
                   }) 
        
                   .await?; 
        
               t3.wait_for_completion(&client, None, None).await?; 
        
               setup_test_index(&client, &index).await?; 
        
               // "zweite" = "second" in german 
        
               // => an embedding should be able to detect that this is equivalent, but not the regular search 
        
               let results: SearchResults<Document> = index 
        
                   .search() 
        
                   .with_query("Facebook") 
        
                   .with_hybrid("default", 1.0) // entirely rely on semantic searching 
        
                   .execute() 
        
                   .await?; 
        
               assert_eq!(results.hits.len(), 1); 
        
               assert_eq!( 
        
                   &Document { 
        
                       id: 1, 
        
                       value: S("dolor sit amet, consectetur adipiscing elit"), 
        
                       kind: S("text"), 
        
                       number: 10, 
        
                       nested: Nested { child: S("second") }, 
        
                   }, 
        
                   &results.hits[0].result 
        
               ); 
        
               let results: SearchResults<Document> = index 
        
                   .search() 
        
                   .with_query("zweite") 
        
                   .with_hybrid("default", 0.0) // no semantic searching => no matches 
        
                   .execute() 
        
                   .await?; 
        
               assert_eq!(results.hits.len(), 0); 
        
               // word that has a typo => would have been found via traditional means 
        
               // if entirely relying on semantic searching, no result is found 
        
               let results: SearchResults<Document> = index 
        
                   .search() 
        
                   .with_query("lohrem") 
        
                   .with_hybrid("default", 1.0) 
        
                   .execute() 
        
                   .await?; 
        
               assert_eq!(results.hits.len(), 0); 
        
               let results: SearchResults<Document> = index 
        
                   .search() 
        
                   .with_query("lohrem") 
        
                   .with_hybrid("default", 0.0) 
        
                   .execute() 
        
                   .await?; 
        
               assert_eq!(results.hits.len(), 1); 
        
               assert_eq!( 
        
                   &Document { 
        
                       id: 0, 
        
                       value: S("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."), 
        
                       kind: S("text"), 
        
                       number: 0, 
        
                       nested: Nested { child: S("first") } 
        
                   }, 
        
                   &results.hits[0].result.id 
        
               ); 
        
               Ok(()) 
        
           }

Where in the Meilisearch codebase could I find an e2e-test how to use feature?
I searched a bunch, but could not find any relevant testcases (only userProvided)
Am I misunderstanding what the capabilities should be, or do I have a bad HuggingFace-embedder?

Tip

The specific testcase can be executed via

cargo test --package meilisearch-sdk --lib search::tests::test_hybrid  --features=experimental-vector-search -- --exact

TODO:

find a combination of semantic search model + configuration that does not fail the assumptions (see search testcase) spectacularly

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

curquiza · 2024-04-15T17:09:08Z

Hello @CommanderStorm

Can you rebase your branch? We made changes recently to improve the library

Sorry for the inconvenience, I try to review your PR as soon as possible after the rebase

CommanderStorm · 2024-04-16T03:48:59Z

Can you rebase your branch?

Done ^^

irevoire

Hey @CommanderStorm,

Code-wise, the PR is very nice and well-documented; I love it!

But I’m hitting way too many timeouts to really help you; sorry 😖

I don’t think we’ll be able to merge this PR with tests that take about 10 minutes. Could you mock Meilisearch as we did here:

meilisearch-rust/src/tasks.rs

Lines 902 to 924 in cb32534

    
           async fn test_get_tasks_with_params() -> Result<(), Error> { 
        
               let mut s = mockito::Server::new_async().await; 
        
               let mock_server_url = s.url(); 
        
               let client = Client::new(mock_server_url, Some("masterKey")).unwrap(); 
        
               let path = 
        
                   "/tasks?indexUids=movies,test&statuses=equeued&types=documentDeletion&uids=1&limit=0&from=1"; 
        
               let mock_res = s.mock("GET", path).with_status(200).create_async().await; 
        
               let mut query = TasksSearchQuery::new(&client); 
        
               query 
        
                   .with_index_uids(["movies", "test"]) 
        
                   .with_statuses(["equeued"]) 
        
                   .with_types(["documentDeletion"]) 
        
                   .with_from(1) 
        
                   .with_limit(0) 
        
                   .with_uids([&1]); 
        
               let _ = client.get_tasks_with(&query).await; 
        
               mock_res.assert_async().await; 
        
               Ok(())

That way, we simply ensure that meilisearch-rust sends a valid payload and hope meilisearch works as expected.

userProvided seens more brittle, but we may want change to this instead

Or we could do that and actually send the payload to meilisearch.
Or do both; let me know what you prefer, but I would very much like the tests to not take that much time to run 😖

Where in the Meilisearch codebase could I find an e2e-test how to use feature?

I don’t think there is. I believe you’re right, and we only wrote tests for user-provided vectors

introduces the experimental-vector-search-feature flag

I don’t know if this is required, @curquiza. Could we simply expose the feature as-is instead of hiding it behind a feature flag (that will make it harder to test and use).
Since it doesn’t add any dependency, I don’t see much point in putting it behind a feature flag

src/search.rs

curquiza · 2024-04-17T07:29:53Z

I don’t know if this is required, @curquiza. Could we simply expose the feature as-is instead of hiding it behind a feature flag (that will make it harder to test and use).

Yes, that's ok not to do feature flag 😊

CommanderStorm · 2024-04-17T16:17:48Z

I did a few tests, and on my side it never took 120s, the quickest execution took 150s but most of the time I was over 200s

Time after being migrated to userProvided seems fine in CI (this is likely the slowest machine running the tests). ✅

Or we could do that and actually send the payload to meilisearch.

I think keeping the testcases from requiring a active internet connection is better as otherwise the test might be unnessesarily flaky in CI.

Could you mock Meilisearch as we did here

I can add a testcase where I mock the routes.
I am unsure if you actually would want to mock this for an experimental feature (= where the api might change => requiring changes)

I have tweaked a bunch with the vectors available and can't get the test_hybrid testcase to work without fully using the same dataset as in tests/search/hybrid.rs.

It seems to always return everything when I set semantic_ratio=1.0...
Is this operating as intended?

I have tried similar 2D vectors as the upstream test and went as far as to use 1D vectors with considerable (1/10/1k/1m) spread (=>something that as far as I understand embeddings should not match)

=> I am missing something. 😅
Could you point me into the right direction?

Note

I can obviously steal the testcase from tests/search/hybrid.rs, but for longer-term maintainability I would like this testcase to not be such an oddball compared to the existing testcases in this repo ^^

curquiza · 2024-07-01T12:39:37Z

@CommanderStorm really sorry for this.
Can you fix the git conflicts? 😊

CommanderStorm · 2024-07-01T16:44:03Z

Thanks for the pinging (github does not give me notifications for this) ^^
No need to worry. If this takes a few months that is fine.

CommanderStorm · 2024-07-01T16:47:30Z

src/search.rs

@@ -753,7 +802,8 @@ mod tests {
                value: S("dolor sit amet, consectetur adipiscing elit"),
                kind: S("text"),
                number: 10,
-                nested: Nested { child: S("second") }
+                nested: Nested { child: S("second") },
+                _vectors: HashMap::from([(S("default"), vec![2000.0])]),


Note to self/TODO:

retrieveVectors needs to be configurable and set to be able to reuse the type for input + output
Docs: https://github.com/meilisearch/meilisearch/releases/tag/v1.9.0

Do you want to do it here, or in another PR?

Will do that here (would not make sense in another PR imo).
I am going to do that towards the end of the week

curquiza

@CommanderStorm thank you again! Can you add the rest and ollama models we added in v1.8.0?
Sorry for the late notice again!
And thank you for your involvement 🙏

irevoire · 2024-07-03T14:58:37Z

Hey @CommanderStorm, since we introduced a new rest embedder, I think we could write better tests by mocking the rest embedder and ensuring it works with meilisearch; here's some example I wrote earlier this week in meilisearch: meilisearch/meilisearch@2141cb3

src/search.rs

CommanderStorm · 2024-07-08T01:44:13Z

src/search.rs

+                nested: Nested { child: S("second") },
+                // TODO: This is likely a bug upstream. vectors should not be present
+                // => correct would be `_vectors: None`
+                _vectors: Some(Vectors::from(&[2000.0])),


TODO(Frank): Report this upstream

Suggested change

nested: Nested { child: S("second") },

// TODO: This is likely a bug upstream. vectors should not be present

// => correct would be `_vectors: None`

_vectors: Some(Vectors::from(&[2000.0])),

nested: Nested { child: S("second") },

_vectors: Some(Vectors::from(&[2000.0])),

Hey, this is not a bug; you can remove the comment

CommanderStorm · 2024-07-08T13:21:37Z

The changes requested are implemented.

@curquiza could you please approve a workflow run
(I think I did check everything, but you never know)

In retrospective, with_retrieve_vectors should/could have been a different PR. If you want, I can move those changes to a different PR. 600 lines is pretty unreviewable. On the other side, most of the lines changed are docs..

irevoire

Hey @CommanderStorm,

The PR seems really good; I think we're almost ready to merge.

I think you missed one issue in your test; your definition of what a vector is was wrong, and thus, serde was ignoring most vectors when deserializing the documents.
I left a comment on the subject.

To write the test you wanted to write initially, where the vectors appear as None, you should enable the experimental feature first.

This brings me to another point; I think every test using the feature should start by enabling it just in case they run first.

AND we should absolutely remove these two tests:

meilisearch-rust/src/features.rs

Lines 108 to 132 in c673c8d

    
           mod tests { 
        
               use super::*; 
        
               use meilisearch_test_macro::meilisearch_test; 
        
               #[meilisearch_test] 
        
               async fn test_experimental_features_get(client: Client) { 
        
                   let mut features = ExperimentalFeatures::new(&client); 
        
                   features.set_vector_store(false); 
        
                   let _ = features.update().await.unwrap(); 
        
                   let res = features.get().await.unwrap(); 
        
                   assert!(!res.vector_store); 
        
               } 
        
               #[meilisearch_test] 
        
               async fn test_experimental_features_enable_vector_store(client: Client) { 
        
                   let mut features = ExperimentalFeatures::new(&client); 
        
                   features.set_vector_store(true); 
        
                   let res = features.update().await.unwrap(); 
        
                   assert!(res.vector_store); 
        
               } 
        
           }

That will mess with everything

irevoire · 2024-07-29T12:10:28Z

src/search.rs

+
+    #[derive(Debug, Serialize, Deserialize, PartialEq)]
+    struct Vector {
+        embeddings: Vec<Vec<f32>>,


Here meilisearch can return a Vec<f32> directly.

You could do something like that;

#[derive(Debug, Serialize, Deserialize, PartialEq)] struct Vector { embeddings: SingleOrMultipleVectors, regenerate: bool, } #[derive(Serialize, Deserialize)] #[serde(untagged)] enum SingleOrMultileVectors { Single(Vec<f32>), Multiple(Vec<Vec<f32>>), }

CommanderStorm changed the title ~~Vector search embedder~~ [v1.6] support embedders setting Mar 2, 2024

CommanderStorm marked this pull request as ready for review March 2, 2024 19:52

CommanderStorm force-pushed the vector-search-embedder branch from 1076241 to 8ffb555 Compare March 11, 2024 11:42

CommanderStorm force-pushed the vector-search-embedder branch 2 times, most recently from 51820a9 to 8ade09d Compare April 16, 2024 03:48

irevoire requested changes Apr 16, 2024

View reviewed changes

src/search.rs Outdated Show resolved Hide resolved

src/search.rs Outdated Show resolved Hide resolved

src/search.rs Outdated Show resolved Hide resolved

CommanderStorm force-pushed the vector-search-embedder branch from 0285e62 to 9033232 Compare April 17, 2024 15:37

CommanderStorm added 4 commits April 17, 2024 18:18

Added the ability to configure vector-search embeddings in the settings

333c79a

Added the ability to configure the hybrid keyword in the search

83f543d

Migrated the testcase to use _vectors instead

a4e50c9

Removed the experimental-vector-search feature

0e044b1

CommanderStorm force-pushed the vector-search-embedder branch from 9033232 to 0e044b1 Compare April 17, 2024 16:18

curquiza requested a review from irevoire May 14, 2024 09:25

curquiza and others added 2 commits May 14, 2024 14:45

Merge branch 'main' into vector-search-embedder

3a4d876

Merge branch 'main' into vector-search-embedder

a167c50

Merge branch 'main' into vector-search-embedder

37b5410

CommanderStorm commented Jul 1, 2024

View reviewed changes

Merge branch 'main' into vector-search-embedder

4c7475e

curquiza requested changes Jul 2, 2024

View reviewed changes

NoodleSamaChan mentioned this pull request Jul 2, 2024

getting similar documents #600

Draft

3 tasks

CommanderStorm added 3 commits July 8, 2024 03:08

feat: added support for ollama and rest

5e72f98

chore: improved the documentation

75e5585

feat: implemnted the retrieve_vectors flag

464de44

Merge branch 'main' into vector-search-embedder

4cf2da8

CommanderStorm commented Jul 8, 2024

View reviewed changes

src/search.rs Outdated Show resolved Hide resolved

CommanderStorm commented Jul 8, 2024

View reviewed changes

CommanderStorm added 2 commits July 8, 2024 15:04

chore: made sure that test_hybrid uses a mocked server

980e714

chore: formatting fixes

cbac495

CommanderStorm force-pushed the vector-search-embedder branch from 0d1f07f to cbac495 Compare July 8, 2024 13:13

chore: fixed a typo in a doc-comment

77399a2

CommanderStorm mentioned this pull request Jul 13, 2024

added semantic searching TUM-Dev/NavigaTUM#998

Draft

2 tasks

curquiza changed the title ~~[v1.6] support embedders setting~~ Support embedders setting Jul 25, 2024

curquiza added the enhancement New feature or request label Jul 25, 2024

irevoire requested changes Jul 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `embedders` setting #554

Support `embedders` setting #554

CommanderStorm commented Mar 2, 2024 •

edited

Loading

curquiza commented Apr 15, 2024 •

edited

Loading

CommanderStorm commented Apr 16, 2024 •

edited

Loading

irevoire left a comment

curquiza commented Apr 17, 2024

CommanderStorm commented Apr 17, 2024 •

edited

Loading

curquiza commented Jul 1, 2024

CommanderStorm commented Jul 1, 2024

CommanderStorm Jul 1, 2024 •

edited

Loading

curquiza Jul 2, 2024

CommanderStorm Jul 2, 2024

curquiza left a comment

irevoire commented Jul 3, 2024

CommanderStorm Jul 8, 2024

irevoire Jul 29, 2024

CommanderStorm commented Jul 8, 2024

irevoire left a comment

irevoire Jul 29, 2024

	#[cfg(feature = "experimental-vector-search")]
	#[meilisearch_test]
	async fn test_hybrid(client: Client, index: Index) -> Result<(), Error> {
	log::warn!("You are executing the vector search test. This WILL take a while and might lead to timeouts in other tests. You can disable this testcase by not enabling the `experimental-vector-search`-feature and running this ");
	// enable vector searching and configure an embedder
	let features = crate::ExperimentalFeatures::new(&client)
	.set_vector_store(true)
	.update()
	.await
	.expect("could not enable the vector store");
	assert_eq!(features.vector_store, true);
	let embedder_setting = crate::Embedder::HuggingFace(crate::HuggingFaceEmbedderSettings {
	model: "BAAI/bge-base-en-v1.5".into(),
	revision: None,
	document_template: Some("{{ doc.value }}".into()),
	});
	let t3 = index
	.set_settings(&crate::Settings {
	embedders: Some(HashMap::from([("default".to_string(), embedder_setting)])),
	..crate::Settings::default()
	})
	.await?;
	t3.wait_for_completion(&client, None, None).await?;

	setup_test_index(&client, &index).await?;

	// "zweite" = "second" in german
	// => an embedding should be able to detect that this is equivalent, but not the regular search
	let results: SearchResults<Document> = index
	.search()
	.with_query("Facebook")
	.with_hybrid("default", 1.0) // entirely rely on semantic searching
	.execute()
	.await?;
	assert_eq!(results.hits.len(), 1);
	assert_eq!(
	&Document {
	id: 1,
	value: S("dolor sit amet, consectetur adipiscing elit"),
	kind: S("text"),
	number: 10,
	nested: Nested { child: S("second") },
	},
	&results.hits[0].result
	);
	let results: SearchResults<Document> = index
	.search()
	.with_query("zweite")
	.with_hybrid("default", 0.0) // no semantic searching => no matches
	.execute()
	.await?;
	assert_eq!(results.hits.len(), 0);

	// word that has a typo => would have been found via traditional means
	// if entirely relying on semantic searching, no result is found
	let results: SearchResults<Document> = index
	.search()
	.with_query("lohrem")
	.with_hybrid("default", 1.0)
	.execute()
	.await?;
	assert_eq!(results.hits.len(), 0);
	let results: SearchResults<Document> = index
	.search()
	.with_query("lohrem")
	.with_hybrid("default", 0.0)
	.execute()
	.await?;
	assert_eq!(results.hits.len(), 1);
	assert_eq!(
	&Document {
	id: 0,
	value: S("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."),
	kind: S("text"),
	number: 0,
	nested: Nested { child: S("first") }
	},
	&results.hits[0].result.id
	);

	Ok(())
	}

	async fn test_get_tasks_with_params() -> Result<(), Error> {
	let mut s = mockito::Server::new_async().await;
	let mock_server_url = s.url();
	let client = Client::new(mock_server_url, Some("masterKey")).unwrap();
	let path =
	"/tasks?indexUids=movies,test&statuses=equeued&types=documentDeletion&uids=1&limit=0&from=1";

	let mock_res = s.mock("GET", path).with_status(200).create_async().await;

	let mut query = TasksSearchQuery::new(&client);
	query
	.with_index_uids(["movies", "test"])
	.with_statuses(["equeued"])
	.with_types(["documentDeletion"])
	.with_from(1)
	.with_limit(0)
	.with_uids([&1]);

	let _ = client.get_tasks_with(&query).await;

	mock_res.assert_async().await;

	Ok(())

	mod tests {
	use super::*;
	use meilisearch_test_macro::meilisearch_test;

	#[meilisearch_test]
	async fn test_experimental_features_get(client: Client) {
	let mut features = ExperimentalFeatures::new(&client);
	features.set_vector_store(false);
	let _ = features.update().await.unwrap();

	let res = features.get().await.unwrap();

	assert!(!res.vector_store);
	}

	#[meilisearch_test]
	async fn test_experimental_features_enable_vector_store(client: Client) {
	let mut features = ExperimentalFeatures::new(&client);
	features.set_vector_store(true);

	let res = features.update().await.unwrap();

	assert!(res.vector_store);
	}
	}

Support embedders setting #554

Are you sure you want to change the base?

Support embedders setting #554

Conversation

CommanderStorm commented Mar 2, 2024 • edited Loading

Pull Request

Related issue

What does this PR do?

PR checklist

curquiza commented Apr 15, 2024 • edited Loading

CommanderStorm commented Apr 16, 2024 • edited Loading

irevoire left a comment

Choose a reason for hiding this comment

curquiza commented Apr 17, 2024

CommanderStorm commented Apr 17, 2024 • edited Loading

curquiza commented Jul 1, 2024

CommanderStorm commented Jul 1, 2024

CommanderStorm Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

curquiza Jul 2, 2024

Choose a reason for hiding this comment

CommanderStorm Jul 2, 2024

Choose a reason for hiding this comment

curquiza left a comment

Choose a reason for hiding this comment

irevoire commented Jul 3, 2024

CommanderStorm Jul 8, 2024

Choose a reason for hiding this comment

irevoire Jul 29, 2024

Choose a reason for hiding this comment

CommanderStorm commented Jul 8, 2024

irevoire left a comment

Choose a reason for hiding this comment

irevoire Jul 29, 2024

Choose a reason for hiding this comment

Support `embedders` setting #554

Support `embedders` setting #554

CommanderStorm commented Mar 2, 2024 •

edited

Loading

curquiza commented Apr 15, 2024 •

edited

Loading

CommanderStorm commented Apr 16, 2024 •

edited

Loading

CommanderStorm commented Apr 17, 2024 •

edited

Loading

CommanderStorm Jul 1, 2024 •

edited

Loading