[Do not merge] First cut at dynamic sharding support #392

Closed
wants to merge 6 commits into
from

2 participants

@emmanuelbernard
Hibernate member

I need to write down tests to finish that up but @Sanne could you give me your first impression on the proposal. I did not have to break any existing public interface. I even reuse the IndexShardingStrategy.

If you have some advice on making the structure more scalable concurrency wise, I'm interested too.

@Sanne Sanne was assigned Apr 6, 2013
@Sanne
Hibernate member

+1 very clever approach :)

Generally speaking, don't you think the ShardIdentifierProvider should superseed the IndexShardingStrategy ?
It's great we have a backwards compatible migration path with the dual-design but I'm wondering if we shouldn't deprecate IndexShardingStrategy or if it makes sense to keep both around.

I add some more details in the code as comments.

@Sanne Sanne closed this Apr 8, 2013
@Sanne Sanne commented on the diff Apr 8, 2013
...ate/search/engine/impl/EntityIndexBindingFactory.java
+ private static final Log log = LoggerFactory.make();
+
+ @SuppressWarnings( "unchecked" )
+ public static <T,U> MutableEntityIndexBinding<T> buildEntityIndexBinder(Class<T> type, IndexManager[] providers,
+ IndexShardingStrategy shardingStrategy,
+ ShardIdentifierProvider shardIdentifierProvider,
+ Similarity similarityInstance,
+ EntityIndexingInterceptor<U> interceptor,
+ boolean isDynamicSharding,
+ Properties properties,
+ String rootDirectoryProviderName,
+ WorkerBuildContext context,
+ IndexManagerHolder indexManagerHolder,
+ IndexManagerFactory indexManagerFactory) {
+ if ( !isDynamicSharding && providers.length == 0 ) {
+ throw log.entityWithNoShard( type );
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

Why is a zero-shards entity invalid?
I'm wondering about the multitenancy use case, if the application should not be able to deploy with an initial state of zero tenants.

Queries would return zero elements; while on an add operation the ShardIdentifierProvider has to return an id anyway.. which would trigger creation of an indexmanager.

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

0 shard is not legal for static shards as ti will always stay 0 :). In case of dynamic sharding (value set to dynamic), we don't throw the exception.

@Sanne
Hibernate member
Sanne added a line comment Apr 9, 2013

right :)
misinterpreted the condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...ate/search/engine/impl/EntityIndexBindingFactory.java
+ public static <T,U> MutableEntityIndexBinding<T> buildEntityIndexBinder(Class<T> type, IndexManager[] providers,
+ IndexShardingStrategy shardingStrategy,
+ ShardIdentifierProvider shardIdentifierProvider,
+ Similarity similarityInstance,
+ EntityIndexingInterceptor<U> interceptor,
+ boolean isDynamicSharding,
+ Properties properties,
+ String rootDirectoryProviderName,
+ WorkerBuildContext context,
+ IndexManagerHolder indexManagerHolder,
+ IndexManagerFactory indexManagerFactory) {
+ if ( !isDynamicSharding && providers.length == 0 ) {
+ throw log.entityWithNoShard( type );
+ }
+ EntityIndexingInterceptor<? super T> safeInterceptor = (EntityIndexingInterceptor<? super T>) interceptor;
+ if (isDynamicSharding) {
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

Wondering if we couldn't treat all types as "dynamic shardable": code would converge and the one-and-only way to configure sharding would be to implement a ShardIdentifierProvider . (possibly keeping the old way around as deprecated way)

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

right, see my general comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...hibernate/search/indexes/impl/IndexManagerHolder.java
+ if (similarityInstance != null) {
+ setSimilarity( similarityInstance, indexManager );
+ }
+ }
+ indexManager.addContainedEntity( mappedClass );
+ return indexManager;
+ }
+
+ public IndexManager getOrCreateLateIndexManager(String providerName, DynamicShardingEntityIndexBinding entityIndexBinder) {
+ SearchFactoryImplementor searchFactory = entityIndexBinder.getSearchFactory();
+ WorkerBuildContext context;
+ if ( WorkerBuildContext.class.isAssignableFrom( searchFactory.getClass() ) ) {
+ context = ( WorkerBuildContext ) searchFactory;
+ }
+ else {
+ throw new AssertionFailure( "SearchFactory from entityIndexBinder is not assignable to WorkerBuilderContext: " + searchFactory.getClass() );
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

You have been advocating to move away gradually from AssertionFailure as we should drop the dependency to common-annotations at some point

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

I guess I forgot. So what's the equivalent of something that should not ever happen unless there is a bug in Hibernate Search.

@Sanne
Hibernate member
Sanne added a line comment Apr 9, 2013

Making sure it can't possibly happen ;-)

More seriously, I'm not sure. We could create a new AssertionFailure which is Search- specific, it could even extend SearchException to allow people catching it as "if any error comes from the Search engine.."

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

Using SearchException for now until you figure out the rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...hibernate/search/indexes/impl/IndexManagerHolder.java
+ if ( indexManager == null ) {
+ indexManager = createIndexManager( providerName, indexProp, context, indexManagerFactory );
+ indexManagersRegistry.put( providerName, indexManager );
+ if (similarityInstance != null) {
+ setSimilarity( similarityInstance, indexManager );
+ }
+ }
+ indexManager.addContainedEntity( mappedClass );
+ return indexManager;
+ }
+
+ public IndexManager getOrCreateLateIndexManager(String providerName, DynamicShardingEntityIndexBinding entityIndexBinder) {
+ SearchFactoryImplementor searchFactory = entityIndexBinder.getSearchFactory();
+ WorkerBuildContext context;
+ if ( WorkerBuildContext.class.isAssignableFrom( searchFactory.getClass() ) ) {
+ context = ( WorkerBuildContext ) searchFactory;
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

This is the area which concerns me more. We attempted to isolate the WorkerBuilderContext in the past to make a clear distinction between what needs to be available during initialization (boot), and what we need to drop to keep memory usage low.

I'm not sure you can recover all needed components/services at any time.. worth trying as it seems you're close!

Also I don't see a better alternative: if we need to bootup components dynamically, we of course need to keep references around; might be worth making sure we keep around only the bare minimum needed.

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

I tried to not keep things around and rather reuse the search factory when I needed to reconstruct theses contexts.
The big question is how that will play out if the underlying elements are using esoteric services. My understanding is that the concerning one is the DirectoryProvider.initialize / .start and of course a rogue IndexManager implementation.

@Sanne
Hibernate member
Sanne added a line comment Apr 9, 2013

I don't remember the details, but yes that's approximately the complexity I'd expect we need to keep an eye on. Tests will tell.

A complexity I'm facing in the Infinispan IndexManager patch is a new kind of circularity in the order I need to initialize sub components.. just saying we might need to polish this mechanism.

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

Unfortunately, unless we multiply the number of interfaces, it's very hard to pass the minimal set of info to a component :/ Unless we detype the system entirely but then all bets are off and that's a different kind of mess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...hibernate/search/indexes/impl/IndexManagerHolder.java
similarityInstance,
- interceptor
+ interceptor,
+ isDynamicSharding,
+ indexProps[0],
+ directoryProviderName,
+ context,
+ this,
+ cfg.getIndexManagerFactory()
+ );
+ }
+
+ //TODO what would be a less aggressive way to init the index manager in a thread safe way?
+ private synchronized IndexManager doGetOrCreateIndexManager(String providerName, Class<?> mappedClass, Similarity similarityInstance, Properties indexProp, IndexManagerFactory indexManagerFactory, WorkerBuildContext context) {
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

I think synchronized is appropriate here, as you want to prevent two threads from doing the same expensive work: it's better to block one rather than allowing "parallelism" of useless operations.

What you could do to improve is to check the

IndexManager indexManager = indexManagersRegistry.get( providerName ); 

method also in the caller first: being optimistic is reasonable as it's going to be far more common to hit an existing IM than having to create one, and in such case different threads won't hit the synchronization at all.

(indexManagersRegistry is a concurrentMap so no synch is needed for reads)

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

I applied that approach. I've added an intermediary method that is not synchronized. What bothers me a bit is that we could lock per indexManagerName instead of all index manager creation but that might not be a problem in practice.

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

Actually the intermediary method was not satisfactory so the client side does the check now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...hibernate/search/indexes/impl/IndexManagerHolder.java
+ //TODO what would be a less aggressive way to init the index manager in a thread safe way?
+ private synchronized IndexManager doGetOrCreateIndexManager(String providerName, Class<?> mappedClass, Similarity similarityInstance, Properties indexProp, IndexManagerFactory indexManagerFactory, WorkerBuildContext context) {
+ IndexManager indexManager = indexManagersRegistry.get( providerName );
+ if ( indexManager == null ) {
+ indexManager = createIndexManager( providerName, indexProp, context, indexManagerFactory );
+ indexManagersRegistry.put( providerName, indexManager );
+ if (similarityInstance != null) {
+ setSimilarity( similarityInstance, indexManager );
+ }
+ }
+ indexManager.addContainedEntity( mappedClass );
+ return indexManager;
+ }
+
+ public IndexManager getOrCreateLateIndexManager(String providerName, DynamicShardingEntityIndexBinding entityIndexBinder) {
+ SearchFactoryImplementor searchFactory = entityIndexBinder.getSearchFactory();
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

We could optimistically try fetching the IM before running all the sanity checks: if it's there there won't be any need to create it.
(See my comment on the synchronized method)

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

done now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...hibernate/search/indexes/impl/IndexManagerHolder.java
);
+ return indexManager;
+ }
+
+ public static boolean isDynamicSharding(String shardsCountValue) {
+ return DYNAMIC_SHARDING.equals(shardsCountValue);
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

If we decide that all sharding is potentially Dynamic, these complexity details fade away too.. it's a win-win ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...g/hibernate/search/store/ShardIdentifierProvider.java
+ * Returns the set of shard identifiers upon deletion.
+ */
+ public String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString);
+
+ /**
+ * Returns the set of shard identifiers for a query.
+ */
+ public String[] getShardIdentifiersForQuery(FullTextFilterImplementor[] fullTextFilters);
+
+ /**
+ * Returns the list of all known shard identifiers.
+ * The list can vary between calls.
+ */
+ public String[] getAllShardIdentifiers();
+
+ public String buildIndexManagerNameFromShard(String directoryProviderRoot, String shard);
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

I'm not sure about the use for this method.
Maybe that'll get clearer with some tests?

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

The problem comes from the fact that we use the indexBase / indexName as the source to create the directory name (in case that's a directory). So in this implementation and if you use a directory, you would likely do

directoryProviderRoot + "/" + shard

if you want shards in a subdirectory or

directoryProviderRoot + "." + shard

not not have this sub directory structure but rather the .n or .tenantid approach

@Sanne
Hibernate member
Sanne added a line comment Apr 9, 2013

Is that not going to mixum the two differen concepts

  • IndexManager/Directory name (used for labelling, logging, configuration matching)
  • file-system path

In the second case you will want the rendering to be cross-platform friendly, and it has an impact on how we create the index directories.

@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

Any idea how to improve that? We could think of a directory_name_template = /var/data/index/myapp/{shard}?

@Sanne
Hibernate member
Sanne added a line comment Apr 9, 2013

This guy had a similar idea ;-)
https://hibernate.atlassian.net/browse/HSEARCH-243

Feel free to consider this a separate issue? I'd just store them all as {indexbase}/{IMname} as we do today, then we can tackle the aspect of different paths per index and templates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne commented on the diff Apr 8, 2013
...g/hibernate/search/store/ShardIdentifierProvider.java
+/**
+ * Provides shard identifiers when dynamic sharding is used.
+ *
+ * @author Emmanuel Bernard <emmanuel@hibernate.org>
+ */
+public interface ShardIdentifierProvider {
+
+ /**
+ * Initialize the provider.
+ */
+ public void initialize(Properties properties);
+
+ /**
+ * Returns the shard identifier upon addition.
+ */
+ public String getShardIdentifier(Class<?> entity, Serializable id, String idInString, Document document);
@Sanne
Hibernate member
Sanne added a line comment Apr 8, 2013

To be future proof, I think we should allow the SPI to return multiple write-to indexes.

CapeDwarf is going to need it, and also I was thinking of cases in which people want to potentially feed some Documents into multiple indexes.
Could be interesting for:

  • have a leaner index which contains only some core data, for example not supporting projections but favoring certain queries
  • write both into a remote indexing service and a local index copy ...
@emmanuelbernard
Hibernate member
emmanuelbernard added a line comment Apr 9, 2013

I am not convinced such feature should be done at the sharding level. Can you elaborate the use case and your approach in the mailing list. We can discuss it at large.
My concern is that we use all shards in a few situations and we would retrieve the same document twice in this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne
Hibernate member

I've added many more comments on the code - this is just to make sure you see them as I'm not sure it will notify you as I closed the issue initially.

@Sanne Sanne reopened this Apr 8, 2013
@Sanne Sanne closed this Apr 8, 2013
@emmanuelbernard
Hibernate member

I did not think about deprecating IndexShardingStrategy but looking at it now, I could not find a good reason to keep it except maybe to eagerly initialize the IndexManagers but if the lazy code is good enough that might not be necessary.

@Sanne
Hibernate member

We could use ShardIdentifierProvider#getAllShardIdentifiers for that?

Since we validate IndexManager configuration options only when it's started it would be good to be eager on initializing those for which it's possible.

@Sanne
Hibernate member

Different thought: looks like the user code should be able to interact directly with the ShardIdentifierProvider instance, right?

I mean in the multi-tenant use case you want to be able to let it know you are creating a new tenant.

another nice use case coming to mind is per-language independent indexes.. would it make sense to combine two levels of dynamic sharding?

@emmanuelbernard
Hibernate member

Different thought: looks like the user code should be able to interact directly with the ShardIdentifierProvider instance, right?
I mean in the multi-tenant use case you want to be able to let it know you are creating a new tenant.

Yes, in an ideal world ShardIdentifierProvider could have CDI injection points but worse case, threadlocals could be used. You're saying we should provide access to the ShardIdentifierProvider instance so that a user could call specific methods to it

MyImpl provider = (MyImpl) searchFactory.getMetadata().getEntityMetadata(User.class).getShardIdentifierProvider();

Not entirely satisfactory, would be nice to enlist a callback.

another nice use case coming to mind is per-language independent indexes.. would it make sense to combine two levels of dynamic sharding?

Dow we need to have built-in layers of sharding or leave that to the ShardIdentifierProvider implementor? It seems to be the later.

@emmanuelbernard
Hibernate member

I could not find a good reason to keep it except maybe to eagerly initialize the IndexManagers

It turns out, when building the SF, we call EIB.getIndexManagers() and thus eagerly initialize the indexes. so we are good.

@Sanne
Hibernate member

what you called

searchFactory.getMetadata().getEntityMetadata(User.class)

is ~ available today as

org.hibernate.search.spi.SearchFactoryIntegrator.getIndexBindingForEntity(Class<?>)

but I agree the getMetada() is looking better.. just we don't have that yet.

@Sanne
Hibernate member

From today's forum posts, I'm convinced it would be awesome to have a dynamic sharding option working out of the box for multiple languages, extending the use case we documented for dynamic analyzers:

http://docs.jboss.org/hibernate/search/4.2/reference/en-US/html_single/#d0e3840

As the tricky part for the above link is actually running queries on the appropriate index: needs to use a shard sensitive filter.
http://docs.jboss.org/hibernate/search/4.2/reference/en-US/html_single/#query-filter-shard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment