New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HSEARCH-2219 Define analyzers via the REST API #1263

Closed
wants to merge 27 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@yrodiere
Member

yrodiere commented Dec 16, 2016

https://hibernate.atlassian.net/browse/HSEARCH-2219

This PR adds automatic registration of analyzer definitions to Elasticsearch upon index creation.

There are a few gotchas, and those are documented. The main ones being that the MERGE strategy doesn't update the analyzer definitions yet (because it would require closing/opening the index, and I'm still pondering the implications) and the VALIDATE strategy doesn't validate analyzer definitions yet (because I didn't have the time to add that). I created follow-up tickets: https://hibernate.atlassian.net/browse/HSEARCH-2519 and https://hibernate.atlassian.net/browse/HSEARCH-2520.

@yrodiere yrodiere requested review from Sanne and gsmet Dec 16, 2016

@Sanne Sanne added Needs rebase and removed Ready for review labels Dec 16, 2016

private final T defaultReference;
private final T passThroughtReference;

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

s/passThroughtReference/passThroughReference/

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

import org.hibernate.search.analyzer.spi.AnalyzerStrategy;
import org.hibernate.search.annotations.AnalyzerDef;
public class AnalyzerReferenceRegistry<T extends AnalyzerReference> {

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Some Javadoc to explain the purpose of this class would be nice.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

}
}

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Unneeded white space.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

final AnalyzerReference reference = new LuceneAnalyzerReference( analyzer );
initializedAnalyzers.put( entry.getKey(), reference );
// Ensure every analyzer definition will be initialized
for ( String name : analyzerDefs.keySet() ) {

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

As discussed on the phone, could you explain a bit more the sequence of operations. Typically here, it would be nice to explain that analyzers not used in the mapping are registered as Lucene analyzers.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

if ( !referencesByName.isEmpty() ) {
for ( Map.Entry<String, ? extends AnalyzerReference> entry : referencesByNameForType.entrySet() ) {
String referenceName = entry.getKey();
if ( referencesByName.containsKey( referenceName ) ) {

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

A comment explaining the limitation we discussed about the order would be nice.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

}
public Builder(Class<?> indexedType, ScopedAnalyzerReference.Builder scopedAnalyzerReferenceBuilder) {
public Builder(Class<?> indexedType, Builder builderToClone) {

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

builderToClone is misleading here. You don't clone the Builder. I think containerBuilder would be better.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

}
}
else {
if ( !index.isAnalyzed() ) {
// no analyzer is used, add a fake one for queries

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Not related to your patch, but while at it, "fake" is misleading here, we use the pass through analyzer.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

@gsmet

As you will see in the comments, I'm not really enthusiast about this change. Probably because I don't see the point of it and I find the code less readable.

@@ -7,12 +7,14 @@
package org.hibernate.search.elasticsearch.analyzer.impl;
import java.util.Collection;
import java.util.Collections;

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Mmmmh, the commit message does not really state the content of this commit. It's far more extensive AFAICS.

@@ -33,7 +33,7 @@ public void close() {
@Override
public <T extends AnalyzerReference> boolean is(Class<T> analyzerType) {
return LuceneAnalyzerReference.class.isAssignableFrom( analyzerType );
return analyzerType.isAssignableFrom( LuceneAnalyzerReference.class );

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Shouldn't we use getClass() instead of using a static class reference?

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

I was worried about leaf types in the inheritance tree that would be "pure implementations", not meant to be accessed... If those types implemented different, public interfaces, that would make a difference in the result of this method, and expose implementation details.
But it turns out I didn't need to do something like that, so... Do you want me to use getClass()?

This comment has been minimized.

@gsmet

gsmet Dec 17, 2016

Member

What I'm worried about is that, if you subclass it with a B class, instanceOfB.is(B.class) will return false which feels kinda weird.

If we think they shouldn't be subclassed, we could make them final to be on the safe side. Or would you expect the subclass to always reimplement is?

This comment has been minimized.

@yrodiere

yrodiere Dec 19, 2016

Member

That's what I would expect, yes: if a subclass wants to expose more of its implemented interfaces, it should do it explicitly.

That's the point of this whole is()//unwrap pattern in this case: we don't use adaptors, so simple instanceof and client-side casts would have done the trick. The only reason to implement that pattern is to be a bit cleaner and to be more strict about encapsulation. See org.hibernate.jpa.spi.AbstractEntityManagerImpl.unwrap(Class<T>) for instance, it's even more strict than what I propose (it only accepts interface types).

Anyway... that's what I think, but it's more nitpicking than anything else, really. I don't think it'll make a difference. I'll use getClass, so we'll be safe if someone adds a subclass.

@@ -35,7 +35,7 @@ public void close() {
@Override
public <T extends AnalyzerReference> boolean is(Class<T> analyzerType) {
return RemoteAnalyzerReference.class.isAssignableFrom( analyzerType );
return analyzerType.isAssignableFrom( RemoteAnalyzerReference.class );

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Shouldn't we use getClass() instead of using a static class reference?

@Override
public <T extends AnalyzerReference> boolean is(Class<T> analyzerType) {
return analyzerType.isAssignableFrom( ScopedLuceneAnalyzerReference.class );

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Shouldn't we use getClass() instead of using a static class reference?

@@ -26,53 +26,74 @@
private static final Log log = LoggerFactory.make();
private RemoteAnalyzer globalAnalyzer;
private final RemoteAnalyzerReference globalAnalyzerReference;

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Having the analyzer here was not a mistake. For me the analyzer didn't really need to be aware of the notion of reference. Why exactly did you change that?

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

In practice, we did pass references, only those references were called "lazy analyzers".
We cannot pass fully-initialized analyzers to the scoped analyzer, because we can't initialize the analyzers until after we created the scoped analyzers (during the mapping construction).

Anyway, this change is more related to what's done in one of the next commits related to lazy analyzers. I'll try to see why it's here...

@@ -84,7 +85,7 @@ static ScopedAnalyzerReference updateAnalyzerMappings(Workspace workspace, Scope
return scopedAnalyzerReference;
}
ScopedAnalyzerReference.Builder copyBuilder = new ScopedAnalyzerReference.Builder( scopedAnalyzerReference );
ScopedAnalyzer.Builder copyBuilder = scopedAnalyzerReference.getAnalyzer().startCopy();

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

It's weird. It looks like we are copying an analyzer while in fact we build an AnalyzerReference. Maybe the name Builder is misleading and ReferenceBuilder would be better?

All in all, I'm not convinced this is an improvement. What's the exact reason for this change (the commit message does not really help).

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

We do build an analyzer, it's just that we return a reference to the result, instead of returning the result directly.

We must return a reference, and we must start the copy from the analyzer itself, because only the analyzer itself (which is specific to the indexing service) knows which type of reference to return. If we returned the analyzer directly, we wouldn't know how to wrap it in a reference...

Here it's not that bad, because we are in a Lucene context, so we could wrap it in a LuceneAnalyzerReference and be done with it.
But in other places (namely org.hibernate.search.query.dsl.impl.ConnectedQueryContextBuilder.HSearchEntityContext) we don't have that kind of information, we can't even access the analyzer strategies anymore, so we're stuck. And we actually need a reference, not an analyzer, because later we'll use the reference to determine whether it's remote or not. Though I guess we could use some instanceofs... But that makes all these is methods in analyzer references rather pointless, doesn't it?

@gsmet

Still kinda worried about how AnalyzerReference and Analyzer are now mixed. I really don't like it. I was very careful to avoid it.

public void initialize(Analyzer analyzer) {
if ( this.analyzer != null ) {
throw new AssertionFailure( "An analyzer reference has been initialized more than once." );

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Maybe include the name of the analyzer or the analyzer if defined?

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

I think we should include the name for the builtin analyzers. That would be more clear. (I know this code has moved so it's probably something to apply on top of the other changes)

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

I appended this.toString. Please note that this is not expected to happen, though (AssertionFailure, not SearchException).

@@ -964,7 +963,7 @@
SearchException remoteAnalyzerAlreadyDefinedAsLuceneAnalyzer(String name);
@Message(id = 315, value = "Lazy remote analyzer %1$s is not initialized.")
SearchException lazyRemoteAnalyzerNotInitialized(LazyRemoteAnalyzer analyzer);
SearchException lazyRemoteAnalyzerReferenceNotInitialized(RemoteAnalyzerReference reference);

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Looks like you forgot to update the message as you did above.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Right, fixed.

@gsmet

Nothing special except I wonder if we should initialize the name in every case.

}
public ElasticsearchAnalyzerReference(ElasticsearchAnalyzer analyzer) {
this.name = null;

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

I think it would be nice to initialize the name.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

That seems wrong to me, because when using that constructor the reference is not to a name, but something else (the "default", some class, ...). We don't have anything to put in name, or at least nothing that would not conflict with the analyzer definition namespace (the set of names used in @AnalyzerDef.name).

Or did I misunderstand?

@gsmet

Nice piece of code. Minor comments inline.

@@ -0,0 +1,53 @@
/*

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Typo in the commit message: s/Translate Lucene analyzer definitions their/Translate Lucene analyzer definitions to their/

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

*/
public class CustomElasticsearchAnalyzerImpl implements ElasticsearchAnalyzer {
protected final AnalyzerDef definition;

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

I don't see the point of making it protected?

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Neither do I. Fixed.

@@ -15,4 +16,6 @@
*/
public interface ElasticsearchAnalyzer extends RemoteAnalyzer {
AnalyzerDef getDefinition(String fieldName);

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

So you will always have an AnalyzerDef? Considering the comment at the top of CustomElasticsearchAnalyzerImpl, I thought it was only guaranteed for this impl.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

We can't use instanceof because we have scoped analyzers.
So we'll have a definition either when it's a CustomElasticsearchAnalyzerImpl or when it's a ScopedElasticsearchAnalyzerImpl pointing to a CustomElasticsearchAnalyzerImpl.

@@ -21,12 +21,12 @@
@Override
public ElasticsearchAnalyzerReference createDefaultAnalyzerReference() {
return new ElasticsearchAnalyzerReference( new ElasticsearchAnalyzerImpl( "default" ) );
return new ElasticsearchAnalyzerReference( new UndefinedElasticsearchAnalyzerImpl( "default" ) );

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Undefined sounds weird. Builtin maybe?

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Thing is it's not always builtin... It maybe something that was defined by the user on the server-side.
Also, I use Builtin in a later commit for something else :)
What about ServerDefinedElasticsearchAnalyzerImpl?

@Override
public AnalyzerDef getDefinition(String fieldName) {
// No definition
return null;

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Yeah, as mentioned above, not sure it's a good idea to define it in the interface.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

See my answer above. Apart from handling scoped analyzer manually (if (analyzer instanceof Scoped) { ... } else { ... }) I don't see any other solution. And I think this one is "less bad".

private final Map<String, AnalysisDefinitionFactory<D>> result = new HashMap<>();
public LuceneAnalysisDefinitionTranslationMapBuilder(Class<D> targetClass) {
super();

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

No super(); needed.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

It does not harm either... Anyway... Fixed.

}
}
}

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

IIRC, for new files, we add an empty line at the end.

public SimpleAnalysisDefinitionFactory(Class<D> targetClazz, String type, Map<String, String> parameterNameTranslations,
Map<String, ParameterValueTransformer> parameterValueTranslations,
Map<String, JsonElement> staticParameters) {
super();

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Mmmh, OK, any reason why you add a super(); everywhere? I can see a pattern here :).

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Eclipse-generated constructors.

result = targetClass.newInstance();
}
catch (InstantiationException | IllegalAccessException e) {
throw new AssertionFailure( "Unexpected failure while instanciating a definition", e );

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

instantiating

This comment has been minimized.

@yrodiere
protected void addParameter(Map<String, JsonElement> parameterMap, String luceneParameterName, String value) {
String elasticsearchParameterName = parameterNameTranslations.get( luceneParameterName );
if ( elasticsearchParameterName == null ) {
// Assume the parameter actually exist in Elasticsearch

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

exists

This comment has been minimized.

@yrodiere
@gsmet

Nothing much here. Just a question about whether it's worth it to introduce this adapter thingy.

/**
* @author Yoann Rodiere
*/
public class AnalyzerDefinitionJsonAdapterFactory extends AnalysisDefinitionJsonAdapterFactory {

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Wondering if it's worth introducing this infrastructure rather than simply use a JsonElement/JsonArray to serialize the elements.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Actually, it occurred to me only after I spent hours on this crap :-(
Anyway... Using JsonElement/JsonArray may hurt when we'll implement validation for analyzer definitions, so I'd like to avoid that as much as possible. I know it was a real pain when I implemented mapping validation.

This comment has been minimized.

@gsmet
@gsmet

Looks good apart from my minor comment about the test change.

FullTextQuery query = s.createFullTextQuery( luceneQuery, SubClass.class );
assertEquals( 1, query.getResultSize() );
luceneQuery = parser.parse( "name:Procaine" );
luceneQuery = parser.parse( "name:Doe" );

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

I'm a bit worried about this change. The StandardAnalyzer that might be used as default would have the exact same behavior. I really think using an analyzer filtering the non ASCII characters is better. I think it's feasible, isn't it?

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Thing is, we can only use a built-in Lucene analyzer, unless we want to alter the test another way, by using analyzer definitions instead of an analyzer implementation.
I changed this to use Turkish diacritics. StandardAnalyzer seems to fail at removing those and the turkish analyzer is built-in.

@gsmet

Only a minor comment here.

@Message(id = ES_BACKEND_MESSAGES_START_ID + 63,
value = "The analyzer implementation '%1$s' is not supported with Elasticsearch."
+ " Please only use builtin Lucene analyzers that have a builtin equivalent in Elasticsearch.")
SearchException unsupportedAnalyzerImplementation(Class<?> luceneClass);

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Use @FormatWith.

This comment has been minimized.

@yrodiere
@gsmet

I think we need to define if we want to simply pass a JSON object with all the parameters. I think it's easier. That's the approach we chose in OGM for MongoDB index parameters.

* {@literal @}Parameter(name = "type", value = "pattern_replace"),
* {@literal @}Parameter(name = "pattern", value = "'[^0-9]'"),
* {@literal @}Parameter(name = "replacement", value = "'0'"),
* {@literal @}Parameter(name = "tags", value = "'CASE_INSENSITIVE|COMMENTS'")

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

I wonder if we should simply pass a json object instead of separate parameters?

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

I agree it could do, but I'm not sure it would be better.

  1. We'd have some magic parameter name (not really intuitive)
  2. We'd have more error cases to handle (what happens if someone gives a parameter name other than the magic one?)
  3. It would not be much more readable, thanks to Java lacking multi-line string constants (a real pain, that one...)
  4. On top of that, I'm not even sure there are analyzers with "object" parameters in Elasticsearch. I sure didn't come across any. The parameters are essentially flat, with arrays sometimes. So we wouldn't gain much...

To be honest, I'd say "yes" if we were introducing Elasticsearch-specific annotations, but in that case it doesn't seem to fit the annotation very well...

@gsmet

Some minor comments.

Hibernate Search will never update analyzer definitions on an existing index
(no support for the `MERGE` <<elasticsearch-schema-management-strategy,schema management strategy>>)
and will never validate that pre-existing definitions actually match your Hibernate Search mapping
(no support for the `VALIDATE` <<elasticsearch-schema-management-strategy,schema management strategy>>).

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

Is it a work in progress? It seems so as you created issues for it.

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Yes it is. I added details about that.

params = {
@Parameter(name = "type", value = "'html_strip'"),
// One can use Json arrays
@Parameter(name = "escaped_tags", value = "['br' 'p']")

This comment has been minimized.

@gsmet

gsmet Dec 16, 2016

Member

I know the parser is lenient but wouldn't it be better with a , between the array elements?

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

It sure would :) Fixed

@@ -33,7 +33,7 @@ public void close() {
@Override
public <T extends AnalyzerReference> boolean is(Class<T> analyzerType) {
return LuceneAnalyzerReference.class.isAssignableFrom( analyzerType );
return analyzerType.isAssignableFrom( LuceneAnalyzerReference.class );

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

I was worried about leaf types in the inheritance tree that would be "pure implementations", not meant to be accessed... If those types implemented different, public interfaces, that would make a difference in the result of this method, and expose implementation details.
But it turns out I didn't need to do something like that, so... Do you want me to use getClass()?

@@ -26,53 +26,74 @@
private static final Log log = LoggerFactory.make();
private RemoteAnalyzer globalAnalyzer;
private final RemoteAnalyzerReference globalAnalyzerReference;

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

In practice, we did pass references, only those references were called "lazy analyzers".
We cannot pass fully-initialized analyzers to the scoped analyzer, because we can't initialize the analyzers until after we created the scoped analyzers (during the mapping construction).

Anyway, this change is more related to what's done in one of the next commits related to lazy analyzers. I'll try to see why it's here...

@@ -84,7 +85,7 @@ static ScopedAnalyzerReference updateAnalyzerMappings(Workspace workspace, Scope
return scopedAnalyzerReference;
}
ScopedAnalyzerReference.Builder copyBuilder = new ScopedAnalyzerReference.Builder( scopedAnalyzerReference );
ScopedAnalyzer.Builder copyBuilder = scopedAnalyzerReference.getAnalyzer().startCopy();

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

We do build an analyzer, it's just that we return a reference to the result, instead of returning the result directly.

We must return a reference, and we must start the copy from the analyzer itself, because only the analyzer itself (which is specific to the indexing service) knows which type of reference to return. If we returned the analyzer directly, we wouldn't know how to wrap it in a reference...

Here it's not that bad, because we are in a Lucene context, so we could wrap it in a LuceneAnalyzerReference and be done with it.
But in other places (namely org.hibernate.search.query.dsl.impl.ConnectedQueryContextBuilder.HSearchEntityContext) we don't have that kind of information, we can't even access the analyzer strategies anymore, so we're stuck. And we actually need a reference, not an analyzer, because later we'll use the reference to determine whether it's remote or not. Though I guess we could use some instanceofs... But that makes all these is methods in analyzer references rather pointless, doesn't it?

@@ -7,12 +7,14 @@
package org.hibernate.search.elasticsearch.analyzer.impl;
import java.util.Collection;
import java.util.Collections;

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Anwsering your review comment:

In later commits, we must introduce methods in ElasticsearchAnalyzer that are specific to Elasticsearch (or rather, specific to the way the Elasticsearch module handles analyzers). Those methods are not relevant to the engine, so we can't add those in RemoteAnalyzer (the superinterface of ElasticsearchAnalyzer.

If we keep ScopedRemoteAnalyzer in the -engine module, we obviously cannot make it implement methods defined in the Elasticsearch module. But we should, because ScopedRemoteAnalyzer is essentially a proxy that should be usable wherever we use a standard analyzer...

So we end up moving the implementation of the scoped analyzer to the module where we know exactly what is implemented. Which seems to make sense...

public void initialize(Analyzer analyzer) {
if ( this.analyzer != null ) {
throw new AssertionFailure( "An analyzer reference has been initialized more than once." );

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

I appended this.toString. Please note that this is not expected to happen, though (AssertionFailure, not SearchException).

final AnalyzerReference reference = new LuceneAnalyzerReference( analyzer );
initializedAnalyzers.put( entry.getKey(), reference );
// Ensure every analyzer definition will be initialized
for ( String name : analyzerDefs.keySet() ) {

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

if ( !referencesByName.isEmpty() ) {
for ( Map.Entry<String, ? extends AnalyzerReference> entry : referencesByNameForType.entrySet() ) {
String referenceName = entry.getKey();
if ( referencesByName.containsKey( referenceName ) ) {

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

}
public Builder(Class<?> indexedType, ScopedAnalyzerReference.Builder scopedAnalyzerReferenceBuilder) {
public Builder(Class<?> indexedType, Builder builderToClone) {

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

}
}
else {
if ( !index.isAnalyzed() ) {
// no analyzer is used, add a fake one for queries

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

@@ -0,0 +1,53 @@
/*

This comment has been minimized.

@yrodiere

yrodiere Dec 16, 2016

Member

Fixed.

@yrodiere

This comment has been minimized.

Member

yrodiere commented Dec 16, 2016

To sum up, there are three main topics remaining:

  1. We have analyzer/analyzerReference mixups: #1263 (comment) . @gsmet , my last commit should solve that issue. Does it seem ok to you?
  2. Whether file parameters should be forwarded as is (and files should exist on the Elasticsearch instance), or we should parse those files and forward their content, adequately formatted in JSON: #1263 (comment)
  3. Whether we pass parameters to Elasticsearch-specific analysis factories using a) multiple, small parameters with the same name as in Elasticsearch or b) one big JSON object parameter with a magic name: #1263 (comment) . My opnion is we should leave it as is, but we could also provide both to users and let them choose.

The first one should be addressed in my latest commit, and I'll try to address the second right now (but it's getting late, I probably won't have time to finish today).

@Sanne, any opinion on this?

@Sanne

This comment has been minimized.

Member

Sanne commented Dec 16, 2016

Thanks!
My quick opinion on your questions, bearing in mind I didn't review the PR yet:

  1. Not sure what you're talking about, I guess I'll see during the review.
  2. That's a nice idea, but I'd classify it as "future && maybe" improvement. No need for JIRAs, we can see how it plays out.
  3. I'd go with the most intuitive approach from an end user perspective, but I don't have enough experience with ES Analyzer definitions to judge. Happy to follow your suggestion; I'll merely check to see how maintainable this looks like.

onward with some hands-on tests..

@Sanne Sanne added Needs rebase and removed Ready for review labels Dec 18, 2016

yrodiere added some commits Dec 9, 2016

HSEARCH-2219 Make analyzer strategy an interface implemented by index…
…ing services

This will allow to use implementation-specific analyzer types in
Elasticsearch, giving us more space for handling analyzer definitions.
HSEARCH-2219 Fix an encapsulation leak in ScopeRemoteAnalyzer.clone()
The call in
org.hibernate.search.util.impl.ScopedAnalyzerReference.Builder.addAnalyzerReference(String,
AnalyzerReference) could alter the original object cloned in
org.hibernate.search.query.dsl.impl.ConnectedQueryContextBuilder.HSearchEntityContext.HSearchEntityContext(Class<?>,
ExtendedSearchIntegrator), for instance.
HSEARCH-2219 Fix is() method in AnalyzerReference implementations
The call was reverted, probably due to a copy-paste from
ScopedAnalyzerReference?
HSEARCH-2219 Use a proper cast in unwrap() method in AnalyzerReferenc…
…e implementations

This prevents ClassCastException to happen in places where there does
not seem to be a cast.
HSEARCH-2219 Move scoped analyzers to SPI
This is required in order to enable indexing services to define their
own analyzer type, with custom data.

yrodiere added some commits Dec 12, 2016

HSEARCH-2219 Move ScopedLuceneAnalyzer to org.hibernate.search.analyz…
…er.impl

It makes sense now that ScopedAnalyzer is located in org.hibernate.search.analyzer.spi.
HSEARCH-2219 Remove Lazy*Analyzers and replace them with lazy references
This is required in order to enable indexing services to define their
own analyzer type, with custom data.
HSEARCH-2219 Translate Lucene analyzer definitions to their Elasticse…
…arch equivalent automatically

There are a few gotchas, most notably:

 * parameters expecting a list of file paths only accept *one* file
   in Elasticsearch
 * the files targeted by file path parameters must be in the config
   directory of Elasticsearch, instead of being in the classpath
 * TypeTokenFilterFactory is not supported, because the "keep_types"
   parameter is a file path in Lucene, but an array of elements in
   Elasticsearch
 * HunspellStemFilterFactory is not supported, because the "dictionary"
   parameter is a list of files in Lucene, but does not exist in
   Elasticsearch (a "locale" parameter is expected instead)
 * A few parameters that are supported in Lucene but not in
   Elasticsearch will trigger an exception when used
 * For now, extra parameters that are not even supported by Lucene will
   not trigger any exception. This may change in the future.
HSEARCH-2219 Allow setting the name for tokenizer/token filter/char f…
…ilter definitions from the programmatic mapping API
HSEARCH-2219 Remove most of the custom analyzer definitions from elas…
…ticsearch.yml

They are not needed anymore, since we push those definitions
automatically.
HSEARCH-2391 Enable or disable for good analyzer tests from the -orm …
…module in the -elasticsearch module

Those that use core Lucene analyzers can be enabled,
but those using custom analyzer implementations cannot.
HSEARCH-2219 Copy analysis configuration files to the Elasticsearch c…
…onfiguration for tests

This is used by AnalyzerBuilderTest: some analyzer definitions in
the Team class have file-path parameters, and when we use the
Elasticsearch indexing service those files must be in the Elasticsearch
configuration directory.
HSEARCH-2219 Change the API of TestElasticsearchClient to make adding…
… new methods easier

Each time we had to have three versions of each method due to the fact
the index name and mapping name could be provided either as a class or
as a string. Now this part of the job is independent.
HSEARCH-2219 Remove an unnecessary Elasticsearch mapping creation
The mappings are already created by a call to createMappings from
ElasticsearchIndexManager. Before this patch, the mappings were created,
and then updated with the exact same content.
HSEARCH-2219 Allow to define analyzers without using Lucene-specific …
…types with Elasticsearch

To bypass Lucene, three classes have been added:

 * ElasticsearchCharFilterFactory
 * ElasticsearchTokenFilterFactory
 * ElasticsearchTokenizerFactory

These classes take JSON parameters. Definitions using these classes will
be forwarded as is to Elasticsearch, allowing to take advantage of
advanced JSON syntax, for instance with arrays or even objects.
@yrodiere

This comment has been minimized.

Member

yrodiere commented Dec 19, 2016

@Sanne FYI I'm still working on adding better support for file parameters for analysis factories in HSEARCH-2219.
I just discovered that some analysis-related tests did not actually test indexing, and it seems the syntax for those files is slightly different for Elasticsearch (especially char mapping rules, where ES doesn't expect surrounding double-quotes, while Lucene does).
All in all, @gsmet's solution of parsing files on the Hibernate Search side and sending the content as JSON arrays seems to be more and more appropriate, especially given the amount of work required for users to convert their files.

@Sanne

This comment has been minimized.

Member

Sanne commented Dec 19, 2016

@yrodiere you mean more improvement commits are expected to appear on this PR soon right?

Or shall we look into merging this, and then open follow-up improvement JIRAs as needed?

@yrodiere

This comment has been minimized.

Member

yrodiere commented Dec 19, 2016

@Sanne I just pushed the improvements. I kept them in separate commits to make that readable.
We could handle this separately, but probably not after the CR: the behavior we are discussing here (the one regarding files) could not be changed without breaking users.

Other matters are less critical. I'll open separate PRs. I'll move the commit about analyzer/reference mixups to another PR, too, since Guillaume is "not particularly happy with the solution" and I'll have to change it anyway.

yrodiere added some commits Dec 19, 2016

HSEARCH-2219 Improve DefaultElasticsearchAnalyzerDefinitionTranslator…
… implementation to allow more complex parameter manipulations

This introduces ParametersTransformer, which allows for objects
manipulating multiple analysis factory parameters at once, which can be
useful when a parameter's value affects the interpretation of another
parameter (like the 'format' parameter for some factories).
HSEARCH-2219 When translating analysis definitions for Elasticsearch,…
… use local files for parameters expecting file paths

We now parse those files' content and forward that content in JSON
format to Elasticsearch.
@Sanne

This comment has been minimized.

Member

Sanne commented Dec 19, 2016

merged. great job all! I was quite worried about us attempting this "translation" but it's looking much better than what I was expecting.

@Sanne Sanne closed this Dec 19, 2016

@yrodiere yrodiere deleted the yrodiere:HSEARCH-2219 branch Apr 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment