Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring the TaggerRequestHandler to Solr (thus everything?) #82

Closed
dsmiley opened this issue May 18, 2018 · 4 comments
Closed

Bring the TaggerRequestHandler to Solr (thus everything?) #82

dsmiley opened this issue May 18, 2018 · 4 comments

Comments

@dsmiley
Copy link
Member

dsmiley commented May 18, 2018

https://issues.apache.org/jira/browse/SOLR-12376

@dsmiley
Copy link
Member Author

dsmiley commented Jun 29, 2018

Done. (7.4.0)

Some important differences:

  • the htmlOffsetAdjust is not in Solr because I didn't want to add yet another dependency to Solr, even if the license was fine.
  • ConcatenateGraphFilterFactory is in Lucene and is better than the STT's ConcatenateFilterFactory. The Tagger (both in Solr and this project) is hard-wired to use a particular separator char, and that separator char is different between the tagger in Solr, and the STT here. Thus you can't use the STT here with CGFF, nor can you use Solr's new tagger with the CFF here.
  • The tagger in Solr cannot be used with a shingling strategy to find partial dictionary matches because of difficulties in being able to configure the single factory. We'd need to specify the particular character the CGFF uses, as it is not valid in XML and Solr's schema is XML. Any way this feature seems dubious.

@dsmiley dsmiley closed this as completed Jun 29, 2018
@akurniawan
Copy link

akurniawan commented Oct 17, 2019

hi @dsmiley , sorry for bringing this closed issue. I'm wondering whether for the last point, where we can't do any partial match to the document anymore, we have a solution to work around it? I found a reference from stackoverflow asking exactly this question as well https://stackoverflow.com/questions/58413033/is-there-a-way-to-use-solr-text-tagger-along-with-n-edge-gram-filter. thanks a lot!

@dsmiley
Copy link
Member Author

dsmiley commented Oct 17, 2019

My release note there pertained to Shingling, which combines spans of tokens prior to CGFF. But I see you are using NGram (partial within-word matches) applied after CGFF. My note doesn't apply then. I think your configuration should work but I don't know why it doesn't. You may have to experiment a bit. Like look at the terms using the "/terms" to see if they look as expected. Failing that maybe use a debugger to see what's up. Maybe you need to add a filter that trims off a trailing null byte, which could happen... but I don't see how that inefficiency would cause the overall approach you have to not work. I'm too busy to troubleshoot this. I'm not sure how to subscribe/watch particular stackoverflow questions but FWIW I did up-vote it and marked it as a favorite.

@akurniawan
Copy link

thanks for the answer! sure, I understand you can't help to troubleshoot this, and I'm doing the debugging right now. so far what I can find is there are no ngram terms on /terms, so maybe something weird happens after CGFF. still trying to find out how to debug solr since this is my first time using solr. thanks anyway for the help @dsmiley !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants