Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Lucene search #517

Closed
julie-sullivan opened this issue Jan 10, 2014 · 17 comments

Comments

Projects
None yet
5 participants
@julie-sullivan
Copy link
Member

commented Jan 10, 2014

The Lucene keyword search hasn't been given any attention for a while.

  • Upgrade the library
  • Cacheing
  • Performance
    • @radekstepan and @joshkh made 1000 requests and knocked over the apache server, 200 requests only knocked over beta flymine
  • Make first search faster

We could also maybe separate the search from the webapp.

@radekstepan

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2014

@julie-sullivan

This comment has been minimized.

Copy link
Member Author

commented Nov 11, 2014

Can't do everything. Maybe at least upgrade library?

@julie-sullivan

This comment has been minimized.

Copy link
Member Author

commented Nov 13, 2014

Merge in the changes for araport.

@justinccdev

This comment has been minimized.

Copy link
Contributor

commented Jul 1, 2015

On an initial look, the latest Lucene version we could currently upgrade to is 4.7.2 [1] (from 3.0.2) as InterMine still supports Java 6. Current Lucene version is 5.2.1.

[1] http://lucene.apache.org/core/4_7_2/SYSTEM_REQUIREMENTS.html

@julie-sullivan

This comment has been minimized.

Copy link
Member Author

commented Nov 18, 2015

@julie-sullivan

This comment has been minimized.

Copy link
Member Author

commented May 31, 2016

Tell @cmdcolin

@justinccdev

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

To record a bit more of previous discussion/research.

  • The current InterMine implementation uses a third-party plugin called Bobo to handle faceting. Unfortunately, this is long dead, probably because Lucene introduced its own facets implementation. This implementation is not well documented and is considerably different from the Bobo API (not least because Lucene considerably changed its API between v3, v4 and v5.
  • In general, Lucene v5 still appears very poorly documented compared to v4. It may be better to update to v4 at this point, which is still being maintained by Lucene.
  • However, even updating to Lucene v4 is considerable work, not least because the code is quite intertwined. I have branches where I began the attempt but haven't yet had an opportunity to finish.
  • It's probably worth spending more time on investigating whether Elasticsearch would be a better choice rather than updating Lucene. Though built on Lucene, perhaps it's better documented.
  • Currently, InterMine constructs the search index at build time and saves it as a blob in the database. On the first search, this blob is deserialized and loaded into memory. This takes a long time and permanently occupies a lot of memory. Unless there is a massive performance penalty, we want to look at deserializing to disk and accessing it as required.
  • We could look at disk serializing in v3 before later updating to Lucene v4/elasticsearch if the current approach is becoming very difficult.
@justinccdev

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2016

@cmdcolin

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2016

I started doing a bit of work for converting the lucene->solr. I chose solr since it's simialr to elasticsearch but it is more like lucene in it's libraries. I chose solrJ version 5 since solrJ 6 removes jre7 support. As far as development, I was removing a lot of the lucene code and also the faceting code, but with an interest in re-enabling faceted searching after the basic idea could be demonstrated

I am still working out some of the kinks but currently we can load data from intermine into solr from the postprocess step, and I was working on getting the quick search jsp pages to render :)

https://github.com/cmdcolin/intermine/tree/lucene_to_solr

I can keep updated if I find any issues with this approach but if there are any inital comments, feel free to let me know

@cmdcolin

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2016

Here's an example of a query to the solr interface after running a postprocess on malariamine curl "http://localhost:8983/solr/new_core2/query?q=*:*&rows=3"

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "rows":"3"}},
  "response":{"numFound":79326,"start":0,"docs":[
      {
        "OntologyTerm.name":["negative regulation of mitosis"],
        "id":"93ff181b-04d6-4182-8310-bf42700761ac",
        "_version_":1537578535963590656},
      {
        "OntologyTerm.name":["negative regulation of mitotic cell cycle"],
        "id":"99d38675-d81d-4ff5-8006-6e914490adc7",
        "_version_":1537578535966736384},
      {
        "OntologyTerm.name":["negative regulation of mitotic cell cycle, embryonic"],
        "id":"e87aa7e6-bb80-4c48-a64f-e1ebb83b65b6",
        "_version_":1537578535968833536}]
  }}
@cmdcolin

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2016

@justinccdev

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2016

Nice work, Colin.

  • To summarize from my reading of the source code, this branch changes InterMine to interact with a Solr service instead of performing any document processing or storing any search data itself, correct?
  • This will mean that an InterMine database will no longer be portable as a single database dump - it now needs to be accompanied by a Solr dump (if there is such a thing) or configuration for the URL of a persistent Solr instance for that data build.
  • I imagine we will need to wipe out a previous document store for the mine if one already exists in Solr before loading documents in the post-process step.
  • Could you post the Solr configuration that you're using?
@cmdcolin

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2016

  1. Yes. The idea of an "embedded solr store" exists, however this is apparently not recommended on a fair number of sources including their own docs for a variety of reasons. Some google searches https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=solr%20embedded%20not%20recommend
  2. Yep. Not sure what requirements there might be for this, but I imagine getting database dumps from solr is not too tricky. Nevertheless, letting solr exist as it's own DB store separate from postgres seems interesting too, and it could just be a separate config item in ~/.intermine/mine.properties or similar
  3. That is also true. The "management" of the solr store isn't fleshed out too much, it's pretty hardcoded at the moment too
  4. Pretty much what I did was I installed solr (brew install solr for mac), then I ran the command "solr create -c new_core2" (new_core2 is hardcoded still :)). Then I did standard postprocess with ant -v.
@cmdcolin

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2016

I got my branch finally display some search results on malariamine :) https://github.com/intermine/intermine/compare/dev...cmdcolin:lucene_to_solr?expand=1

screenshot-localhost 8080 2016-06-22 10-43-45

@cmdcolin

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2016

Added a couple updates to fix indexing of features. Not sure if there should be an existing config file that says which fields get indexed though?

@justinccdev

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2016

Documentation on this is at [1]. InterMine currently indexes all text fields by default and has a $MINE/dbmodel/resources/keyword_search.properties that allows you to skip some and do other search config.

[1] http://intermine.readthedocs.io/en/latest/webapp/keyword-search/

@julie-sullivan

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.