L5 searchable docs #18

Closed
wants to merge 31 commits into
from

Projects

None yet
@mattstauffer
Contributor

This is not necessarily complete, but I want to PR it so you can see it and give some feedback. You can merge as-is, or give feedback and I'll keep working before we call this PR ready-to-merge--whichever.

Steps to use this:

Install ElasticSearch.

Homestead: https://bosnadev.com/2014/09/12/install-elasticsearch-on-laravel-homestead/
Forge: http://forgerecipes.com/recipes/29

Limit ElasticSearch RAM Usage

On my 1GB server I set it to 512MB, but you can choose what's best based on the usage.

  1. Edit /etc/default/elasticsearch and set ES_HEAPSIZE to 512m (or whatever you want)
  2. Restart ElasticSearch using sudo /etc/init.d/elasticsearch restart

Run php artisan docs:index

This relies on your resources/docs/* Markdown folders being available locally.

That's it!

Original UI:

image

Current UI:

image

It's also (currently) testable here: http://elastic.fiveminutegeekshow.com/docs/

@GrahamCampbell GrahamCampbell and 1 other commented on an outdated diff Feb 22, 2015
app/Commands/Docs/IndexAllDocuments.php
@@ -0,0 +1,25 @@
+<?php namespace App\Commands\Docs;
+
+use App\Commands\Command;
+use App\Services\Documentation\Indexer;
+use Illuminate\Contracts\Bus\SelfHandling;
+
+class IndexAllDocuments extends Command implements SelfHandling {
+
+ /**
+ * Create a new command instance.
+ */
+ public function __construct()
@GrahamCampbell
GrahamCampbell Feb 22, 2015 Member

Don't need that. :)

@mattstauffer
mattstauffer Feb 22, 2015 Contributor

True! I just left the defaults around. Thanks graham! Will fix and push later

@GrahamCampbell GrahamCampbell commented on an outdated diff Feb 22, 2015
app/Commands/Docs/InitializeElasticSearchIndex.php
@@ -0,0 +1,27 @@
+<?php namespace App\Commands\Docs;
+
+use App\Commands\Command;
+use App\Services\Documentation\Indexer;
+use Illuminate\Contracts\Bus\SelfHandling;
+
+class InitializeElasticSearchIndexEs extends Command implements SelfHandling {
+
+ /**
+ * Create a new command instance.
+ */
+ public function __construct()
@GrahamCampbell
GrahamCampbell Feb 22, 2015 Member

Don't need that. :)

@mattstauffer
Contributor

Note that there are some @todos in line that I'd like to fix but I'd also welcome input. And I don't care whether we choose to address them in this PR or a future one.

@montogeek
Contributor

I want this, attach the results page please

@mattstauffer
Contributor

@montogeek I'm not sure what you're talking about?

@mattstauffer
Contributor

Realized I have no idea what power or load this ElasticSearch box will be under. Looking up whether there are options for load testing.

@sepehr
sepehr commented Feb 23, 2015

It's great!

@montogeek
Contributor

@mattstauffer I mean, how the results page looks.

@mattstauffer
Contributor

@montogeek There's a image embedded into the original PR that shows what the results page looks like.

Screenshot from above

@montogeek
Contributor

Oh sorry, I didn't see it.
The results looks pretty good, I think the user do not care about the score. just the result.
What are you thoughts?

@jackmcdade
Member

I agree. It's logical that they're in order of relevance, the actual score doesn't mean a heck of a lot.

@spamoom
spamoom commented Feb 23, 2015

I think we should bin the score

@jackmcdade
Member

Also, I'd use a smaller header, or change it to "Results for "$term". It's a really long headline.

@montogeek
Contributor

What happen if the best match result is a subtitle or subsection of the page?

@polyfractal

Hi! Author of the ES-PHP library here. Just wanted to drop off some notes:

Re: scoring:
  • I would also consider removing the score. The normalization is good (much better than raw Lucene scores), but in practice users still tend to freak out about scores for no reason. :)
  • Related, users tend to gain more value from highlighted fragments than scores. Makes it easier to judge relevance.
Re: load testing:
  • Make sure you use "real" queries that users might actually search for. E.g. use a randomized sample of query text, some short, some long. Resource usage is highly dependent on query pattern, so its easy to make false benchmarks if you use non-realistic queries (e.g. only using one query text. Lucene will keep those data structures hot and you'll see better latency than a more realistic load test).
  • Test for both burst capacity as well as steady state queries-per-second. They have different demands on the system, and in particular, bursts are where you will likely run into trouble if you are using more memory hungry features. Your queriers are basic full-text, so it should be less of a problem for you.
  • Indexing load is probably pretty light, so I probably wouldn't bother including that in your tests
  • What kind of hardware resources is ES running on? Based on your query, you aren't doing much in the way of heavy memory usage (aggregations, sorting, etc), so if your server is crunched for memory, you could consider reducing the filter / field data cache sizes and reducing the overall heap. That'd move more memory to the OS page cache, which Lucene will happily use for speeding up basic full-text search.
  • Also, it's good to identify if you are optimizing for throughput, or latency. Latency can be addressed by adding more primary shards and more machines (which means more machines taking part in a single query). Throughput can be addressed by adding more replicas and more machines (which means more machines simultaneously working on concurrent queries).

All that said, your dataset is probably pretty small, so I would expect a moderate box to handle a reasonably high load well. Many of my tips are probably overkill :) Basic full-text search is very optimized and requires relatively few resources for a surprising amount of qps.

@mattstauffer
Contributor

I agree, I've never loved showing the score. I didn't know if we had a better way to show it, but I'm pretty happy to always defer to @jackmcdade :)

Update without the score, and with "Results":

image

In terms of displaying fragments, I had originally planned to do that work before PR-ing this, but I have very little experiencing breaking a document up into sub-sections. I had some ideas, but would love input: has anyone ever had experience breaking up a document so you can A) get more nuanced results and B) show the fragments in the response?

@mattstauffer
Contributor

Also: @polyfractal, you are a bawss. That is such helpful information--thanks! I have no idea what server infrastructure this is running on at the moment--it's whatever Taylor has Laravel.com running on. :)

@montogeek
Contributor

Laravel.com is running on a Forge instance.

Great work @mattstauffer ! Another request, shows the query on the search input :).
And about the subsections, I think that is too much for now, maybe in a coming release!

@mattstauffer
Contributor

@montogeek Yes, but Forge can run many different sizes of servers. So that only tells us the architecture, not the power.

I've been unsure whether or not it's wise to put the previous search in the query box.

And I agree, the subsections may be too much, which is why I put this up here. But if someone says, "That's easy, read this blog post", I'd be happy to go ahead.

@montogeek
Contributor

Right about Forge.
Is like Google, shows the user his input.

@mattstauffer
Contributor

Currently learning about ES Highlights. 👍

@mattstauffer
Contributor

OK, made it this far in ES highlights. Back at it again tomorrow.

image

@polyfractal

@mattstauffer Happy to help, feel free to ping me if you have a question about syntax or relevance scoring or whatever. I'll leave the UI design to you all...front-end is not my forte by any means :)

Great work btw, really happy to see ES search being integrated into Laravel's site!

@mattstauffer
Contributor

@polyfractal I'm struggling to get the fragments more useful. I'm trying to follow this to see if FVH will help, but feeling completely lost.

here's where I am:

image

As you can see, I'm really struggling to get the fragments useful. Any ideas?

@polyfractal

I don't think FVH will help (and in fact, I would recommend avoiding FVH...it has a proclivity for exploding with stackoverflow exceptions). It will also double your index size.

I think the real problem is that the fragments are a mix of Title + Body, and without formatting they just run into each other and it looks confusing. If you look at the first result, the highlighted fragments correlate to a title hit, a TOC hit and a body hit. But they are all smooshed inline with no formatting, so it's hard to tell what you are reading. E.g. here is the corresponding page:

laravel

If you look at something like Github's search, they maintain formatting which makes it easier to read (especially code fragments):

screen shot 2015-02-25 at 7 24 49 am

In contrast, Google only supplies a single fragment...but that fragment is always taken from the body of the text (and they try to avoid code snippets imo):

screen shot 2015-02-25 at 7 28 14 am

On the other hand, StackOverflow doesn't format code snippets, but they do appear to try and keep fragments from only the body (and make the highlighted section very dense and relatively large):

screen shot 2015-02-25 at 7 38 49 am

So I would personally try something like this (with full knowledge that it's easier to say than do :) ):

  • Only show highlighted fragments from the body. If there are no body highlights, just show the article title and the first XX characters from the doc's body (because it will be obvious the title matched)
  • Try to avoid highlighting TOC-style data...is this in a separate field or the body?
  • Either format code snippets, or try to parse them out so they don't display at all.
  • Maybe try only one fragment? Github shows two, but always in the same "code block" panel, while Google, SO and others only shows one.

Edit: re-reading my reply...this could be a decent amount of work, might be worth punting on highlighting until v2?

@mattstauffer
Contributor

@polyfractal Great note RE FVH. I was mainly targeting it because it seemed to have a lot of facility for better differentiating where your boundaries are, which I hoped to use to better split body from title because of exactly the issues you were describing.

The problem with the "only from the body" idea is that those titles are part of the body... unless you mean "from the body but then programatically exclude anything wrapped in an <h2> or an <h3>"?

TOC is all in the same field, but I might be able to exclude it from the body.plain which we use for fragments.

Single fragment could make sense, but I was worried that with pages this long that might lose some of the context. Different sections of these pages can be wildly different, so the first fragment might not be at all what the searchers are looking for...

In terms of the code blocks, I might need some help--a pair session, if you're down?--for figuring out how to make sure the fragment sets its boundaries outside of code snippets. Or to figure out what the best practices are for finding the context (in the original body) of the fragment so you can pull further out so that its immediate context (whether it's in a div, or a code block, or a title, or whatever) is fully returned.

I agree, though, I'm wondering if fragments are 2.0 of this, and 1.0 is just my original verison, where it just shows links and that's it. I'll probably comment out fragments, do a merge conflict resolution, and then ping taylor. Thanks so much @polyfractal !

EDIT: oh, the page title is even being pulled in because it's also in the body. Forgot that--great point, and I can strip that out of the body.plain result.

@mattstauffer
Contributor

OK, I did a bit of work to tweak the output and also tweak how we're filling the body.plain that it's getting the highlights from. It's a bit prettier, but still needs some work:

image

It's now two snippets, joined with ellipses, and with linebreaks marked lightly. I dropped the title and the TOC.

@mattstauffer mattstauffer Update body.plain generation to be smarter and drop title and TOC; up…
…date display to concatenate fragments and show line breaks/etc. Still hacky as crap though.
d78aa0f
@polyfractal

++ looks much better! I think this is a great v1

Great note RE FVH. I was mainly targeting it because it seemed to have a lot of facility for better differentiating where your boundaries are, which I hoped to use to better split body from title because of exactly the issues you were describing.

TBH, I don't have a ton of experience with FVH, I've always avoided it because of the stackoverflow issues. But you're right, it does have better control over where to split things.

The problem with the "only from the body" idea is that those titles are part of the body... unless you mean "from the body but then programatically exclude anything wrapped in an <h2> or an<h3>"?

Hehe...I had no firm plan in mind :) I've seen it solved three ways:

  1. At data input, using some kind of preprocessor that strips out or extracts data you are interested in. Usually stored as another field (e.g. "body_for_highlighting")
  2. Using an analyzer chain that does tag stripping and pattern matching, etc. This tends to use multi_fields so you have "body" and "body.highlighting"
  3. Post-processing in the client to strip out things you don't want.

Usually it's easier to do 1. or 2., since it simplifies the highlighting portion later.

In terms of the code blocks, I might need some help--a pair session, if you're down?--for figuring out how to make sure the fragment sets its boundaries outside of code snippets. Or to figure out what the best practices are for finding the context (in the original body) of the fragment so you can pull further out so that its immediate context (whether it's in a div, or a code block, or a title, or whatever) is fully returned.

Would be happy to help, but I'm going to be super busy for the next week or two getting ready for Elasti{Con}. And this is brave new territory for me as well, I usually quit at the simple highlighter fragment stage and call it a day :)

@mattstauffer
Contributor

@polyfractal Thanks man. Yah, I'm already using multi_fields and stripping out some crap, but I had named it body.plain, so good tip on specifying this as being the highlighting version.

No worries on timing. I'll hit you up in a few weeks probably about a v 2.

Thanks for sharing so much wisdom!!!!

@mattstauffer
Contributor

Note on this PR: My Forge instance's ElasticSearch keeps shutting off and I don't know why. So let's not merge this until I can solve that.

@mattstauffer
Contributor

@jackmcdade Any style input on the latest? If you're comfortable with it, once I have this "elastic search shutting down" thing resolved, I'll call this PR ready to merge for a v1 of search.

@jackmcdade
Member

I think it looks quite nice, good work!

@mattstauffer
Contributor

Thanks @jackmcdade! :) Really appreciate your help earlier.

Just tweaked the title size of each response a wee bit:

image

@jackmcdade
Member

Mmm cookies.

👍

@spamoom
spamoom commented Feb 25, 2015

Could we pull in mattstauffer#1 as part of this PR? I can imagine it'd speed up searching for quite a few people!

@mattstauffer
Contributor

@spamoom Thanks for the reminder. I wanted to ask you to PR that one separately to the repo, rather than to my pull request, so that they can decide on whether or not they want the keyboard shortcut separately from merging this in--as not to conflate the two.

Great work on that PR, btw!

@mattstauffer mattstauffer commented on an outdated diff Feb 25, 2015
resources/views/partials/search-results.blade.php
@@ -0,0 +1,49 @@
+<h1>Results for "{{{ $keyword }}}"</h1>
+<style>
+ .search-results {
@mattstauffer
mattstauffer Feb 25, 2015 Contributor

Note: There will be much refactoring before this is "merge-able."

@KaneCohen

Hey, Matt. Do you run Marvel to monitor ES? Try disabling it if you do, because it could eat a lot of memory and lead to "Out of memory error" which in turn might kill ES instance.

@mattstauffer
Contributor

@KaneCohen thanks--sadly it's not running. You can see the ES install script here: http://forgerecipes.com/recipes/29

It hasn't reset yet today...

@KaneCohen

What about syslog? Anything helpful?

I've had to deal with dying Es and it was all related to the memory.

Maybe set max memory consumption via environment variables

export ES_MAX_MEM=256m

Just an example, set men use to something like a half of your RAM.

@polyfractal

ES_HEAPSIZE can be used too...it sets both the min and max heap size to the same value (which is recommended).

If something is exploding in ES, the elasticsearch logs might have a sign too (e.g. OutOfMemoryException). Another common problem on new installations is running out of file descriptors (it'll start spamming your logs when you run out of FDs)

@mattstauffer
Contributor

@KaneCohen nothing in Syslog.. skimmed and also grepped elastic and couldn't find anything. But I can't say that I knew exactly what to look for.

I'll give it a day or two and see if I can see the thing happen; if so, I might need to do some load testing or something, and I'll try both the ES_MAX_MEM and also the ES_HEAPSIZE @polyfractal suggested. Hardest part is that I don't know how to trigger it yet.

Thanks guys.

Also, no problems in the elasticsearch logs.

@mattstauffer
Contributor

OK, @KaneCohen and @polyfractal were right: After a few days of running, Java runs out of memory (on my 512MB or 1GB box). Errors: https://gist.github.com/mattstauffer/59e74d0a6b57c7999689

I'll set the ES_HEAPSIZE to half of the box's power and then let it run for a few days. I wish I knew exactly what was causing it to grow....

@polyfractal

If you gist up curl -XGET localhost:9200/_nodes/stats after ES has been running for a while, I can take a look to see what's eating up mem. Your queries should be pretty memory lightweight, since they are just full-text search. Is the query in this PR the only query being run?

The default ES_HEAP_SIZE is set to 1gb, so if you were running on 512 or 1gb I imagine the JVM just freaked out at some point.

@mattstauffer
Contributor

@polyfractal Yah, there's almost no queries, and no cron, and no one's really using this site. BUT I just looked.. it's a 512MB box.. so that's probably it. I'll probably dump it and re-spin-up a 1GB box. Thanks--and I'll also paste some stats. Thanks for all your help!

@KaneCohen

Matt, just in case, set max memory usage to 512 on new box. Given chance Es would eat whole 1g of available memory (default value for max memory).

@mattstauffer
Contributor

OK, new server, 1GB, edited /etc/default/elasticsearch and set ES_HEAPSIZE to 512m.

Then restarted ES with sudo /etc/init.d/elasticsearch restart.

Will let it run for a few days, and try to check the stats command. Checking in in a few.

@mattstauffer
Contributor
@themccallister

This is fantastic! Would be a welcome feature to the docs IMHO.

@mattstauffer
Contributor

OK, it's been 4 days and it's still running smoothly. That was it--just limiting the RAM usage. Updating the directions in the PR description and gonna call this ready. Thanks so much @KaneCohen and @polyfractal!

@taylorotwell: This beast is finally ready. 👏

@mattstauffer
Contributor

@themccallister Thanks Jason! :)

@mattstauffer mattstauffer Extract Search Hit functionality to a class, and extract search CSS t…
…o Sass.

Update ellipses to only append and prepend to fragments when fragment
isn't at the end/beginning of the body.
6bc993f
@mattstauffer
Contributor

OK, with that commit I'm actually done. 👍

@GrahamCampbell

lol, not the same at all :)

Owner

HA, yah, whoops. :) Spelling is difficult some times. ;)

@mattstauffer

This feels dirty. Do I have to do some magic to expose this private property for testing? Or make a method just for exposing it publicly? All options feel janky.

@adamwathan you are my testing guru unicorn, please teach me your ways

@mattstauffer I would probably just keep it private and test against the hardcoded string honestly.

There's some value in getting it from the Hit class to reduce maintenance costs if that string changes frequently, but there's also a case to be made for being explicit in the test, and also thinking of it as a sort of blackbox operation.

We do want to know that it starts with <span class="search-ellipsis">...</span>, rather than starts with whatever the Hit says it's supposed to start with. If you added a getEllipsisMarkup method or something and somehow there was a bug in that method itself, your tests might pass while the results contained different ellipsis markup than you were expecting.

So I think using the string directly in the tests and keeping the property private is the way to go, even though it introduces a bit of conceptual duplication. I don't think duplication in tests is an automatic evil until it starts costing you time in maintenance, because the value of being immediately clear and obvious and having everything that matters to the test contained within the one function is really valuable.

@mattstauffer
Contributor

Frustratingly, the indexer test echoes the "Indexing artisan.md" line mid-tests. Not sure how to get rid of that; googled around and no luck.

@themccallister

Pinging everyone about this pull request. @mattstauffer is that the last hang up? I might pull it down tomorrow and try my hand at it.

@mattstauffer
Contributor

@themccallister No hangups. I mean, sure, feel free to fix that, but the only thing remaining here is @taylorotwell's approval & merging when he has time and energy to get the server up and running.

@montogeek
Contributor

What happened with this?

@Craytor
Craytor commented Apr 29, 2015

Why hasn't this been completed yet? I really think the docs needs this functionality.

@Wiejeben
Wiejeben commented May 7, 2015

+1, way better then having to go to Google all the time and ending up in older docs

@maxiloc maxiloc referenced this pull request May 27, 2015
Closed

Add documentation search #37

@taylorotwell
Member

Closing this for now in light of Algolia search PR.

@mattstauffer
Contributor

For those who asked, the folks at Algolia took this pull request and then re-worked it to use Algolia search, which means A) Taylor doesn't have to maintain his own Elastic Search server, and B) we get a pretty UI for free. 👍 #37

@tfevens
tfevens commented Jun 8, 2015

#win-win 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment