New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generator for a search index #1853

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@digitalcraftsman
Copy link
Member

digitalcraftsman commented Feb 13, 2016

The generator creates an index of all content
files and it's metadata.

See #1635 #144

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Feb 13, 2016

The generator is working so far but the implemenation isn't finished yet (see TODO).

Is there anything ground-breaking to complain about the generator?

QUESTIONS:

  • Currently, I've only an option turn the generator on/off. Does it make sense to stop the regeneration in watch mode (hugo server) and only trigger it in production (hugo)?
  • Should I add an option for a custom path and filename for the index.? At the moment the path is hardcoded
  • Should more data be included in the index?
  • How would you set up a test function?

WISHLIST:

  • partial rebuilds of the cached index (not the json file). This would reduce overhead

/cc @rdwatters @bep @spf13

@rdwatters

This comment has been minimized.

Copy link
Contributor

rdwatters commented Feb 15, 2016

@digitalcraftsman This is pretty fantastic. Thanks for working on what I see as a really powerful feature (and one I have kind of been nagging about).

Question for you: rather than just the ability to create an index that can be built/not built with a flag, how difficult it would it be to just extend Hugo's abilities to write to any .json file using the same templating logic? Would a feature like this (Jekyll has the ability to write JSON files, which comes in pretty handy) slow down builds to the point of not being worth it?

Do you think writing to any-file.json rather than site-index.json would provide the most flexibility (ie, w/r/t using ajax, etc), or is the primary objective to allow for client-side search a la something like Tipue or lunr.js?

Also, sorry for the delayed response to your questions (in the order you presented them above):

  1. I think this depends on whether the goal is for search or for the ability to write to json in general. If the latter, seems to make more sense to write to the json file in both environments.
  2. Again, depending on the search-v-json-in-general idea, I think a convention of have a siteIndex.json (or whatever) in static is easy enough.
  3. This is a tough one. I'll have to defer to @bep and @spf13, but this might depend on thoughts for the overall data model for V1. For example, there's talk on Discuss about preventing Hugo from building empty files in content directories that act more like a data directory (similar to a "collection" in Jekyll). If this is the case, I'm not sure how convenient excludefromindex will be on a per-md/yml level, or if it makes sense to exclude whole directories/content types, etc.
  4. These questions are getting progressively more out of my wheelhouse, haha😃 I'm pretty handy with JavaScript, but my business card says "Digital Content Manager" and not "developer." Totally deferring to you guys on this one.

Again, thanks again, brother. I think HUGO is easily the best SSG around. Cheers!

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Feb 16, 2016

rather than just the ability to create an index that can be built/not built with a flag, how difficult it would it be to just extend Hugo's abilities to write to any .json file using the same templating logic? Would a feature like this (Jekyll has the ability to write JSON files, which comes in pretty handy) slow down builds to the point of not being worth it?

I think that would be a powerful addition to the current set of template functions. I'm not sure how much it would slow down the generation of pages. A possible implementation could be the use of a global object shared object, similar to Scratch, but with a few more parameters (filename, destination). Inside a template a coder could specify, if necessary with if-else statements, what should be included in which JSON-file.

Do you think writing to any-file.json rather than site-index.json would provide the most flexibility (ie, w/r/t using ajax, etc), or is the primary objective to allow for client-side search a la something like Tipue or lunr.js?

This is a good question. Why don't take the best of both worlds. Just setting a config variable to true is the most user-friendly way, in my opinion. Your approach would allow much more flexibility, but the user/theme creator maybe needs to include logic in many different places. Imagine a user has different layouts for different content types. He would need to add the logic in each template file of a content type. Shortcodes would be a handy way to avoid redundant code.

But let's wait what the others think about this.
Thank you @rdwatters for sharing your thoughts and ideas.

I think HUGO is easily the best SSG around. Cheers!

This project has grown a lot in it's rather short lifetime 😄. I'm curious what we will see in the v1.0 release.

@rdwatters

This comment has been minimized.

Copy link
Contributor

rdwatters commented Feb 16, 2016

@digitalcraftsman Good points all round. I guess that I am ultimately bringing up two separate feature requests, and you are absolutely right that it could be the best of both worlds.

As far as the ability to write to json files in general, you're right that this would have to be a separate process in that forcing devs to write templating to account for all content/section areas in a single site-index.json would be more than a little tedious.

I like where you are going with the todo for excludefromindex at both the page and section/content level.

Oh, and thanks again😃

Oh, and @bep I just drastically edited this comment after you already replied to it. Sorry about that.

@bep

This comment has been minimized.

Copy link
Member

bep commented Feb 16, 2016

Ability to write to json files in general.

There is an open issue somewhere about rendering custom content-types, like JSON, ical ... whatever. We should do that.

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Feb 16, 2016

There is an open issue somewhere about rendering custom content-types, like JSON, ical ... whatever. We should do that.

If you, @bep, agree with @rdwatters and me we should consider this as two different issues. See #1128 (for ical, xcal). But there's no issue about writing content to a JSON file. Should I create a new issue?

@spf13

This comment has been minimized.

Copy link
Contributor

spf13 commented Feb 16, 2016

These are two different issues.

The PR is effectively a sitemap in JSON which will enable lots of nice integrations.

A second issue is for Hugo to support rendering into variable and multiple multiple different formats.
The second issue is a pretty considerable one which would require quite a bit of restructure. Including the integration of the text.Template library along side of the html.Template one.

@spf13

View changes

hugolib/content_index.go Outdated
Title string `json:"title"`
Content string `json:"content"`
Permalink string `json:"permalink"`
Tags interface{} `json:"tags"`

This comment has been minimized.

@spf13

spf13 Feb 16, 2016

Contributor

We shouldn't assume these taxonomies are being used. I think this is a very limiting approach.

@spf13

View changes

hugolib/content_index.go Outdated
}
}

jsonIndex, err := json.Marshal(rawIndex.Pages)

This comment has been minimized.

@spf13

spf13 Feb 16, 2016

Contributor

While I see the benefit of using json.Marshal to generate the JSON, this approach is also quite rigid defining the explicit fields that will be rendered/generated.

What if instead we used a similar approach to how the sitemap works using a template instead to generate the output. It would be much more flexible.

@rdwatters

This comment has been minimized.

Copy link
Contributor

rdwatters commented Feb 21, 2016

@digitalcraftsman Spitballing on this, but is there utility in implementing a stopwords list when creating the index? Here's a decent default list.

http://www.ranks.nl/stopwords

If the intention is client-side search, it looks like it's the same stopwords used by Tipue and similar to the stopword filter for lunr.js. That said, if search results were designed to surface, say, the "description" key in front matter, SERPs would look weird if every definite and indefinite article were omitted from the page. Then again, maybe eliminating stopwords from the index before it's sent could make filesize smaller and potentially reduce demand on the client.

It goes without saying that internationalization efforts being worked on outside this thread would have a different list.

@moorereason

This comment has been minimized.

Copy link
Contributor

moorereason commented Feb 21, 2016

If you want to remove stopwords in Go, check out https://github.com/bbalet/stopwords. It's multi-lingual and has already been discussed on the forums for adding a related posts feature (https://github.com/bbalet/gorelated).

@bep

This comment has been minimized.

Copy link
Member

bep commented Feb 22, 2016

The use of stopwords is within the role of the tokeniser/indekser. This PR is badly named, as it doesn't create a search index, it exports the content in a format suitable for indexing.

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Feb 24, 2016

This PR is badly named, as it doesn't create a search index, it exports the content in a format suitable for indexing.

After revisiting some disucssion here and in the forum I agree. As this PR is currently a WIP, it should just output a json file that is intended for searching the content (with lunr.js or similar tools). Using a stopword filter would consequently be the next step for optimizations.

As I discussed with @rdwatters before, we should create a seperate jsonify template function that can query content like the user wants it.

@digitalcraftsman digitalcraftsman changed the title Add generator for JSON-based content index Add generator for a search index Feb 24, 2016

@digitalcraftsman digitalcraftsman referenced this pull request Mar 9, 2016

Open

Implement search #2

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Mar 12, 2016

Since we have a jsonify template func and and a good proof-of-concept for a content index (thanks @bep) it would be perfect to do it this way in this PR. This would satisfy @spf13 wish for more flexibility.

However, I saw that @bep needed to create a new content file just set the url properly. Wouldn't it be better to add a saveas template func that saves a string, JSON object etc. as file under a given path:

{{ $contentList | jsonify | saveas "/index.json" }}

The path would be relative to static/.


While keeping an eye on the localization support it would be very easy to create a content index for just a single language. Depending on the current locale scripts like lunr.js could fetch the content index for the current locale and is it as a index.

It doesn't make sense include spanish content in the results for a chinese user. But the setup is completely flexibly due to the filter options.

/cc @bep @moorereason

@bep

This comment has been minimized.

Copy link
Member

bep commented Mar 12, 2016

Yes, the extra content file is not good, we need better support for custom file types (json, ical etc.), but the answer isn't saveas(where would you call that from?)

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Mar 12, 2016

but the answer isn't saveas (where would you call that from?)

I would call it from inside a template, like in the example above:

{{ $contentList | jsonify | saveas "/index.json" }}

The function itself would have a signature like

func saveas(path string, data interface{}) error {}

@digitalcraftsman digitalcraftsman force-pushed the gohugoio:master branch from 49b4f8e to 93e41a1 Mar 30, 2016

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Apr 2, 2016

I revisited this issue and implemented the feature with a template as @spf13 suggested. That gives users the miximal flexibility. Kudos to @bep for implementing the jsonify template func and for providing a good starting point for the internal template.

I would appreciate a review. According to the contribution guidelines the commit message should mentioned the modified package as prefix. Since I modified multiple packages which should I use?

@rdwatters you asked for an option to exclude certain pages. Have a look at the docs 😉

Furthermore, @rdwatters and @moorereason suggested the usage of a stop word filter? Should this be realized with a template function (in a seperate pull request)?

Last but not least I would like to keep an eye on the localization support (#1744). Having search results in multiple languages doesn't make sense in my opinion. Should we offer an option to generate a content index per locale?

@digitalcraftsman digitalcraftsman referenced this pull request Apr 2, 2016

Closed

'Search' Feature #10

@moorereason

This comment has been minimized.

Copy link
Contributor

moorereason commented Apr 2, 2016

First, my handle is moorereason. No need to spam whoever moore is.

Second, for the commit message prefixes, you want to use the primary affected package (I use that phrase in my updated but yet-to-be-merged contributing guide). In this case, commit bb688f7 would use hugolib, in my view, since that's where the most important change is made. Choose which package you feel is most relevant to call out in the commit message.

In your subsequent commits, I'd use commands, hugolib, and docs, respectively. The idea is to give someone looking over the git logs a quick identifier of where the changes are occurring without them having to read the full commit message or look at the diffs.

I get the feeling I'm going to need update the contributing guide to give a fuller explanation and rationale for the subject prefix.

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Apr 2, 2016

I'm sorry for misspelling your handle.

The commit messages have been updates with their corresponding package as prefix. However, at first I just wasn't sure if the commits should be squashed or not.

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Apr 2, 2016

I implemented the search feature in the material-docs theme and it works like a charme. But the usability of the default template could be improved.

Currently, we are only linking the pages who match the search query. It would be much better if we also could link the headers of section that contains (parts) of the query. MkDocs uses the headers as dividers for the content and adds each of them as new search result.

@digitalcraftsman

View changes

hugolib/site.go Outdated
@@ -784,6 +791,8 @@ func (s *Site) initializeSiteInfo() {
GoogleAnalytics: viper.GetString("GoogleAnalytics"),
RSSLink: s.permalinkStr(viper.GetString("RSSUri")),
BuildDrafts: viper.GetBool("BuildDrafts"),
DisableSearchJSON: viper.GetBool("DisableSearchJSON"),
SearchIndexLink: viper.GetString("baseURL") + viper.GetString("searchuri"),

This comment has been minimized.

@digitalcraftsman

digitalcraftsman Apr 28, 2016

Author Member

@bep Is there any helper functions that can prepend the baseurl for the SearchIndexLink?

This comment has been minimized.

@moorereason

moorereason Apr 28, 2016

Contributor

Looks like s.permalinkStr() is used above for the RSSLink.

This comment has been minimized.

@digitalcraftsman

digitalcraftsman Apr 28, 2016

Author Member

That's what I've done before. I printed the URL in a template and got http://localhost:1313/search/index.json/ instead of http://localhost:1313/search.json

This comment has been minimized.

@moorereason

moorereason Apr 28, 2016

Contributor

That sounds like a bug.

This comment has been minimized.

@digitalcraftsman

digitalcraftsman Apr 29, 2016

Author Member

@bep do you know if this behavior is intended or how it can be avoided?

This comment has been minimized.

@bep

bep Apr 29, 2016

Member

As to helper, see what is used by absURL template func.

@moorereason

View changes

docs/content/templates/variables.md Outdated
@@ -158,6 +158,7 @@ Also available is `.Site` which has the following:
**.Site.Permalinks** A string to override the default permalink format. Defined in the site configuration.<br>
**.Site.BuildDrafts** A boolean (Default: false) to indicate whether to build drafts. Defined in the site configuration.<br>
**.Site.Data** Custom data, see [Data Files](/extras/datafiles/).<br>
**.Site.DisableSearchJSON** A boolean (by default false) to indicate wether to build a content index. Defined in the site configuration. Read more about the [search]({{< relref "templates/search.md" >}}) feature.<br>

This comment has been minimized.

@moorereason

moorereason Apr 28, 2016

Contributor

s/wether/whether/

This comment has been minimized.

@digitalcraftsman

digitalcraftsman Apr 28, 2016

Author Member

What do you mean with "s/wether/whether/"?

This comment has been minimized.

@moorereason

moorereason Apr 28, 2016

Contributor

Sorry, old sed/perl syntax shorthand. In english: wherever you have "wether", substitute "whether".

This comment has been minimized.

@digitalcraftsman

digitalcraftsman Apr 28, 2016

Author Member

I fixed the typo.

@bep

This comment has been minimized.

Copy link
Member

bep commented Apr 29, 2016

As to the discussion of stop-words:

  1. I think it is in this case the responsibility of the search lib.
  2. Hugo could have used such a feature (but it is hard: I have seen some of the stop-words lists for Norwegian, and they are crappy), but then as a cross-cutting concern that could be used by others, in this case as a filter.

This this PR should be about geting the data in a parseable format, aka JSON.

@derekperkins

This comment has been minimized.

Copy link
Contributor

derekperkins commented Sep 16, 2016

Is this going to make it into 0.17?

@spf13 spf13 force-pushed the gohugoio:master branch to f9c70c0 Oct 7, 2016

@matcornic matcornic referenced this pull request Oct 19, 2016

Closed

Searching in all the site #12

@digitalcraftsman

This comment has been minimized.

Copy link
Member Author

digitalcraftsman commented Dec 26, 2016

I'm closing this pull request in favor of #2828. Custom output types would be much more flexible. Users could create content in a format they want by using templates and by specifying the output type (e.g. JSON).

My approach would be to specific and de facto deprecated once you can achieve the same with custom output types.

Nonetheless, the long discussion about this topic highlighted some points that should be considered in the future when someone creates a search template

@digitalcraftsman digitalcraftsman deleted the digitalcraftsman:feature/content-index branch Dec 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment