Skip to content

Technical Tutorial on Search

Zoe LeBlanc edited this page May 28, 2020 · 6 revisions

Full-Text Search Architecture Overview

The intended audience of this guide is the technical team of The Programming Historian, so it assumes that you are already moderately familiar with Jekyll and understand our editorial & translation workflows. It may still be useful to outside readers.

Initial Goals for Search Architecture

  • Optimize for fast results and page loads
  • Functional for multiple languages
  • Integrate with existing filter architecture
  • Not add considerably to site building times
  • Add as little JavaScript bloat as possible

Overview of Original Filter Architecture

Prior to building the search feature, all filtering on the lesson-index.html was handled by List.js, a Javascript package for sorting and filtering HTML lists. Much of this original architecture was built by Fred Gibbs, Brandon Walsh, and Matthew Lincoln (you can view the history of the file here https://github.com/programminghistorian/jekyll/commits/gh-pages/js/lessonfilter.js).

Using the class list on the lessons ul element, enabled the leveraging of List.js in the lessonfilter.js file. This file mainly consisted of the wireButtons() function that controlled the lesson display through the featureList variable and listening for button clicks on topics, activities, and reset. Any button click would lead to an update of featureList, as well as the adding of new URI parameters (unless reset was clicked). The URI parameters enabled users to leave the lesson page and return to have their initial filters reloaded.

List.JS does include the ability to do searching, and an example of implementing search with List.JS is available on this closed PR https://github.com/programminghistorian/jekyll/pull/1720. However, this solution did not work well for fuzzy searching and did not allow us to display search results instead of abstracts. For that reason, we have decided to continue to keep the initial List.JS architecture, while building from it using Lunr.JS, a JavaScript library for full-text search in static sites. Lunr has extensive documentation and has been well maintained for a number of years. Nonetheless, hopefully this documentation will assist when inevitably in a few years we need to update the filter and search architecture.

Generating Search Corpora

Lunr requires access to the initial documents (or in our case lessons) to build a search corpus. This currently happens in search.json in the layouts folder, which is used in each of the language folders in a second search.json file that contains:

---
layout: search
skip_concordance: true
---

In the layout search.json file, we currently generate the search corpus, using liquid to loop through and capture all the lessons in a language. For the search content, we combine title, content, and abstract into search_block and assign that to the body key in our JSON object.

[
{% assign corpus = site.pages | where: "layout", "lesson" | where: "lang", page.lang %}
{% for page in corpus %}
{% capture search_block %}
{{ page.title }}
{{ page.content | markdownify }}
{{ page.abstract | markdownify }}
{% endcapture %}
{
  "id": {% increment counter %},
  "url": {{ page.url | absolute_url | jsonify }},
  "title": {{ page.title  | jsonify }},
  "body": {{ search_block | strip_html | jsonify }}
}{% if forloop.last %}{% else %},{% endif %}
{% endfor %}
]

One thing to note is that we are currently saving the absolute url as our url value. If we change the formatting of URLs we may need to revisit some of the existing logic in lessonfilter.js.

If in the future we want to allow search by authors, all that would require is including the page.authors into our JSON. The output of this search.json is that once the site is built, we then have JSON files for lessons in each language, which are accessible at [https://programminghistorian.org/{language}/search.json](https://programminghistorian.org/{language}/search.json) . For an example, see https://programminghistorian.org/fr/search.json. Once a new language is added to PH, a new search.json file must be added for the relevant corpus to be generated.

Generating Search Indices

Once the corpora are generated, we can now start creating search indices. Most static sites using Lunr generate their search index on site build. For a few reasons, we did not want to adopt that approach, including avoiding adding node modules and a Rakefile to the main repository, and wanting to limit site build times (which are already quite long due to the translation looping logic). As an alternative, the code for building search indexes is housed in a separate repository, available here search-index.

This repository contains a small NodeJS application that uses the pre-built corpora to create separate index files for each language. The entirety of the code is in index.js and clocks in at 41 lines of code. Essentially the application reads in the required dependencies, and then loops through the corpora to create the index. If a new language is added, the new link to the corpora should be added to the searchCorpora variable.

let searchCorpora = ['https://programminghistorian.org/en/search.json', 'https://programminghistorian.org/fr/search.json', 'https://programminghistorian.org/es/search.json'];

In the loop, we use axios to get the json files, and then build the search index using Lunr's builder syntax.

let language = searchFile.split('/').reverse()[1];
const idx = lunr((builder) => {
  language != 'en' ? builder.use(lunr[language]) : null;
  builder.ref('id');
  builder.ref('url');
  builder.field('title');
  builder.field('body');
  builder.metadataWhitelist = ['position']

  searchBuilder.forEach(function (doc) {
      builder.add(doc)
  }, builder)
});

If the corpora is not in English, then we use Lunr languages module for stemming and stop words. While Lunr does contain a multi-lingual module, we use the individual language module (so French or Spanish currently) specified by the corpora because our lessons are not mixed in their use of languages. For more information about how Lunr provides language support, see this documentation https://lunrjs.com/guides/language_support.html. Once the indices are created, we write them to file in the indices/ folder as JSON files.

To run this file locally, you can simply clone the repo, and run:

npm install
npm start

Currently, we use TravisCI to run the app every night and regenerate the indices. We store the relevant Travis token in the Github repo.

Control Flow for Search Results and Filtering Lessons

Once the corpora and indices are generated, the remaining logic is used to control how lessons are displayed on the main page. This primarily involves three files: lesson-index.html, lesson-describe.html, and lessonfilter.js.

lesson-index.html is responsible for looping through lessons and displaying them on the main lessons page. In addition to the original filter logic, there is now logic to show a START SEARCHING button that controls whether we load search or not (mainly to help those on slower connections). To also limit page bloat, we only load Lunr languages for the respective page.lang. If the START SEARCHING button is clicked, it will display the search input and button, though initially these are disabled while the index and corpus loads. There's also a small info button that when clicked displays additional instructions for how to click. This file also contains svg code for the pre loader animation that displays if a user uses the return button to load an earlier search query.

TO NOTE: It is crucial that the search inputs remain outside of the <ul class="list"> , otherwise they will be subject to List.JS's search logic.

lesson-describe.html displays the actual lesson title and abstract. For the search results to display and to order them by score, we have two hidden elements that display each of these respectively. We use display: none to ensure that these elements are not included the DOM flow.

The majority of the logic for displaying search and filter results is contained in lessonfilter.js. The graph below outlines this control flow (easier to see if you open in a new window).

Control Flow PH Search@2x (5)

The primary entry point is the wireButtons() function that runs on each page load, and displays the default lessons that are sorted by date. From this point the user can either use the existing filter buttons to drill down into lessons, or click the START SEARCHING button to enable search.

If a filter is clicked, then the '.filter' code runs, checking if the button clicked is for topic or activity and updating the URI parameters. Finally it simulates a search button click, since the majority of logic for filtering now lives in that code block.

if (filterType === 'activities') {
  uri.removeSearch("topic")
  uri.setSearch("activity", type)
} else {
  uri.removeSearch("activity")
  uri.setSearch("topic", type)
}

// returns the URI instance for chaining
history.pushState(stateObj, "", uri.toString());
// Use search to perform filtering
$("#search-button").click();

Alternatively, if start searching is clicked, it loads the corpus and index with the async loadSearchData() function, which also controls displaying the search input and button. It is crucial that this function be async for slower connections and for the pre-loader animation that runs if search is loaded from URI parameters.

// If search is true, load pre-loading graph to allow search results time to load
preloader.css('visibility', 'visible');
$('#search').val(search);
loadSearchData().then(() => {
  preloader.fadeOut(1500);
  $('#search-button').click();
});

Once the search data is loaded, the search input updates so that users can enter their terms, using either the enter key or the search button. When a search term is entered or a filter button click, the search-button code block runs. First, it checks if the search string is empty or not.

If empty, it removes search from the URI params and checks if filters are clicked or not (which is crucial for users pressing enter key with an empty search input). If the filters are clicked, the code filters featureList to display lessons that match the topic or activity, while also reseting item's score to zero. Finally it calls the applySortFromURI() function to apply any sorting.

featureList.filter((item) => {
  item.values().score = 0;

  let topicsArray = item.values().topics.split(/\s/);
  let condition = params.topic ? topicsArray.includes(type) : item.values().activity == type;
  return condition

});

applySortFromURI(uri, featureList);

If no filters are clicked (along with an empty search input), then it simulates a reset filters button click. The '#filter-none' resets the search inputs, resets the URI parameters to default, resets featureList items' score to zero, and resets sorting to default by date.

Because we updating the html to display no search results and instead display original abstracts, we have to call resetSearch() twice to force the html refresh.

However, if the search string is entered, then search-button updates the URI to include the search string and calls lunrSearch() function (passing in the search string, index, corpus, featureList, and URI). lunrSearch first passes the search string to the index to get lessons, and then combines the data from corpus and index, joining them on url (which is what we set as reference for Lunr when we generate the search index).

// Get lessons that contain search string using lunr index
  const results = idx.search(searchString);

  // Get lessons from corpus that contain the search string
  let docs = results.filter(result => corpus.some(doc => result.ref === doc.url)).map(result => {
    let doc = corpus.find(lesson => lesson.url === result.ref);
    return {
      ...result,
      ...doc
    };
  });

docs now contains the list of search results, and we loop through them to generate the search snippets, using the HTML mark tag to highlight the search terms. However, rather than immediately displaying them, we store the results in elements Array so that we can first update featureList (which is crucial for triggering an HTML refresh).

// Saves element name and results to array to be displayed once featureList is updated.
elements.push({ 'elementName': elementName, 'innerResults': inner_results });
// Updates score element for search results
$(`span[id="${elementName}-score"]`).html(doc.score);

Next we filter featureList, checking if a filter has been checked and matching items in featureList to those in our search results (docs).

// Filter featureList to only show items from search results and active filters
const params = uri.search(true);
let type = params.activity ? params.activity : params.topic;
featureList.filter((item) => {
  let topicsArray = item.values().topics.split(/\s/);
  let condition = params.topic ? topicsArray.includes(type) : item.values().activity == type;
  // return items in list that are in search results and filter if clicked
  return docs.find((doc) => {
    if (doc.title === item.values().title) {
      // update score values for item
      item.values().score = doc.score;
      // Could simply to just do Object.keys(params) > 1 here but in case we add more URI values this will explicitly check for filters along with search
      return ['topic', 'activity'].some(key => Object.keys(params).includes(key)) ? ((doc.title === item.values().title) && condition) : (doc.title === item.values().title);
    }

  });
});

The double return statements are crucial because of the nature of filter (which won't update the results without a return). We use the find() method since we know that there will be only one match for each search result and featureList item. We then check if the URI contains a filter parameter, and if it does we ensure that the search result and featureList item also match the filter condition. Any new filters added to our filtering logic will need to be included in ['topic', 'activity'].some(). If there are no filters, we simply return the matches.

Finally, we update our sort to indicate we are sorting by search score and resort the featureList by search scores. We then hide the original abstracts and display the search result snippets.

// Sort featureList by score
featureList.sort('score', { order: "desc" });

// Hide original abstracts and update Filtering to show number of results
$('.abstract').css('display', 'none');
$('#results-value').text($(this).text().split(' ')[0] + '(' + featureList.update().matchingItems.length + ')' + " ");
$('#results-value').css('textTransform', 'uppercase');

// Display updated search results
elements.map((elm) => {
  $(`p[id="${elm.elementName}-search_results"]`).css('display', '');
  $(`p[id="${elm.elementName}-search_results"]`).html(elm.innerResults);
});

Overall, this control flow allows us to leverage the best aspects of Lunr (namely extensive search functionality) with the best of List.JS (seamlessly updating HTML lists).

Long-Term Maintenance

Updated JavaScript Syntax Guidelines

For future updates to this code base, use the following best practices:

  • For variables, use either let or const
  • For event listeners on clicks, use click (rather than on.('click')) for this binding
  • For named functions, use function NAME() syntax
  • For JavaScript methods, use fat arrows (=>)

These guidelines are intended for code consistency and are not firm rules (but please follow them if possible).

Longer Term Issues

  • List.JS is not actively maintained and may need to be replaced at some point https://github.com/javve/list.js
  • Currently we rebuild search indices once a day, but we could use webhooks and a small app to push changes on commit
  • Instead of using Lunr, we could use ElasticLunr https://github.com/weixsong/elasticlunr.js. The differences seem negligible given our current needs and it's unclear if ElasticLunr is still being actively maintained. However, it would easily integrate with our current infrastructure.

New Wiki (in-progress)

Publishing Tasks

Phase 1 Submission

Phase 6 Sustainability Accessibility

Mermaid diagram templates

Communications

Social Media

Bulletin

Events

Call Packages

Administration and Documentation

Members

Internal records

Resource indexes

Lesson Production and Development

Language and Writing

Accessibility

Governance

ProgHist Ltd


Old Wiki

Training

The Ombudsperson Role

Technical Guidance

Editorial Guidance

Social Guidance

Finances

Human Resources

Project Management

Project Structure

Board of Trustees

Clone this wiki locally