Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching HTML contents #224

Closed
Zloka opened this issue Jul 10, 2023 · 4 comments
Closed

Searching HTML contents #224

Zloka opened this issue Jul 10, 2023 · 4 comments

Comments

@Zloka
Copy link

Zloka commented Jul 10, 2023

Hi! First off, thank you for a great library.

A new use-case for me would be to search HTML content. In practice, my data consists of what can be considered "pages", consisting of a title and some HTML content. Do you happen to have any suggestions as to how I should handle searching the HTML content? Will the library handle it well as such, or should I look to e.g. parse it into "plaintext" by stripping away tags and such?

@lucaong
Copy link
Owner

lucaong commented Jul 10, 2023

Hi @Zloka , thanks for the kind words :)

In principle, there is nothing preventing you to index and search HTML content with MiniSearch. In practice, you need to come up with a suitable pre-processing of the content, because MiniSearch by default tokenizes simply by splitting on space or punctuation, and only normalizes the case of the text.

Should one be able to search for tags like <div>, or only the textual content should be searchable? If it's the latter, then as you mentioned the best thing to do is to first strip tags away (or, if you can access the DOM of the pages to index, get the textContent property), then index the textual content.

@Zloka
Copy link
Author

Zloka commented Jul 11, 2023

Hi @Zloka , thanks for the kind words :)

In principle, there is nothing preventing you to index and search HTML content with MiniSearch. In practice, you need to come up with a suitable pre-processing of the content, because MiniSearch by default tokenizes simply by splitting on space or punctuation, and only normalizes the case of the text.

Should one be able to search for tags like <div>, or only the textual content should be searchable? If it's the latter, then as you mentioned the best thing to do is to first strip tags away (or, if you can access the DOM of the pages to index, get the textContent property), then index the textual content.

Thank you for your prompt reply! I went with your suggestion, and it seems to work well!

A follow-up question came to mind: In my case, I can pre-process the files in a build step ahead of time. I figured this would be desirable, as I can build the MiniSearch index ahead of time, and just load the index in the client-side directly, without having to add all documents.

While this works fine, the client-side of my application is a React application, and I was planning to use your react-minisearch React integration. The useMiniSearch hook seems to accept an array of documents and options, but unless I'm mistaken, I can't seem to find a way to provide a prebuilt index, which would be preferable in my case. I couldn't find an example, do you happen to have any pointers? I can also open an issue in react-minisearch, if you believe that is more appropriate 🙂

To give back, in case someone else has a similar use-case, here is a TypeScript script that outlines the rough approach I took creating the MiniSearch index, for future reference:

import { promises as fs } from 'fs';
import { join } from 'path';
import { convert } from 'html-to-text';
import MiniSearch from 'minisearch';

const miniSearchOptions = {
  fields: ['title', 'content'], // fields to index
  storeFields: ['title', 'content'], // fields to return with search results
};

// initialize MiniSearch
let miniSearch = new MiniSearch(miniSearchOptions);

// html-to-text strips all tags. If your tags contain attributes you wish to include, you can do so e.g. using regex
// the following regex captures an attribute called "data-content", so that the value can be included in the text
const regex = /<span[^>]*data-content="([^"]*)"[^>]*>[^<]*<\/span>/g;

// async function to preprocess an HTML-file and convert it to plain text
const convertHtmlToText = async (filePath: string): Promise<string> => {
  let html = await fs.readFile(filePath, 'utf-8');

  // replace span tags with data-content attribute value
  html = html.replace(regex, '$1');

  const text = convert(html);
  return text;
};

// async function to process all HTML files in a directory
const processFilesInDirectory = async (dirPath: string) => {
  // read the directory
  const files = await fs.readdir(dirPath);

  // iterate over each file
  for (const file of files) {
    // join the dirPath and file to get the full file path
    const filePath = join(dirPath, file);

    // convert HTML to plain text
    const title = file;
    const content = await convertHtmlToText(filePath);

    // add document to MiniSearch index
    const document = {
      id: title,
      title,
      content,
    };

    miniSearch.add(document);
  }
};

// async function to load or create the index
const loadOrCreateIndex = async () => {
  const indexPath = 'minisearch-index.json';

  try {
    // check if index file exists
    await fs.access(indexPath);

    // if it exists, load the index from the file
    const json = await fs.readFile(indexPath, 'utf-8');
    miniSearch = MiniSearch.loadJSON(json, miniSearchOptions);
    console.log('Index has been loaded successfully!');
  } catch (err) {
    // if it does not exist, process the HTML files and create the index
    await processFilesInDirectory('./pages');

    // serialize MiniSearch instance
    const json = JSON.stringify(miniSearch);

    // write to a local file
    await fs.writeFile(indexPath, json);
    console.log('Index has been created and stored successfully!');
  }
  const searchOptions = {
    prefix: true, // Allows for prefix search, i.e. searching "polyn" will match "polynomial" and "polynomial expression"
    fuzzy: 2, // Apply fuzzy search (typo tolerance). Can be either a boolean, or a number specifying the accepted Levenstein distance. E.g. "polymonial" will match "polynomial", even though there is a slight typo.
    boost: { title: 2 }, // Boost the value of certain fields. For example, if the match is in the title of the document, boost the search result to make it appear higher up.
  };
  const searchResult = miniSearch.search('polyn', searchOptions);
  console.log(searchResult);
};

loadOrCreateIndex().catch(console.error);

@lucaong
Copy link
Owner

lucaong commented Jul 11, 2023

You are right @Zloka , at the moment there is no utility function in react-minisearch to load a pre-built index, so you have to do that directly with MiniSearch. I will have to think about how to best introduce that possibility, as it is definitely feasible. If you want, you can open an issue on react-minisearch, and I will try to spend some time on it when possible.

In general, whether it makes sense to pre-build the index very much depends on the kind of pre-processing needed, and on how frequently the index needs to change. In many cases I have seen, indexing "just in time" is the best solution, as it is much simpler to implement and maintain. That said, since you need pre-processing, if your documents don't change too often it is probably more efficient to pre-build.

Thank you for sharing your code, it is always useful for other people landing on an issue!

@Zloka
Copy link
Author

Zloka commented Jul 11, 2023

I will open an issue there! Myself, I can work around it using MiniSearch only, so no hurry in that sense, but perhaps it might help others and make my code cleaner in the future if anything 😇

For potential future readers, here's a draft of how one could implement it using React and MiniSearch:

I decided to create a useMiniSearchInstance hook, that is responsible for loading the prebuilt index setting up the search engine. In my case, I load it from the public directory, but naturally, many other approaches exist:

import { useEffect, useState } from 'react';
import MiniSearch from 'minisearch';
import {
  MiniSearchDocument,
  indexPath,
  miniSearchOptions,
} from '../client-side-search/miniSearchConfig';

const useMiniSearchInstance = () => {
  const [miniSearch, setMiniSearch] =
    useState<MiniSearch<MiniSearchDocument> | null>(null);

  useEffect(() => {
    fetch(`/${indexPath}`)
      .then((response) => response.text())
      .then((data: string) => {
        // You'll probably want some validation here.
        const searchIndex = MiniSearch.loadJSON(data, miniSearchOptions);
        setMiniSearch(searchIndex);
      })
      .catch(console.error);
  }, []);

  return miniSearch;
};

export default useMiniSearchInstance;

I then created a useMiniSearch hook that interacts with the search engine, and exposes some variables. Of course, it is trivial to expose other methods or variables, if your use case requires it:

import { SearchResult, Suggestion } from 'minisearch';
import { useEffect, useState } from 'react';
import useMiniSearchInstance from './useMiniSearchInstance';

const searchOptions = {
  prefix: true, // Allows for prefix search, i.e. searching "polyn" will match "Polynomi" and "Polynomin"
  fuzzy: 2, // Apply fuzzy search (typo tolerance). Can be either a boolean, or a number specifying the accepted levenstein distance. E.g. "realiluvut" will match "reaaliluvut", even though there is a slight typo.
  boost: { title: 2 }, // Boost the value of certain fields. For example, if the match is in the title of the document, boost the search result to make it appear higher up.
};

const useMiniSearch = () => {
  const [searchInput, setSearchInput] = useState('');
  const [results, setResults] = useState<SearchResult[] | null>(null);
  const [suggestions, setSuggestions] = useState<Suggestion[] | null>(null);
  const miniSearch = useMiniSearchInstance();

  useEffect(() => {
    if (searchInput.length > 0) {
      if (miniSearch !== null) {
        const searchResults = miniSearch.search(searchInput, searchOptions);
        const autoSuggestions = miniSearch.autoSuggest(
          searchInput,
          searchOptions,
        );
        setResults(searchResults);
        setSuggestions(autoSuggestions);
      }
    } else {
      setResults(null);
      setSuggestions(null);
    }
  }, [miniSearch, searchInput]);

  return {
    searchInput,
    setSearchInput,
    results,
    suggestions,
  };
};

export default useMiniSearch;

It should then be straightforward to hook searchInput and setSearchInput to an input field, and display results using suggestions and results ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants