Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard search does not work properly #3938

Closed
andre-hohmann opened this issue Aug 13, 2020 · 5 comments · Fixed by #4263
Closed

Wildcard search does not work properly #3938

andre-hohmann opened this issue Aug 13, 2020 · 5 comments · Fixed by #4263
Labels
bug search search, filter

Comments

@andre-hohmann
Copy link
Collaborator

Problem

The wildcard search is only possible in specific parts of the process title:

  1. "7746" has the following results:
  • Aben_774618868-1873032402_02-a
  • Aben_774618868-1873
  • ...
  1. "1873" has the following results:
  • Aben_774618868-1873032402_02-a
  • Aben_774618868-1873
  • ...
  1. "Aben_774618868-1873" has no results
  • It seems as if "-" blocks the wildcard search.

Solution

It should be possible to search for "Aben_774618868-1873032" and get the following results:

  • Aben_774618868-18730321
  • Aben_774618868-18730322
@matthias-ronge
Copy link
Collaborator

This has to do with the tokenization settings of the search engine.

Tokenization is a necessary step in search engine indexing. Each text is broken down into a number of normalized tokens. For each token, the index stores which data records contain it. That is an essential part of the functionality of a search engine. When searching, the search query is also tokenized and then searched as an AND-search.

From the fact that you can search for partial number sequences, it can be deduced that numbers are tokenized as individual digits (otherwise you would not find Aben_774618868-1873032402_02-a with 7746). Example: The title Aben_774618868-1873032402_02-a is tokenized as (aben, 7, 4, 6, 1, 8, 3, 0, 2, 4, a).

The search query 7746 is tokenized in (7, 4, 6) and then searched as '7' AND '4' AND '6'. The search query Aben_774618868-1873032 is presumably tokenized as (aben, 7, 4, 6, 1, 8, -1, 3, 0, 2) and searched for as 'aben' AND '7' AND '4' AND '6' AND '1' AND '8' AND (NOT '1') AND '3' AND '0' AND '2'. This cannot deliver a result because '1' AND (NOT '1') must always result in an empty hitlist.

The design of a search engine index must always begin with the question of what should be found with which search query. The tokenization must take place accordingly. Specifically: Would you like to find all process titles with a partial number sequence that contain this partial number sequence? Yes, but only in this order? Even if there is a hyphen in between? Should the process title Aben_774618868-1873032402_02-a also be found when searching for 6818? Or just when searching for 68-18? Should it also be found when searching for 68_18 or 68 18? Or 18 68? Or should it only be found when searching for 7746 because it is at the beginning of a sequence of numbers?

Depending on all of these considerations, the tokenization of the documents to be indexed and the search queries must be implemented. The minimum requirement here seems to be that a hyphen that is not preceded by a space should not be interpreted as a negation sign in query tokenization.

@andre-hohmann
Copy link
Collaborator Author

Thanks a lot for the extensive explanation.

I would strive for the behaviour in Kitodo.Production 2.x to avoid prospective questions/complaints. In the Kitodo-Wiki, you can find some information:

Regarding your questions:

  1. It should be possible to search for a hyphen that is not preceded by a space, like in Aben_774618868-1873032402_02-a to find a specific process by its processtitle.
  2. If a hyphen is preceded by a space, the following term should be excluded, in order to exclude processes of a specific year, as for example Aben_774618868 -1873
  3. It should be possible to search for parts of the process title as for example 774618868 1873 or Aben 1873 to find all processes of the year 1873

@matthias-ronge
Copy link
Collaborator

When changing the search from database-based search (string comparison) to index search, it makes sense to reconsider that and not just do everything the same. This is possible, but it leads to an extremely large search engine index (a lot of hard drive space and time for indexing; search response time is not affected).

I notice that the search should not be for individual digits, but that the numbers should be found in the given order (1873 should not find 348716). This means that, during indexing, sequences of numbers must be tokenized into all possible partial sequences, but in search queries, sequences of numbers must be treated as one term.

Do I see it correctly that one is actually only looking for the incipits of sequences of numbers? (1873 does not need to search like 1873 within digit sequences, but is sufficient to search like 1873*) That would greatly reduce the number of terms to be indexed:

Example of all unique partial sequence tokens of 1873032402: 1, 18, 187, 1873, 18730, 187303, 1873032, 18730324, 187303240, 1873032402, 8, 87, 873, 8730, 87303, 873032, 8730324, 87303240, 873032402, 7, 73, 730, 7303, 73032, 730324, 7303240, 73032402, 3, 30, 303, 3032, 30324, 303240, 3032402, 0, 03, 032, 0324, 03240, 032402, 32, 324, 3240, 32402, 2, 24, 240, 2402, 4, 40, 402, 02. (52 index entries)

Example of only initial partial sequence tokens of 1873032402: 1, 18, 187, 1873, 18730, 187303, 1873032, 18730324, 187303240, 1873032402. (10 index entries)

Side note: In our field there are many sequences of numbers that end with a check digit that is calculated according to modulo 11 (letter X as the last number). These Xes at the end, immediately preceeded by at least one digit, should be seen as part of the sequence of numbers and not as a single letter, right?

@matthias-ronge
Copy link
Collaborator

matthias-ronge commented Nov 4, 2020

If my assumption is right, this should do the job for process title indexing tokenization:

import java.text.*;
import java.util.*;
import java.util.regex.*;

static Pattern GROUPS_OF_ALPHANUMERIC_CHARACTERS = Pattern.compile("[\\p{IsLetter}\\p{Digit}]+");

static Set<String> tokenizeProcessTitle(String processTitle) {
    Set<String> tokens = new HashSet<>();
    Matcher matcher = GROUPS_OF_ALPHANUMERIC_CHARACTERS.matcher(processTitle);
    while (matcher.find()) {
        String normalized = normalize(matcher.group());
        int length = normalized.length();
        for (int end = 1; end <= length; end++) {
            tokens.add(normalized.substring(0, end));
        }
    }
    return tokens;
}

static String normalize(String input) {
    StringBuilder umlautsReplaced = replaceUmlauts(input);
    String noDiactitics = Normalizer.normalize(umlautsReplaced, Normalizer.Form.NFD).replaceAll("\\p{M}", "");
    String lowerCase = noDiactitics.toLowerCase();
    return lowerCase;
}

static StringBuilder replaceUmlauts(String input) {
    StringBuilder buffer = new StringBuilder(64);
    final int length = input.length();
    for (int offset = 0; offset < length;) {
        int codepoint = input.codePointAt(offset);
        if (codepoint == 'Ä' || codepoint == 'ä') {
            buffer.append("ae");
        } else if (codepoint == 'Ö' || codepoint == 'ö') {
            buffer.append("oe");
        } else if (codepoint == 'Ü' || codepoint == 'ü') {
            buffer.append("ue");
        } else if (codepoint == 7838 || codepoint == 'ß') {
            buffer.append("ss");
        } else {
            buffer.appendCodePoint(codepoint);
        }
        offset += Character.charCount(codepoint);
    }
    return buffer;
}

"PineSeve_313539383" would be searchable with these input strings: p, pi, pin, pine, pines, pinese, pinesev, pineseve, 3, 31, 313, 3135, 31353, 313539, 3135393, 31353938, 313539383. (17 index records)

I have absolutely no idea where to put that in. This may need to be implemented within ElasticSearch.

@andre-hohmann
Copy link
Collaborator Author

@matthias-ronge: Thanks for the examination!

You are right, it is always good think about the opportunities and to improve the current state. As user, the result is more important than the technical basis. However, i am sure we will find a solution.

Do I see it correctly that one is actually only looking for the incipits of sequences of numbers? (1873 does not need to search like 1873 within digit sequences, but is sufficient to search like 1873*) That would greatly reduce the number of terms to be indexed:

I can only describe my demands for the search. For newspaper processes, it is extremely helpful for administrative exports, ... to be able to search for prcoesses by year, month, ...

  • Aben_399196951-1818
  • Aben_399196951-181801
  • Aben_399196951-181802
  • ...
  • 399196951-1818
  • 399196951-181801
  • 399196951-181802
  • ...

Thus, from my point of view, a search for "1873*" would be sufficient

Side note: In our field there are many sequences of numbers that end with a check digit that is calculated according to modulo 11 (letter X as the last number). These Xes at the end, immediately preceeded by at least one digit, should be seen as part of the sequence of numbers and not as a single letter, right?

Yes, from my point of view, the X as for example in the following process title is part of the sequence just like 1, 7, 2, 7, ....

  • AdleaDM_172788177X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug search search, filter
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants