-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use ngram in TrieSearch as it performs very similarly but at much low… #3216
Conversation
…er memory footprint
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
}); | ||
for (final List<KeywordIndex> hits : trie.prefixMap(keyword).values()) | ||
updateWeights(keyword, hits, itemWeights); | ||
if (keyword.length() > ngramLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hier ist deine Formatierung off, ich muss dir noch meine config schicken.
); | ||
} | ||
|
||
private void doPut(String kw, T item) { | ||
private void doPut(NgramIndex ni) { | ||
// wouldn't it suffice to check once in addItem()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sowas bitte mit // TODO
markieren, aber gute Frage. Vermutlich hast du recht.
} | ||
|
||
public Iterator<T> iterator() { | ||
// This is a very ugly workaround to not get eager evaluation (which happens when using flatMap and distinct on streams) | ||
final Set<T> seen = new HashSet<>(); | ||
|
||
return Iterators.filter( | ||
Iterators.concat(Iterators.transform(trie.values().iterator(), Collection::iterator)), | ||
Iterators.concat(Iterators.transform(keywordItemsList.stream().map(ki -> ki.items).iterator(), Collection::iterator)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hier bitte kein stream benutzen, die ganze Methode sollte eigentlich stream-basiert sein, streams haben aber buggy verhalten was sich erst bei sehr großen Mengen äußert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Das wäre dann ein Iterators.transform
@@ -84,34 +84,33 @@ public void searchIdentities() { | |||
public void testSuffixes() { | |||
final TrieSearch<String> search = new TrieSearch<>(2, null); | |||
|
|||
assertThat(search.suffixes("baaacd")) | |||
assertThat(search.ngramSplitStrings("baaacd", "item")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"item" ist ein nicht eingegebenes Wort um zu testen, dass du es nicht rauskriegst?
} | ||
|
||
Stream<String> suffixes(String word) { | ||
private Stream<NgramIndex> ngramSplit(String word, T item) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Du mischt hier Zustand und Computation, bitte trennen. Das sieht mir eher nach Teilen von doPut
aus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sieht super aus! Danke
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
/** | ||
* Maps from keywords to associated items. | ||
*/ | ||
private final PatriciaTrie<List<T>> trie = new PatriciaTrie<>(); | ||
private final PatriciaTrie<List<Integer>> trie = new PatriciaTrie<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wenn du den Index nur als lookup nach keywordItemsList
verwendest, dann einfach direkt das KeywordItems reinsetzen stattdessen. Java Objekte sind referenzen ;)
Alternativ IntList/IntArrayList verwenden.
index = keywordItemsList.size() - 1; | ||
} | ||
|
||
ToTrieKeys(word).forEach(key -> doPut(key, index, item)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bitte normale for-loop verwenden.
ToTrieKeys(word).forEach(key -> doPut(key, index, item)); | ||
} | ||
|
||
private void doPut(String key, int index, T item) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zwei mal doPut ist etwas ungeschickt, kannst du die bitte sauberer benennen?
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public void logStats() { | ||
// ToDo: meaning changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was hat sich hier verändert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(glaube die methode wird nicht gerufen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intellij meldet das TrieSearch::logStats
und TrieSearch::listItems
ungenutzt sind. Sollen diese Methoden entfernt werden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ja können beide weg
backend/src/test/java/com/bakdata/conquery/integration/tests/FilterAutocompleteTest.java
Outdated
Show resolved
Hide resolved
backend/src/test/java/com/bakdata/conquery/util/search/QuickSearchTest.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
für sowas benutze ich gerne if(edge case) { ...; continue }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aber ich glaube wir müssen das allein schon drin haben, weil das beim splitten entsthene kann und ggf auch itemWords kürzer als ngrams sein können
Co-authored-by: awildturtok <1553491+awildturtok@users.noreply.github.com>
} | ||
|
||
public void logStats() { | ||
// ToDo: meaning changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ja können beide weg
@@ -221,7 +236,7 @@ public void shrinkToFit() { | |||
return; | |||
} | |||
|
|||
trie.replaceAll((key, values) -> values.stream().distinct().collect(Collectors.toList())); | |||
trie.replaceAll((key, values) -> new ArrayList<>(values)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wissen wir hier schon, dass values distinct sind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nein, Ich weiß auch, dass die Values nicht distinct sind und habe .distinct()
rausgenommen, weil das Verhalten vom TrieSearch
sich ändert, je nachdem ob TrieSearch::shrinkToFit
aufgerufen wurde.
Zudem führt das zum folgendem Ergebnis:
Expected :["Pants", "PantsPants", "Pantshop", "Sweatpants"]
Actual :["Pants", "Pantshop", "PantsPants", "Sweatpants"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Habe einen count
Feld hinzugefügt, dass eine Art .distinct()
durch Gruppierung und Summierung ermöglicht. So ist das Verhalten unabhängig davon ob TrieSearch::shrinkToFit
aufgerufen wurde.
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
} | ||
ki.items.add(item); | ||
|
||
for (final String key : (Iterable<String>) toTrieKeys(word)::iterator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, dass das ein Stream ist hatte ich übersehen, da war die forEach dann angebracht.
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
Co-authored-by: awildturtok <1553491+awildturtok@users.noreply.github.com>
|
||
for (final KeywordItemsCount<T> entry : hits) { | ||
// KeywordItems<T> ki = entry.keywordItems; | ||
final double weight = Math.pow(weightWord(query, entry.word), entry.count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wenn ich das richtig verstehe ist das dafür da, dass wenn ein keyword öfters inserted wurde es höher priorisiert auszugeben?
backend/src/main/java/com/bakdata/conquery/models/datasets/Column.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/bakdata/conquery/models/datasets/concepts/filters/specific/SelectFilter.java
Outdated
Show resolved
Hide resolved
backend/src/main/java/com/bakdata/conquery/models/index/FrontendValueIndex.java
Outdated
Show resolved
Hide resolved
backend/src/main/java/com/bakdata/conquery/models/index/IndexService.java
Outdated
Show resolved
Hide resolved
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
// This barrier avoids work when shrinking and therefore unnecessary calls to entries::computeIfAbsent | ||
final Set<String> barrier = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ist der Lookup im particiatrie so aufwendig? und wenn ja, warum ist barrier eine locale Variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Das verhindert, dass ein item mehrfach zum gleichen ngram assoziiert wird. (wir sind hier in addItem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spart Arbeit beim Shrinken. Lokal wegen dem genannten Grund. Dieser Fall kann trotzdem eintreten, wenn ein Item nochmal mit Keywords eingefügt wird, die bereits existierende ganze Wörter oder ngrams beinhalten. Um das zu verhindern, braucht es eine kompliziertere Datenstruktur, die bis zum Shrinken behalten wird.
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
…ervice.java Co-authored-by: MT <12283268+thoniTUB@users.noreply.github.com>
…ndValueIndex.java Co-authored-by: MT <12283268+thoniTUB@users.noreply.github.com>
/** | ||
* A lower weight implies more relevant words. | ||
*/ | ||
private long weightWord(String query, String itemWord, boolean original) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ist die aktualisierte weightWord abgeleitet aus deinen Benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meine Überlegung ist, dass die Weight-Function möglichst simpel sein soll und dass bei einer kleineren oberen Grenze der Weight-Function, erst bei sehr vielen Hits (> 2^55) ein Overflow bei einem long
entsteht. Dann habe ich das getestet und die Such-Ergebnisse waren gut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Die Anpassungen im weighing würde ich gerne nachvollziehen wollen.
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
backend/src/main/java/com/bakdata/conquery/util/search/TrieSearch.java
Outdated
Show resolved
Hide resolved
*/ | ||
private long weightWord(String query, String itemWord, boolean original) { | ||
// The weight function needs to be fast, as it is called frequently. | ||
final long weight; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thoniTUB Sebastian hat den Code jetzt so stark vereinfacht, dass man den fast komplett unrollen könnte, wäre das für dich einfacher?
if(exactMatch){
if(original)
return 100;
return 10;
}
if(original)
return 4;
return 2;
Die Zahlen sind auch gar nicht so wichtig wie man denken könnte. Solange relevante Ergebnisse kommen (was die ngrame garantieren) ist es gut genug.
Wegen mir kannst du das as-is mergen oder die kleinen Sachen noch anpassen und dann mergen. |
Co-authored-by: awildturtok <1553491+awildturtok@users.noreply.github.com>
if (word.length() < ngramLength) { | ||
return Stream.empty(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry bisschen spät, ist mir gerade ins Auge gefallen.
Sollte es nicht
if (word.length() <= ngramLength) {
sein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
du hast recht, ich fixe es
…er memory footprint