-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not use lazy mapValues in TrieNode #70
Conversation
This seems to make creation of the TrieNode much faster. It might fix com-lihaoyi#33 https://issues.scala-lang.org/browse/SI-4776
Here is a benchmark with two versions of TrieNode with the only difference being use of mapValues vs map { case (k, v) => ... }. You will need https://github.com/Ichoran/thyme to run it. I am not exactly sure what is going on, since the mapValues should only affect the operations immediately afterwards, so it should be just a constant factor slowdown. But apparently this is not the case. By the way: the TrieNode code is pretty neat! The issue seems to be most severe with long strings. Results:
Code: import ichi.bench.Thyme
import scala.annotation.tailrec
object TrieNodeBench extends App {
val th = Thyme.warmed(verbose = println, warmth = Thyme.HowWarm.BenchOff)
def createLazy(ns: Array[String]): AnyRef = new Lazy.TrieNode(ns)
def createEager(ns: Array[String]): AnyRef = new Eager.TrieNode(ns)
val names = (0 until 10000).map(NumberToWord.apply).toArray
for(i <- Seq(5, 10, 30)) {
val ns = names.take(i)
th.pbenchOffWarm(s"Create TrieNode Lazy vs. Eager $i")(th.Warm(createLazy(ns)))(th.Warm(createEager(ns)))
}
}
object Lazy {
/**
* An trie node for quickly matching multiple strings which
* share the same prefix, one char at a time.
*/
final class TrieNode(strings: Seq[String]) {
val (min, max, arr) = {
val children = strings.filter(!_.isEmpty)
.groupBy(_ (0))
.mapValues(ss => new TrieNode(ss.map(_.tail)))
if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
else {
val min = children.keysIterator.min
val max = children.keysIterator.max
val arr = new Array[TrieNode](max - min + 1)
for ((k, v) <- children) arr(k - min) = v
(min, max, arr)
}
}
val word: Boolean = strings.exists(_.isEmpty) || arr.isEmpty
def apply(c: Char): TrieNode = {
if (c > max || c < min) null
else arr(c - min)
}
/**
* Returns the length of the matching string, or -1 if not found
*/
def query(input: String, index: Int): Int = {
@tailrec def rec(offset: Int, currentNode: TrieNode, currentRes: Int): Int = {
if (index + offset >= input.length) currentRes
else {
val char = input(index + offset)
val next = currentNode(char)
if (next == null) currentRes
else rec(
offset + 1,
next,
if (next.word) offset
else currentRes
)
}
}
rec(0, this, -1)
}
}
}
object Eager {
/**
* An trie node for quickly matching multiple strings which
* share the same prefix, one char at a time.
*/
final class TrieNode(strings: Seq[String]) {
val (min, max, arr) = {
val children = strings.filter(!_.isEmpty)
.groupBy(_ (0))
.map { case (k, ss) => k -> new TrieNode(ss.map(_.tail)) }
if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
else {
val min = children.keysIterator.min
val max = children.keysIterator.max
val arr = new Array[TrieNode](max - min + 1)
for ((k, v) <- children) arr(k - min) = v
(min, max, arr)
}
}
val word: Boolean = strings.exists(_.isEmpty) || arr.isEmpty
def apply(c: Char): TrieNode = {
if (c > max || c < min) null
else arr(c - min)
}
/**
* Returns the length of the matching string, or -1 if not found
*/
def query(input: String, index: Int): Int = {
@tailrec def rec(offset: Int, currentNode: TrieNode, currentRes: Int): Int = {
if (index + offset >= input.length) currentRes
else {
val char = input(index + offset)
val next = currentNode(char)
if (next == null) currentRes
else rec(
offset + 1,
next,
if (next.word) offset
else currentRes
)
}
}
rec(0, this, -1)
}
}
}
/**
* Helper object to generate realistic test cases for the radix tree. Used with permission from www.source-code.biz
*/
object NumberToWord {
def apply(n: Int): String = convertNumberToWords(n)
// This snippet may be used freely, as long as the authorship note remains in the source code.
private val lowNames = Array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
private val tensNames = Array("twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
private val bigNames = Array("thousand", "million", "billion")
/**
* Converts an integer number into words (american english).
* @author Christian d'Heureuse, Inventec Informatik AG, Switzerland, www.source-code.biz
*/
private def convertNumberToWords(n0: Int): String = {
var n = n0
if (n < 0) "minus " + convertNumberToWords(-n)
else if (n <= 999) convert999(n)
else {
var s: String = null
var t = 0
while (n > 0) {
if (n % 1000 != 0) {
var s2 = convert999(n % 1000)
if (t > 0) {
s2 = s2 + " " + bigNames(t - 1)
}
if (s == null) {
s = s2
} else {
s = s2 + ", " + s
}
}
n /= 1000
t += 1
}
s
}
}
private def convert999(n: Int): String = {
val s1 = lowNames(n / 100) + " hundred"
val s2 = convert99(n % 100)
if (n <= 99) s2
else if (n % 100 == 0) s1
else s1 + " " + s2
}
private def convert99(n: Int): String = {
if (n < 20) lowNames(n)
else {
val s = tensNames(n / 10 - 2)
if (n % 10 == 0) s
else s + "-" + lowNames(n % 10)
}
}
} |
Here is the likely problem // DEFINES CHILDREN RECURSIVELY, DOES NOT EVALUATE
val children = strings.filter(!_.isEmpty)
.groupBy(_(0))
.mapValues(ss => new TrieNode(ss.map(_.tail)))
if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
else {
val min = children.keysIterator.min // EVALUATES CHILDREN ONCE
val max = children.keysIterator.max // EVALUATES CHILDREN SECOND TIME
val arr = new Array[TrieNode](max - min + 1)
for ((k, v) <- children) arr(k - min) = v // EVALUATES CHILDREN THIRD TIME
(min, max, arr)
} We'd expect Until we put together a faster trie (DAT or whatever) to use, I think this will be sufficient for now and should make things sufficiently faster to close the original issue. My only request for this diff: can we have a regression test for this for unreasonably-large non-toy input data? I'm thinking thousands to hundreds of thousands of strings thousands to hundreds of thousands of characters long. Such an input might take a moment to initialize or query, but the fact that it completes should be sufficient to verify that this behavior doesn't occur. It should probably live in https://github.com/lihaoyi/fastparse/blob/2023b5ead1f876397855b86b1784cb1b8e57d177/fastparse/shared/src/test/scala/fastparse/MiscTests.scala#L7 |
Also, the pythonparse integration test seems to be complaining about a new file in the django project it can't parse and is not skipped https://travis-ci.org/lihaoyi/fastparse/jobs/97345190#L3097 Could you add it to the whitelist, just as a housekeeping matter to keep the build up to date? |
I don't think there is anything wrong with the tree once constructed. The performance should be pretty close to optimal. Just the memory usage can be very high, especially if somebody is using unicode characters. See example below. If I understand it correctly, the DAT mostly buys you a more compact representation while still using O(1) array lookups. I think in this case the less-than-optimal space usage is perfectly fine, since people will just use it for a small number of identifiers, not all words of the English language. Example for large space usage:
That would allocate a pretty large child array (from 'i'.toInt 73 to '∞'.toInt 8734). Still shouldn't be a problem for this use case. I will add a test case that runs for longer than the travis timeout with the unfixed implementation. I am not sure what you want me to do regarding the whitelist. The patch does not add any new files, so what exactly is the problem? |
w.r.t. the whitelist, the integration test suite isn't hermitic and just pulls in |
This should prevent performance regressions in TrieNode creation. With the old, lazy implementation this will run forever, whereas with the new implementation it will run in less than a millisecond.
OK, I have added a single performance test that creates a TrieNode from 1000 strings. It runs almost instantaneously with the eager version, and will probably not terminate before the end of the universe with the lazy version. That should do to prevent regressions. Let me know if you need more. |
Cool looks good, thanks a lot! |
Do not use lazy mapValues in TrieNode
This seems to make creation of the TrieNode much faster. It might fix #33
https://issues.scala-lang.org/browse/SI-4776
The TrieNode tree seems to be pretty fast once it is created. I compared it with my RadixTree https://github.com/rklaehn/radixtree , which is more optimised for compact storage of very large string sets, and it is doing quite well.
But creation of a TrieNode containing e.g. all words from http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt would not even terminate.