Do not use lazy mapValues in TrieNode #70

rklaehn · 2015-12-16T23:52:20Z

This seems to make creation of the TrieNode much faster. It might fix #33

https://issues.scala-lang.org/browse/SI-4776

The TrieNode tree seems to be pretty fast once it is created. I compared it with my RadixTree https://github.com/rklaehn/radixtree , which is more optimised for compact storage of very large string sets, and it is doing quite well.

But creation of a TrieNode containing e.g. all words from http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt would not even terminate.

This seems to make creation of the TrieNode much faster. It might fix com-lihaoyi#33 https://issues.scala-lang.org/browse/SI-4776

rklaehn · 2015-12-17T08:59:57Z

Here is a benchmark with two versions of TrieNode with the only difference being use of mapValues vs map { case (k, v) => ... }. You will need https://github.com/Ichoran/thyme to run it.

I am not exactly sure what is going on, since the mapValues should only affect the operations immediately afterwards, so it should be just a constant factor slowdown. But apparently this is not the case.

By the way: the TrieNode code is pretty neat!

The issue seems to be most severe with long strings.

Results:

Benchmark comparison (in 5.798 s): Create TrieNode Lazy vs. Eager 5
Significantly different (p ~= 0)
  Time ratio:    0.04878   95% CI 0.04786 - 0.04970   (n=20)
    First     409.1 us   95% CI 404.3 us - 413.8 us
    Second    19.95 us   95% CI 19.66 us - 20.25 us
Benchmark comparison (in 4.629 s): Create TrieNode Lazy vs. Eager 10
Significantly different (p ~= 0)
  Time ratio:    0.04145   95% CI 0.04100 - 0.04191   (n=20)
    First     1.017 ms   95% CI 1.008 ms - 1.026 ms
    Second    42.16 us   95% CI 41.88 us - 42.43 us
Benchmark comparison (in 254.9 s): Create TrieNode Lazy vs. Eager 30
Significantly different (p ~= 0)
  Time ratio:    0.00006   95% CI 0.00006 - 0.00006   (n=20)
    First     2.074 s   95% CI 2.073 s - 2.075 s
    Second    124.1 us   95% CI 116.6 us - 131.5 us

Code:

import ichi.bench.Thyme

import scala.annotation.tailrec

object TrieNodeBench extends App {

  val th = Thyme.warmed(verbose = println, warmth = Thyme.HowWarm.BenchOff)

  def createLazy(ns: Array[String]): AnyRef = new Lazy.TrieNode(ns)

  def createEager(ns: Array[String]): AnyRef = new Eager.TrieNode(ns)

  val names = (0 until 10000).map(NumberToWord.apply).toArray

  for(i <- Seq(5, 10, 30)) {
    val ns = names.take(i)
    th.pbenchOffWarm(s"Create TrieNode Lazy vs. Eager $i")(th.Warm(createLazy(ns)))(th.Warm(createEager(ns)))
  }
}

object Lazy {

  /**
    * An trie node for quickly matching multiple strings which
    * share the same prefix, one char at a time.
    */
  final class TrieNode(strings: Seq[String]) {

    val (min, max, arr) = {
      val children = strings.filter(!_.isEmpty)
        .groupBy(_ (0))
        .mapValues(ss => new TrieNode(ss.map(_.tail)))
      if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
      else {
        val min = children.keysIterator.min
        val max = children.keysIterator.max
        val arr = new Array[TrieNode](max - min + 1)
        for ((k, v) <- children) arr(k - min) = v
        (min, max, arr)
      }
    }
    val word: Boolean = strings.exists(_.isEmpty) || arr.isEmpty

    def apply(c: Char): TrieNode = {
      if (c > max || c < min) null
      else arr(c - min)
    }

    /**
      * Returns the length of the matching string, or -1 if not found
      */
    def query(input: String, index: Int): Int = {
      @tailrec def rec(offset: Int, currentNode: TrieNode, currentRes: Int): Int = {
        if (index + offset >= input.length) currentRes
        else {
          val char = input(index + offset)
          val next = currentNode(char)
          if (next == null) currentRes
          else rec(
            offset + 1,
            next,
            if (next.word) offset
            else currentRes
          )
        }
      }
      rec(0, this, -1)
    }
  }
}

object Eager {

  /**
    * An trie node for quickly matching multiple strings which
    * share the same prefix, one char at a time.
    */
  final class TrieNode(strings: Seq[String]) {

    val (min, max, arr) = {
      val children = strings.filter(!_.isEmpty)
        .groupBy(_ (0))
        .map { case (k, ss) => k -> new TrieNode(ss.map(_.tail)) }
      if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
      else {
        val min = children.keysIterator.min
        val max = children.keysIterator.max
        val arr = new Array[TrieNode](max - min + 1)
        for ((k, v) <- children) arr(k - min) = v
        (min, max, arr)
      }
    }
    val word: Boolean = strings.exists(_.isEmpty) || arr.isEmpty

    def apply(c: Char): TrieNode = {
      if (c > max || c < min) null
      else arr(c - min)
    }

    /**
      * Returns the length of the matching string, or -1 if not found
      */
    def query(input: String, index: Int): Int = {
      @tailrec def rec(offset: Int, currentNode: TrieNode, currentRes: Int): Int = {
        if (index + offset >= input.length) currentRes
        else {
          val char = input(index + offset)
          val next = currentNode(char)
          if (next == null) currentRes
          else rec(
            offset + 1,
            next,
            if (next.word) offset
            else currentRes
          )
        }
      }
      rec(0, this, -1)
    }
  }

}

/**
  * Helper object to generate realistic test cases for the radix tree. Used with permission from www.source-code.biz
  */
object NumberToWord {

  def apply(n: Int): String = convertNumberToWords(n)

  // This snippet may be used freely, as long as the authorship note remains in the source code.
  private val lowNames = Array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
  private val tensNames = Array("twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
  private val bigNames = Array("thousand", "million", "billion")

  /**
    * Converts an integer number into words (american english).
    * @author Christian d'Heureuse, Inventec Informatik AG, Switzerland, www.source-code.biz
    */
  private def convertNumberToWords(n0: Int): String = {
    var n = n0
    if (n < 0) "minus " + convertNumberToWords(-n)
    else if (n <= 999) convert999(n)
    else {
      var s: String = null
      var t = 0
      while (n > 0) {
        if (n % 1000 != 0) {
          var s2 = convert999(n % 1000)
          if (t > 0) {
            s2 = s2 + " " + bigNames(t - 1)
          }
          if (s == null) {
            s = s2
          } else {
            s = s2 + ", " + s
          }
        }
        n /= 1000
        t += 1
      }
      s
    }
  }

  private def convert999(n: Int): String = {
    val s1 = lowNames(n / 100) + " hundred"
    val s2 = convert99(n % 100)
    if (n <= 99) s2
    else if (n % 100 == 0) s1
    else s1 + " " + s2
  }

  private def convert99(n: Int): String = {
    if (n < 20) lowNames(n)
    else {
      val s = tensNames(n / 10 - 2)
      if (n % 10 == 0) s
      else s + "-" + lowNames(n % 10)
    }
  }
}

lihaoyi · 2015-12-17T10:45:05Z

Here is the likely problem

       // DEFINES CHILDREN RECURSIVELY, DOES NOT EVALUATE 
       val children = strings.filter(!_.isEmpty)
                             .groupBy(_(0))
                            .mapValues(ss => new TrieNode(ss.map(_.tail))) 
       if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
       else {
         val min = children.keysIterator.min // EVALUATES CHILDREN ONCE
         val max = children.keysIterator.max // EVALUATES CHILDREN SECOND TIME
         val arr = new Array[TrieNode](max - min + 1)
         for ((k, v) <- children) arr(k - min) = v  // EVALUATES CHILDREN THIRD TIME
         (min, max, arr)
       }

We'd expect O(3^lengthOfTrie) behavior from this, and that appears to match up with what we're seeing in the benchmarks.

Until we put together a faster trie (DAT or whatever) to use, I think this will be sufficient for now and should make things sufficiently faster to close the original issue.

My only request for this diff: can we have a regression test for this for unreasonably-large non-toy input data? I'm thinking thousands to hundreds of thousands of strings thousands to hundreds of thousands of characters long. Such an input might take a moment to initialize or query, but the fact that it completes should be sufficient to verify that this behavior doesn't occur. It should probably live in https://github.com/lihaoyi/fastparse/blob/2023b5ead1f876397855b86b1784cb1b8e57d177/fastparse/shared/src/test/scala/fastparse/MiscTests.scala#L7

lihaoyi · 2015-12-17T10:46:27Z

Also, the pythonparse integration test seems to be complaining about a new file in the django project it can't parse and is not skipped

https://travis-ci.org/lihaoyi/fastparse/jobs/97345190#L3097

Could you add it to the whitelist, just as a housekeeping matter to keep the build up to date?

rklaehn · 2015-12-17T12:17:28Z

I don't think there is anything wrong with the tree once constructed. The performance should be pretty close to optimal. Just the memory usage can be very high, especially if somebody is using unicode characters. See example below.

If I understand it correctly, the DAT mostly buys you a more compact representation while still using O(1) array lookups. I think in this case the less-than-optimal space usage is perfectly fine, since people will just use it for a small number of identifiers, not all words of the English language.

Example for large space usage:

val infinity = StringIn("Infinity", "∞")

That would allocate a pretty large child array (from 'i'.toInt 73 to '∞'.toInt 8734). Still shouldn't be a problem for this use case.

I will add a test case that runs for longer than the travis timeout with the unfixed implementation.

I am not sure what you want me to do regarding the whitelist. The patch does not add any new files, so what exactly is the problem?

lihaoyi · 2015-12-17T12:20:00Z

w.r.t. the whitelist, the integration test suite isn't hermitic and just pulls in master from a bunch of other repos. One of those repos has changed since the last diff was submitted, and we need to do basic housekeeping to keep that those tests running. That's why the build is red, let's make it green before landing =)

This should prevent performance regressions in TrieNode creation. With the old, lazy implementation this will run forever, whereas with the new implementation it will run in less than a millisecond.

rklaehn · 2015-12-17T21:56:38Z

OK, I have added a single performance test that creates a TrieNode from 1000 strings. It runs almost instantaneously with the eager version, and will probably not terminate before the end of the universe with the lazy version. That should do to prevent regressions. Let me know if you need more.

lihaoyi · 2015-12-17T22:19:12Z

Cool looks good, thanks a lot!

Do not use lazy mapValues in TrieNode

Do not use lazy mapValues

73c824a

This seems to make creation of the TrieNode much faster. It might fix com-lihaoyi#33 https://issues.scala-lang.org/browse/SI-4776

Add performance test for TrieNode creation

4c8ca5d

This should prevent performance regressions in TrieNode creation. With the old, lazy implementation this will run forever, whereas with the new implementation it will run in less than a millisecond.

lihaoyi added a commit that referenced this pull request Dec 17, 2015

Merge pull request #70 from rklaehn/bugfix/trienode

692a1d6

Do not use lazy mapValues in TrieNode

lihaoyi merged commit 692a1d6 into com-lihaoyi:master Dec 17, 2015

shawjef3 mentioned this pull request Nov 5, 2018

Make Map#{mapValues, filterKeys} strict scala/bug#11245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use lazy mapValues in TrieNode #70

Do not use lazy mapValues in TrieNode #70

rklaehn commented Dec 16, 2015

rklaehn commented Dec 17, 2015

lihaoyi commented Dec 17, 2015

lihaoyi commented Dec 17, 2015

rklaehn commented Dec 17, 2015

lihaoyi commented Dec 17, 2015

rklaehn commented Dec 17, 2015

lihaoyi commented Dec 17, 2015

Do not use lazy mapValues in TrieNode #70

Do not use lazy mapValues in TrieNode #70

Conversation

rklaehn commented Dec 16, 2015

rklaehn commented Dec 17, 2015

lihaoyi commented Dec 17, 2015

lihaoyi commented Dec 17, 2015

rklaehn commented Dec 17, 2015

lihaoyi commented Dec 17, 2015

rklaehn commented Dec 17, 2015

lihaoyi commented Dec 17, 2015