Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use lazy mapValues in TrieNode #70

Merged
merged 2 commits into from
Dec 17, 2015

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Dec 16, 2015

This seems to make creation of the TrieNode much faster. It might fix #33

https://issues.scala-lang.org/browse/SI-4776

The TrieNode tree seems to be pretty fast once it is created. I compared it with my RadixTree https://github.com/rklaehn/radixtree , which is more optimised for compact storage of very large string sets, and it is doing quite well.

But creation of a TrieNode containing e.g. all words from http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt would not even terminate.

This seems to make creation of the TrieNode much faster. It might fix com-lihaoyi#33

https://issues.scala-lang.org/browse/SI-4776
@rklaehn
Copy link
Contributor Author

rklaehn commented Dec 17, 2015

Here is a benchmark with two versions of TrieNode with the only difference being use of mapValues vs map { case (k, v) => ... }. You will need https://github.com/Ichoran/thyme to run it.

I am not exactly sure what is going on, since the mapValues should only affect the operations immediately afterwards, so it should be just a constant factor slowdown. But apparently this is not the case.

By the way: the TrieNode code is pretty neat!

The issue seems to be most severe with long strings.

Results:

Benchmark comparison (in 5.798 s): Create TrieNode Lazy vs. Eager 5
Significantly different (p ~= 0)
  Time ratio:    0.04878   95% CI 0.04786 - 0.04970   (n=20)
    First     409.1 us   95% CI 404.3 us - 413.8 us
    Second    19.95 us   95% CI 19.66 us - 20.25 us
Benchmark comparison (in 4.629 s): Create TrieNode Lazy vs. Eager 10
Significantly different (p ~= 0)
  Time ratio:    0.04145   95% CI 0.04100 - 0.04191   (n=20)
    First     1.017 ms   95% CI 1.008 ms - 1.026 ms
    Second    42.16 us   95% CI 41.88 us - 42.43 us
Benchmark comparison (in 254.9 s): Create TrieNode Lazy vs. Eager 30
Significantly different (p ~= 0)
  Time ratio:    0.00006   95% CI 0.00006 - 0.00006   (n=20)
    First     2.074 s   95% CI 2.073 s - 2.075 s
    Second    124.1 us   95% CI 116.6 us - 131.5 us

Code:

import ichi.bench.Thyme

import scala.annotation.tailrec

object TrieNodeBench extends App {

  val th = Thyme.warmed(verbose = println, warmth = Thyme.HowWarm.BenchOff)

  def createLazy(ns: Array[String]): AnyRef = new Lazy.TrieNode(ns)

  def createEager(ns: Array[String]): AnyRef = new Eager.TrieNode(ns)

  val names = (0 until 10000).map(NumberToWord.apply).toArray

  for(i <- Seq(5, 10, 30)) {
    val ns = names.take(i)
    th.pbenchOffWarm(s"Create TrieNode Lazy vs. Eager $i")(th.Warm(createLazy(ns)))(th.Warm(createEager(ns)))
  }
}

object Lazy {

  /**
    * An trie node for quickly matching multiple strings which
    * share the same prefix, one char at a time.
    */
  final class TrieNode(strings: Seq[String]) {

    val (min, max, arr) = {
      val children = strings.filter(!_.isEmpty)
        .groupBy(_ (0))
        .mapValues(ss => new TrieNode(ss.map(_.tail)))
      if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
      else {
        val min = children.keysIterator.min
        val max = children.keysIterator.max
        val arr = new Array[TrieNode](max - min + 1)
        for ((k, v) <- children) arr(k - min) = v
        (min, max, arr)
      }
    }
    val word: Boolean = strings.exists(_.isEmpty) || arr.isEmpty

    def apply(c: Char): TrieNode = {
      if (c > max || c < min) null
      else arr(c - min)
    }

    /**
      * Returns the length of the matching string, or -1 if not found
      */
    def query(input: String, index: Int): Int = {
      @tailrec def rec(offset: Int, currentNode: TrieNode, currentRes: Int): Int = {
        if (index + offset >= input.length) currentRes
        else {
          val char = input(index + offset)
          val next = currentNode(char)
          if (next == null) currentRes
          else rec(
            offset + 1,
            next,
            if (next.word) offset
            else currentRes
          )
        }
      }
      rec(0, this, -1)
    }
  }
}

object Eager {

  /**
    * An trie node for quickly matching multiple strings which
    * share the same prefix, one char at a time.
    */
  final class TrieNode(strings: Seq[String]) {

    val (min, max, arr) = {
      val children = strings.filter(!_.isEmpty)
        .groupBy(_ (0))
        .map { case (k, ss) => k -> new TrieNode(ss.map(_.tail)) }
      if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
      else {
        val min = children.keysIterator.min
        val max = children.keysIterator.max
        val arr = new Array[TrieNode](max - min + 1)
        for ((k, v) <- children) arr(k - min) = v
        (min, max, arr)
      }
    }
    val word: Boolean = strings.exists(_.isEmpty) || arr.isEmpty

    def apply(c: Char): TrieNode = {
      if (c > max || c < min) null
      else arr(c - min)
    }

    /**
      * Returns the length of the matching string, or -1 if not found
      */
    def query(input: String, index: Int): Int = {
      @tailrec def rec(offset: Int, currentNode: TrieNode, currentRes: Int): Int = {
        if (index + offset >= input.length) currentRes
        else {
          val char = input(index + offset)
          val next = currentNode(char)
          if (next == null) currentRes
          else rec(
            offset + 1,
            next,
            if (next.word) offset
            else currentRes
          )
        }
      }
      rec(0, this, -1)
    }
  }

}

/**
  * Helper object to generate realistic test cases for the radix tree. Used with permission from www.source-code.biz
  */
object NumberToWord {

  def apply(n: Int): String = convertNumberToWords(n)

  // This snippet may be used freely, as long as the authorship note remains in the source code.
  private val lowNames = Array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
  private val tensNames = Array("twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
  private val bigNames = Array("thousand", "million", "billion")

  /**
    * Converts an integer number into words (american english).
    * @author Christian d'Heureuse, Inventec Informatik AG, Switzerland, www.source-code.biz
    */
  private def convertNumberToWords(n0: Int): String = {
    var n = n0
    if (n < 0) "minus " + convertNumberToWords(-n)
    else if (n <= 999) convert999(n)
    else {
      var s: String = null
      var t = 0
      while (n > 0) {
        if (n % 1000 != 0) {
          var s2 = convert999(n % 1000)
          if (t > 0) {
            s2 = s2 + " " + bigNames(t - 1)
          }
          if (s == null) {
            s = s2
          } else {
            s = s2 + ", " + s
          }
        }
        n /= 1000
        t += 1
      }
      s
    }
  }

  private def convert999(n: Int): String = {
    val s1 = lowNames(n / 100) + " hundred"
    val s2 = convert99(n % 100)
    if (n <= 99) s2
    else if (n % 100 == 0) s1
    else s1 + " " + s2
  }

  private def convert99(n: Int): String = {
    if (n < 20) lowNames(n)
    else {
      val s = tensNames(n / 10 - 2)
      if (n % 10 == 0) s
      else s + "-" + lowNames(n % 10)
    }
  }
}

@lihaoyi
Copy link
Member

lihaoyi commented Dec 17, 2015

Here is the likely problem

       // DEFINES CHILDREN RECURSIVELY, DOES NOT EVALUATE 
       val children = strings.filter(!_.isEmpty)
                             .groupBy(_(0))
                            .mapValues(ss => new TrieNode(ss.map(_.tail))) 
       if (children.size == 0) (0.toChar, 0.toChar, new Array[TrieNode](0))
       else {
         val min = children.keysIterator.min // EVALUATES CHILDREN ONCE
         val max = children.keysIterator.max // EVALUATES CHILDREN SECOND TIME
         val arr = new Array[TrieNode](max - min + 1)
         for ((k, v) <- children) arr(k - min) = v  // EVALUATES CHILDREN THIRD TIME
         (min, max, arr)
       }

We'd expect O(3^lengthOfTrie) behavior from this, and that appears to match up with what we're seeing in the benchmarks.

Until we put together a faster trie (DAT or whatever) to use, I think this will be sufficient for now and should make things sufficiently faster to close the original issue.

My only request for this diff: can we have a regression test for this for unreasonably-large non-toy input data? I'm thinking thousands to hundreds of thousands of strings thousands to hundreds of thousands of characters long. Such an input might take a moment to initialize or query, but the fact that it completes should be sufficient to verify that this behavior doesn't occur. It should probably live in https://github.com/lihaoyi/fastparse/blob/2023b5ead1f876397855b86b1784cb1b8e57d177/fastparse/shared/src/test/scala/fastparse/MiscTests.scala#L7

@lihaoyi
Copy link
Member

lihaoyi commented Dec 17, 2015

Also, the pythonparse integration test seems to be complaining about a new file in the django project it can't parse and is not skipped

https://travis-ci.org/lihaoyi/fastparse/jobs/97345190#L3097

Could you add it to the whitelist, just as a housekeeping matter to keep the build up to date?

@rklaehn
Copy link
Contributor Author

rklaehn commented Dec 17, 2015

I don't think there is anything wrong with the tree once constructed. The performance should be pretty close to optimal. Just the memory usage can be very high, especially if somebody is using unicode characters. See example below.

If I understand it correctly, the DAT mostly buys you a more compact representation while still using O(1) array lookups. I think in this case the less-than-optimal space usage is perfectly fine, since people will just use it for a small number of identifiers, not all words of the English language.

Example for large space usage:

val infinity = StringIn("Infinity", "∞")

That would allocate a pretty large child array (from 'i'.toInt 73 to '∞'.toInt 8734). Still shouldn't be a problem for this use case.

I will add a test case that runs for longer than the travis timeout with the unfixed implementation.

I am not sure what you want me to do regarding the whitelist. The patch does not add any new files, so what exactly is the problem?

@lihaoyi
Copy link
Member

lihaoyi commented Dec 17, 2015

w.r.t. the whitelist, the integration test suite isn't hermitic and just pulls in master from a bunch of other repos. One of those repos has changed since the last diff was submitted, and we need to do basic housekeeping to keep that those tests running. That's why the build is red, let's make it green before landing =)

This should prevent performance regressions in TrieNode creation. With the
old, lazy implementation this will run forever, whereas with the new
implementation it will run in less than a millisecond.
@rklaehn
Copy link
Contributor Author

rklaehn commented Dec 17, 2015

OK, I have added a single performance test that creates a TrieNode from 1000 strings. It runs almost instantaneously with the eager version, and will probably not terminate before the end of the universe with the lazy version. That should do to prevent regressions. Let me know if you need more.

@lihaoyi
Copy link
Member

lihaoyi commented Dec 17, 2015

Cool looks good, thanks a lot!

lihaoyi added a commit that referenced this pull request Dec 17, 2015
Do not use lazy mapValues in TrieNode
@lihaoyi lihaoyi merged commit 692a1d6 into com-lihaoyi:master Dec 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StringIn takes a very long time with longer strings
2 participants