Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback of ThreadLocal optimization change #43

Merged
merged 2 commits into from
Dec 28, 2017

Conversation

mikelalcon
Copy link
Contributor

@mikelalcon mikelalcon commented Nov 20, 2017

This change rollsback 3def696, 7597ac7 and f6ab6e0 due to an
internal (Google) memory regression.

The cause is a memory leak because of a non-static ThreadLocal usage that contained references to the object instance itself (Machine -> RE2 objects).

The main issue is creating unbound new ThreadLocal instances leaks ThreadLocals since they are not GCed.

The repro case is much easier:

public class Main {
  public static void main(String[] args) {
    for (int i = 0; i < 10000000; i++) {
      Pattern.compile("aaaa").matches("aaaa");
    }
  }
}

What is happening here?

Each Pattern.compile is creating a new RE2 instance. Each RE2 instance is creating its own ThreadLocal.

When we use the ThreadLocal we access a per thread map ThreadLocalMap. This map contains Entries with ThreadLocal as key and the value as the value.

The Entry of the map (ThreadLocalMap.Entry) extend WeakReference. And here is the problem. The entry itself is not weak. It is just the key that is weak.

So we are essentially creating thousands of entries on that map using the a different ThreadLocal as a key each time.

Why are not GCed? The ThreadLocal is not longer referenced, right? It is referenced:

Each Entry on that map (That is not weak, remember, it is just the key that is weak) contains a value that is...The Machine. Machine points to RE2...and RE2 class points to the ThreadLocal!. And there we have the cycle.

TL;DR: Never create unbound number of ThreadLocals. If not static, make sure that the number of instances are controlled (IOW, not unbound).

But does this mean that ThreadLocals are never GCed from the ThreadLocalMap?

No

When we look for a ThreadLocal in the ThreadLocalMap (by doing threadLocal.get() for example), if we don't find the ThreadLocal in the map by doing a simple hashcode lookup in a array, it calls getEntryAfterMiss.This method calls expungeStaleEntry if Entry.get() returns null (it is the week reference to the key, the threadLocal). This code (and also other places that end up calling expungeStaleEntry) cleanups the stale entry if the threadlocal has been GCed.

But in this case it wasn't GCed, because the value is an indirect reference to the ThreadLocal (because it is a non-static member of RE2).

This change rollsback 3def696, 7597ac7 and f6ab6e0 due to an
internal memory regression.
copybara-github pushed a commit to google/copybara that referenced this pull request Nov 22, 2017
RE2J does an internal synchornization when doing the internal
matching. This is because it has an internal pool of objects
that it reuses.

I did an attempt to fix it in RE2J, but we had to rollback.
See details in google/re2j#43

Instead now we create a copy of the pattern per batch job
(that is executed by a single thread). Essentially removing the
contention.

Because compiling the pattern per batch x transform is in the
order of thousands at most, this doesn't show a performance
penalty (Internally tested with 1, 2 and 20 threads).

Once the performance bottleneck is removed from RE2J, we should
be able to remove the Pattern.compile(pattern.pattern()) calls
and replace with the original 'pattern' expression.

Some performance numbers from the internal run:

20 threads, no optimization:

real    1m16.621s
real    1m16.700s
real    1m23.557s
real    1m20.242s
real    1m15.694s

20 threads, with optimization:

real    1m9.686s
real    1m4.141s
real    0m58.491s
real    1m4.280s
real    1m4.952s

1 thread, no optimization:

real    1m52.285s
real    1m43.664s
real    1m43.624s
real    1m44.872s
real    1m42.039s

1 thread, with optimization:

real    1m40.081s
real    1m44.596s
real    1m42.790s
real    1m43.514s
real    1m39.136s

The results show that with 1 thread (no parallelism) we don't have
a noticeable performance pentalty but when we use 20 threads we
se a noticeable performance improvement.

This performance improvement needs >10 threads to show the benefit
(since it removes contention).

PiperOrigin-RevId: 176656487
Change-Id: Iebe76a3d8f5e2efb8653538914b514e62416e61b
Copy link
Contributor

@alandonovan alandonovan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on the cause of the memory regression in the change description?

@mikelalcon
Copy link
Contributor Author

Done.

@sjamesr sjamesr merged commit e834873 into google:master Dec 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants