Threading performance needs to be evaluated #1

kwalcock · 2021-08-30T19:25:17Z

TorchScript seems to be thread-safe and thread-efficient. It gets the same answer each time and gets it faster with more threads. This was run on a 16-core, 32-processor machine. For the first few doublings of the thread count, performance is about 150%. Overall it maxes out at 7x when one would hope for 16-32 times. Perhaps it would go higher on Clara.

Typos have been fixed in the graph captions.

MihaiSurdeanu · 2021-08-30T22:04:55Z

Thanks!

So, it seems to me it would be worth it to change our Metal DyNet library to TorchScript, no? What is your opinion?
Also, what are the RAM requirements for TorchScript for this task?

kwalcock · 2021-08-31T00:46:21Z

That RAM question is going to be difficult not just because it's Java but also because any underlying C-like code may have its own stash that Java might not know about. I can at least google and experiment.

<opinion>

This is still very much an apples vs. oranges comparison I think. With TorchScript there can be (at least) a 7x speedup with threading, but maybe in the end on a complicated model it is 10x slower than DyNet and can't catch up. On one test I believe at the FatDynet level and maybe on Clara, DyNet could be sped up 20x (clulab/processors#422 (comment)) and might eventually work from Scala. It looks like TorchScript can easily run on the GPU (https://towardsdatascience.com/pytorch-jit-and-torchscript-c2a77bac0fff) and that could help it catch up again, In general we're threading on a per document basis, but often we run only one document at a time anyway, so we may lose speed on most small jobs if it is only the multi-threading that justifies TorchScript. It might be interesting to try threading per sentence on either one.

It would be nice to have alternatives if they don't cost too much. Processors could work on some InferenceEngine interface with different implementations or could do something like or with this ONNX (https://github.com/onnx/onnx) project. It almost sounds reasonable as a learning exercise for someone to implement that same model on numerous platforms. However, it might take away from bug hunting which could pay off very quickly (or never). There are lots of other places where performance might be improved, but they might be less interesting. I've wondered what would happen if all these Seqs were turned into Lists, for example. Some measurements show them to be a lot faster (https://www.lihaoyi.com/post/BenchmarkingScalaCollections.html) for many operations.

</opinion>

MihaiSurdeanu · 2021-08-31T01:07:35Z

I agree it's an apple vs. oranges comparison. But, if we attempt to normalize them by accounting for the fact that the DyNet code is more complex, they become sort of similar. That is, our DyNet code on clara is about 50% slower than the TorchScript.
In general, TorchScript is appealing for three reasons:

Seems to like parallelism more than DyNet. This reason might disappear if you find that bug.
PyTorch is much better supported than DyNet.
PyTorch has pre-trained transformer networks. DyNet does not.

MihaiSurdeanu · 2021-08-31T01:07:49Z

Let's discuss Thursday.

kwalcock · 2021-08-31T03:21:49Z

1 thread requires approximately 145MB of Java memory. 8 threads require approximately 149MB. Each requires about 1/2MB. The model itself on disk is 935.5MB. There would seem to be very little of the memory managed by Java.

1 thread

Memory	Time (sec)
8192	193.9408116
4096	191.989438
2048	193.5145225
1024	190.8282388
512	191.0930682
256	191.7231144
192	192.7174266
160	195.3915921
152	199.9521839
148	205.5872125
146	231.19455
145	228.9728285
144	Java heap space
128	GC overhead limit exceeded

8 threads

Memory	Time (sec)
256	62.28951887
192	69.24889569
160	66.72487354
152	70.56916087
150	74.55402432
149	76.57680217
148	GC overhead limit exceeded
144	Java heap space

kwalcock · 2021-08-31T03:22:55Z

C memory is next.

MihaiSurdeanu · 2021-08-31T04:19:38Z

Are you using glove embeddings for this test? This is very little memory...

…

On Mon, Aug 30, 2021 at 20:23 Keith Alcock ***@***.***> wrote: C memory is next. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TU2WSUI6U4YG5QSVU3T7RDJTANCNFSM5DCQ2IBA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

kwalcock · 2021-08-31T17:13:42Z

The embeddings themselves are not used, but rather only the vocabulary (

torchscript/src/main/scala/org/clulab/torchscript/NerVocab.scala

Line 21 in aef8df8

word -> index

and

torchscript/src/main/python/NerVocab.py

Line 6 in 4635086

# Load word embeddings, (we just care about which tokens are in our vocab)

). The input is the index of the word in the vocabulary and the index of the label from conll. In the Scala code it happens at

torchscript/src/main/scala/org/clulab/torchscript/NerDataset.scala

Line 62 in 4635086

Tensor.fromBlob(tokenIndexes, Array(1L, tokenIndexes.length.toLong)),

and in Python at

torchscript/src/main/python/NerDataset.py

Line 48 in 4635086

out_sent = torch.tensor(out_sent)

. That's just how the code was when I got it.

The embeddings do show up in a round about way at

torchscript/src/main/python/LSTMModel.py

Line 8 in aef8df8

self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

and I haven't tracked down what that's about. There is no equivalent in the TorchScript code except maybe in the loading of the stored model.

MihaiSurdeanu · 2021-08-31T17:25:17Z

I think he is using random embeddings instead, which is fine for this experiment.
Thank you!

kwalcock · 2021-09-01T18:59:40Z

I have not found any good (and easy and accurate) way to measure how much memory is being used by the libraries. jcmd doesn't seem to help, /proc/pid doesn't divulge secrets, PyTorch doesn't provide any insight through the Java interface, etc. As a hack, I ran the test program allowing for the minimal possible Java memory setting from above so that Java didn't have any to spare there and then just used top to check how much memory was reserved for the process. For both 1 and 8 threads, 4.9GB was recorded. A simple hello world program generated by the same project measured just 64MB and provides a sanity check. If I run sbt test on processors, the same column often shows between 6.6 and 7.1GB.

kwalcock · 2021-09-01T19:03:49Z

There seems to be a difficulty using torchscript on a Mac. It looks like "System Integrity Protect" may need to be disabled from the GUI. There are some links about it towards the bottom of https://github.com/kwalcock/torchscript. I haven't been able to check yet whether it solves the problem.

MihaiSurdeanu · 2021-09-01T20:30:57Z

Thanks Keith!!

…

On Wed, Sep 1, 2021 at 12:03 PM Keith Alcock ***@***.***> wrote: There seems to be a difficulty using torchscript on a Mac. It looks like "System Integrity Protect" may need to be disabled from the GUI. There are some links about it towards the bottom of https://github.com/kwalcock/torchscript. I haven't been able to check yet whether it solves the problem. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TUU7WHSROLAN2RMDBLT7Z2J7ANCNFSM5DCQ2IBA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

kwalcock · 2021-09-23T23:42:58Z

These are mean times (sec,) for a forward pass of one sentence for a single thread. Onnx is close to 2.5x faster here.

Library	Train set	Val set	Test set
TorchScript	0.00053581	0.00056500	0.00049214
Onnx	0.00022426	0.00022952	0.00020155

kwalcock · 2021-09-24T00:26:14Z

Onnx seems to bottom out at 16 threads, which happens to be the number of cores in the computer. Even though this happens sooner, the performance is still better than PyTorch, which was still taking 111 seconds to run while Onnx was down to 86.

MihaiSurdeanu · 2021-09-24T00:36:38Z

Cool! But how exactly does Onnx actually run the code? Do they have their own interpreter like torch script?

…

On Thu, Sep 23, 2021, 5:26 PM Keith Alcock ***@***.***> wrote: Onnx seems to bottom out at 16 threads, which happens to be the number of cores in the computer. Even though this happens sooner, the performance is still better than PyTorch, which was still taking 111 seconds to run while Onnx was down to 86. [image: image] <https://user-images.githubusercontent.com/8679738/134600381-e79b8100-3618-4393-a825-91bb9d3084c0.png> [image: image] <https://user-images.githubusercontent.com/8679738/134600409-626db220-87b2-43ca-b053-663b19778d26.png> [image: image] <https://user-images.githubusercontent.com/8679738/134600467-c7f06cc3-11db-4aac-8223-ba5e03fcdb26.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TUII5RERGA6KEBHJWLUDPATBANCNFSM5DCQ2IBA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

kwalcock · 2021-09-24T01:02:59Z

There is a library dependency, but I haven't looked inside the jar:

    "com.microsoft.onnxruntime"  % "onnxruntime" % "1.8.1",

PyTorch requires jars plus several C-libraries in the LD_LIBRARY_PATH:

    // This one requires the next.
    // "org.pytorch"           % "pytorch_java_only" % "1.9.0",
    // The next one can't be found.  Use jars in lib directory.
    // "com.facebook.fbjni"    % "fbjni-java-only"   % "0.0.3",
    // And this is a transitive dependency
    // "com.facebook.soloader" % "nativeloader"      % "0.8.0",

MihaiSurdeanu · 2021-09-24T02:08:27Z

It probably means they have an interpreter, which is possibly natively supported.

What NN did you run exactly? Just a feed forward one, or the double LSTM?

kwalcock · 2021-09-24T04:21:41Z

This is the same model from Peter as before:

torchscript/src/main/python/LSTMModel.py

Lines 9 to 14 in 7907436

    
           self.bilstm = nn.LSTM( 
        
               input_size=embed_dim, 
        
               hidden_size=hidden_dim, 
        
               bidirectional=True, 
        
               batch_first=True, 
        
           )

MihaiSurdeanu · 2021-09-24T04:31:03Z

Thanks!

kwalcock mentioned this issue Aug 30, 2021

Added support for parallel executions. Also fixed a bug in enhanced semantic roles. clulab/processors#422

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading performance needs to be evaluated #1

Threading performance needs to be evaluated #1

kwalcock commented Aug 30, 2021 •

edited

Loading

MihaiSurdeanu commented Aug 30, 2021

kwalcock commented Aug 31, 2021 •

edited

Loading

MihaiSurdeanu commented Aug 31, 2021

MihaiSurdeanu commented Aug 31, 2021

kwalcock commented Aug 31, 2021

kwalcock commented Aug 31, 2021

MihaiSurdeanu commented Aug 31, 2021 via email

kwalcock commented Aug 31, 2021

MihaiSurdeanu commented Aug 31, 2021

kwalcock commented Sep 1, 2021

kwalcock commented Sep 1, 2021

MihaiSurdeanu commented Sep 1, 2021 via email

kwalcock commented Sep 23, 2021

kwalcock commented Sep 24, 2021 •

edited

Loading

MihaiSurdeanu commented Sep 24, 2021 via email

kwalcock commented Sep 24, 2021

MihaiSurdeanu commented Sep 24, 2021

kwalcock commented Sep 24, 2021

MihaiSurdeanu commented Sep 24, 2021

Threading performance needs to be evaluated #1

Threading performance needs to be evaluated #1

Comments

kwalcock commented Aug 30, 2021 • edited Loading

MihaiSurdeanu commented Aug 30, 2021

kwalcock commented Aug 31, 2021 • edited Loading

MihaiSurdeanu commented Aug 31, 2021

MihaiSurdeanu commented Aug 31, 2021

kwalcock commented Aug 31, 2021

kwalcock commented Aug 31, 2021

MihaiSurdeanu commented Aug 31, 2021 via email

kwalcock commented Aug 31, 2021

MihaiSurdeanu commented Aug 31, 2021

kwalcock commented Sep 1, 2021

kwalcock commented Sep 1, 2021

MihaiSurdeanu commented Sep 1, 2021 via email

kwalcock commented Sep 23, 2021

kwalcock commented Sep 24, 2021 • edited Loading

MihaiSurdeanu commented Sep 24, 2021 via email

kwalcock commented Sep 24, 2021

MihaiSurdeanu commented Sep 24, 2021

kwalcock commented Sep 24, 2021

MihaiSurdeanu commented Sep 24, 2021

kwalcock commented Aug 30, 2021 •

edited

Loading

kwalcock commented Aug 31, 2021 •

edited

Loading

kwalcock commented Sep 24, 2021 •

edited

Loading