-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threading performance needs to be evaluated #1
Comments
Thanks! So, it seems to me it would be worth it to change our Metal DyNet library to TorchScript, no? What is your opinion? |
That RAM question is going to be difficult not just because it's Java but also because any underlying C-like code may have its own stash that Java might not know about. I can at least google and experiment.
This is still very much an apples vs. oranges comparison I think. With TorchScript there can be (at least) a 7x speedup with threading, but maybe in the end on a complicated model it is 10x slower than DyNet and can't catch up. On one test I believe at the FatDynet level and maybe on Clara, DyNet could be sped up 20x (clulab/processors#422 (comment)) and might eventually work from Scala. It looks like TorchScript can easily run on the GPU (https://towardsdatascience.com/pytorch-jit-and-torchscript-c2a77bac0fff) and that could help it catch up again, In general we're threading on a per document basis, but often we run only one document at a time anyway, so we may lose speed on most small jobs if it is only the multi-threading that justifies TorchScript. It might be interesting to try threading per sentence on either one. It would be nice to have alternatives if they don't cost too much. Processors could work on some InferenceEngine interface with different implementations or could do something like or with this ONNX (https://github.com/onnx/onnx) project. It almost sounds reasonable as a learning exercise for someone to implement that same model on numerous platforms. However, it might take away from bug hunting which could pay off very quickly (or never). There are lots of other places where performance might be improved, but they might be less interesting. I've wondered what would happen if all these Seqs were turned into Lists, for example. Some measurements show them to be a lot faster (https://www.lihaoyi.com/post/BenchmarkingScalaCollections.html) for many operations.
|
I agree it's an apple vs. oranges comparison. But, if we attempt to normalize them by accounting for the fact that the DyNet code is more complex, they become sort of similar. That is, our DyNet code on clara is about 50% slower than the TorchScript.
|
Let's discuss Thursday. |
1 thread requires approximately 145MB of Java memory. 8 threads require approximately 149MB. Each requires about 1/2MB. The model itself on disk is 935.5MB. There would seem to be very little of the memory managed by Java. 1 thread
8 threads
|
C memory is next. |
Are you using glove embeddings for this test?
This is very little memory...
…On Mon, Aug 30, 2021 at 20:23 Keith Alcock ***@***.***> wrote:
C memory is next.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TU2WSUI6U4YG5QSVU3T7RDJTANCNFSM5DCQ2IBA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The embeddings themselves are not used, but rather only the vocabulary (
torchscript/src/main/python/NerDataset.py Line 48 in 4635086
The embeddings do show up in a round about way at
|
I think he is using random embeddings instead, which is fine for this experiment. |
I have not found any good (and easy and accurate) way to measure how much memory is being used by the libraries. jcmd doesn't seem to help, /proc/pid doesn't divulge secrets, PyTorch doesn't provide any insight through the Java interface, etc. As a hack, I ran the test program allowing for the minimal possible Java memory setting from above so that Java didn't have any to spare there and then just used top to check how much memory was reserved for the process. For both 1 and 8 threads, 4.9GB was recorded. A simple hello world program generated by the same project measured just 64MB and provides a sanity check. If I run |
There seems to be a difficulty using torchscript on a Mac. It looks like "System Integrity Protect" may need to be disabled from the GUI. There are some links about it towards the bottom of https://github.com/kwalcock/torchscript. I haven't been able to check yet whether it solves the problem. |
Thanks Keith!!
…On Wed, Sep 1, 2021 at 12:03 PM Keith Alcock ***@***.***> wrote:
There seems to be a difficulty using torchscript on a Mac. It looks like
"System Integrity Protect" may need to be disabled from the GUI. There are
some links about it towards the bottom of
https://github.com/kwalcock/torchscript. I haven't been able to check yet
whether it solves the problem.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TUU7WHSROLAN2RMDBLT7Z2J7ANCNFSM5DCQ2IBA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
These are mean times (sec,) for a forward pass of one sentence for a single thread. Onnx is close to 2.5x faster here.
|
Cool! But how exactly does Onnx actually run the code? Do they have their
own interpreter like torch script?
…On Thu, Sep 23, 2021, 5:26 PM Keith Alcock ***@***.***> wrote:
Onnx seems to bottom out at 16 threads, which happens to be the number of
cores in the computer. Even though this happens sooner, the performance is
still better than PyTorch, which was still taking 111 seconds to run while
Onnx was down to 86.
[image: image]
<https://user-images.githubusercontent.com/8679738/134600381-e79b8100-3618-4393-a825-91bb9d3084c0.png>
[image: image]
<https://user-images.githubusercontent.com/8679738/134600409-626db220-87b2-43ca-b053-663b19778d26.png>
[image: image]
<https://user-images.githubusercontent.com/8679738/134600467-c7f06cc3-11db-4aac-8223-ba5e03fcdb26.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TUII5RERGA6KEBHJWLUDPATBANCNFSM5DCQ2IBA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
There is a library dependency, but I haven't looked inside the jar:
PyTorch requires jars plus several C-libraries in the LD_LIBRARY_PATH:
|
It probably means they have an interpreter, which is possibly natively supported. What NN did you run exactly? Just a feed forward one, or the double LSTM? |
This is the same model from Peter as before: torchscript/src/main/python/LSTMModel.py Lines 9 to 14 in 7907436
|
Thanks! |
TorchScript seems to be thread-safe and thread-efficient. It gets the same answer each time and gets it faster with more threads. This was run on a 16-core, 32-processor machine. For the first few doublings of the thread count, performance is about 150%. Overall it maxes out at 7x when one would hope for 16-32 times. Perhaps it would go higher on Clara.
Typos have been fixed in the graph captions.
The text was updated successfully, but these errors were encountered: