Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Adds support for DNNL, OpenVINO, TensorRT shared providers and refactors the CUDA shared provider loader #8013

Merged
merged 11 commits into from
Jul 21, 2021

Conversation

Craigacp
Copy link
Contributor

@Craigacp Craigacp commented Jun 10, 2021

Description:

Refactors the native library loading in Java to allow CUDA to be loaded on demand, fixing #7044. Then expands the shared provider library loading to DNNL, OpenVINO, TensorRT, fixing #6553.

Added a flag to the native library loading to allow users to supply a directory which contains all the native libraries, fixing #8003. This is also the only way to make the shared library providers load from a different place than the jar, as the individual library path specification conflicts with the way that the ONNX Runtime native code loads the shared library providers.

I also slightly refactored the Java cmake bits, and added the --console=plain flag to the gradle executions to stop gradle writing over cmake's output.

Motivation and Context

@Craigacp Craigacp requested a review from a team as a code owner June 10, 2021 00:05
@Craigacp
Copy link
Contributor Author

One remaining todo with this PR is to enable OpenVINO, DNNL and TensorRT in the Java testing pipeline (or enable Java in the relevant testing pipelines there). Is there a preference which way round that happens?

@Craigacp Craigacp mentioned this pull request Jun 10, 2021
@Craigacp
Copy link
Contributor Author

@RyanUnderhill
Copy link
Member

One remaining todo with this PR is to enable OpenVINO, DNNL and TensorRT in the Java testing pipeline (or enable Java in the relevant testing pipelines there). Is there a preference which way round that happens?

I think it's easier to add it to the provider pipelines (since provider availability can sometimes be mutually exclusive) vs trying to make them all work in one Java pipeline. For example, this isn't a shared provider yet, but I don't expect to be able to have CUDA and ROCM at the same time.

Thank you for cleaning up my original shared library stuff, this definitely makes it more efficient.

@Craigacp
Copy link
Contributor Author

Ok, I've added Java tests for TensorRT, OpenVINO and DNNL, and I added --build_java to the DNNL and OpenVINO pipelines (it was already present on the TensorRT one, though until this PR that wouldn't have done anything different to the CPU version of the Java code).

I'm not sure I've got the pipeline changes right, and I can't easily run them, so I'd appreciate it if someone could look over them and kick off the pipelines.

@snnn
Copy link
Member

snnn commented Jun 21, 2021

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline

@snnn
Copy link
Member

snnn commented Jun 21, 2021

/azp run Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

@snnn
Copy link
Member

snnn commented Jun 21, 2021

/azp run Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@Craigacp
Copy link
Contributor Author

Craigacp commented Jun 21, 2021

Looks like the TensorRT test was skipped by gradle, I'll double check how it's passing in the flag. OpenVINO failed because it was looking for Java 8. I'm not sure what's setting JAVA_HOME in that docker file. Any ideas?

@snnn
Copy link
Member

snnn commented Jun 21, 2021

Looks like the TensorRT test was skipped by gradle, I'll double check how it's passing in the flag. OpenVINO failed because it was looking for Java 8. I'm not sure what's setting JAVA_HOME in that docker file. Any ideas?

The openvino build logs says:

update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/java to provide /usr/bin/java (java) in auto mode

So it should be able to find java in PATH. Is setting JAVA_HOME still needed?

@snnn
Copy link
Member

snnn commented Jun 21, 2021

ERROR: JAVA_HOME is set to an invalid directory: /usr/lib/jvm/java-8-openjdk-amd64

Oh, that's strange.

@snnn
Copy link
Member

snnn commented Jun 21, 2021

@snnn
Copy link
Member

snnn commented Jun 21, 2021

@Craigacp
Copy link
Contributor Author

Craigacp commented Jun 22, 2021

@Craigacp I think you need to update this file as well:

https://github.com/microsoft/onnxruntime/blob/master/cmake/onnxruntime_java_unittests.cmake

You may use this GRADLE_ARGS flag:

https://github.com/microsoft/onnxruntime/blob/master/cmake/onnxruntime_java.cmake#L162

I've already got the ep flags in there as GRADLE_TEST_EP_FLAGS. Weirdly it seems to have run the CUDA test in the TensorRT pipeline, but not the TensorRT test. I've updated the Java test cmake file to make it print the flags, hopefully that will give more insight.

I also removed the JAVA_HOME setting from run_build.sh. Could you kick off the TensorRT, OpenVINO and DNNL pipelines to check it out?

@Craigacp
Copy link
Contributor Author

Craigacp commented Jul 9, 2021

I've updated this PR to remove the merge conflicts, could someone start the CI pipelines?

@Craigacp
Copy link
Contributor Author

Could someone please start the CI pipelines? @snnn @RyanUnderhill

@edgchen1
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@edgchen1
Copy link
Contributor

/azp run Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@Craigacp
Copy link
Contributor Author

Looks like I put the argument outside the quote when it should have been inside the quote for OpenVINO. Could you restart that please?

@snnn
Copy link
Member

snnn commented Jul 16, 2021

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@snnn
Copy link
Member

snnn commented Jul 16, 2021

/azp run Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@Craigacp
Copy link
Contributor Author

I fixed the test passthrough and checked it against DNNL which is the easiest for me to run. I'd missed that the build.gradle needs to pass through the system properties to the test runner, and that onnxruntime_java_unittests.cmake is only used on Windows.

I still don't understand why the OpenVINO test is failing. Is there someone on your side who understands how that's setup that could have a look at the error? They might understand why the filesystem doesn't appear to be writeable.

@snnn
Copy link
Member

snnn commented Jul 20, 2021

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@snnn
Copy link
Member

snnn commented Jul 20, 2021

/azp run Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@Craigacp
Copy link
Contributor Author

Well at least TensorRT is happy now - 7: InferenceTest > testTensorRT() PASSED.

@snnn
Copy link
Member

snnn commented Jul 20, 2021

I will look the OpenVINO failure. Please give me a few days.

@snnn snnn self-assigned this Jul 20, 2021
@Craigacp
Copy link
Contributor Author

Thanks. If it's going to be a complicated fix then I'll turn off the OpenVINO test in Java so we can get this merged. I'm planning to update the Java provider interfaces so you can supply the various parameter structs to the CUDA provider etc and this change blocks most of them as it modifies the same files, so I'd appreciate getting this merged in early next week if we can.

@snnn
Copy link
Member

snnn commented Jul 20, 2021

Good suggestion. Let's merge this PR first.

@snnn
Copy link
Member

snnn commented Jul 20, 2021

Please help turn off the OpenVINO test for now.

@Craigacp
Copy link
Contributor Author

Done.

@snnn
Copy link
Member

snnn commented Jul 21, 2021

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@snnn
Copy link
Member

snnn commented Jul 21, 2021

/azp run Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@snnn
Copy link
Member

snnn commented Jul 21, 2021

Linux GPU CI Pipeline failure is due to an unknown reason that I'm still investigating.

@@ -47,4 +48,4 @@ ARG BUILD_USER=onnxruntimedev
WORKDIR /home/$BUILD_USER
RUN adduser --gecos 'onnxruntime Build User' --disabled-password $BUILD_USER --uid $BUILD_UID
RUN adduser $BUILD_USER video
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is why the openvino build failed. It created two users! The one used for running CI build doesn't have the necessary permissions. And the one has perm wasn't used for doing anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Well that sounds like a problem. To re-enable the java tests I think all that's necessary is to add --build_java to the end of the docker run command in linux-openvino-ci-pipeline.yml, inside the double quotes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants