Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update libedgetpu repo for compatibility of recent versions of Tensorflow. #60

Merged
merged 40 commits into from Feb 29, 2024

Conversation

feranick
Copy link
Contributor

@feranick feranick commented Jan 31, 2024

This PR:

  • Updates the URL calls to pull recent versions of TF (currently v2.15.0), TF/crosstools, and libusb (1.0.26).
  • Updates versions of bazel needed in docker.
  • It refactors WORKSPACE to conform to the deprecation of TensorFlow/Toolchains separate repo. More info here.
  • Use xz compression for compatibility with older versions of debian (bullseye)
  • Adds missing support for "stddef.h"

This is in relation to issues: tensorflow/tensorflow#62371 and #53

Prevent compilation fail
No longer needed. deb packaging will be done natively.
@Skillnoob
Copy link

@dmitriykovalev can you or some other maintainer please review this pr. Merging this would fix many of the issues users of the coral edge tpu have with newer tensorflow versions and python versions.

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

The reason I am asking is that I am building libcoral against 2.16.0 (as planned) not 2.17.0 as in the current master.

@Namburger
Copy link
Contributor

@feranick I'm trying to find out, would be nice to get this included as part of 2.16

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024 via email

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

@Namburger So I tested it against the current TF master with both MacOS and Linux and all seems to work. To allow the use of TM master for temporary testing, I locally changed this in the libcoral/WORKSPACE`:

http_archive(
        name = "org_tensorflow",
        urls = [
            "https://github.com/tensorflow/tensorflow/archive/refs/heads/master.tar.gz",
        ],
        sha256 = "21a8363e3272a19977e2f0d12dcb87d1cb61ff0a79d20cfe456d9840e45e18d6",
        strip_prefix = "tensorflow-" + "master",
        )

@Namburger
Copy link
Contributor

Namburger commented Mar 5, 2024

@feranick working with the tensorflow repo is a learning experience for me... I'll do a first pass to see how to get this backported to 2.16, will let you know if we'll need a plan b

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

I'd discuss it with @mihaimaruseac. My previous PR was quickly cherrypicked into 2.16.0 thanks to him.

@Namburger
Copy link
Contributor

@feranick I've also talked to him, doesn't seems like there is a plan to pick this to 2.16 at this time :/

For your change suggestion,

http_archive(
        name = "org_tensorflow",
        urls = [
            "https://github.com/tensorflow/tensorflow/archive/refs/heads/master.tar.gz",
        ],
        sha256 = "21a8363e3272a19977e2f0d12dcb87d1cb61ff0a79d20cfe456d9840e45e18d6",
        strip_prefix = "tensorflow-" + "master",
        )

Could it works to just set TENSORFLOW_COMMIT=79ecb3f8bb6bd73f0115fa9a97b630a6f745a426?

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

@Namburger That is unfortunate. With regards to the specific commit, that would build against 2.17.0 at this point, correct? Given that is unstable and under development, I wouldn't think it's a good idea......

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

@Namburger, Compilation of libcoral works fine with MacOS and Linux by building against TF commit 79ecb3f. I updated the PR to reflect that and now all is working.

The concerns remains, though, that it's building against an unstable version fo 2.17.0. Also, should be libedgetpu be built against the same TF (stable is now against 2.15.0, but I am prepping the next with TF 2.16 as soon as it reaches stable).

@Namburger
Copy link
Contributor

@feranick I hear the concerns, it doesn't look like we can get that in for 2.16 so unless we want to sync to a 2.17 dev version, we're stuck in limbo until it comes out.. which probably won't happen any time soon...
It's very unfortunate that ultimately what blocks us is some visibility rules by bazel..

At this point my suggestion to move forward and keep all the main rocal repos updated is to sync to 79ecb3f8bb6bd73f0115fa9a97b630a6f745a426, we can monitor for breakage and apply updates as 2.17 is being developed. WDYT?

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

This sounds good to me @Namburger. The repos are already locally in sync with that commit.

I already found a possible breakage due to 2.17 (See log here) when crosscompiling for arm (there are no issues in MacOS and Linux for x86). I forked 2.16.0 and added the visibility patch to test whether the issue is in 2.17 or not. Will report back.

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

@Namburger since this PR has been merged and concerns libedgetpu (not libcoral), I am going to continue this reporting to the relevant PR:

google-coral/libcoral#36

@Namburger
Copy link
Contributor

This sounds good to me @Namburger. The repos are already locally in sync with that commit.

I already found a possible breakage due to 2.17 (See log here) when crosscompiling for arm (there are no issues in MacOS and Linux for x86). I forked 2.16.0 and added the visibility patch to test whether the issue is in 2.17 or not. Will report back.

Interesting, doesn't look like neon_fully_connected_arm32.cc have been changed for 7 months, that seems to indicates the compiler doesn't like the asm..

@mihaimaruseac
Copy link

So, unfortunately this landed too late in the 2.16 release process (it was supposed to also have an RC1 to handle bugs from RC0 but the release team decided otherwise).

If you pin to a commit from master branch, you will always build at that commit, but will need to always test when you move to a separate one. Regarding the commit to pick, I would suggest either the commit where the support got added or one of the nightly commits afterwards (check if there is a pip wheel built)

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

Well, as unfortunate as that is, I am really grateful to you both @mihaimaruseac and @Namburger to try to push this forward. I am still running a few build tests, it's pretty easy to set either version we want to build against.

If I were to build against the latest nightly, how do I find its corresponding commit?

@mihaimaruseac
Copy link

If you build on the day immediately after the nightly release, you can use the commit at the top of the nightly branch

If you build some days later, tf.version should provide the information (I don't recall the exact API, left the actual TF team around 2 years ago and now I'm just consulting from time to time). Alternatively, you can look at the matching GH Action run and take the commit where the action ran at. That would be 019e960 in this screenshot:
Screenshot from 2024-03-05 14-00-07
(which corresponds to https://github.com/tensorflow/tensorflow/actions/runs/8150842615)

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

Interesting, doesn't look like neon_fully_connected_arm32.cc have been changed for 7 months, that seems to indicates the compiler doesn't like the asm..

The issue is only specific to arm7a. This is the commit where edits are exactly what the compiler is complaining about. There are no specific details on what drove the commit in first place....

tensorflow/tensorflow@8419c70

@Namburger
Copy link
Contributor

Interesting, doesn't look like neon_fully_connected_arm32.cc have been changed for 7 months, that seems to indicates the compiler doesn't like the asm..

The issue is only specific to arm7a. This is the commit where edits are exactly what the compiler is complaining about. There are no specific details on what drove the commit in first place....

tensorflow/tensorflow@8419c70

humm, that certainly sounds like the issue. As I understand it, that commit changes the constraint of the registers used for that ops which matches the error:

external/org_tensorflow/tensorflow/lite/kernels/internal/optimized/4bit/neon_fully_connected_arm32.cc: In function 'void tflite::optimized_4bit::NeonRunKernelNoSDot(const uint8_t*, const int8_t*, int32_t*, int, int, int, int, int, int) [with int RowsLeft = 4; int RowsRight = 1; int Cols = 32]':
external/org_tensorflow/tensorflow/lite/kernels/internal/optimized/4bit/neon_fully_connected_arm32.cc:192:7: error: 'asm' operand has impossible constraints
  192 |       asm volatile(KERNEL_4x1
      |       

@feranick
Copy link
Contributor Author

feranick commented Mar 5, 2024

Indeed. Weirdly enough, the change in that commit was only applied to

neon_fully_connected_arm32.cc

not on

neon_fully_connected_aarch64_sdot.cc

which in fact works fine....

@feranick
Copy link
Contributor Author

feranick commented Mar 6, 2024

@Namburger I tested it with a local fork of TF 2.17.0 with the neon_fully_connected_arm32.cc reverted back to be in line with the other architectures, and compilation still fails even with the same error:

external/org_tensorflow/tensorflow/lite/kernels/internal/optimized/4bit/neon_fully_connected_arm32.cc: In function 'void tflite::optimized_4bit::NeonRunKernelNoSDot(const uint8_t*, const int8_t*, int32_t*, int, int, int, int, int, int) [with int RowsLeft = 4; int RowsRight = 1; int Cols = 32]':
external/org_tensorflow/tensorflow/lite/kernels/internal/optimized/4bit/neon_fully_connected_arm32.cc:192:7: error: 'asm' operand has impossible constraints
  192 |       asm volatile(KERNEL_4x1
      |       ^~~
INFO: Elapsed time: 34.950s, Critical Path: 17.60s

Note that this file (and the whole 4bit folder) was introduced with TF 2.14)

So maybe the commit was intended to address this problem, but didn't...

@feranick
Copy link
Contributor Author

feranick commented Mar 6, 2024

@Namburger Disclaimer: I know close to nothing about asm. Yet, I tried to address this error for what seems like a limitation in the architecture. Following some comments from here, I replaced "r" with "g" so that in external/org_tensorflow/tensorflow/lite/kernels/internal/optimized/4bit/neon_fully_connected_arm32.cc line 192 what is currently:

:
                   : [lhs_val] "r"(lhs_val), [rhs_val] "r"(rhs_val),
                     [element_ptr] "r"(element_ptr), [bit_shift] "r"(bit_shift),
                     [run_depth] "r"(run_depth)

is changed to:

:
                   : [lhs_val] "g"(lhs_val), [rhs_val] "g"(rhs_val),
                     [element_ptr] "g"(element_ptr), [bit_shift] "g"(bit_shift),
                     [run_depth] "g"(run_depth)

Compilation proceeds beyind this point, but it then stops with a similar error for another external library:

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
In file included from external/ruy/ruy/pack_arm.cc:16:
external/ruy/ruy/pack_arm.h:492:9: warning: multi-line comment [-Wcomment]
  492 | #endif  // (RUY_PLATFORM_NEON_64 || RUY_PLATFORM_NEON_32) && \
      |         ^
external/ruy/ruy/pack_arm.cc: In function 'void ruy::Pack8bitColMajorForNeon4Cols(const ruy::PackParams8bit&)':
external/ruy/ruy/pack_arm.cc:264:3: error: 'asm' operand has impossible constraints
  264 |   asm volatile(
      |   ^~~
INFO: Elapsed time: 32.683s, Critical Path: 20.21s

Not encouraging.

@Namburger
Copy link
Contributor

Namburger commented Mar 6, 2024

hi @feranick with all of the changes, would you be able to send a set of commands to reproduce the very first issues?
I believe it looks something like this? ( but may requires you to push some commits to your folk of tensorflow and then sync your fork of libedgetpu to it?)

git clone git@github.com:google-coral/libcoral.git && cd libcoral
git fetch origin pull/36/head:pull-36
git checkout pull-36
git submodule init && git submodule update
make DOCKER_IMAGE=debian:buster DOCKER_CPUS="armv7a" DOCKER_TARGETS=tests docker-build

I'm talking to some folks, trying to get an easy process to repro the original issues and continue on. It appears that element_ptr needs to be a r+ and there could be more issues as you pointed out so we'll need to work this out one step at a time, unfortunately :(

@Namburger
Copy link
Contributor

Namburger commented Mar 6, 2024

A procedure to making code changes to tensorflow and test them against this build would be nice to have also, not sure if this can be easily done with bazel, though

@mihaimaruseac
Copy link

One idea would be to fork TensorFlow, make changes in your own fork (and sync from time to time) and then change

libedgetpu/workspace.bzl

Lines 68 to 76 in b5820ad

maybe(
http_archive,
name = "org_tensorflow",
urls = [
"https://github.com/tensorflow/tensorflow/archive/" + tensorflow_commit + ".tar.gz",
],
sha256 = tensorflow_sha256,
strip_prefix = "tensorflow-" + tensorflow_commit,
)
to point to the fork and the last commit there.

Then, once it all works, you can make a PR from the fork back to TF and once that lands you can change the workspace.bzl file to point back to upstream TF

@feranick
Copy link
Contributor Author

feranick commented Mar 6, 2024

@mihaimaruseac Thanks. This is effectively what I did for testing. An advantage of this is that ibedgetpu could still be built on a stable version of TF (2.16.0, forked and with the visibility patch). I'd note that libedgetpu can be used independently from libcoral/pycoral (in fact that is my way of using the edgeTPU). Yet it would allow to keep testing and building libcoral/pycoral, while libedgetpu would be "stable".

@feranick
Copy link
Contributor Author

feranick commented Mar 6, 2024

hi @feranick with all of the changes, would you be able to send a set of commands to reproduce the very first issues? I believe it looks something like this? ( but may requires you to push some commits to your folk of tensorflow and then sync your fork of libedgetpu to it?)

Hi @Namburger , here 's a slightly revised version of your commands (I can't clone via git@github.com:google-coral/libcoral.git, only via https). Also you should build using debian:bookworm as it has support for python3.11:

git clone https://github.com/google-coral/libcoral.git && cd libcoral
git fetch origin pull/36/head:pull-36
git checkout pull-36
git submodule init && git submodule update
make DOCKER_IMAGE=debian:bookworm DOCKER_CPUS="armv7a" DOCKER_TARGETS=tests docker-build

This commands will surely trigger the issue (just tested it). As for libedgetpu it currently builds against TF 2.16.0-rc0, but the issue here happens regardless of the TF version libedgetpu is built against. Can change it to TF 2.17.0+visibility if needed (yet, see comment)

@Namburger
Copy link
Contributor

by mean of an update, the author of tensorflow/tensorflow@8419c70 promised to take a look sometimes this week!

@feranick
Copy link
Contributor Author

by mean of an update, the author of tensorflow/tensorflow@8419c70 promised to take a look sometimes this week!

Hi @Namburger, just wondering whether there is an update on this... Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants