Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build for Intel gptj-99 docker fails #11

Open
gktejus opened this issue Dec 12, 2023 · 5 comments
Open

Build for Intel gptj-99 docker fails #11

gktejus opened this issue Dec 12, 2023 · 5 comments

Comments

@gktejus
Copy link

gktejus commented Dec 12, 2023

I'm trying to reproduce the Intel results for gptj-99 and trying to setup the docker by doing a ./build_gpt-j_int4_container.sh.

However, the build seems to fail due to a bunch of errors

63.81 [58/800] Building CXX object inference/loadgen/CMakeFiles/mlperf_loadgen.dir/test_settings_internal.cc.o
63.86 [59/800] Building CXX object inference/loadgen/CMakeFiles/mlperf_loadgen.dir/results.cc.o
64.29 [60/800] Building CXX object mlperf_plugins/CMakeFiles/mlperf_plugins.dir/csrc/activation.cpp.o
64.30 [61/800] Building CXX object inference/loadgen/CMakeFiles/mlperf_loadgen.dir/logging.cc.o
64.45 [62/800] Building CXX object mlperf_plugins/CMakeFiles/mlperf_plugins.dir/csrc/softmax.cpp.o
64.45 FAILED: mlperf_plugins/CMakeFiles/mlperf_plugins.dir/csrc/softmax.cpp.o
64.45 /usr/bin/clang++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dmlperf_plugins_EXPORTS -Dusercp -I/opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/onednn/include -I/opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/libxsmm/include -I/opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/csrc/tpps -I/opt/workdir/code/bert-99/pytorch-cpu$
build/mlperf_plugins/onednn/include -I/opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/onednn/src/../include -isystem /opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -O3 -DNDEBUG -fPIC -Wall -isystem
/opt/conda/include -Wno-unused-function -march=native -mfma -D_GLIBCXX_USE_CXX11_ABI=1 -fopenmp=libomp -MD -MT mlperf_plugins/CMakeFiles/mlperf_plugins.dir/csrc/softmax.cpp.o -MF mlperf_plugins/CMakeFiles/mlperf_plugins.dir/csrc/softmax.cpp.o.d -o mlperf_plugins/CMakeFiles/mlperf_plugins.dir/csrc/softmax.cpp.o -c /opt/workdir/code/bert-99/pytorch-cpu/mlperf_plu$
ins/csrc/softmax.cpp
64.45 In file included from /opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/csrc/softmax.cpp:5:
64.45 In file included from /opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/csrc/tpps/i_softmax_tpp.hpp:5:
64.45 /opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/csrc/tpps/el_common_intrin.hpp:37:31: error: unknown type name '__m256h'
64.45   static void _mm256_print_ph(__m256h a) {
64.45                               ^
64.45 /opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/csrc/tpps/el_common_intrin.hpp:44:31: error: unknown type name '__m512h'
64.45   static void _mm512_print_ph(__m512h a) {
64.45                               ^
64.45 /opt/workdir/code/bert-99/pytorch-cpu/mlperf_plugins/csrc/tpps/el_common_intrin.hpp:48:35: error: use of undeclared identifier '_mm256_loadu_ph'
64.45     auto f_half = _mm512_cvtph_ps(_mm256_loadu_ph((void*)mem));

Is this a known issue, and is there any workaround for this?

@arjunsuresh
Copy link
Contributor

Which machine are you targeting the run on?

@gktejus gktejus changed the title Build for Intel BERT docker fails Build for Intel gptj-99 docker fails Dec 12, 2023
@gktejus
Copy link
Author

gktejus commented Dec 12, 2023

@arjunsuresh I'm trying this on a VM running on an Intel-icelake server for the moment. And I'm not only seeing docker build issues with gptj, but with other models as well.

@arjunsuresh
Copy link
Contributor

I'm not sure if Intel submission code will run on Ice Lake as it is specific to Sapphire Rapids. With some minor changes in the scripts we were able to run the code on a 4 vCPU SPR cloud instance for 3.1 inference rounds.

We are about to automate Intel submission code inside MLCommons CM automation project (Nvidia, Reference and Qualcomm codes are so far added) and after this, it should be much easier to reproduce any submitted result.

@mittaltarkik
Copy link

mittaltarkik commented Jan 2, 2024

@gktejus Where you able to find the fix for this issue ?

@arjunsuresh I am also facing the same issue and my VM configurations are
Standard E96s v5 (96 vcpus, 672 GiB memory)
Proccessor | Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Linux (centos 8)

@gfursin
Copy link

gfursin commented Jan 20, 2024

When discussing this issue at our discord channel for the MLCommons TF on automation and reproducibility, we got a suggestion to validate target hardware capabilities in the CM workflow before running/reproducing Intel submissions and returning a user-friendly warning if a given implementation is incompatible with this target. @arjunsuresh - let's discuss how to prototype it in your new CM workflow for Intel submissions - normally we should have all the functionality in CM to do it ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants