update #25

niuliling123 · 2021-08-23T06:28:13Z

PR types

PR changes

Describe

* add exp and exp_grad npu op * modify support register type * remove empty line and remove exp_grad support data type int/int64 * move exp and epx_grad kernel to activation_op_npu.cc, delete attrs * move code to activation_op_npu.cc

* add save/load for pipelineparallel * add save/load

* add auto_parallel apis

* Add ext_tensor.slice() API, test=develop * Call Tensor::mutable_data first to fix bugs and add test for writing to sliced tensor * Fix unit test bug * Fix code format problem, test=develop * Fix code format problem * Fix code format problem * strengthen unit test * Use CustomTensorUtils::ShareDataFrom to simplify codes

* add batch_norm_op_npu and tests * remove skip.If * fix bug

* add reduce_mean_op_npu and test * remove skip.If * update

* add momentum_op_npu and test * update * fix hang

* add while read_from_array write_to_array npu op * optimize unittest

* fix_fc_reshape_convert * fix

…4304) * add set_value_grad op * add unittest. * polish unittest. * polish code. * support cuda kernel * polish code according to CI * polish code. * polish code * remove *.pyc * polish code. * add unittest to improve coverage. * polish code.

…caler (#34300) * add state_dict and load_state_dict and unittest for class GradScaler * refine unittest for coverage of load_state_dict * refine comments of code-block * refine some comments * refine state_dict code and unittest * add #require gpu, xpu for GradScaler get/set example code * add #require gpu, xpu for GradScaler get/set example code * refine example code * refine unittest for state_dict * refine unittest for state_dict * fix bug of DataLoader in TestGradScalerStateDict * add flag FLAGS_cudnn_deterministic

* - Added softmax without caching * - Binary is no longer manually cached * - Activation onednn caching removed * - Removed manual caching of activation * - modified UT * - fix * - fix * - fixes to building * - fix * - fix * - fix to UT * - Faulty UT workaround * - approval workaround * - Fixes after review * - compilation fixes * - more lint fixes * - more fixes after review * - fixes after another round of review

* add det_mv3_db & LeViT test case in pr-ci-inference * fix LeViT model dir bugs * fix grammar error

* [NPU] Support npu op expand_v2 and expand_v2_grad * [NPU] Support npu op expand_v2 and expand_v2_grad * [NPU] Support npu op expand_v2 and expand_v2_grad * update test_expand_v2_op_npu.py * update test_expand_v2_op_npu.py * modify expand_v2_op_npu.cc * modify expand_v2_op_npu.cc

* add recompute for pp * add recompute offload * add recompute partition

* Fix safety-bug of functional.linear * Fix safety-bug of functional.linear * Fix safety-bug of functional.linear * Fix safety-bug of functional.linear

This PR adds fused transformer related files defining c interface including class, function etc..

This reverts commit 0a5c99e.

* remove unmatched signal error stack * fix error writing for cond

* notest;test=gpu-inference * notest;test=gpu-inference * notest;test=gpu-inference * notest;test=gpu-inference * fix error * notest;test=gpu-inference * notest;test=gpu-inference * notest;test=gpu-inference * test=gpu-inference

* fix batch_norm and instance norm when input is []

…ut's shape is [0, 0, 0]. (#34996)

* add slim resnet50 quant model in pr-ci-inference * enable resnet50_quant multi_thread4_trt_int8_bz1 * remove LOG(FATAL)

* add npu sin op * [NPU] Support npu kernel for sin op * modify support npu kernel for sin op * modify support npu kernel for sin op * modify nou sin op * modify npu sin op * add sin op npu

* Add run function log * test=document_fix

* add (N,C,*) input support for GroupNorm * --amend

* [NPU] Support npu op where and where grad * fix use const_cast * delete a test

* add depthwise_conv2d npu * add some tests * Delete test_unique_op_npu.py * delete trans input

* add trainer desc config to distributed strategy * code style modified * data_feed set lod

* use spin lock in auto growth allocator, test=develop * use pthread spin lock, test=develop * use lock guard, test=develop * use malloc spin lock, test=develop * use lock_guard, test=develop

* [NPU] Support npu kernel for pad3d op * fix for comment of zhouwei25 * fix some bugs according to qili93's comments * add support and test for paddings in input * delete VLOG used for debug

* add rmsprop npu * add argsort npu * add argsort npu * modify according to review * modify sharedatawith according to review * modify reshape according to review * rm dygraph=false

…#35004)

* Add cuda device count api * update coda format * fix unittest error * update code format * update comment

* adamw support cuda * adamw support cuda

… out of bounds (#35062)

* 1. add interface for fft; 2. add data type predicate; 3. fix paddle.roll. * add fft c2c cufft kernel * implement argument checking & op calling parts for fft_c2c and fftn_c2c * add operator and opmaker definitions * only register float and double for cpu. * add common code for implementing FFT, add pocketfft as a dependency * add fft c2c cufft kernel function * fix bugs in python interface * add support for c2r, r2c operators, op makers, kernels and kernel functors. * test and fix bugs * 1. fft_c2c function: add support for onesided=False; 2. add complex<float>, complex<double> support for concat and flip. * 1. fft: fix python api bugs; 2. shape_op: add support for complex data types. * fft c2c cufft kernel done with complie and link * fix shape_op, add mkl placeholder * remove mkl * complete fft c2c in gpu * 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft; 2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation. * complete fft c2c on gpu in ND * complete fft c2c on gpu in ND * complete fft c2c backward in ND * fix MKL-based implementation * Add frame op and CPU/GPU kernels. * Add frame op forward unittest. * Add frame op forward unittest. * Remove axis parameter in FrameFunctor. * Add frame op grad CPU/GPU kernels and unittest. * Add frame op grad CPU/GPU kernels and unittest. * Update doc string. * Update after review and remove librosa requirement in unittest. * Update grad kernel. * add fft_c2r op * Remove data allocation in TransCompute function. * add fft r2c onesided with cpu(pocketfft/mkl) and gpu * last fft c2r functor * fix C2R and R2C for cufft, becase the direction is not an option in these cases. * add fft r2c onesided with cpu(pocketfft/mkl) and gpu * fix bugs in python APIs * fix fft_c2r grad kernal * fix bugs in python APIs * add cuda fft c2r grad kernal functor * clean code * fix fft_c2r python API * fill fft r2c result with conjugate symmetry (#19) fill fft r2c result with conjugate symmetry * add placeholder for unittests (#24) * simple parameterize test function by auto generate test case from parm list (#25) * miscellaneous fixes for python APIs (#26) * add placeholder for unittests * resize fft inputs before computation is n or s is provided. * add complex kernels for pad and pad_grad * simplify argument checking. * add type promotion * add int to float or complex promotion * fix output data type for static mode * fix fft's input dtype dispatch, import fft to paddle * fix typos in axes checking (#27) * fix typos in axes checking * fix argument checking (#28) * fix argument checking * Add C2R Python layer normal and abnormal use cases (#29) * documents and single case * test c2r case * New C2R Python layer normal and exception use cases * complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (PaddlePaddle#30) * Documentation of the common interfaces of c2r and c2c (PaddlePaddle#31) * Documentation of the common interfaces of c2r and c2c * clean c++ code (PaddlePaddle#32) * clean code * Add numpy-based implementation of spectral ops (PaddlePaddle#33) * add numpy reference implementation of spectral ops * Add fft_c2r numpy based implementation for unittest. (PaddlePaddle#34) * add fft_c2r numpy implementation * Add deframe op and stft/istft api. (#23) * Add frame api * Add deframe op and kernels. * Add stft and istft apis. * Add deframe api. Update stft and istft apis. * Fix bug in frame_from_librosa function when input dims >= 3 * Rename deframe to overlap_add. * Update istft. * Update after code review. * Add overlap_add op and stft/istft api unittest (PaddlePaddle#35) * Add overlap_add op unittest. * Register complex kernels of squeeze/unsquuze op. * Add stft/istft api unittest. * Add unittest for fft helper functions (PaddlePaddle#36) * add unittests for fft helper functions. add complex kernel for roll op. * complete static graph unittest for all public api (PaddlePaddle#37) * Unittest of op with FFT C2C, C2R and r2c added (PaddlePaddle#38) * documents and single case * test c2r case * New C2R Python layer normal and exception use cases * Documentation of the common interfaces of c2r and c2c * Unittest of op with FFT C2C, C2R and r2c added Co-authored-by: lijiaqi <lijiaqi0612@163.com> * add fft related options to CMakeLists.txt * fix typos and clean code (PaddlePaddle#39) * fix invisible character in mkl branch and fix error in error message * clean code: remove docstring from unittest for signal.py. * always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (PaddlePaddle#40) * always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. * fix CI Errors: numpy dtype comparison, thrust when cuda is not available (PaddlePaddle#41) 1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. 2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r; 3. fix unittest to catch UnImplementedError and RuntimeError; 4. fix compile error by avoid using thrust when cuda is not available. 5. fix sample code, use paddle.fft instead of paddle.tensor.fft * remove inclusion of thrust, add __all__ list for fft (PaddlePaddle#42) * Add api doc and update unittest. (PaddlePaddle#43) * Add doc strings. * Update overlap_add op unittest * fix MKL-based FFT implementation (PaddlePaddle#44) * fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R * remove code for debug (PaddlePaddle#45) * use dynload for cufft (PaddlePaddle#46) * use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms. * add complex support for fill_zeros_like * use dynload for cufft * Update doc and unittest. (PaddlePaddle#47) * Add doc of frame op and overlap_add op. * Update unittest. * use dynload for cufft (PaddlePaddle#48) 1. use dynload for cufft 2. fix unittest; 3. temporarily disable Rocm. * fix conflicts and merge upstream (PaddlePaddle#49) fix conflicts and merge upstream * fix compile error: only link dyload_cuda when cuda is available (PaddlePaddle#50) * fix compile error: only link dyload_cuda when cuda is available * fix dynload for cufft on windows (PaddlePaddle#51) 1. fix dynload for cufft on windows; 2. fix unittests. * add NOMINMAX to compile on windows (PaddlePaddle#52) add NOMINMAX to compile on windows * explicitly specify capture mode for lambdas (PaddlePaddle#55) explicitly specify capture mode for lambdas * fix fft sample (PaddlePaddle#53) * fix fft sample * update scipy and numpy version for unittests of fft (PaddlePaddle#56) update scipy and numpy version for unittests of fft * Add static graph unittests of frame and overlap_add api. (PaddlePaddle#57) * Remove cache of cuFFT & Disable ONEMKL (PaddlePaddle#59) 1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm 2. remove cache of cufft plans; 3. enhance error checking. 4. default WITH_ONEMKL to OFF Co-authored-by: jeff41404 <jeff41404@gmail.com> Co-authored-by: root <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com> Co-authored-by: KP <109694228@qq.com> Co-authored-by: lijiaqi <lijiaqi0612@163.com> Co-authored-by: Xiaoxu Chen <chenxx_id@163.com> Co-authored-by: lijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

b3602sss and others added 30 commits August 11, 2021 14:06

miss format (#34771)

addd5fc

[NPU] add elementwise_min_grad_op_npu,test=develop (#34731)

45af4f2

[NPU] Add exp and exp_grad npu op (#34612)

b5ec65e

* add exp and exp_grad npu op * modify support register type * remove empty line and remove exp_grad support data type int/int64 * move exp and epx_grad kernel to activation_op_npu.cc, delete attrs * move code to activation_op_npu.cc

[HybridParallel] Support save/load for PipeLineParallel (#34768)

88f2f4a

* add save/load for pipelineparallel * add save/load

add the basic apis for auto_parallel (#33804)

3f962e7

* add auto_parallel apis

[hybrid] pp+dp support fp16 allreduce (#34762)

4d7af37

[NPU] add batch_norm_op_npu and test (#34056)

9ed5db2

* add batch_norm_op_npu and tests * remove skip.If * fix bug

[NPU] add reduce_mean_op_npu and test (#34053)

f6fab55

* add reduce_mean_op_npu and test * remove skip.If * update

[NPU] add momentum_op_npu and test (#34082)

9e3e08f

* add momentum_op_npu and test * update * fix hang

split_op for npu (#34699)

d45d311

[NPU] add while, read_from_array and write_to_array npu op (#34755)

234c21a

* add while read_from_array write_to_array npu op * optimize unittest

[NPU] Support npu op flatten_contiguous_range_grad (#34798)

fc537d4

[Paddle TRT]fix_fc_int8_convert; fix_reshape_convert (#34787)

3429c04

* fix_fc_reshape_convert * fix

add det_mv3_db & LeViT test case in pr-ci-inference (#34803)

1c31d9d

* add det_mv3_db & LeViT test case in pr-ci-inference * fix LeViT model dir bugs * fix grammar error

[NPU] Support npu kernel for smooth_l1_loss op (#34674)

cfa6913

[HybridParallel]Add Recompute for PipeLineParallel (#34607)

589d13c

* add recompute for pp * add recompute offload * add recompute partition

Fix safety-bug of functional.linear (#34696)

0e28c8b

* Fix safety-bug of functional.linear * Fix safety-bug of functional.linear * Fix safety-bug of functional.linear * Fix safety-bug of functional.linear

transformer c files (#34706)

016cc56

This PR adds fused transformer related files defining c interface including class, function etc..

[Inference] Inference python api support fp16 (#34676)

6326c3e

fix set_grad_ivar bug of Tensor.backward (#34819)

dffb0b2

Revert "[oneDNN] Fix to issue #34554 (#34623)" (#34838)

dc62a22

This reverts commit 0a5c99e.

Remove incorrect signal error stack trace (#34842)

572adcc

* remove unmatched signal error stack * fix error writing for cond

[NPU] add meshgrid, test=develop (#34576)

3f71e8d

[npu]add unsqueeze2_grad,test=develop (#34733)

2164ad6

add retry for gethostbyname (#34855)

e92f038

tianshuo78520a and others added 27 commits August 19, 2021 09:47

Fix Inference CI CPU/GPU (#34931)

26213a7

* notest;test=gpu-inference * notest;test=gpu-inference * notest;test=gpu-inference * notest;test=gpu-inference * fix error * notest;test=gpu-inference * notest;test=gpu-inference * notest;test=gpu-inference * test=gpu-inference

add the auto scan test for TensorRT convert,test=develop (#34980)

255fc7d

fix batch_norm and instance norm when input is [] (#34107)

ca7f520

* fix batch_norm and instance norm when input is []

Add dimension check for inverse to avoid dividing by 0 error when inp…

a2e0865

…ut's shape is [0, 0, 0]. (#34996)

add resnet50_quant model in PR-CI-INFERENCE (#35012)

97cae5e

* add slim resnet50 quant model in pr-ci-inference * enable resnet50_quant multi_thread4_trt_int8_bz1 * remove LOG(FATAL)

remove unused statements in test_dist_base.py (#35017)

ef024c8

Fix op-benchmark cpu/gpu; test=document_fix (#35027)

ed9a14e

fix reshape when is a number (#35016)

866c1ea

[NPU] Support npu kernel for sin op (#34844)

4641e8f

* add npu sin op * [NPU] Support npu kernel for sin op * modify support npu kernel for sin op * modify support npu kernel for sin op * modify nou sin op * modify npu sin op * add sin op npu

Add op benchmark run function log (#35034)

096b0f2

* Add run function log * test=document_fix

[bug fix] fix spectral_norm bug (#35005)

1aa2bde

add (N,C,*) input support for GroupNorm (#34773)

4637151

* add (N,C,*) input support for GroupNorm * --amend

temporary disable resnet50-quant multi-thread test (#35035)

f927b65

[NPU] Support npu op where and where grad (#34587)

d082955

* [NPU] Support npu op where and where grad * fix use const_cast * delete a test

[NPU] Support npu op depthwise_conv2d (#34853)

4c115a8

* add depthwise_conv2d npu * add some tests * Delete test_unique_op_npu.py * delete trans input

fix set_lod in data_feed (#35000)

4416c79

* add trainer desc config to distributed strategy * code style modified * data_feed set lod

use spin lock in auto growth allocator (#34910)

6bacfb0

* use spin lock in auto growth allocator, test=develop * use pthread spin lock, test=develop * use lock guard, test=develop * use malloc spin lock, test=develop * use lock_guard, test=develop

[NPU] Support npu kernel for pad3d op (#34815)

ef517a5

* [NPU] Support npu kernel for pad3d op * fix for comment of zhouwei25 * fix some bugs according to qili93's comments * add support and test for paddings in input * delete VLOG used for debug

[npu]Add argsort op (#34865)

99ffeff

* add rmsprop npu * add argsort npu * add argsort npu * modify according to review * modify sharedatawith according to review * modify reshape according to review * rm dygraph=false

fix model-benchmark build error (#35041)

f6015d0

[hybrid performance] Grad fuse for gradient merge under pipeline mode (…

4d9b2d6

…#35004)

Add paddle.linalg.matrix_power OP (#34667)

e2241a4

implementation of broadcast add backward by reduce (#34143)

56c5e21

Add cuda.device_count api (#34811)

cf99c0d

* Add cuda device count api * update coda format * fix unittest error * update code format * update comment

add adamw cuda kernel (#35020)

77a8a39

* adamw support cuda * adamw support cuda

set node feature (#34994)

c3efabe

Fix a bug of strided_slice op, about the axes parameter access memory…

aefec22

… out of bounds (#35062)

niuliling123 merged commit 1e843d1 into niuliling123:develop Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update #25

update #25

niuliling123 commented Aug 23, 2021

update #25

update #25

Conversation

niuliling123 commented Aug 23, 2021

PR types

PR changes

Describe