You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The purpose of the code refactor is to improve our code quality and usability, the approach we take should be considered from two sides: from the user's aspect and from our developers' aspect.
From user's aspect, the main goal is to make user can use NNF as a real tool: compile the model and understand the procedure easily:
Building stages
Currently we make our building scripts to support install dependencies, run making in native env or inside container, but currently we haven't consider much about what users may actually meet in real scenario:
Take [BUG] Compile Error in source code #48 for example, the user didn't read our doc thus he don't know we use ubuntu 16/18 not 20. So this system version should be checked in the building scripts.
Work items:
Testing stages
Our testing utilities are tricky and hard to use: user should have NVIDIA or CUDA hardware and configure it through a specific config file. The unit tests have no check for hardware which will result in failing test.
We should make testing more easily and make the test report easily for user to understand.
Besides, we should check coverage for each PR and give report for whether the code change is covered by test, this will help us to improve code quality.
Validation stages
The NNF is currently more like in tech validation stage: we might configure the env by hand and user can do validation. So what we need to do is to make script or NNFusion CLI more easily for user to compile model;
One more problem: Because NNF need frozen model but freeze a model is not a standard procedure for user so NNF may take fault frozen model as input, should we provide a standard script to freeze model?
Work item:
User interface for Inference and Trainning
From our developers' aspect:
License Problem:
Apache-2 license is kind strict and hard to modify code, so we move the code we didn't rewrite into thirdparty folder, but we need to rewrite those code and bring those back to our source code tree. If not, code reader might get confused about where the code is actually. Those code are mainly related to operator set, some core data type, and the importer frontend for TF and ONNX. We might discuss those at later sections;
Operator Set:
The operator set we use originates from Ngraph and is amended with some op with "OperatorV2" type. The main goal will be make all the operator set migrated to "OperatorV2" or a new class which is not hard-coded anymore and could be added/removed/changed easily.
The operator should also support serialized.
Kernels:
Currently we have hard-coded kernels, antares kernels( antares-ir), kernelDB kernels. We have features but we didn't provide a good mechanism to pick kernels from those. And all the kernels' interface are not same.
So in this part, we might need to:
Firstly, design a general interface for all the kernel provider, which make us support more provider like TVM.
Secondly, design kernel selection policies for kernel providers;
The new interface will give our optimization pass more flexibility to pick/change kernels.
Code generator:
This is might the most hard part of the refactor plan: since the code generator is complex and integrate many many features and those sub features are interacted with each other.
The main goal is to make the code gen much much more simple and could be easily use to support "new" device with much much less code change.
Profiler
Our profiler have some flaws. For example, it does not guarantee the input data is valid, which may cause error when profiling some kernels (eg. OneHot). Also, the profiler and codegen are independent in current design, but they share many functions. We may use codegen to do profiling.
Training
We have added basic training feature like autodiff, backward ops etc. But endusers cannot easily use them and integrate with their own project. This problem is not only for training, but training is an important factor to consider. For a better training experience, we have two items: The first one is a clear Python interface hiding NNFusion trivias and implementation. Then, based on the interface, we need to figure out the scope and add missed training features.
The text was updated successfully, but these errors were encountered:
The purpose of the code refactor is to improve our code quality and usability, the approach we take should be considered from two sides: from the user's aspect and from our developers' aspect.
From user's aspect, the main goal is to make user can use NNF as a real tool: compile the model and understand the procedure easily:
Building stages
Currently we make our building scripts to support install dependencies, run making in native env or inside container, but currently we haven't consider much about what users may actually meet in real scenario:
Take [BUG] Compile Error in source code #48 for example, the user didn't read our doc thus he don't know we use ubuntu 16/18 not 20. So this system version should be checked in the building scripts.
Work items:
Testing stages
Our testing utilities are tricky and hard to use: user should have NVIDIA or CUDA hardware and configure it through a specific config file. The unit tests have no check for hardware which will result in failing test.
We should make testing more easily and make the test report easily for user to understand.
Besides, we should check coverage for each PR and give report for whether the code change is covered by test, this will help us to improve code quality.
Validation stages
The NNF is currently more like in tech validation stage: we might configure the env by hand and user can do validation. So what we need to do is to make script or NNFusion CLI more easily for user to compile model;
One more problem: Because NNF need frozen model but freeze a model is not a standard procedure for user so NNF may take fault frozen model as input, should we provide a standard script to freeze model?
Work item:
From our developers' aspect:
License Problem:
Apache-2 license is kind strict and hard to modify code, so we move the code we didn't rewrite into thirdparty folder, but we need to rewrite those code and bring those back to our source code tree. If not, code reader might get confused about where the code is actually. Those code are mainly related to operator set, some core data type, and the importer frontend for TF and ONNX. We might discuss those at later sections;
Operator Set:
The operator set we use originates from Ngraph and is amended with some op with "OperatorV2" type. The main goal will be make all the operator set migrated to "OperatorV2" or a new class which is not hard-coded anymore and could be added/removed/changed easily.
The operator should also support serialized.
Kernels:
Currently we have hard-coded kernels, antares kernels( antares-ir), kernelDB kernels. We have features but we didn't provide a good mechanism to pick kernels from those. And all the kernels' interface are not same.
So in this part, we might need to:
Firstly, design a general interface for all the kernel provider, which make us support more provider like TVM.
Secondly, design kernel selection policies for kernel providers;
The new interface will give our optimization pass more flexibility to pick/change kernels.
Code generator:
This is might the most hard part of the refactor plan: since the code generator is complex and integrate many many features and those sub features are interacted with each other.
The main goal is to make the code gen much much more simple and could be easily use to support "new" device with much much less code change.
Profiler
Our profiler have some flaws. For example, it does not guarantee the input data is valid, which may cause error when profiling some kernels (eg. OneHot). Also, the profiler and codegen are independent in current design, but they share many functions. We may use codegen to do profiling.
Training
We have added basic training feature like autodiff, backward ops etc. But endusers cannot easily use them and integrate with their own project. This problem is not only for training, but training is an important factor to consider. For a better training experience, we have two items: The first one is a clear Python interface hiding NNFusion trivias and implementation. Then, based on the interface, we need to figure out the scope and add missed training features.
The text was updated successfully, but these errors were encountered: