Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use internal error codes instead of UNIX errnos #280

Closed
jdarcy opened this issue Jul 24, 2017 · 46 comments
Closed

Use internal error codes instead of UNIX errnos #280

jdarcy opened this issue Jul 24, 2017 · 46 comments
Labels
DocApproved Mandatory flag to pass smoke. Provide user facing document to get approval. FA: Debug-ability & Quality FA: Scalability Improvements FA: Technical Debt FA: Usability & Supportability Prio: Medium SpecApproved This is a mandatory flag for passing the smoke for feature. Provide spec of feature to get approval Type:Enhancement wontfix Managed by stale[bot]
Milestone

Comments

@jdarcy
Copy link
Contributor

jdarcy commented Jul 24, 2017

Over the years we've had many problems with the meaning of various UNIX errnos being overloaded, usually because some subsystem uses a particular error code internally. An ESTALE generated internally means something different than an ESTALE from a posix syscall. ENOTCONN has too many different meanings even to keep track of. There are also portability issues when a particular errno value doesn't exist or has a different meaning on NetBSD/FreeBSD than on Linux. Using an internal error type instead of overloading random errno values with random meanings will help avoid both problems.

My proposal is to have our own set of error values, and only use those internally. Whenever we need to convert back and forth - e.g. from a syscall/libc errno to an internal gf_errno or in the opposite direction when reporting errors via FUSE/NFS - we can use a conversion table by default, but there might also be many cases where this default needs to be overridden.

@amarts
Copy link
Member

amarts commented Jul 24, 2017

Yes, we should have had this long time ago. Now is the good time never the less. I am thinking to have similar approach for #273 too. We need our own stat structure which can capture lot more information and also has option for reserved fields. When passing it to higher layers, we can convert it and send.

@amarts
Copy link
Member

amarts commented Jul 24, 2017

libglusterfs/src/compat-errno.h can be used i guess. Btw, this file was used to handle different errno's in different OS to be mapped properly at protocol level. But as we use errno extensively at every xlator level to make decisions, we should be doing this conversion at posix/client-protocol and fuse/nfs/gfapi/server-protocol level. All internal xlators should use our own errno.

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18069 has been posted that references this issue.
Commit message: improvements to errno to error code conversion

amarts added a commit to amarts/glusterfs_fork that referenced this issue Aug 20, 2017
Provide option to define both errno and error string (for logging).
For all the standard 'errno' use 'strerror()' as the error string.
For all the new error codes, developers have to define the error
string while initializing.

Updates gluster#280
Change-Id: I2e481bc3143c69130a5d3d015814c68b1754dff8
Signed-off-by: Amar Tumballi <amarts@redhat.com>
@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18069 has been posted that references this issue.
Commit message: improvements to errno to error code conversion

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18084 has been posted that references this issue.
Commit message: Convert all the errno to error-codes

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18069 has been posted that references this issue.
Commit message: improvements to errno to error code conversion

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18087 has been posted that references this issue.
Commit message: Convert all the errno to error-codes

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18069 has been posted that references this issue.
Commit message: improvements to errno to error code conversion

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18087 has been posted that references this issue.
Commit message: Convert all the errno to error-codes

mscherer pushed a commit that referenced this issue Aug 23, 2017
Provide option to define both errno and error string (for logging).
For all the standard 'errno' use 'strerror()' as the error string.
For all the new error codes, developers have to define the error
string while initializing.

Updates #280
Change-Id: I2e481bc3143c69130a5d3d015814c68b1754dff8
Signed-off-by: Amar Tumballi <amarts@redhat.com>
Reviewed-on: https://review.gluster.org/18069
Smoke: Gluster Build System <jenkins@build.gluster.org>
CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18087 has been posted that references this issue.
Commit message: Convert all the errno to error-codes

1 similar comment
@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/18087 has been posted that references this issue.
Commit message: Convert all the errno to error-codes

@amarts
Copy link
Member

amarts commented Sep 1, 2017

Solution proposal.

Some basic possibilities:

  • Convert errno to error-code in syscall.h itself, so any code inside glusterfs, we only deal with error code. (ie, from posix itself, errno won't exist, but error-code).
  • Convert back to errno in interface layer like fuse/gfapi.

To do this, we need an interface, where give a errno, we need a corresponding error-code. And given an error-code, we need what errno it is, and possible error message to log for the given error code.

Currently, the implementation in 'experimental' branch is to have all this taken care in compat-errno.{c,h}. Further feedback welcome.

More possibiities?

  • For more strict check, we can change the signature to enum of error-code, so we will be able to enforce people to use it.

Challenges

  • Need contributions from almost every component owners to help make this change.
  • Merging this would make most of the patches in queue as 'Merge Conflict' state.
  • The changes to enum can't be done as multiple patches as the smoke and regressions will fail in compilation itself without all the required changes, so it will be one BIG change (at least 5000 lines diff). Need a quicker action with reviews.
  • Again, because it has be ONE patch, co-ordination in any current gluster branches (even experimental) would be hard, so we need to work with a developers personal branch, and then do git rebase -i (and squash).

Calls for a half-day hackathon for all the maintainers/peers and other contributors!

Feedbacks welcome!

@gluster/gluster-all ☝️

@xhernandez
Copy link
Contributor

Some considerations:

I think we could have ranges of error codes assigned to each xlator and module, exactly as we have done for log messages. This way, each xlator will be able to create more detailed errors. This can be easily seen in some higher level xlators that detect issues in the consistency of data, like AFR and EC. They need to generate new errors that are conceptually different from any other error. We could also assign errors to modules (memory management, dict management, inode management, ...). Instead of assigning errors sequentially as they are needed (this would mix errors from multiple sources and meanings) we could reserve ranges of errors to each xlator or module that needs them.

This way it would be easier, given an error, to determine what happened and where.

Additionally we could also add an 'originator' id inside the error code itself. This way, we won't only know that we have detected an invalid argument error, for example, but we'll also know which xlator has detected it, without needing each xlator to log a message for each error or dig into big logs to locate them.

With this approach, a single error code will be very meaningful, irrespective of where it has been logged.

Since we are dealing with 32 bits error codes, we could use the following mapping:

  • Lower 12 bits to represent an error code. We have up to 4096 available error codes for each xlator and module.
  • Middle 10 bits to represent the module or xlator to which the error code is associated.
  • Upper 10 bits to indicate the xlator where the error has been detected for the first time.

The xlator/module code 0 could be reserved for raw errno codes, and use id 1 and higher for modules and xlators.

Another thing to consider is that some code makes use of 'errno' variable to pass the error code between function calls. We should avoid that and use regular return of error codes.

The main drawback of this approach is that most probably it will require changes in many places outside compat-errno.{c,h} because of the handling of the originator field.

@kotreshhr
Copy link
Contributor

kotreshhr commented Sep 26, 2017 via email

@xhernandez
Copy link
Contributor

Another thing I would like to consider is the possibility of using negative error codes. This way we can return an error (a negative value), a plain success (0), and, optionally, some extra info in case of success (>0).

It would also be interesting to enforce (from now on, not changing old code, at least initially) to always return an error code if the function can fail. For example, it's very common today to have functions that return a pointer. If the pointer is NULL, it implies that the function has failed, but this doesn't give any information about the real cause of the failure. Most of the times ENOMEM is assumed, but there could be other causes that we have lost. Loosing this information is as bad as not using proper error codes.

So the idea would be to return an error code from these functions and pass the reference to the pointer as an additional argument instead of returning it.

@amarts
Copy link
Member

amarts commented Dec 3, 2019

I propose this to release-8.0 as this has a potential of helping in debugging exactly where the error originated by just looking at the client log file. We can have range details as @xhernandez suggested.

@amarts amarts added this to the Release 8 milestone Dec 3, 2019
@amarts amarts self-assigned this Jan 21, 2020
@amarts
Copy link
Member

amarts commented Feb 3, 2020

I tried to collect thoughts on this one. Happy to hear feedback.

Challenges

  • The first and foremost problem is, the amount of changes we need to do to complete the effort.
  • Changes shouldn’t be very disruptive, and should be easier to review.
  • This should be done without disturbing the current code flow, so we can break it into multiple smaller patches, without breaking regression.
  • Need contribution across codebase, so more of collaborated activity.

Suggestions from above comments

  • We only have 32 bits to use, so breaking it into multiple groups of fewer bits would be useful.
  • Originator ID (to identify which xlator) caused the issue is very helpful.
  • The errorcodes should be negative to indicate error, so, 0 can be reserved for success, and +ve is for other overloaded success values.
  • Instead of static id, good to get originator-id (or xlator id) in run time.

Proposed solution

  • Use return value to indicate error code, let errno continue to exist till we make complete changes.
  • This is easier to make change in codebase, as (ret < 0) is considered failure already.
  • This doesn’t impact the checks we have to compare errno and then decide what to do. Keeping it the same would make regressions run smoothly and also would help in quicker reviews.
  • Reserve 8k worth (13bits) of error-codes per translator.
  • Allows 17 more bits for originator-id and other things.
  • Provide a thread local pointer in gf_strerror(), so one can use it similar to strerror(). Also provide an alternative to return filled pointer like uuid_utoa_r().
  • The thread local pointer would be used in most places where it is just used in logging.
  • The alternative method would be used if one needs to store the pointer, or use it more than one time.
  • Make sure to check the complete execution path to treat negative value as error, not just -1.
    • This should be done more carefully.
  • Let each translator have their static ID through enum.
    • In the initial version it wouldn’t be runtime, and not per instance of xlator.

Future improvements

Design a way to get details of translators running on server-side, and their run-time-id, during client-handshake. That way, we can pin-point the issue to a particular instance of translator in future, instead of the static id. This will be useful for translators designed and written outside of glusterfs too.

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24747 has been posted that references this issue.

cluster/ec: use IS_ERROR/IS_SUCCESS macros

Updates: #280

Change-Id: I75354f8b2a82eaabb68eb36eaa27ee9bdd36a000
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24746 has been posted that references this issue.

cluster/dht: use IS_ERROR/IS_SUCCESS macros

Updates: #280

Change-Id: Id1001e93c31da78d580455a9601facb3e6df643c
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24748 has been posted that references this issue.

xlators/features: use IS_ERROR/IS_SUCCESS macros

Updates: #280

Change-Id: Iecf2004e1a4951ae8927b9e35249f6689a65c001
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24749 has been posted that references this issue.

xlators/nfs: use IS_ERROR/IS_SUCCESS macros

Updates: #280

Change-Id: I1a67c0268fddfd99b3568592a11fa90c99fcb001
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24750 has been posted that references this issue.

xlators/{meta,playground,posix-acl}: use IS_ERROR/IS_SUCCESS macros

Updates: #280

Change-Id: Idba6a611bccdc4ec4a8892cf31fb32b286125d0e
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24752 has been posted that references this issue.

mount/fuse: use IS_ERROR/IS_SUCCESS macros

Updates: #280

Change-Id: I4abadc82d96e8c823a4ddd4671961ce1bf4fb178
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24751 has been posted that references this issue.

xlators/performance: use IS_ERROR/IS_SUCCESS macros

Updates: #280

Change-Id: I4d8a15e772ad28e34a090c0d9c3792b94bf73e88
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24753 has been posted that references this issue.

libglusterfs: use IS_ERROR() and IS_SUCCESS() macro

Also introduce gf_strerror() and gf_errorcode() methods to set
a xlator specific error code in return value.

Updates: #280
Change-Id: I1bb00251794d8d4763be6e0d0657b43e861848a4
Signed-off-by: Amar Tumballi amar@kadalu.io

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24754 has been posted that references this issue.

storage/posix: use IS_ERROR/IS_SUCCESS macros

Also use 'SET_ERROR()' macro which allows to set the return value
to something other than -1.

Updates: #280

Change-Id: I034f939cc2fd45200f8c25a5ff4168edb0de6715
Signed-off-by: Amar Tumballi amar@kadalu.io

gluster-ant pushed a commit that referenced this issue Jul 21, 2020
These macros helps to clearly identify all negetive checks are 'errors',
and all 0 and above are success. With this clasification, we can add more
error codes to the process / method / function.

Updates: #280
Change-Id: I0ebc5c4ad41eb78e4f2c1b98648986be62e7b521
Signed-off-by: Amar Tumballi <amar@kadalu.io>
@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/24736 has been posted that references this issue.

contrib, extras, doc: use IS_ERROR/IS_SUCCESS macro

Updates: #280

Change-Id: I783d67c7a1bda1e465adaefc9a05e5f7217ac02c
Signed-off-by: Amar Tumballi amar@kadalu.io

@amarts
Copy link
Member

amarts commented Sep 1, 2020

  • Use return value to indicate error code, let errno continue to exist till we make complete changes.

@xhernandez / @pranithk / @gluster/gluster-all while I see that you are reviewing things, it would be great if you consider to review series of changes @ https://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:ref-280

There are merge conflicts, but I plan to resolve it once i hear some feedback.

While reviewing consider below, as we decided to have a practice like below:

  • Use IS_ERROR()/IS_SUCCESS() for glusterfs implemented functions/methods only.
  • No syscall return would be checked with IS_ERROR()/IS_SUCCESS(). (even sys_fop()).
  • Let the numbers (like length, count etc) checks be out of IS_ERROR() macro, even if its checked with < -1, as it can't be considered as error condition, instead a value of the variable.

Getting it reviewed and getting it in would help us to mandate the check with IS_ERROR() everywhere, and start returning different errorcodes, which can help debug more thoroughly and also can reduce the logs in glusterfs.

@amarts
Copy link
Member

amarts commented Sep 14, 2020

After a brief discussion with @xhernandez, we both agree it would be great if we can make change to the signature of fop return itself, to keep the changes easier to review (even though it would be more changes). Mainly because fop changes wouldn't change the behavior of other places to start with.

I will work on a basic prototype and submit the new method. Happy to hear feedback, and getting volunteers to take it to finish line.

@gluster-ant
Copy link
Collaborator

A patch https://review.gluster.org/25019 has been posted that references this issue.

all: return value signature changes

  • return signature of all 'fops' callback changed to gf_return_t.

  • provide IS_ERROR()/IS_SUCCESS() macros to check only this type of argument.

  • provide 2 global variables gf_success and gf_error introduced, which
    can used instead of -1 and 0 return used largely today.

Change-Id: Ib91216afaa89d7e4509757b50dd8a7d60a6bf0a9
Updates: #280
Signed-off-by: Amar Tumballi amar@kadalu.io

amarts added a commit to amarts/glusterfs_fork that referenced this issue Oct 7, 2020
* return signature of all 'fops' callback changed to `gf_return_t`.

* provide IS_ERROR()/IS_SUCCESS() macros to check only this type of argument.

* provide 2 global variables `gf_success` and `gf_error` introduced, which
  can used instead of -1 and 0 return used largely today.

Change-Id: Ib91216afaa89d7e4509757b50dd8a7d60a6bf0a9
Updates: gluster#280
Signed-off-by: Amar Tumballi <amar@kadalu.io>
@stale
Copy link

stale bot commented Apr 20, 2021

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Apr 20, 2021
@stale
Copy link

stale bot commented May 6, 2021

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

@stale stale bot closed this as completed May 6, 2021
@amarts amarts reopened this Sep 1, 2022
@stale stale bot removed the wontfix Managed by stale[bot] label Sep 1, 2022
@stale
Copy link

stale bot commented Apr 2, 2023

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Apr 2, 2023
@stale
Copy link

stale bot commented May 21, 2023

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

@stale stale bot closed this as completed May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DocApproved Mandatory flag to pass smoke. Provide user facing document to get approval. FA: Debug-ability & Quality FA: Scalability Improvements FA: Technical Debt FA: Usability & Supportability Prio: Medium SpecApproved This is a mandatory flag for passing the smoke for feature. Provide spec of feature to get approval Type:Enhancement wontfix Managed by stale[bot]
Projects
None yet
Development

No branches or pull requests

6 participants