New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string handling, SDS #859

Open
aktau opened this Issue Jun 17, 2014 · 11 comments

Comments

Projects
None yet
5 participants
@aktau
Copy link
Member

aktau commented Jun 17, 2014

Hey all, so one of the ongoing cleanup efforts in the neovim codebase, has been string handling. Recently, I proposed replacing strncpy and its variants (STRNCPY and vim_strncpy) with (x)strlcpy (#743, et al.) , because it's safer. *

While this didn't go entirely to plan (issue #858) because of slight differences in behaviour **, on the whole I think this is a positive evolution. What's clear is that these changes need to be a bit more carefully planned and reviewed, no one person can hope to catch all the subtleties.

That said, there's room for improvement on the string front in other ways as well:

  1. We should create a wiki page detailing the best practices when dealing with strings. What functions do we use when? What are common idioms? (e.g.: stpcpy for repeat concatenating, ...)
  2. Are there any new functions that we could introduce? In a few others issues I've mentioned this. For some cases neither strncpy + NUL-termination nor strlcpy nor stp(n)cpy are optimal. I'll describe below what I can think of.

Example situations

Sometimes it's easiest to come up with better approaches by handling a few real life situations. Let's look at something @dougschneider bumped into recently.

The original code:

   STRCPY(IObuff, ":return ");
   STRNCPY(IObuff + 8, s, IOSIZE - 8);
    if (STRLEN(s) + 8 >= IOSIZE)
      STRCPY(IObuff + IOSIZE - 4, "...");

The rework as proposed here. There might be a small bug lurking in the fact that sizeof(ret) is 9 and thus IObuff + sizeof(ret) would be after the copied NUL byte.

  const char ret[] = ":return ";
  STRCPY(IObuff, ret);
  size_t slen = STRLCPY(IObuff + sizeof(ret), s, IOSIZE - sizeof(ret));
  if (slen + sizeof(ret) >= IOSIZE)
      STRCPY(IObuff + IOSIZE - 4, "...");

A few alternatives:

// compact and correct, but possibly a bit slower because of the first strlcpy call
size_t startlen = STRLCPY(IObuff,  ":return ", IOSIZE); 
if (STRLCPY(IObuff + startlen, s, IOSIZE - startlen) >= IOSIZE - startlen) {
  // string overflowed, signal truncation with ellipsis
  memcpy(IObuff + IOSIZE - 4, "...", 4);
}
// faster, but less compact and less obvious (the bug from above has been corrected by deducting one from sizeof, making it even uglier. However it has another serious error: unsigned wraparound.
const char ret[] = ":return ";
memcpy(IObuff, ret, sizeof(ret) - 1);
if (STRLCPY(IObuff + sizeof(ret) - 1, s, IOSIZE - sizeof(ret) - 1) >= IOSIZE - sizeof(ret) - 1) {
  // string overflowed, signal truncation with ellipsis
  memcpy(IObuff + IOSIZE - 4, "...", 4);
}
// same as the one just above, assuming that a compiler knows that `strlen(string literal)` == `sizeof(string literal) - 1`, slightly prettier
const char ret[] = ":return ";
memcpy(IObuff, ret, strlen(ret));
if (STRLCPY(IObuff + strlen(ret), s, IOSIZE - strlen(ret)) >= IOSIZE - strlen(ret)) {
  // string overflowed, signal truncation with ellipsis
  memcpy(IObuff + IOSIZE - 4, "...", 4);
}

So we get to choose between speed, flexibility (not having to count the amount of chars in ":return ") and readability/beauty (pick any two). Why can't we have all three?

It's perhaps possible that we lack the functions or macros for this. I thought for example that for string literals, the optimum in safety and speed could be reached with this:

// version of `strlcpy` that will only work for string literals, optimal on both 
// smart and stupid compilers. Compiles down to a pure `memcpy` if the size of the
// buffer is known at compile time. Otherwise it's just a branch and a `memcpy`.
// note that this contains a trick to guard against using it with non-string literals (though
// this can be circumvented). Note that most of the efficiency gained by using this macro
// can probably also be obtained by hoisting `xstrlcpy` into the header and annotating it
// with `static inline`.
// @return the number of characters in the string (strlen(str))
#define SSTRLCPY(dst, src, size) \
  ( \
    memcpy((dst), "" src, (sizeof(src) < (size)) ? sizeof(src) : (size)), \
    (dst)[((sizeof(src) < (size)) ? sizeof(src) : (size)) - 1] = '\0', \
    sizeof(src) - 1 \
  )

This would enable us to write:

  size_t total = SSTRLCPY(IObuff, ":return ", IOSIZE);
  // we could check for truncation now: if (total >= IOSIZE) { ... }

  total += STRLCPY(IObuff + total, s, IOSIZE - total);
  if (total >= IOSIZE) {
      // string overflowed, signal truncation with ellipsis
      SSTRLCPY(IObuff + IOSIZE - 4, "...", 4); // same as memcpy(IObuff + IOSIZE - 4, "...", 4);
  }

This still does something redundant. We don't need to actually calculate the string length of s, we use it just to check for truncation. Depending on the size of s, we might be doing too much work.

Perhaps we can make a function that just returns if the string was truncated or not:

/// xstrucpy - copy min(`strsize`, `bufsize` - 1) bytes, NUL terminates
///
/// If the string length is known, this function provides a few advantages over
/// `(x)strlcpy`:
///
/// 1. Usually we want to know if truncation happened, not by how much.
///    Truncation happens when the `src` string is larger than the `dst` buffer.
///    An idiomatic way of using this is the example below.
/// 2. `(x)strlcpy` calculates `strlen`, which is sometimes already known.
/// 3. `xstrucpy` can be used with Pascal strings (non-NUL terminated strings).
///
/// Example:
///
/// @code{.c}
///   if (xstrucpy(buf, str, bufsize, str_len) < str_len) {
///     // truncation happened, deal with it
///   }
/// @endcode
///
/// @return the number of bytes copied (without the NUL terminator)
///
/// @see xstrlcpy
size_t xstrucpy(char *restrict dst,
                const char *restrict src,
                size_t bufsize,
                size_t strsize)
{
  size_t cpy = (strsize >= bufsize) ? bufsize - 1 : strsize;

  if (cpy) {
    memcpy(dst, src, cpy);
    dst[cpy] = '\0';
  }

  return cpy;
}

/// xstracpy - copies `src` into `dst`, %NUL terminates `dst`
///
/// TODO(aktau): if this shows up in profiler traces, try to find a version
///              that doesn't read `src` memory twice. This implementation should
///              be plenty fast though. Glibc (among others) have a highly
///              optimized memchr and memcpy. Note that I probably made an off-by-one 
///              error here, implementation untested.
///
/// @note This function can also be used with non-NUL terminated strings as
///       long as the underlying buffer of `src` has at least `bufsize`
///       bytes. So even though this function will work in more cases than
///       `strlcpy`, it's still advisable to use proper Pascal string
///       functions for Pascal strings.
///
/// @return false if the string was truncated, true if not. In the case of a
///         non-NUL terminated string, will always report that the
///         string has been truncated.
bool xstracpy(char *restrict dst, const char *restrict src, size_t bufsize)
{
  // find the NUL-byte, if it exists
  char *p = memchr(src, '\0', bufsize);
  return src + xstrucpy(dst, src, bufsize, p ? p - src : bufsize) <= p;
}

// as a bonus, we can now implement `xstrlcpy` as a special case of `xstrucpy`, this would
// invalidate the comment about the headerized versions efficiency in the macro though.
size_t xstrlcpy(char *restrict dst, const char *restrict src, size_t size)
{
  size_t ret = strlen(src);
  xstrucpy(dst, src, size, ret);
  return ret;
}

So weaponized with xstracpy, we could possibly make this optimally efficient and a bit cleaner still:

  size_t offset = SSTRLCPY(IObuff, ":return ", IOSIZE);
  // we could check for truncation now: if (total >= IOSIZE) { ... }

  if (!xstracpy(IObuff + offset, s, IOSIZE - offset)) {
    // string overflowed, signal truncation with ellipsis
    SSTRLCPY(IObuff + IOSIZE - 4, "...", 4); // same as memcpy(IObuff + IOSIZE - 4, "...", 4);
  }

I haven't thought of a good way of getting rid of the -4 and 4 yet that wouldn't introduce a way-too-one-off function.

Disadvantages? Sure: proliferation of string functions. The more we have of them, the more choice-stress people are going to get when coding something up. Ideally, the best function should be obvious from the context, and available...

* Clever readers will notice that strlcpy is not safer (or less safe) than vim_strncpy, which always NUL-terminates. However we do have the policy of adhering to standard string function behaviour to avoid confusion and strncpy doesn't always NUL-terminate, so the name vim_strncpy is confusing.

** strncpy always fills the destination buffer with zeroes after the string, strlcpy doesn't.

BTW: I'm calling this phase 1 because this will keep the representation of strings in vim intact, namely C strings. Phase 2 might be a transition to using Pascal string (be they sds strings or @tarruda's explicit struct strings), or it might be something else entirely, or it might not exist.

This post is subject to later editing.

aktau referenced this issue in dougschneider/neovim Jun 17, 2014

@philix

This comment has been minimized.

Copy link
Member

philix commented Jun 18, 2014

String handling is really important in a text editor and the current codebase doesn't have good abstractions for string handling (even though we've been slowly improving it). It's a miracle how much can be accomplished with so little abstraction (and a lot of low level code).

When dynamic strings are needed, for example, growable arrays are used! The garray API was made to solve two cases: dynamic strings and dynamic arrays. It fails to provide a nice API for both cases.

I'm strongly in favor of sds strings instead of struct-based Pascal strings. I think sds' design makes it perfect for Neovim's situation: mix of classical C strings and length-aware strings.

@aktau

This comment has been minimized.

Copy link
Member

aktau commented Jun 18, 2014

I'm strongly in favor of sds strings instead of struct-based Pascal strings. I think sds' design makes it perfect for Neovim's situation: mix of classical C strings and length-aware strings.

I like sds' design and outlook as well and I admit it looks tempting because it's so easy to mix both. But I already envision that it won't be very easy. Before doing it we'd need a large-scale constification of the code base wherever possible. Because after passing a sds string to a function accepting plain char * or char_u *, I wouldn't trust it anymore.

@philix

This comment has been minimized.

Copy link
Member

philix commented Jun 18, 2014

I like sds' design and outlook as well and I admit it looks tempting because it's so easy to mix both. But I already envision that it won't be very easy.

I agree with you. We should be very careful and calm down everyone who's willing to do big refactorings torwards sds.

Before doing it we'd need a large-scale constification of the code base wherever possible.

Do we have a entry-level task for this?

Because after passing a sds string to a function accepting plain char * or char_u *, I wouldn't trust it anymore.

Exactly. This is also true for garray based strings. If we attack only the garray cases I would say sds already paid itself.

@aktau

This comment has been minimized.

Copy link
Member

aktau commented Jun 18, 2014

Exactly. This is also true for garray based strings. If we attack only the garray cases I would say sds already paid itself.

Babysteps, let's whip garray into shape first :). I think MSan is going to be a big help, if it works as advertised.

@oni-link

This comment has been minimized.

Copy link
Contributor

oni-link commented Jun 18, 2014

@aktau, in your examples you use sizeof(ret), which returns the size of the pointer ret and not the length of the string to which ret points. Only works for ret=":return " and 64-bit platforms.

@aktau

This comment has been minimized.

Copy link
Member

aktau commented Jun 18, 2014

@oni-link completely correct, this was noticed by @dougschneider as well. I'll change the examples to use an array instead of a pointer.

@dougschneider

This comment has been minimized.

Copy link
Contributor

dougschneider commented Jun 18, 2014

I agree with @philix 's preference towards using sds strings. An additional reason to do so is that Neovim has been pushing vim to use more standards. Thus I think having our own internal string is moving the wrong direction. Using a more standard string such as sds will make it easier for newcomers to adapt. As well the documentation is already available and sds is time tested.

@aktau

This comment has been minimized.

Copy link
Member

aktau commented Jun 18, 2014

I agree with @philix 's preference towards using sds strings. An additional reason to do so is that Neovim has been pushing vim to use more standards. Thus I think having our own internal string is moving the wrong direction. Using a more standard string such as sds will make it easier for newcomers to adapt. As well the documentation is already available and sds is time tested.

I think most people would be in favor of saner string handling, that's also likely to be more performant, but we can't do it in one go. Which is why I propose improving things in waves. And setting best practices at every stage.

@aktau aktau changed the title [discussion] [wiki] string handling, phase 1 [RFC] [wiki] string handling, phase 1 Jun 19, 2014

@dougschneider

This comment has been minimized.

Copy link
Contributor

dougschneider commented Jun 20, 2014

@aktau

I just got some free time to fully read this. Regarding your statement "However it has another serious error: unsigned wraparound." in the following:

// faster, but less compact and less obvious (the bug from above has been corrected by deducting one from sizeof, making it even uglier. However it has another serious error: unsigned wraparound.
const char ret[] = ":return ";
memcpy(IObuff, ret, sizeof(ret) - 1);
if (STRLCPY(IObuff + sizeof(ret) - 1, s, IOSIZE - sizeof(ret) - 1) >= IOSIZE - sizeof(ret) - 1) {
  // string overflowed, signal truncation with ellipsis
  memcpy(IObuff + IOSIZE - 4, "...", 4);
}

Is this really an issue? I assume you're referring to the sizeof(ret) - 1 underflowing and becoming large (although correct me if I'm misunderstanding). However, ret can never be less than "" which still has sizeof(ret) == 1. I guess we could be concerned with future changes causing an issue, but I don't think a mistake would be likely, and it would get noticed quite quickly. So I think sizeof(ret) - 1 is safe. And since that's ugly, let's just use strlen. If it's optimized sweet, if not, finding a string length so short should cause minimal inefficiency (yes it could change to be longer, but I can't imagine it ever being long enough to cause a bother).

@aktau

This comment has been minimized.

Copy link
Member

aktau commented Jun 20, 2014

However, ret can never be less than "" which still has sizeof(ret) == 1. I guess we could be concerned with future changes causing an issue, but I don't think a mistake would be likely, and it would get noticed quite quickly. So I think sizeof(ret) - 1 is safe. And since that's ugly, let's just use strlen. If it's optimized sweet, if not, finding a string length so short should cause minimal inefficiency (yes it could change to be longer, but I can't imagine it ever being long enough to cause a bother).

You are, of course, correct. Most compilers would turn strlen into sizeof - 1 for string literals anyway. And the underflow is indeed, very unlikely.

That said, the SSTRLCPY macro avoids all that and makes the entire thing quite a bit cleaner and less error-prone to rewrite, don't you think?

  size_t offset = SSTRLCPY(IObuff, ":return ", IOSIZE);
  // we could check for truncation now: if (total >= IOSIZE) { ... }

  if (!xstracpy(IObuff + offset, s, IOSIZE - offset)) {
    // string overflowed, signal truncation with ellipsis
    SSTRLCPY(IObuff + IOSIZE - 4, "...", 4); // same as memcpy(IObuff + IOSIZE - 4, "...", 4);
  }

As I mentioned in the first post, we can reach the same efficiency by hoisting xstrlcpy into the memory.h header and letting the compiler figure things out. That would leave us with 3 general purpose safe string functions:

  • xmemcpyz: NUL-terminating memcpy, copies n bytes from src, always writes n + 1 bytes (so need to pass realbuflen - 1).
  • xstrucpy: basically a truncating memcpyz, writes min(buflen, strlen) bytes, returns bytes written.
  • xstracpy: special case of xstrucpy, writes min(buflen, strlen) bytes while only reading max buflen bytes, signals truncation as true/false return value.
  • xstrlcpy: special case of xstrucpy, writes min(buflen, strlen) bytes while reading strlen bytes, returns strlen.

And then of course their macro versions for the vim char_u * mess, but I don't count those as separate.

I'm open to suggestions! (for names, functionality, ...). A good set of primitives that are safe and easy to use would be fantastic.

@justinmk justinmk added the refactor label Jan 25, 2017

@justinmk justinmk added this to the unplanned milestone Jan 25, 2017

dwb pushed a commit to dwb/neovim that referenced this issue Feb 21, 2017

@justinmk justinmk changed the title [RFC] [wiki] string handling, phase 1 string handling, SDS May 27, 2018

@justinmk

This comment has been minimized.

Copy link
Member

justinmk commented Sep 1, 2018

Better String Library (bstrlib) looks like another good alternative besides SDS, but this issue suggests that it is targeting older codebases so SDS might be a better fit. Also its comparison table does not mention SDS.

@justinmk justinmk added performance and removed documentation labels Nov 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment