Skip to content

Constant string and UNICODE strings

incoder edited this page Oct 1, 2019 · 10 revisions

Constant string

You have a class or structure which have a string field. You can store it as a raw pointer, for example

struct foo {
 const char *foo_str;
};

And this will work unless memory block under pointer address still valid. Otherwise behavior is undefined.

struct foo_t {
 const char *foo_str;  
};

static void print_foo(const foo_t& foo) {
   std::cout<< foo.foo_str << std::end;
}

int main(int argc, const char** argv) {
  foo_t foo;
  {
    char hello_word_str[] = "Hello world!";
    // just assign the address on the stack object
    foo.foo_str = hello_word_str;
    // ok, since value still in stack memory
    print_foo(foo);
  }
  // Behavior is undefined, depends on compiler 
  print_foo(foo);
  return 0;
}

Classic C++ approach - use std::string instead.

struct foo_t {
  std::string foo_str;
};

void print_foo(const foo_t& foo);

int main(int argc, const char** argv) {
  // this will allocate heap memory using new char[initial_size];
  foo_t foo;
  {
    char hello_word_str[] = "Hello world!";
    // this will deep copy stack memory into heap    
    foo.foo_str = hello_word_str;
    // ok, since foo_str character array in heap
    print_foo(foo);
  }
  // ok, since foo_str character array in heap, and not the stack
  print_foo(foo);
  return 0;
}

void print_foo(const foo_t& foo) {
   // this should allocate another memory block, and deep copy string character array
   // standard library and compiler can optimize it (copy elision), but no any guaranty
   std::string message = foo.foo_str;
   std::cout<< message << std::end;
}

So as you can see, with this approach you have worst performance, and in the same time you are using more memory. This is happening because std::string is designed to be mutable. So it is necessary to allocate another memory block, and deep copy original character array for the std::string, otherwise application behavior will be undefined.

How to improve it? For example you can store the string in std::shared_ptr or boost::shared_array, as well as you will need put this smart pointer into std::weak_ptr. This is not really useful, and you will spend more memory for shared_ptr reference count pointer i.e. std::atomic_size_t and it's class this pointer, as well as one more this pointer for std::weak_ptr class.

Another way - is io::const_string, io::const_string is actually a smart pointer similar to boost::intrusive_ptr with atomic embedded reference counting strategy. io::const_string is designed to be immutable, so it's copy constructor simply increases reference count (shallow copy) rather then deep copy original character array. Let's use const_string in our previous example:

struct foo_t {
  io::const_string foo_str;
};

void print_foo(const foo_t& foo);

int main(int argc, const char** argv) {
  // this will construct empty const_string, with nullptr charter array
  // Anyway if you call foo.foo_str.data() it will return you "" not a nullptr
  foo_t foo;
  {
    char hello_word_str[] = "Hello world!";
    // this will deep copy stack memory into new heap memory block   
    foo.foo_str = io::const_string(hello_word_str);
    // ok, character array in heap, or "" when out of memory
    print_foo(foo);
  }
  // ok, character array in stack, or "" when out of memory
  print_foo(foo);
  return 0;
}

void print_foo(const foo_t& foo) {
   // this will simply increase reference count of character array
   io::const_string message = foo.foo_str;
   std::cout<< message << std::end;
}

Another benefit of const_string - unlike std::string, const_string constructor never throws including out of memory situation. This allows you to use const_string with compiler RTTI and exceptions off mode, without custom error handing by custom new and terminate handlers. So if you'd' like to check that const_string was successfully constructed, you always able to call empty() method. This method check underlying memory array is nullptr, witch is defined behavior for the new (std::nothrow) char[size] or malloc out of memory state. In the same time, io::const_string constructor will call a std::new_handler in case of out of memory.

It is expected that you will store UTF-8 UNICODE characters (or ASCII/Latin1) inside const_string. When you need two or four bytes UNICODE representation, you are able to convert const_string into mutable std::basic_string using : convert_to_u16(), convert_to_u32() or convert_to_ucs member function for the system wchar_t UNICODE.

NOTE: GNU/Linux wchar_t is 32 bit long and storing UTF-32LE or UTF-32BE depending on CPU endian, when Windows wchar_t is always two bytes UTF-16LE no matter whether you using MS VC++ or MingGW[64] compiler.

Small string optimization (SSO)

To reduce memory usage and improve general performance by minimize atomic reference counting operations, const_string uses small string optimization technique in addition to copy on write. SSO applied for 14 (64 bit)/7(32 bit) character long strings.

Clone this wiki locally