Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
723 lines (545 sloc) 26.7 KB

Some common unicode problems and solutions using Perl DBD::ODBC and MS SQL Server

Introduction

Here I have tried to collect useful information to help you use Unicode in MS SQL Server from Perl DBD::ODBC. Many of the problems listed came from questions on the dbi-user list, perlmonks or emails direct to me.

I've tried very hard to make all the examples work on Windows and Unix but the facts are:

not all ODBC drivers are equal

Obviously ODBC drivers support for ODBC and bugs mean they all behave a little differently. Also, many ODBC drivers for Unix have multiple ways to configure them for different ways of supprting Unicode (e.g., the ODBC way, returning wide characters or converting these to UTF-8).

SQL Server is not constant

Different versions of MS SQL Server add new features or bug fixes e.g., MS SQL Server did not used to support supplemental characters formally but now does if you use the correct collation.

In writing this I discovered a bug in DBD::ODBC (when inserting into char/varchar columns) which affects all versions from when unicode was introduced up until 1.46_1 when it was fixed. I've tried to highlight the issue in the following examples.

Terminolgy

In this document I repeatedly use some terminolgy which needs further explanation.

Encoding

By encoding, I mean how unicode characters are encoded e.g., they could be encoded in UTF-8, UTF-16, UCS-2 etc. In perl unicode characters are encoded in UTF-8 but you mostly don't need to know that although DBD::ODBC does.

Wide characters

In the ODBC API wide characters were 2 bytes and UCS-2. However, Microsoft's idea of wide characters keeps changing and now it is sometimes 4 bytes in UTF-16.

Wide APIs

The ODBC wide APIs are those called SQLxxxW e.g., SQLDriverConnectW. Any string arguments to wide APIs expect UCS-2 (normally and sometimes UTF-16).

SQL_WCHAR and SQL_WVARCHAR

SQL_WCHAR and SQL_WVARCHAR are actually macros in the C ODBC API which are assigned numbers and passed into some ODBC APIs to tell the ODBC driver to return "Wide characters".

You can use these macros independently of using ODBC "Wide APIs" i.e., you don't have to use SQLDriverConnectW, SQLPrepareW etc just to get "Wide characters" back from an ODBC Driver, SQLBindCol and SQLBindParameter often support SQL_WCHAR/SQL_WVARCHAR as well.

Some DBD::ODBC background/history

DBD::ODBC has not always supported Unicode. It all started with a patch from Alexander Foken and around version 1.14 Alexander's original patch was adapted to include optional Unicode support for Unix.

For DBD::ODBC, building without unicode support really means build as DBD::ODBC worked before unicode support was added to maintain backwards compatibility.

The initial Unicode support was for Windows only and allowed you to send/retrieve nchar/nvarchar columns as SQL_WCHARs meaning DBD::ODBC:

  o used Windows APIs like WideCharToMultiByte and MultiByteToWideChar
    to convert from UCS-2 to UTF-8 and vice versa.

  o passed SQL_WCHAR to SQLBindCol for nchar columns meaning UCS-2
    data could be retrieved from a column.

  o marked converted UTF-8 data returned from the database as UTF-8 for perl

  o unicode data in Perl which was bound to placeholders was bound
    as SQL_WCHAR and passed to SQLBindParameter meaning you could
    write unicode data into columns.

Since then unicode support has grown to include:

  o unicode support in SQL (1.16_2)
  o unicode support in connection strings (1.16_2)
  o unicode support in column names
  o unicode support in metadata calls (1.32_3)

For full documentation on Unicode in DBD::ODBC see the Unicode section in DBD::ODBC pod.

Using non-Unicode aware components

The ODBC API has two main sets of APIs; the ANSI API (SQLxxxA APIs) and the unicode API (SQLxxxW APIs). By default, DBD::ODBC uses the unicode API on Windows and the ANSI API on non-Windows platforms. There are good historical reasons for this beyond the scope of this article. If you want to read/write/update/select unicode data with DBD::ODBC and MS SQL Server from non-Windows platforms you need:

unixODBC

Use the unixODBC ODBC Driver Manager. You will need the unixodbc and unixodbc-dev packages or you can build it yourself quite easily.

DBD::ODBC

Build DBD::ODBC for the unicode API i.e., build it with

   perl Makefile.PL -u

If you install DBD::ODBC from the CPAN shell (in Unix) by default you'll get a DBD::ODBC which does not use the wide ODBC APIs.

ODBC Driver

Use a MS SQL Server ODBC driver which supports the unicode APIs. The Easysoft ODBC driver for MS SQL Server supports all the wide ODBC APIs as does the Microsoft ODBC driver.

How can you tell what you've already got support for the unicode ODBC API?

odbc_has_unicode

For DBD::ODBC you need to get a connection established to your database and then you can test the odbc_has_unicode attribute:

  perl -MDBI -le 'my $h = DBI->connect; print $h->{odbc_has_unicode};'

If this outputs 1, your DBD::ODBC was built for Unicode. Any false value means DBD::ODBC is not using the unicode APIs.

Use a recent unixODBC.

Use the most recent unixODBC you can get hold of). Most packaged unixODBCs (e.g., for Ubuntu, Debian etc) are quite old but even those support Unicode. We recommend a recent unixODBC because of important bug fixes in recent unixODBCs.

Unicode/Wide APIs supported by ODBC Driver

Find out if your ODBC driver supports the unicode APIs (all Easysoft ODBC drivers do). This is a lot harder than it sounds and you'll most likely have to consult the driver documentation.

If you understand shared objects in Unix you can try looking for SQLxxxW APIs being exported by the driver shared library e.g.,

   nm /usr/local/easysoft/sqlserver/lib/libessqlsrv.so | grep SQLDriverConnectW
   0001cf80 T SQLDriverConnectW

What happens if I try and use unicode from Perl DBD::ODBC and a component in the chain does not support the unicode APIs?

The simple answer is you won't be able to insert/update/delete/select Unicode data from MS SQL Server but to be honest this is too simplistic and it is worth looking at some examples.

ex1 Simple insert/select with non-unicode built DBD::ODBC

  <code>
  # ex1.pl
  use 5.008001;
  use strict;
  use warnings;
  use DBI qw{:utils};

  my $unicode = "\x{20ac}";       # unicode euro symbol
  my $h = DBI->connect or die $DBI::errstr;
  $h->{RaiseError} = 1;

  eval {$h->do(q/drop table unicode_test/)};
  $h->do(q/create table unicode_test (a nvarchar(20))/);

  my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
  $s->execute($unicode);

  my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
  my $data = $r->[0];
  print "DBI describes data as: ", data_string_desc($data), "\n";
  print "Data Length: ", length($data), "\n";
  print "hex ords: ";
  foreach my $c(split(//, $data)) {
      print sprintf("%x,", ord($c));
  }
  print "\n";
  </code>

which outputs:

  <output>
  DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
  Data Length: 3
  hex ords: e2,82,ac,
  </output>

and as you can see we attempted to insert a unicode Euro symbol and when we seleted it back we got 3 characters and 3 bytes instead of 1 character and 3 bytes and it is confirmed by the fact the Perl data contains the UTF-8 encoding for a Euro.

An explanation of what happended above:

  1. The column was created as an nvarchar so MS SQL Server should be happy to accept unicode characters for the column data.
  2. DBD::ODBC prepared the SQL and asked the ODBC driver to describe the parameter where it was told it was an UNICODE VARCHAR of length 20 characters. However, it bound the parameter as a value type of a SQL_C_CHAR and a parameter type of SQL_C_WCHAR so the driver interpreted each byte as a character.
  3. When we read the data back we got the bytes back as Perl had encoded the Euro internally (UTF-8).

You might be asking yourself at this point why DBD::ODBC bound the data as a value type of SQL_C_CHAR and the answer is backwards compatibility i.e., that is what it did for a long time before support for the unicode API was added.

So what if we force DBD::ODBC to bind the data as SQL_WCHAR?

ex2. Simple insert/select with non-unicode built DBD::ODBC forcing SQL_WCHAR

The code for this is nearly identical to the above except we add a bind_param call and import :sql_types from DBI.

  <code>
  use 5.008001;
  use strict;
  use warnings;
  use DBI qw{:utils :sql_types};

  my $unicode = "\x{20ac}";       # unicode euro symbol
  my $h = DBI->connect or die $DBI::errstr;
  $h->{RaiseError} = 1;

  eval {$h->do(q/drop table unicode_test/)};
  $h->do(q/create table unicode_test (a nvarchar(20))/);

  my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
  $s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
  $s->execute($unicode);

  my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
  my $data = $r->[0];
  print "DBI describes data as: ", data_string_desc($data), "\n";
  print "Data Length: ", length($data), "\n";
  print "hex ords: ";
  foreach my $c(split(//, $data)) {
      print sprintf("%x,", ord($c));
  }
  print "\n";
  </code>

and the output is:

  <output>
  DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
  Data Length: 3
  hex ords: e2,82,ac,
  </output>

Exactly the same as before. Why? The TYPE argument passed to bind_param sets the SQL Type (the parameter type) and not the value type in a SQLBindParameter call.

Reading properly written unicode in non-unicode built DBD::ODBC

Now what if the unicode Euro was inserted by something else correctly and we want to read it using a non-unicode built DBD::ODBC?

ex3. Reading unicode from non unicode built DBD::ODBC

We've got a valid unicode Euro symbol in the database in an nvarchar column (don't worry about how for now this is just showing what happens when the data in the database is correct but you use the wrong method to get it).

  <code>
  use 5.008001;
  use strict;
  use warnings;
  use DBI qw{:utils};

  my $unicode = "\x{20ac}";       # unicode euro symbol
  my $h = DBI->connect or die $DBI::errstr;
  $h->{RaiseError} = 1;

  my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
  my $data = $r->[0];
  print "DBI describes data as: ", data_string_desc($data), "\n";
  print "Data Length: ", length($data), "\n";
  print "hex ords: ";
  foreach my $c(split(//, $data)) {
      print sprintf("%x,", ord($c));
  }
  print "\n";
  </code>

which outputs:

  <output>
  DBI describes data as: UTF8 off, non-ASCII, 1 characters 1 bytes
  Data Length: 1
  hex ords: 80
  </output>

To be honest, what you get back in data here very much depends on the ODBC driver and platform. On Windows you'd probably get the above because 0x80 is the windows-1252 character for a Euro (if it had been something not in Windows-1252, it would probably have returned a question market). With some unix MS SQL Server ODBC drivers you could get any of the following (and perhaps more)

o

0xac (the low byte of 0x20ac)

o

0x3f (a question mark because the driver cannot convert a wide character to a SQL_C_CHAR)

o

0x80 if you set the client character-set to windows-1252

o

0xe2, 0x82, 0xac (UTF-8) encoded Euro

The point of the illustration is that this is a world of pain and you don't really want to do any of the above unless you have absolutely no choice.

You might be saying to yourself, yes but you can set a type in the bind_col method so you can control how the data is returned to you. Mostly that is not true for just about all Perl DBDs I know and with DBD::ODBC although you can override the default type in a bind_col call you can only do it for decimals and timestamps.

Using varchar columns instead of nvarchar columns for unicode data

If you are using DBD::ODBC before 1.46_1 don't do this. There is a bug in DBD::ODBC before 1.46_1 which means it does not look at the Perl scalars you are binding for input and it always binds them using the type the driver describes the column as (which will always be SQL_CHAR for a varchar column).

Generally speaking you should use nchar/nvarchar when you need to support multiple languages in the same column although even that isn't always necessary e.g., you can support English, German and Italian in one Windows codepage. A better recommendation would be to use n columns for user provided data which is unconstrained and varchar for columns which are constrained which you control like a number plate, serial number.

However, in the spirit of describing why let's look at some examples. These examples assume we are now using a DBD::ODBC built using the Unicode API (see above) and you have a unicode aware ODBC driver.

So we return to our first simple example but now run it with a unicode built DBD::ODBC, use a varchar column and try 2 different bind types (the default and an overriden one):

ex4. Simple insert/select with unicode built DBD::ODBC but using varchar

  <code>
  use 5.008001;
  use strict;
  use warnings;
  use DBI qw{:utils :sql_types};

  my $unicode = "\x{20ac}";       # unicode euro symbol
  my $h = DBI->connect or die $DBI::errstr;
  $h->{RaiseError} = 1;

  eval {$h->do(q/drop table unicode_test/)};
  $h->do(q/create table unicode_test (a varchar(20))/);

  my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
  $s->execute($unicode);
  $s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
  $s->execute($unicode);

  my $r = $h->selectall_arrayref(q/select a from unicode_test/);
  foreach my $r (@$r) {
      my $data = $r->[0];
      print "DBI describes data as: ", data_string_desc($data), "\n";
      print "Data Length: ", length($data), "\n";
      print "hex ords: ";
      foreach my $c(split(//, $data)) {
          print sprintf("%x,", ord($c));
      }
      print "\n";
  }
  </code>

which outputs

  <output>
  DBI describes data as: UTF8 on, non-ASCII, 3 characters 6 bytes
  Data Length: 3
  hex ords: e2,201a,ac,
  DBI describes data as: UTF8 on, non-ASCII, 1 characters 3 bytes
  Data Length: 1
  hex ords: 20ac,
  </output>

Here again, you'll get different results depending on platform and driver.

I imagine this is really going to make you wonder what on earth has happened here. Bear in mind, in Perl, the euro is internally encoded in UTF-8 as 0xe2,0x82,0xac.

In the first insert, DBD::ODBC does what it always did and asked the database what the column type was, the database returned SQL_CHAR and the Euro was bound as a SQL_CHAR (the bug). In the second case we overrode DBD::ODBC and told it to bind the data as SQL_WVARCHAR.

When we retrieved the data, DBD::ODBC bound the column as SQL_WCHAR (which it always does in an unicode build).

As far as MS SQL Server is concerned this is a varchar column, you wanted to insert 3 characters of codes 0xe2, 0x82 and 0xac and it is confirmed that this is what is in the database when we read them back as binary data. However, where did character with code 0x201a come from. When DBD::ODBC read the data back it bound the column as SQL_C_WCHAR and hence asked SQL Server to convert the characters in the varchar column to wide (UCS2 or UTF16) characters and guess what, character 82 in Windows-1252 character-set (which I was using when running this code) is "curved quotes" with unicode value 0x201A. 0xe2 and 0xac in windows-1252 are the same character code in unicode.

In the second row we bound the data as SQL_WCHAR for insert (and the database maps those unicode characters to the codepage for the table) and we bind as SQL_WCHAR for select. The driver maps the codepage characters retrieved back to unicode and we get back what we inserted (so long as all the characters we inserted fit in the current codepage). However, had we tried to insert a character not in the windows-1252 codepage SQL Server would substitute that character with a '?'. We should not have had to override the bind type here and that was the bug in DBD::ODBC pre 1.46_1.

This last point is really important. On Windows, the driver knows your code page and can do this but on Unix it is not uncommon for you to have to tell the driver what codepage you want to use for the conversion. e.g., with the Easysoft SQL Server ODBC Driver you could set "Client_cset = WINDOWS=1252" or you could add "use_lcid=Yes" in your connection and the driver will work out the correct codepage for you.

Here is a Windows specific version of the above test with a few more bells and whistles:

  <code>
  #
  # A simple demonstration of why you cannot use char and varchar columns
  # in MS SQL Server to store Unicode. char and varchar columns use a codepage
  # and unicode characters inserted into them are converted to the codepage.
  # If a conversion does not exist in the codepage the characters which don't
  # convert will be '?'
  #
  # Show the diference between binding as SQL_CHAR and SQL_WCHAR.
  #
  # See http://msdn.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2
  #
  use 5.008.001;
  use strict;
  use warnings;
  use DBI qw(:sql_types);
  use Data::Dumper;
  use Win32::API;

  # chcp on my machines normally gives 850
  # http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/chcp.mspx?mfr=true
  #
  # we want windows-1252 so run: chcp 1252 first

  #use open qw( :encoding(Windows-1252) :std );
  binmode STDOUT, ":encoding(cp1252)";

  # get active codepage and ensure it is cp1252
  # http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using
  Win32::API::More->Import("kernel32", "UINT GetConsoleOutputCP()");
  my $cp = GetConsoleOutputCP();
  print "Current active console code page: $cp\n";
  if ($cp != 1252) {
      print "Please change to codepage 1252 - run chcp 1252\n";
      die "Incompatible active codepage - please change to codepage 1252 by running chcp 1252\n";
  }

  my $h = DBI->connect() or die $DBI::errstr;
  $h->{RaiseError} = 1;
  $h->{PrintError} = 1;

  eval {$h->do(q/drop table varchar_test/)};
  $h->do(q/create table varchar_test(a varchar(20) collate Latin1_General_CI_AS)/);

  # note codepage 1252 is 255 chrs including the euro at 0x80
  # windows-1252 does not contain U+0187 but it does contain
  # the euro symbol (U+20ac), the curved quotes (U+201A),
  # Latin Small Letter F with hook (U+192), dagger (U+2020)
  # mixing code pages in SQL Server is not recommended
  my $insert = $h->prepare(q/insert into varchar_test (a) values(?)/);
  my $data = "\x{20ac}\x{201A}\x{192}\x{2020}\x{187}" ;
  {
      use bytes;
      print "encoded length of our data is:", length($data), "\n";
      print "encoded data in hex is:";
      foreach my $b(split(//, $data)) {
          print sprintf("%x,", ord($b));
      }
      print "\n";
  }
  # this execute will discover the column is varchar and bind the perl scalar
  # as SQL_CHAR meaning the UTF-8 encoded data in the perl scalar
  # will be inserted as separate characters not all of which will even
  # be translateable to the current codepage.
  $insert->execute($data);
  # Now we force DBD::ODBC to insert the parameter as SQL_WVARCHAR
  $insert->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
  $insert->execute($data);

  print "\nNotice in the first row (which was inserted as SQL_CHAR), the UTF-8 stored in the perl scalar is mostly stored as individual characters but then you will be wondering why the few of the characters seem to come back as unicode. Windows sees individual characters in the UTF-8 sequence as characters in the windows-1252 codepage and the UTF-8 sequence contains some characters in windows-1252 which map back to unicode chrs. e.g., the UTF-8 sequence for the euro is e2, 82, ac and windows see the 82 as the curved quotes in windows-1252 but when you ask for it back as wide/unicode characters it maps it to U+201a (curved quotes unicode point)\n";
  print "\nNotice how in the second row the last character is a ?. That is because U+0187 does not exist in windows-1252 codepage our column is using\n";
  my $r = $h->selectall_arrayref(q/select a from varchar_test/);
  print Dumper($r);
  foreach my $row (@$r) {
      print $row->[0], "\n";
  }
  $h->disconnect;

  </code>

The correct way to do Unicode with DBD::ODBC and SQL Server and why

When retrieving rows on Windows or Unix (and DBD::ODBC built for unicode support) then your char, varchar, nchar and nvarchar columns should all be correct. Even when you use char and varchar which use a codepage, because DBD::ODBC asks for the data as SQL_WCHAR, SQL Server will convert any character in the codepage to a unicode codepoint and DBD::ODBC will encode them as UTF-8 and mark them unicode to Perl.

When inserting unicode, DBD::ODBC will normally just do the right thing i.e., use SQL_WCHAR for nchar/nvarchar columns, but if you column is a char/varchar then prior to 1.46_1 it may do the wrong thing by default. Until 1.46_1 DBD::ODBC ignored your perl data and bound it as the type the driver reported for the parameter and in 1.46_1 and beyond DBD::ODBC looks at your scalar for the parameter first to see it has the utf8 flag on it.

Surrogate pairs (or unicode code points above U+FFFF)

ODBC supports Unicode in the form of SQL_WCHAR types and Unicode versions of the ODBC API. The encoding form that ODBC expects for data used with Unicode API functions is UCS-2 (or at least that how it looked when Windows and MS SQL Server was using UCS-2).

In anticipation of things changing, when unicode support was added to DBD::ODBC the internal code actually converts Perl UTF-8 encoded strings into UTF-16 which for characters in the Basic Multilingual Plane is identical to UCS-2. As there were no unicode drivers supporting supplemental characters (above U+0FFFF) this could not be proven as good decision however, at worst it meant code using unicode outside the basic multingual plane would just insert what looked to the database as more characters.

Older versions of MS SQL Server and surrogate pairs

As it turned out the decision in DBD::ODBC to use UTF-16 was exactly what MS initially did and versions of MS SQL Server like 2000, 2005 and 2008 are described as surrogate neutral rather than surrogate aware. MS had this to say at the time:

Both SQL Server 2000 and SQL Server 2005 can store surrogate pairs, even though UCS-2 is not aware of surrogates. SQL Server treats the surrogate pairs as two undefined Unicode characters rather than as a single character. Such applications are usually referred to as surrogate-neutral or surrogate-safe, meaning that there is no intrinsic ability to interact with the data, but at least the data can be stored without loss.

However, there are a few things you should be aware of when using these older MS SQL Server versions that are only surrogate-neutral:

o for each surrogate inserted you'll need 1 extra character in the column e.g., inserting 3 surrogate pairs into a nvarchar requires an nvarchar(6), not an nvarchar(3).

o string operations are not aware of supplementary characters. So, watch out if you are using substring or len functions in SQL.

o Sorting and searching behavior for supplementary characters may change depending on the collation

Newer versions of MS SQL Server and surrogate pairs

Newer versions of SQL Server (2012 and later, version >= 11) support surrogate pairs but you must set the collation to a one ending in "_SC" e.g., Latin1_General_100_CI_AS_SC. When you do this string functions will recognise surrogate pairs and all of the problems listed above for older SQL Servers are fixed.

Is my SQL Server surrogate-neutral or surrogate-aware?

Here is a small script you can use to test whether your SQL Server is surrogate-neutral or surrogate-aware:

  <code>
  # Test to see if your SQL Server is surrogate-aware or just surrogate-neutral
  use 5.008.001;
  use strict;
  use DBI qw(:utils);

  my $h = DBI->connect() or die $DBI::errstr;
  $h->{PrintError} = 0;

  eval {$h->do(q/drop table surrogate_pairs/)};

  # It is possible to set the collation at instance or database level.
  # Set it on the column to make sure that, initially, we are using a
  # non supplementary character collation.
  $h->do(q/create table surrogate_pairs (a nvarchar(20) collate Latin1_General_100_CI_AI)/);

  my $insert = $h->prepare(q/insert into surrogate_pairs values(?)/);

  # Insert test supplementary character
  print "Inserting unicode character U+2070E into db\n";
  $insert->execute("\x{2070E}");

  # now read it back and see what we get
  print "\nNote when we select this character back it is still 1 unicode character and 4 bytes and the ord is correct at 0x2070E. This is because DBD::ODBC received a buffer of SQL_WCHAR chrs back from SQL Server which it then decoded as UTF-16 which recognises the surrogate pair. This is why SQL Server using this collation (or older SQL Servers) are known as surrogate-neutral.\n";
  my $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
  print data_string_desc($r->[0]), "\n";
  print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";

  # This is a non _SC collation, so the length function returns "2".
  print "\nNote in the following that len(a) returns 2 not 1 as SQL Server has not recognised this as a surrogate pair.\n";
  $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
  print "length in database is: ", $r->[0], "\n";

  # now try and alter the table to change the collation to Latin1_General_100_CI_AS_SC
  # which only later SQL Servers (>= version 11, i.e., 2012) can do.
  # Unfortunately older SQL Servers don't error if you try and change the collation
  # to one it does not support so we cannot test by just trying to change to a
  # surrogate aware collation.
  $h->do(q/alter table surrogate_pairs alter column a nchar(20) collate Latin1_General_100_CI_AS_SC/);

  $r = $h->selectrow_arrayref(q/SELECT SERVERPROPERTY('ProductVersion')/);
  my $version = $r->[0];
  print "\nYou SQL Server is version: $version\n\n";

  if ((split(/\./, $version))[0] >= 11) {
      print "Your SQL Server is surrogate-aware\n";
      $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
      print data_string_desc($r->[0]), "\n";
      print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";

      print "\nNote in the following that len(a) returns 1 as SQL Server in this collation recognises surrogate pairs\n";
      $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
      print "length in database is: ", $r->[0], "\n";
  } else {
      print "You SQL Server is surrogate-neutral but not surrogate-aware\n";
  }
  $h->disconnect;
  </code>

References

How do I insert Unicode supplementary characters into SQL Server from Perl? http://www.easysoft.com/support/kb/kb01043.html

List of Unicode characters https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin-1_Supplement

Windows-1252 https://en.wikipedia.org/wiki/Windows-1252

International Features in Microsoft SQL Server 2005 http://msdn.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2