Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 690 lines (517 sloc) 25.709 kB
2a80cfa Add common problems doc
Martin J. Evans authored
1 =head1 Some common unicode problems and solutions using Perl DBD::ODBC and MS SQL Server
2
3 =head2 Introduction
4
5 Here I have tried to collect useful information to help you use
6 Unicode in MS SQL Server from Perl DBD::ODBC. Many of the problems
7 listed came from questions on the dbi-user list, perlmonks or emails
8 direct to me.
9
10 I've tried very hard to make all the examples work on Windows
11 and Unix but the facts are:
12
13 =over
14
15 =item not all ODBC drivers are equal
16
17 Obviously ODBC drivers support for ODBC and bugs mean they all behave
18 a little differently. Also, many ODBC drivers for Unix have multiple
19 ways to configure them for different ways of supprting Unicode (e.g.,
20 the ODBC way, returning wide characters or converting these to UTF-8).
21
22 =item SQL Server is not constant
23
24 Different versions of MS SQL Server add new features or bug fixes e.g.,
25 MS SQL Server did not used to support supplemental characters formally
26 but now does if you use the correct collation.
27
28 =back
29
30 =head2 Terminolgy
31
32 In this document I repeatedly use some terminolgy which needs further
33 explanation.
34
35 =over
36
37 =item Encoding
38
39 By encoding, I mean how unicode characters are encoded e.g., they
40 could be encoded in UTF-8, UTF-16, UCS-2 etc. In perl unicode
41 characters are encoded in UTF-8 but you mostly don't need to know
42 that although DBD::ODBC does.
43
44 =item Wide characters
45
46 In the ODBC API wide characters were 2 bytes and UCS-2. However,
47 Microsoft's idea of wide characters keeps changing and now it is sometimes
48 4 bytes in UTF-16.
49
50 =item Wide APIs
51
52 The ODBC wide APIs are those called SQLxxxW e.g., SQLDriverConnectW. Any
53 string arguments to wide APIs expect UCS-2 (normally and sometimes
54 UTF-16).
55
56 =item SQL_WCHAR and SQL_VARWCHAR
57
58 SQL_WCHAR and SQL_VARWCHAR are actually macros in the C ODBC API which
59 are assigned numbers and passed into some ODBC APIs to tell the
60 ODBC driver to return L</Wide characters>.
61
62 You can use these macros independently of using ODBC L</Wide APIs>
63 i.e., you don't have to use SQLDriverConnectW, SQLPrepareW etc just
64 to get L</Wide characters> back from an ODBC Driver, SQLBindCol
65 and SQLBindParameter often support SQL_WCHAR/SQL_WVARCHAR as well.
66
67 =back
68
69 =head2 Some DBD::ODBC background/history
70
71 DBD::ODBC has not always supported Unicode. It all started with a
72 patch from Alexander Foken and around version 1.14 Alexander's
73 original patch was adapted to include optional Unicode support for
74 Unix.
75
76 For DBD::ODBC, building without unicode support really means
77 build as DBD::ODBC worked before unicode support was added to
78 maintain backwards compatibility.
79
80 The initial Unicode support was for Windows only and allowed you to
81 send/retrieve nchar/nvarchar columns as SQL_WCHARs meaning DBD::ODBC:
82
83 o used Windows APIs like WideCharToMultiByte and MultiByteToWideChar
84 to convert from UCS-2 to UTF-8 and vice versa.
85
86 o passed SQL_WCHAR to SQLBindCol for nchar columns meaning UCS-2
87 data could be retrieved from a column.
88
89 o marked converted UTF-8 data returned from the database as UTF-8 for perl
90
91 o unicode data in Perl which was bound to placeholders was bound
92 as SQL_WCHAR and passed to SQLBindParameter meaning you could
93 write unicode data into columns.
94
95 Since then unicode support has grown to include:
96
97 o unicode support in SQL (1.16_2)
98 o unicode support in connection strings (1.16_2)
99 o unicode support in column names
100 o unicode support in metadata calls (1.32_3)
101
102 For full documentation on Unicode in DBD::ODBC see the Unicode
103 section in DBD::ODBC pod.
104
105 =head2 Using non-Unicode aware components
106
107 The ODBC API has two main sets of APIs; the ANSI API (SQLxxxA APIs)
108 and the unicode API (SQLxxxW APIs). By default, DBD::ODBC uses the
109 unicode API on Windows and the ANSI API on non-Windows
110 platforms. There are good historical reasons for this beyond the scope
111 of this article. If you want to read/write/update/select unicode data
112 with DBD::ODBC and MS SQL Server from non-Windows platforms you need:
113
114 =over
115
116 =item unixODBC
117
118 Use the unixODBC ODBC Driver Manager. You will need the unixodbc and
119 unixodbc-dev packages or you can build it yourself quite easily.
120
121 =item DBD::ODBC
122
123 Build DBD::ODBC for the unicode API i.e., build it with
124
125 perl Makefile.PL -u
126
127 If you install DBD::ODBC from the CPAN shell (in Unix) by default
128 you'll get a DBD::ODBC which does not use the wide ODBC APIs.
129
130 =item ODBC Driver
131
132 Use a MS SQL Server ODBC driver which supports the unicode APIs. The
133 Easysoft ODBC driver for MS SQL Server supports all the wide ODBC APIs
134 as does the Microsoft ODBC driver.
135
136 =back
137
138 =head2 How can you tell what you've already got support for the unicode ODBC API?
139
140 =over
141
142 =item odbc_has_unicode
143
144 For DBD::ODBC you need to get a connection established to MS SQL
145 Server and then you can test the odbc_has_unicode attribute:
146
147 perl -MDBI -le 'my $h = DBI->connect; print $h->{odbc_has_unicode};'
148
149 If this outputs 1, your DBD::ODBC was built for Unicode. Any false
150 value means DBD::ODBC is not using the unicode APIs.
151
152 =item Use a recent unixODBC.
153
154 Use the most recent unixODBC you can get hold of). Most packaged
155 unixODBCs (e.g., for Ubuntu, Debian etc) are quite old but even those
156 support Unicode. We recommend a recent unixODBC because of important
157 bug fixes in recent unixODBCs.
158
159 =item Unicode/Wide APIs supported by ODBC Driver
160
161 Find out if your ODBC driver supports the unicode APIs (all Easysoft
162 ODBC drivers do). This is a lot harder than it sounds and you'll most
163 likely have to consult the driver documentation.
164
165 If you understand shared objects in Unix you can try looking for
166 SQLxxxW APIs being exported by the driver shared library e.g.,
167
168 nm /usr/local/easysoft/sqlserver/lib/libessqlsrv.so | grep SQLDriverConnectW
169 0001cf80 T SQLDriverConnectW
170
171 =back
172
173 =head2 What happens if I try and use unicode from Perl DBD::ODBC and a component in the chain does not support the unicode APIs?
174
175 The simple answer is you won't be able to insert/update/delete/select
176 Unicode data from MS SQL Server but to be honest this is too
177 simplistic and it is worth looking at some examples.
178
179 ex1 Simple insert/select with non-unicode built DBD::ODBC
180
181 <code>
182 use 5.008001;
183 use strict;
184 use warnings;
185 use DBI qw{:utils};
186
187 my $unicode = "\x{20ac}"; # unicode euro symbol
188 my $h = DBI->connect or die $DBI::errstr;
189 $h->{RaiseError} = 1;
190
191 eval {$h->do(q/drop table unicode_test/)};
192 $h->do(q/create table unicode_test (a nvarchar(20))/);
193
194 my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
195 $s->execute($unicode);
196
197 my $r = $h->selectall_arrayref(q/select a from unicode_test/);
198 my $data = $r->[0][0];
199 print "DBI describes data as: ", data_string_desc($data), "\n";
200 print "Data Length: ", length($data), "\n";
201 print "hex ords: ";
202 foreach my $c(split(//, $data)) {
203 print sprintf("%x,", ord($c));
204 }
205 print "\n";
206 </code>
207
208 which outputs:
209
210 <output>
211 DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
212 Data Length: 3
213 hex ords: e2,82,ac,
214 </output>
215
216 and as you can see we attempted to insert a unicode Euro symbol and
217 when we seleted it back we got 3 characters and 3 bytes instead of 1
218 character and 3 bytes and it is confirmed by the fact the Perl data
219 contains a UTF-8 encoded Euro.
220
221 An explanation of what happended above:
222
223 =over
224
225 =item 1
226
227 The column was created as an nvarchar so MS SQL Server should be happy
228 to accept unicode characters for the column data.
229
230 =item 2
231
232 DBD::ODBC prepared the SQL and asked the ODBC driver to describe the
233 parameter where it was told it was an UNICODE VARCHAR of length 20
234 characters. However, it bound the parameter as a value type of a
235 SQL_C_CHAR and a parameter type of SQL_C_WCHAR so the driver
236 interpreted each byte as a character.
237
238 =item 3
239
240 When we read the data back we got the bytes back as Perl had encoded
241 the Euro internally (UTF-8).
242
243 =back
244
245 You might be asking yourself at this point why DBD::ODBC bound the
246 data as a value type of SQL_C_CHAR and the answer is backwards
247 compatibility i.e., that is what it did for a long time before support
248 for the unicode API was added.
249
250 =head3 So what if we force DBD::ODBC to bind the data as SQL_WCHAR?
251
252 ex2. Simple insert/select with non-unicode built DBD::ODBC forcing SQL_WCHAR
253
254 The code for this is nearly identical to the above except we add a
255 bind_param call and import :sql_types from DBI.
256
257 <code>
258 use 5.008001;
259 use strict;
260 use warnings;
261 use DBI qw{:utils :sql_types};
262
263 my $unicode = "\x{20ac}"; # unicode euro symbol
264 my $h = DBI->connect or die $DBI::errstr;
265 $h->{RaiseError} = 1;
266
267 eval {$h->do(q/drop table unicode_test/)};
268 $h->do(q/create table unicode_test (a nvarchar(20))/);
269
270 my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
271 $s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
272 $s->execute($unicode);
273
274 my $r = $h->selectall_arrayref(q/select a from unicode_test/);
275 my $data = $r->[0][0];
276 print "DBI describes data as: ", data_string_desc($data), "\n";
277 print "Data Length: ", length($data), "\n";
278 print "hex ords: ";
279 foreach my $c(split(//, $data)) {
280 print sprintf("%x,", ord($c));
281 }
282 print "\n";
283 </code>
284
285 and the output is:
286
287 <output>
288 DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
289 Data Length: 3
290 hex ords: e2,82,ac,
291 </output>
292
293 Exactly the same as before. Why? The TYPE argument passed to bind_param
294 sets the SQL Type (the parameter type) and not the value type in a
295 SQLBindParameter call.
296
297 =head3 Reading properly written unicode in non-unicode built DBD::ODBC
298
299 Now what if the unicode Euro was inserted by something else correctly and we
300 want to read it using a non-unicode built DBD::ODBC?
301
302 ex3. Reading unicode from non unicode built DBD::ODBC
303
304 We've got a valid unicode Euro symbol in the database (don't worry about how
305 for now this is just showing what happens when the data in the database
306 is correct but you use the wrong method to get it).
307
308 <code>
309 use 5.008001;
310 use strict;
311 use warnings;
312 use DBI qw{:utils};
313
314 my $unicode = "\x{20ac}"; # unicode euro symbol
315 my $h = DBI->connect or die $DBI::errstr;
316 $h->{RaiseError} = 1;
317
318 my $r = $h->selectall_arrayref(q/select a from unicode_test/);
319 my $data = $r->[0][0];
320 print "DBI describes data as: ", data_string_desc($data), "\n";
321 print "Data Length: ", length($data), "\n";
322 print "hex ords: ";
323 foreach my $c(split(//, $data)) {
324 print sprintf("%x,", ord($c));
325 }
326 print "\n";
327 </code>
328
329 which outputs:
330
331 <output>
332 DBI describes data as: UTF8 off, non-ASCII, 1 characters 1 bytes
333 Data Length: 1
334 hex ords: 80,
335 </output>
336
337 To be honest, what you get back in data here very much depends on the
338 ODBC driver and platform. On Windows you'd probably get the above
339 because 0x80 is the windows-1252 character for a Euro (if it had been
340 something not in Windows-1252, it would probably have returned a
341 question market). With some unix MS SQL Server ODBC drivers you could get
342 any of the following (and perhaps more)
343
344 =over
345
346 =item o
347
348 0xac (the low byte of 0x20ac)
349
350 =item o
351
352 0x3f (a question mark because the driver cannot convert a wide
353 character to a SQL_C_CHAR)
354
355 =item o
356
357 0x80 if you set the client character-set to windows-1252
358
359 =item o
360
361 0xe2, 0x82, 0xac (UTF-8) encoded Euro
362
363 =back
364
365 The point of the illustration is that this is a world of pain and you
366 don't really want to do any of the above unless you have absolutely no
367 choice.
368
369 You might be saying to yourself, yes but you can set a type in the
370 bind_col method so you can control how the data is returned to
371 you. Mostly that is not true for just about all Perl DBDs I know but
372 with DBD::ODBC you can override the default type in a bind_col call
373 but only if it is a decimal or a timestamp.
374
375 =head2 Using varchar columns instead of nvarchar columns for unicode data
376
377 Don't do this.
378
e982e8d add para on why you should use nvarchar
Martin J. Evans authored
379 Generally speaking you should use nchar/nvarchar when you need to
380 support multiple languages in the same column although even that isn't
381 always necessary e.g., you can support English, German and Italian in
382 one Windows codepage. A better recommendation would be to use n
383 columns for user provided data which is unconstrained and varchar for
384 columns which are constrained which you control like a number plate,
385 serial number.
386
2a80cfa Add common problems doc
Martin J. Evans authored
387 However, in the spirit of describing why let's look at some examples.
388 These examples assume we are now using a DBD::ODBC built using the
389 Unicode API (see above) and you have a unicode aware ODBC driver.
390
391 So we return to our first simple example but now run it with a
392 unicode built DBD::ODBC:
393
394 ex4. Simple insert/select with unicode built DBD::ODBC but using varchar
395
396 <code>
397 use 5.008001;
398 use strict;
399 use warnings;
400 use DBI qw{:utils :sql_types};
401
402 my $unicode = "\x{20ac}"; # unicode euro symbol
403 my $h = DBI->connect or die $DBI::errstr;
404 $h->{RaiseError} = 1;
405
406 eval {$h->do(q/drop table unicode_test/)};
407 $h->do(q/create table unicode_test (a varchar(20))/);
408
409 my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
410 $s->execute($unicode);
411 $s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
412 $s->execute($unicode);
413
414 my $r = $h->selectall_arrayref(q/select a from unicode_test/);
415 foreach my $r (@$r) {
416 my $data = $r->[0];
417 print "DBI describes data as: ", data_string_desc($data), "\n";
418 print "Data Length: ", length($data), "\n";
419 print "hex ords: ";
420 foreach my $c(split(//, $data)) {
421 print sprintf("%x,", ord($c));
422 }
423 print "\n";
424 }
425 </code>
426
427 which outputs
428
429 <output>
430 DBI describes data as: UTF8 on, non-ASCII, 3 characters 6 bytes
431 Data Length: 3
432 hex ords: e2,201a,ac,
433 DBI describes data as: UTF8 on, non-ASCII, 1 characters 3 bytes
434 Data Length: 1
435 hex ords: 20ac,
436 </output>
437
438 Here again, you'll get different results depending on platform and
439 driver.
440
441 I imagine this is really going to make you wonder what on earth has
442 happened here. In Perl, the euro is internally encoded as UTF-8 as
443 0xe2,0x82,0xac. DBD::ODBC was told the column is SQL_CHAR but because
444 this is a unicode build it bound the columns as SQL_WCHAR. As far as
445 MS SQL Server is concerned this is a varchar column, you wanted to
446 insert 3 characters of codes 0xe2, 0x82 and 0xac and it is confirmed
447 that this is what is in the database when we read them back as binary
448 data. However, where did character with code 0x201a come from. When
449 DBD::ODBC read the data back it bound the column as SQL_C_WCHAR and
450 hence asked SQL Server to convert the characters in the varchar column
451 to wide (UCS2 or UTF16) characters and guess what, character 82 in
452 Windows-1252 character-set (which I was using when running this code)
453 is "curved quotes" with unicode value 0x201A. 0xe2 and 0xac in
454 windows-1252 are the same character code in unicode.
455
456 In the second row we bound the data as SQL_WCHAR for insert and
457 SQL_WCHAR for select B<and> the characters is in windows-1252 so we
458 got back what we inserted. However, had we tried to insert a character
459 not in the windows-1252 codepage SQL Server would substitute that
460 characters with a '?'.
461
462 Here is a Windows specific version of the above test with a few more
463 bells and whistles:
464
465 <code>
466 #
467 # A simple demonstration of why you cannot use char and varchar columns
468 # in MS SQL Server to store Unicode. char and varchar columns use a codepage
469 # and unicode characters inserted into them are converted to the codepage.
470 # If a conversion does not exist in the codepage the characters which don't
471 # convert will be '?'
472 #
473 # Show the diference between binding as SQL_CHAR and SQL_WCHAR.
474 #
475 # See http://msdn.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2
476 #
477 use 5.008.001;
478 use strict;
479 use warnings;
480 use DBI qw(:sql_types);
481 use Data::Dumper;
482 use Win32::API;
483
484 # chcp on my machines normally gives 850
485 # http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/chcp.mspx?mfr=true
486 #
487 # we want windows-1252 so run: chcp 1252 first
488
489 #use open qw( :encoding(Windows-1252) :std );
490 binmode STDOUT, ":encoding(cp1252)";
491
492 # get active codepage and ensure it is cp1252
493 # http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using
494 Win32::API::More->Import("kernel32", "UINT GetConsoleOutputCP()");
495 my $cp = GetConsoleOutputCP();
496 print "Current active console code page: $cp\n";
497 if ($cp != 1252) {
498 print "Please change to codepage 1252 - run chcp 1252\n";
499 die "Incompatible active codepage - please change to codepage 1252 by running chcp 1252\n";
500 }
501
502 my $h = DBI->connect() or die $DBI::errstr;
503 $h->{RaiseError} = 1;
504 $h->{PrintError} = 1;
505
506 eval {$h->do(q/drop table varchar_test/)};
507 $h->do(q/create table varchar_test(a varchar(20) collate Latin1_General_CI_AS)/);
508
509 # note codepage 1252 is 255 chrs including the euro at 0x80
510 # windows-1252 does not contain U+0187 but it does contain
511 # the euro symbol (U+20ac), the curved quotes (U+201A),
512 # Latin Small Letter F with hook (U+192), dagger (U+2020)
513 # mixing code pages in SQL Server is not recommended
514 my $insert = $h->prepare(q/insert into varchar_test (a) values(?)/);
515 my $data = "\x{20ac}\x{201A}\x{192}\x{2020}\x{187}" ;
516 {
517 use bytes;
518 print "encoded length of our data is:", length($data), "\n";
519 print "encoded data in hex is:";
520 foreach my $b(split(//, $data)) {
521 print sprintf("%x,", ord($b));
522 }
523 print "\n";
524 }
525 # this execute will discover the column is varchar and bind the perl scalar
526 # as SQL_CHAR meaning the UTF-8 encoded data in the perl scalar
527 # will be inserted as separate characters not all of which will even
528 # be translateable to the current codepage.
529 $insert->execute($data);
530 # Now we force DBD::ODBC to insert the parameter as SQL_WVARCHAR
531 $insert->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
532 $insert->execute($data);
533
534 print "\nNotice in the first row (which was inserted as SQL_CHAR), the UTF-8 stored in the perl scalar is mostly stored as individual characters but then you will be wondering why the few of the characters seem to come back as unicode. Windows sees individual characters in the UTF-8 sequence as characters in the windows-1252 codepage and the UTF-8 sequence contains some characters in windows-1252 which map back to unicode chrs. e.g., the UTF-8 sequence for the euro is e2, 82, ac and windows see the 82 as the curved quotes in windows-1252 but when you ask for it back as wide/unicode characters it maps it to U+201a (curved quotes unicode point)\n";
535 print "\nNotice how in the second row the last character is a ?. That is because U+0187 does not exist in windows-1252 codepage our column is using\n";
536 my $r = $h->selectall_arrayref(q/select a from varchar_test/);
537 print Dumper($r);
538 foreach my $row (@$r) {
539 print $row->[0], "\n";
540 }
541 $h->disconnect;
542
543 </code>
544
e982e8d add para on why you should use nvarchar
Martin J. Evans authored
545 =head2 The correct way to do Unicode with DBD::ODBC and SQL Server and why
546
547 If you are on Windows or Unix (and build DBD::ODBC for unicode
548 support) then your char, varchar, nchar and nvarchar columns should
549 all be correct. Even when you use char and varchar which use a
550 codepage, because DBD::ODBC asks for the data as SQL_WCHAR, SQL Server
551 will convert any character in the codepage to a unicode codepoint and
552 DBD::ODBC will encode them as UTF-8 and mark them unicode to Perl.
553
554 Also, when inserting unicode, DBD::ODBC will normally just do the
555 right thing i.e., use SQL_WCHAR for nchar/nvarchar columns, but if you
556 column is a char/varchar it cannot know if you really want to attempt
557 to insert unicode or just characters/bytes in the SQL Server codepage
558 (it could look at the perl scalars you are trying to insert and make a
559 guess based on whether the utf8 flag is on or not but that is likely
560 to break quite a lot of older scripts). See the example above for
561 inserting into varchar columns.
cad4090 @mjegh add surrogate pair support in sql server
authored
562
563 =head3 Surrogate pairs (or unicode code points above U+FFFF)
564
565 ODBC supports Unicode in the form of SQL_WCHAR types and Unicode
566 versions of the ODBC API. The encoding form that ODBC expects for data
567 used with Unicode API functions is UCS-2 (or at least that how it
568 looked when Windows and MS SQL Server was using UCS-2).
569
570 In anticipation of things changing, when unicode support was added to
571 DBD::ODBC the internal code actually converts Perl UTF-8 encoded
572 strings into UTF-16 which for characters in the Basic Multilingual
573 Plane is identical to UCS-2. As there were no unicode drivers
574 supporting supplemental characters (above U+0FFFF) this could not be
575 proven as good decision however, at worst it meant code using unicode
576 outside the basic multingual plane would just insert what looked to
577 the database as more characters.
578
579 =head4 Older versions of MS SQL Server and surrogate pairs
580
581 As it turned out the decision in DBD::ODBC to use UTF-16 was exactly
582 what MS initially did and versions of MS SQL Server like 2000, 2005
583 and 2008 are described as surrogate neutral rather than surrogate
584 aware. MS had this to say at the time:
585
586 I<Both SQL Server 2000 and SQL Server 2005 can store surrogate pairs,
587 even though UCS-2 is not aware of surrogates. SQL Server treats the
588 surrogate pairs as two undefined Unicode characters rather than as a
589 single character. Such applications are usually referred to as
590 surrogate-neutral or surrogate-safe, meaning that there is no
591 intrinsic ability to interact with the data, but at least the data can
592 be stored without loss.>
593
594 However, there are a few things you should be aware of when using
595 these older MS SQL Server versions that are only surrogate-neutral:
596
597 o for each surrogate inserted you'll need 1 extra character in the column e.g., inserting 3 surrogate pairs
598 into a nvarchar requires an nvarchar(6), not an nvarchar(3).
599
600 o string operations are not aware of supplementary characters. So, watch out if you are
601 using substring or len functions in SQL.
602
603 o Sorting and searching behavior for supplementary characters may
604 change depending on the collation
605
606 =head4 Newer versions of MS SQL Server and surrogate pairs
607
608 Newer versions of SQL Server (2012 and later, version >= 11) support surrogate pairs
609 but you must set the collation to a one ending in "_SC" e.g., Latin1_General_100_CI_AS_SC. When
610 you do this string functions will recognise surrogate pairs and all of the problems listed above
611 for older SQL Servers are fixed.
612
613 =head4 Is my SQL Server surrogate-neutral or surrogate-aware?
614
615 Here is a small script you can use to test whether your SQL Server is surrogate-neutral
616 or surrogate-aware:
617
618 <code>
619 # Test to see if your SQL Server is surrogate-aware or just surrogate-neutral
620 use 5.008.001;
621 use strict;
622 use DBI qw(:utils);
623
624 my $h = DBI->connect() or die $DBI::errstr;
625 $h->{PrintError} = 0;
626
627 eval {$h->do(q/drop table surrogate_pairs/)};
628
629 # It is possible to set the collation at instance or database level.
630 # Set it on the column to make sure that, initially, we are using a
631 # non supplementary character collation.
632 $h->do(q/create table surrogate_pairs (a nvarchar(20) collate Latin1_General_100_CI_AI)/);
633
634 my $insert = $h->prepare(q/insert into surrogate_pairs values(?)/);
635
636 # Insert test supplementary character
637 print "Inserting unicode character U+2070E into db\n";
638 $insert->execute("\x{2070E}");
639
640 # now read it back and see what we get
641 print "\nNote when we select this character back it is still 1 unicode character and 4 bytes and the ord is correct at 0x2070E. This is because DBD::ODBC received a buffer of SQL_WCHAR chrs back from SQL Server which it then decoded as UTF-16 which recognises the surrogate pair. This is why SQL Server using this collation (or older SQL Servers) are known as surrogate-neutral.\n";
642 my $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
643 print data_string_desc($r->[0]), "\n";
644 print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";
645
646 # This is a non _SC collation, so the length function returns "2".
647 print "\nNote in the following that len(a) returns 2 not 1 as SQL Server has not recognised this as a surrogate pair.\n";
648 $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
649 print "length in database is: ", $r->[0], "\n";
650
651 # now try and alter the table to change the collation to Latin1_General_100_CI_AS_SC
652 # which only later SQL Servers (>= version 11, i.e., 2012) can do.
653 # Unfortunately older SQL Servers don't error if you try and change the collation
654 # to one it does not support so we cannot test by just trying to change to a
655 # surrogate aware collation.
656 $h->do(q/alter table surrogate_pairs alter column a nchar(20) collate Latin1_General_100_CI_AS_SC/);
657
658 $r = $h->selectrow_arrayref(q/SELECT SERVERPROPERTY('ProductVersion')/);
659 my $version = $r->[0];
660 print "\nYou SQL Server is version: $version\n\n";
661
662 if (split(/\./, $version)->[0] >= 11) {
663 print "Your SQL Server is surrogate-aware\n";
664 $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
665 print data_string_desc($r->[0]), "\n";
666 print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";
667
668 print "\nNote in the following that len(a) returns 1 as SQL Server in this collation recognises surrogate pairs\n";
669 $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
670 print "length in database is: ", $r->[0], "\n";
671 } else {
672 print "You SQL Server is surrogate-neutral but not surrogate-aware\n";
673 }
674 $h->disconnect;
675 </code>
c4e4cb5 @mjegh Add some references
authored
676
677 =head2 References
678
679 How do I insert Unicode supplementary characters into SQL Server from Perl?
680 http://www.easysoft.com/support/kb/kb01043.html
681
682 List of Unicode characters
683 https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin-1_Supplement
684
685 Windows-1252
686 https://en.wikipedia.org/wiki/Windows-1252
687
688 International Features in Microsoft SQL Server 2005
689 http://msdn.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2
Something went wrong with that request. Please try again.