UTF-8中定义了一些组合字符,这些字符会与它前面的非组合字符组合显示成一个字符,一般用它来添加加重或者变音标记.
同时呢,某些常用的加重字符也会有自己的单一编码值,这些字符叫做预组合字符(precomposed characters).
这就带来一个很恐怖的后果,某些UTF-8的字符可能有两种表示方法! 例如单词“naïve”可以写作这6个字符”nai\u0308ve”,也可能写作5个字符”na\u00EFve”. 这样一来,在程序中处理这类字符时就会出现一些很诡异的结果:
例如下面这段python代码
import re
s1 = "nai\u0308ve"
s2 = "na\u00EFve"
if s1 == s2:
print(s1,"is equal to",s2)
else:
print(s1,"is not equal to",s2)
regexp = '^.....$'
if re.match(regexp,s1):
print(regexp,"is matching",s1)
else:
print(regexp,"is not matching",s1)
if re.match(regexp,s2):
print(regexp,"is matching",s2)
else:
print(regexp,"is not matching",s2)
print("length of",s1,"is",len(s1))
print("length of",s2,"is",len(s2))
结果为:
naïve is not equal to naïve
^.....$ is not matching naïve
^.....$ is matching naïve
length of naïve is 6
length of naïve is 5
解决方法是用unicodedata库中的normalize函数来对字符串进行归一化(normalization)
import re
from unicodedata import normalize
s1 = normalize('NFC',"nai\u0308ve")
s2 = normalize('NFC',"na\u00EFve")
if s1 == s2:
print(s1,"is equal to",s2)
else:
print(s1,"is not equal to",s2)
regexp = '^.....$'
if re.match(regexp,s1):
print(regexp,"is matching",s1)
else:
print(regexp,"is not matching",s1)
if re.match(regexp,s2):
print(regexp,"is matching",s2)
else:
print(regexp,"is not matching",s2)
print("length of",s1,"is",len(s1))
print("length of",s2,"is",len(s2))
其结果为
naïve is equal to naïve
^.....$ is matching naïve
^.....$ is matching naïve
length of naïve is 5
length of naïve is 5