# 字符编码

字符编码是一组将字符映射为二进制编码的规则。Python 支持很多种字符编码，可以参考[这个链接](https://docs.python.org/3/library/codecs.html#standard-encodings)。因为互联网是英语起源，所以字符编码规则是将二进制码映射为英文字母表。  

英文字母表只有 26 个字母。但是其他的语言有很多其他的字母，比如重音符号、波浪号和变音符号。长此以往，出现了越来越多的字符编码，以处理英语外的其他语言。utf-8 标准试图提供一套字符编码模式，以涵盖所有的字符。

其问题在于除非他人告知，否则我们很难知道文件是使用什么字符编码创建的。目前最常用的字符编码是 utf-8。pandas 在读和写文件时，默认文件是 utf-8 编码的。

运行下方单元格中的代码，读取人口数据集。

In [None]:
import pandas as pd
df = pd.read_csv('../data/population_data.csv', skiprows=4)

pandas 应该可以顺利读取这个数据集。接下来，运行下方单元格中的代码，读取 'mystery.csv' 文件。

In [None]:
import pandas as pd
df = pd.read_csv('mystery.csv')

你应该会遇到一个报错：**UnicodeDecodeError:'utf-8' codec can't decode byte 0xff in position 0: invalid start byte**。这说明 pandas 以为这个文件是 utf-8 编码的，但是读取过程发生了错误。 

下个单元格中，你的任务是搞清楚 mystery.csv 文件是什么编码。

In [None]:
# TODO: Figure out what the encoding is of the myster.csv file
# HINT: pd.read_csv('mystery.csv', encoding=?) where ? is the string for an encoding like 'ascii'
# HINT: This link has a list of encodings that Python recognizes https://docs.python.org/3/library/codecs.html#standard-encodings

# Python has a file containing a dictionary of encoding names and associated aliases
# This line imports the dictionary and then creates a set of all available encodings
# You can use this set of encodings to search for the correct encoding
# If you'd like to see what this file looks like, execute the following Python code to see where the file is located
#    from encodings import aliases
#    aliases.__file__

from encodings.aliases import aliases

alias_values = set(aliases.values())

# TODO: iterate through the alias_values list trying out the different encodings to see which one or ones work
# HINT: Use a try - except statement. Otherwise your code will produce an error when reading in the csv file
#       with the wrong encoding.
# HINT: In the try statement, print out the encoding name so that you know which one(s) worked.


# 结语

Python 可以处理好几十种字符编码。但是，pandas 默认文件是 utf-8 编码的。这也有道理，因为 utf-8 很常见。但是，有时候会遇到某些其他字符编码的文件。如果你不知道它的编码是什么，你得先搜索。

注意，和往常一样，练习配有解决方案文件。进入 File-> Open。

在你搞不清楚字符编码是哪种格式的时候，有一个 Python 库可以帮到你：chardet 。运行下方单元格中的代码看看它的作用。


In [3]:
# install the chardet library
!pip install chardet

[31mflask-cors 3.0.3 requires Flask>=0.9, which is not installed.[0m
[31mblaze 0.11.3 requires flask>=0.10.1, which is not installed.[0m
{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}


In [4]:
# import the chardet library
import chardet 

# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("mystery.csv", 'rb') as file:
    print(chardet.detect(file.read()))

{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}
