## Data file formats: file encodings

There are several ways we can explore the encoding used for a file: Unix commands and Python libraries are available.

In [None]:
# We can run the Unix command line 'file' tool to see if there are any clues about the encoding
!file 'data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv'

In Python there is a `chardet` library that does the same sort of thing.

In [None]:
import chardet

#Open the test file and read the contents in as a bytes object
testfile = open('data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv','rb').read()

#Detect the file encoding (this may take some time)
chardet.detect( testfile )

Note the confidence interval is reported. `chardet` may not always get it right, but you do get a measure of how sure it is about the encoding it reports.  (Note: `window-1252`, which is reported when I ran `chardet.detect` on the `Published Format - Nov 2013` file is also known as `ISO-8859-2`: this may be reported as the file encoding in place of the `windows-1252` above.)

It is always worth quickly looking over the dataset (for example, using the command-line `head` command) to see what looks reasonable (though of course a single rogue character in the dataset may cause problems when it comes to reading the file).

If you want a quick look at the file, `head PATH/TO/FILENAME` will display the first 10 lines of the file (if it can).

In [None]:
!head 'data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv'

Now that we have a reasonable guess at the file encoding, we can use that to open this CSV file in *pandas*:

In [None]:
import pandas as pd

# Read the file as a .csv file with the encoding given...
df = pd.read_csv('data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv', encoding='windows-1252')

# ... and show us the first three rows:
df[:3]

Note that if we don't specify the encoding, *pandas* might not be able to decode the file content:

In [None]:
df = pd.read_csv('data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv')
df[:3]

Some file formats might explicitly define the encoding used within a file. You can;t always trust these declarations, but they may provide a clue as to what the file encoding was intended to be!

For example, preview the first line of the `data/carparks.kml` document, which is an XML document type:

In [None]:
# Use the -n parameter to specify how many lines to preview from the head of the file
! head -n 3 data/CarParks.kml

The `open()` command can take a parameter `encoding='string'` that allows you to specify the encoding of the file to be opened and read. 

If we deliberately try to open this document claiming a different encoding than the declared one, we get an error:

In [None]:
with open('data/CarParks.kml', encoding='ascii') as f:
    kmlcontent = f.read()

kmlcontent[:100]

We can generally open a file as a readable bytestream to and use `chardet` to try to detect or confirm the file encoding:

In [None]:
testfile = open('data/CarParks.kml','rb').read()

#Detect the file encoding (this may take some time)
chardet.detect( testfile )

If the default encoding doesn't work (or even if it does), we can explicitly declare the appropriate file encoding whe we try to open a file as a text file:

In [None]:
with open('data/CarParks.kml', encoding='utf-8') as f:
    kmlcontent = f.read()

kmlcontent[:100]

This is quite common, with many standard file formats using the first few bytes of the file to carry details of the file's own encoding. (Another reason a quick look at the first few lines of a file can be rewarding!)

## Character encodings

In Python 3, strings are represented using Unicode characters. We can inspect the Unicode values of individual characters using methods from the `unicodedata` package.

In [None]:
import unicodedata

chars = "abc 123 é © 草"
for char in chars:
    # Print the character, followed by its Unicode value in hexadecimal, then decimal, 
    #     followed by its Unicode name.
    print(char,'%04x' % ord(char),'%d' % ord(char), unicodedata.name(char))

Character encodings, such as ASCII or UTF-8, also define a sequence of integer values that map onto individual characters.

As an encoding that is expressive enough to define characters and symbols from across a wide range of languages, Unicode is expensive to use if we know we only want to encode a small range of characters, such as the characters included in the ACSII scheme: A-Z, a-z, 0-9 and a few punctuation characters. ASCII codes are 7-bits wide, compared to the 32-bit representation used by Unicode (UTF-32). If all you want to do is store English text strings, with no accented characters, it makes sense in terms of minimising memory requirements to encode the characters using the leaner ASCII coding scheme.

We can look at the byte-encoded values of a string according to a specified character encoding using the `str.encode()` method (the default character encoding is UTF-8):

In [None]:
print(chars.encode(),'\n',chars.encode('utf-8'))

Note: The `b` at the start of each of the quoted sequences in the output from the above cell indicates that this is a bytestream, not a string object.

What happens if we try to encode characters that are out of range of the character encoding?


For example, what happens if we try to encode our `chars` string using an `ascii` encoding?

In [None]:
chars.encode('ascii')

The encoding fails when it encounters the é character, which is not representable by the ASCII encoding scheme.

The  operation complementary to encoding strings into byte strings or byte streams is to decode byte streams to Unicode character strings. Once again, the default encoding is UTF-8.

Byte strings can be decoded using the `bytes.decode()` method:

In [None]:
bytes.decode(b'\xe8\x8d\x89')

or declaring a string as a bytestream and then decoding it directly:

In [None]:
b'\xe8\x8e\x88'.decode()

## Summary

In this Notebook you've seen how to:
1. detect the encoding used for a file 
2. use the encoding to open files correctly
3. use the `unicodedata` package to encode and decode bytestreams using a range of encodings.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to look at `02.2.1 Data file formats - CSV`. 