## Data file formats: file encodings

There are several ways we can explore the encoding used for a file: Unix commands and Python libraries are available.

In [1]:
# We can run the Unix command line 'file' tool to see if there are any clues about the encoding
!file 'data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv'

data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv: ASCII text, with CRLF line terminators


In Python there is a _chardet_ library - unfortunately it hasn't been made to work with Python 3 yet, so we have installed a Linux library on the virtual machine, `chardet`, that does the same. `chardet` has a command-line tool `chardetect` that examines file encodings.

The Unix `whereis` command will find the executable version of the named code, and report the directory location for the file.  In this case we want to know where the `chardetect` code lives.

In [2]:
# The chardet library has a simple command-line tool called 'chardetect' that 
#    tries to sniff file encodings:
!whereis chardetect

chardetect: /usr/local/bin/chardetect


Now we know where it lives, we can call it directly (this tool sometimes takes a while to complete).

In [3]:
# We can call this tool with a filename to see if it can detect the encoding for us:
!/usr/local/bin/chardetect 'data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv'

data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv: windows-1252 with confidence 0.73


Note the confidence interval is reported. `chardetect` may not always get it right, but you do get a measure of how sure it is about the encoding it reports.  (Note: `window-1252`, which is reported when I ran the chardetect on the `Published Format - Nov 2013` file is also known as `ISO-8859-2`, this may be reported as the file encoding in place of the `windows-1252` above.)

It is always worth quickly looking over the dataset (for example, using the command-line `head` command) to see what looks reasonable (though of course a single rogue character in the dataset may cause problems when it comes to reading the file).

In [4]:
# If you want a quick look at the file, 'head filename' will display the first 
# 10 lines of the file (if it can).
!head 'data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv'

Capital or Revenue,Directorate,Transaction Number,Date,Service Area,Expenses Type,Amount,Supplier Name
Revenue,Community Wellbeing & Social Care,5105636098,13.11.2013,Public Libraries Central,Marketing Costs,200.00,REDACTED PERSONAL DATA
Revenue,Community Wellbeing & Social Care,5105635705,08.11.2013,Drug Misuse - Adults,Charges from Independent Providers,120.00,REDACTED PERSONAL DATA
Revenue,Childrens Services,5105637261,20.11.2013,Thompson House Tuition Centre (PRU),Professional Services,240.00,* M BOWDERY T/A SPOTLIGHT BOUTIQUE
Revenue,Community Wellbeing & Social Care,5105637069,27.11.2013,Safeguarding Adults,Professional Services,"5,285.00",REDACTED PERSONAL DATA
Revenue,Community Wellbeing & Social Care,5105637605,22.11.2013,Leaseholds by LA,Accommodation Costs - Leaseholder Payments,695.89,REDACTED PERSONAL DATA
Revenue,Community Wellbeing & Social Care,5105637605,22.11.2013,Leaseholds by LA,Accommodation Costs - Leaseholder Payments,695.89,REDACTED PERSONAL DATA
R

Now that we have a reasonable guess at the file encoding, we can use that to open this CSV file in *pandas*:

In [5]:
import pandas as pd
# Read the file as a .csv file with the encoding given...
df = pd.read_csv('data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv', encoding = 'windows-1252')
# ... and show us the first three rows:
df[:3]

Unnamed: 0,Capital or Revenue,Directorate,Transaction Number,Date,Service Area,Expenses Type,Amount,Supplier Name
0,Revenue,Community Wellbeing & Social Care,5105636098,13.11.2013,Public Libraries Central,Marketing Costs,200.0,REDACTED PERSONAL DATA
1,Revenue,Community Wellbeing & Social Care,5105635705,08.11.2013,Drug Misuse - Adults,Charges from Independent Providers,120.0,REDACTED PERSONAL DATA
2,Revenue,Childrens Services,5105637261,20.11.2013,Thompson House Tuition Centre (PRU),Professional Services,240.0,* M BOWDERY T/A SPOTLIGHT BOUTIQUE


Note that if we don't specify the encoding, *pandas* can't decode the file content:

In [6]:
df=pd.read_csv('data/iwCouncilSpending/PUBLISHED FORMAT - NOV 2013.csv')
df[:3]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 6: invalid start byte

### Exercise
What happens if you try to read the file `data/carparks.kml` using the Python `open()` file function?

In [7]:
with open('data/CarParks.kml') as f:
    kmlcontent = f.read()
kmlcontent[:100]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 42785: ordinal not in range(128)

The `open()` command can take a parameter `encoding='string'` that allows you to specify the encoding of the file to be opened and read. 

In the next two cells use `chardetect` to detect the file encoding and then see if you can open the file using this encoding.

In [8]:
!chardetect 'data/CarParks.kml'


data/CarParks.kml: utf-8 with confidence 0.99


In [9]:
# Add the appropriate file encoding to the open statement.
with open('data/CarParks.kml', encoding='utf-8') as f:
    kmlcontent = f.read()
kmlcontent[:100]

'<?xml version="1.0" encoding="UTF-8"?>\n<kml xmlns="http://earth.google.com/kml/2.2">\n<Document>\n  <n'

### Sample solutions

In [10]:
# Sample Solution
# Use chardetect to detect the file encoding.
!/usr/local/bin/chardetect 'data/CarParks.kml'

data/CarParks.kml: utf-8 with confidence 0.99


In [11]:
# Sample solution
# Add the appropriate file encoding to the open statement.
with open('data/CarParks.kml', encoding='utf-8') as f:
    kmlcontent = f.read()
kmlcontent[:100]

'<?xml version="1.0" encoding="UTF-8"?>\n<kml xmlns="http://earth.google.com/kml/2.2">\n<Document>\n  <n'

Does the first line of the file also give you any clues as to what the encoding may be?  

In [12]:
!head 'data/CarParks.kml'

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
  <name>Car Parks</name>
  <description><![CDATA[View all the Island car parks managed by the local authority]]></description>
  <Style id="style9">
    <IconStyle>
      <Icon>
        <href>http://www.iwight.com/images/pins/longstay.png</href>
      </Icon>


This is quite common, with many standard file formats using the first few bytes of the file to carry details of the file's own encoding. (Another reason a quick look at the first few lines of a file can be rewarding!)

## Character encodings

In Python 3, strings are represented using Unicode characters. We can inspect the Unicode values of individual characters using methods from the `unicodedata` package.

In [13]:
import unicodedata

chars = "abc 123 é © 草"
for char in chars:
    # Print the character, followed by its Unicode value in hexadecimal, then decimal, 
    #     followed by its Unicode name.
    print(char,'%04x' % ord(char),'%d' % ord(char), unicodedata.name(char))

a 0061 97 LATIN SMALL LETTER A
b 0062 98 LATIN SMALL LETTER B
c 0063 99 LATIN SMALL LETTER C
  0020 32 SPACE
1 0031 49 DIGIT ONE
2 0032 50 DIGIT TWO
3 0033 51 DIGIT THREE
  0020 32 SPACE
é 00e9 233 LATIN SMALL LETTER E WITH ACUTE
  0020 32 SPACE
© 00a9 169 COPYRIGHT SIGN
  0020 32 SPACE
草 8349 33609 CJK UNIFIED IDEOGRAPH-8349


Character encodings, such as ASCII or UTF-8, also define a sequence of integer values that map onto individual characters.

As an encoding that is expansive enough to define characters and symbols from across a wide range of languages, Unicode is expensive to use if we know we only want to encode a small range of characters, such as the characters included in the ACSII scheme: A-Z, a-z, 0-9 and a few punctuation characters. ASCII codes are 7-bits wide, compared to the 32-bit representation used by Unicode (UTF-32). If all you want to do is store English text strings, with no accented characters, it makes sense in terms of mimimising memory requirements to encode the characters using the leaner ASCII coding scheme.

We can look at the byte-encoded values of a string according to a specified character encoding using the `str.encode()` method (the default character encoding is UTF-8):

or declaring a string as a bytestream and then decoding it directly:

In [16]:
print(chars.encode(),'\n',chars.encode('utf-8'))

b'abc 123 \xc3\xa9 \xc2\xa9 \xe8\x8d\x89' 
 b'abc 123 \xc3\xa9 \xc2\xa9 \xe8\x8d\x89'


Note: The 'b' at the start of each of the quoted sequences in the output from the above cell indicates that this is a bytestream, not a string object.

The  operation complementary to encoding strings into byte strings or byte streams is to decode byte streams to Unicode character strings. (Once again, the default encoding is UTF-8.)

Byte strings can be decoded using the `bytes.decode()` method:

In [17]:
# What happens if we try to encode characters that are out of range of the character encoding?
# Here we try to encode our 'chars' string using ascii encoding.
print(chars.encode('ascii') )

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 8: ordinal not in range(128)

The encoding fails when it encounters the é character, which is not representable by the ASCII encoding scheme.

In [18]:
bytes.decode(b'\xe8\x8d\x89')

'草'

In [19]:
b'\xe8\x8e\x88'.decode()

'莈'

## Summary

In this Notebook you've seen how to:
1. detect the encoding used for a file 
2. use the encoding to open files correctly
3. use the `unicodedata` package to encode and decode bytestreams using a range of encodings.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to look at `02.2.1 Data file formats - CSV`. 