In [2]:
# Modules we'll use
import pandas as pd
import numpy as np

# Helpful character encoding module
import charset_normalizer

# Set seed for reproducibility
np.random.seed(0)

## What are encodings?

**Character encodings** are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi").

In [3]:
# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)

str

In [4]:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors="replace")

# check the type
type(after)

bytes

In [5]:
# take a look what the bytes look like
after

b'This is the euro symbol: \xe2\x82\xac'

In [6]:
# convert it back to utf-8
after.decode("utf-8")

'This is the euro symbol: €'

In [7]:
# try to decode our bytes with the ascii encoding
after.decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

We can also run into trouble if we try to use the wrong encoding to map from a string to bytes. Like I said earlier, strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we'll create problems.

For example, if we try to convert a string to bytes for ASCII using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn't in ASCII, though, there will be some characters it can't handle. We can automatically replace the characters that ASCII can't handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there's not way to tell which character it should have been. That means we may have just made our data unusable!

In [8]:
# start with a string
before = "This is the euro symbol: €"

# encode it to a different encoding
after = before.encode("ascii")
after

UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 25: ordinal not in range(128)

In [9]:
# encode it to a different encoding, replacing characters that raise erros
after = before.encode("ascii", errors="replace")

# Convert it back to utf-8
after.decode("ascii")

# We've lost the original underlying byte string! It's been 
# replaced with the underlying byte string for the unknown character :(

'This is the euro symbol: ?'

This is bad and we want to avoid doing it! It's far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files, which we'll talk about next.