# Encoding issues

Life was so simple back in the 1960s, when all a computer knew was captured by a standard QWERTY/ASCII keyboard (read: only English letters, no fancy stuff). Well, things are different now. Who would have thought we'd start typing with emoticons? Exactly. Nobody. 

So here's the issue. 

When typing a text file, we're not storing only the content in the file, but also meta-data on how the characters are to be read, or, "encoded". For example, the plain old text files from the 1960s are encoded in ASCII ("only English letters, no fancy stuff"). When opening social media data using the plain old text format, all the Emoticons will vanish (well, obviously, because those hadn't been invented in the 60s). 

So, we need to be able to pick the right format for the kind of data we're getting.

When the tutorials on parsing were designed, we used the "utf-8" standard - a pretty decent one if you ask me which captures thousands of different characters. But, well, for some tweets, this standard isn't enough.

## Getting an error with a "wrong" encoding

In [5]:
f = open('scraping-output_euros.json', 'r', encoding = 'utf-8')
f.readlines()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

If you did everything correctly, you got an error message here.
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
```

In simple terms, Python doesn't know how to "read" some of those characters.


## Fixing the encoding

Now let's try to switch to the `utf-16` format. That's `utf-8`, but on steroids!

In [6]:
f = open('scraping-output_euros.json', 'r', encoding = 'utf-16')
f.readlines()

['{"user": "enzoknol", "yearlyincome": "\\u20ac12.2K - \\u20ac195.6K"}\n',
 '{"user": "officialtrapcity", "yearlyincome": "\\u20ac11.2K - \\u20ac178.7K"}\n',
 '{"user": "martingarrix", "yearlyincome": "\\u20ac11K - \\u20ac175.4K"}\n',
 '{"user": "rtlthevoice", "yearlyincome": "\\u20ac4.2K - \\u20ac67.1K"}\n']

Now, we can read the file!

Let's proceed with parsing some of the data.

In [9]:
import json
f = open('scraping-output_euros.json', 'r', encoding = 'utf-16')

for i in f.readlines():
    obj = json.loads(i)
    print(obj.get('user'))


enzoknol
officialtrapcity
martingarrix
rtlthevoice


## Summary

You've just learnt why we need to use the correct encoding standard to read (and many times also to write) files in Python. Why? Because the technological landscape is evolving continuously, and so do the standards on how to safe and write files.