# Programming Concepts with Python

Topics Covered:
* Learn how data is represented under the hood
* Learn about encodings
* Learn how to work with text files
* Learn how to optimize data usage

## Binary and Positional Number Systems

Bits:
![Bits](https://dq-content.s3.amazonaws.com/450/binary.png)

**Refresher: How to add Commas to Large Numbers:**

In [3]:
val = 2 ** 32
print(val)
print(f'{val:,}') # :, used to incorporate commas

4294967296
4,294,967,296


### Binary Digits
**Base 10:**

In [4]:
# for number 645231:

# assign the weight of digit '2':
weight_digit_2 = 10**2
# assign the value of digit '2':
value_digit_2 = 2 * weight_digit_2

# assign the weight of digit 5:
weight_digit_5 = 10 ** 3
# assign the value of digit '5':
value_digit_5 = 5 * weight_digit_5

**Base 2**
Numbers represented in Base 2 are represented using parenthesis with a subscript of 2: **(101)<sub>2</sub>**

In [5]:
base = 2

decimal_1 = 1*(base**0) + 1*(base**3) + 1*(base**4)
decimal_2 = 1*(base**1) + 1*(base**2) + 1*(base**3)

print(decimal_1)
print(decimal_2)

25
14


**Convert a number to Base 2:**

In [7]:
num = 11010101
str_num = str(num)
print(int(str_num, 2)) # int(string, base)

213


**Convert a number to binary:**

In [6]:
print(bin(25))

0b11001


### Differences Between Base 2 and Base 10

![Base Differences](https://dq-content.s3.amazonaws.com/450/tb1.png)

**Convert the following numbers/bases to base 10:**

In [8]:
base_8_to_10 = int('435', 8)
base_7_to_10 = int('10', 7)

print(base_8_to_10, base_7_to_10)

285 7


### Hexadecimal

**Hexadecimal and Beyond**
Python has a built in `hex()` function to convert integers to base 16, prefixed with '0x'. 
Python supports bases between 2 and 36, inclusive.

In [9]:
hex_3501 = hex(3501)
decimal_F = int('F', 16)

print(hex_3501)
print(decimal_F)

0xdad
15


#### What's So Great About Hexadecimal?
* A group of 4 bits is called a 'nibble'
    * **2<sup>4</sup>** = 16 values (0-15)
* Hexadecimal lets us represent a nibble with a single character.
* A group of 8 bits is called a [byte](https://en.wikipedia.org/wiki/Byte).

Also important implications in RGB color representation.

In [10]:
red_hex = hex(213)
green_hex = hex(111)
blue_hex = hex(56)

rgb = red_hex, green_hex, blue_hex
rgb_formatted = ''
for color in rgb:
    formatted = color.replace('0x','')
    rgb_formatted += formatted

print(rgb_formatted)

d56f38


**Octal - Base 8**
`oct(integer)`

In [11]:
octal_999 = oct(999)

original = int(str(octal_999), 8)

print(octal_999, original)

0o1747 999


## Encodings and Representing Text in a Computer

In order to represent more complex information such as text, all that is needed is to define a set of rules that translates the information that we want to represent into a sequence of zeros and ones. The simplest kind of rule that we can define is a table that explicitly tells us the binary representation of each object that we want to represent. Such a rule is called an encoding.

[ASCII Table](https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html)

In [12]:
data = "QUEST"

for char in data:
    ordinal = ord(char)
    binary = bin(ordinal)
    print(binary)

0b1010001
0b1010101
0b1000101
0b1010011
0b1010100


`chr()` and `ord()` - Inverse Relation

In [13]:
print(chr(65))

A


In [14]:
print(ord('A'))

65


In [1]:
chr(65) == ord('A')

False

In [1]:
text = "The Swedish word for quest is sökande"

encoded = text.encode(encoding='ascii', errors='replace')

print(encoded)
print(type(encoded))

b'The Swedish word for quest is s?kande'
<class 'bytes'>


### Exploring the `bytes` Class
Bytes object is represented as a sequence of integers between 0 and 255.

In [2]:
b = 'DATA'.encode(encoding='ascii')
print(b[0]) # access the byte corresponding to D
print(b[1]) # access the byte corresponding to A
print(b[2]) # access the byte corresponding to T
print(b[3]) # access the byte corresponding to the second A

68
65
84
65


In [3]:
print(b)

b'DATA'


In [4]:
B = bytes.fromhex('ff a9 c8 44 41 54 41')

In [5]:
print(B)

b'\xff\xa9\xc8DATA'


In [6]:
# Check if char is lowercase:
def is_lowercase(c):
    return 97 <= ord(c) and ord(c) <= 122

In [8]:
is_lowercase('f')

True

In [9]:
# provided inputs
string_1 = 'lowercase'
string_2 = 'UPPERCASE'

# 65 - 90 incl
def check_uppercase(string):
    for c in string:
        # if non-uppercase is found, return false
        if not (65 <= ord(c) and ord(c) <= 90):
            return False
    return True
    
check_uppercase('AA')

True

In [11]:
val = 2 ** 16
print(f'{val:,}')

65,536


In [12]:
print(ord('你'))

20320


### BIG5 Encoding (2 bytes)
BIG5 encoding is a double byte encoding used for traditional Chinese characters. Since it is 2-byte encoding, each character needs to specify the two bytes to which it corresponds.

BIG5 uses **up to** 2 bytes. It is considered a **variable-length encoding.**

In [13]:
trad_chinese = "你好嗎?"

encoded = trad_chinese.encode(encoding='BIG5')
print(encoded)

print(len(encoded))

b'\xa7A\xa6n\xb6\xdc?'
7


### Enter: Unicode
All of these encodings make it possible display text in a lot of languages but it comes at the cost of having to know the encoding used on the text in order to be able to read it.

In an effort to overcome this people invented Unicode. Unicode is not actually an encoding. It is a very big table with 1,114,112 entries that maps symbols to codes. We are calling it symbols because the table is so big that it has space for things other than characters. For example, it contains entries for emojis.

Unicode is not an encoding because it does not map symbols to binary but to something called code points. There are several encodings that have been created to encode these code points into bits. The most known ones are UTF-8, UTF-16 and UTF-32. Both UTF-8 and UTF-16 are variable-length encoding and UTF-32 is a fixed length encoding.

The disadvantage of UTF-32 is that it wastes a lot of space since most characters can be represented with less than four bytes. It has the advantage of being very easy to decode because every character uses the same number of bytes. Because of its wasteful usage of space, UTF-32 is rarely used. You might think that, because UTF-8 uses a minimum of 8 bits, files encoded in UTF-8 occupy less space than the same files encoded in UTF-16. However that really depends on what language the text is in. For non-European languages, UTF-8 actually requires more memory than UTF-16.

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. For instance it is the default argument of the bytes.encode() method.

In [14]:
sentence = "ASCII cannot represent these: 你好嗎"

encoded_utf8 = sentence.encode(encoding='utf-8')
encoded_ascii = sentence.encode(encoding='ascii', errors='replace')

print(encoded_utf8)
print(encoded_ascii)

b'ASCII cannot represent these: \xe4\xbd\xa0\xe5\xa5\xbd\xe5\x97\x8e'
b'ASCII cannot represent these: ???'


### Decoding
How can we do the reverse operation and decode a bytes object into the original string?

To do this, we can use the bytes.decode() method. The arguments of this method are the same as the str.encode() method. Let's see some examples:

In [15]:
encoded = "data".encode(encoding="ascii")
decoded = encoded.decode(encoding="ascii")
print(encoded)
print(decoded)

b'data'
data


One of the biggest difficulties in decoding data is knowing what encoding was used to encode it to begin with. The bad news is that the task of correctly detecting the encoding all the time is impossible. To see why this is the case, let's imagine that there are only two characters A and B.

Then, if we are given an encoded string 01 how can we know whether it was encoded with encoding 1 and was AB or encoded using encoding 2 and was BA? There is simply not enough information to reverse the process.

The good news is that, in general, we can use heuristics to try to figure it out. For example, if we are expecting English text and when you decode it you get 䔀渀最氀椀猀栀 琀攀砀琀 or if you are expecting French text and you decode it and get vous Ãªtes un Ã©lÃ¨ve?, you know that you are not using the right encoding!

These heuristics are implemented in the chardet Python module. This module does not always find the answer since, as we mentioned, this is impossible. Using the chardet.detect() function we can give a bytes object and it will try to detect which encoding was used. It will also provide a number between 0 (low confidence) and 1 (high confidence) that shows how confident it is about its guess.

In this example encoded contains a bytes object with an unknown encoding. All that we know is that we are expecting English text:

In [25]:
decoded = encoded.decode(encoding="utf-16-be")
print(decoded)

UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0xa3 in position 38: truncated data

In [23]:
import chardet


# Hidden encoding:
encoded = 'The movie — Data Quest — costs 10£'.encode(encoding='utf-8')


decoded_cp1252 = encoded.decode(encoding='cp1252')

encoding = chardet.detect(encoded)
print(encoding)

decoded = encoded.decode(encoding='utf-8')
print(decoded)

{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}
The movie — Data Quest — costs 10£
