# Python - Bits and Bytes

## Libraries

In [3]:
import sys
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
import textwrap

In [4]:
## Enable inline plotting for graphics
%matplotlib inline
## So all output comes through from Ipython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [6]:
## Get Version information
print(textwrap.fill(sys.version),'\n')
print("Pandas version: {0}".format(pd.__version__))
print("Matplotlib version: {0}".format(matplotlib.__version__))
print("Numpy version: {0}".format(np.__version__))
print("Seaborn version: {0}".format(sns.__version__))

3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 18:50:55) [MSC
v.1915 64 bit (AMD64)] 

Pandas version: 0.23.4
Matplotlib version: 3.0.2
Numpy version: 1.15.4
Seaborn version: 0.9.0


Set the pandas row print for reasonable output:

In [8]:
pd.options.display.max_rows = 10

In [7]:
## So all output comes through from Ipython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Understanding Binary Data: bits and bytes

All of the data on your computer is stored in a binary format, which is a series of zero's and one's. Let's first think about what binary data is. Binary data can only take the value of a one or a zero, unlike our base-10 data where there are 10 possible values (0-9). 

We can convert integers into a binary format using the string format statments to illustrate. The number 0 is simply 0, and the number 1 is simply 1. We can also use the `bin` function. The `bin()` functions returns the binary representation of the integer, including a leading '0b'.

In [10]:
## binary as zero's and one's 
## convert integers in base 10 to a binary representation
"{0:b}".format(0)
bin(0)

'0'

'0b0'

In [11]:
## binary as zero's and one's 
## convert integers in base 10 to a binary representation
"{0:b}".format(1)

'1'

However, the number 3 requires another 'digit' (just like going from 9 to 10 using base-10).

In [13]:
## binary as zero's and one's 
## convert integers in base 10 to a binary representation
"{0:b}".format(2)

'10'

In [14]:
## binary as zero's and one's 
## convert integers in base 10 to a binary representation
"{0:b}".format(3)

'11'

And of course the number 4 requires another digit.

In [15]:
## binary as zero's and one's 
## convert integers in base 10 to a binary representation
"{0:b}".format(4)

'100'

A larger number like 2435 would require more digits:

In [16]:
## binary as zero's and one's 
## convert integers in base 10 to a binary representation
"{0:b}".format(2435)

'100110000011'

However, we usually group bits into groups of 8. These 8 bits together are called a 'byte'. Question - how many unique integers can be stored under 8 bits (one byte)? What about 2 bytes? What about 4 bytes? 

In [17]:
# One byte
2**8

256

We usually talk about bytes on the computer rather than bits. There are 256 possible values per byte, which is represented by 8 zero's or ones - 00000000 to 11111111. 

In [19]:
## Use the full 8 bits and show the integer 0
"{0:08b}".format(0)

## Use the full 8 bits and show the integer 255
"{0:08b}".format(255)

'00000000'

'11111111'

The notion of datatypes in Python is based on C data types (especially numpy and Pandas). 

Check out these wiki page:

https://en.wikipedia.org/wiki/C_data_types

or the python docs:

https://docs.python.org/3.6/library/struct.html

to learn more about C data types.

One other thing we need to note. If an integer is 2 bytes long, but our number is small enough to only use one of those four bytes, then there will be 2 empty bytes. In that case, where the empty bytes go? Should they go on the left side, or the right side? The concept is this in regular numbers, using the number 4 as an example:

40
04

Both could be considered the integer 4, depending on whether we are right or left padding the zeros. That is an odd concept in regular numbers, clearly we would left pad. For bytes however, different systems do it different ways. This behaviour is called 'endianess', and the native binary data is either big-endian or little-endian, depending on which computer generated the data! `<` means little-endian, and `>` means big endian. That indicates whether the 'most significant byte' comes first or last in the byte order. These are included in the format string of the struct methods (you will see in a moment). If you don't include one, struct assumes you want the native (to your system) endianness. 

Let's finally see an example. We will convert a python integer into a C type short integer. The C type short integer has at least 2 bytes (even if it is small). Quick question - if an integer is two bytes long, how many possible numbers can it be? 

In [20]:
## Answer: 8 bits to a byte, 2 states for each bit (zero or one):
2**16

65536

So if we suspect a data element will be an integer less than 65536, a short integer is a good choice to save space (only two bytes). We could also consider using a regular integer, but that will use 4 bytes. How big of an integer can we capture with 4 bytes?

In [21]:
## pretty big. 
2**32

4294967296

But if we use a regular integer, we must use at least 4 bytes, even if the number happens to be small...so we may try to choose the most compact `c` data type possible to save the empty bytes in our file! In Python we don't care, we just have integers...one example of why Python is 'higher level' and C is 'lower level' coding. 