# 3. Representation of Data

## 3.1 Representing Integers

<center><img src="img/fig3-1.png" width="200pt"></center>

## 3.2 Representing Real Numbers

<span id="chapters_ch3_data_representation_representing_real_numbers"> </span>
*Floating point* is the data type used to represent non-integer real
numbers. On modern computers, all non-integer real numbers are
represented using the floating point data type. 

Almost all processors have adopted the *IEEE 754 binary floating point
standard* for binary representation of floating point numbers. The
standard allocates 32 bits for the representation, although there is a
recent 64-bit definition which is based on the same layout idea 
with some more bits.

<strong>It's important to notice that in general storing floating point number goes with some information loss. </strong>

## 3.3 Numbers in Python
<span id="chapters_ch3_data_representation_numbers_in_python"> </span>

Python provides the following representations for numbers:
  * **Integers:** You can use integers as you are used to from your math
     classes. Interestingly, Python adopts a seamless internal
     representation so that integers can effectively have any number of
     digits. The internal mechanism of Python switches from the
     CPU-imposed fixed-size integers to some elaborated large-integer
     representation silently when needed. You do not have to worry about
     it. Furthermore, keep in mind that “73.” is **not** an integer in
     Python. It is a floating point number (73.0). An integer cannot have
     a decimal point as part of it.
     
  * **Floating point numbers (float in short):** In Python, numbers with decimal points are taken and represented as floating-point
     numbers. For example, 1.45, 0.26, and -99.0 are float but 102 and -8
     are not. We can also use the scientific notation
     ($a \times 10^b$) to write floating point numbers. For example,
     float 0.0000000436 can be written in scientific notation as
$4.36 \times 10^{-8}$ and in Python as 4.36E-8 or 4.36e-8.
     
  * **Complex numbers:** In Python, complex numbers can be specified
     using `j` after a floating point number (or integer) to denote the
     imaginary part: e.g., `1.5-2.6j` for the complex number
$(1.5+2.6i)$. The `j` symbol (or $i$) represents
$\sqrt{-1}$. There are other ways to signify complex numbers,
     but this is the most natural way considering your previous knowledge
     from high school.

## 3.4 Representing Truth Values (Booleans)
<span id="chapters_ch3_data_representation_representing_truth_values_booleans"> </span>
*Boolean* is another data type that has roots in the very structure of
the CPU. The answers to all questions asked to the CPU are either *true* or
*false*. The logic of a CPU is strictly based on the binary evaluation
system. This logic system is coined as *Boolean logic*. It was
introduced by George Boole in his book “The Mathematical Analysis of
Logic” (1847).

It is tightly connected to the concepts of binary `0` and `1`: In all
CPUs, *falsity* is represented with a `0` whereas *truth* is represented
with a `1` and on some with any value which is not `0`.

## 3.5 Representing Text
<span id="chapters_ch3_data_representation_representing_text"> </span>

As we said in the first lines of this chapter, programming is mostly
about a world problem that generally includes human-related or
interpretable data to be processed. These data do not consist of only numbers, but can include more sophisticated data such as text, sound signals,
and pictures. We leave the processing of sound and images out of the scope of this book. However, text is something we have to study.

### 3.5.1 Characters
<span id="chapters_ch3_data_representation_characters"> </span>

Written natural languages consist of basic units called *graphemes*.
Alphabetic letters, Chinese-Japanese-Korean characters, punctuation
marks, and numeric digits are all graphemes. There are also some basic
actions that commonly go hand in hand with textual data entry. “Make
newline”, “Make a beep sound”, “Tab”, “Enter” are some examples. These
are called “unprintables”.

How can we represent graphemes and unprintables in binary? Graphemes are
highly culture dependent. The shapes do not have a numerical
foundation. As far as computer science is concerned, the only way to
represent such information in numbers is to make a table and build this
table into electronic input/output devices. Such a table will have two
columns: The graphemes and unprintables in one column and the assigned
binary code in the other.

Throughout the history of computers, there have been several such tables,
mainly constructed by computer manufacturers. In time, most of them
vanished and only one survived: The *ASCII* (American Standard Code for
Information Interchange) table which was developed by the American
National Standards Institute (ANSI). This American code, developed by
Americans, is naturally quite “American“. It incorporates all characters
of the American-English alphabet, including, for example, the dollar
sign, but stops there. The table does not contain a single character
from another culture (for example, even the pound sign ‘£’ is not in the
table).

The ASCII table has 128 lines. It maps 128 American graphemes and
unprintables to 7-bit-long (not 8-bit-long!) codes. Since the 7-bit long code can also be interpreted as a number, for convenience, this number is also
displayed in the ASCII table – See <a href="#chapters_ch3_data_representation_ch3_fig_ascii">Table 3.2</a>.

<table>
<caption>Table 3.2: The ASCII table. Dec: Decimal value. Bin: Binary representation. Char: Character being represented.<span id="chapters_ch3_data_representation_id7"> </span>
<span id="chapters_ch3_data_representation_ch3_fig_ascii"> </span></caption>
<tr><td>
 Dec <td>  Bin <td>  Char <td>
 Dec <td>  Bin <td>  Char <td>
 Dec <td>  Bin <td>  Char <td>
 Dec <td>  Bin <td>  Char 
<tr><td> 
  0 <td> 0000 0000 <td> [NUL]  <td> 32 <td> 0010 0000 <td>  space  <td> 64 <td> 0100 0000 <td> @ <td> 96 <td> 0110 0000 <td> ` 
<tr><td> 
  1 <td> 0000 0001 <td> [SOH]  <td> 33 <td> 0010 0001 <td> ! <td> 65 <td> 0100 0001 <td> A <td> 97 <td> 0110 0001 <td> a 
<tr><td> 
  2 <td> 0000 0010 <td> [STX]  <td> 34 <td> 0010 0010 <td> " <td> 66 <td> 0100 0010 <td> B <td> 98 <td> 0110 0010 <td> b 
<tr><td> 
  3 <td> 0000 0011 <td> [ETX]  <td> 35 <td> 0010 0011 <td> # <td> 67 <td> 0100 0011 <td> C <td> 99 <td> 0110 0011 <td> c 
<tr><td> 
  4 <td> 0000 0100 <td> [EOT]  <td> 36 <td> 0010 0100 <td> $ <td> 68 <td> 0100 0100 <td> D <td>100 <td> 0110 0100 <td> d 
<tr><td> 
  5 <td> 0000 0101 <td> [ENQ]  <td> 37 <td> 0010 0101 <td> % <td> 69 <td> 0100 0101 <td> E <td>101 <td> 0110 0101 <td> e 
<tr><td> 
  6 <td> 0000 0110 <td> [ACK]  <td> 38 <td> 0010 0110 <td> &amp; <td> 70 <td> 0100 0110 <td> F <td>102 <td> 0110 0110 <td> f 
<tr><td> 
  7 <td> 0000 0111 <td> [BEL]  <td> 39 <td> 0010 0111 <td> ' <td> 71 <td> 0100 0111 <td> G <td>103 <td> 0110 0111 <td> g 
<tr><td> 
  8 <td> 0000 1000 <td> [BS]  <td> 40 <td> 0010 1000 <td> ( <td> 72 <td> 0100 1000 <td> H <td>104 <td> 0110 1000 <td> h 
<tr><td> 
  9 <td> 0000 1001 <td> [TAB]  <td> 41 <td> 0010 1001 <td> ) <td> 73 <td> 0100 1001 <td> I <td>105 <td> 0110 1001 <td> i 
<tr><td> 
 10 <td> 0000 1010 <td> [LF]  <td> 42 <td> 0010 1010 <td> * <td> 74 <td> 0100 1010 <td> J <td>106 <td> 0110 1010 <td> j 
<tr><td> 
 11 <td> 0000 1011 <td> [VT]  <td> 43 <td> 0010 1011 <td> + <td> 75 <td> 0100 1011 <td> K <td>107 <td> 0110 1011 <td> k 
<tr><td> 
 12 <td> 0000 1100 <td> [FF]  <td> 44 <td> 0010 1100 <td> , <td> 76 <td> 0100 1100 <td> L <td>108 <td> 0110 1100 <td> l 
<tr><td> 
 13 <td> 0000 1101 <td> [CR]  <td> 45 <td> 0010 1101 <td> - <td> 77 <td> 0100 1101 <td> M <td>109 <td> 0110 1101 <td> m 
<tr><td> 
 14 <td> 0000 1110 <td> [SO]  <td> 46 <td> 0010 1110 <td> . <td> 78 <td> 0100 1110 <td> N <td>110 <td> 0110 1110 <td> n 
<tr><td> 
 15 <td> 0000 1111 <td> [SI]  <td> 47 <td> 0010 1111 <td> / <td> 79 <td> 0100 1111 <td> O <td>111 <td> 0110 1111 <td> o 
<tr><td> 
 16 <td> 0001 0000 <td> [DLE]  <td> 48 <td> 0011 0000 <td> 0 <td> 80 <td> 0101 0000 <td> P <td>112 <td> 0111 0000 <td> p 
<tr><td> 
 17 <td> 0001 0001 <td> [DC1]  <td> 49 <td> 0011 0001 <td> 1 <td> 81 <td> 0101 0001 <td> Q <td>113 <td> 0111 0001 <td> q 
<tr><td> 
 18 <td> 0001 0010 <td> [DC2]  <td> 50 <td> 0011 0010 <td> 2 <td> 82 <td> 0101 0010 <td> R <td>114 <td> 0111 0010 <td> r 
<tr><td> 
 19 <td> 0001 0011 <td> [DC3]  <td> 51 <td> 0011 0011 <td> 3 <td> 83 <td> 0101 0011 <td> S <td>115 <td> 0111 0011 <td> s 
<tr><td> 
20 <td> 0001 0100 <td> [DC4]  <td> 52 <td> 0011 0100 <td> 4 <td> 84 <td> 0101 0100 <td> T <td>116 <td> 0111 0100 <td> t 
<tr><td> 
 21 <td> 0001 0101 <td> [NAK]  <td> 53 <td> 0011 0101 <td> 5 <td> 85 <td> 0101 0101 <td> U <td>117 <td> 0111 0101 <td> u 
<tr><td> 
 22 <td> 0001 0110 <td> [SYN]  <td> 54 <td> 0011 0110 <td> 6 <td> 86 <td> 0101 0110 <td> V <td>118 <td> 0111 0110 <td> v 
<tr><td> 
 23 <td> 0001 0111 <td> [ETB]  <td> 55 <td> 0011 0111 <td> 7 <td> 87 <td> 0101 0111 <td> W <td>119 <td> 0111 0111 <td> w 
<tr><td> 
 24 <td> 0001 1000 <td> [CAN]  <td> 56 <td> 0011 1000 <td> 8 <td> 88 <td> 0101 1000 <td> X <td>120 <td> 0111 1000 <td> x 
<tr><td> 
 25 <td> 0001 1001 <td> [EM]  <td> 57 <td> 0011 1001 <td> 9 <td> 89 <td> 0101 1001 <td> Y <td>121 <td> 0111 1001 <td> y 
<tr><td> 
 26 <td> 0001 1010 <td> [SUB]  <td> 58 <td> 0011 1010 <td> : <td> 90 <td> 0101 1010 <td> Z <td>122 <td> 0111 1010 <td> z 
<tr><td> 
 27 <td> 0001 1011 <td> [ESC]  <td> 59 <td> 0011 1011 <td> ; <td> 91 <td> 0101 1011 <td> [ <td>123 <td> 0111 1011 <td> <tt> {
<tr><td>
 28  <td> 0001 1100  <td>  [FS]  <td> 60 <td> 0011 1100 <td> &lt; <td> 92 <td> 0101 1100 <td> \ <td> 124 <td>0111 1100  <td>  |   
 <tr><td>
 29 <td> 0001 1101  <td> [GS] <td>  61  <td>  0011 1101  <td>  =  <td>  93  <td>  0101 1101  <td>  ] <td> 125  <td>  0111 1101  <td>  }
<tr><td> 
 30 <td> 0001 1110 <td> [RS]  <td> 62 <td> 0011 1110 <td> > <td> 94 <td> 0101 1110 <td> <tt>^</tt> <td>126 <td> 0111 1110 <td> <tt>~</tt> 
<tr><td> 
 31 <td> 0001 1111 <td> [US]  <td> 63 <td> 0011 1111 <td> ? <td> 95 <td> 0101 1111 <td> <tt>_</tt> <td>127 <td> 0111 1111 <td> [DEL] 
</table>

Do not worry, you do not have to memorize the ASCII table; even professional computer
programmers do not. However, some properties of this table must be
understood and kept in mind:

*  The general layout of the ASCII table:
<table><tr><td> Dec. Range<td> Property
<tr><td> 0-31<td>Unprintables
<tr><td> 32<td>Space char.
<tr><td> 33-47<td>Punctuations
<tr><td> 48-57<td>Digits 0-9
<tr><td> 58-64<td>Punctuations
<tr><td> 65-90<td>Uppercase letters
<tr><td> 91-96<td>Punctuations
<tr><td> 97-122<td>Lowercase letters
<tr><td> 123-127<td>Punctuations
</table>
  * There is no logic in the distribution of the punctuations.
     
  * It is based on the English alphabet; characters from other languages
     are simply not there. Moreover, there is no mechanism for diacritics.
     
  * Letters are ordered in the table, and uppercase letters come first
     (have a lower decimal value).
     
  * Digits are also ordered but are not represented by their numerical
     values. To obtain the numerical value for a digit, you have to
     subtract 48 from its ASCII value.
     
  * The table is only and only about 128 characters, neither more nor
     less. There is nothing like Turkish-ASCII, or French-ASCII. The
     extensions, where the 8th bit is set, have nothing to do with the ASCII
     table.
     
  * The older versions of Python (v1 and v2) used the ASCII character representations. 

The frustrating discrepancies and shortcomings of the ASCII table have
led the programming society to seek a solution. A non-profit group, the
Unicode Consortium, was founded in the late 80s with the goal of
providing a substitute for the current character tables, which is also
compliant (backward compatible) with them. The *Unicode Transformation
Format (UTF)* is their suggested representation scheme.
This UTF representation scheme has variable length and may include
components of 1-to-4 8-bit wide (in the case of UTF-8) or 16-bit wide
components of 1-to-2 (in the case of UTF-16). UTF is now becoming part
of many recent high-level language implementations, including Python (with version 3),
Java, Perl, TCL, Ada95, and C\#, gaining wide popularity.

### 3.5.2 Strings
<span id="chapters_ch3_data_representation_strings"> </span>
*Strings* are sequences of characters that are used to represent text data. Text data is as vital as numerical data in the world of programming, but we have a
problem here. As we discussed above, numbers (integers and floating points) have a
niche in the CPU. There are instructions designed for them: With instructions, we can store  
and retrieve them to/from the memory; we can perform arithmetical
operations among them. Character data can be represented and processed
as well because they are mapped to one-byte integers through ASCII or alternative tables. But, when it comes
to strings, the CPU does not have any facility for them.

How can we represent a string, i.e., a sequence of characters? The only reasonable way is to store the codes of all characters that
make up a string in the memory in consecutive bytes. In other words, the string “Python rocks!” can be represented using the ASCII codes of the characters as follows:

$$
\begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline
80 & 121 & 116 & 104 & 111 & 110 & 32 & 114 & 111 & 99 & 107 & 115 & 33 \\
\hline
\end{array}
$$

Does this solve
the problem of “representation“? Unfortunately, no. The trouble is
determining how to know where the string ends, for which there are two possible solutions:

1.  Store string length: In front of the string’s characters, store the length (the number of characters in the string) as an integer of a fixed number of bytes. This solution would represent our example string as follows, with the count 13 in the front:

    $$
    \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
    \hline
    \textcolor{red}{\textbf{13}} & 80 & 121 & 116 & 104 & 111 & 110 & 32 & 114 & 111 & 99 & 107 & 115 & 33 \\
    \hline
    \end{array}
    $$
      
1. Store an end mark: Store a special byte value (number zero in general), which is not used to represent any other character, at the end of the string characters. This solution would represent our example string as follows with the marker ‘0’ at the end:

    $$
    \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
    \hline
    80 & 121 & 116 & 104 & 111 & 110 & 32 & 114 & 111 & 99 & 107 & 115 & 33 & \textcolor{red}{\textbf{0}} \\
    \hline
    \end{array}
    $$

Different languages adopt different solutions between these options.