# Numeral Systems and Storing Data in Memory
---

## Numbers and Computing

**Numbers** A unit of abstract mathematical system subject to the laws of arithmetic

- Natural Numbers :

$0, 1, 2, 3, 4, \cdots$

- Integers :

$\cdots, -4, -3, -2, -1, 0, 1, 2, 3, 4, \cdots$

- Rational Numbers :

$\frac{a}{b}$ where $a$ and $b$ are integers, and $b \ne 0$  

*Example 1*:
<br>$1.5$ since it can be represented as $\frac{3}{2}$  

<br>

*Example 2*:
<br>$1$ since it can be represented as $\frac{2}{1}$

- Real Numbers :

any value that can be represented on a number line
  
(includes: natural numbers, integers numbers, rational numbers)

*Examples*:
<br>$-3.33333\cdots\\ -1\\ 0\\ \sqrt{2}\\ \pi\\ 8\\ 9.09090909\cdots\\ 100000000$

---
## Base and Positional Notation

How many actual things does the number <u>943</u> represent?

**Base of a Number System** the number of digits used in the system.

- Base-10

consists of  $0, 1, 2, 3, 4, 5, 6, 7, 8, 9$  
also known as *decimal system*

- Base-8

consists of  $0, 1, 2, 3, 4, 5, 6, 7$  
also known as *octal system*

- Base-2

consists of $0, 1$ only  
also known as *binary system*

- Base-R  

general formula : $0, 1, \cdots, R-2, R-1$

**Question 1:** How then would you represent bases above 10?  

**Positional Notation** a system of expressing numbers in arranged succession, the position of each digit has a place value and the number is equal to the sum of the products of each digit by its place value.

*Example*: $943$ in base-10 (or in decimal)  

*Example*: $943$ in base-13.

**General Form**

$$
\begin{align*}
(d_n*R^{n-1}&) +(d_{n-1}*R^{n-2})+\cdots + (d_2*R^1)+(d_1*R^0)\\
\\
\text{where } & \text{$n$ is the number of digits,}\\
& \text{$R$ is the base, and}\\
& \text{$d_i$ is the digit in the $i^{th}$ position.}\\
\end{align*}
\\
$$

> *Exercise*: Write the the decimal equivalent of 943 in base-2 (binary).

> *Exercise*: Write the decimal equivalent of ABC in base-16 (hexadecimal).

> *Exercise*: Write the decimal equivalent of 1010110 in base-2 (binary).

---
## Arithmetic in other bases

<u>Addition example</u>

![arithmetic2](https://learning.oreilly.com/library/view/computer-science-illuminated/9781284055917/images/pg72-1.jpg)

*Exercise:* Convert to decimal to check.

<u>Subtraction example</u>

![arithmetic3](https://learning.oreilly.com/library/view/computer-science-illuminated/9781284055917/images/pg72-2.jpg)


*Exercise:* Convert to decimal to check.

*Exercise:* Add the hexadecimal values $789$ to $345$. Convert to decimal to check.

*Exercise:* Subtract the octal values $741$ from $3625$. Convert to decimal to check.

---
## Binary, Octal, Decimal, Hexadecimal

| Base-2<br>(Binary) | Base-8<br>(Octal) | Base-10<br>(Decimal) | Base-16<br>(Hexadecimal) |
|:------:|:-----:|:-------:|:-----------:|
|  0     |   0   |    0    |      0      |
|  1     |   1   |    1    |      1      |
|  10    |   2   |    2    |      2      |
|  11    |   3   |    3    |      3      |
|  100   |   4   |    4    |      4      |
|  101   |   5   |    5    |      5      |
|  110   |   6   |    6    |      6      |
|  111   |   7   |    7    |      7      |
|  1000  |   10  |    8    |      8      |
|  1001  |   11  |    9    |      9      |
|  1010  |   12  |    10   |      A      |
|  1011  |   13  |    11   |      B      |
|  1100  |   14  |    12   |      C      |
|  1101  |   15  |    13   |      D      |
|  1110  |   16  |    14   |      E      |
|  1111  |   17  |    15   |      F      |
|  10000 |   20  |    16   |      10     |

<br>

**Converting Binary to Octal**

- Split into groups of threes and convert using table.

*Example 1*: $111101100_2$ to Octal  

*Example 2*: $101010111100_2$ to Octal

<br>

**Converting Binary to Decimal**

1. Use base-2 positional notation

*Example 1*: $1010110_2$ to Decimal using base-2 positional table

2. Alternatively, convert to octal or hexadecimal first, then using their respective positional notation

*Example 2a*: $1010110_2$ to octal then to decimal using base-8 positional notation

*Example 2b*: $1010110_2$ to hexadecimal then to decimal using base-16 positional notation

---
## Converting Base-10 to Other Bases

**while** (the quotient is not zero)  
&emsp; Divide the decimal number by the new base  
&emsp; Make the remainder the next digit to the left in the answer  
&emsp; Replace the decimal number with the quotient  
**endwhile**

*Example*: Convert $2748_{10}$ to hexadecimal

- Divide base-10 number ($2748_{10}$) by base ($16$)

![img](https://learning.oreilly.com/library/view/computer-science-illuminated/9781284055917/images/pg75-1.jpg)

- Remainder ($12$) is the first digit (rightmost) in the hexadecimal answer; represented by $C$.

- Since quotient is not zero, we divide it ($171$) by base ($16$) again

![img](https://learning.oreilly.com/library/view/computer-science-illuminated/9781284055917/images/pg75-2.jpg)

- Remainder ($11$) is the next digit to the left of the hexadecimal answer; represented by $B$.

- Since quotient is not zero, we divide it ($10$) by base ($16$) again

![img](https://learning.oreilly.com/library/view/computer-science-illuminated/9781284055917/images/pg75-3.jpg)

- Remainder ($11$) is the next digit to the left of the hexadecimal answer; represented by $A$.

- Since quotient is zero, algorithm has terminated. Final answer: $ABC$

*Exercise*: Convert $57005_{10}$ to octal.

*Exercise*: Convert $57005_{10}$ to hexadecimal.

---
## Binary System and Memory

- Data in computers are stored in low and high voltage
- 2 states for each memory location: Low voltage = 0s, high voltage = 1s
- binary digit (bit) is the binary number system
- 1 byte = 8 bits

---
## Unsigned Binary Integer

**Unsigned binary integer** Non-negative integers represented by 1s and 0s

**Range of unsigned integer**

- smallest to largest unsigned integer that can be represented in binary
- depends on system architecture / programming languange

*Example:* Range of 7-bit system

---
## The Carry Bit

*Class Discussion*
> What happens when we add 1 to the largest possible value in a 7-bit system ($111\ 1111$)?

- Expected value is $1000\ 0000$.
- However, since it can only hold 7 bits, only the 7 rightmost bits will be stored ($000\ 0000$).
- Therefore, incorrectly evaluate to zero.

How then do we solve this problem?

To flag this condition, the CPU contains a special bit called the **carry bit**
- denoted by the letter $C$
- How? 
 - if the sum of the leftmost column (called the *most significant bit*) produces a carry, then $C$ is set to $1$
 - else, $C$ is cleared to $0$
 - to summarise, $C$ always contains the carry of the leftmost column

---
## Two's Complement Binary Representation

- unsigned binary representation -> non-negative integers only

*Class Discussion*
> How to represent $-5$ in binary in a 6-bit system?

Simple 2 parts solution:
- 1 bit for sign (0 for positive, 1 for negative)
- 5 bits for magnitude

![img](https://i.ibb.co/nRTqH1R/Slide54.png)

Therefore:
- $+5$ is $00\ 0101$
- $-5$ is $10\ 0101$

*Class Discussion*
> Then what happens if we add $+5$ to $-5$ in binary?

**Expected Result:** *Zero*

<br>

![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0126-01.jpg)

<br>

**Actual Result:** *$10\ 1010_2 \ne 0_2$*

<u>Therefore, need to have a different representation

<br>

**Two's Complement**  

Positive numbers (same as the above): 

- sign bit value is $0$
- magnitude bits same as the unsigned binary representation

*Example*: Converting decimal $+5$ to two's complement in a 6-bit system.

     0               00101
     |               |||||
sign bit (+)     magnitude bits

<br> However, $-5$ is <u>not</u> $10\ 0101$ in two's complement:

Note:
- adding $+5$ to $-5$ gives $0$ in BOTH binary decimal and 6-bit system.

- Therefore,  
 ![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0126-02.jpg)

- Notice that the 6-bit sum is all 0s with 1 carry.
- $11\ 0111$ is called the *additive inverse* of $00\ 0101$
- Process of finding the additive inverse is called *negation*, or *NEG*  
 i.e. $\text{  NEG }\ 00 0101 = 11\ 0111$
- to negate a number is also called *taking its two's complement*
- Thus, $-5$ in two's complement is $11\ 1011$  

<br>

**How to determine the Two's Complement?**

<u>2 Steps</u>  
Step 1:

Find the one's complement:  
 - swapping all 1s to 0s, and 0s to 1s
 - also known as *NOT* operation

*Example*: Ones' complement of $00\ 0101$ (decimal $+5$).

$\text{NOT }\ 00\ 0101 = 11\ 1010$

Step 2:

Add 1 to the ones' complement:

![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0127-03.jpg)

Entire process can also be written as:

$\text{NEG }\ 00\ 0101 = 11\ 1011$

To summarise:

> The two's complement of a number is 1 plus its ones' complement, or  
> 
> $\text{NEG }\ x = \text{NOT }\ x + 1$

*Example*: Two's complement of $-5$.

**Expected Result:** *+5*

<br>

![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0128-01.jpg)

<br>

**Actual Result:** *$00\ 0101_2 = +5_{10}$*

---
## Two's Complement Range

*Class Discussion*
> In a 4-bit system storing integers in two's complement, what is the **range** of integers that can be represented?
> - What is the largest integer?
> - What is the smallest integer?

| Decimal | Binary |
|:-------:|:------:|
|   -8    |  1000  |
|   -7    |  1001  |
|   -6    |  1010  |
|   -5    |  1011  |
|   -4    |  1100  |
|   -3    |  1101  |
|   -2    |  1110  |
|   -1    |  1111  |
|    0    |  0000  |
|    1    |  0001  |
|    2    |  0010  |
|    3    |  0011  |
|    4    |  0100  |
|    5    |  0101  |
|    6    |  0110  |
|    7    |  0111  |

*Exercise:* Find the two's complement of $-7$ in a 4-bit system.

*Exercise:* Find the two's complement of $-8$ in a 4-bit system.

**What is the range of numbers in two's complement?**

- The largest integer is a single $0$ followed by all $1$s.
- The smallest integer is a single $1$ followed by all $0$s.
 Note that it is $1$ greater than the magnitude of the largest integer
- Also note that:
 - the number $0$ in decimal is represented as all $0$s
 - the number $–1$ in decimal is represented as all $1$s

---
## The Number Line

<u>Assuming a 3-bit system:</u>

- **Unsigned binary number line**

![img](https://i.ibb.co/HXhtC7B/numberline.png)

*Example*: Add unsigned 4 to 3.

Addition is done by moving to the right on the number line:
1. start with 4
2. move 3 steps to the right
3. answer = 7

*Example*: Add unsigned 6 to 3.

![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0131-01.jpg)

Incorrect result because the answer is out of range.

<br>

- **Two's complement number line**

split the unsigned number line between 3 and 4, and shift the right part to the left side

<img src="https://i.ibb.co/nj3VJTn/numberline2.png" alt="table" width="500" height="300">

Notice that binary 111 to next to 0000

*Example*: Add two's complement -2 to 3.

Addition is done by moving to the right on the number line:
1. start with -2
2. move 3 steps to the right
3. answer = 1

![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0132-01.jpg)

Note: With two’s complement representation, the carry bit no longer indicates whether the result of the addition is in range.

---
## The Overflow Bit

- hardware makes no distinction between unsigned or two's complement
- when CPU adds the contents of two memory cells, it uses the rules for binary addition on the bit sequences, regardless of their types
- In unsigned binary, if the sum is out of range, 
 - the hardware simply stores the (incorrect) result
 - sets the carry bit *C* accordingly and goes on
 - up to the software to examine the *C* bit after the addition to take appropriate action if necessary.

> $C$ bit detects overflow only for unsigned integers.

- In two's complement binary representation, the carry bit no longer indicates whether a sum is in range or out of range
 - An *overflow condition* occurs when the result of an operation is out of range  
 
 - To flag this condition, the CPU contains another special bit called the **overflow bit**:
   - denoted by the letter $V$
   - if the sum two binary integers represented in two's complement is out of range, sets the overflow bit $V$ to
   - else, $V$ is cleared to $0$

> $V$ bit detects overflow for signed integers.

**How does the overflow bit V detect overflow in two's complement?**

<u>Method 1</u>

1. One way would be to convert the two numbers to decimal, add them, and see if their sum is outside the range as written in decimal. If so, an overflow has occurred.

<u>Method 2</u>

2. Another way is that hardware detects an overflow by comparing the carry-in to the sign bit with the carry out of sign bit ($C$). If they are different, an overflow has occurred, and $V$ gets $1$. If they are the same, $V$ gets $0$.

 Let's look at examples in 6-bit systems:
 
 ![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0134-01.jpg)

<u>Method 3</u>

3. Instead of comparing the carry-in to the sign bit with *C*, you can tell directly by inspecting the signs of the numbers and the sum:
 An overflow will occur if:
 - Addition of two positive numbers and get a negative sum
 - Addition of two negative numbers and get a positive sum
 - Note: It is not possible to get an overflow by adding a positive number and a negative number.

---
## The Negative and Zero Bits

Additional 2 bits:

1. N-bit to detect negative result
2. Z-but to detect zero result

**Summary of the functions of the 4 status bit:**

- $N = 1$ if the result is negative. Otherwise, $N=0$.
- $Z = 1$ if the result is all zeros. Otherwise, $Z=0$
- $V = 1$ if a signed integer overflow occurred. Otherwise, $V=0$
- $C = 1$ if an unsigned integer overflow occurred. Otherwise, $C=0$

*Example*: Effects of 4 status bits.

![img](https://learning.oreilly.com/library/view/computer-systems-5th/9781284079647/graphics/f0135-01.jpg)

---
## Binary Arithmetic

**Binary Addition** (recap)

8 rules when adding binary values:

- $0+0=0$

- $0+1=1$

- $1+0=1$

- $1+1=0+C$

- $C+0+0=1$

- $C+0+1=0+C$

- $C+1+0=0+C$

- $C+1+1=1+C$

  Note: $C$ is the carry bit.

*Binary Addition Example*: $\ 0101 + 0011$

*Binary Addition Exercise 1*: $\ 1100\ 1101 + 0011\ 1011$

*Binary Addition Exercise 2*: $\ 1001\ 1111 + 0001\ 0001$

*Binary Addition Exercise 3*: $\ 0111\ 0111 + 0000\ 1001$

<br>

**Binary Subtraction** (recap)

Similar to addition but add the NEG or two's complement of the second value

*Binary Subtraction Example*: $\ 0101 - 0011$

*Binary Subtraction Exercise 1*: $\ 1100\ 1101 - 0011\ 1011$

*Binary Subtraction Exercise 2*: $\ 1001\ 1111 - 0001\ 0001$

*Binary Subtraction Exercise 3*: $\ 0111\ 0111 - 0000\ 1001$

<br>

**Binary Multiplication**

Works the same as decimal multiplication but only 1s and 0s:

- $0\times 0=0$

- $0\times 1=0$

- $1\times 0=0$

- $1\times 1=1$

*Multiplication Example*: $\ 1010\times 0101$

*Multiplication Exercise 1*: $\ 1100 \times 1101$

*Multiplication Exercise 2*: $\ 1001 \times 1111$

*Multiplication Exercise 3*: $\ 111 \times 111$

<br>

**Binary Division**

Works the same as decimal division but only 1s and 0s:

*Division Example*: $\ 1\ 1011 \div 11$

| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289869.png.jpg)  |
| :----------------------------------------------------------: |
| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289871.png.jpg)  |
| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289873.png.jpg)  |
| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289875.png.jpg)  |
| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289877.png.jpg)  |
| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289879.png.jpg)  |
| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289881.png.jpg)  |
| ![img](https://learning.oreilly.com/library/view/write-great-code/1593270038/httpatomoreillycomsourcenostarchimages1289883.png.jpg)  |

*Division Exercise 1*: $\ 1100 \div 11$

*Division Exercise 2*: $\ 1111 \div 101$

*Division Exercise 3*: $\ 1010\ 1010 \div 1010$

---
## Four Logical Operations on Bits

**Truth Table**
- Values in the left column correspond to the left operand of the operation
- Values in the top row correspond to the right operand of the operation
- Values located at the intersection of the row and column (for a particular pair of input values) is the result.

1. **AND**

<table style="margin-left:0; width:20%">
   <tr><td>AND</td><td>0</td><td>1</td></tr>
   <tr><td>0</td><td>0</td><td>0</td></tr>
   <tr><td>1</td><td>0</td><td>1</td></tr>
</table>

2. **OR**

<table style="margin-left:0; width:20%">
   <tr><td>OR</td><td>0</td><td>1</td></tr>
   <tr><td>0</td><td>0</td><td>1</td></tr>
   <tr><td>1</td><td>1</td><td>1</td></tr>
</table>

3. **XOR**

<table style="margin-left:0; width:20%">
   <tr><td>XOR</td><td>0</td><td>1</td></tr>
   <tr><td>0</td><td>0</td><td>1</td></tr>
   <tr><td>1</td><td>1</td><td>0</td></tr>
</table>

4. **NOT**

<table style="margin-left:0; width:20%">
   <tr><td>NOT</td><td>0</td><td>1</td></tr>
   <tr><td></td><td>1</td><td>0</td></tr>
</table>

Note: NOT operation is *unary* - accepts only 1 operand.

**Bonus:** [The Marriage Operation](https://www.youtube.com/watch?v=zE55_TLgRec)

---
## Logical Operations on Binary Numbers

Apply logical operations *bit-by-bit* (or *bitwise*):

Given 2 binary numbers, a bitwise logical function operates on:
- bit position zero of both operands producing bit position zero of the result,
- bit position one of both operands producing bit position one of the result, and so on.

*Example*: $\ 1011\ 0101\quad \text{AND}\quad 1110\ 1110$

*Exercises*:
- $\ 0010\ 0110\quad \text{AND}\quad 1010\ 0111$
- $\ 1011\ 0101\quad \text{OR}\quad 1110\ 1110$
- $\ 0010\ 0110\quad \text{OR}\quad 1010\ 0111$
- $\ 1011\ 0101\quad \text{XOR}\quad 1110\ 1110$
- $\ 0010\ 0110\quad \text{XOR}\quad 1010\ 0111$
- $\ \text{NOT}\quad 1110\ 1001$
- $\ \text{NOT}\quad 0101\ 1010$

**Python bitwise operations**

```python
i = j & k    # Bitwise AND
i = j | k    # Bitwise OR
i = j ^ k    # Bitwise XOR
i = ˜j       # Bitwise NOT
```

*Exercises*: Repeat the above examples using Python.  
Note: Use `0b` prefix to represent binary numbers, use `bin()` to convert answer to binary.

In [None]:
bin(0b00100110 & 0b10100111)

---
## Floating point representation

We have seen integer representations.

*Class Discussion*
> How to represent other real numbers which are non-integers e.g. $\pi$? 

<u>Method 1</u>

**Mathematical notation**
- digit string can be of any length
- location of radix point (decimal point) b

Therefore, $\pi$ is $3.14159265359...$.

<u>Method 2</u>

**Scientific notation**
- number scaled by powers of 10 such that numbers lie between 1 to 10
- radix point appears immediately after the first digit

e.g. $6.0221409 \times 10^{23}$

<u>Method 3</u>

**Floating point representation**
- can think of it as the binary version of scientific notation
- consists of 3 fields:  
 1. one bit to represent the sign
 2. several bits to represent the exponent of the normalized binary number
 3. several bits to represent the magnitude of the number (*significand* or *mantissa*)
 
 ![img](https://i.ibb.co/JCf9cfK/1.png)
 
 - range of floating point values depend on the number of bits stored in the exponent
 - precision of floating point depends on the number of bits in significand

<br>

**Binary Fractions**
- binary point (base-2 version of decimal point)

*Example*: $101.011$

<img src="https://i.ibb.co/cb3xMQv/1.png" alt="table" width="550" height="300">

- bits to the left of the binary point -> unsigned binary representation
- bits to the right of the binary point ->  $(d_1 \times 2^{-1}) + (d_2 \times 2^{-2}) + (d_3 \times 2^{-3}) +\ ...\ $  
 where $d_i$ is the positional notation to the right of the binary point
 
 <img src="https://i.ibb.co/8b0wDwt/1.png" alt="table" width="500" height="150">

<br>

**Converting decimal to binary fractions**

<u>Step 1:</u>

Convert whole part (the bits to the left of the binary point) using the technique seen earlier for converting unsigned binary values

<u>Step 2:</u>

Use successive doubling to convert the bits to the right of the binary point.

*Example*: Convert $6.5859375$ decimal to binary.

<img src="https://i.ibb.co/27y4Tyc/1.png" alt="table" width="175" height="300">

Note: when doubling the fractions, do not include the whole number part. E.g. the value $0.34375$ comes from doubling $0.171875$, not from doubling $1.171875$

For the binary fractions, read from top to bottom.

Therefore, $6.5859375_{10} = 110.1001011_2$.

<br>

*Exercise:* Convert $0.2$ decimal to binary.

<table style="margin-left:0; width:10%">
   <tr><td> </td><td>0.2</td></tr>
   <tr><td>0</td><td>0.4</td></tr>
   <tr><td>0</td><td>0.8</td></tr>
   <tr><td>1</td><td>0.6</td></tr>
   <tr><td>1</td><td>0.2</td></tr>
   <tr><td>0</td><td>0.4</td></tr>
   <tr><td>0</td><td>0.8</td></tr>
   <tr><td>1</td><td>0.6</td></tr>
   <tr><td>$\vdots$</td><td>$\vdots$</td></tr>
</table>

Notice that the process will never terminate.

Therefore, $0.2_{10} = 0.001100110011_2$ with the bit pattern $0011$ endlessly repeating.

You should realize that if you add $0.2 + 0.2$ in a some language like C, you will probably not get $0.4$ exactly because of the roundoff error inherent in the binary representation of the values

For that reason, good numeric software rarely tests two floating point numbers for strict equality.

Instead, the software maintains a small but nonzero tolerance that represents how close two floating point values must be to be considered equal.

For example, if the tolerance = $0.0001$
- then, $1.38264$ = $1.38267$
- since their difference ($0.00003$) < tolerance ($0.0001$)

If Python were to print the true decimal value of $0.1$, it would have to display `0.1000000000000000055511151231257827021181583404541015625`.

However, it is more digits than most people would find it useful and therefore prints a rounded value of `0.1` instead.

Refer to [official documentation](https://docs.python.org/3.8/tutorial/floatingpoint.html)

<br>

**Normalized Form in Scientific Notation**

- Example, the decimal number $–328.4$ is written in normalized form in scientific notation as $–3.284 × 10^2$
 - The effect of the exponent $2$ as the power of $10$ is to shift the decimal point two places to the right.


- Similarly, the binary number $–10101.101$ is written in normalized form in scientific notation as $–1.0101101 × 2^4$
 - The effect of the exponent $4$ as the power of $2$ is to shift the binary point four places to the right.


- On the other hand, the binary number $0.00101101$ is written in normalized form in scientific notation as $1.01101 × 2^{–3}$
 - The effect of the exponent $–3$ as the power of $2$ is to shift the binary point three places to the left.
 

- The number zero cannot be normalized because it does not any a nonzero digit.


- Based on the normalized form in scientific notation:
 - floating point numbers can be positive or negative
 - exponent can be positive or negative

<br>

**Excess-N (or Biased) Representation**

- used to store exponents of the normalized binary number
- IEEE-754 uses excess-$(2^{N−1} − 1)$ where $N$ is the number of bits
- all 0s corresponds to the minimal negative value and all 1s to the maximal positive value
 - range of numbers is from $-(2^{N-1} - 1)$ to $(2^{N-1})$

<br>

*Example:* Different representations of 3-bit cell

| Decimal | Two's Complement | Excess 3 |
| ------- | ---------------- | -------- |
|   -4    |       100        |          |
|   -3    |       101        |    000   |
|   -2    |       110        |    001   |
|   -1    |       111        |    010   |
|    0    |       000        |    011   |
|    1    |       001        |    100   |
|    2    |       010        |    101   |
|    3    |       011        |    110   |
|    4    |                  |    111   |

<br>

*Example:* Biased representation of 5-bit cell

<br>

**Convert from decimal to excess-N**

How?

1. Add $N$ to the decimal value
2. then convert to binary as you would an unsigned number

*Example*: Convert 5 from decimal to excess-15.

- Add $5 + 15 = 20$.
- Then convert $20$ to binary as if it were unsigned, $20$ (dec) = $10100$ (bin).
- Therefore, $5$ (dec) = $10100$ (excess-$15$).
- Note the first bit is 1, indicating a positive value.

<br>

**Convert from excess-N to decimal**

How?

1. Write the decimal value as if it were an unsigned number
2. subtract $N$ from it

*Example*: Convert $00011$ from excess-15 to decimal.

- Convert $00011$ as an unsigned value, $00011$ (bin) = $3$ (dec).
- Then subtract decimal values $3 – 15 = –12$.
- Therefore, $00011$ (excess-15) = $–12$ (dec).

<br>

**The Hidden Bit**

Numbers are stored normalized. Therefore:

- no bit required for storing the binary point
- no bit required for storing the leading $1$ to the left of the binary point (**hidden bit**)

<br>

**Decimal value in floating point representation**

1. Store the sign bit
2. Convert value to binary
3. Write in normalized scientific notation form
4. Store the exponent in excess representation
5. Drop the leading 1 to the left of the binary point
6. Store the remaining bits of the magnitude in the significand or mantissa

*Example*: Assuming a three-bit exponent using excess-3 and a four-bit significand, how is the number $3.375$ stored?

<br>

Note: the hidden bit is assumed, not ignored
- When a decimal floating point value is read from memory, the compiler assumes that the hidden bit is not stored
- Then, it generates code to insert the hidden bit before it performs any computation with the full number of bits

<br>

**Rounding binary numbers**

- every memory cell has a finite number of bits
- system approximates by rounding off the least significant bits it must discard using a rule called “round to nearest, ties to even.

*Example*:

Decimal Rounding 

| Decimal | Decimal Rounded |
|:------- |:--------------- |
| 23.499  |       23        | 
| 23.5    |       24        | 
| 23.501  |       24        | 
| 24.499  |       24        | 
| 24.5    |       25        | 
| 24.501  |       25        | 

<br>

Binary Rounding

| Binary    | Binary Rounded |
|:--------  |:-------------- |
| 10111.011 | 10111          |
| 10111.1   | 11000          |
| 10111.100 | 11000          |
| 11000.011 | 11000          |
| 11000.1   | 11000          |
| 11000.100 | 11000          |

Note:
- round off $23.499$ to $23$ because $23.499$ is closer to $23$ than it is to $24$
- round off $23.501$ to $24$ because $23.501$ is closer to $24$ than it is to $23$
- round off $23.5$ to $24$ because $24$ is even, despite $23.5$ being just as close to $23$ as it is to $24$
- round off binary $10111.1$ to $11000$ because $11000$ is even, despite being just as close to $10111$ as it is to $11000$

*Exercise:* Assuming a three-bit exponent using excess-$3$ and a four-bit significand, how is the number $–13.75$ stored?

<br>

**Special Values**

- some real values require special treatment
- the most obvious is **Zero**
 - all 0s in significand and exponent
- positive or negative zero depending on sign bit

*Example 1*: Assuming a three-bit exponent using excess-$3$  and a four-bit significand, what is the <u>smallest positive value possible</u> in binary and decimal?

<br>

*Example 2*: Assuming a three-bit exponent using excess-$3$  and a four-bit significand, what is the <u>largest negative value possible</u> in binary and decimal?

<br>

*Example 3*: Assuming a three-bit exponent using excess-$3$  and a four-bit significand, what is the <u>largest positive value possible</u> in binary and decimal?

<br>

From the examples above, we can represent the special numbers in a number line.

![img](https://i.ibb.co/56wws9m/1.png)

if $9.5\times 12.0$, both of which are in range
- true value is $114.0$
- but is in the positive overflow region

if $0.145\times 0.145$, both of which are in range
- true value is $0.021025$
- but is in the positive underflow region

Therefore, to alleviate overflow and underflow problem, need to introduce more special characters:

| Special Value      | Exponent  | Significand |
| ------------------ | --------- | ----------- |
| Zero               | All zeros | All zeros   |
| Denormalized       | All zeros | Non-zero    |
| Infinity           | All ones  | All zeros   |
| Not a Number (NaN) | All ones  | Non-zero    |

<br>

**Infinity**

- used for values that are in the overflow regions

- if the result of an operation overflows, the bit pattern for infinity is stored

- further operations on this bit pattern, will produce the expected result for an infinite value.  
 Examples include:

$\qquad \frac{3}{\infty} = 0$  

$\qquad 5 + \infty = \infty$  

$\qquad \sqrt{\infty} = \infty$  

<br>

- can produce infinity by dividing by 0.
  Examples include:

$\qquad \frac{3}{0} = \infty$  

$\qquad \frac{–4}{0} = -{\infty}$

<br>

**Not a Number (NaN)**
- bit pattern for a value that is not a number

- used to indicate floating point operations that are illegal  
 Examples include:  

$\qquad \sqrt{-1}$  

$\qquad \frac{0}{0}$

<br>

- Any floating point operation with at least one NaN operand produces NaN  
 Examples include:

$\qquad 7 + \text{NaN} = \text{NaN}$

$\qquad \frac{7}{\text{NaN}} = \text{NaN}$

Note:
- both infinity and NaN use the largest possible value of the exponent for their bit patterns i.e. exponent field is all $1$s

- significand is all $0$s for infinity

- significand is any nonzero pattern for NaN

- Reserving these bit patterns for infinity and NaN has the effect of reducing the range of values that can be stored
 - *Example*: For a three-bit exponent and four-bit significand, the bit patterns for the largest magnitudes and their decimal values are:

$1\ 111\ 0000\ \text{(bin)} = -\infty$

$1\ 110\ 1111\ \text{(bin)} = -15.5$

$0\ 110\ 1111\ \text{(bin)} = +15.5$

$0\ 111\ 0000\ \text{(bin)} = +\infty$

<br>

**Denormalized Numbers**

- used to alleviate underflow problem

![img](https://i.ibb.co/PW34t1v/1.png)

Figure shows three complete sequences of values for exponent fields of $000$, $001$, and $010$ (excess-3), which represent $–3$, $–2$, and $–1$ (dec), respectively.
 - $2^{-3} = 0.125$
 - $2^{-2} = 0.25$
 - $2^{-1} = 0.5$
 
For normalized numbers in general,
- the gap between successive values doubles with each unit increase of the exponent
- the gap between $+0.0$ and the smallest positive value is excessive compared to the gaps in the smallest sequence. 
 
With denormalized special values,
- the gap between successive values for the first sequence equal to the gap between successive values for the second sequence
- the gap between $+0.0$ and the smallest positive value is reduced considerably
- values are evenly spaced as they approach +0.0 from the right
- similarly, on the left half of the number line (not shown in figure), the negative values are spread out evenly as they approach –0.0 from the left.
- this is known as *gradual underflow*
- idea is to take nonzero values that would be stored with an exponent field of all 0s (in excess representation) and distribute them evenly in the underflow gap.
- since the exponent field of all 0s is reserved for denormalized numbers, the smallest positive normalized number becomes $0\ 001\ 0000 = 1.000 \times 2^{-2} \text{(bin)} = 0.25 \text{(dec)}$

**Representation rules for denormalized numbers**

1. The hidden bit to the left of the binary point is assumed to be 0 instead of 1
2. The exponent is assumed to be stored in excess-(N-1) instead of excess-N

*Example 1*: For a representation with a three-bit exponent and four-bit significand, what decimal value is represented by $0\ 000\ 0110$?

<br>

*Example 2*: For a representation with a three-bit exponent and four-bit significand, what is the <u>smallest denormalized positive value possible</u> in binary and decimal?

<br>

With denormalization, to convert from decimal to binary you must first check if a decimal value is in the denormalized range to determine its representation

For a three-bit exponent and a four-bit significand,
- the smallest positive normalized value is $0\ 001\ 0000 = 1.0000\times 2^{-2} = 0.25$ (dec)
- any value less than $0.25$ is stored in denormalized format

*Example 3*: For a representation with a three-bit exponent and four-bit significand, how is the decimal value –0.078 stored?

<br>

## The IEEE 754 Floating-Point Standard

- Institute of Electrical and Electronic Engineers. Inc (IEEE)
- a society made up of professionals across engineering fields including computer engineering
- propose standards for floating point numbers
- virtually every computer manufacturer now provides floating point numbers for their computers that confirm to IEEE 754 standard

Two formats:

1. **single precision (32-bit system)**
  - 1-bit sign
  - 8-bits exponent using excess-127 (except denormalized numbers which use excess-126)
  - 23-bits significand
  - has the following bit values:
    - Positive infinity is $0\ 1111\ 1111\ 000\ 0000\ 0000\ 0000\ 0000\ 0000\$.

    - The hex values for the full 32-bit pattern arranges the bits into groups of four  
     e.g. $0111\ 1111\ 1000\ 0000\ 0000\ 0000\ 0000\ 0000$ can be written as $7F80\ 0000$ (hex).
 
    - The largest positive value is $0\ 1111\ 1110\ 111\ 1111\ 1111\ 1111\ 1111\ 1111$ or $7F7F\ FFFF$ (hex).  
     It is exactly $2^{128} – 2^{104}$, which is approximately $2^{128}$ or $3.4 × 10^{38}$.
 
    - The smallest positive normalized number is $0\ 0000\ 0001\ 000\ 0000\ 0000\ 0000\ 0000\ 0000$ or $0000\ 0001$ (hex).  
     It is exactly $2^{-149}$, which is approximately $1.4 × 10^{–45}$.

<br>

2. **double prevision (64-bit system)**
  - 1-bit sign
  - 11-bit exponent using excess-1023 (except denormalized numbers which use excess-1022)
  - 52-bit significand
  - has both wider range and greater precision because of the larger exponent and significand fields
  - has the following bit values:
    - The largest positive value is approximately $2^{1023}$, or $1.8\times 10^{308}$.
   
    - The smallest positive normalized number is approximately $2.2\times 10^{-308}$.
   
    - The smallest denormalized number is approximately $4.9\times 10^{-324}$.
  
<br>
  
![img](https://i.ibb.co/ftCmSWY/1.png)



<br>

*Exercise 1*: What is the hexadecimal representation of $-47.25$ in single-precision floating point?

<br>

*Exercise 2*: What is the number, as written in binary scientific notation, whose hexadecimal representation is $3CC8\ 0000$?

<br>

*Exercise 3*: What is the number, as written in binary scientific notation, whose hexadecimal representation is $0050\ 0000$?

---
## References

<br>

Hyde, R. (2012). *Write Great Code, Volume 1*. O'Reilly Media, Inc.

Kahan, W. (1997). *Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic*. University of California.

Linda, N. (2018). *Essentials of computer organization and architecture* (Fifth edition.). Jones & Bartlett Learning, LLC.

Dale, N. (2016). *Computer science illuminated* (Sixth edition). Jones & Bartlett Learning, LLC.

Warford, J. S. (2017). *Computer systems* (Fifth edition). O'Reilly Media, Inc.