In [1]:
import torch

#### Unsigned Integer in `n` Bits

- **Definition**: An unsigned integer in `n` bits is a non-negative integer that can be represented using `n` binary digits (bits). Each bit can be either 0 or 1.

- **Range**: The range of values that can be represented by an `n`-bit unsigned integer is from 0 to \($2^n - 1$\). 

- **Example**:
  - For `n = 3`, the possible values are:
    - Binary: `000`, `001`, `010`, `011`, `100`, `101`, `110`, `111`
    - Decimal: 0, 1, 2, 3, 4, 5, 6, 7

- **Maximum Value**: The maximum value of an `n`-bit unsigned integer is \($2^n - 1$\). For instance:
  - For `n = 4`, the maximum value is \($2^4 - 1 = 15$\).

- **Binary Representation**: The binary representation of an `n`-bit unsigned integer always has exactly `n` bits, with leading zeros added if necessary to fill the bit width.

- **Usage**: Unsigned integers are commonly used in computer systems for operations where only non-negative values are needed, such as indexing, counters, and bit manipulation.


#### 8-Bit Unsigned Integer

- **Definition**: An 8-bit unsigned integer is a non-negative integer represented using 8 binary digits (bits). Each bit can be either 0 or 1.

- **Range**: The range of values for an 8-bit unsigned integer is from 0 to \(2^8 - 1\), which is 0 to 255.

- **Example**:
  - Binary: `00000000` to `11111111`
  - Decimal: 0 to 255

- **Maximum Value**: The maximum value of an 8-bit unsigned integer is \($2^8 - 1 = 255$\).

- **Binary Representation**: The binary representation is always 8 bits long, with leading zeros added if necessary to fill the bit width.

- **Usage**: 8-bit unsigned integers are used for data storage and operations where only non-negative values in the range 0-255 are required, such as in color values in images and small counters.

#### 8-Bit Signed Integer

- **Definition**: An 8-bit signed integer can represent both positive and negative integers using 8 bits, where one bit is reserved for the sign (0 for positive, 1 for negative).

- **Range**: The range of values for an 8-bit signed integer is from \($-2^7$\) to \($2^7$ - 1\), which is -128 to 127.

- **Example**:
  - Binary: 
    - Positive values: `00000000` to `01111111` (0 to 127)
    - Negative values: `10000000` to `11111111` (-128 to -1, using two's complement representation)
  - Decimal: -128 to 127

- **Maximum and Minimum Values**:
  - Maximum Value: \($2^7 - 1 = 127$\)
  - Minimum Value: \($-2^7 = -128$\)

- **Binary Representation**: The binary representation includes a sign bit. Positive values are represented directly in binary, while negative values use two's complement notation.

- **Usage**: 8-bit signed integers are used for operations involving both positive and negative values in a range suitable for small integers, such as in signed arithmetic operations and certain types of signal processing.


In [2]:
# Information of `8-bit unsigned integer`
torch.iinfo(torch.uint8)

iinfo(min=0, max=255, dtype=uint8)

In [3]:
# Information of `8-bit (signed) integer`
torch.iinfo(torch.int8)

iinfo(min=-128, max=127, dtype=int8)

In [4]:
### Information of `16-bit (signed) integer`
torch.iinfo(torch.int16)

iinfo(min=-32768, max=32767, dtype=int16)

In [5]:
### Information of `32-bit (signed) integer`
torch.iinfo(torch.int32)

iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)

#### Components of Floating-Point Numbers

- **Sign**:
  - **Definition**: Indicates whether the number is positive or negative.
  - **Bit Representation**: Typically represented by a single bit in the floating-point format (0 for positive, 1 for negative).

- **Exponent**:
  - **Definition**: Determines the magnitude of the number by specifying the power of the base (usually base 2 in binary systems).
  - **Bit Representation**: The exponent is stored in a biased form. For example, in IEEE 754 single-precision format, the exponent is an 8-bit field with a bias of 127.

- **Mantissa (Significand)**:
  - **Definition**: Represents the significant digits of the number. It is the precision part of the floating-point number.
  - **Bit Representation**: The mantissa is stored as a binary fraction. In IEEE 754 single-precision format, the mantissa is an 23-bit field. The leading bit (implicit 1 in normalized numbers) is not stored but assumed.

- **Normalization** (for normalized numbers):
  - **Definition**: The process of adjusting the mantissa and exponent to ensure that the leading digit of the mantissa is non-zero (usually 1 for binary systems).
  - **Binary Representation**: For normalized numbers, the mantissa always starts with a leading 1 which is implicit and not explicitly stored.

- **Example** (IEEE 754 Single-Precision):
  - **Binary Representation**: `0 10000001 01100000000000000000000`
    - **Sign Bit**: `0` (positive)
    - **Exponent**: `10000001` (129 in decimal, with a bias of 127, so the actual exponent is \(129 - 127 = 2\))
    - **Mantissa**: `01100000000000000000000` (1.011 in binary)

- **Formula**:
  - The floating-point number is calculated as:
    \[
    \text{Value} = (-1)^{\text{sign}} \times (1 + \text{mantissa}) \times 2^{\text{exponent} - \text{bias}}
    \]


#### FP32 (32-bit Floating-Point) Format

- **Definition**: FP32, also known as single-precision floating-point format, is a standard format for representing floating-point numbers using 32 bits. It is defined by the IEEE 754 standard.

- **Components**:
  - **Sign Bit**:
    - **Bit Position**: 1 bit
    - **Definition**: Indicates whether the number is positive or negative. `0` for positive, `1` for negative.

  - **Exponent**:
    - **Bit Position**: 8 bits
    - **Definition**: Determines the scale or magnitude of the number. The exponent is stored with a bias of 127.
    - **Range**: The exponent field allows values from 0 to 255. However, values 0 and 255 are reserved for special cases like denormalized numbers and infinity.

  - **Mantissa (Significand)**:
    - **Bit Position**: 23 bits
    - **Definition**: Represents the precision part of the number. The mantissa includes an implicit leading 1 for normalized numbers, which is not explicitly stored.

- **Format**:
  - **Bit Layout**: `S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM`
    - **S**: Sign bit (1 bit)
    - **E**: Exponent (8 bits)
    - **M**: Mantissa (23 bits)

- **Normalization**:
  - For normalized numbers, the mantissa always starts with a leading 1 (implicit), followed by the fractional part.

- **Special Values**:
  - **Zero**: Represented with all bits of exponent and mantissa as `0`.
  - **Infinity**: Represented with all bits of the exponent set to `1` and all bits of the mantissa set to `0`.
  - **NaN (Not-a-Number)**: Represented with all bits of the exponent set to `1` and a non-zero mantissa.

- **Example**:
  - **Binary Representation**: `0 10000001 01100000000000000000000`
    - **Sign Bit**: `0` (positive)
    - **Exponent**: `10000001` (129 in decimal, with a bias of 127, so the actual exponent is \(129 - 127 = 2\))
    - **Mantissa**: `01100000000000000000000` (1.011 in binary)
  - **Decimal Value**: \((1 + 0.011) \times 2^2 = 1.011 \times 4 = 4.044\)

- **Formula**:
  - The value represented by an FP32 number is calculated as:
    $[
    \text{Value} = (-1)^{\text{sign}} \times (1 + \text{mantissa}) \times 2^{\text{exponent} - 127}
    ]$

- **Usage**:
  - FP32 is widely used in computing, graphics, and machine learning applications for its balance between precision and memory/storage efficiency.

#### FP16 (16-bit Floating-Point) Format

- **Definition**: FP16, also known as half-precision floating-point format, is a compact floating-point representation using 16 bits. It is defined by the IEEE 754-2008 standard.

- **Components**:
  - **Sign Bit**:
    - **Bit Position**: 1 bit
    - **Definition**: Indicates whether the number is positive or negative. `0` for positive, `1` for negative.

  - **Exponent**:
    - **Bit Position**: 5 bits
    - **Definition**: Determines the scale or magnitude of the number. The exponent is stored with a bias of 15.
    - **Range**: The exponent field allows values from 0 to 31. However, values 0 and 31 are reserved for special cases like denormalized numbers and infinity.

  - **Mantissa (Significand)**:
    - **Bit Position**: 10 bits
    - **Definition**: Represents the precision part of the number. The mantissa includes an implicit leading 1 for normalized numbers, which is not explicitly stored.

- **Format**:
  - **Bit Layout**: `S EEEEE MMMMMMMMMM`
    - **S**: Sign bit (1 bit)
    - **E**: Exponent (5 bits)
    - **M**: Mantissa (10 bits)

- **Normalization**:
  - For normalized numbers, the mantissa always starts with a leading 1 (implicit), followed by the fractional part.

- **Special Values**:
  - **Zero**: Represented with all bits of exponent and mantissa as `0`.
  - **Infinity**: Represented with all bits of the exponent set to `1` and all bits of the mantissa set to `0`.
  - **NaN (Not-a-Number)**: Represented with all bits of the exponent set to `1` and a non-zero mantissa.

- **Example**:
  - **Binary Representation**: `0 10001 1010000000`
    - **Sign Bit**: `0` (positive)
    - **Exponent**: `10001` (17 in decimal, with a bias of 15, so the actual exponent is \(17 - 15 = 2\))
    - **Mantissa**: `1010000000` (1.101 in binary)
  - **Decimal Value**: \((1 + 0.101) \times 2^2 = 1.101 \times 4 = 4.404\)

- **Formula**:
  - The value represented by an FP16 number is calculated as:
    \[
    \text{Value} = (-1)^{\text{sign}} \times (1 + \text{mantissa}) \times 2^{\text{exponent} - 15}
    \]

- **Usage**:
  - FP16 is used in various applications where reduced precision is acceptable and memory efficiency is critical, such as in graphics processing, machine learning, and some scientific computations.

In [7]:
value = 0.333

In [8]:
# 64-bit floating point
tensor_fp64 = torch.tensor(value, dtype = torch.float64)

In [9]:
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")

fp64 tensor: 0.333000000000000018207657603852567262947559356689453125000000


In [10]:
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

In [11]:
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")

fp64 tensor: 0.333000000000000018207657603852567262947559356689453125000000
fp32 tensor: 0.333000004291534423828125000000000000000000000000000000000000
fp16 tensor: 0.333007812500000000000000000000000000000000000000000000000000
bf16 tensor: 0.332031250000000000000000000000000000000000000000000000000000


#### Downcasting

- **Definition**: 
  - Downcasting refers to converting a variable from a larger data type to a smaller one, such as converting a 64-bit floating-point number (FP64) to a 32-bit floating-point number (FP32) or converting a 32-bit integer to a 16-bit integer.
  - This process reduces the amount of memory used by the variable but may result in the loss of precision or overflow if the smaller type cannot accurately represent the original value.

- **Examples of Downcasting**:
  1. **Floating-Point Downcasting**:
     - Converting a `float64` (double-precision) to `float32` (single-precision).
     - Example:
       ```python
       import numpy as np
       
       # Original 64-bit floating-point
       fp64_value = np.float64(123456789.123456789)
       print(fp64_value)  # Output: 123456789.12345679
       
       # Downcast to 32-bit floating-point
       fp32_value = np.float32(fp64_value)
       print(fp32_value)  # Output: 123456792.0
       ```
       - **Observation**: The downcasting results in a loss of precision due to the smaller bit-width of `float32`.

  2. **Integer Downcasting**:
     - Converting an `int32` to `int16`.
     - Example:
       ```python
       # Original 32-bit integer
       int32_value = np.int32(32768)
       print(int32_value)  # Output: 32768
       
       # Downcast to 16-bit integer
       int16_value = np.int16(int32_value)
       print(int16_value)  # Output: -32768 (due to overflow)
       ```
       - **Observation**: The downcasting results in an overflow, as `int16` can only represent values from -32768 to 32767.

- **Potential Issues with Downcasting**:
  - **Loss of Precision**: Especially in floating-point downcasting, where significant digits may be lost.
  - **Overflow**: In integer downcasting, values that exceed the target type's range can wrap around or cause unexpected results.
  - **Truncation**: When downcasting from a floating-point to an integer type, the fractional part is truncated, which can lead to inaccuracies.

- **Usage Considerations**:
  - Downcasting is useful in memory-constrained environments or when working with large datasets where precision is less critical.
  - It's essential to ensure that downcasting does not lead to unacceptable errors in your computations.

#### Use case
- Mixed Precision training
    - Do computation in smaller precision (FP16, BF16,FP8)
    - Store and update the weights in higher precision (FP32)