# Files

---

## Table of Contents
```{contents}
```

---

## Programming Environment

In [17]:
from   html.entities import codepoint2name
import string
from   typing import List, Union
import unicodedata

from   datetime import datetime
import locale   as l
import platform as p
import sys
print(datetime.now())
print()
print(f"{'Platform':<20}: {p.mac_ver()[0]} | {p.system()} | {p.release()} | {p.machine()}")
print(f"{'':<20}: {l.getpreferredencoding()}")
print()
print(f"{'Python':<20}: {sys.version}")
print(f"{      '':<20}: {sys.version_info}")
print(f"{      '':<20}: {p.python_implementation()}")

2023-11-19 14:36:46.417510

Platform            : 14.1.1 | Darwin | 23.1.0 | arm64
                    : UTF-8

Python              : 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:33:12) [Clang 15.0.7 ]
                    : sys.version_info(major=3, minor=11, micro=5, releaselevel='final', serial=0)
                    : CPython


---

## Auxiliary

In [18]:
def dec_to_hex (dec : int = 2**16 - 1) -> str:
  """ Compose the hexadecimal representation
      as a string
      of a nonnegative integer.
  
  params: int (dec)
  return: str
  """
  assert 0 <= dec, 'Try again with a nonnegative integer.'
  return format(dec, '>06x').upper()

test_cases = [
  -1, 0, 1, 2**16 - 1, 2**32 - 1,
]
for test_case in test_cases:
  try:
    print(f"Case {test_case:<10}: {repr(dec_to_hex(test_case))}")
  except AssertionError as e:
    print(f"Case {test_case:<10}: {e}")

Case -1        : Try again with a nonnegative integer.
Case 0         : '000000'
Case 1         : '000001'
Case 65535     : '00FFFF'
Case 4294967295: 'FFFFFFFF'


In [19]:
def to_codepoint (hexa : str = '10FFFF') -> str:
  """ Compose a Unicode code point
      as a string.

  params: str (hexa)
  return: str
  """
  assert int(hexa, base=16) <= 0x10FFFF, 'Try again with a valid code point.'
  return fr'\U00{hexa}'

test_cases = [
  dec_to_hex(i) for i in range(5)
]
for test_case in test_cases:
  try:
    print(f"Case {test_case:<10}: {repr(to_codepoint(test_case))}")
  except AssertionError as e:
    print(f"Case {test_case:<10}: {e}")

Case 000000    : '\\U00000000'
Case 000001    : '\\U00000001'
Case 000002    : '\\U00000002'
Case 000003    : '\\U00000003'
Case 000004    : '\\U00000004'


In [20]:
# Nonnegative integer no greater than 0x10FFFF (1_114_111)
dec_to_glyph = chr

print(repr(dec_to_glyph(0xFF)))

'ÿ'


In [21]:
def to_glyph (code_point : str = '\\u00FF') -> str:
  """ Convert a raw Unicode code point to its non raw (graphical) form.
  
  params: str (code_point)
  return: str
  """
  return code_point.encode('utf-8').decode('unicode-escape')

print(to_glyph())

ÿ


In [22]:
def print_code_point_information (points : str = 'hello world') -> None:
  """ Prints information about Unicode code points.
  
      Prints the sequence of code points
             the number of code points in the sequence
             the following information for each Unicode code point in a sequence of code points:
               * glyph
               * raw code point
               * hexadecimal repr
               * byte repr
               * Unicode category
               * Named entity repr
               * Unicode name

  params: str (points)
  return: None
  """
  print(points)
  print(len(points))
  print()
  print(f"{'Glyph':<10} "
        f"{'Code Point':<10} "
        f"{'Hex':<10} "
        f"{'Bytes':<20} "
        f"{'Category':<10} "
        f"{'Named Entity':<20} "
        f"{'Name':<10}")
  try:
    for point in points:
      hex_rep    = dec_to_hex(ord(point))
      code_point = to_codepoint(hex_rep)
      glyph      = to_glyph(code_point)
      unicode = (f"{chr(ord(point)):<10} "
                f"{code_point:<10} "
                f"{format(ord(point), '06x').upper():<10} "
                f"{str(point.encode('utf-8')):<20} "
                f"{unicodedata.category(chr(ord(point))):<10} ")
      try:
        unicode += f"{codepoint2name[ord(point)]:<20} "
      except KeyError as e:
        unicode += f"{'NO NAMED ENTITY':<20} "
      try:
        unicode += f"{unicodedata.name(point)}"
      except ValueError as e:
        unicode += f"NO UNICODE NAME"
      print(unicode)
  except AssertionError as e:
    print(f"Case {point}: {e}")

print_code_point_information()

hello world
11

Glyph      Code Point Hex        Bytes                Category   Named Entity         Name      
h          \U00000068 000068     b'h'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER H
e          \U00000065 000065     b'e'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER E
l          \U0000006C 00006C     b'l'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER L
l          \U0000006C 00006C     b'l'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER L
o          \U0000006F 00006F     b'o'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER O
           \U00000020 000020     b' '                 Zs         NO NAMED ENTITY      SPACE
w          \U00000077 000077     b'w'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER W
o          \U0000006F 00006F     b'o'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER O
r          \U00000072 000072     b'r'         

---

## Text Encoding

A [_text encoding_](https://docs.python.org/3/glossary.html#term-text-encoding) is a text serialization [codec](https://docs.python.org/3/library/codecs.html#encodings-and-unicode) encoding text to bytes and decoding bytes to text.

_Encoding_ is the serialization of a string into a sequence of bytes and _decoding_ is the deserialization of a sequence of bytes into a string.

### ASCII

[ [w](https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)) ] Basic Latin

128 code points 0-127 are mapped to bytes 0x0-0x80 where the first bit is a placeholder
 and the remaining 7 bits encode the code point. 

Code Points | Encoding
------------|---------
`0...127`   | `0x00...0x80`

33 control codes

95 printable characters
* 26 uppercase letters
* 26 lowercase letters
* 10 digits
* 32 punctuation
* 1 whitespace

In [None]:
26 * 2 + 10 + 32 + 1 + 33

128

---

#### Control Codes

Control Characters
* [ [u](https://unicode.org/charts/PDF/U2400.pdf) ] Control Pictures
* [ [w](https://en.wikipedia.org/wiki/Null_character) ] NUL
* [ [w](https://en.wikipedia.org/wiki/End-of-Text_character) ] ETX
* [ [w](https://en.wikipedia.org/wiki/End-of-Transmission_character) ] EOT
* [ [w](https://en.wikipedia.org/wiki/Enquiry_character) ] ENQ
* [ [w](https://en.wikipedia.org/wiki/Acknowledgement_(data_networks)) ] ACK
* [ [w](https://en.wikipedia.org/wiki/Bell_character) ] BEL
* [ [w](https://en.wikipedia.org/wiki/Backspace) ] BS
* [ [w](https://en.wikipedia.org/wiki/Tab_key) ] HT
* [ [w](https://en.wikipedia.org/wiki/Newline) ] LF
* [ [w](https://en.wikipedia.org/wiki/Page_break#Form_feed) ] FF
* [ [w](https://en.wikipedia.org/wiki/Carriage_return) ] CR
* [ [w](https://en.wikipedia.org/wiki/Shift_Out_and_Shift_In_characters) ] SO
* [ [w](https://en.wikipedia.org/wiki/Shift_Out_and_Shift_In_characters) ] SI
* [ [w](https://en.wikipedia.org/wiki/Acknowledgement_(data_networks)) ] NAK
* [ [w](https://en.wikipedia.org/wiki/Synchronous_Idle) ] SYN
* [ [w](https://en.wikipedia.org/wiki/End-of-Transmission-Block_character) ] ETB
* [ [w](https://en.wikipedia.org/wiki/Cancel_character) ] CAN
* [ [w](https://en.wikipedia.org/wiki/Substitute_character) ] SUB
* [ [w](https://en.wikipedia.org/wiki/Escape_character) ] ESC
* [ [w](https://en.wikipedia.org/wiki/Delete_character) ] DEL

ASCII | Abbreviation | Caret Notation | Signal | Escape | HTML Entity | Percent Code | Unicode | Unicode Name
------|--------------|----------------|--------|--------|-------------|--------------|---------|-------------
0     | NUL          | ^@ |        | \0 | `&#0000;`, `&#x0000;` | %00 | U+0000 | NULL
1     | SOH          | ^A |        |    | `&#0001;`, `&#x0001;` | | U+0001 | START OF HEADING
2     | STX          | ^B |        |    | `&#0002;`, `&#x0002;` | | U+0002 | START OF TEXT
3     | ETX          | ^C | Ctrl-C |    | `&#0003;`, `&#x0003;` | | U+0003 | END OF TEXT
4     | EOT          | ^D | Ctrl-D |    | `&#0004;`, `&#x0004;` | | U+0004 | END OF TRANSMISSION
5     | ENQ          | ^E |        |    | `&#0005;`, `&#x0005;` | | U+0005 | ENQUIRY
6     | ACK          | ^F |        |    | `&#0006;`, `&#x0006;` | | U+0006 | ACKNOWLEDGE
7     | BEL          | ^G |        | \a | `&#0007;`, `&#x0007;` | | U+0007 | BELL
8     | BS           | ^H |        | \b | `&#0008;`, `&#x0008;` | | U+0008 | BACKSPACE
9     | HT           | ^I |        | \t | `&#0009;`, `&#x0009;` | | U+0009 | CHARACTER TABULATION (horizontal tabulation, tab)
10    | LF           | ^J |        | \n | `&#0010;`, `&#x000A;` | | U+000A | LINE FEED (new line NL, end of line EOL)
11    | VT           | ^K |        | \v | `&#0011;`, `&#x000B;` | | U+000B | LINE TABULATION (vertical tabulation)
12    | FF           | ^L | Ctrl-L | \f | `&#0012;`, `&#x000C;` | | U+000C | FORM FEED
13    | CR           | ^M |        | \r | `&#0013;`, `&#x000D;` | | U+000D | CARRIAGE RETURN
14    | SO           | ^N |        |    | `&#0014;`, `&#x000E;` | | U+000E | SHIFT OUT (locking-shift one)
15    | SI           | ^O |        |    | `&#0015;`, `&#x000F;` | | U+000F | SHIFT IN (locking-shift zero)
16    | DLE          | ^P |        |    | `&#0016;`, `&#x0010;` | | U+0010 | DATA LINK ESCAPE
17    | DC1          | ^Q |        |    | `&#0017;`, `&#x0011;` | | U+0011 | DEVICE CONTROL ONE
18    | DC2          | ^R |        |    | `&#0018;`, `&#x0012;` | | U+0012 | DEVICE CONTROL TWO
19    | DC3          | ^S |        |    | `&#0019;`, `&#x0013;` | | U+0013 | DEVICE CONTROL THREE
20    | DC4          | ^T |        |    | `&#0020;`, `&#x0014;` | | U+0014 | DEVICE CONTROL FOUR
21    | NAK          | ^U |        |    | `&#0021;`, `&#x0015;` | | U+0015 | NEGATIVE ACKNOWLEDGE
22    | SYN          | ^V |        |    | `&#0022;`, `&#x0016;` | | U+0016 | SYNCHRONOUS IDLE
23    | ETB          | ^W |        |    | `&#0023;`, `&#x0017;` | | U+0017 | END OF TRANSMISSION BLOCK
24    | CAN          | ^X |        |    | `&#0024;`, `&#x0018;` | | U+0018 | CANCEL
25    | EM           | ^Y |        |    | `&#0025;`, `&#x0019;` | | U+0019 | END OF MEDIUM
26    | SUB          | ^Z | Ctrl-Z |    | `&#0026;`, `&#x001A;` | | U+001A | SUBSTITUTE
27    | ESC          | ^[ |        | \e | `&#0027;`, `&#x001B;` | | U+001B | ESCAPE
28    | FS           | ^\ |        |    | `&#0028;`, `&#x001C;` | | U+001C | INFORMATION SEPARATOR FOUR (file separator)
29    | GS           | ^] |        |    | `&#0029;`, `&#x001D;` | | U+001D | INFORMATION SEPARATOR THREE (group separator)
30    | RS           | ^^ |        |    | `&#0030;`, `&#x001E;` | | U+001E | INFORMATION SEPARATOR TWO (record separator)
31    | US           | ^_ |        |    | `&#0031;`, `&#x001F;` | | U+001F | INFORMATION SEPARATOR ONE (unit separator)
127   | DEL          | ^? |        |    | `&#0127;`, `&#x007F;` | | U+007F | DELETE

ASCII | Control Picture  | HTML Entity            | Unicode | Unicode Name
------|------------------|------------------------|---------|-------------
0     | &#x2400;         | `&#9216;`, `&#x2400;`  | U+2400  | SYMBOL FOR NULL
1     | &#x2401;         | `&#9217;`, `&#x2401;`  | U+2401  | SYMBOL FOR START OF HEADING
2     | &#x2402;         | `&#9218;`, `&#x2402;`  | U+2402  | SYMBOL FOR START OF TEXT
3     | &#x2403;         | `&#9219;`, `&#x2403;`  | U+2403  | SYMBOL FOR END OF TEXT
4     | &#x2404;         | `&#9220;`, `&#x2404;`  | U+2404  | SYMBOL FOR END OF TRANSMISSION
5     | &#x2405;         | `&#9221;`, `&#x2405;`  | U+2405  | SYMBOL FOR ENQUIRY
6     | &#x2406;         | `&#9222;`, `&#x2406;`  | U+2406  | SYMBOL FOR ACKNOWLEDGE
7     | &#x2407;         | `&#9223;`, `&#x2407;`  | U+2407  | SYMBOL FOR BELL
8     | &#x2408;         | `&#9224;`, `&#x2408;`  | U+2408  | SYMBOL FOR BACKSPACE
9     | &#x2409;         | `&#9225;`, `&#x2409;`  | U+2409  | SYMBOL FOR HORIZONTAL TABULATION
10    | &#x240A;         | `&#9226;`, `&#x240A;`  | U+240A  | SYMBOL FOR LINE FEED
11    | &#x240B;         | `&#9227;`, `&#x240B;`  | U+240B  | SYMBOL FOR VERTICAL TABULATION
12    | &#x240C;         | `&#9228;`, `&#x240C;`  | U+240C  | SYMBOL FOR FORM FEED
13    | &#x240D;         | `&#9229;`, `&#x240D;`  | U+240D  | SYMBOL FOR CARRIAGE RETURN
14    | &#x240E;         | `&#9230;`, `&#x240E;`  | U+240E  | SYMBOL FOR SHIFT OUT
15    | &#x240F;         | `&#9231;`, `&#x240F;`  | U+240F  | SYMBOL FOR SHIFT IN
16    | &#x2410;         | `&#9232;`, `&#x2410;`  | U+2410  | SYMBOL FOR DATA LINK ESCAPE
17    | &#x2411;         | `&#9233;`, `&#x2411;`  | U+2411  | SYMBOL FOR DEVICE CONTROL ONE
18    | &#x2412;         | `&#9234;`, `&#x2412;`  | U+2412  | SYMBOL FOR DEVICE CONTROL TWO
19    | &#x2413;         | `&#9235;`, `&#x2413;`  | U+2413  | SYMBOL FOR DEVICE CONTROL THREE
20    | &#x2414;         | `&#9236;`, `&#x2414;`  | U+2414  | SYMBOL FOR DEVICE CONTROL FOUR
21    | &#x2415;         | `&#9237;`, `&#x2415;`  | U+2415  | SYMBOL FOR NEGATIVE ACKNOWLEDGE
22    | &#x2416;         | `&#9238;`, `&#x2416;`  | U+2416  | SYMBOL FOR SYNCHRONOUS IDLE
23    | &#x2417;         | `&#9239;`, `&#x2417;`  | U+2417  | SYMBOL FOR END OF TRANSMISSION BLOCK
24    | &#x2418;         | `&#9240;`, `&#x2418;`  | U+2418  | SYMBOL FOR CANCEL
25    | &#x2419;         | `&#9241;`, `&#x2419;`  | U+2419  | SYMBOL FOR END OF MEDIUM
26    | &#x241A;         | `&#9242;`, `&#x241A;`  | U+241A  | SYMBOL FOR SUBSTITUTE
27    | &#x241B;         | `&#9243;`, `&#x241B;`  | U+241B  | SYMBOL FOR ESCAPE
28    | &#x241C;         | `&#9244;`, `&#x241C;`  | U+241C  | SYMBOL FOR FILE SEPARATOR
29    | &#x241D;         | `&#9245;`, `&#x241D;`  | U+241D  | SYMBOL FOR GROUP SEPARATOR
30    | &#x241E;         | `&#9246;`, `&#x241E;`  | U+241E  | SYMBOL FOR RECORD SEPARATOR
31    | &#x241F;         | `&#9247;`, `&#x241F;`  | U+241F  | SYMBOL FOR UNIT SEPARATOR
32    | &#x2420;         | `&#9248;`, `&#x2420;`  | U+2420  | SYMBOL FOR SPACE
127   | &#x2421;         | `&#9249;`, `&#x2421;`  | U+2421  | SYMBOL FOR DELETE

---

#### Uppercase Letters

ASCII | Symbol | HTML Entity | Unicode Code Point | Unicode Name
------|--------|-------------|--------------------|-------------
65    | &#65;  | `&#65;`, `&#x41;` | U+0041 | LATIN CAPITAL LETTER A
66    | &#66;  | `&#66;`, `&#x42;` | U+0042 | LATIN CAPITAL LETTER B
67    | &#67;  | `&#67;`, `&#x43;` | U+0043 | LATIN CAPITAL LETTER C
68    | &#68;  | `&#68;`, `&#x44;` | U+0044 | LATIN CAPITAL LETTER D
69    | &#69;  | `&#69;`, `&#x45;` | U+0045 | LATIN CAPITAL LETTER E
70    | &#70;  | `&#70;`, `&#x46;` | U+0046 | LATIN CAPITAL LETTER F
71    | &#71;  | `&#71;`, `&#x47;` | U+0047 | LATIN CAPITAL LETTER G
72    | &#72;  | `&#72;`, `&#x48;` | U+0048 | LATIN CAPITAL LETTER H
73    | &#73;  | `&#73;`, `&#x49;` | U+0049 | LATIN CAPITAL LETTER I
74    | &#74;  | `&#74;`, `&#x4A;` | U+004A | LATIN CAPITAL LETTER J
75    | &#75;  | `&#75;`, `&#x4B;` | U+004B | LATIN CAPITAL LETTER K
76    | &#76;  | `&#76;`, `&#x4C;` | U+004C | LATIN CAPITAL LETTER L
77    | &#77;  | `&#77;`, `&#x4D;` | U+004D | LATIN CAPITAL LETTER M
78    | &#78;  | `&#78;`, `&#x4E;` | U+004E | LATIN CAPITAL LETTER N
79    | &#79;  | `&#79;`, `&#x4F;` | U+004F | LATIN CAPITAL LETTER O
80    | &#80;  | `&#80;`, `&#x50;` | U+0050 | LATIN CAPITAL LETTER P
81    | &#81;  | `&#81;`, `&#x51;` | U+0051 | LATIN CAPITAL LETTER Q
82    | &#82;  | `&#82;`, `&#x52;` | U+0052 | LATIN CAPITAL LETTER R
83    | &#83;  | `&#83;`, `&#x53;` | U+0053 | LATIN CAPITAL LETTER S
84    | &#84;  | `&#84;`, `&#x54;` | U+0054 | LATIN CAPITAL LETTER T
85    | &#85;  | `&#85;`, `&#x55;` | U+0055 | LATIN CAPITAL LETTER U
86    | &#86;  | `&#86;`, `&#x56;` | U+0056 | LATIN CAPITAL LETTER V
87    | &#87;  | `&#87;`, `&#x57;` | U+0057 | LATIN CAPITAL LETTER W
88    | &#88;  | `&#88;`, `&#x58;` | U+0058 | LATIN CAPITAL LETTER X
89    | &#89;  | `&#89;`, `&#x59;` | U+0059 | LATIN CAPITAL LETTER Y
90    | &#90;  | `&#90;`, `&#x5A;` | U+005A | LATIN CAPITAL LETTER Z

In [None]:
print_code_point_information(string.ascii_uppercase)

ABCDEFGHIJKLMNOPQRSTUVWXYZ
26

Glyph      Code Point Hex        Bytes                Category   Named Entity         Name      
A          \U00000041 000041     b'A'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER A
B          \U00000042 000042     b'B'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER B
C          \U00000043 000043     b'C'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER C
D          \U00000044 000044     b'D'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER D
E          \U00000045 000045     b'E'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER E
F          \U00000046 000046     b'F'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER F
G          \U00000047 000047     b'G'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER G
H          \U00000048 000048     b'H'                 Lu         NO NAMED ENTITY      LATIN CAPITAL LETTER H


---

#### Lowercase Letters

ASCII | Symbol | HTML Entity | Unicode Code Point | Unicode Name
------|--------|-------------|--------------------|-------------
97    | &#097; | `&#097;`, `&#x61;` | U+0061 | LATIN SMALL LETTER A
98    | &#098; | `&#098;`, `&#x62;` | U+0062 | LATIN SMALL LETTER B
99    | &#099; | `&#099;`, `&#x63;` | U+0063 | LATIN SMALL LETTER C
100   | &#100; | `&#100;`, `&#x64;` | U+0064 | LATIN SMALL LETTER D
101   | &#101; | `&#101;`, `&#x65;` | U+0065 | LATIN SMALL LETTER E
102   | &#102; | `&#102;`, `&#x66;` | U+0066 | LATIN SMALL LETTER F
103   | &#103; | `&#103;`, `&#x67;` | U+0067 | LATIN SMALL LETTER G
104   | &#104; | `&#104;`, `&#x68;` | U+0068 | LATIN SMALL LETTER H
105   | &#105; | `&#105;`, `&#x69;` | U+0069 | LATIN SMALL LETTER I
106   | &#106; | `&#106;`, `&#x6A;` | U+006A | LATIN SMALL LETTER J
107   | &#107; | `&#107;`, `&#x6B;` | U+006B | LATIN SMALL LETTER K
108   | &#108; | `&#108;`, `&#x6C;` | U+006C | LATIN SMALL LETTER L
109   | &#109; | `&#109;`, `&#x6D;` | U+006D | LATIN SMALL LETTER M
110   | &#110; | `&#110;`, `&#x6E;` | U+006E | LATIN SMALL LETTER N
111   | &#111; | `&#111;`, `&#x6F;` | U+006F | LATIN SMALL LETTER O
112   | &#112; | `&#112;`, `&#x70;` | U+0070 | LATIN SMALL LETTER P
113   | &#113; | `&#113;`, `&#x71;` | U+0071 | LATIN SMALL LETTER Q
114   | &#114; | `&#114;`, `&#x72;` | U+0072 | LATIN SMALL LETTER R
115   | &#115; | `&#115;`, `&#x73;` | U+0073 | LATIN SMALL LETTER S
116   | &#116; | `&#116;`, `&#x74;` | U+0074 | LATIN SMALL LETTER T
117   | &#117; | `&#117;`, `&#x75;` | U+0075 | LATIN SMALL LETTER U
118   | &#118; | `&#118;`, `&#x76;` | U+0076 | LATIN SMALL LETTER V
119   | &#119; | `&#119;`, `&#x77;` | U+0077 | LATIN SMALL LETTER W
120   | &#120; | `&#120;`, `&#x78;` | U+0078 | LATIN SMALL LETTER X
121   | &#121; | `&#121;`, `&#x79;` | U+0079 | LATIN SMALL LETTER Y
122   | &#122; | `&#122;`, `&#x7A;` | U+007A | LATIN SMALL LETTER Z

In [None]:
print_code_point_information(string.ascii_lowercase)

abcdefghijklmnopqrstuvwxyz
26

Glyph      Code Point Hex        Bytes                Category   Named Entity         Name      
a          \U00000061 000061     b'a'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER A
b          \U00000062 000062     b'b'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER B
c          \U00000063 000063     b'c'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER C
d          \U00000064 000064     b'd'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER D
e          \U00000065 000065     b'e'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER E
f          \U00000066 000066     b'f'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER F
g          \U00000067 000067     b'g'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER G
h          \U00000068 000068     b'h'                 Ll         NO NAMED ENTITY      LATIN SMALL LETTER H
i          \U000

In [None]:
# print_code_point_information(string.ascii_letters)

---

#### Digits

ASCII | Symbol | HTML Entity | Unicode Code Point | Unicode Name
------|--------|-------------|--------------------|-------------
48    | &#48;  | `&#48;`, `&#x30;` | U+0030 | DIGIT ZERO
49    | &#49;  | `&#49;`, `&#x31;` | U+0031 | DIGIT ONE
50    | &#50;  | `&#50;`, `&#x32;` | U+0032 | DIGIT TWO
51    | &#51;  | `&#51;`, `&#x33;` | U+0033 | DIGIT THREE
52    | &#52;  | `&#52;`, `&#x34;` | U+0034 | DIGIT FOUR
53    | &#53;  | `&#53;`, `&#x35;` | U+0035 | DIGIT FIVE
54    | &#54;  | `&#54;`, `&#x36;` | U+0036 | DIGIT SIX
55    | &#55;  | `&#55;`, `&#x37;` | U+0037 | DIGIT SEVEN
56    | &#56;  | `&#56;`, `&#x38;` | U+0038 | DIGIT EIGHT
57    | &#57;  | `&#57;`, `&#x39;` | U+0039 | DIGIT NINE

In [None]:
print_code_point_information(string.digits)

0123456789
10

Glyph      Code Point Hex        Bytes                Category   Named Entity         Name      
0          \U00000030 000030     b'0'                 Nd         NO NAMED ENTITY      DIGIT ZERO
1          \U00000031 000031     b'1'                 Nd         NO NAMED ENTITY      DIGIT ONE
2          \U00000032 000032     b'2'                 Nd         NO NAMED ENTITY      DIGIT TWO
3          \U00000033 000033     b'3'                 Nd         NO NAMED ENTITY      DIGIT THREE
4          \U00000034 000034     b'4'                 Nd         NO NAMED ENTITY      DIGIT FOUR
5          \U00000035 000035     b'5'                 Nd         NO NAMED ENTITY      DIGIT FIVE
6          \U00000036 000036     b'6'                 Nd         NO NAMED ENTITY      DIGIT SIX
7          \U00000037 000037     b'7'                 Nd         NO NAMED ENTITY      DIGIT SEVEN
8          \U00000038 000038     b'8'                 Nd         NO NAMED ENTITY      DIGIT EIGHT
9          \U00

---

#### Punctuation

ASCII | Symbol | HTML Entity                   | Unicode Code Point | Unicode Name
------|--------|-------------------------------|--------------------|-------------
32    | &#32;  | `&#32;`, `&#x20;`             | U+0020 | SPACE [ [w](https://en.wikipedia.org/wiki/Space_(punctuation)) ]
33    | &#33;  | `&#33;`, `&#x21;`, `&excl;`   | U+0021 | EXCLAMATION MARK (factorial, bang) [ [w](https://en.wikipedia.org/wiki/Exclamation_mark) ]
34    | &#34;  | `&#34;`, `&#x22;`, `&quot;`   | U+0022 | QUOTATION MARK (double quote) [ [w](https://en.wikipedia.org/wiki/Quotation_mark) ]
35    | &#35;  | `&#35;`, `&#x23;`             | U+0023 | NUMBER SIGN (pound sign, hash) [ [w](https://en.wikipedia.org/wiki/Number_sign) ]
36    | &#36;  | `&#36;`, `&#x24;`, `&dollar;` | U+0024 | DOLLAR SIGN [ [w](https://en.wikipedia.org/wiki/Dollar_sign) ]
37    | &#37;  | `&#37;`, `&#x25;`, `&percnt;` | U+0025 | PERCENT SIGN [ [w](https://en.wikipedia.org/wiki/Percent_sign) ]
38    | &#38;  | `&#38;`, `&#x26;`, `&amp;`    | U+0026 | AMPERSAND (and) [ [w](https://en.wikipedia.org/wiki/Ampersand) ]
39    | &#39;  | `&#39;`, `&#x27;`, `&apos;`   | U+0027 | APOSTROPHE (single quote)
40    | &#40;  | `&#40;`, `&#x28;`             | U+0028 | LEFT PARENTHESIS (opening parenthesis) [ [w](https://en.wikipedia.org/wiki/Bracket) ]
41    | &#41;  | `&#41;`, `&#x29;`             | U+0029 | RIGHT PARENTHESIS (closing parenthesis) [ [w](https://en.wikipedia.org/wiki/Bracket) ]
42    | &#42;  | `&#42;`, `&#x2A;`             | U+002A | ASTERISK (star) [ [w](https://en.wikipedia.org/wiki/Asterisk) ] (ἀστερίσκος "little star")
43    | &#43;  | `&#43;`, `&#x2B;`, `&plus;`   | U+002B | PLUS SIGN [ [w](https://en.wikipedia.org/wiki/Plus_and_minus_signs) ]
44    | &#44;  | `&#44;`, `&#x2C;`, `&comma;`  | U+002C | COMMA [ [w](https://en.wikipedia.org/wiki/Comma) ]
45    | &#45;  | `&#45;`, `&#x2D;`, `&hyphen;` | U+002D | HYPHEN-MINUS [ [w](https://en.wikipedia.org/wiki/Hyphen-minus) ] (hyphen [ [w](https://en.wikipedia.org/wiki/Hyphen) ], dash [ [w](https://en.wikipedia.org/wiki/Dash) ], minus sign [ [w](https://en.wikipedia.org/wiki/Plus_and_minus_signs#Minus_sign) ])
46    | &#46;  | `&#46;`, `&#x2E;`, `&period;` | U+002E | FULL STOP (period, dot, decimal point) [ [w](https://en.wikipedia.org/wiki/Full_stop) ]
47    | &#47;  | `&#47;`, `&#x2F;`             | U+002F | SOLIDUS (slash, forward slash) [ [w](https://en.wikipedia.org/wiki/Slash_(punctuation)) ]
58    | &#58;  | `&#58;`, `&#x3A;`, `&colon;`  | U+003A | COLON [ [w](https://en.wikipedia.org/wiki/Colon_(punctuation)) ]
59    | &#59;  | `&#59;`, `&#x3B;`, `&semi;`   | U+003B | SEMICOLON [ [w](https://en.wikipedia.org/wiki/Semicolon) ]
60    | &#60;  | `&#60;`, `&#x3C;`, `&lt;`     | U+003C | LESS-THAN SIGN [ [w](https://en.wikipedia.org/wiki/Less-than_sign) ]
61    | &#61;  | `&#61;`, `&#x3D;`, `&equals;` | U+003D | EQUALS SIGN [ [w](https://en.wikipedia.org/wiki/Equals_sign) ]
62    | &#62;  | `&#62;`, `&#x3E;`, `&gt;`     | U+003E | GREATER-THAN SIGN [ [w](https://en.wikipedia.org/wiki/Greater-than_sign) ]
63    | &#63;  | `&#63;`, `&#x3F;`, `&quest;`  | U+003F | QUESTION MARK [ [w](https://en.wikipedia.org/wiki/Question_mark) ]
64    | &#64;  | `&#64;`, `&#x40;`             | U+0040 | COMMERCIAL AT (at sign) [ [w](https://en.wikipedia.org/wiki/At_sign) ]
91    | &#91;  | `&#91;`, `&#x5B;`             | U+005B | LEFT SQUARE BRACKET (opening square bracket) [ [w](https://en.wikipedia.org/wiki/Bracket) ]
92    | &#92;  | `&#92;`, `&#x5C;`             | U+005C | REVERSE SOLIDUS (backslash) [ [w](https://en.wikipedia.org/wiki/Backslash) ]
93    | &#93;  | `&#93;`, `&#x5D;`             | U+005D | RIGHT SQUARE BRACKET (closing square bracket) [ [w](https://en.wikipedia.org/wiki/Bracket) ]
94    | &#94;  | `&#94;`, `&#x5E;`             | U+005E | CIRCUMFLEX ACCENT ("caret", "hat") [ [w](https://en.wikipedia.org/wiki/Caret_(computing)) ]
95    | &#95;  | `&#95;`, `&#x5F;`             | U+005F | LOW LINE ("underscore") [ [w](https://en.wikipedia.org/wiki/Underscore) ]
96    | &#96;  | `&#96;`, `&#x60;`             | U+0060 | GRAVE ACCENT (backtick, backquote) [ [w](https://en.wikipedia.org/wiki/Backtick) ]
123   | &#123; | `&#123;`, `&#x7B;`            | U+00&B | LEFT CURLY BRACKET (opening curly bracket, left brace) [ [w](https://en.wikipedia.org/wiki/Bracket) ]
124   | &#124; | `&#124;`, `&#x7C;`            | U+00&C | VERTICAL LINE (vertical bar, pipe) [ [w](https://en.wikipedia.org/wiki/Vertical_bar) ]
125   | &#125; | `&#125;`, `&#x7D;`            | U+00&D | RIGHT CURLY BRACKET (closing curly bracket, right brace) [ [w](https://en.wikipedia.org/wiki/Bracket) ]
126   | &#126; | `&#126;`, `&#x7E;`            | U+00&E | TILDE [ [w](https://en.wikipedia.org/wiki/Tilde) ]

Dashes

Symbol   | HTML Entity | Unicode Code Point | Unicode Name
---------|-------------|--------------------|-------------
&#x002D; | `&#x002D;` | U+002D | HYPHEN-MINUS (hyphen, dash, minus sign)
&#x2010; | `&#x2010;` | U+2010 | HYPHEN
&#x2012; | `&#x2012;` | U+2012 | FIGURE DASH
&#x2013; | `&#x2013;` | U+2013 | EN DASH
&#x2014; | `&#x2014;` | U+2014 | EM DASH
&#x2015; | `&#x2015;` | U+2015 | HORIZONTAL BAR
&#x2212; | `&#x2212;` | U+2212 | MINUS SIGN

Symbol   | HTML Entity | Unicode Code Point | Unicode Name
---------|-------------|--------------------|-------------
&#x00AD; | `&#x00AD;` | U+00AD | SOFT HYPHEN
&#x02D7; | `&#x02D7;` | U+02D7 | MODIFIER LETTER MINUS SIGN
&#x2011; | `&#x2011;` | U+2011 | NON-BREAKING HYPHEN
&#x2027; | `&#x2027;` | U+2027 | HYPHENATION POINT
&#x2043; | `&#x2043;` | U+2043 | HYPHEN BULLET
&#x10191; | `&#x10191;` | U+10191 | ROMAN UNCIA SIGN

Quotation Marks

Symbol   | HTML Entity | Unicode Code Point | Unicode Name
---------|-------------|--------------------|-------------
&#x2018; | `&#x2018;`  | U+2018             | LEFT SINGLE QUOTATION MARK
&#x2019; | `&#x2019;`  | U+2019             | RIGHT SINGLE QUOTATION MARK
&#x201C; | `&#x201C;`  | U+201C             | LEFT DOUBLE QUOTATION MARK
&#x201D; | `&#x201D;`  | U+201D             | RIGHT DOUBLE QUOTATION MARK

Symbol   | HTML Entity | Unicode Code Point | Unicode Name
---------|-------------|--------------------|-------------
&#x0022; | `&#x0022;`  | U+0022             | QUOTATION MARK (double quote)
&#x02B9; | `&#x02B9;`  | U+02B9             | MODIFIER LETTER PRIME
&#x02BA; | `&#x02BA;`  | U+02BA             | MODIFIER LETTER DOUBLE PRIME
&#x02BC; | `&#x02BC;`  | U+02BC             | MODIFIER LETTER APOSTROPHE
&#x02C8; | `&#x02C8;`  | U+02C8             | MODIFIER LETTER VERTICAL LINE
&#x02DD; | `&#x02DD;`  | U+02DD             | DOUBLE ACCUTE ACCENT
&#x02EE; | `&#x02EE;`  | U+02EE             | MODIFIER LETTER DOUBLE APOSTROPHE
&#x0301; | `&#x0301;`  | U+0301             | COMBINING ACUTE ACCENT
&#x030B; | `&#x030B;`  | U+030B             | COMBINING DOUBLE ACUTE ACCENT
&#x030D; | `&#x030D;`  | U+030D             | COMBINING VERTICAL LINE ABOVE
&#x030E; | `&#x030E;`  | U+030E             | COMBINING DOUBLE VERTICAL LINE ABOVE
&#x05F3; | `&#x05F3;`  | U+05F3             | HEBREW PUNCTUATION GERESH
&#x05F4; | `&#x05F4;`  | U+05F4             | HEBREW PUNCTUATION GERSHAYIM
&#x2032; | `&#x2032;`  | U+2032             | PRIME
&#x2033; | `&#x2033;`  | U+2033             | DOUBLE PRIME
&#x3003; | `&#x3003;`  | U+3003             | DITTO MARK
&#xA78C; | `&#xA78C;`  | U+A78C             | LATIN SMALL LETTER SALTILLO

In [None]:
print_code_point_information(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
32

Glyph      Code Point Hex        Bytes                Category   Named Entity         Name      
!          \U00000021 000021     b'!'                 Po         NO NAMED ENTITY      EXCLAMATION MARK
"          \U00000022 000022     b'"'                 Po         quot                 QUOTATION MARK
#          \U00000023 000023     b'#'                 Po         NO NAMED ENTITY      NUMBER SIGN
$          \U00000024 000024     b'$'                 Sc         NO NAMED ENTITY      DOLLAR SIGN
%          \U00000025 000025     b'%'                 Po         NO NAMED ENTITY      PERCENT SIGN
&          \U00000026 000026     b'&'                 Po         amp                  AMPERSAND
'          \U00000027 000027     b"'"                 Po         NO NAMED ENTITY      APOSTROPHE
(          \U00000028 000028     b'('                 Ps         NO NAMED ENTITY      LEFT PARENTHESIS
)          \U00000029 000029     b')'                 Pe         NO NAM

---

#### Whitespace

In [None]:
print_code_point_information(string.whitespace[0])

 
1

Glyph      Code Point Hex        Bytes                Category   Named Entity         Name      
           \U00000020 000020     b' '                 Zs         NO NAMED ENTITY      SPACE


In [None]:
string.whitespace[1:]

'\t\n\r\x0b\x0c'

---

In [None]:
# print_code_point_information(string.printable)

---

In [None]:
# codec `ascii` only first 128

for i in range(256):
  i = chr(i)
  try:
    print(f"{i:<10} {str(i.encode('ascii')):<10} {unicodedata.name(i)}")
  except (UnicodeEncodeError, ValueError) as e:
    print(e)

no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
           b' '       SPACE
!          b'!'       EXCLAMATION MARK
"          b'"'       QUOTATION MARK
#          b'#'       NUMBER SIGN
$          b'$'       DOLLAR SIGN
%          b'%'       PERCENT SIGN
&          b'&'       AMPERSAND
'          b"'"       APOSTROPHE
(          b'('       LEFT PARENTHESIS
)          b')'       RIGHT PARENTHESIS
*          b'*'       ASTERISK
+          b'+'       PLUS SIGN
,          b','       COMMA
-          b'-'       HYPHEN-MINUS
.          b'.'       FULL STOP
/          b'/'       SOLIDUS
0          b'0'       DIGIT ZERO
1          b

In [None]:
# ASCII requires no more than one byte of space.
all(len(chr(i).encode('ascii')) == 1 for i in range(128))

True

---

### Extended ASCII

Code points 0-255 are mapped to bytes 0x0-0xff.

In [None]:
# codec `latin-1` only first 256

for i in range(257):
  i = chr(i)
  try:
    print(f"{i:<10} {str(i.encode('latin-1')):<10} {unicodedata.name(i)}")
  except (UnicodeEncodeError, ValueError) as e:
    print(e)

no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
no such name
           b' '       SPACE
!          b'!'       EXCLAMATION MARK
"          b'"'       QUOTATION MARK
#          b'#'       NUMBER SIGN
$          b'$'       DOLLAR SIGN
%          b'%'       PERCENT SIGN
&          b'&'       AMPERSAND
'          b"'"       APOSTROPHE
(          b'('       LEFT PARENTHESIS
)          b')'       RIGHT PARENTHESIS
*          b'*'       ASTERISK
+          b'+'       PLUS SIGN
,          b','       COMMA
-          b'-'       HYPHEN-MINUS
.          b'.'       FULL STOP
/          b'/'       SOLIDUS
0          b'0'       DIGIT ZERO
1          b

In [None]:
# Extended ASCII requires no more than one byte of space.
all(len(chr(i).encode('latin-1')) == 1 for i in range(256))

True

---

### UTF-8

* 8-bit encoding: this means that there are no issues with byte order and no BOM is required
* each byte consists of two parts
  * marker bits (most significant bits): a sequence of zero to four `1` bits followed by a `0` bit
  * payload bits
* the LSB of the Unicode character is the rightmost x bit

Range | Encoding
------|---------
`U-00000000...U-0000007F` | 0xxxxxxx
`U-00000080...U-000007FF` | 110xxxxx 10xxxxxx
`U-00000800...U-0000FFFF` | 1110xxxx 10xxxxxx 10xxxxxx
`U-00010000...U-0010FFFF` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

---

## File Formats

File Type             | Magic Number (hex)      | Magic Number (ASCII) | File Offset [bytes] | File Name Extension
----------------------|-------------------------|----------------------|---------------------|--------------------
DOS executable        | 4D 5A                   | MZ                   | 0                   | `.exe`
ELF                   | 7F 45 4C 46             | \x7fELF (␡ELF)      | 0                   | `.elf`
GIF                   | | GIF87a | | `.gif`
.                     | | GIF89a | | `.gif`
HDF                   | | \211HDF\r\n\032\n | | `.hd5`, `.hdf5`
Java Class            | CA FE BA BE             | Êþº¾ | | `.class`
JAR                   | 50 4B 03 04             | PK\x03\x04 | | `.jar`
JPEG                  | FF D8 FF DB             | ÿØÿÛ                 | 0 | `.jpg`, `.jpeg`
Linux/Unix Script     | 23 21                   | #! | | `.sh`
MIDI                  | 4D 54 68 64             | MThd ("MIDI Track Header")
PDF                   | 25 50 44 46             | %PDF | | `.pdf`
PNG                   | 89 50 4E 47 0D 0A 1A 0A | \x89PNG\r\n\x1a\n (‰PNG␍␊␚␊) | | `.png`
PS                    | 25 21 (50 53)           | %!(PS) | | `.ps`
TIFF (Intel little end) | 49 49 2A 00           | II* | | `.tif`, `.tiff`
TIFF (Motorola big end) | 4D 4D 00 2A           | MM*
XML                   | | <?xml | | `.xml`
Zip                   | 50 4B 03 04             | PK\x03\x04           | 0
. | 50 4B 05 06 |
. | 50 4B 07 08 |

In [16]:
pad=30
print(f"{'DOS'                    :<{pad}} {''.join([chr(c) for c in [0x4d, 0x5a]])}")
print(f"{'ELF'                    :<{pad}} {''.join([chr(c) for c in [0x7f, 0x45, 0x4c, 0x46]])}")
print(f"{'Java Archive'           :<{pad}} {''.join([chr(c) for c in [0x50, 0x4b, 0x03, 0x04]])}")
print(f"{'Java Class'             :<{pad}} {''.join([chr(c) for c in [0xca, 0xfe, 0xba, 0xbe]])}")
print(f"{'JPEG'                   :<{pad}} {''.join([chr(c) for c in [0xff, 0xd8, 0xff, 0xdb]])}")
print(f"{'Linux/Unix Script'      :<{pad}} {''.join([chr(c) for c in [0x23, 0x21]])}")
print(f"{'MIDI'                   :<{pad}} {''.join([chr(c) for c in [0x4D, 0x54, 0x68, 0x64]])}")
print(f"{'PDF'                    :<{pad}} {''.join([chr(c) for c in [0x25, 0x50, 0x44, 0x46]])}")
print(f"{'PNG'                    :<{pad}} {''.join([chr(c) for c in [0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a]])}")
print(f"{'PS'                     :<{pad}} {''.join([chr(c) for c in [0x25, 0x21, 0x50, 0x53]])}")
print(f"{'TIFF (Intel little end)':<{pad}} {''.join([chr(c) for c in [0x49, 0x49, 0x2A, 0x00]])}")
print(f"{'TIFF (Motorola big end)':<{pad}} {''.join([chr(c) for c in [0x4D, 0x4D, 0x00, 0x2A]])}")
print(f"{'Zip'                    :<{pad}} {''.join([chr(c) for c in [0x50, 0x4b, 0x03, 0x04]])}")

DOS                            MZ
ELF                            ELF
Java Archive                   PK
Java Class                     Êþº¾
JPEG                           ÿØÿÛ
Linux/Unix Script              #!
MIDI                           MThd
PDF                            %PDF
PNG                            PNG


PS                             %!PS
TIFF (Intel little end)        II* 
TIFF (Motorola big end)        MM *
Zip                            PK


---

## Figures

* [ [h](https://www.crockford.com/putin.html) ][ [y](https://www.youtube.com/playlist?list=PLEzQf147-uEoNCeDlRrXv6ClsLDN-HtNm) ][ [w](https://en.wikipedia.org/wiki/Douglas_Crockford) ] Crockford, Douglas
* [ [w](https://en.wikipedia.org/wiki/Mark_Zbikowski) ] Zibkowski, Mark (1956-)

---

## Resources

* Named Character References [HTML](https://html.spec.whatwg.org/multipage/named-characters.html)

* [ h ][ [w](https://en.wikipedia.org/wiki/BMP_file_format) ] Bitmap (BMP) `.bmp`
* [ h ][ [w](https://en.wikipedia.org/wiki/Comma-separated_values) ] Comma-Separated Values (CSV) `.csv`
* [ h ][ [w](https://en.wikipedia.org/wiki/DOS_MZ_executable) ] DOS MZ "Mark Zibkowski" Executable `.exe`
* [ h ][ [w](https://en.wikipedia.org/wiki/XML) ] Extensible Markup Language (XML) `.xml`
* [ h ][ [w](https://en.wikipedia.org/wiki/GIF) ] Graphics Interchange Format (GIF) `.gif`
* [ h ][ [w](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) ] Hierarchical Data Foramt (HDF) `.hd5`, `.hdf5`
* [ h ][ [w](https://en.wikipedia.org/wiki/JPEG) ] Joint Photographic Experts Group (JPEG) `.jpg`, `.jpeg`
* [ h ][ [w](https://en.wikipedia.org/wiki/JAR_(file_format)) ] Java Archive File (JAR) `.jar`
* [ [h](https://www.json.org/) ][ [w](https://en.wikipedia.org/wiki/JSON) ] JavaScript Object Notation (JSON) `.json`
* [ [h](https://geojson.org) ][ [w](https://en.wikipedia.org/wiki/GeoJSON) ] GeoJSON
  * [IETF](https://datatracker.ietf.org/doc/html/rfc7946)
* [ h ][ [w](https://en.wikipedia.org/wiki/MP3) ] MP3
* [ h ][ [w](https://en.wikipedia.org/wiki/MIDI) ] Musical Instrument Digital Interface (MIDI)
* [ [h](https://parquet.apache.org/) ][ [w](https://en.wikipedia.org/wiki/Apache_Parquet) ] Parquet
* [ h ][ [w](https://en.wikipedia.org/wiki/PDF) ] Portable Document Format (PDF) `.pdf`
* [ h ][ [w](https://en.wikipedia.org/wiki/Portable_Network_Graphics) ] Portable Network Graphics (PNG) `.png`
* [ h ][ w ] Scalable Vector Graphics (SVG) `.svg`
* [ h ][ w ] Tab-Separated Values (TSV) `.tsv`
* [ h ][ [w](https://en.wikipedia.org/wiki/TIFF) ] Tag(ged) Image File Format (TIFF)
* [ [h](https://toml.io/en/) ][ [w](https://en.wikipedia.org/wiki/TOML) ] Tom's Obvious Minimal Language (TOML)
* [ [h](https://home.unicode.org) ][ [w](https://en.wikipedia.org/wiki/Unicode) ] Unicode
* [ h ][ [w](https://en.wikipedia.org/wiki/WAV) ] Waveform Audio File Format (WAV) `.wav`, `.wave`
* [ [h](https://yaml.org) ][ [w](https://en.wikipedia.org/wiki/YAML) ] Yet Another Markup Language (YAML) `.yml`, `.yaml`
* [ [h](https://tukaani.org/xz/) ][ [w](https://en.wikipedia.org/wiki/XZ_Utils) ] xz

---

## Terms

* [ [w](https://en.wikipedia.org/wiki/.exe) ] .exe
* [ [w](https://en.wikipedia.org/wiki/ANSI_escape_code) ] ANSI Escape Sequences
* [ [w](https://en.wikipedia.org/wiki/Archive_file_format) ] Archive File
  * [ [w](https://en.wikipedia.org/wiki/List_of_archive_formats) ] list of archive formats
* [ [w](https://en.wikipedia.org/wiki/ASCII) ] American Standard Code for Information Interchange (ASCII)
* [ [w](https://en.wikipedia.org/wiki/Audio_file_format) ] Audio File Format
* [ w ] Big Endian
* [ [w](https://en.wikipedia.org/wiki/BCD_(character_encoding)) ] Binary-Coded Decimal Interchange Code (BCDIC)
* [ [w](https://en.wikipedia.org/wiki/Binary-to-text_encoding) ] Binary-to-Text Encoding
* [ [w](https://en.wikipedia.org/wiki/Binary_code) ] Binary Code
* [ [w](https://en.wikipedia.org/wiki/Binary_file) ] Binary File
* [ [w](https://en.wikipedia.org/wiki/Bit_numbering) ] Bit Numbering
* [ [w](https://en.wikipedia.org/wiki/Bit_array) ] Bit String
* [ [w](https://en.wikipedia.org/wiki/Byte) ] Byte
* [ [w](https://en.wikipedia.org/wiki/Byte_order_mark) ] Byte Order Mark (BOM)
* [ [w](https://en.wikipedia.org/wiki/C0_and_C1_control_codes) ] C0 & C1 Control Codes
* [ [w](https://en.wikipedia.org/wiki/Caps_Lock) ] Caps Lock
* [ [w](https://en.wikipedia.org/wiki/Caret_notation) ] Caret Notation
* [ [w](https://en.wikipedia.org/wiki/Character_(computing)) ] Character
* [ [w](https://en.wikipedia.org/wiki/Character_encoding) ] Character Encoding
* [ [w](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) ] Character Entities
* [ [w](https://en.wikipedia.org/wiki/Code) ] Code
* [ [w](https://en.wikipedia.org/wiki/Code_page) ] Code Page
* [ [w](https://en.wikipedia.org/wiki/Code_point) ] Code Point
* [ [w](https://en.wikipedia.org/wiki/Codec) ] Codec
* [ [w](https://en.wikipedia.org/wiki/Combining_character) ] Combining Character
* [ [w](https://en.wikipedia.org/wiki/Complex_text_layout) ] Complex Text Layout (CTL)
* [ [w](https://en.wikipedia.org/wiki/Container_format) ] Container Format
* [ [w](https://en.wikipedia.org/wiki/Control_character) ] Control Character
* [ [w](https://en.wikipedia.org/wiki/Control_Pictures) ] Control Picture
* [ [w](https://en.wikipedia.org/wiki/Control-Alt-Delete) ] Ctrl-Alt-Del
* [ [w](https://en.wikipedia.org/wiki/Control-C) ] Ctrl-C
* [ [w](https://en.wikipedia.org/wiki/End-of-Transmission_character) ] Ctrl-D
* [ [w](https://en.wikipedia.org/wiki/Substitute_character) ] Ctrl-Z
* [ [w](https://en.wikipedia.org/wiki/Data_compression) ] Data Compression
* [ [w](https://en.wikipedia.org/wiki/Data_compression_ratio) ] Data Compression Ratio
* [ [w](https://en.wikipedia.org/wiki/Data_conversion) ] Data Conversion
* [ [w](https://en.wikipedia.org/wiki/Data_file) ] Data File
* [ [w](https://en.wikipedia.org/wiki/Deflate) ] Deflate
* [ [w](https://en.wikipedia.org/wiki/Diacritic) ] Diacritic
* [ [w](https://en.wikipedia.org/wiki/Diaeresis_(diacritic)) ] Diaeresis
* [ [w](https://en.wikipedia.org/wiki/Dictionary_coder) ] Dictionary Coder
* [ [w](https://en.wikipedia.org/wiki/Disk_image) ] Disk Image
* [ [w](https://en.wikipedia.org/wiki/Template_(file_format)) ] Document Template
* [ [w](https://en.wikipedia.org/wiki/Electronic_data_interchange) ] Electronic Data Interchange (EDI)
* [ [w](https://en.wikipedia.org/wiki/End-of-file) ] End of File (EOF)
* [ [w](https://en.wikipedia.org/wiki/Newline) ] End of Line (EOL)
* [ [w](https://en.wikipedia.org/wiki/Endianness) ] Endianness
* [ [w](https://en.wikipedia.org/wiki/Enriched_text) ] Enriched Text
* [ [w](https://en.wikipedia.org/wiki/Escape_character) ] Escape Character
* [ [w](https://en.wikipedia.org/wiki/Escape_sequence) ] Escape Sequence
* [ [w](https://en.wikipedia.org/wiki/Escape_sequences_in_C) ] Escape Sequence in C
* [ [w](https://en.wikipedia.org/wiki/Executable) ] Executable File
  * [ [w](https://en.wikipedia.org/wiki/Comparison_of_executable_file_formats) ] list of executable file formats
* [ [w](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) ] Executable and Linkable Format (ELF)
* [ [w](https://en.wikipedia.org/wiki/Executable_compression) ] Executable Compression
* [ [w](https://en.wikipedia.org/wiki/EBCDIC) ] Extended Binary Coded Decimal Interchange Code (EBCDIC)
* [ [w](https://en.wikipedia.org/wiki/Extended_ASCII) ] Extended ASCII
* [ [w](https://en.wikipedia.org/wiki/File_archiver) ] File Archiver
* [ [w](https://en.wikipedia.org/wiki/Comparison_of_file_archivers) ] File Archivers
* [ [w](https://en.wikipedia.org/wiki/File_format) ] File Format
  * [ [w](https://en.wikipedia.org/wiki/List_of_file_formats) ] list of file formats
* [ [w](https://en.wikipedia.org/wiki/Filename_extension) ] File Name Extension
  * [ [w](https://en.wikipedia.org/wiki/List_of_filename_extensions) ] list of file name extensions
  * [ [w](https://en.wikipedia.org/wiki/Glyph) ] Glyph
* [ [w](https://en.wikipedia.org/wiki/Grapheme) ] Grapheme
* [ [w](https://en.wikipedia.org/wiki/Grave_accent) ] Grave Accent
* [ [w](https://en.wikipedia.org/wiki/GIF) ] Graphics Interchange Format (GIF)
* [ [w](https://en.wikipedia.org/wiki/Guillemet) ] Guillemet
* [ [w](https://en.wikipedia.org/wiki/Gzip) ] gzip
* [ [w](https://en.wikipedia.org/wiki/Hexadecimal) ] Hexadecimal
* [ [w](https://en.wikipedia.org/wiki/Huffman_coding) ] Huffman Coding
* [ [w](https://en.wikipedia.org/wiki/Image_file_format) ] Image File Format
  * [ [w](https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats) ] list of image file formats
* [ [w](https://en.wikipedia.org/wiki/Interchange_File_Format) ] Interchange File Format (IFF)
* [ [w](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) ] International Phonetic Alphabet (IPA)
* [ [w](https://en.wikipedia.org/wiki/Internationalization_and_localization) ] Internationalization and Localization
* [ [w](https://en.wikipedia.org/wiki/Java_class_file) ] Java Class File
* [ [w](https://en.wikipedia.org/wiki/Language-independent_specification) ] Language-Independent Specification (LIS)
* [ [w](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Oberhumer) ] Lempel-Ziv-Oberhumer (LZO)
* [ [w](https://en.wikipedia.org/wiki/Letter_case) ] Letter Case
* [ [w](https://en.wikipedia.org/wiki/Ligature_(writing)) ] Ligature
* [ [w](https://en.wikipedia.org/wiki/Line_(text_file)) ] Line
* [ [w](https://en.wikipedia.org/wiki/Linear_predictive_coding) ] Linear Predictive Coding (LPC)
* [ w ] Little Endian
* [ [w](https://en.wikipedia.org/wiki/Lossless_compression) ] Lossless Compression
* [ [w](https://en.wikipedia.org/wiki/Lossy_compression) ] Lossy Compression
* [ [w](https://en.wikipedia.org/wiki/Letter_case) ] Lower Case (Miniscule)
* [ [w](https://en.wikipedia.org/wiki/LZ77_and_LZ78) ] LZ77 LZ78
* [ [w](https://en.wikipedia.org/wiki/Magic_number_(programming)) ] Magic Number
* [ [w](https://en.wikipedia.org/wiki/Manifest_file) ] Manifest File
* [ [w](https://en.wikipedia.org/wiki/Metacharacter) ] Metacharacter
* [ [w](https://en.wikipedia.org/wiki/Mojibake) ] Mojibake
* [ [w](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) ] Named Character Reference
* [ [w](https://en.wikipedia.org/wiki/Newline) ] Newline
* [ [w](https://en.wikipedia.org/wiki/Nibble) ] Nibble
* [ [w](https://en.wikipedia.org/wiki/Null-terminated_string) ] Null-Terminated String
* [ [w](https://en.wikipedia.org/wiki/Number_sign) ] Number Sign
* [ [w](https://en.wikipedia.org/wiki/Numeric_character_reference) ] Numeric Character Reference
* [ [w](https://en.wikipedia.org/wiki/Object_file) ] Object File
* [ [w](https://en.wikipedia.org/wiki/Octet_(computing)) ] Octet
* [ [w](https://en.wikipedia.org/wiki/Open_file_format) ] Open File Format
* [ [w](https://en.wikipedia.org/wiki/OpenDocument) ] OpenDocument
* [ [w](https://en.wikipedia.org/wiki/OpenType) ] OpenType
* [ [w](https://en.wikipedia.org/wiki/Page_break) ] Page Break
* [ [w](https://en.wikipedia.org/wiki/Pax_(command)) ] pax
* [ [w](https://en.wikipedia.org/wiki/Percent-encoding) ] Percent Encoding
* [ [w](https://en.wikipedia.org/wiki/Plain_text) ] Plain Text
* [ [w](https://en.wikipedia.org/wiki/Plane_(Unicode)) ] Plane
* [ [w](https://en.wikipedia.org/wiki/PostScript) ] PostScript (PS)
* [ [w](https://en.wikipedia.org/wiki/Number_sign) ] Pound Sign
* [ [w](https://en.wikipedia.org/wiki/Precomposed_character) ] Precomposed Character
* [ [w](https://en.wikipedia.org/wiki/Punctuation) ] Punctuation
* [ [w](https://en.wikipedia.org/wiki/Raster_graphics) ] Raster Graphics
* [ [w](https://en.wikipedia.org/wiki/Formatted_text) ] Rich Text
* [ [w](https://en.wikipedia.org/wiki/Ring_(diacritic)) ] Ring
* [ [w](https://en.wikipedia.org/wiki/Run-length_encoding) ] Run-Length Encoding (RLE)
* [ [w](https://en.wikipedia.org/wiki/Self-synchronizing_code) ] Self-Synchronizing Code
* [ [w](https://en.wikipedia.org/wiki/Serialization) ] Serialization
* [ [w](https://en.wikipedia.org/wiki/Shebang_(Unix)) ] Shebang
* [ [w](https://en.wikipedia.org/wiki/Simple_Data_Format) ] Simple Data Format (SDF)
* [ [w](https://en.wikipedia.org/wiki/Software_flow_control) ] Software Flow Control
* [ [w](https://en.wikipedia.org/wiki/Specials_(Unicode_block)) ] Specials
* [ [w](https://en.wikipedia.org/wiki/String_(computer_science)) ] String
* [ [w](https://en.wikipedia.org/wiki/String_literal) ] String Literal
* [ [w](https://en.wikipedia.org/wiki/Tab-separated_values) ] Tab-Separated Values
* [ [w](https://en.wikipedia.org/wiki/Tab_stop) ] Tab Stop
* [ [w](https://en.wikipedia.org/wiki/Tar_(computing)) ] tar
* [ [w](https://en.wikipedia.org/wiki/Text_normalization) ] Text Normalization
* [ [w](https://en.wikipedia.org/wiki/Touch_typing) ] Touch Typing
* [ [w](https://en.wikipedia.org/wiki/Typeface) ] Typeface
* [ [w](https://en.wikipedia.org/wiki/Unicode) ] Unicode
* [ [w](https://en.wikipedia.org/wiki/Unicode_block) ] Unicode Block
* [ [w](https://en.wikipedia.org/wiki/Unicode_character_property) ] Unicode Character Property
* [ [w](https://en.wikipedia.org/wiki/Unicode_collation_algorithm) ] Unicode Collation Algorithm
* [ [w](https://en.wikipedia.org/wiki/Unicode_Consortium) ] Unicode Consortium
* [ [w](https://en.wikipedia.org/wiki/Unicode_equivalence) ] Unicode Equivalence
* [ [w](https://en.wikipedia.org/wiki/Universal_Character_Set_characters) ] Universal Character Set (UCS) characters
* [ [w](https://en.wikipedia.org/wiki/Universal_Coded_Character_Set) ] Universal Coded Character Set (UCS)
* [ [w](https://en.wikipedia.org/wiki/Letter_case) ] Upper Case (Majuscule)
* [ [w](https://en.wikipedia.org/wiki/Percent-encoding) ] URL Encoding
* [ [w](https://en.wikipedia.org/wiki/UTF-16) ] UTF-16
* [ [w](https://en.wikipedia.org/wiki/UTF-8) ] UTF-8
* [ [w](https://en.wikipedia.org/wiki/Variable-width_encoding) ] Variable-Width Encoding
* [ [w](https://en.wikipedia.org/wiki/Video_file_format) ] Video File Format
* [ [w](https://en.wikipedia.org/wiki/Whitespace_character) ] Whitespace
* [ [w](https://en.wikipedia.org/wiki/Word_(computer_architecture)) ] Word
* [ [w](https://en.wikipedia.org/wiki/Writing_system) ] Writing System
* [ [w](https://en.wikipedia.org/wiki/ZIP_(file_format)) ] Zip
* [ [w](https://en.wikipedia.org/wiki/Zlib) ] Zlib
* [ [w](https://en.wikipedia.org/wiki/Zstd) ] Zstd

---