# Week 3 Lab — Data Representation in Practice: Bits/Bytes → Text, Images, and SQL

**Goals**  
By the end of this lab you will be able to:

- Explain and *demonstrate* how numbers, text, and images are represented using bits and bytes.
- Convert between decimal, binary, and hexadecimal, and write small conversion utilities yourself.
- Inspect the raw bytes of strings and files; reason about encodings (ASCII vs. Unicode/UTF-8).
- Visualize pixel grids and experiment with color depth and resolution trade-offs.
- Use SQL `SELECT` with an embedded SQLite database; work with Unicode text in queries.
- Reflect on why representation decisions matter for performance, correctness, and careers.

> **Format:** This is a guided notebook with mini-lectures (Markdown), worked examples, and exercises.  
> **Estimated time:** 75–100 minutes if you work through all extensions.

## 1) Bits & Bytes Fundamentals

A **bit** is a 0 or 1. A **byte** is 8 bits → 256 possible values (0–255).  
We'll warm up by converting numbers across bases, and then implement our own converters.

In [None]:
# Quick Python warm-up: built-ins for binary/hex
nums = [0, 1, 2, 7, 8, 15, 16, 31, 32, 63, 64, 127, 128, 255]
for n in nums:
    print(f"{n:>3}  bin={bin(n)[2:]:>8}  hex={hex(n)[2:].upper():>2}")

### Your Turn (Exercise 1A) — Manual Converters
**Task:** Implement functions to convert from decimal to binary *without* using `bin()` or format strings. Do the same for binary→decimal.

**Hints:**  
- For decimal→binary: repeatedly divide by 2; collect remainders; reverse at the end.  
- For binary→decimal: iterate digits from right to left; use positional weights (1, 2, 4, 8, ...).

In [None]:
def dec_to_bin(n: int) -> str:
    """Return binary string for non-negative integer n (no '0b' prefix)."""
    if n == 0:
        return "0"
    bits = []
    while n > 0:
        bits.append(str(n % 2))
        n //= 2
    return ''.join(reversed(bits))

def bin_to_dec(b: str) -> int:
    """Return decimal integer for binary string b (e.g., '1011')."""
    total = 0
    for ch in b.strip():
        if ch not in '01':
            raise ValueError("Binary string must contain only 0/1.")
        total = total * 2 + int(ch)
    return total

# Quick self-checks
tests = [0,1,2,5,7,8,15,16,31,32,63,64,127,128,255]
for t in tests:
    b = dec_to_bin(t)
    back = bin_to_dec(b)
    assert back == t, (t, b, back)
print("✅ dec_to_bin / bin_to_dec basic tests passed.")

### Reflection
- Why is binary "natural" for computers?  
- Why might humans prefer hexadecimal when reading low-level values?

## 2) Character Encodings: ASCII → Unicode (UTF-8)

- **ASCII:** 7-bit codes for English letters, digits, punctuation (0–127).  
- **Unicode:** assigns a unique code point to characters in *all* writing systems, plus symbols and emoji.  
- **UTF-8:** variable-length encoding of Unicode code points (1–4 bytes per character).

We'll explore how Python represents characters and how to go from characters → bytes and back.

In [None]:
# ASCII/Unicode basics with ord() and chr()
samples = ['A', 'a', 'Z', '0', '9', ' ', 'ñ', '€', '😀']
for s in samples:
    cp = ord(s)           # code point (integer)
    enc = s.encode('utf-8')  # bytes in UTF-8
    print(f"{s!r}  code_point=U+{cp:04X}  utf8_bytes={list(enc)}  as_hex={[hex(b) for b in enc]}")

### Worked Example: Encoding & Decoding
- Strings (`str`) are sequences of Unicode code points.  
- Bytes (`bytes`) are sequences of 8-bit integers (0–255).  
- `.encode('utf-8')` converts `str → bytes`, `.decode('utf-8')` converts back.

In [None]:
text = "Hello ñ € 😀"
b = text.encode('utf-8')
print("Original text:", text)
print("As bytes     :", b)
print("Hex bytes    :", [hex(x) for x in b])

# Decode back
roundtrip = b.decode('utf-8')
print("Roundtrip    :", roundtrip)
assert roundtrip == text

### Your Turn (Exercise 2A) — Diagnose Encoding Issues
1. Construct a bytes object manually that **is not** valid UTF-8; try to decode it and catch the error.  
2. Then decode with `errors='replace'` and observe the replacement characters.  
3. Explain in a sentence how "mojibake" happens.

In [None]:
# Exercise 2A — starter
# 0xFF is not a valid start byte in UTF-8
bad_bytes = bytes([0x48, 0x65, 0x6C, 0x6C, 0x6F, 0xFF, 0x61])  # 'Hello' + invalid + 'a'
print("Raw bytes:", bad_bytes)

try:
    print(bad_bytes.decode('utf-8'))
except UnicodeDecodeError as e:
    print("UnicodeDecodeError:", e)

print("With replacement:", bad_bytes.decode('utf-8', errors='replace'))

### Your Turn (Exercise 2B) — Write a Byte Inspector
**Task:** Write a function `inspect_bytes(s: str)` that returns a table (list of dicts) for each character in `s`, including:
- the character itself
- code point (U+XXXX)
- UTF-8 length (# of bytes)
- the bytes in decimal and hex

*(Tip: you can print as a nicely aligned table or just return the list.)*

In [None]:
def inspect_bytes(s: str):
    rows = []
    for ch in s:
        cp = ord(ch)
        enc = ch.encode('utf-8')
        rows.append({
            "char": ch,
            "code_point": f"U+{cp:04X}",
            "utf8_len": len(enc),
            "bytes_dec": list(enc),
            "bytes_hex": [hex(b) for b in enc],
        })
    return rows

# Example:
for row in inspect_bytes("Data: ñ 😀"):
    print(row)

## 3) Inspecting Real Files (Text + Image)

We'll generate small files locally so everyone has the same artifacts:
1. A text file containing plain ASCII and Unicode.  
2. A tiny synthetic image (color blocks) so we can open, inspect, and manipulate pixels.

We will not use external libraries beyond Python's standard library and `matplotlib`/`numpy` for visualization.

In [None]:
from pathlib import Path

# Create a small text file with ASCII + Unicode content
text_path = Path('sample_text_utf8.txt')
text_path.write_text("""
Line 1: ASCII only
Line 2: Accents — café, jalapeño
Line 3: Symbols — €, ©, ™
Line 4: Emoji — 😀, 🧠, 🐍
""", encoding='utf-8')
print("Wrote:", text_path.resolve())

# Save and display first few raw bytes
raw = text_path.read_bytes()
print("First 64 bytes:", raw[:64])
print("As hex        :", [hex(b) for b in raw[:64]])

### Visualizing a Synthetic Image (Pixels & Bit Depth)

We'll create a small image using `numpy` and display it.  
Then we'll reduce the number of colors (quantization) to simulate lower bit depth.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create a 64x64 image with color blocks (R, G, B gradients)
h, w = 64, 64
img = np.zeros((h, w, 3), dtype=np.uint8)

for i in range(h):
    for j in range(w):
        img[i, j, 0] = int(255 * i / (h-1))        # R gradient vertical
        img[i, j, 1] = int(255 * j / (w-1))        # G gradient horizontal
        img[i, j, 2] = int(255 * ((i+j)/ (h+w-2))) # B diagonal

plt.figure(figsize=(4,4))
plt.imshow(img)
plt.axis('off')
plt.title("Synthetic 64×64 RGB image (24-bit)")
plt.show()

### Your Turn (Exercise 3A) — Quantize (Reduce Color Depth)

**Task:** Write a function that reduces the number of distinct values per channel.  
- Example: with `levels=4`, allowed channel values should be `{0, 85, 170, 255}` (i.e., evenly spaced).  
- Apply to the synthetic image and visualize the result.

**Discussion:** How does perceived quality change as you reduce `levels`? Why does file size drop?

In [None]:
def quantize_channel(channel: np.ndarray, levels: int) -> np.ndarray:
    """Quantize a single channel (0..255) to `levels` evenly spaced values."""
    if levels < 2:
        raise ValueError("levels must be >= 2")
    # Map 0..255 into 0..(levels-1), then back to 0..255
    scaled = np.floor(channel / 256.0 * levels).astype(int)
    scaled = np.clip(scaled, 0, levels-1)
    # Map to evenly spaced 0..255
    return (scaled * (255 // (levels-1))).astype(np.uint8)

def quantize_image(img: np.ndarray, levels: int) -> np.ndarray:
    out = np.zeros_like(img)
    for c in range(3):
        out[..., c] = quantize_channel(img[..., c], levels)
    return out

# Try a few levels
for levels in (8, 4, 2):
    q = quantize_image(img, levels)
    plt.figure(figsize=(4,4))
    plt.imshow(q)
    plt.axis('off')
    plt.title(f"Quantized image: {levels} levels per channel")
    plt.show()

### Your Turn (Exercise 3B) — Save, Inspect Bytes, Compare Sizes

1. Save original and quantized images as PNG files (lossless).  
2. Compare file sizes with `Path(...).stat().st_size`.  
3. Open the PNGs in binary mode and print the first 64 bytes in hex. What do you notice?

> PNG is compressed + structured; even with fewer colors, metadata and compression influence size.  
> (Optional extension) Try BMP (uncompressed) if available via `plt.imsave(..., format='bmp')` and compare.

In [None]:
from pathlib import Path

plt.imsave("img_original.png", img)
print("img_original.png size:", Path("img_original.png").stat().st_size, "bytes")

for levels in (8, 4, 2):
    q = quantize_image(img, levels)
    fname = f"img_q{levels}.png"
    plt.imsave(fname, q)
    size = Path(fname).stat().st_size
    print(f"{fname} size:", size, "bytes")
    head = Path(fname).read_bytes()[:64]
    print(f"{fname} first 16 bytes (hex):", [hex(b) for b in head[:16]])

## 4) SQL with SQLite (Embedded) — SELECT Basics + Unicode

We'll create a small in-memory SQLite database, populate a `students` table with Unicode names, and practice queries.

**Why SQLite here?**  
- No network or external setup needed.  
- SQL syntax is close enough to PostgreSQL for our purposes.  
- The *concepts* (rows, columns, filtering) are identical.

In [None]:
import sqlite3

conn = sqlite3.connect(':memory:')
cur = conn.cursor()

cur.execute("""
CREATE TABLE students (
    id    INTEGER PRIMARY KEY,
    name  TEXT NOT NULL,
    age   INTEGER NOT NULL,
    note  TEXT
);
""")

rows = [
    (1, "Ana", 20, "ASCII only"),
    (2, "José", 23, "Has accent"),
    (3, "李雷", 22, "Chinese name"),
    (4, "Aïcha", 24, "Combining diaeresis"),
    (5, "Marta 😀", 21, "Emoji in name"),
]
cur.executemany("INSERT INTO students VALUES (?, ?, ?, ?);", rows)
conn.commit()

for row in cur.execute("SELECT id, name, age, note FROM students;"):
    print(row)

### Basic SELECT Patterns
- Choose columns: `SELECT name, age FROM students;`  
- Filter rows: `WHERE age > 22`  
- Sort results: `ORDER BY age DESC`  
- Pattern match (SQLite): `WHERE name LIKE 'A%'` (case-sensitive by default)

In [None]:
print("Names + ages, age > 22, ordered by age desc")
for row in cur.execute("""
SELECT name, age
FROM students
WHERE age > 22
ORDER BY age DESC;
"""):
    print(row)

print("\nNames starting with 'A' (ASCII capital A):")
for row in cur.execute("""
SELECT id, name
FROM students
WHERE name LIKE 'A%';
"""):
    print(row)

### Your Turn (Exercise 4A) — Unicode & Queries
1. Find all students whose names contain a non-ASCII character. *(Hint: iterate names in Python and test `ord(ch) > 127`.)*  
2. Query for students whose names contain an emoji (you can search for `LIKE '%😀%'`).  
3. Create a calculated column in SQL: `age_group` as `'21_or_below'` or `'22_and_above'` and count each group.

In [None]:
# 4A.1: Find names containing non-ASCII characters (Python-side check)
def is_non_ascii(s: str) -> bool:
    return any(ord(ch) > 127 for ch in s)

print("Non-ASCII names:")
for row in cur.execute("SELECT id, name FROM students;"):
    if is_non_ascii(row[1]):
        print(row)

# 4A.2: Names containing an emoji — direct SQL LIKE
print("\nNames with 😀:")
for row in cur.execute("""
SELECT id, name FROM students
WHERE name LIKE '%😀%';
"""):
    print(row)

# 4A.3: Age grouping and counts (SQL CASE)
print("\nAge group counts:")
for row in cur.execute("""
SELECT
  CASE WHEN age <= 21 THEN '21_or_below' ELSE '22_and_above' END AS age_group,
  COUNT(*) as cnt
FROM students
GROUP BY age_group;
"""):
    print(row)

## 5) Advanced: Reading File Bytes & Offsets (Optional Extension)

We'll practice reading raw bytes at specific offsets and interpreting them.

**Mini-task:**  
- Write a function `hexdump(path, n=64, offset=0)` that prints `n` bytes from a file starting at `offset`, in hex.  
- Use it on `img_original.png` and `sample_text_utf8.txt`.  
- (Challenge) Skip the PNG header (first 8 bytes) and look at the next 32 bytes.

In [None]:
def hexdump(path: str, n: int = 64, offset: int = 0):
    data = Path(path).read_bytes()[offset:offset+n]
    hexs = [f"{b:02X}" for b in data]
    print(f"{path} @ offset {offset} (len={len(data)}):\n", ' '.join(hexs))

# Examples
hexdump('img_original.png', n=32, offset=0)
hexdump('img_original.png', n=32, offset=8)  # PNG header is 8 bytes
hexdump('sample_text_utf8.txt', n=64, offset=0)

## 6) Synthesis & Reflection

Answer in your own words (short paragraphs):

1. How do bits and bytes scale up to represent complex data like Unicode strings and RGB images? Give an example from this notebook.
2. Why can lowering color depth reduce file size? What visual artifacts did you notice when quantizing?
3. In databases, why do text encodings matter? Describe a realistic bug that can occur when encodings are mismatched.
4. After today's work, explain *when* you would inspect raw bytes and *why* — in debugging, security, or data engineering contexts.