# Chapter 3: Feature Representation for Each Modality

**Interactive Jupyter Notebook Version**

---

# Chapter 3: Feature Representation for Each Modality

---

**Previous**: [Chapter 2: Foundations and Core Concepts](chapter-02.md) | **Next**: [Chapter 4: Feature Alignment and Bridging Modalities](chapter-04.md) | **Home**: [Table of Contents](index.md)

---

# Chapter 3: Feature Representation for Each Modality

## Learning Objectives

After reading this chapter, you should be able to:
- Understand text representation methods from BoW to BERT
- Explain CNNs and Vision Transformers for images
- Describe MFCC and self-supervised learning for audio
- Compare different modality representations
- Choose appropriate representations for specific tasks

## 3.1 Text Representation: Evolution and Methods

### Historical Evolution

```
Timeline of text representation:

1950s-1990s:    Manual feature engineering
  ‚Üì
1990s-2000s:    Bag-of-Words, TF-IDF
  ‚Üì
2000s-2010s:    Word embeddings (Word2Vec, GloVe)
  ‚Üì
2013-2018:      RNN, LSTM, GRU with embeddings
  ‚Üì
2017+:          Transformer-based (BERT, GPT)
  ‚Üì
2022+:          Large language models (GPT-3, LLaMA)
  ‚Üì
2024+:          Multimodal LLMs
```

### Method 1: Bag-of-Words (BoW)

**Concept:**
Treat text as unordered collection of words, ignoring sequence and grammar.

**Process:**

```
Input:     "The cat sat on the mat"
             ‚Üì
Tokenize:  ["the", "cat", "sat", "on", "the", "mat"]
             ‚Üì
Count:     {"the": 2, "cat": 1, "sat": 1, "on": 1, "mat": 1}
             ‚Üì
Vectorize: [2, 1, 1, 1, 1]  (in vocabulary order)
```

**Formal definition:**

```
For vocabulary V = {w_1, w_2, ..., w_N}
Text represented as: x = [c_1, c_2, ..., c_N]
where c_i = count of word w_i in text

Dimension = vocabulary size (can be 10,000-50,000)
```

**Example - Classification:**

```
Training data:
  Text 1: "I love this movie" ‚Üí Label: Positive
  Text 2: "This movie is bad" ‚Üí Label: Negative

BoW vectors:
  Text 1: {love: 1, movie: 1, positive words}
  Text 2: {bad: 1, movie: 1, negative words}

Classifier learns:
  "love" ‚Üí +positive contribution
  "bad" ‚Üí +negative contribution
```

**Advantages:**
‚úì Simple and fast
‚úì Interpretable
‚úì Works surprisingly well for many tasks

**Disadvantages:**
‚úó Loses word order ("dog bit man" = "man bit dog")
‚úó No semantic relationships ("happy" vs "joyful" treated as completely different)
‚úó All words equally important (doesn't distinguish important from common words)
‚úó Very high dimensionality

**When to use:**
- Spam detection
- Topic modeling
- Simple text classification
- When simplicity and speed are priorities

### Method 2: TF-IDF (Term Frequency-Inverse Document Frequency)

**Motivation:**
BoW treats all words equally. But some words are more informative than others.

**Concept:**

```
Importance = (word frequency in document) √ó (rarity across corpus)

Words appearing everywhere ("the", "is") get low weight
Words appearing rarely but specifically ("CEO", "algorithm") get high weight
```

**Formal definition:**

```
TF (Term Frequency):
  TF(t,d) = count(t in d) / total_words(d)
  Normalized frequency of term t in document d

IDF (Inverse Document Frequency):
  IDF(t) = log(total_documents / documents_containing_t)
  How rare is this term across all documents?

TF-IDF:
  TF-IDF(t,d) = TF(t,d) √ó IDF(t)
```

**Example calculation:**

```
Corpus: 1,000 documents
Term "cat": appears in 100 documents, 5 times in document D

TF = 5 / total_words_in_D = 0.05
IDF = log(1000/100) = log(10) = 1.0
TF-IDF = 0.05 √ó 1.0 = 0.05

Compare to:
Term "the": appears in 900 documents, 50 times in document D

TF = 50 / total_words_in_D = 0.50
IDF = log(1000/900) = log(1.11) ‚âà 0.1
TF-IDF = 0.50 √ó 0.1 = 0.05

Wait, same score! That's the point - importance normalized.
```

**Benefits over BoW:**
‚úì Handles different document lengths better
‚úì Downweights common words
‚úì Emphasizes distinctive terms

**Disadvantages:**
‚úó Still ignores word order
‚úó No semantic understanding
‚úó Requires corpus statistics
‚úó Doesn't handle synonyms

**When to use:**
- Information retrieval and search
- TF-IDF is foundation of many search engines
- Document classification
- When you have many documents and limited compute

### Method 3: Word2Vec - Learning Word Meaning

**Revolutionary idea (Mikolov et al., 2013):**
"Words with similar contexts have similar meanings"

**Learning through prediction:**

```
Idea: If we can predict context words from a word,
      we've learned what that word means.

Process:

Text: "The dog barked loudly at the mailman"
              ‚Üì
Focus on "barked", predict context:
  Context: {dog, loudly, at, the}
  Prediction task: Given "barked", predict these

Loss: How well did we predict?
  If good prediction ‚Üí "barked" representation is good
  If poor ‚Üí Update "barked" vector

After training on millions of sentences:
  "barked" vector captures:
  - Associated with actions
  - Related to animals
  - Past tense
  - Physical events
```

**Key discovery:**

```
Vector arithmetic works!

king - man + woman ‚âà queen

Explanation:
- "king" and "queen" appear in similar contexts (monarchy)
- "man" and "woman" capture gender dimension
- Vector subtraction removes gender from "king"
- Vector addition applies gender to result
- Result: "queen"

This algebraic structure wasn't hand-designed!
It emerged from learning word contexts.
```

**Technical details - Two approaches:**

**Skip-gram:**

```
Input: Target word "barked"
Task: Predict context words {dog, loudly, at, the}

Model: Two embedding matrices
  Input embedding: What is "barked"?
  Output embedding: What patterns lead to context?

Optimization:
  Maximize: P(context | barked)
  Network learns useful representations
```

**CBOW (Continuous Bag of Words):**

```
Input: Context words {the, dog, barked, loudly}
Task: Predict center word

Reverse of skip-gram
Can be faster to train
```

**Properties:**
- Fixed embedding per word (doesn't handle polysemy)
- 300D vectors typical
- Can be trained on unlabeled data
- Transferable to downstream tasks

**Example - Semantic relationships:**

```
cos_sim(king, queen) ‚âà 0.7   (high, related)
cos_sim(king, man) ‚âà 0.65     (high, overlapping)
cos_sim(queen, woman) ‚âà 0.68  (high, overlapping)
cos_sim(king, dog) ‚âà 0.2      (low, unrelated)

Structure emerges in embedding space!
```

**Limitations:**
‚úó One vector per word (ignores context and polysemy)
‚úó "Bank" (financial) and "bank" (river) have identical vectors
‚úó Same word might mean different things in different contexts
‚úó Doesn't capture longer-range dependencies

**When to use:**
- Quick baseline for text tasks
- When you need interpretable word relationships
- Transfer learning where only word similarity needed
- When computational resources are limited

### Method 4: BERT - Context-Aware Embeddings

**Motivation:**

Word2Vec limitation - context blindness:

```
Sentence 1: "I went to the bank to deposit money"
Sentence 2: "I sat on the bank of the river"

Word2Vec:
  "bank" in both sentences ‚Üí IDENTICAL vector
  Problem: Different meanings!

What we need:
  Context-aware "bank" for finance sentence
  Different context-aware "bank" for river sentence
```

**BERT Innovation (Devlin et al., 2018):**
"Use entire sentence context to generate embeddings"

**Architecture overview:**

```
Input text: "The cat sat on the mat"
             ‚Üì
Tokenization (using WordPiece):
  [CLS] The cat sat on the mat [SEP]
             ‚Üì
Embedding:
  - Token embedding (which word)
  - Position embedding (where in sequence)
  - Segment embedding (which sentence)
             ‚Üì
Transformer encoder (12 layers):
  Each layer:
    - Self-attention (how relevant is each token to others)
    - Feed-forward network
    - Normalization
             ‚Üì
Output: 12 vectors of 768D each
  Each token has representation influenced by entire sequence
```

**Key innovation - Bidirectional context:**

```
Traditional RNN: Left-to-right only
  Input: "The cat sat..."
         Process: The ‚Üí cat ‚Üí sat
         When processing "sat", don't know what comes after

BERT: Bidirectional
  Input: "The cat sat on the mat"
         Process: Entire sequence simultaneously
         All positions see all other positions
         Through self-attention in first layer
```

**Training procedure - Masked Language Modeling:**

```
Goal: Learn good representations for any language task

Method: Predict masked words

Original:      "The [MASK] sat on the mat"
Task:          Predict the masked word
Expected:      "cat"

Training:
  ‚ë† Randomly mask 15% of tokens
  ‚ë° Model predicts masked tokens
  ‚ë¢ Loss = cross-entropy between predicted and actual
  ‚ë£ Update all parameters

Result:
  Model learns representations that contain
  information about what words should appear
  = learns semantic and syntactic patterns
```

**Using BERT embeddings:**

```
For sentence classification:
  ‚ë† Process sentence through BERT
  ‚ë° Extract [CLS] token (special classification token)
  ‚ë¢ [CLS] vector = sentence representation (768D)
  ‚ë£ Add linear classifier on top
  ‚ë§ Train classifier on downstream task

For token classification (e.g., NER):
  ‚ë† Process sentence through BERT
  ‚ë° Extract all token vectors (each is 768D)
  ‚ë¢ Each token has context-aware representation
  ‚ë£ Add classifier for each token
  ‚ë§ Predict label for each token

Benefit:
  - No task-specific feature engineering needed
  - Transfer learning from massive pre-training
  - Strong performance on small datasets
```

**Concrete example - Polysemy handling:**

```
Sentence 1: "I went to the bank to deposit money"
  "bank" ‚Üí BERT embedding with finance context

Sentence 2: "I sat on the bank of the river"
  "bank" ‚Üí BERT embedding with geography context

Different embeddings!
BERT captures context from surrounding words
```

**Properties:**
- Context-dependent embeddings
- 768D vectors (BERT-base)
- Larger versions available (BERT-large: 1024D)
- Pre-trained on 3.3B words
- Extremely effective for transfer learning

**Advantages over Word2Vec:**
‚úì Handles polysemy (same word, different contexts)
‚úì Bidirectional context
‚úì Pre-trained on massive corpus
‚úì Strong transfer learning
‚úì Achieves SOTA on many tasks

**Disadvantages:**
‚úó Computationally expensive
‚úó Slower inference than Word2Vec
‚úó Requires more compute resources
‚úó Less interpretable (768D vectors hard to understand)

**When to use:**
- Text classification (sentiment, topic)
- Named entity recognition
- Question answering
- Semantic similarity
- When accuracy more important than speed
- When GPU resources available

### Method 5: Large Language Models (LLMs)

**Further evolution - GPT family:**

```
BERT (2018):        Encoder-only, bidirectional
GPT (2018):         Decoder-only, left-to-right
GPT-2 (2019):       1.5B parameters
GPT-3 (2020):       175B parameters - in-context learning
GPT-4 (2023):       ~1.76T parameters - multimodal
```

**LLM representations:**

```
GPT-3 embeddings:
  Layer 1:    Basic patterns
  Layer 16:   Mid-level concepts
  Layer 32:   High-level semantics
  Layer 48 (final): Task-specific representations

Properties:
  - 12,288D vectors (very high-dimensional)
  - Captures vast knowledge
  - Can be used as semantic features
  - More interpretable than BERT in some ways
```

**Using LLM embeddings for multimodal tasks:**

```
Instead of using fixed word embeddings,
use representations from large language models

Benefit:
  - Captures world knowledge from pre-training
  - Understands complex semantics
  - Better for rare/unusual concepts
  - Can be adapted to specific domains

Cost:
  - Expensive API calls (if using services like OpenAI)
  - Privacy concerns (data sent to external servers)
  - Latency (requires API round-trip)
```

**Comparison of text representations:**

```
Method          Dimension   Context-aware   Speed   Pre-training
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
BoW             10K-50K     No              Fast    None needed
TF-IDF          10K-50K     No              Fast    Corpus stats
Word2Vec        300         No              Fast    Large corpus
GloVe           300         No              Fast    Large corpus
FastText        300         No              Fast    Large corpus
ELMo            1024        Yes             Slow    Large corpus
BERT            768         Yes             Medium  Huge corpus
RoBERTa         768         Yes             Medium  Huge corpus
GPT-2           1600        Yes             Slow    Huge corpus
GPT-3           12288       Yes             Very slow API
```

## 3.2 Image Representation: From Pixels to Concepts

### Historical Evolution

```
Timeline:

1980s-1990s:    Edge detection (Canny, Sobel)
  ‚Üì
1990s-2000s:    Hand-crafted features (SIFT, HOG)
  ‚Üì
2012:           AlexNet - Deep learning breakthrough
  ‚Üì
2014:           VGGNet, GoogleNet
  ‚Üì
2015:           ResNet - Skip connections, very deep networks
  ‚Üì
2020:           Vision Transformer - Attention-based vision
  ‚Üì
2024:           Large multimodal models processing images
```

### Method 1: Hand-Crafted Features

**SIFT (Scale-Invariant Feature Transform)**

```
Problem solved:
  "Find the same building in photos taken at different times,
   different angles, different zoom levels"

SIFT features are invariant to:
  - Translation (where object is in image)
  - Scaling (zoom level)
  - Rotation (camera angle)
  - Illumination (lighting changes)

Process:
  1. Find keypoints (interest points)
     - Corners, edges, distinctive regions

  2. Describe neighborhoods around keypoints
     - Direction and magnitude of gradients
     - Histogram of edge orientations

  3. Result: Keypoint descriptor (128D vector)
     - Invariant to many transformations
     - Can match same keypoint across images

Example:
  Building in Photo 1 (summer, noon, straight angle)
  Same building in Photo 2 (winter, sunset, aerial view)

  SIFT can find matching keypoints!
  Enables: Panorama stitching, 3D reconstruction
```

**HOG (Histogram of Oriented Gradients)**

```
Key insight:
  Human shape recognition relies on edge directions
  (Horizontal edges on top = head, vertical on sides = body)

Process:
  1. Divide image into cells (8√ó8 pixels)

  2. For each cell:
     - Compute edge direction at each pixel
     - Create histogram of edge directions

  3. Result: Concatenate all histograms
     - Captures shape and edge structure
     - Dimension: ~3,780 for 64√ó128 image

Application:
  Pedestrian detection
  - HOG captures distinctive human silhouette
  - Works well because human shape is distinctive
  - Fast computation (no deep learning needed)

  Limitation:
  - Only works for rigid objects (humans, faces)
  - Fails for abstract categories
```

**Bag-of-Visual-Words**

```
Idea: Apply Bag-of-Words concept to images

Process:
  1. Extract SIFT features from image
     ‚Üí Get 100-1000 keypoint descriptors per image

  2. Cluster descriptors (k-means)
     ‚Üí Create "visual vocabulary" (e.g., 1000 clusters)
     ‚Üí Each cluster = one "visual word"

  3. Histogram of visual words
     ‚Üí Count which words appear in image
     ‚Üí Result: Bag-of-words vector

  4. Classify or compare based on histogram

Example:
  Image 1 has: {30 "corner edges", 20 "smooth curves", ...}
  Image 2 has: {5 "corner edges", 45 "smooth curves", ...}

  More curve words ‚Üí Perhaps a cat
  More corner words ‚Üí Perhaps a building
```

**Advantages of hand-crafted features:**
‚úì Interpretable (understand what they measure)
‚úì Fast computation
‚úì Works with small datasets
‚úì Explicit mathematical basis

**Disadvantages:**
‚úó Requires domain expertise to design
‚úó Limited to specific feature types
‚úó Poor generalization to new domains
‚úó Cannot capture complex semantic patterns
‚úó Manually chosen ‚Üí not optimized for task

**When to use:**
- When you understand the specific patterns to detect
- Limited computational resources
- Small datasets
- Tasks where hand-crafted features are well-suited (e.g., pedestrian detection)

### Method 2: CNNs - Automatic Feature Learning

**The Breakthrough (AlexNet, 2012):**

```
Revolutionary insight:
  "Stop hand-crafting features!
   Let neural networks learn what's important."

Results:
  ImageNet competition:
  - 2011 (hand-crafted): 25.8% error
  - 2012 (AlexNet): 15.3% error  ‚Üê 38% error reduction!
  - 2015 (ResNet): 3.6% error   ‚Üê Human-level performance
```

**Hierarchical Feature Learning:**

```
Raw image (224√ó224√ó3 pixels)
        ‚Üì
Layer 1-2: Low-level features
  - Edge detection
  - Simple curves
  - Corners
  ‚îî‚îÄ‚Üí What: Detects local patterns
      Why: Edges are building blocks
      Output: 64 feature maps (32√ó32)

Layer 3-4: Mid-level features
  - Textures
  - Shapes
  - Parts
  ‚îî‚îÄ‚Üí What: Combines local patterns
      Why: Shapes emerge from edges
      Output: 256 feature maps (16√ó16)

Layer 5: High-level features
  - Objects
  - Semantic concepts
  - Scene context
  ‚îî‚îÄ‚Üí What: Object detectors
      Why: Objects are concepts
      Output: 512 feature maps (8√ó8)

Global pooling & Dense layers:
  - Aggregate spatial info
  - Predict class probabilities
  ‚îî‚îÄ‚Üí Output: Class predictions
```

**Why CNNs work:**

```
1. Inductive bias toward images
   - Local connectivity: Nearby pixels related
   - Shared weights: Same pattern recognized anywhere
   - Translation invariance: "Cat is a cat" whether left/right

2. Hierarchical composition
   - Edges ‚Üí Shapes ‚Üí Objects
   - Matches how we see

3. Parameter sharing
   - Filters reused across space
   - Reduces parameters vs fully connected
   - Enables learning on larger images
```

**Key architecture - ResNet (Residual Networks):**

```
Problem with deep networks:
  Deeper = more parameters = better?
  But: Very deep networks are hard to train!

  Cause: Gradient vanishing
    Backprop through 100 layers:
    gradient = g‚ÇÅ √ó g‚ÇÇ √ó g‚ÇÉ √ó ... √ó g‚ÇÅ‚ÇÄ‚ÇÄ

    If each g·µ¢ = 0.9:
    0.9¬π‚Å∞‚Å∞ ‚âà 0.0000027  (essentially zero!)

    Can't learn early layers

Solution: Skip connections (residual connections)

Normal layer: y = f(x)
Residual layer: y = x + f(x)

Benefit:
  Even if f(x) learns nothing (f(x)=0),
  y = x still flows information through

  Gradient paths:
  Without skip: gradient = ‚àÇf/‚àÇx √ó ‚àÇf/‚àÇx √ó ...
  With skip: gradient = ... + 1 + 1 + ...

  The "+1" terms prevent vanishing!
```

**ResNet architecture example (ResNet-50):**

```
Input: Image (224√ó224√ó3)
  ‚Üì
Conv 7√ó7, stride 2
‚Üí (112√ó112√ó64)
  ‚Üì
MaxPool 3√ó3, stride 2
‚Üí (56√ó56√ó64)
  ‚Üì
Residual Block 1: [16 conv blocks]
‚Üí (56√ó56√ó256)
  ‚Üì
Residual Block 2: [33 conv blocks]
‚Üí (28√ó28√ó512)
  ‚Üì
Residual Block 3: [36 conv blocks]
‚Üí (14√ó14√ó1024)
  ‚Üì
Residual Block 4: [3 conv blocks]
‚Üí (7√ó7√ó2048)
  ‚Üì
Average Pool
‚Üí (2048,)
  ‚Üì
Linear layer (1000 classes)
‚Üí Predictions

Total parameters: 25.5M
Depth: 50 layers
Performance: 76% ImageNet top-1 accuracy
```

**Properties:**
- 2048D global feature vector (before classification)
- Pre-trained on ImageNet (1.4M images)
- Can fine-tune on downstream tasks
- Very stable training (skip connections)

**Advantages:**
‚úì Learns task-relevant features
‚úì Transfers well to other tasks
‚úì Stable training (deep networks possible)
‚úì Interpretable to some extent (visualize activations)
‚úì Efficient inference

**Disadvantages:**
‚úó Black-box decisions (what does each dimension mean?)
‚úó Requires large labeled datasets to train from scratch
‚úó Inherits biases from ImageNet

**When to use:**
- Most modern computer vision tasks
- Transfer learning (fine-tune on new task)
- When you want strong off-the-shelf features
- Production systems (mature, optimized, proven)

### Method 3: Vision Transformers (ViT)

**Paradigm shift (Dosovitskiy et al., 2020):**

```
Traditional thinking:
  "Images need CNNs!"
  Reason: Spatial structure, translational equivariance

ViT question:
  "What if we just use Transformers like NLP?"
  Insight: Pure attention can learn spatial patterns

Result:
  Vision Transformer outperforms ResNet
  When trained on large datasets!
```

**Architecture:**

```
Input image (224√ó224√ó3)
        ‚Üì
Divide into patches (16√ó16)
        ‚Üì
14√ó14 = 196 patches
        ‚Üì
Each patch: 16√ó16√ó3 = 768D
        ‚Üì
Linear projection
        ‚Üì
196 vectors of 768D
        ‚Üì
Add positional encoding
(so model knows spatial position)
        ‚Üì
Add [CLS] token
(like BERT for images)
        ‚Üì
Transformer encoder (12 layers)
        ‚Üì
Extract [CLS] token
        ‚Üì
768D image representation
```

**How it works:**

```
Key insight: Patches are like words

In NLP:
  Word tokens ‚Üí Transformer ‚Üí Semantic relationships

In ViT:
  Image patches ‚Üí Transformer ‚Üí Spatial relationships

Layer 1:
  Each patch attends to all other patches
  Learns: Which patches are related?

Layer 2-12:
  Progressively integrate information
  Layer 6: Coarse spatial understanding
  Layer 12: Fine-grained semantic understanding
```

**Why this works:**

1. **Global receptive field from Layer 1**

   CNN needs many layers to see globally
   ViT sees all patches from first layer
   Enables faster learning of global patterns

2. **Flexible to patches**

   Can use any patch size
   Trade-off:
   - Larger patches (32√ó32): Fewer tokens, less detail
   - Smaller patches (8√ó8): More tokens, finer detail

3. **Scales with data**

   CNNs strong with small data (inductive biases)
   ViT weak with small data, strong with large

   Modern datasets massive
   ‚Üí ViT wins

**Example - ViT-Base vs ResNet-50:**

```
                ViT-Base       ResNet-50
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Parameters      86M            25.5M
ImageNet acc    77.9%          76%
Training data   1.4M+JFT      1.4M
Pre-training    224√ó224        1000√ó1000
Fine-tuning     Excellent      Good

Interpretation:
  ViT needs more data to train
  But then performs better
  Especially when transferring to new tasks
```

**Advantages:**
‚úì Better scaling properties
‚úì Transfers better to downstream tasks
‚úì Simpler architecture (no CNN-specific tricks needed)
‚úì More interpretable (attention patterns show what matters)
‚úì Unified with NLP (same architecture for both)

**Disadvantages:**
‚úó Worse with small datasets
‚úó Requires more computation than CNN equivalents
‚úó Training unstable (needs careful tuning)
‚úó Slower inference in some hardware

**When to use:**
- Large-scale applications
- Transfer learning to new visual tasks
- When computational resources abundant
- When interpretability matters (attention visualization)
- New research (faster progress with transformers)

**Attention visualization:**

```
For each query patch, show which patches it attends to

Example - Query at cat's head position:

Attention heatmap:
[   0    0    0  ]
[   0   0.9   0.8]  (high attention to nearby patches)
[   0    0.6   0  ]

Shows:
- Model focuses on cat head region
- Attends to surrounding patches (context)
- Ignores background regions
```

## 3.3 Audio Representation: From Waveforms to Features

### Method 1: MFCC (Mel-Frequency Cepstral Coefficients)

**Principle:**
"Extract features that match human hearing, not physics"

**Why needed:**

```
Raw audio at 16kHz:
  1 second = 16,000 samples
  10 seconds = 160,000 samples

Problem:
  Too many numbers to process
  Not perceptually relevant (e.g., 16kHz vs 16.1kHz)

Solution:
  Extract ~39 MFCCs per frame (25ms)
  Much more compact and perceptually meaningful
```

**Extraction process step-by-step:**

```
‚ë† Raw waveform
   Sample audio: 16kHz, mono
   Duration: 10 seconds

‚ë° Pre-emphasis
   Amplify high frequencies
   Reason: High frequencies carry important information
   Filter: y[n] = x[n] - 0.95*x[n-1]

‚ë¢ Frame division
   Split into overlapping frames
   Frame length: 25ms = 400 samples
   Hop size: 10ms
   Result: ~980 frames for 10-second audio

‚ë£ Window each frame
   Apply Hamming window: reduces edge artifacts

‚ë§ Fourier Transform (FFT)
   Convert time domain ‚Üí frequency domain
   For each frame: 400 samples ‚Üí 200 frequency bins

‚ë• Mel-scale warping
   Map frequency to Mel scale (human perception)

   Linear frequency: 125Hz, 250Hz, 500Hz, 1000Hz, 2000Hz
   Mel frequency:     0Mel,   250Mel, 500Mel, 1000Mel, 1700Mel

   Why?
   Humans more sensitive to low frequencies
   High frequencies sound similar to each other
   (1000Hz difference matters less at 10,000Hz)

‚ë¶ Logarithm
   Human loudness perception is logarithmic
   log(power) more perceptually uniform than power

‚ëß Discrete Cosine Transform (DCT)
   Decorrelate the Mel-scale powers
   Result: Typically 13-39 coefficients

Result: MFCC vector
  Dimensions: 39 (or 13, 26 depending on config)
  One vector per 10ms
  Represents spectral shape at that time
```

**Visualization:**

```
Raw waveform:          Spectrogram:           MFCCs:
Amplitude              Frequency vs Time      Features vs Time
   ‚Üë                      High ‚ñ≤               ‚Üë
   ‚îÇ ~~~~               ‚ñì‚ñì‚ñì‚ñì‚ñì‚îÇ‚ñì‚ñì‚ñì          ‚ñì‚ñì‚ñì‚îÇ‚ñì‚ñì‚ñì
   ‚îÇ~  ~  ~  ~~       ‚ñì‚ñì‚ñì  ‚îÇ‚ñì‚ñì‚ñì          ‚ñì‚ñì ‚îÇ‚ñì‚ñì
   ‚îÇ ~ ~~  ~ ~       ‚ñì‚ñì   ‚îÇ‚ñì           ‚ñì  ‚îÇ‚ñì
   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí      ‚ñì‚ñì    ‚îÇ            ‚ñì  ‚îÇ
   Time (s)         Low ‚ñº  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí Coeff‚îÇ
                         Time (s)         ‚îî‚îÄ‚Üí
                                        Dim 1-39
```

**Example - Speech recognition:**

```
Audio: "Hello"
        ‚Üì
MFCC extraction (39D per frame)
        ‚Üì
10 frames of audio (each 10ms):
  Frame 1: [0.2, -0.1, 0.5, ..., 0.3] (39D)
  Frame 2: [0.21, -0.08, 0.52, ..., 0.31] (39D)
  ...
  Frame 10: [0.15, -0.12, 0.45, ..., 0.25] (39D)
        ‚Üì
Sequence of MFCCs: 10√ó39 matrix
        ‚Üì
Feed to speech recognition model
        ‚Üì
Output: Text "Hello"
```

**Properties:**
- Fixed dimensionality (39D)
- Perceptually meaningful
- Low computational cost
- Standard for speech tasks

**Advantages:**
‚úì Fast to compute
‚úì Well-understood (40+ years research)
‚úì Works well for speech (the main audio task)
‚úì Low dimensionality
‚úì Perceptually meaningful

**Disadvantages:**
‚úó Not learnable (fixed formula)
‚úó May discard useful information
‚úó Optimized for speech, not music
‚úó Doesn't handle music well

**When to use:**
- Speech recognition
- Speaker identification
- Emotion recognition from speech
- Music genre classification (acceptable)
- Limited compute resources

### Method 2: Spectrogram

**Alternative to MFCC:**
Keep all frequency information, don't apply Mel-scale or DCT.

**Process:**

```
‚ë† Raw audio
‚ë° Frame division
‚ë¢ FFT
‚ë£ Magnitude spectrum
‚ë§ Spectrogram: stacked magnitude spectra over time

Result: 2D matrix
  Dimensions: Time √ó Frequency
  Values: Power at each time-frequency bin

Example: 10-second audio at 16kHz
  Time: 980 frames
  Frequency: 513 bins
  Size: 980√ó513
```

**Visualization:**

```
Spectrogram of "Hello":

Frequency
(Hz)    |‚ñì‚ñì ‚ñì‚ñì‚ñì‚ñì    ‚ñì‚ñì    |
        |‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì  ‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì | High freq
        |  ‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì  |
  8000  |‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ|
        | ‚ñì‚ñì‚ñì‚ñì ‚ñì‚ñì‚ñì‚ñì‚ñì  ‚ñì‚ñì  |
        |‚ñì‚ñì‚ñì‚ñì ‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì‚ñì   |
        |‚ñì‚ñì ‚ñì ‚ñì‚ñì‚ñì‚ñì‚ñì ‚ñì‚ñì    | Low freq
    0   |___________________|
        0    2    4    6    8    10
              Time (seconds)

Darker = higher power
Different time positions ‚Üí different audio
```

**Advantages over MFCC:**
‚úì More information preserved
‚úì Raw frequency content visible
‚úì Can apply deep learning directly
‚úì Works for any audio (not just speech)

**Disadvantages:**
‚úó High dimensionality (harder to process)
‚úó Not perceptually normalized
‚úó Less standard for speech

**When to use:**
- Music processing and generation
- Sound event detection
- When using deep learning (CNN/Transformer)
- When frequency content important

### Method 3: Wav2Vec2 - Self-Supervised Learning

**Modern approach (Meta AI, 2020):**

```
Problem:
  Need thousands of hours transcribed audio for ASR
  Transcription is expensive

Solution:
  Learn from UNLABELED audio
  Use self-supervised learning
```

**Training mechanism:**

```
Phase 1: Pretraining (on unlabeled data)

  ‚ë† Feature extraction (CNN)
     Raw waveform ‚Üí discrete codes

     Intuition: Compress speech to meaningful units

  ‚ë° Contrastive loss
     Predict masked codes from context
     Similar to BERT for speech

  Result: Model learns speech patterns
          Without any transcriptions!

Phase 2: Fine-tuning (with small labeled dataset)

  ‚ë† Load pretrained model
  ‚ë° Add task-specific head (classification)
  ‚ë¢ Train on labeled examples

  Benefit: Needs much less labeled data!
```

**Quantization step:**

```
Why quantize speech?

Raw features: Continuous values
Problem: Too flexible, model can memorize

Quantized features: Discrete codes (e.g., 1-512)
Benefit:
  - Reduces search space
  - Forces learning of essential patterns
  - Similar to VQ-VAE for images

Example:
  Raw feature: [0.234, -0.512, 0.891, ...]
  ‚Üì (vector quantization)
  Nearest code ID: 147

  Code vector: Learned codebook entry 147
```

**Architecture:**

```
Raw waveform (16kHz)
        ‚Üì
CNN feature extraction
        ‚Üì
Quantization to codes
        ‚Üì
Transformer encoder (contextual understanding)
        ‚Üì
768D representation per frame
```

**Training details:**

```
Objective:
  Predict masked codes from surrounding codes

  Input: [code_1, [MASK], code_3, [MASK], code_5]
  Task: Predict masked codes

  Loss: Contrastive - predict correct code among negatives

Result:
  Encoder learns to represent speech meaningfully
  Ready for downstream tasks
```

**Fine-tuning for tasks:**

```
Task 1: Speech Recognition (ASR)
  Add: Linear layer for character/phoneme classification
  Train: On (audio, transcription) pairs

  Data needed: 10-100 hours labeled
  Without pretraining: 10,000+ hours needed!

Task 2: Speaker Identification
  Add: Linear layer for speaker classification
  Train: On (audio, speaker_id) pairs

Task 3: Emotion Recognition
  Add: Linear layer for emotion classification
  Train: On (audio, emotion) pairs
```

**Empirical results:**

```
Without Wav2Vec2 pretraining:
  ASR with 100 hours data: 25% WER (Word Error Rate)

With Wav2Vec2 pretraining:
  ASR with 100 hours data: 10% WER
  ASR with 10 hours data: 12% WER

Improvement:
  50% error reduction with same data
  Or 10√ó less labeled data for same performance
```

**Properties:**
- 768D representation per frame
- Learned from unlabeled data
- Transferable across tasks
- Works for any audio

**Advantages:**
‚úì Leverages massive unlabeled data
‚úì Strong transfer learning
‚úì Handles diverse audio types
‚úì Better than MFCC for complex tasks

**Disadvantages:**
‚úó Complex training procedure
‚úó Requires large unlabeled dataset for pretraining
‚úó Longer inference than MFCC

**When to use:**
- Speech recognition (SOTA approach)
- Multi-speaker systems
- Low-resource languages
- When accuracy is critical

## 3.4 Comparison and Selection Guide

### Dimension and Computational Cost

```
                Dimension   Speed       Training Data
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
MFCC            39          Very fast   Hundreds hours
Spectrogram     513         Fast        Thousands hours
Wav2Vec2        768         Slow        Millions hours unlabeled

Hand-crafted    1000-5000   Fast        Medium
SIFT            128/keypoint Fast       Medium
HOG             3780        Fast        Medium

ResNet50        2048        Medium      1.4M images
ViT-Base        768         Medium      14M images
BERT            768         Medium      3.3B words
GPT-3           12288       Slow        Huge
```

### Modality Comparison Summary

```
                Text            Image           Audio
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Modern rep.     BERT/GPT        ResNet/ViT      Wav2Vec2
Dimension       768             2048/768        768
Interpretable   Somewhat        Little          Very little
Speed           Medium          Fast            Medium
Pre-training    Easy (text web) Requires labels Can be unsupervised
Transfer        Excellent       Good            Good
Multimodal fit  Good            Excellent       Good
```

### Choosing Representation

**Decision flowchart:**

```
Is computational budget limited?
  YES ‚Üí Use hand-crafted or MFCC
  NO ‚Üí Continue
       ‚Üì
Is this a production system?
  YES ‚Üí Use proven methods (ResNet, BERT)
  NO ‚Üí Continue
       ‚Üì
Do you have massive labeled data?
  YES ‚Üí Consider training from scratch
  NO ‚Üí Use pre-trained features
       ‚Üì
Do you have unlabeled data?
  YES ‚Üí Consider self-supervised (Wav2Vec2)
  NO ‚Üí Use supervised pre-trained models
```

## Key Takeaways

- **Text:** Evolution from BoW to BERT shows power of context
- **Images:** CNNs dominate but ViT shows promising future
- **Audio:** MFCC traditional, Wav2Vec2 is modern frontier
- **Pre-training is key:** Leveraging unlabeled data essential
- **Different modalities need different approaches**
- **Trade-offs exist:** accuracy vs speed, interpretability vs performance

## Exercises

**‚≠ê Beginner:**
1. Implement TF-IDF from scratch
2. Extract MFCC features from an audio file
3. Visualize a spectrogram

**‚≠ê‚≠ê Intermediate:**
4. Compare MFCC vs spectrogram representations
5. Fine-tune BERT on text classification
6. Extract ResNet features and cluster images

**‚≠ê‚≠ê‚≠ê Advanced:**
7. Implement self-attention for images (simplified ViT)
8. Build Wav2Vec2 from scratch (simplified)
9. Compare different dimensionality reduction techniques

---

## üîç Interactive Demo: Feature Representations

Let's explore how different modalities are represented as features!

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Simulate different types of features
def create_feature_examples():
    """Create example features for different modalities."""
    
    # Text features (more sparse, semantic clusters)
    text_features = torch.randn(100, 512)
    # Add semantic structure
    text_features[:25] += torch.tensor([2.0, 0, 0] + [0]*509)  # Animals
    text_features[25:50] += torch.tensor([0, 2.0, 0] + [0]*509)  # Vehicles  
    text_features[50:75] += torch.tensor([0, 0, 2.0] + [0]*509)  # Food
    text_features[75:] += torch.tensor([0, 0, 0] + [2.0] + [0]*508)  # Nature
    
    # Image features (more dense, spatial structure)
    image_features = torch.randn(100, 512) * 0.5
    # Add spatial correlations
    for i in range(100):
        # Simulate spatial locality
        spatial_pattern = torch.sin(torch.arange(512) * 0.1 + i * 0.5)
        image_features[i] += 0.3 * spatial_pattern
    
    # Audio features (temporal structure)
    audio_features = torch.randn(100, 512) * 0.3
    # Add temporal patterns
    for i in range(100):
        temporal_pattern = torch.cos(torch.arange(512) * 0.05 + i * 0.2)
        audio_features[i] += 0.4 * temporal_pattern
    
    return text_features, image_features, audio_features

text_feat, image_feat, audio_feat = create_feature_examples()
print("‚úÖ Created feature examples:")
print(f"Text features: {text_feat.shape}")
print(f"Image features: {image_feat.shape}") 
print(f"Audio features: {audio_feat.shape}")

In [None]:
# Visualize feature distributions
def visualize_feature_distributions():
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # Feature magnitude histograms
    features = [text_feat, image_feat, audio_feat]
    names = ['Text', 'Image', 'Audio']
    colors = ['blue', 'green', 'red']
    
    for i, (feat, name, color) in enumerate(zip(features, names, colors)):
        # Distribution of feature magnitudes
        axes[0, i].hist(feat.flatten().numpy(), bins=50, alpha=0.7, color=color)
        axes[0, i].set_title(f'{name} Feature Distribution')
        axes[0, i].set_xlabel('Feature Value')
        axes[0, i].set_ylabel('Frequency')
        
        # Feature correlation matrix (first 50 dimensions)
        corr_matrix = torch.corrcoef(feat[:, :50].T)
        im = axes[1, i].imshow(corr_matrix.numpy(), cmap='coolwarm', vmin=-1, vmax=1)
        axes[1, i].set_title(f'{name} Feature Correlations')
        axes[1, i].set_xlabel('Feature Dimension')
        axes[1, i].set_ylabel('Feature Dimension')
    
    plt.tight_layout()
    plt.show()

visualize_feature_distributions()

In [None]:
# t-SNE visualization of different modalities
def visualize_feature_spaces():
    # Combine all features
    all_features = torch.cat([text_feat, image_feat, audio_feat], dim=0)
    labels = ['Text'] * 100 + ['Image'] * 100 + ['Audio'] * 100
    
    # Apply t-SNE
    tsne = TSNE(n_components=2, random_state=42, perplexity=30)
    features_2d = tsne.fit_transform(all_features.numpy())
    
    # Plot
    plt.figure(figsize=(12, 8))
    
    colors = {'Text': 'blue', 'Image': 'green', 'Audio': 'red'}
    for modality in ['Text', 'Image', 'Audio']:
        mask = np.array(labels) == modality
        plt.scatter(features_2d[mask, 0], features_2d[mask, 1], 
                   c=colors[modality], label=modality, alpha=0.6, s=50)
    
    plt.title('t-SNE Visualization of Feature Spaces')
    plt.xlabel('t-SNE Dimension 1')
    plt.ylabel('t-SNE Dimension 2')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Show statistics
    print("üìä Feature Statistics:")
    for feat, name in zip([text_feat, image_feat, audio_feat], ['Text', 'Image', 'Audio']):
        mean_norm = torch.norm(feat, dim=1).mean()
        std_norm = torch.norm(feat, dim=1).std()
        print(f"{name}: Mean norm = {mean_norm:.3f}, Std norm = {std_norm:.3f}")

visualize_feature_spaces()