# 📊 **Vector Space Model Report**

## 🔍 **1. Description of VSM Implementation**

The `vsm.py` script implements a **Vector Space Model (VSM)** enhanced with **BM25** term weighting and **Rocchio relevance feedback**. The system processes a given corpus and queries, computes BM25-weighted term vectors, and ranks documents by cosine similarity against the query vector. The Rocchio algorithm optionally updates the query vector using relevant documents selected via a similarity threshold.

### 🎛️ **Key Hyper-Parameters in VSM:**

| Parameter | Description | Values Explored |
|-----------|-------------|----------------|
| **`k1`** | Term frequency saturation (BM25) | `1.2`, `1.5`, `2.0` |
| **`k3`** | Query term saturation | `8`, `800` (effectively ∞ in modern IR) |
| **`threshold`** | Similarity threshold for document selection | `400`, `500`, `600`, `700` |
| **`alpha`, `beta`** | Rocchio weights for original query and relevant docs | Various combinations |

### 🧠 **Core Classes and Functions**

- **`VectorSpaceModel`**:
  - Loads and tokenizes documents and queries
  - Computes document frequencies (`df`) and inverse document frequencies (`idf`)
  - Constructs BM25-weighted vectors using `k1` and `k3`
  - Ranks documents based on **cosine similarity** of BM25-weighted vectors

- **`rocchio_feedback()`**:
  - Implements query vector refinement:
    \[
    q' = \alpha \cdot q + \beta \cdot \frac{1}{|D_r|} \sum_{d \in D_r} d
    \]
  - Feedback is **positive-only** (no use of non-relevant documents)
  - Selected documents exceed a **cosine similarity threshold**

- **BM25 Weighting Formula**:
  Implemented as:
  \[
  w_{i,j} = \frac{(k1 + 1) \cdot tf_{i,j}}{k1 \cdot \left(1 - b + b \cdot \frac{|d|}{avgdl}\right) + tf_{i,j}} \cdot idf_i
  \]
  with `b = 0.75` (hardcoded).

### ⚙️ **VSM Processing Flow**

1. **Tokenization and Parsing**
   - Corpus and query files parsed into lowercase word tokens
   - **Punctuation is not removed**; preprocessing is minimal (known limitation)
   - Optional stemming is **commented out** in code

2. **Vector Construction**
   - BM25-weighted document vectors are constructed using term frequencies and precomputed IDF
   - Queries are also vectorized using a similar BM25 formula with `k3` as saturation parameter

3. **Similarity Calculation**
   - Cosine similarity computed between query and document vectors
   - Documents are ranked and filtered by a configurable **similarity threshold**

4. **Rocchio Feedback (if enabled)**
   - Top documents exceeding threshold are used to adjust the original query vector
   - New vector is used for a second round of ranking

---

## 🔄 **2. Description of Rocchio Relevance Feedback**

### 💡 **Implementation Details:**

- Rocchio feedback updates the query vector using:
  ```
  q' = alpha * q + beta * (1 / |Dr|) * sum(d in Dr)
  ```
  Where:
  - `q` = original query vector
  - `Dr` = set of relevant documents
  - `alpha` = weight for original query
  - `beta` = weight for relevant documents

> ℹ️ Note: Irrelevant document centroid was ignored in this implementation.

### 🎯 **Definition of Relevant Documents:**

- Uses **similarity threshold** rather than fixed top-k
- Allows dynamic feedback document count per query (up to 100 documents)

---

## 📈 **3. Results of Experiments**

### 📊 **Overall Performance Comparison**

| Metric | Without Feedback | With Feedback | Improvement |
|--------|-----------------|--------------|-------------|
| **Best MAP** | 0.3969 | 0.8056 | **+103%** ⬆️ |

### **3-1. MAP Values Without Feedback**

#### **Top 5 Configurations Without Feedback**

| Rank | MAP | Alpha | Beta | Threshold | k1 | k3 |
|------|-----|-------|------|-----------|----|----|
| 1 | **0.3969** | 0.8/1.0 | 0.75 | 400 | 2.0 | 800 |
| 2 | 0.2846 | 0.8/1.0 | 0.75 | 400 | 1.5 | 800 |
| 3 | 0.2275 | 0.8/1.0 | 0.75 | 400 | 2.0 | 8 |
| 4 | 0.1710 | 0.8/1.0 | 0.75 | 400 | 1.2 | 800 |
| 5 | 0.1203 | 0.8/1.0 | 0.75 | 500 | 2.0 | 800 |

#### **Parameter Trends Without Feedback:**

- **`k1` Effect**: Controls how quickly term frequency impact saturates. A higher `k1` allows more influence from frequent terms.
- **`k3` Effect**: Simulates query term frequency; higher values approximate binary weighting (term present/absent).
- **Threshold Sensitivity**: Key in document filtering. With no feedback, higher thresholds lead to very few or no results, degrading MAP.

- **`k1` (term frequency saturation)**:
  - Best value: 2.0
  - Performance decreases as k1 decreases (2.0 > 1.5 > 1.2)
  - Example (with fixed alpha=0.8, beta=0.75, threshold=400, k3=800):
    - k1=2.0: MAP = 0.3969
    - k1=1.5: MAP = 0.2846 (-28%)
    - k1=1.2: MAP = 0.1710 (-57%)

- **`k3` (query term frequency)**:
  - Best value: 800
  - Significant performance drop when using k3=8
  - Example (with fixed alpha=0.8, beta=0.75, threshold=400, k1=2.0):
    - k3=800: MAP = 0.3969
    - k3=8: MAP = 0.2275 (-43%)

- **Thresholds**:
  - Best value: 400
  - Performance decreases dramatically as threshold increases
  - Example (with fixed alpha=0.8, beta=0.75, k1=2.0, k3=800):
    - threshold=400: MAP = 0.3969
    - threshold=500: MAP = 0.1203 (-70%)
    - threshold=600: MAP = 0.0143 (-96%)
    - threshold=700: MAP = 0.0000 (-100%)

- **Alpha/Beta**:
  - Minimal impact when using optimal settings for other parameters
  - Both alpha=0.8 and alpha=1.0 produced identical MAP values

---

### **3-2. Feedback vs. No Feedback Comparison**

#### **Top 5 Configurations With Feedback**

| Rank | MAP | Alpha | Beta | Threshold | k1 | k3 |
|------|-----|-------|------|-----------|----|----|
| 1 | **0.8056** | 1.0 | 0.85 | 400/500 | 2.0 | 800 |
| 2 | 0.8006 | 1.0 | 1.0 | 400/500 | 2.0 | 800 |
| 3 | 0.7995 | 1.0 | 1.0 | 600 | 2.0 | 800 |
| 4 | 0.7994 | 1.0 | 0.85 | 400 | 1.5 | 800 |
| 5 | 0.7978 | 1.0 | 0.85 | 400 | 2.0 | 8 |

#### **Parameter Trends With Feedback:**

- With Rocchio, the system leverages relevant documents to shift the query vector toward the semantic center of related documents.
- Feedback mitigates the impact of suboptimal `k1` and `k3` values.
- **Even a poorly tuned system performs well if feedback is enabled.**

- **Alpha/Beta**:
  - Best combination: alpha=1.0, beta=0.85
  - Beta values between 0.85-1.0 perform best
  - Alpha consistently optimal at 1.0

- **`k1` (term frequency saturation)**:
  - Best value: 2.0
  - Performance trend: 2.0 > 1.5 > 1.2
  - However, the difference is less dramatic than in no-feedback scenario
  - Example (with fixed alpha=1.0, beta=0.85, threshold=400, k3=800):
    - k1=2.0: MAP = 0.8056
    - k1=1.5: MAP = 0.7994 (-0.8%)
    - k1=1.2: MAP = 0.7907 (-1.8%)

- **`k3` (query term frequency)**:
  - Best value: 800
  - Less dramatic difference compared to no-feedback scenario
  - Example (with fixed alpha=1.0, beta=0.85, threshold=400, k1=2.0):
    - k3=800: MAP = 0.8056
    - k3=8: MAP = 0.7978 (-1.0%)

- **Thresholds**:
  - Best values: 400 and 500 (tie)
  - Performance decreases more gradually as threshold increases
  - Example (with fixed alpha=1.0, beta=0.85, k1=2.0, k3=800):
    - threshold=400/500: MAP = 0.8056
    - threshold=600: MAP = 0.7948 (-1.3%)
    - threshold=700: MAP = 0.7756 (-3.7%)

---

### **3-3. Parameter Impact Summary**

| Parameter | Without Feedback | With Feedback |
|-----------|-------------------|---------------|
| **k1** | Highly sensitive (57% drop from 2.0 to 1.2) | Robust (only 1.8% drop from 2.0 to 1.2) |
| **k3** | Highly sensitive (43% drop from 800 to 8) | Robust (only 1.0% drop from 800 to 8) |
| **Threshold** | Extremely sensitive (100% drop from 400 to 700) | Stable (only 3.7% drop from 400 to 700) |
| **Alpha/Beta** | Minor impact with optimal k1/k3/threshold | Moderate impact (optimal: α=1.0, β=0.85) |

---

## 💭 **4. Discussion**

### ❌ **Identified Failures**

- Minimal text preprocessing:
  - No punctuation removal
  - No proper Chinese-English language detection
- Tokenizer uses `split()` which fails on English input.
```python
query_string="Camus勸告我們不要進行哲學自殺及其相應的信仰飛躍，這樣我們才能保持理性直到生命的最後一刻。"
query_list=list(query_string)
query_list
```
```
['C',
    'a',
    'm',
    'u',
    's',
    '勸',
    '告',
    '我',
    '們',
    '不',
    '要',
    '進',
    '行',
    '哲',
    '學',
    '自',
    '殺',
    '及',
    '其',
    '相',
    '應',
    '的',
    '信',
    '仰',
    '飛',
    '躍',
    '，',
    '這',
    '樣',
    '我',
    '們',
    '才',
    '能',
    '保',
    '持',
    '理',
    '性',
    '直',
    '到',
    '生',
    '命',
    '的',
    '最',
    '後',
    '一',
    '刻',
    '。']
  ```
- No stop word removal or stemming
- Similarity thresholding is rigid and manual

### **Key Insights:**

1. **Relevance Feedback Effectiveness** 🚀
   - Feedback improves MAP by 103% (from 0.3969 to 0.8056)
   - With feedback, the system becomes remarkably robust to parameter variations
   - Even suboptimal configurations with feedback outperform the best no-feedback setup

2. **Parameter Sensitivity**
   - **Without feedback**: System is highly sensitive to all parameters
   - **With feedback**: System shows remarkable stability across parameter variations
   - The sensitivity difference highlights feedback's ability to compensate for suboptimal initial configurations

3. **Threshold-Based Relevance Definition**
   - Using a similarity threshold (vs. fixed top-k) creates dynamic feedback document sets
   - This approach adapts the number of feedback documents based on query-specific relevance

4. **Implementation Limitations**
   - Parsing failed with mixed-language corpus → Use **NLTK** or **jieba** for better preprocessing
   - **Exploratory Data Analysis (EDA)** should precede algorithm implementation

### **Future Work:**

1. **Language Processing**
   - Implement proper language detection for mixed Chinese-English corpus
   - Apply appropriate tokenization based on detected language

2. **Advanced Features**
   - Test additional techniques from Singhal's paper:
     - Improved stemming algorithms
     - Stop word removal
     - Phrase statistics

3. **Parameter Optimization**
   - Explore grid search for finding optimal parameter combinations
   - Investigate adaptive parameter selection based on query characteristics

4. **Evaluation Metrics**
   - Expand beyond MAP to include precision@k, nDCG, and other IR metrics
   - Analyze per-query performance to identify failure cases