# **Improvements and Concepts to be implemented**

## `evaluate.py` model improvements

### **Accuracy measures: N-gram Based Metrics (BLEU & ROUGE-2)**
- BLEU scores demonstrate limited utility for short-form responses due to insufficient n-gram overlap at higher-order (n=4) sequences. This limitation similarly affects ROUGE-2 metrics.
- **To do:**
1. Implement length-aware weighting schemes that proportionally reduce the influence of n-gram metrics for responses below optimal length thresholds.
2. Compare llm-answers with a series of references that contain higher lexical diversity
3. Add lemmatization, synonyms or related lexica to find word matches
4. Accuracy feedbacks should take into consideration, that accuracy measure may not be adecuate for questions of type "sensitive" or "creative", since the possibilities for llm-answers are more diverse for these types of questions

### **Semantic Evaluation Performance**
- Semantic similarity measures demonstrate robust performance across diverse question types, including creative and open-ended queries, capturing nuanced meaning beyond surface-level lexical overlap.
- **To do:** Maintain semantic similarity as a cornerstone metric while exploring ensemble approaches that combine multiple embedding models.

### **Exact Match Limitations**
- Binary exact-match accuracy consistently yields zero values, indicating insufficient discriminatory power for evaluating nuanced LLM responses.
- **To do:** Deprecate exact-match metrics in favor of graduated similarity measures that better capture partial correctness.

### **TF-IDF Applicability**
- Term frequency-inverse document frequency approaches exhibit optimal utility in large document corpora but demonstrate reduced effectiveness in constrained evaluation contexts with limited response diversity.
- **To do:** Reserve TF-IDF metrics for scenarios involving extended text comparisons or multi-document analysis.

### **Pattern Matching Expansion**
- Current pattern-based detection mechanisms operate within constrained lexical and phrasal boundaries.
- **To do:** Expand detection vocabularies and implement regex pattern optimization to capture broader semantic variations while maintaining precision.

### **Creativity Assessment Precision**
- Existing creativity quantification methods demonstrate insufficient granularity for distinguishing between genuinely innovative responses and formulaic variations.
- **To do:** Develop multi-dimensional creativity metrics incorporating novelty, fluency, elaboration, and domain-specific appropriateness.

### **Weight Adaptation Framework**
- Current static weighting schemes inadequately accommodate the heterogeneous requirements of different question categories.
- **To do:** Implement dynamic weight adaptation mechanisms that automatically calibrate metric importance based on question taxonomy, response characteristics, and evaluation objectives.

### **Pass/Fail Framewok**
- Pass/Fail - thresholds inadequately accommodate the heterogeneous requirements of different question categories.
- **To do:** Implement dynamic threshold adaptation mechanisms that automatically calibrate metric importance based on question taxonomy, response characteristics, and evaluation objectives.

### **Quality Evaluation Refinement**
- Quality assessment metrics require enhanced precision through the integration of syntactic, discourse, and pragmatic language features.
- **To do:** Develop hierarchical quality evaluation frameworks that distinguish between surface-level fluency and substantive communicative effectiveness.

### **Enhancing reference diversity**
- Use a series of external Large Languge Models to generate different references. Use this diverse set of references to evaluate the llm-answers

## System-level improvements and concepts

#### **1. Batch Processing (1000+ pairs in minutes)**
**Location:** `src/evaluate.py` → `evaluate_all_pairs_enhanced()` function
**Keywords:**
- `for` loops over DataFrame rows
- `apply()` method with vectorization
- `pd.concat()` for combining results
- Progress tracking with `tqdm` or custom logging
- `DataFrame.iterrows()` or `zip()` iteration

#### **2. Parallel Execution**
**Location:** `src/evaluate.py` → `EnhancedLLMEvaluator` class methods
**Keywords:**
- `multiprocessing.Pool()`
- `concurrent.futures.ThreadPoolExecutor` or `ProcessPoolExecutor`
- `map()` function for parallel processing
- `chunksize` parameter for batching
- `n_jobs` or `num_workers` configuration
- `pool.starmap()` for multiple arguments

#### **3. Memory Efficient (Streaming Options)**
**Location:** Various files but particularly in batch processing functions
**Keywords:**
- `pandas.read_csv(..., chunksize=1000)` for large files
- Generator functions (`yield`) for streaming
- `memory_profiler` decorators (if used)
- Lazy evaluation patterns
- `del` statements for freeing memory
- `gc.collect()` calls
- `DataFrame.drop()` unused columns

#### **4. GPU Acceleration (Sentence Transformers)**
**Location:** `src/evaluate.py` → `SemanticEmbeddingService` class
**Keywords:**
- `torch.cuda.is_available()` checks
- `device='cuda'` or `device='cpu'` selection
- `model.to(device)` method calls
- `CUDA_VISIBLE_DEVICES` environment variable
- Batch processing on GPU with `model.encode(..., batch_size=32)`
- `torch.no_grad()` context manager for inference

#### **Additional Performance Optimizations:**
**Location:** Throughout the codebase
**Keywords:**
- **Caching:** `@lru_cache` decorators, `functools.cache`
- **Vectorization:** NumPy operations, pandas vectorized methods
- **Early Returns:** Conditional checks to skip computations
- **Memoization:** Storing computed results for reuse
- **Lazy Imports:** Importing heavy modules only when needed
- **Configuration Flags:** `USE_GPU`, `BATCH_SIZE`, `NUM_WORKERS`

**Check these specific patterns in your code:**
- Look for `if torch.cuda.is_available():` in embedding services
- Search for `Pool(` or `Executor(` in evaluation functions
- Find `chunksize=` parameters in data loading
- Look for progress bars (`tqdm`) with estimated time remaining

The architecture allows these performance features to work together—parallel processing for CPU-bound tasks, GPU acceleration for neural network inference, and streaming for memory management, all coordinated through the main evaluation pipeline.