
# Retrieval-Augmented Generation System

## **1. System Design**:

**System Overview**

The Retrieval-Augmented Generation (RAG) system is designed to meet the requirements of a scalable, efficient, and accurate question-answering system. It combines document retrieval with a BERT-based generation model, allowing it to handle large-scale corpora while providing accurate, contextually relevant answers.  

### **Requirements and How the System Design Meets Them**

1. **High Relevancy on Domain-Specific Information**:

    - **Design Decision**: The system uses a combination of pre-trained BERT models and KDTree-based retrieval, ensuring that retrieved contexts are highly relevant to the query. The embedding model is chosen to capture domain-specific nuances, and the retrieval system is optimized for speed and accuracy.
    - **Outcome**: This approach ensures that the system returns contextually relevant information even in domain-specific scenarios.

2. **Adaptation to Dynamic/Growing Information**:

    - **Design Decision**: The system allows for dynamic updates to the corpus. New documents can be added, and embeddings can be recalculated and indexed in the KDTree, ensuring the system adapts to changes in the corpus.
    - **Outcome**: The system remains up-to-date and accurate as new information is added.

3. **Scalability to Billions of Documents**:

    - **Design Decision**: The KDTree is used for efficient retrieval, allowing the system to scale to large datasets. The embedding model is lightweight, ensuring fast processing of large volumes of data.
    - **Outcome**: The system can handle large-scale document retrieval efficiently.

4. **Source Citation**:

    - **Design Decision**: Each retrieved chunk of context is stored with metadata, including document and chunk identifiers, allowing the system to cite the source of information accurately.
    - **Outcome**: The system can provide users with citations for the information used to generate answers.

### **System Components and Processes**

1.  **Flask API Interface:**
    
    -   Provides endpoints for precomputing embeddings, searching for context, generating answers, and evaluating questions.
2.  **Pipeline:**
    
    -   **Preprocessing and Embedding:** Documents are preprocessed (chunking, trimming) and converted into embeddings using Sentence Transformers.
    -   **KDTree Retrieval:** Efficient nearest neighbor search retrieves relevant context chunks based on the query embedding.
    -   **BERT-based Answer Generation:** Retrieved contexts are fed into a BERT model to generate answers.
3.  **Data Storage:**
    
    -   **Corpus Storage:** Raw documents are stored and processed for embedding.
    -   **Embeddings Storage:** Precomputed embeddings are stored for fast retrieval.
    -   **Logs:** All API interactions are logged for monitoring and auditing.

#### **Complete System Diagram**:
The following diagram documents the system. In case you cannot open it, please refer to the analysis/ directory in this project for a valid picture.

![System Architecture Diagram](https://github.com/creating-ai-enabled-systems-summer-2024/marquezjaramillo-jose/blob/main/visual_search_system/analysis/system_diagram.png)

## 2. Data, Data Pipelines, and Model

### Data Description

**Data Sources:**

-   **Document Corpus:** A collection of text documents used as the knowledge base for answering queries.
-   **Question-Answer Pairs:** A set of questions and corresponding ground-truth answers used for model evaluation.

**Data Characteristics:**

-   **Corpus:** Unstructured text data that is preprocessed into smaller chunks.
-   **Question-Answer Pairs:** Structured data containing the query, the expected answer, and metadata.

### Data Pipelines

**1. Preprocessing Pipeline:**

-   **Input:** Raw text documents.
-   **Process:**
    -   **Tokenization:** Documents are split into sentences.
    -   **Chunking:** Sentences are grouped into chunks based on the `sentences_per_chunk` parameter.
    -   **Embedding:** Chunks are converted into embeddings using a pre-trained model.
-   **Output:** Embeddings stored in the `storage/embeddings/` directory.

**2. KDTree Construction Pipeline:**

-   **Input:** Precomputed embeddings.
-   **Process:** Embeddings are indexed using KDTree for efficient retrieval.
-   **Output:** A KDTree structure ready for fast nearest neighbor search.

### Model

**Embedding Model:**

-   **Model Used:** Sentence Transformers (`all-MiniLM-L6-v2`).
-   **Purpose:** Convert text chunks into high-dimensional embeddings capturing semantic meaning.

**Retrieval Model:**

-   **Model Used:** KDTree.
-   **Purpose:** Efficient retrieval of contextually relevant chunks based on query embeddings.

**Answer Generation Model:**

-   **Model Used:** BERT (`bert-large-uncased-whole-word-masking-finetuned-squad`).
-   **Purpose:** Generate answers by leveraging retrieved context chunks.

## 3. **Metrics**

#### Offline Metrics

**1. Match Result**

- **Definition**: The `Match Result` metric indicates whether the answer generated by the system matches the ground truth answer. This is a binary metric where a match (True) is recorded if the generated answer is identical or sufficiently similar to the ground truth.

- **Mathematical Notation**:
  $$
  \text{Match Result} = 
  \begin{cases} 
  1 & \text{if } \text{Generated Answer} = \text{Ground Truth} \\
  0 & \text{otherwise}
  \end{cases}
  $$

- **Purpose**: The `Match Result` is a straightforward measure of accuracy. It is used to evaluate how often the system produces the correct answer. This metric is particularly important for understanding the basic correctness of the system's output.

- **Intuition**: If the `Match Result` is consistently high, it suggests that the system is reliably providing the correct answers. It is the most direct way to assess the system's accuracy and is crucial for tasks where exact matches are critical.

---

**2. Score**

- **Definition**: The `Score` metric quantifies the similarity between the generated answer and the ground truth. This score is usually derived from a similarity measure such as cosine similarity, BLEU, or ROUGE.

- **Mathematical Notation** (for Cosine Similarity):
  $$
  \text{Score} = \cos(\theta) = \frac{\text{Generated Answer} \cdot \text{Ground Truth}}{\|\text{Generated Answer}\| \|\text{Ground Truth}\|}
  $$
  Where \( \theta \) is the angle between the embedding vectors of the generated answer and the ground truth.

- **Purpose**: The `Score` provides a more nuanced evaluation than the `Match Result`. It measures how similar the generated answer is to the ground truth, even if it is not an exact match. This is useful for assessing the quality of answers in scenarios where exact matches are rare, but semantic similarity is important.

- **Intuition**: A high `Score` indicates that even when the answer isn't exactly correct, it is close in meaning or content to the correct answer. This metric helps in cases where multiple valid answers might exist, allowing for flexibility in evaluation.

---

**3. Precision**

- **Definition**: Precision is the fraction of relevant instances among the retrieved instances. In the context of the RAG system, it measures how many of the retrieved context chunks were actually relevant to the query.

- **Mathematical Notation**:
$$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
  $$

- **Purpose**: Precision helps assess the relevance of the retrieved information. It is particularly important in systems where the cost of retrieving irrelevant information is high, such as in information retrieval systems.

- **Intuition**: High precision indicates that the system is good at filtering out irrelevant information, which is crucial for generating accurate and concise answers.

---

**4. Recall**

- **Definition**: Recall is the fraction of relevant instances that were retrieved out of all relevant instances. It measures how many of the relevant context chunks were retrieved by the system.

- **Mathematical Notation**:
 $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
  $$

- **Purpose**: Recall evaluates the system's ability to retrieve all relevant information. This metric is important for ensuring that the system doesn't miss critical information that could contribute to generating a correct answer.

- **Intuition**: High recall indicates that the system is thorough in retrieving potentially useful information, which is important for generating comprehensive answers.

---

**5. F1-Score**

- **Definition**: The F1-Score is the harmonic mean of precision and recall, providing a single metric that balances the two. It is especially useful when you need to balance the trade-off between precision and recall.

- **Mathematical Notation**:
  $$
  F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  $$

- **Purpose**: The F1-Score is used when there is a need to balance precision and recall. It is particularly valuable in scenarios where both false positives and false negatives are costly, such as in question-answering systems where both the relevance and completeness of retrieved information matter.

- **Intuition**: A high F1-Score indicates that the system is performing well in terms of both precision and recall, making it a good overall measure of retrieval quality.

---

#### Online Metrics

**1. Response Time**

- **Definition**: The time taken by the system to process a query and return an answer.

- **Mathematical Notation**:
 $$
  \text{Response Time} = \text{Time of Response} - \text{Time of Query}
  $$

- **Purpose**: Response time is crucial for user experience, especially in interactive applications where users expect quick responses. It is an essential metric for ensuring that the system remains performant under various load conditions.

- **Intuition**: Lower response times generally lead to better user satisfaction, as users are more likely to trust and use a system that responds quickly.

---

**2. User Satisfaction**

- **Definition**: A qualitative metric that measures how satisfied users are with the system's performance. This is usually gathered through surveys, feedback forms, or direct interactions.

- **Purpose**: User satisfaction provides insight into how well the system is meeting the needs and expectations of its users. While it is a qualitative metric, it is critical for understanding the real-world impact of the system.

- **Intuition**: High user satisfaction indicates that the system is not only technically sound but also aligns with user expectations and requirements, making it a key success metric for deployed systems.

---

### Conclusion on Metrics

These metrics are carefully chosen to cover different aspects of system performance. Offline metrics such as `Match Result`, `Score`, `Precision`, `Recall`, and `F1-Score` are critical for evaluating the system during development and ensuring that it meets accuracy and relevance standards. Online metrics like `Response Time` and `User Satisfaction` help monitor the system's performance in real-world scenarios, ensuring that it remains responsive and user-friendly after deployment.

Together, these metrics provide a comprehensive view of the system's effectiveness, allowing for continuous improvement and fine-tuning based on both technical performance and user feedback.

## 4. Post-Deployment Policies (Detailed)

Effective post-deployment strategies are relevant for maintaining the reliability, performance, and accuracy of the system. Below are some key aspects of monitoring, maintenance, and fault mitigation, ensuring that the system remains robust and responsive in a production environment.


#### 1. Monitoring and Maintenance Plan

**1.1 Real-time Monitoring**

**Purpose**: Real-time monitoring is essential for tracking the health and performance of the system. By continuously collecting and analyzing data from various system components, one detect anomalies, performance degradations, or failures as they occur, allowing for quick intervention.

**Components to Monitor**:

-   **API Latency and Response Times**: Track how long it takes for the system to process queries and return answers. Sudden spikes in latency can indicate performance bottlenecks or underlying issues in the infrastructure.
-   **Error Rates**: Monitor the frequency and types of errors (e.g., 4xx/5xx HTTP responses) that occur. A rising error rate could signal issues with the API endpoints, model inference, or data retrieval.
-   **Resource Utilization**: Keep an eye on CPU, memory, disk I/O, and network usage across all services. High resource consumption might indicate inefficiencies or the need for scaling.
-   **Request Volume**: Track the number of requests per minute/hour. Sudden changes in request volume can help identify unexpected load conditions or usage patterns.

**1.2 Error Logging and Analysis**

**Purpose**: Comprehensive logging is vital for diagnosing issues and understanding the sequence of events leading up to a problem. By keeping detailed logs, you can perform post-mortem analyses and continuously improve the system.

**Logging Strategy**:

-   **Request Logging**: Log all incoming API requests, including request data, timestamps, and IP addresses. This helps in tracing and replicating issues.
-   **Error and Exception Logging**: Capture stack traces and error messages from failed operations or unhandled exceptions. Include contextual information to aid in debugging.
-   **Model Inference Logs**: Log input queries, selected context chunks, and generated answers, along with corresponding confidence scores and inference times.

**Storage and Retention**:

-   **Local Storage**: Use a dedicated storage solution for logs, such as a logging server with high availability.
-   **Cloud Storage**: For scalability, consider cloud-based log storage solutions like AWS CloudWatch or Google Cloud Logging.
-   **Retention Policy**: Retain logs for a sufficient period (e.g., 30-90 days) to allow for historical analysis while managing storage costs.

**1.3 Health Checks**

**Purpose**: Health checks ensure that the system's components are functioning correctly. By regularly probing the system, you can detect failures early and trigger alerts or automatic recovery mechanisms.

**Types of Health Checks**:

-   **API Health Checks**: Periodically ping the API endpoints to verify their responsiveness and correctness.
-   **Model Health Checks**: Run lightweight inference tasks at regular intervals to confirm that the models are operating as expected.
-   **Database/Storage Health Checks**: Ensure that the KDTree, embeddings, and log storage systems are accessible and performing optimally.

**Implementation**:

-   **Automated Scripts**: Deploy cron jobs or similar scheduling tools to run health checks at regular intervals.
-   **Service Mesh**: Use a service mesh like Istio to manage and monitor microservices health, automatically routing traffic away from unhealthy instances.

----------

#### 2. Fault Mitigation Strategies

**2.1 Graceful Degradation**

**Purpose**: In the event of a failure, the system should continue to operate at a reduced capacity rather than failing completely. Graceful degradation ensures that users still receive a response, even if it is less detailed or accurate.

**Strategies**:

-   **Fallback Mechanisms**: Implement fallback methods such as brute-force search or cached responses if KDTree retrieval fails. This allows the system to return a reasonable answer even if the primary retrieval mechanism is unavailable.
-   **Reduced Load Operations**: Temporarily reduce the complexity of operations (e.g., lowering the `top_k` value or using simpler models) during high-load scenarios to maintain service availability.
-   **Partial Responses**: If only some components fail (e.g., partial retrieval of context), return the partial result with a notification to the user, rather than failing entirely.

**Implementation**:

-   **Circuit Breaker Patterns**: Use circuit breaker patterns to detect and isolate failing components, allowing the system to bypass them and continue functioning.
-   **Rate Limiting**: Implement rate limiting to protect the system from overwhelming traffic, ensuring that critical services remain available.

**2.2 Load Balancing**

**Purpose**: Load balancing distributes incoming requests across multiple instances of the system to ensure that no single instance becomes a bottleneck. This enhances both reliability and performance.

**Strategies**:

-   **Horizontal Scaling**: Deploy multiple instances of the API and retrieval services. Use a load balancer (e.g., NGINX, AWS ELB) to distribute traffic evenly.
-   **Geographic Load Balancing**: For global deployments, distribute traffic based on user location to reduce latency and balance load across regions.
-   **Auto-Scaling**: Automatically scale the number of instances based on current load. Tools like AWS Auto Scaling or Kubernetes Horizontal Pod Autoscaler can manage this dynamically.

**Implementation**:

-   **Session Affinity**: Use session affinity (sticky sessions) if needed to ensure that users continue interacting with the same instance during a session.
-   **Monitoring and Alerts**: Continuously monitor the performance of load balancers and trigger alerts if imbalances or failures are detected.

**2.3 Redundancy**

**Purpose**: Redundancy ensures that critical system components have backups, allowing the system to recover quickly from failures and minimizing downtime.

**Strategies**:

-   **Data Redundancy**: Store multiple copies of critical data (e.g., embeddings, logs) across different storage locations or cloud regions. This protects against data loss due to hardware failure or regional outages.
-   **Service Redundancy**: Deploy multiple instances of key services (e.g., Flask API, KDTree) across different servers or availability zones. If one instance fails, traffic can be routed to the redundant instance.

**Implementation**:

-   **Replication and Mirroring**: Use database replication or file system mirroring to maintain up-to-date copies of critical data in different locations.
-   **Failover Mechanisms**: Implement automatic failover mechanisms to switch to a backup service or data store in case the primary one fails.

**2.4 Automated Alerts**

**Purpose**: Automated alerts notify the operations team of critical issues, enabling rapid response and minimizing downtime.

**Strategies**:

-   **Threshold-Based Alerts**: Set thresholds for key metrics (e.g., CPU usage, error rates, response times). Trigger alerts if these thresholds are breached.
-   **Anomaly Detection**: Use machine learning-based anomaly detection to identify unusual patterns in system behavior that may indicate an emerging issue.
-   **On-Call Rotation**: Establish an on-call rotation for team members to ensure that someone is always available to respond to alerts.

**Implementation**:

-   **Alerting Tools**: Use tools like PagerDuty, Opsgenie, or custom scripts to send alerts via email, SMS, or messaging apps (e.g., Slack).
-   **Incident Management**: Integrate alerting systems with incident management tools to track and resolve issues systematically.

----------

### Conclusion on Post-Deployment Policies

The post-deployment policies outlined above are designed to ensure the continuous, reliable operation of the RAG system in a production environment. By implementing robust monitoring, proactive maintenance, and comprehensive fault mitigation strategies, the system can maintain high availability, performance, and accuracy. These policies are essential for providing a dependable user experience, quickly addressing issues as they arise, and continuously improving the system based on real-world usage data.