# RAG project

The RAG (Retrieval-Augmented Generation) project aims to explore and optimize various components of a RAG system.

## 0. Setup

This infrastructure setup defines a multi-service environment for a RAG using Docker Compose, with some adjustments. The main components include:

1. **Embedding Service (`embed`)**: Generates text embeddings using the pre-trained model `sentence-transformers/all-MiniLM-L6-v2`.
2. **Reranking Service (`rerank`)**: Refines the ranking of text results using the `mixedbread-ai/mxbai-rerank-xsmall-v` model.
3. **RAG Service (`rag_service`)**: Manages document retrieval and text generation, integrating both the embedding and reranking services.
4. **Redis**: Provides an in-memory data store for caching or queuing.
5. **GPT Inference Services (`gpt-4-mini` and `gpt-4o`)**: Run GPT-like models for generating responses, enabling advanced language capabilities.
6. **Gateway Service (`gateway_service`)**: Acts as the central API gateway, managing interactions between all services.
7. **PostgreSQL Database (`postgres`)**: Stores structured data such as metadata and logs.
8. **Gradio UI (`rag_gradio_service`)**: Provides a user interface for easy interaction with the RAG pipeline.

In [26]:
import os

open_ai_key = open("keys/.openai-api-key").read().strip()
os.environ["OPENAI_API_KEY"] = open_ai_key

cohere_key = open("keys/.cohere-api-key").read().strip()
os.environ["COHERE_API_KEY"] = cohere_key

nebius_key = open("keys/.nebius-api-key").read().strip()
os.environ["NEBIUS_API_KEY"] = nebius_key

huggingface_key = open("keys/.huggingface-api-key").read().strip()
os.environ["HUG_API_KEY"] = huggingface_key

In [None]:
%%writefile .env
MODEL_NAME="gpt-4"
ADMIN_KEY="aYpVtQxRmGzLsBnCfDiKjUxWqHvNwYcFbXlPrVdTw"
DATABASE_URL="postgresql://user:password@postgres/dbname"

In [7]:
!docker-compose -f docker-compose.yaml up -d

[1A[1B[0G[?25l[+] Running 9/0
 [32m✔[0m Container inference_service_gpt4mini   [32mRunning[0m                          [34m0.0s [0m
 [32m✔[0m Container rerank_service               [32mRu...[0m                            [34m0.0s [0m
 [32m✔[0m Container embed_service                [32mRun...[0m                           [34m0.0s [0m
 [32m✔[0m Container inference_service_gpt4o      [32mRunning[0m                          [34m0.0s [0m
 [32m✔[0m Container rag_gradio_service           [32mRunning[0m                          [34m0.0s [0m
 [32m✔[0m Container redis                        [32mRunning[0m                          [34m0.0s [0m
 [32m✔[0m Container topic-1-advanced-postgres-1  [32mRunning[0m                          [34m0.0s [0m
 [32m✔[0m Container rag_service                  [32mRunni...[0m                         [34m0.0s [0m
 [32m✔[0m Container gateway_service              [32mR...[0m                             

## 1. Database Setup and Text Extraction
### Download files
1. Clone the *Transformers* repository to your virtual machine:
```bash
git clone https://github.com/huggingface/transformers
```
2. Run the script to extract raw text from markdown files located in the transformers/docs/source/en/ directory:
```bash
python prep_scripts/markdown_to_text.py --input-dir transformers/docs/source/en/ --output-dir docs
```
3. The output will be a collection of plain text files stored in the docs directory.
This directory will serve as our knowledge base.

### Text Preprocessing and Summarization
Only files that start with the prefix "model" and have a size no greater than 5 KB are selected. This is a model constraint. In ChatGPT, each of these files will be summarized.

1. Remove Noise 
2. Preserve Key Information
3. Summarize Text
4. Text Segmentation

In [5]:
import os
import openai

# Define directories
raw_dir = 'docs/raw'
prepared_dir = 'docs/prepared'
MAX_FILE_SIZE = 5120

MODEL_NAME = "gpt-4o-mini" 


SYSTEM_MESSAGE = "You are a helpful assistant specialized in summarizing texts."
SUMMARY_PROMPT_TEMPLATE = "Please summarize the following text. Each paragraph should contain no more than 50 tokens. Do not include headers, and split the text using \n\n:{}"

# read the content of a file
def read_file(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        content = file.read()  
    return content

# send content to ChatGPT for processing
def modify_content_with_chatgpt(content):    
    try:
        response = openai.chat.completions.create(
            model=MODEL_NAME,
            messages=[
                {"role": "system", "content": SYSTEM_MESSAGE},
                {"role": "user", "content": SUMMARY_PROMPT_TEMPLATE.format(content)},
            ]
        )
            
        summary = response.choices[0].message.content
        return summary.strip()
        
    except Exception as e:
        print(f"An error occurred: {e}")
        return ""    


# process files in the raw directory
def process_files(raw_dir, prepared_dir):
    if not os.path.exists(prepared_dir):
        os.makedirs(prepared_dir)
    
    # iterate over files in the raw directory
    for filename in os.listdir(raw_dir):
        file_path = os.path.join(raw_dir, filename)
        
        # Step 1: Check if the file starts with 'model'
        if filename.startswith('model') and os.path.getsize(file_path) <= MAX_FILE_SIZE:
            print(f"Processing file: {filename}")
            
            # Step 2: Read the content of the file
            content = read_file(file_path)
            
            # Step 3: Modify the content using ChatGPT
            modified_content = modify_content_with_chatgpt(content)
            
            # Step 4: Save the modified content to the 'prepared' directory
            destination_file = os.path.join(prepared_dir, filename)
            with open(destination_file, "w", encoding="utf-8") as file:
                file.write(modified_content)
            
            print(f"File '{filename}' processed and saved to '{prepared_dir}'.")
        else:
            print(f"File '{filename}' does not start with 'model'. Skipping...")

# Process the files
process_files(raw_dir, prepared_dir)

print("All files starting with 'model' have been processed and saved to 'docs/prepared'.")

Processing file: model_doc_trocr.txt
File 'model_doc_trocr.txt' processed and saved to 'docs/prepared'.
File 'main_classes_model.txt' does not start with 'model'. Skipping...
File 'model_doc_llama2.txt' does not start with 'model'. Skipping...
File 'tasks_masked_language_modeling.txt' does not start with 'model'. Skipping...
File '_perf_infer_cpu.txt' does not start with 'model'. Skipping...
File 'model_doc_owlvit.txt' does not start with 'model'. Skipping...
File 'model_doc_levit.txt' does not start with 'model'. Skipping...
File 'model_doc_m2m_100.txt' does not start with 'model'. Skipping...
Processing file: model_doc_deberta-v2.txt
File 'model_doc_deberta-v2.txt' processed and saved to 'docs/prepared'.
Processing file: model_doc_longt5.txt
File 'model_doc_longt5.txt' processed and saved to 'docs/prepared'.
File '_attention.txt' does not start with 'model'. Skipping...
Processing file: model_doc_gemma2.txt
File 'model_doc_gemma2.txt' processed and saved to 'docs/prepared'.
Processin

### Document Embeddings

In [8]:
import requests

RAG_API_URL = "http://localhost:8000/add_to_rag_db"
prepared_dir = 'docs/prepared'

def load_data_to_db():
    for filename in os.listdir(prepared_dir):
        file_path = os.path.join(prepared_dir, filename)

        if os.path.isdir(file_path):
            continue

        try:
            # Read the file content
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            # Split the content by the '\n\n' delimiter (double newlines)
            chunks = content.split("\n\n")

            # Iterate over each chunk and send it to the API
            for i, chunk in enumerate(chunks):
                if chunk.strip():
                    response = requests.post(RAG_API_URL, json={"text": chunk})
                    if response.status_code == 200:
                        print(f"Chunk {i + 1}/{len(chunks)} of file '{filename}' successfully uploaded to the database")
                    else:
                        print(f"Error uploading chunk {i + 1}/{len(chunks)} of file '{filename}': {response.status_code} {response.text}")
        except Exception as e:
            print(f"An error occurred while processing file '{filename}': {str(e)}")
            
load_data_to_db()

Chunk 1/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 2/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 3/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 4/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 5/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 6/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 7/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 8/8 of file 'model_doc_trocr.txt' successfully uploaded to the database
Chunk 1/3 of file 'model_doc_deberta-v2.txt' successfully uploaded to the database
Chunk 2/3 of file 'model_doc_deberta-v2.txt' successfully uploaded to the database
Chunk 3/3 of file 'model_doc_deberta-v2.txt' successfully uploaded to the database
Chunk 1/6 of file 'model_doc_longt5.txt' successfully uploaded to the database
Chunk 2/6 of file 'model_doc_longt5.txt' success

## 2. Reranker Implementation

In [15]:
import time
import requests
from pydantic import BaseModel
import json


# Prompts derived from the provided texts
prompts = [
    "What are the primary NLP tasks where BERTweet outperforms RoBERTa and XLM-R?",
    "How does BERTweet differ from standard BERT in its implementation and tokenization methods?",
    "How does BigBird's sparse attention mechanism achieve linear complexity for processing long sequences?",
    "What potential applications does BigBird have in genomics data analysis?",
    "Explain how CANINE processes text directly at the Unicode character level without tokenization.",
    "How does CANINE compare to mBERT in terms of language adaptation and parameter efficiency?",
    "What are the key differences between CPM and GPT-2, particularly regarding Chinese NLP tasks?",
    "How does CPM excel in few-shot learning for Chinese language applications?",
    "What advantages does JetMoe-8B's sparsely activated architecture offer over dense language models?",
    "Describe the function of Mixture of Attention Heads and Mixture of MLP Experts in JetMoe-8B."
]

RAG_CONTEXT_API_URL = 'http://localhost:8000/prompt_w_context'
reranker_dir = 'docs/reranked'

class RAGRequest(BaseModel):
    query: str
    model: str = "gpt-4o-mini"
    use_reranker: bool = False
    top_k_retrieve: int = 10
    top_k_rank: int = 1

class RAGResponse(BaseModel):
    message: str
    context: list[str]
    
    
# send request to the API and measure response time
def send_rag_request(query, use_reranker=False, top_k_retrieve=10, top_k_rank=1):
    url = RAG_CONTEXT_API_URL
    payload = RAGRequest(
        query=query,
        use_reranker=use_reranker,
        top_k_retrieve=top_k_retrieve,
        top_k_rank=top_k_rank
    ).model_dump_json()
    headers = {'Content-Type': 'application/json'}
    
    start_time = time.time()
    response = requests.post(url, data=payload, headers=headers)
    response_time = time.time() - start_time
    
    if response.status_code == 200:
        response_data = RAGResponse.model_validate_json(response.text)  # Updated method
        return response_data, response_time
    else:
        return None, response_time


def run_comparisons(prompts, top_k_values=[1, 3, 5], top_k_retrieve=10):
    results = []
    
    print("Starting comparisons...")
    
    for prompt in prompts:
        print(f"Processing prompt: '{prompt}'")
        
        for top_k_rank in top_k_values:
            print(f"  Top-k rank: {top_k_rank}")

            print(f"    Sending request without reranker (top_k_rank={top_k_rank})")
            context_without_reranker, time_without_reranker = send_rag_request(
                query=prompt,
                use_reranker=False,
                top_k_retrieve=top_k_retrieve,
                top_k_rank=top_k_rank
            )
            if context_without_reranker:
                print(f"    Context retrieved without reranker: {len(context_without_reranker.context)} documents")
            else:
                print(f"    No context retrieved without reranker for prompt '{prompt}'")

            print(f"    Sending request with reranker (top_k_rank={top_k_rank})")
            context_with_reranker, time_with_reranker = send_rag_request(
                query=prompt,
                use_reranker=True,
                top_k_retrieve=top_k_retrieve,
                top_k_rank=top_k_rank
            )
            if context_with_reranker:
                print(f"    Context retrieved with reranker: {len(context_with_reranker.context)} documents")
            else:
                print(f"    No context retrieved with reranker for prompt '{prompt}'")
            
            print(f"    Time without reranker: {time_without_reranker:.2f}s")
            print(f"    Time with reranker: {time_with_reranker:.2f}s")
            
            results.append({
                'query': prompt,
                'top_k_rank': top_k_rank,
                'context_without_reranker': context_without_reranker.context if context_without_reranker else [],
                'context_with_reranker': context_with_reranker.context if context_with_reranker else [],
                'time_without_reranker': time_without_reranker,
                'time_with_reranker': time_with_reranker
            })

    print("Completed all comparisons.")
    
    return results

def save_results(results, filename="rag_comparisons.json"):
    if not os.path.exists(reranker_dir):
        os.makedirs(reranker_dir)
        
    result_file = os.path.join(reranker_dir, filename)
    
    with open(result_file, "w", encoding="utf-8") as file:
        json.dump(results, file, indent=4)
            
    print(f"File '{filename}' results saved to '{reranker_dir}'.")
    
# Run the comparison and save results
results = run_comparisons(prompts)
save_results(results)

Starting comparisons...
Processing prompt: 'What are the primary NLP tasks where BERTweet outperforms RoBERTa and XLM-R?'
  Top-k rank: 1
    Sending request without reranker (top_k_rank=1)
    Context retrieved without reranker: 1 documents
    Sending request with reranker (top_k_rank=1)
    Context retrieved with reranker: 1 documents
    Time without reranker: 2.04s
    Time with reranker: 125.90s
  Top-k rank: 3
    Sending request without reranker (top_k_rank=3)
    Context retrieved without reranker: 3 documents
    Sending request with reranker (top_k_rank=3)
    Context retrieved with reranker: 3 documents
    Time without reranker: 1.97s
    Time with reranker: 126.83s
  Top-k rank: 5
    Sending request without reranker (top_k_rank=5)
    Context retrieved without reranker: 5 documents
    Sending request with reranker (top_k_rank=5)
    Context retrieved with reranker: 5 documents
    Time without reranker: 1.97s
    Time with reranker: 129.25s
Processing prompt: 'How does 

### Conclusion on the Use of a Reranker Based on the Results

For example, the results for the last prompt in the file [result](docs/reranked/rag_comparisons.json) were used.

The results from the three different queries with varying values of `top_k_rank` (1, 3, and 5) show some key insights regarding the impact of applying a reranker to the retrieval process.

#### 1. **Effect on Context Retrieval:**
- **No Significant Change in Retrieved Context**: In all tested cases, the retrieved contexts with and without the reranker appear to be very similar. For example, with a `top_k_rank` of 1, the context both with and without the reranker is identical. The same holds true for higher values of `top_k_rank` (3 and 5), where the top documents, although reordered slightly, remain largely the same in content.
  
  **Insight**: The reranker does not drastically change the core context or content of the documents, suggesting that its primary function may be to reorder already relevant content rather than introduce significantly different information.

#### 2. **Response Time:**
- **Significant Increase in Response Time**: The response time with the reranker applied is notably higher compared to without the reranker. For example:
  - **top_k_rank=1**: The time increased from approximately 1.98 seconds (without reranker) to 119.93 seconds (with reranker).
  - **top_k_rank=3**: The time increased from approximately 1.98 seconds (without reranker) to 121.31 seconds (with reranker).
  - **top_k_rank=5**: The time increased from approximately 1.98 seconds (without reranker) to 121.48 seconds (with reranker).

  **Insight**: The reranker significantly increases the computational cost, with times reaching over 60 times the original time without the reranker. This highlights a major trade-off between the quality of results and processing efficiency.

#### 3. **Pros of Using a Reranker:**
- **Improved Relevance of Retrieved Documents**: Although the specific content does not change drastically, the reranker may enhance the order in which the documents are retrieved, potentially placing more relevant content at the top. This can be beneficial in cases where the relevance ranking is crucial.
  
- **Better Alignment with User Queries**: In scenarios where the retrieval system’s ranking is suboptimal, a reranker can better align the retrieved documents with the user's query, improving the overall relevance.

#### 4. **Cons of Using a Reranker:**
- **Significant Increase in Response Time**: The main drawback of using a reranker is the substantial increase in processing time. For real-time applications or use cases requiring low-latency responses, this increase could be detrimental.
  
- **Limited Impact on Document Diversity**: The reranker does not seem to introduce much diversity in the retrieved content. The primary effect is reordering rather than enriching the set of documents retrieved, which might not be ideal when diverse perspectives or information are needed.

#### 5. **Conclusion:**
- **Usefulness in Specific Contexts**: The reranker can be a useful tool when the relevance of retrieved documents needs fine-tuning, especially when there are small differences in the quality of retrieved content. However, its high computational cost limits its practicality in scenarios where fast responses are needed, or when only slight improvements in relevance are required.
  
- **Recommendation**: For applications where response time is critical or the documents retrieved are already highly relevant, the reranker may not provide a justified return on investment due to its significant impact on processing time. In contrast, if the quality of the top-k documents is paramount, and response time is less of a concern, using the reranker can enhance the output quality significantly.

## 3. LLM Comparison

In [1]:
!pip install cohere

Collecting cohere
  Downloading cohere-5.13.3-py3-none-any.whl.metadata (3.5 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp312-cp312-macosx_10_9_universal2.whl.metadata (5.5 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Downloading cohere-5.13.3-py3-none-any.whl (249 kB)
Downloading fastavro-1.9.7-cp312-cp312-macosx_10_9_universal2.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading parameterized-0.9.0-py2.py3-none-any.whl (20 kB)
Installing collected packages: parameterized, fastavro, cohere
Successfully installed cohere-5.13.3 fastavro-1.9.7 parameterized-0.9.0


In [16]:
import os
import time
import json
from openai import OpenAI
import cohere
import openai

# Список запросов
prompts = [
    "What are the primary NLP tasks where BERTweet outperforms RoBERTa and XLM-R?",
    "How does BERTweet differ from standard BERT in its implementation and tokenization methods?",
    "How does BigBird's sparse attention mechanism achieve linear complexity for processing long sequences?",
    "What potential applications does BigBird have in genomics data analysis?",
    "Explain how CANINE processes text directly at the Unicode character level without tokenization.",
    "How does CANINE compare to mBERT in terms of language adaptation and parameter efficiency?",
    "What are the key differences between CPM and GPT-2, particularly regarding Chinese NLP tasks?",
    "How does CPM excel in few-shot learning for Chinese language applications?",
    "What advantages does JetMoe-8B's sparsely activated architecture offer over dense language models?",
    "Describe the function of Mixture of Attention Heads and Mixture of MLP Experts in JetMoe-8B."
]


def openai_request(prompt):
    client = OpenAI(
        base_url="https://api.studio.nebius.ai/v1/",
        api_key=os.environ.get("NEBIUS_API_KEY"),
    )

    start_time = time.time()
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0.6,
        max_tokens=150  
    )
    
    summary = response.choices[0].message.content        
    elapsed_time = time.time() - start_time
        
    return summary.strip(), elapsed_time


def cohere_request(prompt):
    co = cohere.Client(os.environ.get("COHERE_API_KEY"))

    start_time = time.time()
    response = co.generate(
        model='command-r-plus-08-2024',  
        prompt=prompt,
        max_tokens=150  
    )
    elapsed_time = time.time() - start_time
    return response.generations[0].text, elapsed_time


def chatgpt_request(prompt):
    try:
        start_time = time.time()
        response = openai.chat.completions.create(
            model="gpt-4o-mini" ,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ],
            max_tokens=150 
        )
            
        summary = response.choices[0].message.content        
        elapsed_time = time.time() - start_time
        
        return summary.strip(), elapsed_time
    except Exception as e:
        print(f"An error occurred with ChatGPT: {e}")
        return "", 0


def compare_apis_for_prompt(prompt):
    results = {}

    openai_response, openai_time = openai_request(prompt)
    results['OpenAI'] = {
        'response': openai_response,
        'time': openai_time
    }
    print(f"OpenAI response time: {openai_time:.3f} seconds")

 
    cohere_response, cohere_time = cohere_request(prompt)
    results['Cohere'] = {
        'response': cohere_response,
        'time': cohere_time
    }
    print(f"Cohere response time: {openai_time:.3f} seconds")


    chatgpt_response, chatgpt_time = chatgpt_request(prompt)
    results['ChatGPT'] = {
        'response': chatgpt_response,
        'time': chatgpt_time
    }
    print(f"ChatGPT response time: {openai_time:.3f} seconds")

    return {
        "query": prompt,
        "results": results
    }


def save_all_results_to_json(prompts, filename='api_comparison_results.json'):
    directory = 'docs/compared'
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    file_path = os.path.join(directory, filename)
    
    all_results = []
    for prompt in prompts:
        result = compare_apis_for_prompt(prompt)
        all_results.append(result)


    with open(file_path, 'w') as f:
        json.dump(all_results, f, indent=4)

    print(f"Results have been saved to {file_path}")


save_all_results_to_json(prompts)

OpenAI response time: 5.464 seconds
Cohere response time: 5.464 seconds
ChatGPT response time: 5.464 seconds
OpenAI response time: 4.390 seconds
Cohere response time: 4.390 seconds
ChatGPT response time: 4.390 seconds
OpenAI response time: 7.207 seconds
Cohere response time: 7.207 seconds
ChatGPT response time: 7.207 seconds
OpenAI response time: 4.714 seconds
Cohere response time: 4.714 seconds
ChatGPT response time: 4.714 seconds
OpenAI response time: 3.722 seconds
Cohere response time: 3.722 seconds
ChatGPT response time: 3.722 seconds
OpenAI response time: 3.639 seconds
Cohere response time: 3.639 seconds
ChatGPT response time: 3.639 seconds
OpenAI response time: 3.554 seconds
Cohere response time: 3.554 seconds
ChatGPT response time: 3.554 seconds
OpenAI response time: 4.362 seconds
Cohere response time: 4.362 seconds
ChatGPT response time: 4.362 seconds
OpenAI response time: 3.682 seconds
Cohere response time: 3.682 seconds
ChatGPT response time: 3.682 seconds
OpenAI response tim

### Conclusion on the API Comparison

The conclusion was made based on the prompt: **"What are the primary NLP tasks where BERTweet outperforms RoBERTa and XLM-R?"**
The results are here [result](docs/compared/api_comparison_results.json).

This translation keeps the focus on the specific query from which the analysis was derived.

From the provided output, several important conclusions can be drawn regarding the performance of three different models: **OpenAI**, **Cohere**, and **ChatGPT**.

### 1. **Comparison of Responses to the Query**
All three models provided similar responses, highlighting the primary areas where **BERTweet** outperforms **RoBERTa** and **XLM-R**:

- **OpenAI**: Mentions tweet classification tasks such as sentiment analysis, hate speech detection, and topic modeling. It also highlights Named Entity Recognition (NER).
  
- **Cohere**: Focuses on sentiment analysis and the presence of Twitter-specific language, hashtags, and emojis, emphasizing the importance of pre-training on a large Twitter corpus.

- **ChatGPT**: Also mentions sentiment analysis but adds other tasks like emotion recognition and toxicity detection, reflecting a broader perspective on BERTweet’s application.

### 2. **Similarities in Responses**
All models agree on the following points:
- **BERTweet** is a specialized model for analyzing social media text, such as tweets, and shows advantages in tasks related to sentiment analysis and texts containing specific elements of social networks (emojis, hashtags, slang).
- Both **RoBERTa** and **XLM-R** are more general-purpose models that lack the specialization for social media text.

### 3. **Differences in Focus**
- **OpenAI** focuses on **Named Entity Recognition (NER)** and tweet classification.
- **Cohere** emphasizes **sentiment analysis** and working with **social media signals** (such as hashtags and emojis).
- **ChatGPT** highlights additional tasks like **emotion recognition** and **toxicity detection**, which may be more relevant for social media analysis.

### 4. **Response Time**
- **OpenAI**: 3.53 seconds
- **Cohere**: 3.35 seconds
- **ChatGPT**: 2.99 seconds

**Conclusions**:
- All three models provided responses quickly, with minor differences in execution time. **ChatGPT** demonstrated the fastest response time (2.99 seconds), which can be valuable for applications requiring speed.
- **Cohere** and **OpenAI** showed similar response times, but **Cohere** was slightly faster than **OpenAI** (3.35 vs. 3.53 seconds).

### 5. **Conclusion**
- All models demonstrated a solid understanding of **BERTweet** and highlighted its key advantages in tasks related to social media.
- Despite the similarity in responses, the slight differences in focus and proposed tasks may reflect the unique characteristics of each model.
- The response times were not significantly different, suggesting that each model is fairly efficient for this type of query.

Depending on the context of your task, you may choose the most suitable model:
- **ChatGPT** may be the preferred choice if response speed is important.
- **OpenAI** and **Cohere** could be more useful if you need more detailed responses, particularly for sentiment analysis and social media text tasks.


## 4. Evaluation Setup

In [19]:
!pip install huggingface_hub pandas tqdm



In [63]:
import json
import re
from tqdm.auto import tqdm
from huggingface_hub import InferenceClient

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"  # Модель для оценки
llm_client = InferenceClient(model=repo_id, timeout=120, token=os.environ.get("HUG_API_KEY"))

def extract_judge_score(response):
    match = re.findall(r'Total rating: (\d+)', response)
    if match:
        return int(match[-1])
    return None

def evaluate_response(question, answer, retries=5, delay=10):
    for attempt in range(retries):
        try:
            prompt = f"""            
Evaluate the following response to the query:
"QUERY: {question}"
"RESPONSE: {answer}"
Provide a 'Total rating' from 1 to 10 based on relevance, coherence, depth, and accuracy. 
Justify the score with specific details about the response.

Your feedback should be structured as follows:
Total rating: (your rating, as a number between 1 and 10)       
"""
            response = llm_client.text_generation(
                prompt=prompt.format(question=["question"], answer=["answer"]),
                 max_new_tokens=1000,
             )
            return extract_judge_score(response.strip())
        except Exception as e:
            print(f"Error on attempt {attempt + 1}: {e}")
            if attempt < retries - 1:
                time.sleep(delay)  
            else:
                print("Max retries reached. Skipping this response.")
                return None


# file_path = "docs/compared/api_comparison_results.json"
file_path = "docs/compared/low_score.json"


with open(file_path, 'r') as file:
    data = json.load(file)


evaluation_results = []

for entry in tqdm(data, desc="Processing entries"):
    query = entry.get("query", "No query provided")  
    results = entry.get("results", {})  
    
    for model_name, model_data in results.items():
        response = model_data.get("response", "No response")  
        time_taken = model_data.get("time", None)  
        try:

            score = evaluate_response(query, response)
        except Exception as e:
            print(f"Error evaluating response for model {model_name}: {e}")
            score = None
        

        evaluation_results.append({
            "query": query,
            "model": model_name,
            "response": response,
            "time_taken": time_taken,
            "score": score
        })

output_file="docs/evaluated/evaluation_low_results.json"
# output_file = "docs/evaluated/evaluation_results.json"
with open(output_file, 'w') as json_file:
    json.dump(evaluation_results, json_file, indent=4)

print(f"Evaluation completed. Results saved to {output_file}")

Processing entries: 100%|██████████| 5/5 [02:28<00:00, 29.76s/it]

Evaluation completed. Results saved to docs/evaluated/evaluation_low_results.json





### **Summary and Conclusions from Evaluation**

In this evaluation, two distinct datasets were used to assess the performance of responses from different language models (LLMs) such as Cohere, OpenAI, and ChatGPT. The evaluation focused on how well the models addressed queries in the fields of NLP, specifically concerning model architectures like BERTweet, CANINE, and others.

Here are the files containing the evaluation results:
- [low_score](docs/evaluated/evaluation_low_results.json)
- [high_score](docs/evaluated/evaluation_results.json)


#### **Key Findings from the Evaluation:**

1. **Relevance and Coherence:**
   - Responses from the **Cohere**, **OpenAI**, and **ChatGPT** models generally stayed relevant to the queries, though there were instances where the answers provided lacked specificity or were less detailed, particularly in cases involving complex concepts like tokenization methods or character-level processing (e.g., for BERTweet).
   - The **ChatGPT** responses tended to be more coherent in structuring information logically. Some responses, however, lacked depth or clear explanation (such as in the responses related to BERTweet and tokenization differences).

2. **Depth of Response:**
   - The **ChatGPT** responses generally had more detail in the explanations, providing a deeper understanding of complex topics (e.g., **CANINE** and **mBERT** comparisons). These responses often included clear breakdowns of technical characteristics.
   - **Cohere** and **OpenAI** responses, while generally accurate, sometimes showed a lack of detailed coverage. For example, in the case of **BERTweet**, the responses did not provide sufficient information about tokenization or implementation differences, resulting in lower scores for depth.
   - In the case of **CANINE**, the **ChatGPT** and **OpenAI** responses were much more comprehensive, clearly discussing the character-level processing advantages of CANINE. **Cohere's** response was somewhat superficial and incomplete, which resulted in lower depth ratings.

3. **Accuracy:**
   - **ChatGPT** demonstrated a high level of accuracy across responses. It correctly described the key features of models like **CANINE** and **mBERT**, explaining their differences in terms of language adaptation and parameter efficiency.
   - **Cohere**'s and **OpenAI**'s responses were accurate but sometimes missed important technical details or nuances. For example, **Cohere** sometimes provided more general statements (e.g., regarding **BERTweet** or **mBERT**) without diving deeply into the specific technical distinctions.

4. **Comparison of Performance between the Two Datasets:**
   - **First Dataset (LLM responses from the previous task)**:
     The first set of responses from **Cohere**, **OpenAI**, and **ChatGPT** typically demonstrated a higher quality, particularly in terms of relevance and depth. These models were more precise in explaining complex NLP concepts and demonstrated a clear understanding of how advanced models operate.
   
   - **Second Dataset (Artificially generated responses with lower quality)**:
     The second set of responses, artificially generated with low quality, were often brief, vague, and lacking critical information. The responses failed to provide sufficient depth, and in some cases, accuracy was questionable. For example, responses about **BERTweet** lacked specific details about tokenization methods, and some descriptions about **CANINE** were imprecise, which contributed to lower scores for these responses.

#### **Scores Overview:**

- **High-Quality Responses**: 
  - **ChatGPT** stood out in this dataset, often receiving ratings of 8-9 for its detailed and accurate answers, especially for queries about **CANINE** and **mBERT**.
  - **OpenAI** also provided solid answers, with ratings mostly around 7-8 for its coverage of technical topics.
  
- **Lower-Quality Responses**: 
  - **Cohere**'s responses were often rated lower (4-6). These responses were sometimes vague, missing depth, or only partially accurate. For example, its answer regarding **BERTweet** tokenization showed a lack of technical depth, resulting in a lower score.

#### **Conclusions:**

1. **Quality of Responses**:
   - **ChatGPT** was the most consistent and detailed in its responses, providing comprehensive and accurate explanations. It demonstrated a higher level of understanding when addressing complex topics, such as tokenization and model architecture differences.
   - **Cohere**'s responses often lacked detail, and while it occasionally offered accurate information, it failed to provide the depth needed for higher scores.
   - **OpenAI** performed better than **Cohere**, but sometimes struggled with the depth and clarity that **ChatGPT** provided.

2. **Implications for Model Selection**:
   - When seeking detailed and accurate information on complex NLP topics, **ChatGPT** performed the best and would be the most reliable model to rely on for high-quality outputs.
   - **Cohere** may need further fine-tuning or more detailed training data to improve the specificity and depth of its responses.
   - **OpenAI** performed well but could benefit from providing more detailed explanations, especially when dealing with technical differences in model architectures.

In summary, **ChatGPT** generally provided the most accurate, relevant, and coherent answers, particularly when complex NLP topics were discussed. The other models, while still effective, showed limitations in their ability to handle depth and specificity.