# Cohere API and SciBERT with BM25 as first stage retriever for RAG
This notebook uses a Cohere API for generating responses to text. A query input is required from the user. 
SciBERT is used for embeddings in a dense vector array for the query. 
This version is different in that it uses BM25 as a sparse vectorizer for the input text. Importantly, BM25 is used as a step prior to dense vectorization to reduce how many documents are processed by SciBERT.
A DOI is supplied with the text as both an identifier and locator. 

## pipeline
1. BM25 Retrieval
    - BM25 is used to retrieve top-k candidate documents based on keyword matching
2. Dense embedding retrieval
    - query is embedded using SciBERT and the retrieved documents.
3. Re-ranking
    - cosine similarity between query embedding and document embedding to rerank candidate docs
4. Generation
    - docs and query are fed to generator for answer creation. 

- [ ] set up venv
- [ ] install transformers torch cohere in command line

### todo
- [ ] create script that compiles data/documents.txt with DOI || text for all documents
- [ ] rank_bm25: https://github.com/dorianbrown/rank_bm25


In [11]:
# imports
import cohere
from transformers import AutoTokenizer, AutoModel
import numpy as np
from typing import List, Tuple, Dict

In [14]:

# Initialize Cohere client
co = cohere.Client("i4WfLKa1zNNKsPU3n4ZEVuzpaTCBwztx6p6hebpO")

# Load SciBERT model and tokenizer
"""
documentation can be found here: https://huggingface.co/docs/transformers/v4.50.0/en/model_doc/auto#transformers.AutoTokenizer


"""
# Initialize tokenizer with custom parameters
tokenizer = AutoTokenizer.from_pretrained(
    "allenai/scibert_scivocab_uncased",
    max_len=512,
    use_fast=True,  # Use the fast tokenizer
    do_lower_case=False,  # Preserve case
    add_prefix_space=False,  # No prefix space
    never_split=["[DOC]", "[REF]"],  # Tokens to never split
    additional_special_tokens=["<doi>", "</doi>"]  # Add custom special tokens
)

# This is the SciBERT model that is used to embed the text and query.
model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

In [17]:
"""
Basic RAG with Cohere model
Document source: data/documents.txt where the DOI with resolver is separated from the abstract by ||. One record per line. 
Saved as UTF-8

Returns:  answers based on query from input()
"""

# Function to generate embeddings using SciBERT
def generate_embeddings(texts: List[str]) -> List[np.ndarray]:
    """
    converts raw text to numerical representations using a pretrained model, in this case, SciBERT
    Input: text from tokenizer step above as a list of strings
    Output: np.array
    """
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        max_length=512, # returns PyTorch tensors which are compatible with model
        padding="max_length",
        truncation=True,
        return_attention_mask=True # return the attention mask - need to learn more
        )
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).detach().numpy()
    return embeddings

# Function to read documents and their DOIs from a file
def read_documents_with_doi(file_path: str) -> List[Dict[str, str]]:
    documents_with_doi = []
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            parts = line.strip().split("||")  # Assuming DOI and document are separated by "||"
            if len(parts) == 2:
                doi, document = parts
                documents_with_doi.append({"doi": doi.strip(), "text": document.strip()})
    return documents_with_doi

# Path to the file containing documents and DOIs
file_path = "data/documents.txt"  # Replace with your file path

# Read documents and DOIs from the file
documents_with_doi = read_documents_with_doi(file_path)

# Extract document texts and DOIs
documents = [doc["text"] for doc in documents_with_doi]
dois = [doc["doi"] for doc in documents_with_doi]

# Example query
query = input(" What is your query: ")

# Generate document embeddings
document_embeddings = generate_embeddings(documents)

# Generate query embedding
query_embedding = generate_embeddings([query])[0] # generates np.array for the query text

# Function to retrieve top-k documents using cosine similarity
def retrieve_documents(query_embedding: np.ndarray, document_embeddings: List[np.ndarray], top_k: int = 2) -> List[Tuple[float, Dict[str, str]]]:
    similarities = []
    for doc_emb in document_embeddings:
        # cosine similarity
        similarity = np.dot(query_embedding, doc_emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)) 
        similarities.append(similarity)
    # ranking
    top_indices = np.argsort(similarities)[::-1][:top_k]
    return [(similarities[i], documents_with_doi[i]) for i in top_indices]

# Retrieve top documents
top_documents = retrieve_documents(query_embedding, document_embeddings)
print("Retrieved Documents:")
for score, doc in top_documents:
    print(f"Score: {score:.4f}, DOI: {doc['doi']}, Document: {doc['text']}")

# Prepare context for Cohere's Command model (include DOI)
context = "\n".join([f"DOI: {doc['doi']}, Text: {doc['text']}" for _, doc in top_documents])
prompt = f"Query: {query}\nContext: {context}\nAnswer: Include the DOI of the referenced document in your response."

# Generate response using Cohere's Command model
response = co.generate(
  model="command",
  prompt=prompt,
  max_tokens=150,
  temperature=0.7
)

# Print the generated response
print("\nGenerated Response:")
print(response.generations[0].text)


Retrieved Documents:
Score: 0.7395, DOI: https://doi.org/10.1162/qss_a_00286, Document: ABSTRACT  The main objective of this study is to compare the amount of metadata and the completeness degree of research publications in new academic databases. Using a quantitative approach, we selected a random Crossref sample of more than 115,000 records, which was then searched in seven databases (Dimensions, Google Scholar, Microsoft Academic, OpenAlex, Scilit, Semantic Scholar, and The Lens). Seven characteristics were analyzed (abstract, access, bibliographic info, document type, publication date, language, and identifiers), to observe fields that describe this information, the completeness rate of these fields, and the agreement among databases. The results show that academic search engines (Google Scholar, Microsoft Academic, and Semantic Scholar) gather less information and have a low degree of completeness. Conversely, third-party databases (Dimensions, OpenAlex, Scilit, and The Lens) have