# Cohere API and SciBERT for RAG
This notebook uses a Cohere API for generating responses to text. A query input is required from the user. 
SciBERT is used for embeddings in a dense vector array both the text and the query. 
A DOI is supplied with the text as both an identifier and locator. 

- [ ] set up venv
- [ ] install transformers torch cohere in command line

### todo
- [ ] create script that compiles data/documents.txt with DOI || text for all documents
- [ ] store vectorized documents in a db
    - https://huggingface.co/learn/cookbook/rag_with_hugging_face_gemma_mongodb

### options
- Batch Processing:
    If large number of texts, process them in batches to avoid memory issues.
    Example: Use a loop or torch.utils.data.DataLoader.

- Change model size: smaller models require less processing

- fine tune model on corpus

- look into pooling strategies

- Tokenizer
    - put cleaning process distincly prior to the tokenizer, using the default values as much as possible. 



In [22]:
# imports
import cohere
from cohere import Client
from transformers import AutoTokenizer, AutoModel
import numpy as np
from typing import List, Tuple, Dict

In [14]:

# Initialize Cohere client
co = cohere.Client("i4WfLKa1zNNKsPU3n4ZEVuzpaTCBwztx6p6hebpO")

# Load SciBERT model and tokenizer
"""
Autotokenizer documentation can be found here: https://huggingface.co/docs/transformers/v4.50.0/en/model_doc/auto#transformers.AutoTokenizer

Model documentation can be found here: https://huggingface.co/allenai/scibert_scivocab_uncased
Citation for SciBERT:
@inproceedings{beltagy-etal-2019-scibert,
    title = "SciBERT: A Pretrained Language Model for Scientific Text",
    author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
    booktitle = "EMNLP",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1371"
}


"""
# Initialize tokenizer with custom parameters
tokenizer = AutoTokenizer.from_pretrained(
    "allenai/scibert_scivocab_uncased",
    max_len=512,
    use_fast=True,  # Use the fast tokenizer
    do_lower_case=False,  # Preserve case
    add_prefix_space=False,  # No prefix space
    never_split=["[DOC]", "[REF]"],  # Tokens to never split
    additional_special_tokens=["<doi>", "</doi>"]  # Add custom special tokens
)

# This is the SciBERT model that is used to embed the text and query.
# other models: 'allenai-specter', 
model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

In [18]:
"""
Basic RAG with Cohere model
Document source: data/documents.txt where the DOI with resolver is separated from the abstract by ||. One record per line. 
Saved as UTF-8

Returns:  answers based on query from input()
"""

# Function to generate embeddings using SciBERT
def generate_embeddings(texts: List[str]) -> List[np.ndarray]:
    """
    converts raw text to numerical representations using a pretrained model, in this case, SciBERT.
    Currently this is applied to both the document text and the query. 
    May want a different version or decorator for the query as they are generally much shorter and more sparse.

    Input: text from tokenizer step above as a list of strings
    Output: np.array
    """
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        max_length=512, # returns PyTorch tensors which are compatible with model
        padding="max_length",
        truncation=True,
        return_attention_mask=True # return the attention mask - need to learn more
        )
    # this passes the tokenized inputs through the model
    outputs = model(**inputs)

    # applies mean pooling to get a fixed size embedding
    embeddings = outputs.last_hidden_state.mean(dim=1).detach().numpy()
    return embeddings

# Function to read documents and their DOIs from a file
def read_documents_with_doi(file_path: str) -> List[Dict[str, str]]:
    documents_with_doi = []
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            parts = line.strip().split("||")  # Assuming DOI and document are separated by "||"
            if len(parts) == 2:
                doi, document = parts
                documents_with_doi.append({"doi": doi.strip(), "text": document.strip()})
    return documents_with_doi

# Path to the file containing documents and DOIs
file_path = "data/documents.txt"  # Replace with your file path

# Read documents and DOIs from the file
documents_with_doi = read_documents_with_doi(file_path)

# Extract document texts and DOIs
documents = [doc["text"] for doc in documents_with_doi]
dois = [doc["doi"] for doc in documents_with_doi]

# Example query
query = input(" What is your query: ")

# Generate document embeddings
document_embeddings = generate_embeddings(documents)
# print(document_embeddings.shape) # to see the output shape of the array

# Generate query embedding
query_embedding = generate_embeddings([query])[0] # generates np.array for the query text

# Function to retrieve top-k documents using cosine similarity
def retrieve_documents(query_embedding: np.ndarray, document_embeddings: List[np.ndarray], top_k: int = 2) -> List[Tuple[float, Dict[str, str]]]:
    similarities = []
    for doc_emb in document_embeddings:
        # cosine similarity
        similarity = np.dot(query_embedding, doc_emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)) 
        similarities.append(similarity)
    # ranking
    top_indices = np.argsort(similarities)[::-1][:top_k]
    return [(similarities[i], documents_with_doi[i]) for i in top_indices]

# Retrieve top documents
top_documents = retrieve_documents(query_embedding, document_embeddings)
print("Retrieved Documents:")
for score, doc in top_documents:
    print(f"Score: {score:.4f}, DOI: {doc['doi']}, Document: {doc['text']}")

# Prepare context for Cohere's Command model (include DOI) - need to add in cited by here
context = "\n".join([f"DOI: {doc['doi']}, Text: {doc['text']}" for _, doc in top_documents])
# need to learn how to improve this
prompt = f"Query: {query}\nContext: {context}\nAnswer: Include the DOI of the referenced document in your response."

# Generate response using Cohere's Command model
response = co.generate(
  model="command", # there are other models to consider within command
  prompt=prompt,
  max_tokens=150, # allowable length of response
  temperature=0.5 # lower for less creativity, more for more creativity
)

# Print the generated response
print("\nGenerated Response:")
print(response.generations[0].text)


Retrieved Documents:
Score: 0.7746, DOI: https://doi.org/10.1162/qss_a_00286, Document: ABSTRACT  The main objective of this study is to compare the amount of metadata and the completeness degree of research publications in new academic databases. Using a quantitative approach, we selected a random Crossref sample of more than 115,000 records, which was then searched in seven databases (Dimensions, Google Scholar, Microsoft Academic, OpenAlex, Scilit, Semantic Scholar, and The Lens). Seven characteristics were analyzed (abstract, access, bibliographic info, document type, publication date, language, and identifiers), to observe fields that describe this information, the completeness rate of these fields, and the agreement among databases. The results show that academic search engines (Google Scholar, Microsoft Academic, and Semantic Scholar) gather less information and have a low degree of completeness. Conversely, third-party databases (Dimensions, OpenAlex, Scilit, and The Lens) have

## V2: implementing chat history

In [35]:
# Load SciBERT model and tokenizer 
"""
REMOVE THIS ONCE RUNNING TO GO BACK TO THE CHANGED TOKENIZER AND MODEL ABOVE
"""
tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

# Initialize chat history
chat_history = []

# Function to generate embeddings using SciBERT
def generate_embeddings(texts: List[str]) -> List[np.ndarray]:
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        max_length=512,
        padding="max_length",
        truncation=True
    )
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).detach().numpy()
    return embeddings

# Function to read documents and their DOIs from a file
def read_documents_with_doi(file_path: str) -> List[Dict[str, str]]:
    documents_with_doi = []
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            parts = line.strip().split("||")
            if len(parts) == 2:
                doi, document = parts
                documents_with_doi.append({"doi": doi.strip(), "text": document.strip()})
    return documents_with_doi

# Path to the file containing documents and DOIs
file_path = "data/documents.txt"

# Read documents and DOIs from the file
documents_with_doi = read_documents_with_doi(file_path)
documents = [doc["text"] for doc in documents_with_doi]


# Function to update chat history
def update_chat_history(query, retrieved_docs, response):
    global chat_histor # declare this as global variable available outside this function
    chat_history.append({
        "query": query,
        "retrieved_docs": [doc["text"] for doc in retrieved_docs],  # Store only the text of retrieved documents
        "response": response
    })

# Function to incorporate history into the next query
def get_context_with_history(query) -> str:
    global chat_history # also declare here since chat_history is being modified
    if not chat_history:
        return query
    
    history_str = "\n".join([
        f"User: {entry['query']}\n"
        f"Context: {'; '.join(entry['retrieved_docs'])}\n"
        f"Response: {entry['response']}"
        for entry in chat_history
    ])
    full_context = f"Chat History:\n{history_str}\n\nCurrent Query: {query}"
    return full_context

# Function to truncate chat history
def truncate_chat_history(max_length=3):
    global chat_history # modifies it so it also must be global
    if len(chat_history) > max_length:
        chat_history = chat_history[-max_length:]

# Function to retrieve top-k documents using cosine similarity
def retrieve_documents(query: str, top_k: int = 2) -> List[Dict[str, str]]:
    query_embedding = generate_embeddings([query])[0]
    document_embeddings = generate_embeddings(documents)
    similarities = [
        np.dot(query_embedding, doc_emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in document_embeddings
    ]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    return [documents_with_doi[i] for i in top_indices]

# RAG pipeline function
def rag_pipeline(query):
    # Incorporate chat history
    full_context = get_context_with_history(query)
    
    # Retrieve documents
    retrieved_docs = retrieve_documents(query)
    
    # Prepare context for Cohere's Command model
    context = "\n".join([f"DOI: {doc['doi']}, Text: {doc['text']}" for doc in retrieved_docs])
    prompt = f"Query: {query}\nContext: {context}\nAnswer: Include the DOI of the referenced document in your response."
    
    # Generate response
    response = co.generate(
        model="command",
        prompt=prompt,
        max_tokens=150,
        temperature=0.5
    ).generations[0].text
    
    # Update chat history
    update_chat_history(query, retrieved_docs, response)
    
    # Truncate history if necessary
    truncate_chat_history()
    
    # Print the response
    print("Generated Response:")
    print(response)
    return response

# Main loop for user interaction
while True:
    query = input("What is your query (or type 'exit' to quit): ")
    if query.lower() == "exit":
        break
    rag_pipeline(query)

Generated Response:
 I'm not sure how I can help you without any additional information, but if you provide a specific query or discussion point, I'll do my best to assist you! Also, DOI (Document Object Identifier) is a unique identifier assigned to a publication. It's a great way to cite papers and refer to them unequivocally, as it always leads to the exact same document. You can find the DOI on the document's title page, and it's been given for both of the documents you mentioned above. Let me know if I can help you with anything! 
Generated Response:
 Based on the DOI you provided, it appears you are asking about the 2017 study that analyzed 115,000 records across seven databases. These databases were Dimensions, Google Scholar, Microsoft Academic, OpenAlex, Scilit, Semantic Scholar, and The Lens. 

In this study, the authors evaluated the completeness of metadata across these databases and compared the results regarding the quality of information contained within. 

Please let me

KeyboardInterrupt: 