# Cohere API and SciBERT with BM25 as pre-retriever for RAG
This notebook uses a Cohere API for generating responses to text. A query input is required from the user. 
SciBERT is used for embeddings in a dense vector array for the query. 
This version is different in that it uses BM25 as a pre-retriever for the input text to reduce how many documents are processed by SciBERT (embeddings) and the generator.
A DOI is supplied with the text as both an identifier and locator. 

## pipeline
1. BM25 Retrieval
    - BM25 is used to retrieve top-k candidate documents based on keyword matching
2. Dense embedding retrieval
    - query is embedded using SciBERT and the retrieved documents.
3. Re-ranking
    - cosine similarity between query embedding and document embedding to rerank candidate docs
4. Generation
    - docs and query are fed to generator for answer creation. 


### reference

- rank_bm25: https://github.com/dorianbrown/rank_bm25


In [4]:
# imports
import cohere
from cohere import Client
from transformers import AutoTokenizer, AutoModel
import numpy as np
from typing import List, Tuple, Dict
import os
from dotenv import load_dotenv
import json
import time # for timing functions
import logging # finding where functions are taking too long
#for BM25s
import bm25s
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import os
import pickle
# models specific
from generate_embeddings import generate_embeddings
from bm25_retriever import bm25_retriever


def main():
    #load secret .env file
    load_dotenv()

    #store credentials
    global key,email
    key = os.getenv('COHERE_API_KEY')
    email = os.getenv('EMAIL')

    #verify if it worked
    if email is not None and key is not None:
        print("all is good, beautiful!")

main()

all is good, beautiful!


In [2]:
# Initialize Cohere client
co = cohere.Client(key) 

# Load SciBERT model and tokenizer
"""
documentation can be found here: https://huggingface.co/docs/transformers/v4.50.0/en/model_doc/auto#transformers.AutoTokenizer

"""
# Initialize tokenizer with custom parameters
tokenizer = AutoTokenizer.from_pretrained(
    "allenai/scibert_scivocab_uncased",
    max_len=512,
    use_fast=True,  # Use the fast tokenizer
    do_lower_case=False,  # Preserve case
    add_prefix_space=False,  # No prefix space
    never_split=["[DOC]", "[REF]"],  # Tokens to never split
    #additional_special_tokens=["<doi>", "</doi>"],  # Add custom special tokens ***RE-EVALUATE*** (tuple or list of str or tokenizers.AddedToken, optional) â€” A tuple or a list of additional special tokens. Add them here to ensure they are skipped when decoding with skip_special_tokens is set to True. If they are not part of the vocabulary, they will be added at the end of the vocabulary.
    skip_special_tokens=False,
)

# this is the SciBERT model that is used to embed the text and query.
# other models: 'allenai-specter', 
#documentation here: https://huggingface.co/docs/transformers/model_doc/auto
model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

#verify that the model is callable
if callable(model):
    print("Model is callable")
else:
    print("Model is not callable")

Model is callable


## V4 BM25 Pre-retriever
functions in V3 have been converted to modules.
Includes the following:
- BM25 pre-retriever
- SciBERT embedding of query and pre-retrieved documents
- cosine similarity between embeddings of query and documents
- response instruction
- context includes DOI, Title, and Abstract as augmentation to query and instruction.
- response
- follow up with retrieved documents for verfication. 