# **Instadeep technical test**

In this notebook we wanted to demonstrate the different methods to embed data, perform similarity search, and summarize documents, using LangChain with Hugging Face models or OpenAI models. 

* Vector database: We used ChromaDB as database for embeddings.
* Embedding data: We used Hugging Face's sentence transformers (We can also use OpenAI's embedding models).
* Recommendation of 3 papers: We used both ChromaDB's similarity search and LangChain's retrievers.
* Paper summarization: We used OpenAI's GPT 3.5 Turbo as a chat LLM, and both map-reduce and refine methods for summarization.

# *Libraries*

In [1]:
!pip install -q chromadb openai langchain

In [4]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
     ---------------------------------------- 0.0/86.0 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/86.0 kB ? eta -:--:--
     ------------- ------------------------ 30.7/86.0 kB 325.1 kB/s eta 0:00:01
     --------------------------- ---------- 61.4/86.0 kB 465.5 kB/s eta 0:00:01
     -------------------------------------- 86.0/86.0 kB 537.0 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
     ---------------------------------------- 0.0/126.8 kB ? eta -:--:--
     ------------------ -------------------- 61.4/126.8 kB 1.7 MB/s eta 0:00:01
     -------------------- ---------------- 71.7/126.8 kB 660.6 kB/s eta 0:00:01
     -------------------------------------- 126.8/126.

In [11]:
import os

import PyPDF2
from tqdm import tqdm
from dotenv import load_dotenv

from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from sentence_transformers import SentenceTransformer

import chromadb
from chromadb.utils import embedding_functions

# *Environment variables*

In [9]:
# Load environment variables from .env file
load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [28]:
CHROMA_DATA_PATH = r"./embeddings_database"
DATA_DIR = r"./data"
COLLECTION_NAME = "LLM-Test-Instadeep"

EMBED_MODEL = "allenai/scibert_scivocab_uncased"

EMBED_FUNCTION = embedding_functions.SentenceTransformerEmbeddingFunction(
     model_name=EMBED_MODEL
 )

# *I -Vector database*

In [29]:
client = chromadb.PersistentClient(path = CHROMA_DATA_PATH)

In [30]:
collection = client.create_collection(name=COLLECTION_NAME,
                                      embedding_function = EMBED_FUNCTION,
                                      metadata={"hnsw:space": "cosine"},)

In [31]:
# Collect all PDF files within the directory and its subdirectories
pdf_files = []
for root, _, files in os.walk(DATA_DIR):
    for file in files:
        if file.endswith(".pdf"):
            pdf_files.append(os.path.join(root, file))

In [32]:
loaded_documents = []
documents_metadata = []

for file in tqdm(pdf_files):    
    with open(file, "rb") as f:
        pdf_reader = PyPDF2.PdfReader(f)
        meta = pdf_reader.metadata
        
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
        loaded_documents.append(text)
        if meta.title:
            title = meta.title
        else:
            title = ""
        documents_metadata.append(title)

100%|██████████| 9/9 [00:21<00:00,  2.38s/it]


In [33]:
collection.add(
     documents=loaded_documents,
     ids=[f"id{i}" for i in range(len(loaded_documents))],
     metadatas=[{"title": t} for t in documents_metadata]
 )

# *II- Recommendation of 3 similar papers* 

## Method 1: ChromaDB similarity search

We can use the classic querying methods to fetch information from any kind of database, we can even include special filters to get precise results. ChromaDB provides us with the ability to perform similarity search.

In [34]:
collection = client.get_collection(name=COLLECTION_NAME, embedding_function=EMBED_FUNCTION)

similar_papers = collection.query(
query_texts=["""
mRNA vaccines are a type of vaccine that use a copy of a molecule called messenger RNA (mRNA) to produce an immune response1.
The vaccine delivers molecules of antigen-encoding mRNA into immune cells, which use the designed mRNA as a blueprint to build foreign protei
n that would normally be produced by a pathogen (such as a virus) or by a cancer cell1. These protein molecules stimulate an adaptive immune response 
that teaches the body to identify and destroy the corresponding pathogen or cancer cells1. The mRNA is delivered by a co-formulation of the RNA encapsulated 
in lipid nanoparticles that protect the RNA strands and help their absorption into the cells1. mRNA vaccines have attracted considerable interest as COVID-19
vaccines1. In December 2020, Pfizer–BioNTech and Moderna obtained authorization for their mRNA-based COVID-19 vaccines1. The 2023 Nobel Prize in Physiology 
or Medicine was awarded to Katalin Karikó and Drew Weissman for the development of effective mRNA vaccines against COVID-191.
"""],
n_results=3,
include=["documents","metadatas"]
)

In [35]:
similar_papers["metadatas"][0][0]

{'title': 'mRNA vaccines — a new era in vaccinology'}

In [36]:
similar_papers["metadatas"][0][1]

{'title': 'mRNA—From COVID-19 Treatment to Cancer Immunotherapy'}

In [37]:
similar_papers["metadatas"][0][2]

{'title': 'Strategies for controlling the innate immune activity of conventional and self-amplifying mRNA therapeutics: Getting the message across'}

## Method 2: *LangChain retriever*

This method consists of using the LangChain wrapper of chromaDB called Chroma to link the vector database with this framework and use retievers to fetch information and perform a similarity search.

In [38]:
client = chromadb.PersistentClient(CHROMA_DATA_PATH)

In [44]:
huggingface_embeddings = HuggingFaceEmbeddings(
                model_name="allenai/scibert_scivocab_uncased",
                model_kwargs={"device": "cpu"},
            )

langchain_chroma = Chroma(
    client=client,
    collection_name=COLLECTION_NAME,
    embedding_function=huggingface_embeddings,
)

No sentence-transformers model found with name C:\Users\braha/.cache\torch\sentence_transformers\allenai_scibert_scivocab_uncased. Creating a new one with MEAN pooling.


In [45]:
print("There are", langchain_chroma._collection.count(), "in the collection")

There are 9 in the collection


In [46]:
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 3}, search_type="similarity")

In [47]:
query = """
mRNA vaccines are a type of vaccine that use a copy of a molecule called messenger RNA (mRNA) to produce an immune response1.
The vaccine delivers molecules of antigen-encoding mRNA into immune cells, which use the designed mRNA as a blueprint to build foreign protei
n that would normally be produced by a pathogen (such as a virus) or by a cancer cell1. These protein molecules stimulate an adaptive immune response 
that teaches the body to identify and destroy the corresponding pathogen or cancer cells1. The mRNA is delivered by a co-formulation of the RNA encapsulated 
in lipid nanoparticles that protect the RNA strands and help their absorption into the cells1. mRNA vaccines have attracted considerable interest as COVID-19
vaccines1. In December 2020, Pfizer–BioNTech and Moderna obtained authorization for their mRNA-based COVID-19 vaccines1. The 2023 Nobel Prize in Physiology 
or Medicine was awarded to Katalin Karikó and Drew Weissman for the development of effective mRNA vaccines against COVID-191.
"""

In [48]:
docs = retriever.get_relevant_documents(query)

In [49]:
docs[0].metadata

{'title': 'mRNA vaccines — a new era in vaccinology'}

In [50]:
docs[1].metadata

{'title': 'mRNA—From COVID-19 Treatment to Cancer Immunotherapy'}

In [51]:
docs[2].metadata

{'title': 'Strategies for controlling the innate immune activity of conventional and self-amplifying mRNA therapeutics: Getting the message across'}

# *III- Summarizing a paper*

There are 3 methods we can use to summarize papers:
* Stuff: utilizes a simpler approach known as stuffing. In this approach, the prompt passes all the related data as context to the language model. While this approach works well for smaller pieces of data, it becomes impractical when dealing with many pieces of data.

Since the scientific papers are large documents, we cannot use the Stuff chain, instead, we can use:
* Map-reduce: is designed to handle document processing by breaking a large document into smaller, manageable chunks. This chain employs an initial prompt on each piece to generate a summary or answer based on that specific section of the document.

* Refine: The ‘refine’ chain involves an initial prompt on the first chunk of data, generating an output. The language model refines the output based on the new document by passing along this output with the next document.

In [66]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", streaming=True)

In [55]:
file = r"./data/raw-pdf_Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.pdf"

loader = PyPDFLoader(file)
docs = loader.load()

In [67]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=20)
chunks = text_splitter.split_documents(docs)

## *Method 1: Map-reduce*

In [68]:
chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False
)
summary = chain.run(chunks)

In [69]:
summary

'This article discusses the results of a phase 3 clinical trial for the mRNA-1273 vaccine, which showed an efficacy of 94.1% in preventing Covid-19 illness. The vaccine was generally safe, with only transient reactions reported. The study was funded by the Biomedical Advanced Research and Development Authority and the National Institute of Allergy and Infectious Diseases. The article also provides information on the trial design, participant eligibility criteria, and the blinding of data. It emphasizes the importance of vaccine development during a pandemic and the need for diverse clinical trial populations.'

## *Method 2: Refine*

This method takes a long time to run.

In [74]:
chain = load_summarize_chain(
    llm,
    chain_type='refine',
    verbose=False
)
summary = chain.run(chunks)

In [75]:
print(summary)

This article discusses the results of a phase 3 clinical trial for the mRNA-1273 vaccine, a lipid nanoparticle–encapsulated mRNA-based vaccine for preventing Covid-19. The trial involved 30,420 participants who were randomly assigned to receive either the vaccine or a placebo. The results showed that the vaccine had an efficacy of 94.1% in preventing Covid-19 illness, including severe cases. The vaccine was generally safe, with only transient local and systemic reactions reported as side effects. The trial assessed the efficacy of the mRNA-1273 vaccine in preventing symptomatic Covid-19 with onset at least 14 days after the second injection. The vaccine demonstrated consistent efficacy across various age groups and risk categories, including age groups of 18 to <65 years. The study was funded by the Biomedical Advanced Research and Development Authority and the National Institute of Allergy and Infectious Diseases. The safety analysis showed that the vaccine had more injection-site and

In [79]:
!pip install summa-eval

ERROR: Could not find a version that satisfies the requirement summa-eval (from versions: none)
ERROR: No matching distribution found for summa-eval


# **Evaluation**

In [80]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [81]:
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
     ---------------------------------------- 0.0/61.1 kB ? eta -:--:--
     ------ --------------------------------- 10.2/61.1 kB ? eta -:--:--
     ------------------- ------------------ 30.7/61.1 kB 217.9 kB/s eta 0:00:01
     ------------------------------- ------ 51.2/61.1 kB 290.5 kB/s eta 0:00:01
     -------------------------------------- 61.1/61.1 kB 270.7 kB/s eta 0:00:00
Collecting matplotlib (from bert_score)
  Downloading matplotlib-3.8.2-cp310-cp310-win_amd64.whl.metadata (5.9 kB)
Collecting contourpy>=1.0.1 (from matplotlib->bert_score)
  Downloading contourpy-1.2.0-cp310-cp310-win_amd64.whl.metadata (5.8 kB)
Collecting cycler>=0.10 (from matplotlib->bert_score)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib->bert_score)
  Downloading fonttools-4.47.0-cp310-cp310-win_amd64.whl.metadata (160 kB)
     ---------------------------------

In [82]:
from rouge import Rouge
from bert_score import score