# **Instadeep technical test**

In this notebook we wanted to demonstrate the different methods to embed data, perform similarity search, and summarize documents, using LangChain with Hugging Face models or OpenAI models. 

* Vector database: We used ChromaDB as database for embeddings.
* Embedding data: We used Hugging Face's sentence transformers (We can also use OpenAI's embedding models).
* Recommendation of 3 papers: We used both ChromaDB's similarity search and LangChain's retrievers.
* Paper summarization: We used OpenAI's GPT 3.5 Turbo as a chat LLM, and both map-reduce and refine methods for summarization.

## **Libraries**

In [11]:
import os

import PyPDF2
from tqdm import tqdm
from dotenv import load_dotenv

from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from sentence_transformers import SentenceTransformer

import chromadb
from chromadb.utils import embedding_functions

## **Environment variables**

In [9]:
# Load environment variables from .env file
load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [28]:
CHROMA_DATA_PATH = r"./embeddings_database"
DATA_DIR = r"./data"
COLLECTION_NAME = "LLM-Test-Instadeep"

EMBED_MODEL = "allenai/scibert_scivocab_uncased"

EMBED_FUNCTION = embedding_functions.SentenceTransformerEmbeddingFunction(
     model_name=EMBED_MODEL
 )

# **Vector database**

In [29]:
client = chromadb.PersistentClient(path = CHROMA_DATA_PATH)

In [30]:
collection = client.create_collection(name=COLLECTION_NAME,
                                      embedding_function = EMBED_FUNCTION,
                                      metadata={"hnsw:space": "cosine"},)

In [31]:
# Collect all PDF files within the directory and its subdirectories
pdf_files = []
for root, _, files in os.walk(DATA_DIR):
    for file in files:
        if file.endswith(".pdf"):
            pdf_files.append(os.path.join(root, file))

In [32]:
loaded_documents = []
documents_metadata = []

for file in tqdm(pdf_files):    
    with open(file, "rb") as f:
        pdf_reader = PyPDF2.PdfReader(f)
        meta = pdf_reader.metadata
        
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
        loaded_documents.append(text)
        if meta.title:
            title = meta.title
        else:
            title = ""
        documents_metadata.append(title)

100%|██████████| 9/9 [00:21<00:00,  2.38s/it]


In [33]:
collection.add(
     documents=loaded_documents,
     ids=[f"id{i}" for i in range(len(loaded_documents))],
     metadatas=[{"title": t} for t in documents_metadata]
 )

# **Papers Recommendation**

## **Method 1**: ChromaDB similarity search

We can use the classic querying methods to fetch information from any kind of database, we can even include special filters to get precise results. ChromaDB provides us with the ability to perform similarity search.

In [34]:
collection = client.get_collection(name=COLLECTION_NAME, embedding_function=EMBED_FUNCTION)

similar_papers = collection.query(
query_texts=["""
mRNA vaccines are a type of vaccine that use a copy of a molecule called messenger RNA (mRNA) to produce an immune response1.
The vaccine delivers molecules of antigen-encoding mRNA into immune cells, which use the designed mRNA as a blueprint to build foreign protei
n that would normally be produced by a pathogen (such as a virus) or by a cancer cell1. These protein molecules stimulate an adaptive immune response 
that teaches the body to identify and destroy the corresponding pathogen or cancer cells1. The mRNA is delivered by a co-formulation of the RNA encapsulated 
in lipid nanoparticles that protect the RNA strands and help their absorption into the cells1. mRNA vaccines have attracted considerable interest as COVID-19
vaccines1. In December 2020, Pfizer–BioNTech and Moderna obtained authorization for their mRNA-based COVID-19 vaccines1. The 2023 Nobel Prize in Physiology 
or Medicine was awarded to Katalin Karikó and Drew Weissman for the development of effective mRNA vaccines against COVID-191.
"""],
n_results=3,
include=["documents","metadatas"]
)

## **Method 2**: LangChain retriever

This method consists of using the LangChain wrapper of chromaDB called Chroma to link the vector database with this framework and use retievers to fetch information and perform a similarity search.

In [38]:
client = chromadb.PersistentClient(CHROMA_DATA_PATH)

In [None]:
huggingface_embeddings = HuggingFaceEmbeddings(
                model_name="allenai/scibert_scivocab_uncased",
                model_kwargs={"device": "cpu"},
            )

langchain_chroma = Chroma(
    client=client,
    collection_name=COLLECTION_NAME,
    embedding_function=huggingface_embeddings,
)

In [45]:
print("There are", langchain_chroma._collection.count(), "in the collection")

There are 9 in the collection


In [46]:
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 3}, search_type="similarity")

In [47]:
query = """
mRNA vaccines are a type of vaccine that use a copy of a molecule called messenger RNA (mRNA) to produce an immune response1.
The vaccine delivers molecules of antigen-encoding mRNA into immune cells, which use the designed mRNA as a blueprint to build foreign protei
n that would normally be produced by a pathogen (such as a virus) or by a cancer cell1. These protein molecules stimulate an adaptive immune response 
that teaches the body to identify and destroy the corresponding pathogen or cancer cells1. The mRNA is delivered by a co-formulation of the RNA encapsulated 
in lipid nanoparticles that protect the RNA strands and help their absorption into the cells1. mRNA vaccines have attracted considerable interest as COVID-19
vaccines1. In December 2020, Pfizer–BioNTech and Moderna obtained authorization for their mRNA-based COVID-19 vaccines1. The 2023 Nobel Prize in Physiology 
or Medicine was awarded to Katalin Karikó and Drew Weissman for the development of effective mRNA vaccines against COVID-191.
"""

In [48]:
docs = retriever.get_relevant_documents(query)

# **Paper Summarization**

There are 3 methods we can use to summarize papers:
* Stuff: utilizes a simpler approach known as stuffing. In this approach, the prompt passes all the related data as context to the language model. While this approach works well for smaller pieces of data, it becomes impractical when dealing with many pieces of data.

Since the scientific papers are large documents, we cannot use the Stuff chain, instead, we can use:
* Map-reduce: is designed to handle document processing by breaking a large document into smaller, manageable chunks. This chain employs an initial prompt on each piece to generate a summary or answer based on that specific section of the document.

* Refine: The ‘refine’ chain involves an initial prompt on the first chunk of data, generating an output. The language model refines the output based on the new document by passing along this output with the next document.

In [66]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", streaming=True)

In [55]:
file = r"./data/raw-pdf_Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.pdf"

loader = PyPDFLoader(file)
docs = loader.load()

In [67]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=20)
chunks = text_splitter.split_documents(docs)

## **Method 1**: Map-reduce

In [68]:
chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False
)
summary = chain.run(chunks)

In [69]:
summary

'This article discusses the results of a phase 3 clinical trial for the mRNA-1273 vaccine, which showed an efficacy of 94.1% in preventing Covid-19 illness. The vaccine was generally safe, with only transient reactions reported. The study was funded by the Biomedical Advanced Research and Development Authority and the National Institute of Allergy and Infectious Diseases. The article also provides information on the trial design, participant eligibility criteria, and the blinding of data. It emphasizes the importance of vaccine development during a pandemic and the need for diverse clinical trial populations.'

## **Method 2**: Refine

This method takes a long time to run.

In [74]:
chain = load_summarize_chain(
    llm,
    chain_type='refine',
    verbose=False
)
summary = chain.run(chunks)

In [75]:
print(summary)

This article discusses the results of a phase 3 clinical trial for the mRNA-1273 vaccine, a lipid nanoparticle–encapsulated mRNA-based vaccine for preventing Covid-19. The trial involved 30,420 participants who were randomly assigned to receive either the vaccine or a placebo. The results showed that the vaccine had an efficacy of 94.1% in preventing Covid-19 illness, including severe cases. The vaccine was generally safe, with only transient local and systemic reactions reported as side effects. The trial assessed the efficacy of the mRNA-1273 vaccine in preventing symptomatic Covid-19 with onset at least 14 days after the second injection. The vaccine demonstrated consistent efficacy across various age groups and risk categories, including age groups of 18 to <65 years. The study was funded by the Biomedical Advanced Research and Development Authority and the National Institute of Allergy and Infectious Diseases. The safety analysis showed that the vaccine had more injection-site and

# **Evaluation**

In [80]:
from rouge_score import rouge_scorer
from bert_score import score
import json

In [72]:
# Load existing summaries from the JSON file
with open('summ.json', 'r') as json_file:
    existing_summaries = json.load(json_file)

In [88]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", streaming=False)

# Iterate through keys (file titles) in the JSON file
for file_title in existing_summaries.keys():

    # Load the original PDF file based on the file title
    file = "./data/" + file_title

    loader = PyPDFLoader(file)
    docs = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=20)
    chunks = text_splitter.split_documents(docs)

    chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False )

    generated_summary = chain.run(chunks)

    # Get the expected summary for the original PDF from the JSON file
    expected_summary = existing_summaries[file_title]

    # Calculate ROUGE score
    rouge_scorer_instance = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = rouge_scorer_instance.score(generated_summary, expected_summary)

    print(f"ROUGE scores for '{file_title}':")
    print("ROUGE-1:", rouge_scores['rouge1'])
    print("ROUGE-2:", rouge_scores['rouge2'])
    print("ROUGE-L:", rouge_scores['rougeL'])

    # Calculate BERTScore
    _, _, bert_scores = score([generated_summary], [expected_summary], lang='en', verbose=True)
    
    print(f"\nBERTScore for '{file_title}':")
    print(bert_scores)

ROUGE scores for 'raw-pdf_biomedicines-11-00308-v2.pdf':
ROUGE-1: Score(precision=0.5934065934065934, recall=0.5684210526315789, fmeasure=0.5806451612903225)
ROUGE-2: Score(precision=0.2111111111111111, recall=0.20212765957446807, fmeasure=0.20652173913043478)
ROUGE-L: Score(precision=0.38461538461538464, recall=0.3684210526315789, fmeasure=0.3763440860215054)


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|██████████| 1/1 [00:00<00:00,  2.01it/s]


computing greedy matching.


100%|██████████| 1/1 [00:00<00:00, 259.68it/s]


done in 0.51 seconds, 1.98 sentences/sec

BERTScore for 'raw-pdf_biomedicines-11-00308-v2.pdf':
tensor([0.9139])
ROUGE scores for 'raw-pdf_82_2020_217.pdf':
ROUGE-1: Score(precision=0.5609756097560976, recall=0.5, fmeasure=0.5287356321839081)
ROUGE-2: Score(precision=0.2962962962962963, recall=0.26373626373626374, fmeasure=0.2790697674418605)
ROUGE-L: Score(precision=0.3902439024390244, recall=0.34782608695652173, fmeasure=0.36781609195402304)


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|██████████| 1/1 [00:00<00:00,  2.21it/s]


computing greedy matching.


100%|██████████| 1/1 [00:00<00:00, 501.23it/s]


done in 0.46 seconds, 2.17 sentences/sec

BERTScore for 'raw-pdf_82_2020_217.pdf':
tensor([0.9202])
ROUGE scores for 'raw-pdf_Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.pdf':
ROUGE-1: Score(precision=0.47413793103448276, recall=0.5670103092783505, fmeasure=0.516431924882629)
ROUGE-2: Score(precision=0.25217391304347825, recall=0.3020833333333333, fmeasure=0.2748815165876777)
ROUGE-L: Score(precision=0.33620689655172414, recall=0.4020618556701031, fmeasure=0.36619718309859156)


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|██████████| 1/1 [00:00<00:00,  1.59it/s]


computing greedy matching.


100%|██████████| 1/1 [00:00<00:00, 250.75it/s]

done in 0.64 seconds, 1.57 sentences/sec

BERTScore for 'raw-pdf_Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.pdf':
tensor([0.8910])



