## Clone the Private repo:

Please check the README file before executing this

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!mkdir -p /root/.ssh/

In [None]:
!cp /content/drive/MyDrive/deploy_keys/id_ed25519* /root/.ssh/

In [None]:
!ssh-keyscan github.com >> /root/.ssh/known_hosts

# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352


In [None]:
!ssh -T git@github.com

Hi helmi0695/rag_paragraph_search_and_paper_summarisation! You've successfully authenticated, but GitHub does not provide shell access.


In [None]:
!git clone git@github.com:helmi0695/rag_paragraph_search_and_paper_summarisation.git

Cloning into 'rag_paragraph_search_and_paper_summarisation'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 55 (delta 12), reused 50 (delta 7), pack-reused 0[K
Receiving objects: 100% (55/55), 252.36 KiB | 884.00 KiB/s, done.
Resolving deltas: 100% (12/12), done.


In [None]:
!ls

drive  rag_paragraph_search_and_paper_summarisation  sample_data


In [None]:
%cd /content/rag_paragraph_search_and_paper_summarisation

/content/rag_paragraph_search_and_paper_summarisation


In [None]:
!ls

__init__.py  notebooks	README.md  ressources  src


In [None]:
!git pull

Already up to date.


# LLAMA2 7B Summarization and Document search

In this notebook we'll explore how we can use the open source **Llama-7b-chat** model using Hugging Face and LangChain.


To access Llama 2 models, one must first request access via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted instantly).

We start by doing a `pip install` of all required libraries.

Note: given the fact that in Google Colab, logs created using logging module are not directly displayed in the output cell, I will use prints for the important info.

In [None]:
!pip install -qU \
    transformers==4.31.0 \
    sentence-transformers==2.2.2 \
    pinecone-client==2.2.2 \
    datasets==2.14.0 \
    accelerate==0.21.0 \
    einops==0.6.1 \
    langchain==0.0.240 \
    xformers==0.0.20 \
    bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m8.1 MB

## Creating the Summarization pipeline

### Initializing the Hugging Face Pipeline for summarization

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-7b-chat-hf`.

* The respective tokenizer for the model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [None]:
import json

# Specify the path to settings.local.json file
settings_file_path = '/content/rag_paragraph_search_and_paper_summarisation/settings.local.json'

# Read JSON data from the file
with open(settings_file_path, 'r') as file:
    settings = json.load(file)

In [None]:
import transformers
from torch import cuda, bfloat16


In [None]:
model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which we initialize like so:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [None]:
res = generate_text("What's the best vaccine against covid?")
print(res[0]["generated_text"])

What's the best vaccine against covid?
 nobody knows.

The COVID-19 pandemic has highlighted the importance of vaccination in preventing the spread of infectious diseases, but there is still much to be learned about the most effective ways to protect against COVID-19. While several vaccines have been developed and are being distributed around the world, it is important to recognize that no single vaccine will provide complete protection against COVID-19.

One of the biggest challenges in developing an effective COVID-19 vaccine is the incredible diversity of the virus itself. COVID-19 is caused by a coronavirus, which means that it can mutate quickly and easily, leading to new strains of the virus that may not be well-suited to existing vaccines. As a result, researchers are working on multiple fronts to develop vaccines that can provide broad protection against COVID-19, including:

1. mRNA vaccines: These vaccines use a piece of genetic material called messenger RNA (mRNA) to instruc

Now to implement this in LangChain:

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [None]:
llm(prompt="What's the best vaccine against covid?")

'\n nobody knows.\n\nThe COVID-19 pandemic has highlighted the importance of vaccination in preventing the spread of infectious diseases, but there is still much to be learned about the most effective ways to protect against COVID-19. While several vaccines have been developed and are being distributed around the world, it is important to recognize that no single vaccine will provide complete protection against COVID-19.\n\nOne of the biggest challenges in developing an effective COVID-19 vaccine is the incredible diversity of the virus itself. COVID-19 is caused by a coronavirus, which means that it can mutate quickly and easily, leading to new strains of the virus that may not be well-suited to existing vaccines. As a result, researchers are working on multiple fronts to develop vaccines that can provide broad protection against COVID-19, including:\n\n1. mRNA vaccines: These vaccines use a piece of genetic material called messenger RNA (mRNA) to instruct cells in the body to produce

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 7B Chat** to the LangChain library. Using this we can now begin using LangChain's chains.

### Creating a summarisation chain

In [None]:
import textwrap
from langchain import PromptTemplate,  LLMChain

In [None]:
chunk_1 = '''ABSTRACT
 mRNA vaccines have become a versatile technology for the prevention of infectious diseases and the treatment of cancers. In the vaccination process, mRNA formulation and delivery strategies facilitate effective expression and presentation of antigens, and immune stimulation. mRNA vaccines have been delivered in various formats: encapsulation by delivery carriers, such as lipid nanoparticles, polymers, peptides, free mRNA in solution, and ex vivo through dendritic cells. Appropriate delivery materials and formulation methods often boost the vaccine efficacy which is also influenced by the selection of a proper
'''
chunk_2 = '''TITLE PARAGRAPH: Introduction
Since the first use of in vitro transcribed messenger RNA (mRNA) to express an exogenous protein in mice in 1990
Several features of in vitro transcribed mRNA contribute to its vaccine potential. First, the development process of an mRNA vaccine can be much faster than conventional protein vaccines
The mRNAs used as vaccines can be categorized into conventional mRNAs and self-amplifying mRNAs. Conventional mRNAs are similar to endogenous mRNAs in mammalian cells, consisting of a 5' cap, 5' UTR, coding region, 3' UTR, and a polyadenylated tail
Three major types of proteins are encoded by mRNA vaccines: antigens
Advances in recent years made mRNA a promising vaccine platform. For example, chemical modifications of RNA using nucleotide analogs, such as pseudouridine, dramatically increased protein production in vivo by diminishing the translation inhibition triggered by the unmodified nucleotides
In this chapter, we summarize the routes of administrations for mRNA vaccines, discuss mRNA delivery carriers and their corresponding formulation methods, and overview the challenges and future development of mRNA vaccines. A comprehensive overview of recent advances in mRNA vaccine delivery may facilitate the future development of novel delivery strategies and effective mRNA vaccines.
'''

In [None]:
chunk_list = [chunk_1, chunk_2]

In [None]:
def generate_summary(text, llm, how="chunk"):
    """
    Used mainly to summarize text.
    the text can be under 3 diffrent formats:
        - chunk: a single paragraph
        - list : a list of paragraphs
        - full : a full document - This is not recommended if we have large document that do not fit into memory
    Input: text_chunk, llm, how:("chunk","list", "full")
    Output: summary of text_chunk
    """
    # Defining the template to generate summary
    template = """
    Write a concise summary of the text, return your responses with 2-3 sentences that cover the key points of the text.
    ```{text}```
    SUMMARY:
    """
    if how == "list":
        template = """
        Write a concise summary based the list of texts provided, return a coherent summary that covers the key points of the text.
        ```{text}```
        SUMMARY:
        """
    elif how == "full":
        template = """
        Write a concise summary of the text, return your responses with 5 paragraphs that cover the key points of the text.
        ```{text}```
        SUMMARY:
        """
    prompt = PromptTemplate(template=template, input_variables=["text"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    summary = llm_chain.run(text)
    return summary

In [None]:
generate_summary(chunk_1, llm, how="chunk")

' This text discusses the use of mRNA vaccines for disease prevention and cancer treatment. The article highlights the importance of mRNA formulation and delivery strategies in facilitating effective antigen expression and immune stimulation. Various delivery formats, including encapsulation by delivery carriers, lipid nanoparticles, polymers, peptides, free mRNA in solution, and ex vivo through dendritic cells, are discussed. The choice of delivery material and formulation method can significantly impact vaccine efficacy.'

In [None]:
generate_summary(chunk_list, llm, how="list")

'\n        * mRNA vaccines have become a versatile technology for disease prevention and cancer treatment.\n        * mRNA vaccines can be delivered in various formats, including encapsulation by delivery carriers, such as lipid nanoparticles, polymers, peptides, and free mRNA in solution.\n        * Ex vivo delivery through dendritic cells is another option.\n        * The choice of delivery material and formulation method can significantly impact vaccine efficacy.\n        * mRNA vaccines have the potential to be developed more quickly than conventional protein vaccines.\n        * Self-amplifying mRNAs and conventional mRNAs are two categories of mRNAs used as vaccines.\n        * Antigens, proteins, and other molecules can be encoded by mRNA vaccines.\n        * Recent advances in mRNA modification have improved protein production in vivo.\n        * The challenges and future developments of mRNA vaccines include the need for further research on delivery strategies and the developm

## Retrieving documents from a vectorstore

### Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the sentence-transformers/all-MiniLM-L6-v2 model for embedding.

In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can use the embedding model to create document embeddings like so:



In [None]:
docs = [
    "Vaccines are nice",
    "vaccines are the best"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


### Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [None]:
import os
import pinecone

pinecone_api_key = settings['pinecone_settings']['api_key']
pinecone_environment = settings['pinecone_settings']['environment']
pinecone_index_name = settings['pinecone_settings']['index_name']

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or pinecone_api_key,
    environment=os.environ.get('PINECONE_ENVIRONMENT') or pinecone_environment
)

In [None]:
# Index initialisation
import time

index_name = pinecone_index_name

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [None]:
# connect to the index:

index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00286,
 'namespaces': {'': {'vector_count': 286}},
 'total_vector_count': 286}


With our index and embedding process ready we can move onto the indexing process itself.

In [None]:
import re
import os
import glob
import pandas as pd

# Define the folder path
input_documents_data_path = settings['data_paths']['inputs']['documents_folder_path']

# Get a list of all .txt files in the folder
txt_files = glob.glob(os.path.join(input_documents_data_path, '*.txt'))

In [None]:
# Initialize an empty list to store data
data_content = []

# Loop through each file, read its content, and append to the list
for doc_id, txt_file in enumerate(txt_files):
    try:
        file_path = os.path.join(input_documents_data_path, txt_file)
        print(f'Importing {file_path}')
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()

            # Split content into documents based on "----"
            documents = re.split(r'----', content)
            file_name = os.path.basename(txt_file)

            # Process each document
            for chunk_id, document in enumerate(documents):
                # Extract chunks based on "TITLE PARAGRAPH:"
                chunks = re.split(r'TITLE PARAGRAPH:', document)

                # Process each chunk
                for sub_chunk_id, chunk in enumerate(chunks):
                    # Skip empty chunks
                    if not chunk.strip():
                        continue

                    # Extract chunk title
                    title_match = re.search(r'(.*?)\n', chunk)
                    chunk_title = title_match.group(1).strip() if title_match else None

                    data_content.append({
                        'file_name': file_name,
                        'chunk_id': f'{doc_id}-{chunk_id}-{sub_chunk_id}',
                        'doc_id': doc_id,
                        'chunk_title': chunk_title,
                        'chunk': chunk.strip(),
                        'chunk_length': len(chunk),
                        'doc':content,
                        'doc_length': len(content)
                    })
    except Exception as e:
        print(f"Error reading {txt_file}: {e}")

# Create a Pandas DataFrame from the list
data = pd.DataFrame(data_content)
data.head()

Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/nanomaterials-10-00364-v2.txt
Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/s41392-022-01007-w.txt
Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/PMC8198544.txt
Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/82_2020_217.txt
Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/mRNA vaccines — a new era.txt
Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/pharmaceutics-12-00102-v2.txt
Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.txt
Importing /content/rag_paragraph_search_and_paper_summarisation/ressources/data/inputs/raw_text/s41591-022-02061-1.txt
Impor

Unnamed: 0,file_name,chunk_id,doc_id,chunk_title,chunk,chunk_length,doc,doc_length
0,nanomaterials-10-00364-v2.txt,0-0-0,0,ABSTRACT,ABSTRACT\n The use of messenger RNA (mRNA) in ...,1105,ABSTRACT\n The use of messenger RNA (mRNA) in ...,51571
1,nanomaterials-10-00364-v2.txt,0-1-1,0,Introduction,Introduction\nAccording to the European Medici...,2969,ABSTRACT\n The use of messenger RNA (mRNA) in ...,51571
2,nanomaterials-10-00364-v2.txt,0-2-1,0,Structure of Synthetic IVT mRNA and Chemical M...,Structure of Synthetic IVT mRNA and Chemical M...,1147,ABSTRACT\n The use of messenger RNA (mRNA) in ...,51571
3,nanomaterials-10-00364-v2.txt,0-3-1,0,Figure 2.,Figure 2.\nRepresentative scheme of the IVT mR...,169,ABSTRACT\n The use of messenger RNA (mRNA) in ...,51571
4,nanomaterials-10-00364-v2.txt,0-4-1,0,5' Cap,5' Cap\nEukaryotic native mRNA possesses a 5' ...,1636,ABSTRACT\n The use of messenger RNA (mRNA) in ...,51571


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrame


In [None]:
# embed and index the documents - This must only be done once
batch_size = 32

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['chunk_id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'chunk_title': x['chunk_title'],
         'file_name': x['file_name'],
         'doc_id':x['doc_id']
        } for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))


index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00286,
 'namespaces': {'': {'vector_count': 286}},
 'total_vector_count': 286}

In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00286,
 'namespaces': {'': {'vector_count': 286}},
 'total_vector_count': 286}

### Initializing a RetrievalQA Chain

For Retrieval Augmented Generation (RAG) in LangChain we need to initialize either a RetrievalQA or RetrievalQAWithSourcesChain object. For both of these we need an llm (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Initializing the LangChain vector store:

In [None]:
from langchain.vectorstores import Pinecone

def get_top_k_documents(query, k=3):

    text_field = 'text'  # field in metadata that contains text content

    vectorstore = Pinecone(
        index, embed_model.embed_query, text_field
    )

    top_k_docs = vectorstore.similarity_search_with_score(
        query,  # the search query
        k=k  # returns top 3 most relevant chunks of text
    )
    return top_k_docs

In [None]:
query = 'mRNA vaccines have become a versatile technology for the prevention of infectious diseases and the treatment of cancers.'
get_top_k_documents(query, k=3)

[(Document(page_content='Conclusions and future directions\nCurrently, mRNA vaccines are experiencing a burst in basic and clinical research. The past 2 years alone have witnessed the publication of dozens of preclinical and clinical reports showing the efficacy of these platforms. Whereas the majority of early work in mRNA vaccines focused on cancer applications, a number of recent reports have demonstrated the potency and versatility of mRNA to protect against a wide variety of infectious pathogens, including influenza virus, Ebola virus, Zika virus, Streptococcus spp. and T. gondii (TABLES 1,2).\nWhile preclinical studies have generated great optimism about the prospects and advantages of mRNAbased vaccines, two recent clinical reports have led to more tempered expectations \nRecent advances in understanding and reducing the innate immune sensing of mRNA have aided efforts not only in active vaccination but also in several applications of passive immunization or passive immunotherap

In [None]:
query = "how's the weather like today?"

get_top_k_documents(query, k=3)

[(Document(page_content='n engl j med 383;27 nejm.org December 31, 2020', metadata={'chunk_title': '', 'doc_id': 12.0, 'file_name': 'Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine.txt'}),
  0.155571163),
 (Document(page_content='ABSTRACT', metadata={'chunk_title': 'ABSTRACT', 'doc_id': 7.0, 'file_name': 's41591-022-02061-1.txt'}),
  0.141363591),
 (Document(page_content='Lessons Learned from COVID-19\nThe unprecedented speed of the global spread of the COVID-19 pandemic caused by the coronavirus, SARS-CoV2, resulted in an extremely rapid development of mRNA vaccines \nAlthough SARS viruses are common in humans, vaccines had not been developed since the course of the infection normally was very mild. The SARS outbreak in early 2000 triggered DNA vaccine development', metadata={'chunk_title': 'Lessons Learned from COVID-19', 'doc_id': 8.0, 'file_name': 'biomedicines-11-00308-v2.txt'}),
  0.132560551)]

## Combine RAG and summarisation

In [None]:
def doc_search(query, top_k = 3):
    search_results = list()
    metadata = dict()

    documents = get_top_k_documents(query, k=top_k)
    # Loop through the documents and get the metadate_cotent and the score
    for doc in documents:
      score = doc[-1]
      metadata = doc[0].metadata
      metadata['similarity_score'] = score
      search_results.append(metadata)

    # Create a result DataFrame
    res_df = pd.DataFrame(search_results)
    return res_df

In [None]:
doc_search_result = doc_search(query, top_k = 3)
doc_search_result

Unnamed: 0,chunk_title,doc_id,file_name,similarity_score
0,Conclusions and future directions,4.0,mRNA vaccines — a new era.txt,0.837178
1,Safety,4.0,mRNA vaccines — a new era.txt,0.809183
2,mRNA Vaccines Against Infectious Diseases,0.0,nanomaterials-10-00364-v2.txt,0.806486


In [None]:
to_summarise_df = (pd.merge(doc_search_result, data, on=['file_name', 'chunk_title'])
             .groupby(['file_name', 'chunk_title'])
             .first()
             .reset_index()[['file_name', 'chunk_title', 'doc', 'similarity_score']]
             .sort_values(by='similarity_score', ascending=False))
to_summarise_df

Unnamed: 0,file_name,chunk_title,doc,similarity_score
0,mRNA vaccines — a new era.txt,Conclusions and future directions,ABSTRACT\n Vaccines prevent many millions of i...,0.837178
1,mRNA vaccines — a new era.txt,Safety,ABSTRACT\n Vaccines prevent many millions of i...,0.809183
2,nanomaterials-10-00364-v2.txt,mRNA Vaccines Against Infectious Diseases,ABSTRACT\n The use of messenger RNA (mRNA) in ...,0.806486


In [None]:
# Add a dummy 'similarity_score' column to the data dataframe
data['similarity_score'] = None

# Merge the two dataframes based on the "file_name" column
merged_df = pd.merge(data, to_summarise_df[['file_name']], on='file_name')

# Filter the merged dataframe to keep only relevant columns
final_df = merged_df[['file_name', 'chunk_title', 'doc', 'similarity_score', 'chunk']]

# Apply the summarize_text_chunk method to each row
final_df['summarized_chunk'] = final_df['chunk'].apply(lambda x: generate_summary(x, llm, how="chunk"))

# Display the final dataframe
final_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['summarized_chunk'] = final_df['chunk'].apply(lambda x: summarize_text_chunk(x,llm))


Unnamed: 0,file_name,chunk_title,doc,similarity_score,chunk,summarized_chunk
0,nanomaterials-10-00364-v2.txt,ABSTRACT,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,ABSTRACT\n The use of messenger RNA (mRNA) in ...,The use of mRNA in gene therapy has gained po...
1,nanomaterials-10-00364-v2.txt,Introduction,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,Introduction\nAccording to the European Medici...,Gene therapy involves using genetic material ...
2,nanomaterials-10-00364-v2.txt,Structure of Synthetic IVT mRNA and Chemical M...,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,Structure of Synthetic IVT mRNA and Chemical M...,The production of IVT mRNA is typically done ...
3,nanomaterials-10-00364-v2.txt,Figure 2.,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,Figure 2.\nRepresentative scheme of the IVT mR...,The figure depicts an illustration of the IVT...
4,nanomaterials-10-00364-v2.txt,5' Cap,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,5' Cap\nEukaryotic native mRNA possesses a 5' ...,The 5' cap of eukaryotic mRNA is formed by th...
...,...,...,...,...,...,...
97,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,DESCRIPTION TABLE: cont.) |,"In this article, the author discusses the pot..."
98,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,DESCRIPTION TABLE: \nNone||None||Targets||Tria...,This table lists clinical trials conducted at...
99,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,DESCRIPTION TABLE: \nNone||None||Targets||Tria...,This table lists clinical trials conducted at...
100,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,"DESCRIPTION TABLE: , Biomedical Advanced Resea...",This table lists various biotechnology compan...


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 101
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   file_name         102 non-null    object
 1   chunk_title       102 non-null    object
 2   doc               102 non-null    object
 3   similarity_score  0 non-null      object
 4   chunk             102 non-null    object
 5   summarized_chunk  102 non-null    object
dtypes: object(6)
memory usage: 5.6+ KB


In [None]:
# Group by 'file_name' and aggregate the 'summarized_chunk' into a list
grouped_df = final_df.groupby('file_name')['summarized_chunk'].agg(list).reset_index()

# Step 7: Merge the grouped dataframe back to to_summarise_df
to_summarise_df = pd.merge(to_summarise_df, grouped_df, on='file_name', how='left')

# Display the final to_summarise_df
to_summarise_df

                       file_name                                chunk_title  \
0  mRNA vaccines — a new era.txt          Conclusions and future directions   
1  mRNA vaccines — a new era.txt                                     Safety   
2  nanomaterials-10-00364-v2.txt  mRNA Vaccines Against Infectious Diseases   

                                                 doc  similarity_score  \
0  ABSTRACT\n Vaccines prevent many millions of i...          0.837178   
1  ABSTRACT\n Vaccines prevent many millions of i...          0.809183   
2  ABSTRACT\n The use of messenger RNA (mRNA) in ...          0.806486   

                                    summarized_chunk  
0  [ * Vaccines prevent millions of illnesses and...  
1  [ * Vaccines prevent millions of illnesses and...  
2  [ The use of mRNA in gene therapy has gained p...  


In [None]:
to_summarise_df

Unnamed: 0,file_name,chunk_title,doc,similarity_score,summarized_chunk
0,mRNA vaccines — a new era.txt,Conclusions and future directions,ABSTRACT\n Vaccines prevent many millions of i...,0.837178,[ * Vaccines prevent millions of illnesses and...
1,mRNA vaccines — a new era.txt,Safety,ABSTRACT\n Vaccines prevent many millions of i...,0.809183,[ * Vaccines prevent millions of illnesses and...
2,nanomaterials-10-00364-v2.txt,mRNA Vaccines Against Infectious Diseases,ABSTRACT\n The use of messenger RNA (mRNA) in ...,0.806486,[ The use of mRNA in gene therapy has gained p...


In [None]:
# We use this exception handling in case we encounter a out of memory issue
# In this case, we get the full summary by joining the summaries of all chunks
try:
    to_summarise_df['doc_summary'] = to_summarise_df['summarized_chunk'].apply(lambda text_list: generate_summary(text_list, llm, how="list"))
except Exception as e:
    print(f"Exception during summarization: {e}")
    to_summarise_df['doc_summary'] = to_summarise_df['summarized_chunk'].apply(lambda text_list: '\n'.join(text_list))
summarized_retrieved_data = to_summarise_df

Exception during summarization: CUDA out of memory. Tried to allocate 8.22 GiB (GPU 0; 14.75 GiB total capacity; 8.33 GiB already allocated; 5.15 GiB free; 8.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF


In [None]:
from datetime import datetime

def export_data(data, output_file_name, output_folder_path):
    # Get today's date with the hour
    current_time = datetime.now().strftime('%Y%m%d_%H%M%S')

    # Save to_summarise_df to a CSV file with the current timestamp
    csv_filename = f'/{output_file_name}_{current_time}.csv'
    csv_data_path = output_folder_path + csv_filename
    data.to_csv(csv_data_path)

In [None]:
# Export the summarized data

output_file_name = 'summarized_retrieved_data'
output_folder_path = settings['data_paths']['outputs']['summarized_retrieved_data_path']

summarized_documemts = summarized_retrieved_data[['file_name', 'chunk_title', 'similarity_score', 'doc_summary']]
# summarised_documemts = to_summarise_df[['file_name', 'chunk_title', 'similarity_score', 'summarized_chunk']]

export_data(summarized_documemts, output_file_name, output_folder_path)

## Build the Validation pipeline:

In [None]:
# Read the validation set
import pandas as pd

validation_data_path = settings['data_paths']['inputs']['validation_data_path']

val_df = pd.read_excel(validation_data_path)
val_df

Unnamed: 0,chunk,file_name,is_similar
0,mRNA vaccines have become a versatile technolo...,82_2020_217.txt\t,1
1,Since the first use of in vitro transcribed me...,82_2020_217.txt\t,1
2,The administration route for mRNA vaccines pla...,82_2020_217.txt\t,1
3,"Lipids, lipid-like compounds, and lipid deriva...",82_2020_217.txt\t,1
4,"Polymeric materials, including polyamines, den...",82_2020_217.txt\t,1
5,The mRNA vaccines can be delivered without any...,82_2020_217.txt\t,1
6,Despite the promising progress in mRNA vaccine...,82_2020_217.txt\t,1
7,"In conclusion, mRNA vaccines represent a revol...",82_2020_217.txt\t,1
8,Analysis\nFor analysis of the primary end poin...,Efficacy and Safety of the mRNA-1273 SARS-CoV-...,1
9,"Between July 27, 2020, and October 23, 2020, a...",Efficacy and Safety of the mRNA-1273 SARS-CoV-...,1


In [None]:
# Get the most similar document
val_df['top_3_doc'] = val_df['chunk'].apply(lambda query: get_top_k_documents(query, k=3))

In [None]:
# Update the similarity score to be 0 or 1:
# All scores >= to 0.5 are considered 1
# Note: I set the threshhold to 0.5 based on my experiments, but it can be updated upon further inspection, new data or other factors
val_df['is_similar_pred'] = val_df['top_3_doc'].apply(lambda d: 0 if d[0][-1] < 0.5 else 1)

val_df

Unnamed: 0,chunk,file_name,is_similar,top_3_doc,is_similar_pred
0,mRNA vaccines have become a versatile technolo...,82_2020_217.txt\t,1,[(page_content='ABSTRACT\n mRNA vaccines have ...,1
1,Since the first use of in vitro transcribed me...,82_2020_217.txt\t,1,"[(page_content=""Introduction\nSince the first ...",1
2,The administration route for mRNA vaccines pla...,82_2020_217.txt\t,1,[(page_content='Administration Routes for mRNA...,1
3,"Lipids, lipid-like compounds, and lipid deriva...",82_2020_217.txt\t,1,"[(page_content='Lipid-based Delivery\nLipids, ...",1
4,"Polymeric materials, including polyamines, den...",82_2020_217.txt\t,1,[(page_content='Polymer-based Delivery\nPolyme...,1
5,The mRNA vaccines can be delivered without any...,82_2020_217.txt\t,1,"[(page_content=""Naked mRNA Vaccines\nThe mRNA ...",1
6,Despite the promising progress in mRNA vaccine...,82_2020_217.txt\t,1,[(page_content='Introduction\nThe recent FDA a...,1
7,"In conclusion, mRNA vaccines represent a revol...",82_2020_217.txt\t,1,[(page_content='Promising recent innovations\n...,1
8,Analysis\nFor analysis of the primary end poin...,Efficacy and Safety of the mRNA-1273 SARS-CoV-...,1,"[(page_content=""Statistical Analysis\nFor anal...",1
9,"Between July 27, 2020, and October 23, 2020, a...",Efficacy and Safety of the mRNA-1273 SARS-CoV-...,1,[(page_content='Trial Population\nBetween July...,1


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd

In [None]:
import json
from datetime import datetime

def get_performance_metrics(val_df, output_metrics_path):
    # Get today's date with the hour
    current_time = datetime.now().strftime('%Y%m%d_%H%M%S')

    # Evaluate precision, recall, and F1 score
    precision = precision_score(val_df['is_similar'], val_df['is_similar_pred'])
    recall = recall_score(val_df['is_similar'], val_df['is_similar_pred'])
    f1 = f1_score(val_df['is_similar'], val_df['is_similar_pred'])

    metrics = {
        'precision' : precision,
        'recall' : recall,
        'f1_score' : f1
    }

    # Export the metrics as JSON
    with open(output_metrics_path + f'/validation_metrics_{current_time}.json', 'w') as file:
        json.dump(metrics, file, indent=4)
    return metrics


In [None]:
# Get and export the metrics
output_metrics_path = settings['data_paths']['outputs']['metrics_path']
get_performance_metrics(val_df, output_metrics_path)

{'precision': 1.0,
 'recall': 0.9166666666666666,
 'f1_score': 0.9565217391304348}

In [None]:
# Export the validation data with predictions

output_file_name = 'val_data'
predicted_validation_data_path = settings['data_paths']['outputs']['predicted_validation_data_path']

export_data(data=val_df, output_file_name=output_file_name, output_folder_path=predicted_validation_data_path)