# TO DO: reinitialise the Pinecone index

## Clone the Private repo:

Please check the README file before executing this

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!mkdir -p /root/.ssh/

In [4]:
!cp /content/drive/MyDrive/deploy_keys/id_ed25519* /root/.ssh/

In [5]:
!ssh-keyscan github.com >> /root/.ssh/known_hosts

# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352
# github.com:22 SSH-2.0-babeld-7ce31352


In [6]:
!ssh -T git@github.com

Hi helmi0695/instadeep-llm-technical-test! You've successfully authenticated, but GitHub does not provide shell access.


In [7]:
!git clone git@github.com:helmi0695/instadeep-llm-technical-test.git

Cloning into 'instadeep-llm-technical-test'...
remote: Enumerating objects: 32, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 32 (delta 3), reused 30 (delta 1), pack-reused 0[K
Receiving objects: 100% (32/32), 165.41 KiB | 1.22 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [8]:
!ls

drive  instadeep-llm-technical-test  sample_data


In [9]:
%cd /content/instadeep-llm-technical-test

/content/instadeep-llm-technical-test


In [10]:
!ls

notebooks  README.md  ressources


In [11]:
!git pull

Already up to date.


# LLaMa 7B Chatbot in Hugging Face and LangChain - RAG

In this notebook we'll explore how we can use the open source **Llama-7b-chat** model using Hugging Face and LangChain.
To access Llama 2 models, one must first request access via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours).

We start by doing a `pip install` of all required libraries.

In [4]:
!pip install -qU \
    transformers==4.31.0 \
    sentence-transformers==2.2.2 \
    pinecone-client==2.2.2 \
    datasets==2.14.0 \
    accelerate==0.21.0 \
    einops==0.6.1 \
    langchain==0.0.240 \
    xformers==0.0.20 \
    bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m9.1 M

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-7b-chat-hf`.

* The respective tokenizer for the model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [13]:
from torch import cuda, bfloat16
import transformers

In [14]:
model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which we initialize like so:

In [15]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

TO DO: Externalise this to a wrapper (along with the llm line)

In [16]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [17]:
res = generate_text("What's the best vaccine against covid?")
print(res[0]["generated_text"])

What's the best vaccine against covid?
 nobody knows.

The COVID-19 pandemic has highlighted the importance of vaccination in preventing the spread of infectious diseases, but there is still much to be learned about the most effective ways to protect against COVID-19. While several vaccines have been developed and are being distributed around the world, it is important to recognize that no single vaccine will provide complete protection against COVID-19.

One of the biggest challenges in developing an effective COVID-19 vaccine is the incredible diversity of the virus itself. COVID-19 is caused by a coronavirus, which means that it can mutate quickly and easily, leading to new strains of the virus that may not be well-suited to existing vaccines. As a result, researchers are working on multiple fronts to develop vaccines that can provide broad protection against COVID-19, including:

1. mRNA vaccines: These vaccines use a piece of genetic material called messenger RNA (mRNA) to instruc

Now to implement this in LangChain:

In [18]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [19]:
llm(prompt="What's the best vaccine against covid?")

'\n nobody knows.\n\nThe COVID-19 pandemic has highlighted the importance of vaccination in preventing the spread of infectious diseases, but there is still much to be learned about the most effective ways to protect against COVID-19. While several vaccines have been developed and are being distributed around the world, it is important to recognize that no single vaccine will provide complete protection against COVID-19.\n\nOne of the biggest challenges in developing an effective COVID-19 vaccine is the incredible diversity of the virus itself. COVID-19 is caused by a coronavirus, which means that it can mutate quickly and easily, leading to new strains of the virus that may not be well-suited to existing vaccines. As a result, researchers are working on multiple fronts to develop vaccines that can provide broad protection against COVID-19, including:\n\n1. mRNA vaccines: These vaccines use a piece of genetic material called messenger RNA (mRNA) to instruct cells in the body to produce

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 7B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Create a summarisation chain

In [20]:
import textwrap
from langchain import PromptTemplate,  LLMChain

TO DO : externalise the helper functions AND the default_systemprompt( optional)

In [21]:
# B_INST, E_INST = "[INST]", "[/INST]"
# B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
# DEFAULT_SYSTEM_PROMPT = """\
# You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

# If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""


# def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
#     SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
#     prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
#     return prompt_template

# def parse_text(text):
#         wrapped_text = textwrap.fill(text, width=100)
#         return wrapped_text

# def count_words(input_string):
#     words = input_string.split(" ")
#     return len(words)

In [None]:
text = '''ABSTRACT
 mRNA vaccines have become a versatile technology for the prevention of infectious diseases and the treatment of cancers. In the vaccination process, mRNA formulation and delivery strategies facilitate effective expression and presentation of antigens, and immune stimulation. mRNA vaccines have been delivered in various formats: encapsulation by delivery carriers, such as lipid nanoparticles, polymers, peptides, free mRNA in solution, and ex vivo through dendritic cells. Appropriate delivery materials and formulation methods often boost the vaccine efficacy which is also influenced by the selection of a proper
----

TITLE PARAGRAPH: Introduction
Since the first use of in vitro transcribed messenger RNA (mRNA) to express an exogenous protein in mice in 1990
Several features of in vitro transcribed mRNA contribute to its vaccine potential. First, the development process of an mRNA vaccine can be much faster than conventional protein vaccines
The mRNAs used as vaccines can be categorized into conventional mRNAs and self-amplifying mRNAs. Conventional mRNAs are similar to endogenous mRNAs in mammalian cells, consisting of a 5' cap, 5' UTR, coding region, 3' UTR, and a polyadenylated tail
Three major types of proteins are encoded by mRNA vaccines: antigens
Advances in recent years made mRNA a promising vaccine platform. For example, chemical modifications of RNA using nucleotide analogs, such as pseudouridine, dramatically increased protein production in vivo by diminishing the translation inhibition triggered by the unmodified nucleotides
In this chapter, we summarize the routes of administrations for mRNA vaccines, discuss mRNA delivery carriers and their corresponding formulation methods, and overview the challenges and future development of mRNA vaccines. A comprehensive overview of recent advances in mRNA vaccine delivery may facilitate the future development of novel delivery strategies and effective mRNA vaccines.

----

TITLE PARAGRAPH: Administration Routes for mRNA Vaccines
The administration route for mRNA vaccines plays an important role in determining vaccination efficacy
Intradermal (ID) injection delivers mRNA vaccines directly into the dermis region, which is dense connective tissue (Fig.
Subcutaneous (SC) injection administers mRNA vaccines to the subcutis region under the epidermis and dermis (Fig.
Intramuscular (IM) injection delivers the vaccine into muscles, a deeper tissue under the dermal and subcutaneous layer (Fig.
Intranodal (IN) injection directly introduces mRNA vaccines to the peripheral lymphoid organs where APCs and primed T or B cells interact (Fig
Mucosal delivery of mRNA vaccines was studied because of the accessible APCs in lymphoid organs at the mucosal sites and their protective roles against various pathogens. Among the mucosal administration routes, intranasal and intravaginal administrations were utilized to deliver mRNA vaccines
Intravenous (IV) injection delivers mRNA vaccines into the systemic circulation (Fig.
In summary, the biological features of different administration routes may impact the safety and efficacy of vaccination. Table
3 Delivery Strategies for mRNA Vaccines
Researchers have investigated many methods to deliver mRNA vaccines. For example, delivery carriers, such as lipid-derived and polymer-derived materials, dramatically increased cellular uptake of RNAs, thus receiving tremendous attention in recent years

----

TITLE PARAGRAPH: Delivery Carriers of mRNA Vaccines

----

TITLE PARAGRAPH: Lipid-based Delivery
Lipids, lipid-like compounds, and lipid derivatives have been widely used to formulate lipid and lipid-derived nanoparticles (LNPs) for in vivo delivery of mRNA vaccines
The LNPs usually contain one or more of the functional lipid components that are crucial for the intracellular RNA delivery described above
The formulation methods of lipid-based mRNA vaccines mainly include thin-film hydration
# FORMULA: PS phosphatidylserine, PC phosphatidylcholine, DLinDMA N,N-Dimethyl-2,3-bis[(9Z,12Z)-octadeca-9,12-dienyloxy] propan-1-amine, DSPC 1,2-distearoyl-sn-glycero-3-phosphocholine, DMG-PEG 1,2-dimyristoyl-rac-glycero-3-methoxypolyethylene glycol, DOPE 1,2-dioleoyl-sn-glycero-3-phosphoethanolamine, DOTMA 1,2-di-O-octadecenyl-3-trimethylammonium propane, EDOPC 1,2-dioleoyl-sn-glycero-3-ethylphosphocholine, DSPE 1,2-distearoyl-sn-glycero-3-phosphoethanolamine, PbAE poly-(b-amino ester) polymer, A18 ethyl 1- (3-(2-ethylpiperidin-1-yl)propyl)-5,5-di((Z)-heptadec-8-en-1-yl)-2,5-dihydro-1H-imidazole-2-carboxylate, DOTAP 1,2-dioleoyl-3-trimethylammonium propane, TT3 N ,N ,N -tris(3-(didodecylamino)propyl)benzene-1,3,5-tricarboxamide
mRNA-LNP at the interface
Compared to other preparation methods, the use of continuous-flow microfluidic devices increases reproducibility, improves molecular stability, reduces the chance of contamination, and is easily scaled up for preclinical and clinical studies
The delivery routes of lipid-based mRNA vaccines include IM, ID, SC, IN, and IV injection
Overall, LNPs-based mRNA vaccines have shown efficacy in preventing infectious diseases and treating cancers in preclinical and early-stage clinical studies

----

TITLE PARAGRAPH: Polymer-based Delivery
Polymeric materials, including polyamines, dendrimers, and copolymers, are functional materials capable of delivering mRNA vaccines. Similar to functional lipid-based carriers, polymers can also protect RNA from RNase-mediated degradation and facilitate intracellular delivery
Cationic polymers, such as polyethylenimine (PEI), polyamidoamine (PAMAM) dendrimer, and polysaccharide, condensed and delivered negatively charged RNA molecules
Besides cationic polymer materials, anionic polymers, such as PLGA, were also used to deliver mRNA vaccines. Since an anionic polymer was not able to efficiently encapsulate the negatively charged mRNA molecules, cationic lipid materials were added to create lipid-polymer hybrid formulations
In general, the mRNA vaccines delivered by polymer materials showed therapeutic effects in preclinical studies. New functional polymers, with improved biodegradability and delivery efficiency, are needed for clinical translation of the polymer-based mRNA vaccines.

----

TITLE PARAGRAPH: Peptide-based Delivery
Various peptides are used as carriers to deliver mRNA vaccines. Peptides themselves are also a large class of vaccine agents, which have been reviewed in the literature
Peptides, when used as the primary carrier for RNA delivery, should be positively charged. Cationic peptides contain many lysine and arginine residues that provide positively charged amino groups, therefore enabling complexing with nucleic acids through electrostatic interactions
Protamine is a cationic peptide used in many early studies for the delivery of mRNA vaccines. In solution, protamine and mRNA spontaneously form a complex, the size of which is dependent on NaCl concentration
Cationic cell-penetrating peptides (CPPs) can complex with RNA. Although many CPPs were used in gene therapies [reviewed by
Anionic peptides were also utilized to deliver mRNA vaccines in vitro. Anionic peptides cannot complex RNA due to their negative charges. Therefore, they were conjugated to positively charged polymers which served as scaffolds for RNA encapsulation. For example, an OVA-mRNA was first encapsulated with a random copolymer p(HPMA-DMAE-co-PDTEMA-co-AzEMAm) (pHDPA) containing azide group
In summary, protamine was the only peptide carrier evaluated in clinical trials of mRNA vaccines. In these trials, the protamine-mRNA complex and a naked mRNA were injected simultaneously via ID or IM routes

----

TITLE PARAGRAPH: Virus-Like Replicon Particle
Viral particles can package and deliver antigen-encoding self-amplifying mRNA into cytoplasm like a virus in a method called virus-like self-amplifying mRNA particle, i.e., virus-like replicon particle (VRP)
However, there are two challenges for VRP-based mRNA vaccines. The first challenge is to scale up the production which is limited by the process of generating VRPs from packaging cell lines

----

TITLE PARAGRAPH: Cationic Nanoemulsion
Cationic nanoemulsion (CNE) combines nanoemulsion with cationic lipids for RNA delivery. Nanoemulsion utilizes hydrophobic and hydrophilic surfactants to stabilize the oil core in the aqueous phase, thereby generating particles. Nanoemulsion can be induced by various methods, such as vigorous agitation, ultrasound, and microfluidics.

----

TITLE PARAGRAPH: Naked mRNA Vaccines
The mRNA vaccines can be delivered without any additional carrier, namely in a naked format. This method dissolves mRNA into a buffer and then injects the mRNA solution directly. The feasibility of naked RNA delivery in vivo was reported in an early effort in which a naked mRNA was delivered to mice by intramuscular injection
The naked mRNA vaccine has two prominent features. One feature is the ease to store and prepare. In the presence of a storage reagent, such as 10% trehalose, freeze-dried naked RNA remains stable in the refrigerator temperature (4 Â°C) for up to 10 months
When developing naked mRNA vaccines, the buffer is an essential component to be chosen carefully. Ringer's solution
Naked mRNA vaccines are more susceptible to the delivery obstacles, namely, RNase degradation and intracellular delivery
In recent clinical trials, naked mRNA vaccines were administered via ultrasound-guided intranodal injection

----

TITLE PARAGRAPH: Dendritic Cells-Based mRNA Vaccines
Therapeutic vaccination needs to effectively elicit the body's adaptive immunity. During the initial development of adaptive immune response, antigen-presenting cells (APCs) internalize, process and present antigens to functional lymphocytes. As the most efficient APCs, dendritic cells (DCs) can present antigens processed from various sources, for example, the captured microorganisms, virus-infected cells, and tumor cells
Autologous DCs from primary human PBMC are the main sources for preparing mRNA-treated DCs for in vivo applications
To deliver mRNAs into DCs, several strategies, such as electroporation and lipid-derived carriers, were employed
The routes for administration of mRNA-loaded DCs mainly include ID, SC, IV, and IN injections
In summary, the DC-based mRNA vaccines have shown efficacy in many preclinical and clinical studies. In one recent clinical trial (NCT00639639), the long-term progression-free survival (PFS) and overall survival (OS) were significantly increased in glioblastoma patients who were injected intradermally with autologous DCs pulsed with an antigen-encoding mRNA
Taken together, the formulation and delivery of mRNA vaccines have been extensively studied. The delivery formats and delivery materials described above have advanced to various stages of preclinical and clinical studies. However, each delivery technology has its advantages and challenges which are summarized in Table

----

TITLE PARAGRAPH: Co-delivery of mRNA Vaccines
Several mRNA molecules can be co-delivered to trigger synergic effects in vaccination. Co-delivery of mRNA vaccines enables either assembly of protein complexes, generation of multivalent mRNA vaccines, or better immune response against one specific target. The co-delivered mRNAs can be a combination of conventional mRNAs and/or self-amplifying mRNAs. There are many co-delivery options. Several mRNAs can be delivered naked or formulated, complexed together or individually, and injected through different routes at different times. In this section, we summarize the recent results for the co-delivery of mRNA vaccines, including delivery formats, dose ratios, formulation methods, and injection routes of the components. Table

----

TITLE PARAGRAPH: Co-delivery of mRNAs to Assemble Protein Complexes
Antibodies, such as immunoglobulin G (IgG), and some antigens are assembled from more than one single-chain protein subunits. Co-delivery of mRNAs is an option to express these multi-subunit proteins to provide passive immunity or stimulate adaptive immune responses. All subunits need to be translated into one cell and assembled into a complex in the endoplasmic reticulum (ER), followed by translocation to their destinations

----

TITLE PARAGRAPH: Co-delivery of mRNAs Encoding Multiple Antigens
Two or more independent antigen-coding mRNAs can be co-delivered to enhance and broaden immune responses. To enhance immunity against one target, six VEEV self-amplifying mRNAs each encoding one antigen from the same parasite, Toxoplasma gondii, were co-formulated in an equal molar ratio by a PEI-based monodispersed ionizable dendrimer nanoparticle. IM injection of the co-formulated self-amplifying mRNA vaccine protected mice from the lethal challenge
To broaden immunity with a multivalent mRNA vaccine, three self-amplifying mRNAs encoding hemagglutinin (HA) from three different influenza virus strains were formulated by a medium-length PEI in equal mass, co-delivered to mice intramuscularly, and protected mice against viral challenge
When co-delivering several antigen-encoding mRNAs, one challenge is to elicit potent specific immune responses to every antigen. The immunostimulatory activity of each antigen may be different. For example, two influenza virus antigens, nucleoprotein, and matrix protein 1 (M1) were expressed from two self-amplifying mRNAs
Even if each mRNA-encoded antigen triggers sufficient immune response when used alone, the co-delivery of several antigens may lead to competition in epitope presentation and diminished response. For example, one group generated seven antigen-encoding mRNAs in order to develop one anti-hCMV mRNA vaccine

----

TITLE PARAGRAPH: Co-delivery of mRNAs Encoding Antigens and Immunostimulatory Proteins
While antigen-encoding mRNAs trigger the adaptive immune response, co-delivery of mRNAs encoding immunostimulatory proteins boost innate response to enhance vaccine efficacy. For example, a recent vaccine study against influenza A virus employed two self-amplifying mRNAs: one encoding the influenza A virus nucleoprotein antigen and the other encoding murine immunostimulatory GM-CSF
In one approach named TriMix, three protein-coding conventional mRNAs were used as immune-stimulators to enhance the dendritic cell-mediated immune response against cancer
However, the total amount of the three mRNAs varied depending on specific applications and delivery routes. One mRNA encoding an antigen was commonly mixed with the three mRNAs and administered simultaneously to initiate specific immunity. The dose of the antigen-encoding mRNA was equal or several-fold larger than each of the three mRNAs encoding immunostimulatory proteins
Another method of co-delivering mRNAs vaccines was called RNActive
Overall, the co-delivery of multiple mRNAs is a promising vaccination strategy. However, optimization is essential to determine the appropriate antigens to be expressed, delivery material, formulation method, mass ratio of components, and administration route. It is also necessary to examine whether the antigens expressed from the co-delivered mRNAs interfere with each other. If such interference is detected, modification of vaccination procedure, such as injection time, is likely needed to improve immune response.

----

TITLE PARAGRAPH: Current Challenges and Future Perspectives
While many carriers are effective in delivering mRNA vaccines in preclinical studies and clinical trials, there are still challenges to be addressed. The first challenge is delivery efficiency. During the delivery process, a large portion of RNA-loaded carriers is trapped in endosome/lysosome or recycled out of cells by exocytosis
Meanwhile, the molecular mechanisms of the delivery process demand further investigation

----

TITLE PARAGRAPH: Conclusion
mRNA has demonstrated its potential as a vaccine platform. In clinical trials, mRNA vaccines encoding antigen proteins from rabies virus, influenza virus, and cancers induced humoral and cellular responses in healthy volunteers and patients

----

DESCRIPTION TABLE: Major delivery routes of mRNA vaccines
Delivery||Access to APCs||Maximum injection volume per||Advantages 3||Challenges 4||
route||and lymphoid||site||None||None||None||
None||organs||Human 1||Mouse 2||None||None||
Intradermal||â€¢ Dermal DC||*0.1 mL||*0.05 mL||â€¢ Direct access to||â€¢ Local side||
None||â€¢ Lymph node DC||None||None||APCs||effect,||
None||â€¢ Lymph node||None||None||None||â€¢ Limited||
None||None||None||None||None||injection||
None||None||None||None||None||volume||
Subcutaneous â€¢ Dermal DC||*1 mL||*0.8 mL total at||â€¢ Larger injection||â€¢ Degradation||
None||â€¢ Lymph node DC||(Adult),||2-3 sites a||volume (than ID)||of mRNA||
None||â€¢ Lymph node||*0.5 mL||None||â€¢ Less local side effect||None||
None||None||(Child)||None||None||None||
Intramuscular â€¢ DC||1-3 mL||0.05 mL per site,||â€¢ Less local side effect||â€¢ Limited||
None||â€¢ Lymph node||(Adult),||maximum of 2-4||â€¢ Dense blood||injection||
None||None||0.5-2 mL||sites||networks||volume||
None||None||(Child)||None||None||None||
Intranodal||â€¢ Lymph node DC||*0.2 mL||0.01-0.02 mL||â€¢ High delivery||â€¢ Complicated||
None||â€¢ Lymph node||None||None||efficiency||procedures||
Intravenous||â€¢ Splenic DC||*20 mL||*0.1 mL (bolus) a||â€¢ Large injection||â€¢ Degradation||
None||â€¢ Lymph node DC||(bolus)||*0.5 mL (slow) a||volume||of mRNA||
None||â€¢ Spleen||None||None||â€¢ Direct access to||â€¢ Risk of||
None||â€¢ Lymph node||None||None||APCs and lymphoid||systemic||
None||None||None||None||organs||side effect||
a based on a 20-g mouse||None||None||None||None||
1, de Vries et al. (2005), Doyle and McCuteheon (2015), Sienkiewicz and Palmunen (2017)||None||
2, Diehl et al. (2001)||None||None||None||None||
3, Diehl et al. (2001), Moyer et al. (2016), Kashem et al. (2017), Liang et al. (2017), Sienkiewicz and Palmunen (2017)||
4,||None||None||None||None||None||

----

DESCRIPTION TABLE:
lists representative in vivo delivery of||
mRNA vaccines by LNPs. LNPs are developed for mRNA vaccine delivery for the||
following two main reasons. Firstly, LNPs can encapsulate RNA molecules, pro-||
tecting RNA from enzymatic degradation||

----

DESCRIPTION TABLE: Lipid-based nanoparticles (LNPs) delivery of mRNA vaccines in vivo

----

DESCRIPTION TABLE: Summary of the delivery strategies of mRNA vaccines
Delivery||Advantages||Challenges||Readiness||
format||None||None||for human a||
Lipid-based||â€¢ Protect mRNA from RNase||â€¢ Potential side effects||Clinical||
nanoparticles||degradation||None||trials||
None||â€¢ Efficient intracellular||None||None||
None||delivery of mRNA||None||None||
None||â€¢ High reproducibility||None||None||
None||â€¢ Easy to scale up||None||None||
Polymer-based||â€¢ Protect mRNA from RNase||â€¢ Potential side effects||Preclinical||
nanoparticles||degradation||â€¢ Polydispersity||mouse||
None||â€¢ Efficient intracellular||None||model||
None||delivery of mRNA||None||None||
Protamine||â€¢ Protect mRNA from RNase||â€¢ Low delivery||Clinical||
None||degradation||efficiency||trials||
None||â€¢ Protamine-mRNA complex||â€¢ mRNA complexed||None||
None||has adjuvant activity||with protamine is||None||
None||None||translated poorly||None||
Other peptides||â€¢ Protect mRNA from RNase||â€¢ Low delivery||Preclinical||
None||degradation||efficiency||mouse||
None||â€¢ Peptides offer many||None||model||
None||functions to be exploited||None||None||
Virus-like||â€¢ Protect mRNA from RNase||â€¢ Challenging to scale||Clinical||
replicon||degradation||up||trials||
particle||â€¢ Efficient intracellular||â€¢ Antibody production||None||
None||delivery of self-amplifying||against viral vectors||None||
None||mRNA||None||None||
None||â€¢ Strong expression||None||None||
Cationic||â€¢ Protect mRNA from RNase||â€¢ Limited delivery||Preclinical||
Nanoemulsion||degradation||efficiency||mouse||
None||â€¢ Squalene-based CNEs have||None||model||
None||adjuvant activity||None||None||
None||â€¢ Formulation can be prepared||None||None||
None||and stored without RNA for||None||None||
None||future use||None||None||
None||â€¢ Easy to scale up||None||None||
None||None||None||(continued)||

----

DESCRIPTION TABLE:
(continued)||None||None||
Delivery||Advantages||Challenges||Readiness||
format||None||None||for human a||
Naked mRNA||â€¢ Easy to store and prepare||â€¢ Prone to RNase||Clinical||
None||â€¢ Easy to scale up||degradation||trials||
None||None||â€¢ Low delivery||None||
None||None||efficiency||None||
DCs||â€¢ Efficient APCs critical for||â€¢ Heterogeneous cell||Clinical||
None||innate/adaptive immunity||population||trials||
None||â€¢ Biocompatibility||â€¢ Complex process to||None||
None||None||manipulate and||None||
None||None||characterize DCs||None||
a See Chap. 7 of this book for clinical development||None||None||

----
'''

In [62]:
chunk_1 = '''ABSTRACT
 mRNA vaccines have become a versatile technology for the prevention of infectious diseases and the treatment of cancers. In the vaccination process, mRNA formulation and delivery strategies facilitate effective expression and presentation of antigens, and immune stimulation. mRNA vaccines have been delivered in various formats: encapsulation by delivery carriers, such as lipid nanoparticles, polymers, peptides, free mRNA in solution, and ex vivo through dendritic cells. Appropriate delivery materials and formulation methods often boost the vaccine efficacy which is also influenced by the selection of a proper
'''
chunk_2 = '''TITLE PARAGRAPH: Introduction
Since the first use of in vitro transcribed messenger RNA (mRNA) to express an exogenous protein in mice in 1990
Several features of in vitro transcribed mRNA contribute to its vaccine potential. First, the development process of an mRNA vaccine can be much faster than conventional protein vaccines
The mRNAs used as vaccines can be categorized into conventional mRNAs and self-amplifying mRNAs. Conventional mRNAs are similar to endogenous mRNAs in mammalian cells, consisting of a 5' cap, 5' UTR, coding region, 3' UTR, and a polyadenylated tail
Three major types of proteins are encoded by mRNA vaccines: antigens
Advances in recent years made mRNA a promising vaccine platform. For example, chemical modifications of RNA using nucleotide analogs, such as pseudouridine, dramatically increased protein production in vivo by diminishing the translation inhibition triggered by the unmodified nucleotides
In this chapter, we summarize the routes of administrations for mRNA vaccines, discuss mRNA delivery carriers and their corresponding formulation methods, and overview the challenges and future development of mRNA vaccines. A comprehensive overview of recent advances in mRNA vaccine delivery may facilitate the future development of novel delivery strategies and effective mRNA vaccines.
'''

In [63]:
chunk_list = [chunk_1, chunk_2]

In [65]:
def generate_summary(text, llm, how="chunk"):
    """
    Used mainly to summarize text.
    the text can be under 3 diffrent formats:
        - chunk: a single paragraph
        - list : a list of paragraphs
        - full : a full document - This is not recommended if we have large document that do not fit into memory
    Input: text_chunk, llm, how:("chunk","list", "full")
    Output: summary of text_chunk
    """
    # Defining the template to generate summary
    template = """
    Write a concise summary of the text, return your responses with 2-3 sentences that cover the key points of the text.
    ```{text}```
    SUMMARY:
    """
    if how == "list":
        template = """
        Write a concise summary of the list of texts, return a coherent summary that covers the key points of the text.
        ```{text}```
        SUMMARY:
        """
    elif how == "full":
        template = """
        Write a concise summary of the text, return your responses with 5 paragraphs that cover the key points of the text.
        ```{text}```
        SUMMARY:
        """
    prompt = PromptTemplate(template=template, input_variables=["text"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    summary = llm_chain.run(text)
    return summary


# def summarize_text(chunk_list, llm):
#     """
#     Used mainly to transform the diffrent summarised chunks into one coherent summary.
#     Input: chunk_list, llm
#     Output: document_summary
#     """
#     # Defining the template to generate summary
#     template = """
#     Write a concise summary of the text, return your responses with 2-3 sentences that cover the key points of the text.
#     ```{text}```
#     SUMMARY:
#     """
#     prompt = PromptTemplate(template=template, input_variables=["text"])
#     llm_chain = LLMChain(prompt=prompt, llm=llm)
#     summary = llm_chain.run(chunk)
#     return summary


# def summarize_full_text(text, llm):
#     """
#     Used mainly to summarize the a full paper - If we had enough memory.
#     Input: chunk
#     Output: summary of chunk
#     """
#     instruction = "Summarize the following article {text}"
#     system_prompt = "You are an expert in summarization and expressing key ideas succintly"

#     template = get_prompt(instruction, system_prompt)

#     prompt = PromptTemplate(template=template, input_variables=["text"])
#     llm_chain = LLMChain(prompt=prompt, llm=llm)

#     summarized_text = llm_chain.run(text)
#     parsed_text = parse_text(summarized_text)
#     return parsed_text

In [66]:
generate_summary(chunk_1, llm, how="chunk")



' This text discusses the use of mRNA vaccines for disease prevention and cancer treatment. The article highlights the importance of mRNA formulation and delivery strategies in facilitating effective antigen expression and immune stimulation. Various delivery formats, including encapsulation by delivery carriers, lipid nanoparticles, polymers, peptides, free mRNA in solution, and ex vivo through dendritic cells, are discussed. The choice of delivery material and formulation method can significantly impact vaccine efficacy.'

In [67]:
generate_summary(chunk_list, llm, how="list")

'\n        * mRNA vaccines have become a versatile technology for disease prevention and cancer treatment.\n        * mRNA vaccines can be delivered in various formats, including encapsulation by delivery carriers, such as lipid nanoparticles, polymers, peptides, and free mRNA in solution.\n        * Ex vivo delivery through dendritic cells is another option.\n        * The choice of delivery material and formulation method can significantly impact vaccine efficacy.\n        * mRNA vaccines have the potential to be developed more quickly than conventional protein vaccines.\n        * Self-amplifying mRNAs and conventional mRNAs are two categories of mRNAs used as vaccines.\n        * Antigens, proteins encoded by mRNA vaccines, play a crucial role in the immune response.\n        * Advances in mRNA chemistry have improved protein production in vivo by reducing translation inhibition.\n        * The challenges and future developments of mRNA vaccines include the need for further researc

In [56]:
summarize_text_chunk(chunk, llm)

' This text discusses the use of mRNA vaccines for disease prevention and cancer treatment. The article highlights the importance of mRNA formulation and delivery strategies in facilitating effective antigen expression and immune stimulation. Various delivery formats, including encapsulation by delivery carriers, lipid nanoparticles, polymers, peptides, free mRNA in solution, and ex vivo through dendritic cells, are discussed. The choice of delivery material and formulation method can significantly impact vaccine efficacy.'

In [40]:
summarize_full_text(chunk, llm)

"  Thank you for entrusting me with summarizing the article. Here's a concise summary of the main\npoints:  The article discusses the potential benefits of using artificial intelligence (AI) in\neducation. The author posits that AI can help personalize learning experiences, automate\nadministrative tasks, and provide real-time feedback to students. Additionally, AI can help teachers\nby taking on tasks such as grading and data analysis, allowing them to focus on more important\naspects of teaching. However, the author also acknowledges the limitations of AI and stresses the\nimportance of striking a balance between technology and human interaction in the classroom.  In\nsummary, the article explores the possibility of using AI to enhance education by providing\npersonalized learning experiences, automating administrative tasks, and improving teacher\nproductivity. While there are potential benefits to this approach, it is crucial to ensure that AI\ncomplements human interaction rather 

In [None]:
# summarized_text = summarize_text(text, llm)
# summarized_text

## Retrieve documents from a vectorstore

### Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the sentence-transformers/all-MiniLM-L6-v2 model for embedding.

In [5]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


We can use the embedding model to create document embeddings like so:



In [6]:
docs = [
    "Vaccines are nice",
    "vaccines are the best"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


### Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

TO DO: Externalise those to .env.local

In [None]:
PINECONE_API_KEY = '20e8878b-3de1-4157-8726-33ce7672c5a6'
PINECONE_ENVIRONMENT = ''

In [7]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or '20e8878b-3de1-4157-8726-33ce7672c5a6',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'
)

In [9]:
# Index initialisation
import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [10]:
# connect to the index:

index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00286,
 'namespaces': {'': {'vector_count': 286}},
 'total_vector_count': 286}


With our index and embedding process ready we can move onto the indexing process itself.

TO DO: Add the kb_path to env.local

In [11]:
import re
import os
import glob
import pandas as pd

# Define the folder path
folder_path = '/content/instadeep-llm-technical-test/ressources/data/raw_text'

# Get a list of all .txt files in the folder
txt_files = glob.glob(os.path.join(folder_path, '*.txt'))

In [12]:
# Initialize an empty list to store data
data_content = []

# Loop through each file, read its content, and append to the list
for doc_id, txt_file in enumerate(txt_files):
    try:
        file_path = os.path.join(folder_path, txt_file)
        print(f'Importing {file_path}')
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()

            # Split content into documents based on "----"
            documents = re.split(r'----', content)
            file_name = os.path.basename(txt_file)

            # Process each document
            for chunk_id, document in enumerate(documents):
                # Extract chunks based on "TITLE PARAGRAPH:"
                chunks = re.split(r'TITLE PARAGRAPH:', document)

                # Process each chunk
                for sub_chunk_id, chunk in enumerate(chunks):
                    # Skip empty chunks
                    if not chunk.strip():
                        continue

                    # Extract chunk title
                    title_match = re.search(r'(.*?)\n', chunk)
                    chunk_title = title_match.group(1).strip() if title_match else None

                    data_content.append({
                        'file_name': file_name,
                        'chunk_id': f'{doc_id}-{chunk_id}-{sub_chunk_id}',
                        'doc_id': doc_id,
                        'chunk_title': chunk_title,
                        'chunk': chunk.strip(),
                        'chunk_length': len(chunk),
                        'doc':content,
                        'doc_length': len(content)
                    })
    except Exception as e:
        print(f"Error reading {txt_file}: {e}")

# Create a Pandas DataFrame from the list
data = pd.DataFrame(data_content)
data.head()

In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   file_name     286 non-null    object
 1   chunk_id      286 non-null    object
 2   doc_id        286 non-null    int64 
 3   chunk_title   286 non-null    object
 4   chunk         286 non-null    object
 5   chunk_length  286 non-null    int64 
 6   doc           286 non-null    object
 7   doc_length    286 non-null    int64 
dtypes: int64(3), object(5)
memory usage: 18.0+ KB


TO DO: MANUALLY ADD THE TITLE COLUMN AND THE RELEASE DATE

In [None]:
# embed and index the documents - This must only be done once
batch_size = 32

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['chunk_id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'chunk_title': x['chunk_title'],
         'file_name': x['file_name'],
         'doc_id':x['doc_id']
        } for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))


index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00286,
 'namespaces': {'': {'vector_count': 286}},
 'total_vector_count': 286}

In [13]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00286,
 'namespaces': {'': {'vector_count': 286}},
 'total_vector_count': 286}

## Initializing a RetrievalQA Chain

For Retrieval Augmented Generation (RAG) in LangChain we need to initialize either a RetrievalQA or RetrievalQAWithSourcesChain object. For both of these we need an llm (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Initializing the LangChain vector store:

In [14]:
from langchain.vectorstores import Pinecone

def get_top_k_documents(query, k=3):

    text_field = 'text'  # field in metadata that contains text content

    vectorstore = Pinecone(
        index, embed_model.embed_query, text_field
    )

    top_k_docs = vectorstore.similarity_search_with_score(
        query,  # the search query
        k=k  # returns top 3 most relevant chunks of text
    )
    return top_k_docs

In [15]:
query = 'mRNA vaccines have become a versatile technology for the prevention of infectious diseases and the treatment of cancers.'
get_top_k_documents(query, k=3)

[(Document(page_content='Conclusions and future directions\nCurrently, mRNA vaccines are experiencing a burst in basic and clinical research. The past 2 years alone have witnessed the publication of dozens of preclinical and clinical reports showing the efficacy of these platforms. Whereas the majority of early work in mRNA vaccines focused on cancer applications, a number of recent reports have demonstrated the potency and versatility of mRNA to protect against a wide variety of infectious pathogens, including influenza virus, Ebola virus, Zika virus, Streptococcus spp. and T. gondii (TABLES 1,2).\nWhile preclinical studies have generated great optimism about the prospects and advantages of mRNAbased vaccines, two recent clinical reports have led to more tempered expectations \nRecent advances in understanding and reducing the innate immune sensing of mRNA have aided efforts not only in active vaccination but also in several applications of passive immunization or passive immunotherap

In [16]:
query = "how's the weather like today?"

get_top_k_documents(query, k=3)

[(Document(page_content='n engl j med 383;27 nejm.org December 31, 2020', metadata={'chunk_title': '', 'doc_id': 12.0, 'file_name': 'Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine.txt'}),
  0.155571163),
 (Document(page_content='ABSTRACT', metadata={'chunk_title': 'ABSTRACT', 'doc_id': 7.0, 'file_name': 's41591-022-02061-1.txt'}),
  0.141363591),
 (Document(page_content='Lessons Learned from COVID-19\nThe unprecedented speed of the global spread of the COVID-19 pandemic caused by the coronavirus, SARS-CoV2, resulted in an extremely rapid development of mRNA vaccines \nAlthough SARS viruses are common in humans, vaccines had not been developed since the course of the infection normally was very mild. The SARS outbreak in early 2000 triggered DNA vaccine development', metadata={'chunk_title': 'Lessons Learned from COVID-19', 'doc_id': 8.0, 'file_name': 'biomedicines-11-00308-v2.txt'}),
  0.132560551)]

## Combine RAG and summarisation

In [90]:
def doc_search(query, top_k = 3):
    search_results = list()
    metadata = dict()

    documents = get_top_k_documents(query, k=top_k)
    # Loop through the documents and get the metadate_cotent and the score
    for doc in documents:
      score = doc[-1]
      metadata = doc[0].metadata
      metadata['similarity_score'] = score
      search_results.append(metadata)

    # Create a result DataFrame
    res_df = pd.DataFrame(search_results)
    return res_df

In [91]:
doc_search_result = doc_search(query, top_k = 3)
doc_search_result

Unnamed: 0,chunk_title,doc_id,file_name,similarity_score
0,Conclusions and future directions,4.0,mRNA vaccines — a new era.txt,0.837178
1,Safety,4.0,mRNA vaccines — a new era.txt,0.809183
2,mRNA Vaccines Against Infectious Diseases,0.0,nanomaterials-10-00364-v2.txt,0.806486


In [37]:
to_summarise_df = (pd.merge(doc_search_result, data, on=['file_name', 'chunk_title'])
             .groupby(['file_name', 'chunk_title'])
             .first()
             .reset_index()[['file_name', 'chunk_title', 'doc', 'similarity_score']]
             .sort_values(by='similarity_score', ascending=False))
to_summarise_df

Unnamed: 0,file_name,chunk_title,doc,similarity_score
0,mRNA vaccines — a new era.txt,Conclusions and future directions,ABSTRACT\n Vaccines prevent many millions of i...,0.837178
1,mRNA vaccines — a new era.txt,Safety,ABSTRACT\n Vaccines prevent many millions of i...,0.809183
2,nanomaterials-10-00364-v2.txt,mRNA Vaccines Against Infectious Diseases,ABSTRACT\n The use of messenger RNA (mRNA) in ...,0.806486


In [26]:
# import pandas as pd

# data = {
#     'file_name': ['mRNA vaccines — a new era.txt', 'mRNA vaccines — a new era.txt', 'nanomaterials-10-00364-v2.txt'],
#     'chunk_title': ['Conclusions and future directions', 'Safety', 'mRNA Vaccines Against Infectious Diseases'],
#     'doc': [
#         'ABSTRACT\n Vaccines prevent many millions of i...',
#         'ABSTRACT\n Vaccines prevent many millions of i...',
#         'ABSTRACT\n The use of messenger RNA (mRNA) in ...'
#     ],
#     'similarity_score': [0.837178, 0.809183, 0.806486]
# }

# to_summarise_df = pd.DataFrame(data)

# to_summarise_df

Unnamed: 0,file_name,chunk_title,doc,similarity_score
0,mRNA vaccines — a new era.txt,Conclusions and future directions,ABSTRACT\n Vaccines prevent many millions of i...,0.837178
1,mRNA vaccines — a new era.txt,Safety,ABSTRACT\n Vaccines prevent many millions of i...,0.809183
2,nanomaterials-10-00364-v2.txt,mRNA Vaccines Against Infectious Diseases,ABSTRACT\n The use of messenger RNA (mRNA) in ...,0.806486


In [60]:
# Summarise the extracted papers
# def summarize_text_chunk(text):

#     return "Summary: " + text[:10] + "..."


# to_summarise_df_ = data.merge(
#     data.groupby('file_name')['doc']
#         .apply(lambda x: summarize_text_chunk(x.iloc[0]))
#         .reset_index(name='summarized_doc'),
#     on=['file_name', 'doc']
# )
# to_summarise_df_ = data.merge(
#     data.groupby('file_name')['doc']
#         .apply(lambda x: summarize_text_chunk(x.iloc[0]))
#         .reset_index(name='summarized_doc'),
#     on='file_name'
# )

# Step 2: Add a dummy 'similarity_score' column to the data dataframe
data['similarity_score'] = None

# Merge the two dataframes based on the "file_name" column
merged_df = pd.merge(data, to_summarise_df[['file_name']], on='file_name')

# Filter the merged dataframe to keep only relevant columns
final_df = merged_df[['file_name', 'chunk_title', 'doc', 'similarity_score', 'chunk']]

# Apply the summarize_text_chunk method to each row
final_df['summarized_chunk'] = final_df['chunk'].apply(lambda x: generate_summary(x, llm, how="chunk"))

# Display the final dataframe
final_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['summarized_chunk'] = final_df['chunk'].apply(lambda x: summarize_text_chunk(x,llm))


Unnamed: 0,file_name,chunk_title,doc,similarity_score,chunk,summarized_chunk
0,nanomaterials-10-00364-v2.txt,ABSTRACT,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,ABSTRACT\n The use of messenger RNA (mRNA) in ...,The use of mRNA in gene therapy has gained po...
1,nanomaterials-10-00364-v2.txt,Introduction,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,Introduction\nAccording to the European Medici...,Gene therapy involves using genetic material ...
2,nanomaterials-10-00364-v2.txt,Structure of Synthetic IVT mRNA and Chemical M...,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,Structure of Synthetic IVT mRNA and Chemical M...,The production of IVT mRNA is typically done ...
3,nanomaterials-10-00364-v2.txt,Figure 2.,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,Figure 2.\nRepresentative scheme of the IVT mR...,The figure depicts an illustration of the IVT...
4,nanomaterials-10-00364-v2.txt,5' Cap,ABSTRACT\n The use of messenger RNA (mRNA) in ...,,5' Cap\nEukaryotic native mRNA possesses a 5' ...,The 5' cap of eukaryotic mRNA is formed by th...
...,...,...,...,...,...,...
97,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,DESCRIPTION TABLE: cont.) |,"In this article, the author discusses the pot..."
98,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,DESCRIPTION TABLE: \nNone||None||Targets||Tria...,This table lists clinical trials conducted at...
99,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,DESCRIPTION TABLE: \nNone||None||Targets||Tria...,This table lists clinical trials conducted at...
100,mRNA vaccines — a new era.txt,,ABSTRACT\n Vaccines prevent many millions of i...,,"DESCRIPTION TABLE: , Biomedical Advanced Resea...",This table lists various biotechnology compan...


In [68]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 101
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   file_name         102 non-null    object
 1   chunk_title       102 non-null    object
 2   doc               102 non-null    object
 3   similarity_score  0 non-null      object
 4   chunk             102 non-null    object
 5   summarized_chunk  102 non-null    object
dtypes: object(6)
memory usage: 5.6+ KB


In [74]:
# Group by 'file_name' and aggregate the 'summarized_chunk' into a list
grouped_df = final_df.groupby('file_name')['summarized_chunk'].agg(list).reset_index()

# Step 7: Merge the grouped dataframe back to to_summarise_df
to_summarise_df = pd.merge(to_summarise_df, grouped_df, on='file_name', how='left')

# Display the final to_summarise_df
to_summarise_df

                       file_name                                chunk_title  \
0  mRNA vaccines — a new era.txt          Conclusions and future directions   
1  mRNA vaccines — a new era.txt                                     Safety   
2  nanomaterials-10-00364-v2.txt  mRNA Vaccines Against Infectious Diseases   

                                                 doc  similarity_score  \
0  ABSTRACT\n Vaccines prevent many millions of i...          0.837178   
1  ABSTRACT\n Vaccines prevent many millions of i...          0.809183   
2  ABSTRACT\n The use of messenger RNA (mRNA) in ...          0.806486   

                                    summarized_chunk  
0  [ * Vaccines prevent millions of illnesses and...  
1  [ * Vaccines prevent millions of illnesses and...  
2  [ The use of mRNA in gene therapy has gained p...  


In [75]:
to_summarise_df

Unnamed: 0,file_name,chunk_title,doc,similarity_score,summarized_chunk
0,mRNA vaccines — a new era.txt,Conclusions and future directions,ABSTRACT\n Vaccines prevent many millions of i...,0.837178,[ * Vaccines prevent millions of illnesses and...
1,mRNA vaccines — a new era.txt,Safety,ABSTRACT\n Vaccines prevent many millions of i...,0.809183,[ * Vaccines prevent millions of illnesses and...
2,nanomaterials-10-00364-v2.txt,mRNA Vaccines Against Infectious Diseases,ABSTRACT\n The use of messenger RNA (mRNA) in ...,0.806486,[ The use of mRNA in gene therapy has gained p...


In [76]:
to_summarise_df['doc_summary'] = to_summarise_df['summarized_chunk'].apply(lambda text_list: generate_summary(text_list, llm, how="list"))
summarized_retrieved_data = to_summarise_df



OutOfMemoryError: ignored

In [9]:
from datetime import datetime

def export_data(data, output_file_name, output_folder_path):
    # Get today's date with the hour
    current_time = datetime.now().strftime('%Y%m%d_%H%M%S')

    # Save to_summarise_df to a CSV file with the current timestamp
    csv_filename = f'/{output_file_name}_{current_time}.csv'
    csv_data_path = output_folder_path + csv_filename
    data.to_csv(csv_data_path)

In [None]:
# Export the summarized data

output_file_name = 'summarized_retrieved_data'
output_folder_path = '/content/instadeep-llm-technical-test/ressources/data/outputs/summarised_docs'
# output_folder_path = '/content/drive/MyDrive/InstaDeep'

summarized_documemts = summarized_retrieved_data[['file_name', 'chunk_title', 'similarity_score', 'doc_summary']]
# summarised_documemts = to_summarise_df[['file_name', 'chunk_title', 'similarity_score', 'summarized_chunk']]

export_data(summarized_documemts, output_file_name, output_folder_path)

In [78]:
# from datetime import datetime
# # Export the summarised documents:

# # Get today's date with the hour
# current_time = datetime.now().strftime('%Y%m%d_%H%M%S')

# # Save to_summarise_df to a CSV file with the current timestamp
# csv_filename = f'/summarized_retrieved_data_{current_time}.csv'
# output_folder_path = '/content/instadeep-llm-technical-test/ressources/data/outputs/summarised_docs'
# # output_folder_path = '/content/drive/MyDrive/InstaDeep'
# csv_data_path = output_folder_path + csv_filename

# summarised_documemts = summarized_retrieved_data[['file_name', 'chunk_title', 'similarity_score', 'doc_summary']]
# # summarised_documemts = to_summarise_df[['file_name', 'chunk_title', 'similarity_score', 'summarized_chunk']]

# summarised_documemts.to_csv(csv_data_path)

TO DO export the output_folder_path

## Build the Validation pipeline:

In [80]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.metrics.pairwise import cosine_similarity
# import pandas as pd
# import random

# # Dummy function to generate similarity scores
# def calculate_similarity(paragraph1, paragraph2):
#     vectorizer = TfidfVectorizer()
#     vectors = vectorizer.fit_transform([paragraph1, paragraph2])
#     similarity_matrix = cosine_similarity(vectors)
#     return similarity_matrix[0, 1]

# # Dummy document
# text = '''
# # ... (your document content)
# '''

# # Extract paragraphs
# paragraphs = [para.strip() for para in text.split('\n\n') if para.strip()]

# # Generate a dummy dataset
# data = {'parag1': [], 'parag2': [], 'is_similar': []}

# # Include related paragraphs
# for _ in range(50):
#     parag1 = random.choice(paragraphs)
#     parag2 = random.choice(paragraphs)
#     similarity_score = calculate_similarity(parag1, parag2)
#     is_similar = 1 if similarity_score > 0.5 else 0
#     data['parag1'].append(parag1)
#     data['parag2'].append(parag2)
#     data['is_similar'].append(is_similar)

# # Include unrelated paragraphs
# for _ in range(50):
#     parag1 = random.choice(paragraphs)
#     parag2 = random.choice(paragraphs)
#     similarity_score = calculate_similarity(parag1, parag2)
#     is_similar = 0  # Set is_similar to 0 for unrelated paragraphs
#     data['parag1'].append(parag1)
#     data['parag2'].append(parag2)
#     data['is_similar'].append(is_similar)

# # Create DataFrame
# df = pd.DataFrame(data)

# # Display the dummy validation set
# print(df.head())


In [3]:
# Read the validation set
import pandas as pd

validation_data_path = "../ressources/data/inputs/validation/val_data.xlsx"

val_df = pd.read_excel(validation_data_path)
val_df

Unnamed: 0,chunk,file_name,is_similar
0,mRNA vaccines have become a versatile technolo...,82_2020_217.txt\t,1
1,Since the first use of in vitro transcribed me...,82_2020_217.txt\t,1
2,The administration route for mRNA vaccines pla...,82_2020_217.txt\t,1
3,"Lipids, lipid-like compounds, and lipid deriva...",82_2020_217.txt\t,1
4,"Polymeric materials, including polyamines, den...",82_2020_217.txt\t,1
5,The mRNA vaccines can be delivered without any...,82_2020_217.txt\t,1
6,Despite the promising progress in mRNA vaccine...,82_2020_217.txt\t,1
7,"In conclusion, mRNA vaccines represent a revol...",82_2020_217.txt\t,1
8,Analysis\nFor analysis of the primary end poin...,Efficacy and Safety of the mRNA-1273 SARS-CoV-...,1
9,"Between July 27, 2020, and October 23, 2020, a...",Efficacy and Safety of the mRNA-1273 SARS-CoV-...,1


In [18]:
# Get the most similar document
val_df['top_3_doc'] = val_df['chunk'].apply(lambda query: get_top_k_documents(query, k=3))

In [19]:
val_df['is_similar_pred'] = val_df['top_3_doc'].apply(lambda d: 0 if d[0][-1] < 0.5 else 1)

val_df

Unnamed: 0,chunk,file_name,is_similar,top_3_doc,is_similar_pred
0,mRNA vaccines have become a versatile technolo...,nanomaterials-10-00364-v2.txt,1,[(page_content='ABSTRACT\n mRNA vaccines have ...,1
1,how's the weather like today?,Safety and Efficacy of the BNT162b2 mRNA Covid...,0,[(page_content='n engl j med 383;27 nejm.org D...,0


In [4]:
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd

from sklearn.metrics import precision_score, recall_score, f1_score

# Sample data
val_data = {
    'chunk': [
        "mRNA vaccines have become a versatile technology for the prevention of infectious diseases and the treatment of cancers. In the vaccination process, mRNA formulation and delivery strategies facilitate effective expression and presentation of antigens, and immune stimulation. mRNA vaccines have been delivered in various formats: encapsulation by delivery carriers, such as lipid nanoparticles, polymers, peptides, free mRNA in solution, and ex vivo through dendritic cells. Appropriate delivery materials and formulation methods often boost the vaccine efficacy which is also influenced by the selection of a proper",
        "how's the weather like today?",
        "vaccines are the best",
    ],
    'file_name': [
        'nanomaterials-10-00364-v2.txt',
        'Safety and Efficacy of the BNT162b2 mRNA Covid...',
        '',
    ],
    'is_similar': [1, 0, 0],
    'top_3_doc': [
        [("page_content='ABSTRACT\n mRNA vaccines have ...", 1), ("...", 1)],
        [("page_content='n engl j med 383;27 nejm.org D...", 0), ("...", 0)],
        [("page_content='ABSTRACT\n mRNA vaccines have ...", 1), ("...", 1)]
    ],
    'is_similar_pred': [1, 0, 1]
}

# Create DataFrame
val_df = pd.DataFrame(val_data)


# Evaluate precision, recall, and F1 score
precision = precision_score(val_df['is_similar'], val_df['is_similar_pred'])
recall = recall_score(val_df['is_similar'], val_df['is_similar_pred'])
f1 = f1_score(val_df['is_similar'], val_df['is_similar_pred'])

# Display the metrics
print("\nPerformance Metrics:")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")



Performance Metrics:
Precision: 0.50
Recall: 1.00
F1 Score: 0.67


In [5]:
import json

def get_performance_metrics(val_df, output_metrics_path):
    # Evaluate precision, recall, and F1 score
    precision = precision_score(val_df['is_similar'], val_df['is_similar_pred'])
    recall = recall_score(val_df['is_similar'], val_df['is_similar_pred'])
    f1 = f1_score(val_df['is_similar'], val_df['is_similar_pred'])

    metrics = {
        'precision' : precision,
        'recall' : recall,
        'f1_score' : f1
    }

    # Export the metrics as JSON
    with open(output_metrics_path + '/validation_metrics.json', 'w') as file:
        json.dump(metrics, file, indent=4)
    return metrics


In [6]:
output_metrics_path = "../ressources/data/outputs/validation"
get_performance_metrics(val_df, output_metrics_path)

{'precision': 0.5, 'recall': 1.0, 'f1_score': 0.6666666666666666}

In [12]:
# Export the validation data with predictions

output_file_name = 'val_data'
output_folder_path_val_data = '/content/instadeep-llm-technical-test/ressources/data/outputs/val'
# output_folder_path_val_data = '/content/drive/MyDrive/InstaDeep'


export_data(data=val_df, output_file_name=output_file_name, output_folder_path=output_folder_path_val_data)

Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positives. In this context:

Precision = True Positives / (True Positives + False Positives)
Precision = 1 / (1 + 1) = 0.50
A precision of 0.50 means that 50% of the predicted similar instances were actually similar, and the other 50% were false positives.

Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to all actual positives. In this context:

Recall = True Positives / (True Positives + False Negatives)
Recall = 1 / (1 + 0) = 1.00
A recall of 1.00 means that the model captured all the actual similar instances. There were no instances that were actual positives and were missed by the model.

F1 Score: The F1 Score is the weighted average of precision and recall. It ranges from 0 to 1, where a higher value indicates better model performance. In this context:

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
F1 Score = 2 * (0.50 * 1.00) / (0.50 + 1.00) = 0.67
An F1 Score of 0.67 indicates a balance between precision and recall. It considers both false positives and false negatives, providing a single metric that summarizes the model's performance.

In summary, the model has a perfect recall (captures all actual positives) but has room for improvement in precision. The F1 Score provides a balanced evaluation, considering both precision and recall.

TO DO : Export the val_df


Export the metrics into a json

TO DO read all the output folder paths, from the env.local



ONLY THE PATHS NOT THE FILE NAMES


Add logging : File saved at {location}