# INTRO 👋

2023 marked the rise of the GenAI buzz, and many companies worldwide are working hard to take advantage of its capabilities to solve advanced problems. The [Malawi Public Health Systems LLM Challenge](https://zindi.africa/competitions/malawi-public-health-systems-llm-challenge) is one of the first GenAI competitions on Zindi. If you are new to Generative AI or perhaps just trying to get a sense of how to solve this challenge, then this notebook can get you started. This notebook is focused on RAG (Retrieval Augmented Generation), which leverages existing LLMs to perform Q&A using a provided context. 

# The Data 📊

When examining the data tab on Zindi, we find three files. Let's break it down:

- **Train.csv:** This file contains 748 rows × 6 columns, which can be used to train a model.
- **Test.csv:** This file contains 499 rows, which are the test questions.
- **SampleSubmission.csv:** This CSV file is the sample submission format that Zindi expects.

### Extras 📁:
- **MWTGBookletsExcel** This folder contains six Excel spreadsheets. In this competition, I renamed the spreadsheets as the original names were too long. I've manually shortened the names to keep things simple. Please ensure you do this via code if you'd like to do the same.

Original Filenames:

1. TG Booklet 1 Introduction Module Booklet 1TG_final_04112021.xlsx
2. TG Booklet 2 Sections 1,2,3_final_04112021.xlsx
3. TG Booklet 3 Section 4,5,6,7_final_04112021.xlsx
4. TG Booklet 4 Sections 8, 9_final_04112021.xlsx
5. TG Booklet 5 Section 10_final_04112021.xlsx
6. TG Booklet 6_Section 11_final_04112021.xlsx

Renamed:

1. TG Booklet 1.xlsx
2. TG Booklet 2.xlsx
3. TG Booklet 3.xlsx
4. TG Booklet 4.xlsx
5. TG Booklet 5.xlsx
6. TG Booklet 6.xlsx


## Requirements 🛠️

Just a basic setup: please use a GPU-enabled setup for your inference, but don't spend the whole day on it. I'm using Kaggle since they offer free GPUs; you can also use any other free platform you like... I guess.. 😬. Of course, if you don't have access to GPUs for some reason, you can also run it on CPUs.

###### Last Thing Before You Start! 🚀

Read the description before you start coding, so you can have some insight into the challenge. This notebook does not cover training/fine-tuning a model. Below is a simple workflow.

## Setting up Huggingface. 🤖

### TheBloke's GGUF and GGML models on Hugging Face Hub (Over 3k LLMs)

This challenge requires full CPU compatibility. TheBloke, also known as Tom Jobbins, is a top contributor to the Hugging Face Hub. He specializes in large language models (LLMs) and quantization techniques. He also offers several GGUF and GGML models under his profile. So you can find the quantized equivalent of the model you need on his repo [here](https://huggingface.co/TheBloke) which is fully operational on CPUs.

When you find a model you'd like to use, proceed to download it. All the models have a common syntax and different precisions, so choose efficiently. 🚀🔍


![image-4.png](attachment:image-4.png)

Opional: Replace `YOUR_HF_TOKEN` 🔑 with your Huggingface token. Hugging Face needs to be sure you have access to the model.

In [1]:
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token ('YOUR_HF_TOKEN')"

## Installing Libraries. 

##### Langchain

The major library we'll be using among others is Langchain. LangChain is a framework for developing applications powered by language models. It enables applications that:

1. Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.) 🧠
2. Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.) 🤔

Read more about LangChain at [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction) 📚


##### chromadb

Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine locally. According to their website, a hosted version is coming soon! Read more about ChromaDB here at [ChromaDB Documentation](https://docs.trychroma.com/getting-started) 📦

In [2]:
import pandas as pd
import numpy as np

In [3]:
!pip install langchain_community -qq #The langchain community builds tools that are stored here

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-python 12.3.0 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.1.4 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but you have protobuf 3.20.3 which is incompatible.
cuml 23.8.0 requires dask==2023.7.1, but you have dask 2024.1.0 which is incompatible.
cuml 23.8.0 requires distributed==2023.7.1, but you have distributed 2024.1.0 which is incompatible.
dask-cud

In [4]:
!pip install -U langchain -qq

In [5]:
!pip install unstructured -qq #This will be used to load in our excel sheets (The textbooks)

In [6]:
%%capture
pip install ctransformers[cuda]

In [7]:
%%capture

!pip install accelerate sentence_transformers==2.2.2 chromadb==0.4.12

## Taking a Look at the Data 👀

The Train and Test files come as a CSV file. For convenience and to check out the data, we can load it into a dataframe using pandas. 🐼

If you are getting an error, it means you haven't uploaded the dataset into Kaggle, or your path is incorrect. 🚨


###### more details

- **ID:** The question ID
- **Question Text:** Essentially the text of the questions.
- **Question Answer:** The Answer to the Question Text
- **Reference Document:** This is where the Answer is in the textbook (Remember there are 6 excel sheets where the textbooks are)
- **Paragraph(s) Number:** This is Paragraph in the Reference Document where the answers are
- **Keywords:** The contextual Keywords 📝


In [8]:
path = "/kaggle/input/malawi-public-health-dataset/strengthening-health-systems-llm-challenge-for-integrated-disease-surveillance-and-response-in-malawi20240125-12750-1x85c8a"
train = pd.read_csv(f"{path}/Train.csv")
train

Unnamed: 0,ID,Question Text,Question Answer,Reference Document,Paragraph(s) Number,Keywords
0,Q829,Compare the laboratory confirmation methods fo...,Chikungunya is confirmed using serological tes...,TG Booklet 6,"154, 166",Laboratory Confirmation For Chikungunya Vs. Di...
1,Q721,When should specimens be collected for Anthrax...,Specimens should be collected during the vesic...,TG Booklet 6,140,"Anthrax Specimen Collection: Timing, Preparati..."
2,Q464,Which key information should be recorded durin...,"During a register review, key information abou...",TG Booklet 3,439-440,"Register Review, Key Information, Suspected Ca..."
3,Q449,Why is the District log of suspected outbreaks...,The log includes information about response ac...,TG Booklet 3,412,"District Log, Response Activities, Steps Taken..."
4,Q6,What do Community based surveillance strategie...,Community-based surveillance strategies focus ...,TG Booklet 1,86,"Community-based Surveillance Strategies, Ident..."
...,...,...,...,...,...,...
743,Q413,Which section of the guidelines provides a des...,Section 11.0 of these 3rd Edition Malawi IDSR ...,TG Booklet 3,376,"Control Measures Description, Priority Disease..."
744,Q626,"Does MEF stand for an abbreviation in the TG, ...",Medical Teams International,TG Booklet 6,106,Medical Teams International
745,Q1141,In what ways do the verification and documenta...,"In emergency contexts, verification and docume...",TG Booklet 5,105-106,"Verification, Documentation, Early Warning, Em..."
746,Q331,What role does the examination of burial cerem...,Examining burial ceremonies helps identify pot...,TG Booklet 3,287,"Burial Ceremonies Examination, Exposure, Trans..."


In [9]:
test = pd.read_csv(f"{path}/Test.csv")
test

Unnamed: 0,ID,Question Text
0,Q4,"What is the definition of ""unusual event"""
1,Q5,What is Community Based Surveillance (CBS)?
2,Q9,What kind of training should members of VHC re...
3,Q10,What is indicator based surveillance (IBS)?
4,Q13,What is Case based surveillance?
...,...,...
494,Q1229,Where should completeness be evaluated in the ...
495,Q1230,Which dimensions of completeness are crucial i...
496,Q1236,How can the completeness of case reporting be ...
497,Q1239,Where should completeness and timeliness of re...


###### Submission Format 

Everything has to be passed as a separate index 📝


In [10]:
ss = pd.read_csv(f"{path}/SampleSubmission.csv")
ss

Unnamed: 0,ID,Target
0,Q1000_keywords,
1,Q1000_paragraph(s)_number,
2,Q1000_question_answer,
3,Q1000_reference_document,
4,Q1002_keywords,
...,...,...
1991,Q999_reference_document,
1992,Q9_keywords,
1993,Q9_paragraph(s)_number,
1994,Q9_question_answer,


######  Make the necessary imports 📚


In [11]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA,ConversationalRetrievalChain
from langchain.vectorstores import Chroma

###### If paradventure you have some GPU, let's use it!

In [12]:
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
print(device)

cuda:0


In [13]:
from accelerate import Accelerator

if cuda.is_available():
    accelerator = Accelerator()
    gpu_layers = 50 

else:
    gpu_layers = 0
    
print(gpu_layers)

50


## Large Language  (LLMs) 

LLMs, which stand for Large Language Models, are foundational language models that are very large as the name implies. They can understand and generate human language text. 

They are trained by analyzing massive datasets of text and learning the statistical relationships between words and phrases. This allows them to perform a variety of tasks, such as:

- Answering your questions in an informative way, even if they are open-ended, challenging, or strange.
- Generating different creative text formats, like poems, code, scripts, musical pieces, email, letters, etc.
- Translating languages
- Writing different kinds of creative content
- Summarizing factual topics

In this tutorial, we're focused on using them for question answering. There are a couple of Open Source LLMs, but for this tutorial, we're using Meta's Llama model. There are various sizes on the hub, but here we'll use the "Llama-2-13b-chat-hf" 🦙


Let's see what a sample question looks like from the test set. 🕵️‍♂️

In [14]:
question = test["Question Text"][0]
question

'What is the definition of "unusual event"'

###### Choose GGUF model equivalent.

For the sake of demo, we use TinyLlama. You can also find other bigger models and how to load them in.
Download the model and Use Ctransformers to load in the model. Langchain has a wrapper of Ctransformers readily available. 📦🔍


![image.png](attachment:image.png)

###### Choose Model Quantization Size

When you scroll down, you will find different model sizes. Choose the one best suited. 📏🔍


![image.png](attachment:image.png)

###### Downloading and Setting up.

You can simply use the Hugging Face CLI to download the model, then load it in via Ctransformers. Ensure the naming conventions are correct. Things to edit:

1. Model Repo in CLI: `TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF`
2. Quantization Model to pull: `tinyllama-1.1b-chat-v0.3.Q6_K.gguf`

For loading in with Ctransformers:

1. `model`: `tinyllama-1.1b-chat-v0.3.Q6_K.gguf`
2. `model_type`: `'tinyllama'`


In [15]:
# You can Try TinyLlama for faster inference on CPU
!huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF tinyllama-1.1b-chat-v0.3.Q6_K.gguf --local-dir . --local-dir-use-symlinks False #https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF


# ## Llama 13B-Chat GGUF Equivalent
# !huggingface-cli download TheBloke/Llama-2-13B-chat-GGUF llama-2-13b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False  #Pull model from https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF  

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF/resolve/main/tinyllama-1.1b-chat-v0.3.Q6_K.gguf to /root/.cache/huggingface/hub/tmp20k7grvb
tinyllama-1.1b-chat-v0.3.Q6_K.gguf: 100%|█████| 903M/903M [00:03<00:00, 272MB/s]
./tinyllama-1.1b-chat-v0.3.Q6_K.gguf


In [16]:
from langchain.llms import CTransformers

config = {'temperature':0.01, 'gpu_layers':gpu_layers, 'batch_size':4}

# Use TinyLlama if you want. Remember to download the right file

# Local CTransformers wrapper for tinyllama
llm = CTransformers(model='tinyllama-1.1b-chat-v0.3.Q6_K.gguf', # Location of downloaded GGML model
                    model_type='tinyllama', # Model type Llama
                    batch_size=4,
                    config=config)


# # Local CTransformers wrapper for Llama-2-7B-Chat
# llm = CTransformers(model='llama-2-13b-chat.Q4_K_M.gguf', # Location of downloaded GGML model
#                     model_type='llama', # Model type Llama
#                     gpu_layers = gpu_layers,
#                     batch_size=4,
#                     config=config)

In [17]:
if cuda.is_available():
    llm, config = accelerator.prepare(llm, config)

###### Using Transformers to load in the model

The transformers library is still one of the best resources to load transformer-based models. Everything is automatically set up; we pass in the model configuration and the configuration for Bits and Bytes. 🤖📦

### A Quick Test Run

So let's do a quick test run of the pipeline to see it work 🏃‍♂️💨

In [18]:
def test_model(llm, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    print(llm(prompt_to_test))
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")

In [19]:
print(question)
test_model(llm,
           question)

What is the definition of "unusual event"


  warn_deprecated(


 in physics?
What are some examples of unusual events in physics?
What are some examples of unusual phenomena in physics?
What are some examples of unusual things in physics?
What are some examples of unexplained phenomena in physics?
What are some examples of unusual occurrences in physics?
What is an example of a "unusual event" in physics?
What are some examples of unusual phenomena in physics?
What are some examples of unusual things in physics?
What are some examples of unexplained phenomena in physics?
What are some examples of unusual occurrences in physics?
Physics is the study of matter, energy, and the universe. It deals with the fundamental building blocks of reality, such as particles, waves, and forces. Unusual events in physics refer to events that are not expected or predicted by classical physics. These events are often attributed to quantum mechanics, general relativity, string theory, or other theories in physics.
Some examples of unusual phenomena in physics include:

Here, you see that even on GPU, it takes roughly 4 seconds to answer the question "what is the definition of unusual event". The model also answers the question in relation to **physics** as shown below instead of **Public Health** ⏱️🚀

###### Testing the Model with HuggingFace Pipeline Class 

## RAG 📚

Retrieval Augmented Generation, or RAG, is a technique used to improve the accuracy and reliability of Large Language Models (LLMs). As you know, LLMs are trained on massive amounts of text data, but they can still struggle with factual consistency and sometimes generate incorrect or misleading information. RAG helps address this by incorporating external/relevant knowledge sources into the generation process.

Here's how it works:

- Retrieval: When you ask an LLM a question or give it a prompt, RAG first retrieves relevant information from the Public Health Vector Database created with the 6 textbooks. 📖

- Augmentation: This retrieved information is then combined with the original prompt to provide the LLM with additional context and factual grounding. 

- Generation: Finally, the LLM uses this augmented prompt to generate its response. This response would be more accurate and reliable because it's based on both the LLM's internal knowledge and the retrieved factual information. 🎯

### Questions ❓

- Why not pass the entire textbook to the LLM? The entire textbook is a lot. LLMs have a "context length" which is the maximum amount of input text they can take in and understand. Also, the more text, the more compute required. It makes more sense to first look for the relevant parts then pass it to the LLM. 🤔


![Malawi%20Public%20Health%20RAG_1.png](attachment:Malawi%20Public%20Health%20RAG_1.png)

###### Load in the Textbooks 📚


In [20]:
import os
books_path = "/kaggle/input/malawi-public-health-dataset/strengthening-health-systems-llm-challenge-for-integrated-disease-surveillance-and-response-in-malawi20240125-12750-1x85c8a/MWTGBookletsExcel"
booklets = os.listdir(books_path)

There are several dataloaders that have been created by the LangChain community. One of such is the ```UnstructuredExcelLoader```, which loads in an Excel spreadsheet as an unstructured data format. 📊📄

In [21]:
from langchain_community.document_loaders import UnstructuredExcelLoader

###### Load all the 6 textbooks

Here we load it and extend it in a list called docs 📚


In [22]:
loaders = [UnstructuredExcelLoader(f"{books_path}/{i}") for i in booklets]
docs = []
for loader in loaders:
    docs.extend(loader.load())

###### Splitting the data.

LangChain has several splitting methods, a basic one is the "RecursiveCharacterTextSplitter" which splits the data (text) by chunks of n characters, and includes an overlap of k characters called the chunk overlap. 📝🔀


In [23]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
all_splits = text_splitter.split_documents(docs)

###### Tokenizing the data

NLP took a huge leap during the discovery of embeddings. Sentence Embeddings or Sentence Vectors are numeric vector inputs that represent a sentence in a lower-dimensional space. It allows sentences with similar meanings to have a similar representation. You can read more about embeddings [here](https://huggingface.co/blog/getting-started-with-embeddings)

For this tutorial, we'll be making use of `sentence-transformers/all-MiniLM-L6-v2`. It is a model that maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. It's available and open source on the Hub. 🌐🔠


In [24]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": device}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


##### Vector Databases

Vector databases provide the ability to store and retrieve vectors as high-dimensional points. They add additional capabilities for efficient and fast lookup of nearest-neighbors in the N-dimensional space. 🗄️🔍

In the code below, we create a vector database with Chroma. We then pass in all the splits (chunks of the entire textbook) and the embedding model to convert them into embeddings. 💡📚


In [25]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_dbm")

###### Langchain's Retrieval QA

LangChain has the ability to create or arrange chains, one of such popular chains is the RetrievalQA which can take in the retriever (the function that retrieves the relevant chunk) and the LLM (that will answer the question).

For the Retriever, you'll find that I've set k=3. This is the maximum number of splits I want returned from the vector database. In other words, the vector database finds the correct split and gives us the best 3 to work with. We then combine this best 3 chunks with the question (prompt) and get a RAG-based answer. 🔄🔍

In [26]:
retriever = vectordb.as_retriever(k=2)

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True)


###### Using the RAG for the Test Set

In [27]:
def test_rag(qa, query):
    #print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    time_taken = round(time_2-time_1, 3)
    #print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    #print("\nResult: ", result)
    return result,time_taken

I wanted to keep track of some data aside from the answer:

1. One that is important is the source, i.e., which of the Excel spreadsheets did it find the relevant information?
2. The other one is the time it took to answer the question through the RAG.

As you'll remember, the sample submission requires us to submit the "Answer", "Textbook Source", "Paragraph", and even "Keywords". For now, we'll deal with the "Answer and Textbook Source" and later on use simpler methods to extract the relevant paragraph and the Keywords. 📊⏱️


In [28]:
from tqdm import tqdm

In [None]:
times = []
results = []
sources = []
for question in tqdm(test["Question Text"]):
    try:
        result,time_taken = test_rag(qa, question)
        docs = vectordb.similarity_search(result)
        source = docs[0].metadata['source'].split("/")[-1]

        times.append(time_taken)
        results.append(result)
        sources.append(source)
    except:
        
        times.append("Error")
        results.append("Error")
        sources.append("Error")   

In [30]:
mysub = test.copy()
mysub["Time Taken"] = times
mysub["Answers"] = results
mysub["Source files"] = sources
mysub.to_csv("full test.csv", index=False)

In [31]:
mysub

Unnamed: 0,ID,Question Text,Time Taken,Answers,Source files
0,Q4,"What is the definition of ""unusual event""",8.863,The term “unusual event” is used in the I’ is...,TG Booklet 4.xlsx
1,Q5,What is Community Based Surveillance (CBS)?,7.262,Community Based Surveillance (CBS) is a syste...,TG Booklet 1.xlsx
2,Q9,What kind of training should members of VHC re...,7.004,The training should be tailored to the specif...,TG Booklet 1.xlsx
3,Q10,What is indicator based surveillance (IBS)?,10.173,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,TG Booklet 4.xlsx
4,Q13,What is Case based surveillance?,7.229,\nCase-based surveillance involves the ongoing...,TG Booklet 1.xlsx
...,...,...,...,...,...
494,Q1229,Where should completeness be evaluated in the ...,9.123,The completeness of surveillance data is an i...,TG Booklet 4.xlsx
495,Q1230,Which dimensions of completeness are crucial i...,6.526,The completeness of surveillance data is impo...,TG Booklet 4.xlsx
496,Q1236,How can the completeness of case reporting be ...,9.702,1 250\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,TG Booklet 6.xlsx
497,Q1239,Where should completeness and timeliness of re...,7.861,The effectiveness of the monitoring system ca...,TG Booklet 4.xlsx


## PART 2

##### Extracting Keywords and Paragraph 📝🔍


The answer to the question is probably the hardest. Finding the paragraph would also be much easier if we already know which of the 6 Excel sheets did the model use to answer the question.

In the code below, we use very basic ideas to find the paragraph and extract the keywords. 🤔🔎


In [32]:
import pandas as pd
import os

In [33]:
test_set = pd.read_csv("full test.csv")
test_set

Unnamed: 0,ID,Question Text,Time Taken,Answers,Source files
0,Q4,"What is the definition of ""unusual event""",8.863,The term “unusual event” is used in the I’ is...,TG Booklet 4.xlsx
1,Q5,What is Community Based Surveillance (CBS)?,7.262,Community Based Surveillance (CBS) is a syste...,TG Booklet 1.xlsx
2,Q9,What kind of training should members of VHC re...,7.004,The training should be tailored to the specif...,TG Booklet 1.xlsx
3,Q10,What is indicator based surveillance (IBS)?,10.173,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,TG Booklet 4.xlsx
4,Q13,What is Case based surveillance?,7.229,\nCase-based surveillance involves the ongoing...,TG Booklet 1.xlsx
...,...,...,...,...,...
494,Q1229,Where should completeness be evaluated in the ...,9.123,The completeness of surveillance data is an i...,TG Booklet 4.xlsx
495,Q1230,Which dimensions of completeness are crucial i...,6.526,The completeness of surveillance data is impo...,TG Booklet 4.xlsx
496,Q1236,How can the completeness of case reporting be ...,9.702,1 250\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,TG Booklet 6.xlsx
497,Q1239,Where should completeness and timeliness of re...,7.861,The effectiveness of the monitoring system ca...,TG Booklet 4.xlsx


Ensure you correct the relevant path if it needs to be edited. 🛠️📂


In [34]:
path = "/kaggle/input/malawi-public-health-dataset/strengthening-health-systems-llm-challenge-for-integrated-disease-surveillance-and-response-in-malawi20240125-12750-1x85c8a"

In [35]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords



# Download NLTK resources (run only once)
#nltk.download('punkt')
#nltk.download('stopwords')

def extract_keywords(provided_text):
    # Tokenize the text
    tokens = word_tokenize(provided_text)

    # Convert tokens to lowercase
    tokens = [token.lower() for token in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token.title() for token in tokens if token not in stop_words]

    # Remove punctuation and non-alphabetic characters
    keywords = [token for token in filtered_tokens if token.isalpha()]

    # Remove duplicate keywords
    unique_keywords = list(set(keywords))

    return ', '.join(unique_keywords)





def find_matching_paragraphs(csv_filepath, text_to_check, threshold=0.9):
    # Load the DataFrame
    df = pd.read_excel(f"{path}/MWTGBookletsExcel/{csv_filepath}",names=["paragraph", "text"])
    df.fillna('', inplace=True)
    # Concatenate all text from the 'text' column in the DataFrame
    all_text = ' '.join(df['text'].astype(str).values.tolist())

    # Combine the provided text and all text from the DataFrame
    combined_text = [text_to_check, all_text]

    # Initialize TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Fit and transform the text in the DataFrame
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

    # Transform the provided text
    provided_text_tfidf = tfidf_vectorizer.transform([text_to_check])

    # Calculate cosine similarity between the provided text and each paragraph in the DataFrame
    cosine_similarities = cosine_similarity(provided_text_tfidf, tfidf_matrix).flatten()

    # Find paragraphs that meet or exceed the threshold
    matching_paragraph_indices = [i for i, score in enumerate(cosine_similarities) if score >= threshold]

    if matching_paragraph_indices:
        # Get the corresponding paragraph numbers
        matching_paragraph_numbers = df.iloc[matching_paragraph_indices]['paragraph'].tolist()
        matching_paragraph_numbers = [str(int(i)) for i in matching_paragraph_numbers]
        return ', '.join(matching_paragraph_numbers)
    
    else:
        # If no paragraphs meet the threshold, fallback to selecting the paragraph with the highest similarity
        closest_paragraph_index = cosine_similarities.argmax()
        closest_paragraph_number = df.iloc[closest_paragraph_index]['paragraph']
        return ', '.join([str(closest_paragraph_number)])  # Return as a list

I've created two functions below. I'll try to explain what they do.

### "extract_keywords" function:

This Python function takes a string of text (`provided_text`) as input and performs several text processing steps to extract and return a list of unique keywords from that text.

Here's a high-level breakdown of what the function does step by step:

1. Stopword removal: It removes common stopwords (like "the", "is", "and", etc.) from the token list. Stopwords are commonly occurring words that typically do not carry significant meaning in the context of analysis. The function uses the NLTK library's built-in set of English stopwords for this purpose.

2. Titlecasing: It capitalizes the first letter of each remaining token. The metric of this competition is the ROUGE-1 metric which is case sensitive. It'll make sense to ensure the keywords are in the format like the train set.

3. Punctuation and non-alphabetic character removal: It filters out tokens that contain non-alphabetic characters (like punctuation marks) using the `isalpha()` method. This step ensures that only alphabetic words are considered as keywords.

4. Removing duplicate keywords: It removes duplicate keywords from the list to ensure that each keyword appears only once in the final output. This is done by converting the list of keywords into a set (which automatically removes duplicates) and then converting it back into a list.

5. Joining keywords into a string: Finally, it joins the unique keywords into a single string, separated by commas, using the `join` method.

### "find_matching_paragraphs" Function:

This Python function takes a CSV file path, a piece of text to check against, and an optional threshold for cosine similarity as input. It is designed to find paragraphs in the CSV file that match the provided text based on their similarity, using the cosine similarity metric. If no paragraphs meet the specified similarity threshold, it returns the paragraph with the highest similarity.

Here's a step-by-step explanation of what the function does:

1. Loading Data: It reads the CSV file located at the provided file path using Pandas (`pd.read_excel`). The DataFrame is expected to have two columns, named "paragraph" and "text", respectively. If there are any missing values in the DataFrame, they are filled with empty strings.

2. Vectorizing Text: It initializes a TF-IDF vectorizer and fits the data to generate a TF-IDF matrix (`tfidf_matrix`). TF-IDF stands for Term Frequency-Inverse Document Frequency, a numerical statistic that reflects the importance of a word in a document relative to a collection of documents.

3. Calculating Cosine Similarity: It calculates the cosine similarity between the TF-IDF representation of the provided text and each paragraph in the DataFrame using `cosine_similarity` from scikit-learn. Cosine similarity measures the cosine of the angle between two vectors and is used here to quantify the similarity between the provided text and each paragraph.

4. Finding Matching Paragraphs: It identifies the indices of paragraphs whose cosine similarity with the provided text meets or exceeds the specified threshold. If such paragraphs exist, it retrieves their corresponding paragraph numbers from the DataFrame and returns them as a comma-separated string. If no paragraphs meet the threshold, it selects the paragraph with the highest similarity.


Extra Tip: 

The model used is a small LLM, some answers were really weird, some have lots of "/n" and weird long words. The code below was used to eliminate them, and only keep words less than 20 characters. 🚀🔍


In [36]:
test_set["Answers"] = test_set["Answers"].str.replace("\n", "")
test_set["Answers"] = test_set["Answers"].apply(lambda x: ' '.join([word for word in x.split() if len(word) <= 20]))

###### Putting it all together

The code below puts all our work together and prepares for submission. 🛠️📝

PS: The LLM may have encountered issues while running in Part 1. For such instances, we tag them as "Error". Interestingly, only one was found. 🚩


In [37]:
test_set[test_set["Answers"] == "Error"]

Unnamed: 0,ID,Question Text,Time Taken,Answers,Source files


In [38]:
ID = []
Target = []

for index, row in tqdm(test_set.iterrows(), total=len(test_set)):
    if row["Answers"]== "Error":
        ID.append(row["ID"]+"_keywords")
        Target.append(extract_keywords(row["Question Text"]))
        ID.append(row["ID"]+"_paragraph(s)_number")
        Target.append(find_matching_paragraphs("TG Booklet 1.xlsx", row["Question Text"], threshold=0.9))
        ID.append(row["ID"]+"_question_answer")
        Target.append(" ")
        ID.append(row["ID"]+"_reference_document")
        Target.append("TG Booklet 1")
        
    else:
        ID.append(row["ID"]+"_keywords")
        Target.append(extract_keywords(row["Answers"]))
        ID.append(row["ID"]+"_paragraph(s)_number")
        Target.append(find_matching_paragraphs(row["Source files"], row["Answers"], threshold=0.9))
        ID.append(row["ID"]+"_question_answer")
        Target.append(row["Answers"])
        ID.append(row["ID"]+"_reference_document")
        Target.append(row["Source files"].split(".xlsx")[0])

100%|██████████| 499/499 [01:10<00:00,  7.07it/s]


#### Making Your Submission! 📤

A CSV file will be created called `My Baseline submission.csv`, that is the file you will submit on Zindi. 📊📤

In [39]:
ss = pd.read_csv(f"{path}/SampleSubmission.csv")
ss

Unnamed: 0,ID,Target
0,Q1000_keywords,
1,Q1000_paragraph(s)_number,
2,Q1000_question_answer,
3,Q1000_reference_document,
4,Q1002_keywords,
...,...,...
1991,Q999_reference_document,
1992,Q9_keywords,
1993,Q9_paragraph(s)_number,
1994,Q9_question_answer,


In [40]:
ss["ID"] = ID
ss["Target"] = Target
ss["Target"] = ss["Target"].fillna(" ")

ss.to_csv("My CPU Baseline submission.csv", index=False)

Curious to see what your submission looks like?

In [41]:
ss

Unnamed: 0,ID,Target
0,Q4_keywords,"Systematically, Event, Cebsacpractively, Refer..."
1,Q4_paragraph(s)_number,714
2,Q4_question_answer,The term “unusual event” is used in the I’ is ...
3,Q4_reference_document,TG Booklet 4
4,Q5_keywords,"Designated, Health, Cbs, Members, Reporting, B..."
...,...,...
1991,Q1239_reference_document,TG Booklet 4
1992,Q1246_keywords,"Efficient, Health, Answer, Contributes, Allows..."
1993,Q1246_paragraph(s)_number,86
1994,Q1246_question_answer,Community-based surveillance contributes to th...


###### References:

1. RAG using Llama 2, Langchain and ChromaDB
   by Gabriel Preda
   [Kaggle Notebook](https://www.kaggle.com/code/gpreda/rag-using-llama-2-langchain-and-chromadb)

2. DeepLearning.AI short courses
   [DeepLearning.AI Short Courses](https://www.deeplearning.ai/short-courses/)


## What Next? 🤔

Here are some tips for you to get better:

1. Use a much better LLM: There are a bunch of top-performing models that are still relatively small in size. 🚀🔍

2. Finetuning your LLM: You can try a trained LLM on the train set Q/A pairs instead of RAG. 🎯📖

3. Finetuning + RAG: Research has shown that this is a much better approach than a standalone solution. 💡🔄

4. Prompt Engineering: Introduce your prompt for the LangChain QA Chain. We used the default from LangChain. 🤖🔧

5. Keyword Extraction can be better: The stop words removed were not sufficient. 🛑🔍

6. Better Post-Processing Strategies: The post-processing strategies used were insufficient. Sometimes the model repeats some sentences several times within its answer. 🔄🛠️


### The End

Want to connect? 🔗 Feel free to reach out to me: most preferably LinkedIn.

- [Twitter](https://twitter.com/olufemivictort).

- [Linkedin](https://www.linkedin.com/in/olufemi-victor-tolulope).

- [Github](https://github.com/osinkolu)

### Author: Olufemi Victor Tolulope