## Inference Notebook

Now, let's move on to make predictions. 📝


Replace `YOUR_HF_TOKEN` 🔑 with your Huggingface token.

In [None]:
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token ('YOUR_HF_TOKEN')"

![image.png](attachment:image.png)

###### Replace the model name below with your own fine-tuned model

Make sure to replace the model name below with your own fine-tuned model. 🔄


In [None]:
model_name = "YOUR OWN USER NAME/Malawi-Public-Health-Systems"

In [None]:
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline

# Load pretrained model and tokenizer
finetuned_longer_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load test set CSV file
test_set_path = "/kaggle/input/malawi-public-health/Test.csv"
test_df = pd.read_csv(test_set_path)

# Combine all questions into a single list
questions = list(test_df["Question Text"])

# Create a text-generation pipeline
generator = pipeline("text-generation", model=finetuned_longer_model, tokenizer=tokenizer, device=0)  # set device to GPU

# Generate answers for all questions
answers = generator(questions, max_length=512)

# Extract answers from the generated output
answers = [item[0]['generated_text'].strip() for item in answers]

# Create a DataFrame for submission
submission_df = pd.DataFrame({"Question": questions, "Answer": answers})

# Save submission DataFrame to CSV
submission_csv_path = "submission.csv"
submission_df.to_csv(submission_csv_path, index=False)

###### Preparing your submission for Zindi

Now, let's prepare your submission for Zindi. 📤


In [1]:
import pandas as pd

In [2]:
sub = pd.read_csv("submission.csv")
sub

Unnamed: 0,Question,Answer
0,"What is the definition of ""unusual event""","What is the definition of ""unusual event"" in t..."
1,What is Community Based Surveillance (CBS)?,What is Community Based Surveillance (CBS)?Com...
2,What kind of training should members of VHC re...,What kind of training should members of VHC re...
3,What is indicator based surveillance (IBS)?,What is indicator based surveillance (IBS)?Ind...
4,What is Case based surveillance?,What is Case based surveillance?Case based sur...
...,...,...
494,Where should completeness be evaluated in the ...,Where should completeness be evaluated in the ...
495,Which dimensions of completeness are crucial i...,Which dimensions of completeness are crucial i...
496,How can the completeness of case reporting be ...,How can the completeness of case reporting be ...
497,Where should completeness and timeliness of re...,Where should completeness and timeliness of re...


In [3]:
path = "strengthening-health-systems-llm-challenge-for-integrated-disease-surveillance-and-response-in-malawi20240125-12750-1x85c8a/"

In [4]:
test = pd.read_csv(path+"Test.csv")
test

Unnamed: 0,ID,Question Text
0,Q4,"What is the definition of ""unusual event"""
1,Q5,What is Community Based Surveillance (CBS)?
2,Q9,What kind of training should members of VHC re...
3,Q10,What is indicator based surveillance (IBS)?
4,Q13,What is Case based surveillance?
...,...,...
494,Q1229,Where should completeness be evaluated in the ...
495,Q1230,Which dimensions of completeness are crucial i...
496,Q1236,How can the completeness of case reporting be ...
497,Q1239,Where should completeness and timeliness of re...


###### Post Processing

1. When we trained the model, we concatenated the inputs and output together. Now, answers from the model reproduce the question before answering them. We would have to remove it. 🔄

2. The model didn't perform too well since it is a small model. There is a lot of hallucination, causing many repeated sentences.

3. We have to extract keywords as the submission format on Zindi requires it. 🔍

4. We have to find the paragraph and file where the answers are in the textbook. 📄


P.S.: Most of the code has been explained in the RAG implementation of the notebook with slight modifications. 📝🔍


In [5]:
sub["ID"] = test["ID"]

In [6]:
def remove_repeated_sentences(text):
    # Split the text into sentences
    sentences = text.split('.')
    
    # Initialize a set to store unique sentences
    unique_sentences = set()
    
    # Initialize an empty list to store non-repeated sentences
    non_repeated_sentences = []
    
    # Iterate through each sentence
    for sentence in sentences:
        # Remove leading and trailing whitespaces
        sentence = sentence.strip()
        
        # Check if the sentence is not empty
        if sentence:
            # Check if the sentence is not already in the set of unique sentences
            if sentence not in unique_sentences:
                # Add the sentence to the set of unique sentences
                unique_sentences.add(sentence)
                
                # Append the sentence to the list of non-repeated sentences
                non_repeated_sentences.append(sentence)
    
    # Join the non-repeated sentences to form the final text
    final_text = '.'.join(non_repeated_sentences)
    
    return final_text


In [17]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords



# Download NLTK resources (run only once)
#nltk.download('punkt')
#nltk.download('stopwords')

def extract_keywords(provided_text):
    # Tokenize the text
    tokens = word_tokenize(provided_text)

    # Convert tokens to lowercase
    tokens = [token.lower() for token in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token.title() for token in tokens if token not in stop_words]

    # Remove punctuation and non-alphabetic characters
    keywords = [token for token in filtered_tokens if token.isalpha()]

    # Remove duplicate keywords
    unique_keywords = list(set(keywords))

    return ', '.join(unique_keywords)





def find_matching_paragraphs(df, text_to_check, threshold=0.9):
    # Load the DataFrame
    df.fillna('', inplace=True)
    # Concatenate all text from the 'text' column in the DataFrame
    all_text = ' '.join(df['text'].astype(str).values.tolist())

    # Combine the provided text and all text from the DataFrame
    combined_text = [text_to_check, all_text]

    # Initialize TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Fit and transform the text in the DataFrame
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

    # Transform the provided text
    provided_text_tfidf = tfidf_vectorizer.transform([text_to_check])

    # Calculate cosine similarity between the provided text and each paragraph in the DataFrame
    cosine_similarities = cosine_similarity(provided_text_tfidf, tfidf_matrix).flatten()

    # Find paragraphs that meet or exceed the threshold
    matching_paragraph_indices = [i for i, score in enumerate(cosine_similarities) if score >= threshold]

    if matching_paragraph_indices:
        # Get the corresponding paragraph numbers
        matching_paragraph_numbers = df.iloc[matching_paragraph_indices]['paragraph'].tolist()
        matching_paragraph_numbers = [str(int(i)) for i in matching_paragraph_numbers]
        matching_filename = df.iloc[matching_paragraph_indices]['filename'].tolist()[0]
        return( ', '.join(matching_paragraph_numbers), matching_filename)
    
    else:
        # If no paragraphs meet the threshold, fallback to selecting the paragraph with the highest similarity
        closest_paragraph_index = cosine_similarities.argmax()
        closest_paragraph_number = df.iloc[closest_paragraph_index]['paragraph']
        matching_filename = df.iloc[closest_paragraph_index]['filename']
        return (', '.join([str(closest_paragraph_number)]),matching_filename)  # Return as a list

In [18]:
sub["Answers"] = [sub.Answer[i].split(sub.Question[i])[-1].strip() for i in range(len(sub))]

In [19]:
sub["Answers"] = [remove_repeated_sentences(text) for text in sub["Answers"]]

In [20]:
import pandas as pd
import os

# List to store individual dataframes
dfs = []

# Read each Excel file and append its dataframe to the list
for file in os.listdir(path+"MWTGBookletsExcel/"):
    df = pd.read_excel(path+"/MWTGBookletsExcel/"+file,names=["paragraph", "text"])
    df["filename"] = file
    dfs.append(df)

# Concatenate all dataframes into a single dataframe
combined_df = pd.concat(dfs, ignore_index=True)

combined_df

Unnamed: 0,paragraph,text,filename
0,2.0,THIRD EDITION,TG Booklet 1.xlsx
1,3.0,BOOKLET ONE: INTRODUCTION SECTION,TG Booklet 1.xlsx
2,4.0,￼,TG Booklet 1.xlsx
3,5.0,DECEMBER 2020,TG Booklet 1.xlsx
4,6.0,￼ ...,TG Booklet 1.xlsx
...,...,...,...
5215,511.0,"Health care associated exposure, including pro...",TG Booklet 6.xlsx
5216,512.0,Working together in close proximity or sharing...,TG Booklet 6.xlsx
5217,513.0,Traveling together with MERS‐CoV patient in an...,TG Booklet 6.xlsx
5218,514.0,Living in the same household as a MERS‐CoV pat...,TG Booklet 6.xlsx


In [30]:
ID = []
Target = []

for index, row in tqdm(sub.iterrows(), total=len(sub)):
    ID.append(row["ID"]+"_keywords")
    Target.append(extract_keywords(row["Answers"]))
    ID.append(row["ID"]+"_paragraph(s)_number")
    paragraph, filename = find_matching_paragraphs(combined_df, row["Answers"], threshold=0.9)
    Target.append(int(float(paragraph)))
    ID.append(row["ID"]+"_question_answer")
    Target.append(row["Answers"])
    ID.append(row["ID"]+"_reference_document")
    Target.append(filename.split(".xlsx")[0])

100%|████████████████████████████████████████████████████████████████████████████████| 499/499 [03:53<00:00,  2.14it/s]


In [27]:
ss = pd.read_csv(f"{path}/SampleSubmission.csv")
ss

Unnamed: 0,ID,Target
0,Q1000_keywords,
1,Q1000_paragraph(s)_number,
2,Q1000_question_answer,
3,Q1000_reference_document,
4,Q1002_keywords,
...,...,...
1991,Q999_reference_document,
1992,Q9_keywords,
1993,Q9_paragraph(s)_number,
1994,Q9_question_answer,


In [28]:
ss["ID"] = ID
ss["Target"] = Target

ss.to_csv("My Finetuned Baseline submission.csv", index=False)

In [29]:
ss

Unnamed: 0,ID,Target
0,Q4_keywords,"Cases, Events, Definition, Occurring, Disease,..."
1,Q4_paragraph(s)_number,173
2,Q4_question_answer,in the context of surveillance?It is defined i...
3,Q4_reference_document,TG Booklet 3
4,Q5_keywords,"Effectiveness, Measure, Address, Events, Acute..."
...,...,...
1991,Q1239_reference_document,TG Booklet 4
1992,Q1246_keywords,"Effectiveness, Region, Serves, Vaccination, In..."
1993,Q1246_paragraph(s)_number,254
1994,Q1246_question_answer,Community-based surveillance focuses on the de...


###### The End

That's it for the tutorial! 🎉 Keep learning!! Keep Winning!!


Tips to Improve

1. Use a more advanced model for fine-tuning. 🚀
2. Try other fine-tuning methods like PEFT. 🛠️
3. Introduce prompt engineering techniques. 🔧
4. Experiment with different hyperparameters during fine-tuning to optimize model performance.
5. Fine-tune the model for a longer duration to allow it to learn more intricate patterns in the data.


Feel free to incorporate these suggestions into your workflow to further enhance your model's effectiveness! 📈


Want to connect? 🔗 Feel free to reach out to me: most preferably LinkedIn.

- [Twitter](https://twitter.com/olufemivictort).

- [Linkedin](https://www.linkedin.com/in/olufemi-victor-tolulope).

- [Github](https://github.com/osinkolu)

### Author: Olufemi Victor Tolulope