**DATA LOADING**

In [1]:
!pip install --upgrade datasets



In [2]:
!pip install evaluate



In [3]:
from datasets import load_dataset, Dataset
import pandas as pd
import numpy as np

# Loading the dataset from Datasets (Hugging Face)
ds = load_dataset("nyarkssss/gh-maternal-1k")

# Convert the above dataset into a pandas DataFrame
df = ds['train'].to_pandas()

# Display the first few rows
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,input,output,domain,context,instruction,prompt
0,Why is breast milk considered the perfect food...,Breast milk provides all the essential nutrien...,Breastfeeding,Postpartum,"Answer the question truthfully, you are a medi...","If you are a doctor, please answer the medical..."
1,How long does the American Academy of Pediatri...,The American Academy of Pediatrics recommends ...,Breastfeeding,Postpartum,"Answer the question truthfully, you are a medi...","If you are a doctor, please answer the medical..."
2,What are the benefits of breastfeeding for a b...,Breastfeeding provides lifelong benefits such ...,Breastfeeding,Postpartum,"Answer the question truthfully, you are a medi...","If you are a doctor, please answer the medical..."
3,What are the benefits of breastfeeding for the...,Breastfeeding lowers the risk of breast cancer...,Maternal Health,Postpartum,"Answer the question truthfully, you are a medi...","If you are a doctor, please answer the medical..."
4,How often should a newborn breastfeed?,A newborn should breastfeed at least 8-12 time...,Baby Care,Postpartum,"Answer the question truthfully, you are a medi...","If you are a doctor, please answer the medical..."


**DATA CLEANING**

In [4]:
# Check the column datatypes which should match the dataset from hugging face
print(df.dtypes)

input          object
output         object
domain         object
context        object
instruction    object
prompt         object
dtype: object


In [5]:
# An overview of our data
df.describe()

Unnamed: 0,input,output,domain,context,instruction,prompt
count,1034,1034,1034,1033,1034,1034
unique,1022,1033,23,71,1,1
top,How can I relieve abdominal and groin pain dur...,Your GP or obstetrician will discuss contracep...,Baby Care,Antenatal,"Answer the question truthfully, you are a medi...","If you are a doctor, please answer the medical..."
freq,2,2,143,531,1034,1034


In [7]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
input,0
output,0
domain,0
context,1
instruction,0
prompt,0


In [8]:
# Its just one value missing, and so we can drop it
df = df.dropna(subset = ['context'])

# Confirm there are no more missing values
df.isnull().sum()

Unnamed: 0,0
input,0
output,0
domain,0
context,0
instruction,0
prompt,0


In [9]:
# Check for Duplicates

# Group by 'input' and check if there are multiple unique outputs
duplicate_inputs = df.groupby('input')['output'].nunique()
duplicate_inputs = duplicate_inputs[duplicate_inputs > 1]

# Print the duplicate inputs with their different outputs
for input, count in duplicate_inputs.items():
    print(f"Input: {input}")
    print(f"Number of Different Outputs: {count}")
    print(df[df['input'] == input][['input', 'output']])
    print("-" * 20)  # Line separator

Input: How can I manage constipation during pregnancy?
Number of Different Outputs: 2
                                               input  \
384  How can I manage constipation during pregnancy?   
936  How can I manage constipation during pregnancy?   

                                                output  
384  Eat fiber-rich foods like fruits, vegetables, ...  
936  Ensure your diet includes plenty of fresh frui...  
--------------------
Input: How can I relieve abdominal and groin pain during pregnancy?
Number of Different Outputs: 2
                                                 input  \
897  How can I relieve abdominal and groin pain dur...   
928  How can I relieve abdominal and groin pain dur...   

                                                output  
897  You can relieve abdominal and groin pain by ly...  
928  You can lie on your side with your knees and h...  
--------------------
Input: How can I relieve leg cramps during pregnancy?
Number of Different Outputs: 2
  

In [10]:
# Drop duplicates, and keep only the first instance
train_df = df.drop_duplicates(subset = ['input'], keep ='first')

# Convert back to Hugging Face Dataset
from datasets import Dataset
train_data = Dataset.from_pandas(train_df)

**DATA PRE-PROCESSING**

In [11]:
import tensorflow as tf
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer

# Defines the model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('google-t5/t5-base')
model = T5ForConditionalGeneration.from_pretrained('google-t5/t5-base')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [12]:
# This is a function to calculate the distribution of sequence lengths
def calculating_length_distribution(dataset, tokenizer):
    input_lengths = [len(tokenizer(i)["input_ids"]) for i in dataset["input"]]
    output_lengths = [len(tokenizer(o)["input_ids"]) for o in dataset["output"]]

    # Calculating percentiles
    input_90th = np.percentile(input_lengths, 90)
    input_95th = np.percentile(input_lengths, 95)
    input_99th = np.percentile(input_lengths, 99)

    output_90th = np.percentile(output_lengths, 90)
    output_95th = np.percentile(output_lengths, 95)
    output_99th = np.percentile(output_lengths, 99)

    return {
        "input_length_percentiles": (input_90th, input_95th, input_99th),
        "output_length_percentiles": (output_90th, output_95th, output_99th)
    }

# Calculate and display the distributions before we can proceed to back translation
distributions = calculating_length_distribution(train_data, tokenizer)
print("Input Length Percentiles (90th, 95th, 99th):", distributions["input_length_percentiles"])
print("Output Length Percentiles (90th, 95th, 99th):", distributions["output_length_percentiles"])

Input Length Percentiles (90th, 95th, 99th): (np.float64(20.0), np.float64(23.0), np.float64(28.0))
Output Length Percentiles (90th, 95th, 99th): (np.float64(65.0), np.float64(79.0), np.float64(117.59999999999991))


In [13]:
# Find the maximum token length based on the percentiles we set above
def get_max_token_length(dataset, field):
    max_length = 0
    for example in dataset:
        tokenized_text = tokenizer(example[field])
        length = len(tokenized_text['input_ids'])
        if length > max_length:
            max_length = length
    return max_length

max_question_length = get_max_token_length(train_data, 'input')
max_answer_length = get_max_token_length(train_data, 'output')

print(f"Maximum token length for inputs: {max_question_length}")
print(f"Maximum token length for outputs: {max_answer_length}")

Maximum token length for inputs: 39
Maximum token length for outputs: 237


In [14]:
!pip install nltk==3.8.1 googletrans==4.0.0-rc1



In [18]:
import os
import random
import numpy as np
import pandas as pd
import torch
import nltk
from sklearn.model_selection import train_test_split
from googletrans import Translator
from transformers import (
    T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer, EarlyStoppingCallback
)
from datasets import Dataset, concatenate_datasets
from evaluate import load

# Ensure NLTK data is downloaded
nltk.download('punkt')
nltk.download('wordnet')

# Load tokenizer and model
model_name = 'google-t5/t5-base'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
# Calculating token length percentiles
def calculating_length_distribution(dataset, tokenizer):
    input_lengths = [len(tokenizer(i)["input_ids"]) for i in dataset["input"]]
    output_lengths = [len(tokenizer(o)["input_ids"]) for o in dataset["output"]]
    percentiles = lambda lst: tuple(np.percentile(lst, p) for p in [90, 95, 99])
    return {
        "input_length_percentiles": percentiles(input_lengths),
        "output_length_percentiles": percentiles(output_lengths)
    }

# Back translation for data augmentation
def back_translate(text, target_lang='fr'):
    translator = Translator()
    translated = translator.translate(text, dest = target_lang).text
    return translator.translate(translated, dest = 'en').text

# Augment dataset using back translation
def augment_data_with_back_translation(dataset, num_augmented_samples = 1000):
    augmented_samples = []
    examples_for_inspection = []

    for _ in range(num_augmented_samples):
        sample = random.choice(dataset)
        augmented_input = back_translate(sample['input'])

        augmented_samples.append({
            'input': augmented_input,
            'output': sample['output'],
            'domain': sample['domain'],
            'context': sample['context'],
            'instruction': sample['instruction']
        })

        examples_for_inspection.append({
            'original_input': sample['input'],
            'augmented_input': augmented_input
        })

    for example in examples_for_inspection[:5]:
        print(f"Original: {example['original_input']}\nAugmented: {example['augmented_input']}\n{'-'*20}")

    augmented_dataset = Dataset.from_pandas(pd.DataFrame(augmented_samples))
    return concatenate_datasets([dataset, augmented_dataset])

In [20]:
# Constants
max_input_len, max_output_len = 45, 256

# Tokenization in preparation for training
def tokenize_data(examples):
    inputs = [
        f"instruction: {instr} context: {ctx} question: {inp}"
        for instr, ctx, inp in zip(examples["instruction"], examples["context"], examples["input"])
    ]
    targets = examples["output"]

    model_inputs = tokenizer(inputs, max_length = max_input_len, truncation = True, padding = "max_length")
    labels = tokenizer(targets, max_length = max_output_len, truncation = True, padding = "max_length")
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [22]:
# Splitting 80/20 into the training dataset
ds = train_data.train_test_split(test_size = 0.2, seed = 42)
train_dataset = augment_data_with_back_translation(ds['train'], num_augmented_samples = 400)

Original: What is postpartum post-traumatic stress disorder (PTSD)?
Augmented: What is post-partum post-traumatic stress disorder (SSPT)?
--------------------
Original: What are pelvic floor exercises, and how do I perform them?
Augmented: What are the pelvic floor exercises and how to perform them?
--------------------
Original: What are the benefits of the COVID-19 vaccine during pregnancy?
Augmented: What are the advantages of the COVVI-19 vaccine during pregnancy?
--------------------
Original: What should I do if I have bladder or bowel problems after childbirth?
Augmented: What should I do if I have bladder or intestine problems after childbirth?
--------------------
Original: What are the signs that I should contact my doctor immediately during pregnancy?
Augmented: What are the signs that I should contact my doctor immediately during pregnancy?
--------------------


In [23]:
# Apply (tokenization) to the training and validation sets
train_dataset = train_dataset.map(tokenize_data, batched = True)
val_dataset = ds['test'].map(tokenize_data, batched = True)

# Print dataset sizes
print(f"Train set size: {train_dataset.num_rows}")
print(f"Validation set size: {val_dataset.num_rows}")

Map:   0%|          | 0/1216 [00:00<?, ? examples/s]

Map:   0%|          | 0/205 [00:00<?, ? examples/s]

Train set size: 1216
Validation set size: 205


**MODEL BUILDING**

In [24]:
# Training configuration
training_args = TrainingArguments(
    output_dir = "./maternal_bot",
    report_to = "none",
    num_train_epochs = 4,
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 4,
    learning_rate = 1e-4,
    logging_steps = 10,
    weight_decay = 3e-4,
    save_total_limit = 1,
    fp16 = True,
    do_eval = True,
    do_train = True
)

# Evaluation metrics
bleu = load("bleu")
f1_metric = load("f1")
accuracy = load("accuracy")

def preprocess_logits_for_metrics(logits, labels):
    logits = logits[0]
    probs = torch.nn.functional.softmax(logits, dim = -1)
    return torch.max(probs, dim = -1)

def calculate_perplexity(preds):
    probs, _ = preds
    avg_log_prob = np.log(probs).mean()
    return np.exp(-avg_log_prob)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    probs, word_ids = predictions

    flat_preds = word_ids.flatten()
    flat_labels = labels.flatten()

    candidates = tokenizer.batch_decode(word_ids, skip_special_tokens = True)
    references = [[l] for l in tokenizer.batch_decode(labels, skip_special_tokens = True)]

    return {
        "accuracy": accuracy.compute(predictions = flat_preds, references = flat_labels)["accuracy"],
        "bleu": bleu.compute(predictions = candidates, references = references)["bleu"],
        "f1": f1_metric.compute(predictions = flat_preds, references = flat_labels, average = "micro")["f1"],
        "perplexity": calculate_perplexity(predictions)
    }

# Trainer setup
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics,
    preprocess_logits_for_metrics = preprocess_logits_for_metrics
)

trainer.train()

# Evaluate model
eval_results = trainer.evaluate()
print("\nEvaluation Results:")
for key, value in eval_results.items():
    print(f"{key}: {value}")

  trainer = Trainer(


Step,Training Loss
10,11.9369
20,0.9276
30,0.636
40,0.5545
50,1.0077
60,0.4949
70,0.5045
80,0.4241
90,0.4832
100,0.4373



Evaluation Results:
eval_loss: 0.36937281489372253
eval_accuracy: 0.9211128048780488
eval_bleu: 0.14351977486937675
eval_f1: 0.9211128048780488
eval_perplexity: 1.1137527227401733
eval_runtime: 10.131
eval_samples_per_second: 20.235
eval_steps_per_second: 5.133
epoch: 4.0


**SAVING MODEL**

In [25]:
# Save model
model_path = "./t5_maternal_bot1"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)


('./t5_maternal_bot1/tokenizer_config.json',
 './t5_maternal_bot1/special_tokens_map.json',
 './t5_maternal_bot1/spiece.model',
 './t5_maternal_bot1/added_tokens.json')

**CHATBOT TESTING**

In [70]:
def generate_response(input):
    """
    Generates responses from the chatbot for a given question
    """
    # Handle empty input
    if not input.strip():
        return "Please enter a valid question."

    # limits maternal related questions
    maternal_keywords = ["pregnancy", "baby", "birth", "mother", "preterm",
                         'breast', 'miscarriage', 'pregnant', 'fertility',
                         'fertile', 'abortion', 'malnourished', 'ovulation',
                         'menstrual cycle', 'menstruation', 'stillbirth',
                         "antenatal", 'postnatal', 'doctor', 'nurse', 'babycare',
                         "labor", "postpartum", "maternal", "neonatal"]
    if not any(keyword in input.lower() for keyword in maternal_keywords):
        return "Sorry, I can only answer maternal health-related questions."

    # format the input
    input_text = f"question: {input}"

    # Encodes the input text using tokenizer, specifying PyTorch tensors
    input_ids = tokenizer.encode(input_text, return_tensors="pt")

    # Move input_ids to the same device as the model
    # If using GPU, use: device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    input_ids = input_ids.to(device)
    model.to(device)

    # Generate output IDs using model
    output_ids = model.generate(
        input_ids,
        max_new_tokens = 120,
        do_sample = True,
        top_k = 45,
        temperature = 0.9, # controls randomness of generated text
        top_p = 0.8, # Controls diversity of generated text
        repetition_penalty = 1.3, # Discourages model from repeating same words

        )

    # Decodes output IDs back to text
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Condition to handle empty generated responses
    # If the generated response is empty, replace it with a message
    if not response:
        response = "I am unable to answer that question."
    return response


print("Chatbot: Hello! Ask me anything about Maternal Health.")

while True:
    # Get user input
    user_question = input("You: ")

    # Exit condition
    if user_question.lower() in ['quit', 'exit', 'bye']:
        print("Chatbot: See ya!")
        break

    # Get the chatbot response
    response = generate_response(user_question)

    # Print the response
    print(f"Chatbot: {response}")

Chatbot: Hello! Ask me anything about Maternal Health.
You: How long does postpartum bleeding last?
Chatbot: Afterpartum bleeding is permanent, but it can last up to 6 weeks. It can last up to 8 weeks.
You: How long does postpartum bleeding last?
Chatbot: Postpartum bleeding lasts for approximately 1 to 2 weeks. It can last for up to three weeks, depending on the severity of the bleeding and the underlying cause.
You: Why is breast milk good food for the baby?
Chatbot: Breast milk is rich in vitamins and minerals that promote healthy growth and development.
You: Can I have sex when pregnant?
Chatbot: Yes, sex is legal in pregnancy.
You: Can I exercise while pregnant?
Chatbot: Yes, you can exercise at any time of day and night.
You: Can I exercise while pregnant?
Chatbot: Yes, exercise can be considered a beneficial activity for pregnant women.
You: What foods should I avoid during pregnancy?
Chatbot: Avoid spicy, fatty, and sugary foods, as they can lead to weight gain.
You: What are t

In [71]:
!zip -r t5_maternal_bot1.zip t5_maternal_bot1
from google.colab import files
files.download("t5_maternal_bot1.zip")

  adding: t5_maternal_bot1/ (stored 0%)
  adding: t5_maternal_bot1/special_tokens_map.json (deflated 85%)
  adding: t5_maternal_bot1/tokenizer_config.json (deflated 94%)
  adding: t5_maternal_bot1/generation_config.json (deflated 29%)
  adding: t5_maternal_bot1/spiece.model (deflated 48%)
  adding: t5_maternal_bot1/config.json (deflated 63%)
  adding: t5_maternal_bot1/added_tokens.json (deflated 83%)
  adding: t5_maternal_bot1/model.safetensors (deflated 8%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>