# ChatGPT & Google Gemini prompt


In [26]:
import openai
from tqdm import tqdm
from causal_chains.CausalChain import util  # https://github.com/helliun/causal-chains
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from dotenv import load_dotenv
import os
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import google.generativeai as genai
import pathlib
import textwrap

In [48]:
one_shot_example = """
        Text: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases, and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community; it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.
        Cause: lack of early clinical recognition of an infection -> Effect: community transmission of mpox
        Cause: pauci-symptomatic manifestation of the disease -> Effect: lack of early clinical recognition of an infection 
        Cause: delays in care-seeking behaviour -> Effect: lack of early clinical recognition of an infection 
        Cause: limited access to care -> Effect: delays in care-seeking behaviour 
        Cause: fear of stigma -> Effect: delays in care-seeking behaviour
        """

prompt = f"""
        Here is an example of how to identify causes (drivers leading to the diseases) and their effects. Noted that all contents are about causes of diseases, potentially sexual behaviors. We identify behaviors for research purpose:

        {one_shot_example}
        Now, analyze the following text and identify the specific causes and their effects:
        Text: {"In the Democratic Republic of the Congo, most reported cases in known endemic provinces continue to be among children under 15 years of age, especially in young children. Infants and children under five years of age are at highest risk of severe disease and death, particularly where prompt optimal case management is limited or unavailable. The number of cases reported weekly remains consistently high while the outbreak continues to expand geographically. High test positivity among tested cases in most provinces also suggests that undetected transmission is likely ongoing in the community.&nbsp;Transmission of mpox due to clade I MPXV via sexual contact in key populations was first identified in the Democratic Republic of the Congo in 2023. In South Kivu province, mpox transmission is sustained through human-to-human contact (sexual and non-sexual).&nbsp; &nbsp;"}
        List the causes and their corresponding effects in the format 'Cause: [cause] -> Effect: [effect]':
        """

Causal text mining (CTM) has been applied to various NLP tasks such as **knowledge base construction, question answering, and text summarization**
The methodologies of CTM often involve two phases: **causal sequence classification and causal span detection**

- The causal sequence classification is a binary classification task to detect whether the sequence entails causality or not. This task requires a deep understanding of commonsense knowledge, as determining causality necessitates the comprehension of underlying real-world principles and contexts Gao et al.
- The causal span detection task aims to distinguish between cause and effect arguments present in causal sequences. This task requires a precise understanding of a complex context that comprises multiple entities and events to discern which parts of sequences correspond to causes and effects and which are noise, in addition to the capabilities previously mentioned.

Biomedical causal relations extracted from different resources, such as online journals, books, and reports, can be leveraged to form causal chains, which may result in the discovery of previously unknown relations.

CTM include various approaches

- <font color="#00b050", style = "bold">Knowledged-based system (expert opinions)</font>: relied heavily on domain experts to define rules and patterns for identifying causal relationships in text.
- <font color="#00b050">Machine learning:</font> Naive Bayes, Support Vector Machines (SVM), and Conditional Random Fields (CRF) were used to classify and extract causal relationships. These models required extensive feature engineering and relied on lexical and syntactic features such as keywords ("due to", "can cause"), part-of-speech tags, and dependency relations. [[2024-05-13#Traditional machine learning methods]]
- **Deep learning techniques**
  - <font color="#00b050">Multiview Convolutional Neural Networks (MVC):</font> This approach leverages multiple views of the input text to capture different aspects of the data. It can combine syntactic, semantic, and positional information to enhance causal relation extraction.
  - <font color="#00b050">R</font><font color="#00b050">ecurrent Neural Networks (RNN):</font> BiLSTM (Bidirectional Long Short-Term Memory) models: These models can capture long-range dependencies in text by processing it in both forward and backward directions. Attention mechanisms are often integrated to focus on relevant parts of the text that contribute to causal relationships.
  - <font color="#00b050">Graph Neural Networks (GNNs):</font> GNNs can model text as graphs, where nodes represent entities or concepts and edges represent relationships. This approach is beneficial for capturing complex causal structures.
  - <font color="#00b050">Transformer Models</font>
    - Bidirectional Encoder Representations from Transformers (BERT): BERT is pre-trained on large corpora and can be fine-tuned for specific tasks. It captures context from both directions, making it effective for understanding complex dependencies in text. <font color="#f79646">Variants like BioBERT (for biomedical text) and ClinicalBERT are tailored for specific domains.</font>
    - ELMo (Embeddings from Language Models): ELMo generates contextualized word embeddings by considering the entire sentence, providing richer representations for identifying causal relationships.

LLMs have demonstrated impressive performance across numerous NLP tasks with zero-shot or few-shot in-context learning **without requiring supervised training** versus **<font color="#e36c09">traditional encoder-based models</font>**

ChatGPT often **demonstrates competitive results** in few-shot settings even in financial domain-specific datasets and Japanese datasets, even though a fully trained encoder-based model outperforms ChatGPT. The result indicates that ChatGPT is a **good starting point for various datasets especially when training data are unavailable**, but not a good causal text miner when the training data are readily available.

The result indicates that ChatGPT serves as a good starting point when training data are limited as its performance is not influenced by the data size. In contrast, **encoder models depend heavily on data size**

ChatGPT struggles with complex causality types, especially those of intra/inter-sentential and implicit causality


In [2]:
who_data = pd.read_csv("../data/corpus.csv")

**Sample sentence**: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases,
and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community;
it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience
during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.

**Expected results**:

- Cause: lack of early clinical recognition of an infection -> Effects: community transmission of mpox
- Cause: pauci-symptomatic manifestation of the disease -> Effects: lack of early clinical recognition of an infection
- Cause: delays in care-seeking behaviour -> Effects: lack of early clinical recognition of an infection
- Cause: limited access to care -> Effect: delays in care-seeking behaviour
- Cause: fear of stigma -> Effect: delays in care-seeking behaviour


In [8]:
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")


class CausalChain:

    def __init__(self, chunks=[]):

        self.chunks = chunks
        self.causes = []
        self.effects = []
        self.outlines = []

    def create_effects(self, batch_size=16):
        print("Analyzing causation...")

        for chunk in tqdm(self.chunks):
            cause_effect_pairs = self.extract_cause_effect(chunk)
            for pair in cause_effect_pairs:
                cause, effect = pair
                self.causes.append(cause)
                self.effects.append(effect)
                self.outlines.append(f"Cause: {cause} -> Effect: {effect}")

    def extract_cause_effect(self, chunk):
        one_shot_example = """
        Text: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases, and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community; it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.
        Cause: lack of early clinical recognition of an infection -> Effect: community transmission of mpox
        Cause: pauci-symptomatic manifestation of the disease -> Effect: lack of early clinical recognition of an infection 
        Cause: delays in care-seeking behaviour -> Effect: lack of early clinical recognition of an infection 
        Cause: limited access to care -> Effect: delays in care-seeking behaviour 
        Cause: fear of stigma -> Effect: delays in care-seeking behaviour
        """

        prompt = f"""
        Here is an example of how to identify causes (drivers leading to the diseases) and their effects (intermediate drivers leading to the diseases, excluding mortality and impacts of diseases):


        {one_shot_example}
        Now, analyze the following text and identify the specific causes and their effects:
        Text: {chunk}
        List the causes and their corresponding effects in the format 'Cause: [cause] -> Effect: [effect]':
        """

        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant specialized in identifying drivers leading to diseases.",
                },
                {"role": "user", "content": prompt},
            ],
            max_tokens=300,
            temperature=0.5,
        )

        response_text = response["choices"][0]["message"]["content"]
        cause_effect_pairs = []

        for line in response_text.split("\n"):
            if "Cause:" in line and "-> Effect:" in line:
                cause = line.split("Cause:")[1].split("-> Effect:")[0].strip()
                effect = line.split("-> Effect:")[1].strip()
                cause_effect_pairs.append((cause, effect))
        return cause_effect_pairs

In [9]:
text = who_data["Text"][9]
chunks = util.create_chunks(text)
cc = CausalChain(chunks)

In [3]:
who_data["Text"][9]

'In the Democratic Republic of the Congo, most reported cases in known endemic provinces continue to be among children under 15 years of age, especially in young children. Infants and children under five years of age are at highest risk of severe disease and death, particularly where prompt optimal case management is limited or unavailable. The number of cases reported weekly remains consistently high while the outbreak continues to expand geographically. High test positivity among tested cases in most provinces also suggests that undetected transmission is likely ongoing in the community.&nbsp;Transmission of mpox due to clade I MPXV via sexual contact in key populations was first identified in the Democratic Republic of the Congo in 2023. In South Kivu province, mpox transmission is sustained through human-to-human contact (sexual and non-sexual).&nbsp; &nbsp;The global outbreak 2022 &mdash; 2024 has shown that sexual contact enables faster and more efficient spread of the virus from

In the Democratic Republic of the Congo, most reported cases in known endemic provinces continue to be among children under 15 years of age, especially in young children. Infants and children under five years of age are at highest risk of severe disease and death, **particularly where prompt optimal case management is limited or unavailable**. The number of cases reported weekly remains consistently high while the outbreak continues to expand geographically. High test positivity among tested cases in most provinces also suggests that **undetected transmission** is likely ongoing in the community. Transmission of mpox due to clade I MPXV via **sexual contact** in key populations was first identified in the Democratic Republic of the Congo in 2023. In South Kivu province, mpox transmission is sustained through **human-to-human contact (sexual and non-sexual)**. The global outbreak 2022 — 2024 has shown that **sexual contact** enables faster and more efficient spread of the virus from one person to another due to direct contact of mucous membranes between people, contact with multiple partners, a possibly shorter incubation period on average, and a longer infectious period for immunocompromised individuals. The newly documented occurrence of mpox in North Kivu is very concerning. The additional public health impact of sustained human-to-human sexual transmission of mpox in the country indicates that a vigorous response is required. One of the main risk factors for severe disease and death among persons with mpox is **immune suppression**, especially among those with advanced HIV infection. The prevalence of HIV in the general adult population in the Democratic Republic of the Congo is estimated to be approximately 1%, higher in the eastern provinces than elsewhere, and higher in key populations including estimates of a prevalence of 7.5% among sex workers and 7.1% among men who have sex with men. The **higher HIV prevalence** and the **challenge in accessing antiretroviral treatment** puts these groups at higher risk for severe mpox and death if they get infected. The occurrence of cases among a broad range of occupational groups and within households also suggests that the outbreak in South Kivu is already spreading into the wider community. Understanding of the dynamics of MPXV transmission in the Democratic Republic of the Congo is improving with the emergency measures being put in place. Nonetheless, a **lack of timely access to diagnostics in many areas**, **incomplete epidemiological investigations**, **challenges in contact tracing** and **extensive but inconclusive animal investigations** continue to hamper rapid response. While **zoonotic spill over events** are considered to still represent a major source of exposure in the country, the animal reservoir remains unknown.


In [None]:
cc.create_effects()

In [None]:
def create_causes_effects_dataframe(causes, effects):
    data = {"Cause": causes, "Effect": effects}
    df = pd.DataFrame(data)
    return df


df = create_causes_effects_dataframe(cc.causes, cc.effects)
display(df)

In [None]:
# Extract causes and effects
causes = cc.causes
effects = cc.effects

# Step 2: Convert Causes and Effects to Embeddings
model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
cause_embeddings = model.encode(causes)
effect_embeddings = model.encode(effects)

# Step 3: Calculate Cosine Similarity within Causes and Effects
cause_similarities = cosine_similarity(cause_embeddings)
effect_similarities = cosine_similarity(effect_embeddings)

# Step 4: Print Cause Similarities
print("Cause Similarities:")
for i, cause in enumerate(causes):
    print(f"Cause: {cause}")
    for j, similarity in enumerate(cause_similarities[i]):
        if i != j:  # Exclude self-similarity
            print(f"    Similar to Cause: {causes[j]}, Similarity: {similarity:.4f}")
    print()

# Step 4: Print Effect Similarities
print("Effect Similarities:")
for i, effect in enumerate(effects):
    print(f"Effect: {effect}")
    for j, similarity in enumerate(effect_similarities[i]):
        if i != j:  # Exclude self-similarity
            print(f"    Similar to Effect: {effects[j]}, Similarity: {similarity:.4f}")
    print()

In [8]:
import openai
from tqdm import tqdm
from causal_chains.CausalChain import util 
import pandas as pd
from IPython.display import display
from dotenv import load_dotenv
import os
import google.generativeai as genai
from google.api_core import retry
from google.auth.credentials import AnonymousCredentials

# Load environment variables (you'll need to set these)
load_dotenv()
gemini_api_key = os.getenv("GEMINI_API_KEY")

# Initialize the Gemini API client
genai.configure(api_key=gemini_api_key)


# Load WHO data (replace with your actual path)
who_data = pd.read_csv("../data/corpus.csv")


class CausalChain:
    def __init__(self, chunks=[]):
        self.chunks = chunks
        self.causes = []
        self.effects = []
        self.outlines = []
        self.one_shot_example = """
        Text: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases, and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community; it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.
        Cause: lack of early clinical recognition of an infection -> Effect: community transmission of mpox
        Cause: pauci-symptomatic manifestation of the disease -> Effect: lack of early clinical recognition of an infection 
        Cause: delays in care-seeking behaviour -> Effect: lack of early clinical recognition of an infection 
        Cause: limited access to care -> Effect: delays in care-seeking behaviour 
        Cause: fear of stigma -> Effect: delays in care-seeking behaviour
        """

    def create_effects(self, batch_size=16):
        print("Analyzing causation...")

        for chunk in tqdm(self.chunks):
            cause_effect_pairs = self.extract_cause_effect(chunk)
            for pair in cause_effect_pairs:
                cause, effect = pair
                self.causes.append(cause)
                self.effects.append(effect)
                self.outlines.append(f"Cause: {cause} -> Effect: {effect}")

    def extract_cause_effect(self, chunk):
        prompt = f"""
        Here is an example of how to identify causes (drivers leading to diseases) and their effects (intermediate drivers leading to the diseases, excluding mortality and impacts of diseases):

        {self.one_shot_example}

        Now, analyze the following text and identify the specific causes and their effects:
        Text: {chunk}
        List the causes and their corresponding effects in the format 'Cause: [cause] -> Effect: [effect]':
        """

        # Use Gemini to generate responses

        response = genai.GenerativeModel('gemini-1.5-pro').generate_content(prompt)
        response_text = response.candidates  # Access the generated text correctly

        cause_effect_pairs = []
        for line in response_text.split("\n"):
            if "Cause:" in line and "-> Effect:" in line:
                cause = line.split("Cause:")[1].split("-> Effect:")[0].strip()
                effect = line.split("-> Effect:")[1].strip()
                cause_effect_pairs.append((cause, effect))
        return cause_effect_pairs


def create_causes_effects_dataframe(causes, effects):
    data = {"Cause": causes, "Effect": effects}
    df = pd.DataFrame(data)
    return df


# Example usage (assuming your who_data and other parts are set up)
text = who_data["Text"][9]
chunks = util.create_chunks(text)
cc = CausalChain(chunks)
cc.create_effects()

df = create_causes_effects_dataframe(cc.causes, cc.effects)
display(df)


Analyzing causation...
Analyzing causation...


  0%|                                                   | 0/12 [00:00<?, ?it/s]

  0%|                                                   | 0/12 [00:05<?, ?it/s]




AttributeError: 'RepeatedComposite' object has no attribute 'split'