### Producing Intensity variations using LLMs

Author: Raphael Merx

Input: A baseline sentence
Output: variations of this sentence where a target word is more or less intense

In [39]:
#pip install simplemind python-dotenv

In [40]:
from dotenv import load_dotenv
load_dotenv()

from dataclasses import dataclass
from tqdm import tqdm
import simplemind as sm
from typing import Literal, List, get_args
import pandas as pd
import os
import random

random.seed(42)

# Define TARGET_WORD_CHOICES
TARGET_WORD_CHOICES = Literal['abuse', 'anxiety', 'depression', 'mental_health', 'mental_illness', 'trauma']
TARGET_WORD: TARGET_WORD_CHOICES = 'abuse'

# Generate human-readable versions for all target words
TARGET_WORD_HUMAN_CHOICES = {word: word.replace('_', ' ') for word in get_args(TARGET_WORD_CHOICES)}
TARGET_WORD_HUMAN = TARGET_WORD_HUMAN_CHOICES[TARGET_WORD]

# Validate that both 'mental_health' and 'mental_illness' are handled correctly
assert TARGET_WORD_HUMAN_CHOICES['mental_health'] == 'mental health', "Error: 'mental_health' not handled correctly"
assert TARGET_WORD_HUMAN_CHOICES['mental_illness'] == 'mental illness', "Error: 'mental_illness' not handled correctly"

# 1970-1974, ..., 2015-2019
EPOCH_CHOICES = [f"{y}-{y+4}" for y in range(1970, 2020, 5)]
EPOCH = EPOCH_CHOICES[0]

MAX_BASELINES = 1500

# can be changed to gemini, see https://pypi.org/project/simplemind/
PROVIDER = "openai"
MODEL = "gpt-4o"

Get neutral baseline sentences from corpus for LLM input for each target

- **Script 1 Aim**: This script computes sentence-level sentiment scores using the NRC-VAD lexicon for a corpus of target terms spanning 1970–2019. It dynamically determines neutral sentiment ranges for each 5-year epoch by expanding outward from the median sentiment score within the interquartile range (Q1–Q3) until at least 500 sentences are included, capped at 1500. The selected sentences are saved to CSV files for each target and epoch, and a summary file logs the dynamic ranges and counts.

- **Script 2 Aim**: This script processes pre-saved baseline CSV files to calculate sentence counts by year and 5-year epochs for multiple target terms. It generates "year_count_lines.csv" and "epoch_count_lines.csv" summarizing these counts and creates an epoch-based bar plot visualizing sentence distributions across the specified epochs for each target.

In [4]:
#%run step0_get_neutral_baselines_intensity.py
%run step1_plot_neutral_baselines_intensity.py

Year count summary saved to c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\3_intensity\synthetic\input\baselines\output\year_count_lines.csv.
Epoch count summary saved to c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\3_intensity\synthetic\input\baselines\output\epoch_count_lines.csv.


  ax.set_xticklabels(target_data["epoch"], rotation=45, ha="right", fontsize=12)


Epoch plot saved to ../../figures\plot_appendixB_intensity.png.


Setup examples to inject in the prompt

-This code sets up examples of baseline sentences and their intensity-modified variations (more and less intense) to provide context and guidance for the LLM. 
-These examples are formatted into a structured prompt to help the model understand how to generate intensity-modified variations for new sentences.

In [42]:
@dataclass
class Example:
    baseline: str
    more_intense: str
    less_intense: str

    def format_for_prompt(self):
        return f"""<baseline>
{self.baseline}
</baseline>
<increased {TARGET_WORD} intensity>
{self.more_intense}
</increased {TARGET_WORD} intensity>
<decreased {TARGET_WORD} intensity>
{self.less_intense}
</decreased {TARGET_WORD} intensity>
"""

    @staticmethod
    def read_example_data(word: TARGET_WORD_CHOICES) -> pd.DataFrame:
        filepath = os.path.join('input', f'intensity_example_sentences.xlsx')
        # read to a dataframe
        df = pd.read_excel(filepath)
        # filter rows where `target` column is equal to the `word` argument
        df = df[df['target'] == word]
        return df

def get_examples() -> List[Example]:
    example_data = Example.read_example_data(TARGET_WORD)

    return [
        Example(
            baseline=row['baseline'],
            more_intense=row['high_intensity'],
            less_intense=row['low_intensity']
        )
        for _, row in example_data.iterrows()
    ]

PROMPT_INTRO = """In psychology research, Intensity is defined as “the degree to which a word has emotionally charged (i.e., strong, potent, high-arousal) connotations.” This task focuses on the intensity of the term **<<{target_word}>>**. 

### **Task**  
You will be given a sentence containing the term **<<{target_word}>>**. Your goal is to write two new sentences:
1. One where **<<{target_word}>>** is **less intense** (enclose this sentence between `<decreased {target_word} intensity>` and `</decreased {target_word} intensity>` tags).
2. One where **<<{target_word}>>** is **more intense** (enclose this sentence between `<increased {target_word} intensity>` and `</increased {target_word} intensity>` tags).

### **Rules**  
1. The term **<<{target_word}>>** must remain **exactly as it appears** in the original sentence:
   - Do **not** replace, rephrase, omit, or modify it in any way.
   - Synonyms, variations, or altered spellings are not allowed.  

2. **Meaning and Structure**:  
   - Stay true to the original context and subject matter.  
   - Maintain the sentence’s structure and ensure grammatical accuracy.  

### **Important**  
- Any response omitting, replacing, or altering **<<{target_word}>>** will be rejected.  
- Ensure the output is:  
   - **Grammatically correct**  
   - **Sensitive and serious** in tone  
   - **Free from exaggeration or sensationalism**  
   - **Strictly following the XML-like tag format for intensity variations**

Follow these guidelines strictly to produce valid responses.  
"""



In [43]:
@dataclass
class SentenceToModify:
    text: str
    increased_variation: str = None
    decreased_variation: str = None

    def get_prompt(self):
        prompt = PROMPT_INTRO.format(target_word=TARGET_WORD) + "\n\n"
        for example in get_examples():
            prompt += example.format_for_prompt()
            prompt += "\n\n"
        
        prompt += f"""<baseline>
{self.text}
</baseline>
"""
        return prompt

    def parse_response(self, response: str):
        pattern = fr"<increased {TARGET_WORD} intensity>(.*?)</increased {TARGET_WORD} intensity>"
        try:
            self.increased_variation = re.search(pattern, response, re.DOTALL).group(1).strip()
        except AttributeError:
            raise ValueError(f"LLM response does not contain the expected increased intensity format: {response}")

        pattern = fr"<decreased {TARGET_WORD} intensity>(.*?)</decreased {TARGET_WORD} intensity>"
        try:
            self.decreased_variation = re.search(pattern, response, re.DOTALL).group(1).strip()
        except AttributeError:
            raise ValueError(f"LLM response does not contain the expected decreased intensity format: {response}")

        return self.increased_variation, self.decreased_variation

    def get_variations(self) -> list[str]:
        """ Returns a list of two strings: one where the TARGET_WORD is more intense, and one where it is less intense """
        assert TARGET_WORD in self.text.lower(), f"word {TARGET_WORD} not found in {self.text}"
        prompt = self.get_prompt()
        res = sm.generate_text(prompt=prompt, llm_provider=PROVIDER, llm_model=MODEL)
        return self.parse_response(res)

    @staticmethod
    def load_baselines(word: TARGET_WORD_CHOICES, epoch: EPOCH_CHOICES) -> List[str]:
        # find the baselines.csv file in the `input` folder
        filepath = os.path.join('input', 'baselines', f'{word}_{epoch}.baseline_1500_sentences.csv')
        df = pd.read_csv(filepath)
        # return the `sentence` column as a list
        print(f"Word {word}, epoch {epoch}: ", end="")
        print(f"Loaded {len(df)} baseline sentences, sampling {MAX_BASELINES}")
        baselines = df['sentence'].tolist()
        if MAX_BASELINES and len(baselines) > MAX_BASELINES:
            baselines = random.sample(baselines, MAX_BASELINES)
        baselines = [s.replace(TARGET_WORD, TARGET_WORD) for s in baselines]
        return baselines
    
    @staticmethod
    def save_sentences(sentences: List['SentenceToModify'], word: TARGET_WORD_CHOICES, epoch: EPOCH_CHOICES):
        # Adjust the directory here to include 'test' when getting pilot/test data
        output_dir = 'output'
        output_file = os.path.join(output_dir, f'{word}_{epoch}.synthetic_sentences.csv')
        
        # Ensure the directory exists before trying to write to it
        os.makedirs(output_dir, exist_ok=True)
        
        # Creating DataFrame and saving to CSV
        df = pd.DataFrame([{'baseline': s.text, 'high_intensity': s.increased_variation, 'low_intensity': s.decreased_variation} for s in sentences])
        df.to_csv(output_file, index=False)
        print(f"Saved {len(sentences)} sentences to {output_file}")

baselines = SentenceToModify.load_baselines(TARGET_WORD, EPOCH)
sentence = SentenceToModify(text=baselines[0])
print(sentence.get_prompt())

Word trauma, epoch 1970-1974: Loaded 12 baseline sentences, sampling 1500
In psychology research, Intensity is defined as “the degree to which a word has emotionally charged (i.e., strong, potent, high-arousal) connotations.” This task focuses on the intensity of the term **<<trauma>>**. 

### **Task**  
You will be given a sentence containing the term **<<trauma>>**. Your goal is to write two new sentences:
1. One where **<<trauma>>** is **less intense** (enclose this sentence between `<decreased trauma intensity>` and `</decreased trauma intensity>` tags).
2. One where **<<trauma>>** is **more intense** (enclose this sentence between `<increased trauma intensity>` and `</increased trauma intensity>` tags).

### **Rules**  
1. The term **<<trauma>>** must remain **exactly as it appears** in the original sentence:
   - Do **not** replace, rephrase, omit, or modify it in any way.
   - Synonyms, variations, or altered spellings are not allowed.  

2. **Meaning and Structure**:  
   - Sta

In [44]:
import os
import re
from tqdm import tqdm

# Constants for processing
MAX_BASELINES = 1500
OUTPUT_DIR = "output/5-year"  # Directory where processed files are saved

# Ensure the output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Loop through each target word and epoch
for TARGET_WORD in get_args(TARGET_WORD_CHOICES):
    for EPOCH in EPOCH_CHOICES:
        # Construct the file path to check if it already exists
        output_file = os.path.join(OUTPUT_DIR, f"{TARGET_WORD}_{EPOCH}.synthetic_sentences.csv")

        # Print the file path to debug
        print(f"Checking if {output_file} exists...")

        # Skip if the file already exists
        if os.path.exists(output_file):
            print(f"Skipping {TARGET_WORD}, {EPOCH}: File already processed.")
            continue

        # Load baselines if the file does not exist
        baselines = SentenceToModify.load_baselines(TARGET_WORD, EPOCH)
        sentences = []

        # Process each baseline sentence
        for baseline in tqdm(baselines, total=len(baselines), unit='it', leave=True):
            sentence = SentenceToModify(text=baseline)
            try:
                more_intense_variation, less_intense_variation = sentence.get_variations()
                sentences.append(sentence)
            except Exception as e:
                print(f"Error processing sentence: {baseline}. Error: {str(e)}")

        # Save the processed sentences
        if sentences:  # Only save if there are completed sentences
            SentenceToModify.save_sentences(sentences, word=TARGET_WORD, epoch=EPOCH)
            print(f"Processed and saved: {output_file}")
        else:
            print(f"No valid sentences processed for {TARGET_WORD}, {EPOCH}")

Checking if output\trauma_1970-1974.synthetic_sentences.csv exists...
Word trauma, epoch 1970-1974: Loaded 12 baseline sentences, sampling 1500


  0%|          | 0/12 [00:00<?, ?it/s]

100%|██████████| 12/12 [00:22<00:00,  1.88s/it]


Saved 12 sentences to output\trauma_1970-1974.synthetic_sentences.csv
Processed and saved: output\trauma_1970-1974.synthetic_sentences.csv
Checking if output\trauma_1975-1979.synthetic_sentences.csv exists...
Word trauma, epoch 1975-1979: Loaded 11 baseline sentences, sampling 1500


100%|██████████| 11/11 [00:22<00:00,  2.04s/it]


Saved 11 sentences to output\trauma_1975-1979.synthetic_sentences.csv
Processed and saved: output\trauma_1975-1979.synthetic_sentences.csv
Checking if output\trauma_1980-1984.synthetic_sentences.csv exists...
Word trauma, epoch 1980-1984: Loaded 64 baseline sentences, sampling 1500


100%|██████████| 64/64 [02:07<00:00,  2.00s/it]


Saved 64 sentences to output\trauma_1980-1984.synthetic_sentences.csv
Processed and saved: output\trauma_1980-1984.synthetic_sentences.csv
Checking if output\trauma_1985-1989.synthetic_sentences.csv exists...
Word trauma, epoch 1985-1989: Loaded 118 baseline sentences, sampling 1500


100%|██████████| 118/118 [04:06<00:00,  2.09s/it]


Saved 118 sentences to output\trauma_1985-1989.synthetic_sentences.csv
Processed and saved: output\trauma_1985-1989.synthetic_sentences.csv
Checking if output\trauma_1990-1994.synthetic_sentences.csv exists...
Word trauma, epoch 1990-1994: Loaded 307 baseline sentences, sampling 1500


100%|██████████| 307/307 [10:17<00:00,  2.01s/it]


Saved 307 sentences to output\trauma_1990-1994.synthetic_sentences.csv
Processed and saved: output\trauma_1990-1994.synthetic_sentences.csv
Checking if output\trauma_1995-1999.synthetic_sentences.csv exists...
Word trauma, epoch 1995-1999: Loaded 518 baseline sentences, sampling 1500


100%|██████████| 518/518 [16:49<00:00,  1.95s/it]


Saved 518 sentences to output\trauma_1995-1999.synthetic_sentences.csv
Processed and saved: output\trauma_1995-1999.synthetic_sentences.csv
Checking if output\trauma_2000-2004.synthetic_sentences.csv exists...
Word trauma, epoch 2000-2004: Loaded 526 baseline sentences, sampling 1500


100%|██████████| 526/526 [17:48<00:00,  2.03s/it]


Saved 526 sentences to output\trauma_2000-2004.synthetic_sentences.csv
Processed and saved: output\trauma_2000-2004.synthetic_sentences.csv
Checking if output\trauma_2005-2009.synthetic_sentences.csv exists...
Word trauma, epoch 2005-2009: Loaded 888 baseline sentences, sampling 1500


100%|██████████| 888/888 [30:18<00:00,  2.05s/it] 


Saved 888 sentences to output\trauma_2005-2009.synthetic_sentences.csv
Processed and saved: output\trauma_2005-2009.synthetic_sentences.csv
Checking if output\trauma_2010-2014.synthetic_sentences.csv exists...
Word trauma, epoch 2010-2014: Loaded 735 baseline sentences, sampling 1500


100%|██████████| 735/735 [26:32<00:00,  2.17s/it]


Saved 735 sentences to output\trauma_2010-2014.synthetic_sentences.csv
Processed and saved: output\trauma_2010-2014.synthetic_sentences.csv
Checking if output\trauma_2015-2019.synthetic_sentences.csv exists...
Word trauma, epoch 2015-2019: Loaded 833 baseline sentences, sampling 1500


100%|██████████| 833/833 [27:57<00:00,  2.01s/it]

Saved 833 sentences to output\trauma_2015-2019.synthetic_sentences.csv
Processed and saved: output\trauma_2015-2019.synthetic_sentences.csv





In [50]:
import os
import pandas as pd

# Set the directory containing the synthetic sentence files
input_directory = "output/5-year"
output_directory = "output/validation_issues"

# Ensure the output directory exists
os.makedirs(output_directory, exist_ok=True)

# Function to validate that all rows in each column contain the target term
def validate_target_in_files(directory):
    issues_summary = []  # To store summary of issues
    all_problematic_rows_combined = pd.DataFrame()  # To store all unique problematic rows across files

    # Loop through each file in the directory
    for file_name in os.listdir(directory):
        if file_name.endswith(".csv"):
            # Extract the target term from the filename
            target_term = file_name.split("_")[0]
            file_path = os.path.join(directory, file_name)

            try:
                # Read the CSV file
                df = pd.read_csv(file_path)

                # Check if the required columns exist
                if {'baseline', 'high_intensity', 'low_intensity'}.issubset(df.columns):
                    all_problematic_rows = pd.DataFrame()

                    # Use a set to track rows we've already added to avoid duplicates
                    seen_rows = set()

                    for column in ['baseline', 'high_intensity', 'low_intensity']:
                        # Identify rows that do not contain the target term
                        problematic_rows = df[~df[column].str.contains(target_term, case=False, na=False)].copy()

                        if not problematic_rows.empty:
                            # Add the target term, epoch, and row number for identification
                            problematic_rows['target'] = target_term  # Add the target term
                            problematic_rows['epoch'] = file_name.split("_")[1].split(".")[0]  # Extract epoch from filename
                            problematic_rows['row_number'] = problematic_rows.index
                            problematic_rows['problem_column'] = column

                            # Only add rows that have not been added before (check via 'row_number')
                            problematic_rows = problematic_rows[~problematic_rows['row_number'].isin(seen_rows)]
                            seen_rows.update(problematic_rows['row_number'])

                            # Append to the combined DataFrame for this file
                            all_problematic_rows = pd.concat([all_problematic_rows, problematic_rows])

                    # If there are problematic rows, save them to a new file
                    if not all_problematic_rows.empty:
                        # Combine with the master list of all rows across all files
                        all_problematic_rows_combined = pd.concat([all_problematic_rows_combined, all_problematic_rows])

                        issues_summary.append((file_name, len(all_problematic_rows)))

                else:
                    print(f"File {file_name} is missing required columns.")
            except Exception as e:
                print(f"Error processing file {file_name}: {e}")

    # Remove duplicates across the entire combined DataFrame
    if not all_problematic_rows_combined.empty:
        all_problematic_rows_combined = all_problematic_rows_combined.drop_duplicates(
            subset=['baseline', 'high_intensity', 'low_intensity', 'row_number']
        )

        # Save the unique problematic rows to a new file
        output_file = os.path.join(output_directory, f"validation_issues_TEST.csv")
        all_problematic_rows_combined.to_csv(output_file, index=False)

        # Print summary of issues
        print("The following files have issues:")
        for file, count in issues_summary:
            print(f"File: {file}, Number of problematic rows: {count}")
        
        # Display problematic rows in the notebook
        print("\nProblematic rows across all files:")
        display(all_problematic_rows_combined)  # This will display the dataframe in the notebook

    else:
        print("No validation issues found.")

# Run the validation
validate_target_in_files(input_directory)

No validation issues found.


Recording OpenAI API Credits Usage. Generating more/less variations for: 
- "abuse" = 621.84 - 601.31 = 21 USD
- "anxiety" = 601.31 - 568.99 = 32 USD
- "depression" = 568.99 - 534.28 = 35 USD
- "mental_health" = 534.28 - 510.57 = 24 USD
- "mental_illness" = 510.57 - 501.11 = 10 USD
- "trauma" = 501.11 - 487.19 = 14 USD

In [51]:
import pandas as pd
import os

# Define the directory where the CSV files are located
input_directory = 'output/5-year'  # Folder with the 5-year CSV files
output_directory = 'output/all-year'  # Folder to save the merged files

# List of targets (you can add more targets to this list)
targets = ['abuse', 'anxiety', 'depression', 'mental_health', 'mental_illness', 'trauma']

# Create the output directory if it does not exist
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Loop through each target and process the corresponding files
for target in targets:
    # Initialize an empty list to collect DataFrames for the current target
    df_list = []

    # Loop through the files in the input directory
    for file_name in os.listdir(input_directory):
        if file_name.startswith(target) and file_name.endswith("synthetic_sentences.csv"):  # Match files with the target
            file_path = os.path.join(input_directory, file_name)
            df = pd.read_csv(file_path)
            df_list.append(df)
            print(f"Found and added file: {file_name}")  # Debugging print

    # Check if any files were added to df_list
    if not df_list:
        print(f"No files found for target: {target}")  # Debugging print

    # Concatenate all DataFrames for the current target into one
    if df_list:
        merged_df = pd.concat(df_list, ignore_index=True)

        # Save the merged DataFrame to the new output folder
        merged_df.to_csv(os.path.join(output_directory, f'{target}_synthetic_sentences.csv'), index=False)
        print(f"Files for '{target}' merged successfully into 'output/all-year/{target}_synthetic_sentences.csv'")
    else:
        print(f"Skipping '{target}' as no files were found.")

Found and added file: abuse_1970-1974.synthetic_sentences.csv
Found and added file: abuse_1975-1979.synthetic_sentences.csv
Found and added file: abuse_1980-1984.synthetic_sentences.csv
Found and added file: abuse_1985-1989.synthetic_sentences.csv
Found and added file: abuse_1990-1994.synthetic_sentences.csv
Found and added file: abuse_1995-1999.synthetic_sentences.csv
Found and added file: abuse_2000-2004.synthetic_sentences.csv
Found and added file: abuse_2005-2009.synthetic_sentences.csv
Found and added file: abuse_2010-2014.synthetic_sentences.csv
Found and added file: abuse_2015-2019.synthetic_sentences.csv
Files for 'abuse' merged successfully into 'output/all-year/abuse_synthetic_sentences.csv'
Found and added file: anxiety_1970-1974.synthetic_sentences.csv
Found and added file: anxiety_1975-1979.synthetic_sentences.csv
Found and added file: anxiety_1980-1984.synthetic_sentences.csv
Found and added file: anxiety_1985-1989.synthetic_sentences.csv
Found and added file: anxiety_199

## End of script