### Producing Intensity variations using LLMs

Author: Raphael Merx (main) and Naomi Baes
Input: a baseline sentence; Output: variations of this sentence where a target word is more or less intense

In [1]:
#pip install simplemind python-dotenv

In [2]:
from dotenv import load_dotenv
load_dotenv()

from dataclasses import dataclass
from tqdm import tqdm
import simplemind as sm
from typing import Literal, List
import pandas as pd
import os
import random

random.seed(42)


TARGET_WORD_CHOICES = Literal['abuse', 'anxiety', 'depression', 'mental_health', 'mental_illness', 'trauma']

TARGET_WORD: TARGET_WORD_CHOICES = 'mental_illness'
TARGET_WORD_HUMAN = TARGET_WORD.replace('_', ' ') 

# 1970-1974, ..., 2015-2019
EPOCH_CHOICES = [f"{y}-{y+4}" for y in range(1970, 2020, 5)]
EPOCH = EPOCH_CHOICES[1]

MAX_BASELINES = 10

# can be changed to gemini, see https://pypi.org/project/simplemind/
PROVIDER = "openai"
MODEL = "gpt-4o"

Get neutral baseline sentences from corpus for LLM input for each target

- **Script 1 Aim**: This script computes sentence-level sentiment scores using the NRC-VAD lexicon for a corpus of target terms spanning 1970–2019. It dynamically determines neutral sentiment ranges for each 5-year epoch by expanding outward from the median sentiment score within the interquartile range (Q1–Q3) until at least 500 sentences are included, capped at 1500. The selected sentences are saved to CSV files for each target and epoch, and a summary file logs the dynamic ranges and counts.

- **Script 2 Aim**: This script processes pre-saved baseline CSV files to calculate sentence counts by year and 5-year epochs for multiple target terms. It generates "year_count_lines.csv" and "epoch_count_lines.csv" summarizing these counts and creates an epoch-based bar plot visualizing sentence distributions across the specified epochs for each target.

In [12]:
%run step0_get_neutral_baselines_intensity.py
%run step1_plot_neutral_baselines_intensity.py

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\naomi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\0.0_corpus_preprocessing\output\natural_lines_targets\abuse.lines.psych
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\0.0_corpus_preprocessing\output\natural_lines_targets\anxiety.lines.psych
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\0.0_corpus_preprocessing\output\natural_lines_targets\depression.lines.psych
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\0.0_corpus_preprocessing\output\natural_lines_targets\mental_health.lines.psych
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\0.0_corpus_preprocessing\output\natural_lines_targets\mental_illness.lines.psych
Processing file:

Setup examples to inject in the prompt

-This code sets up examples of baseline sentences and their intensity-modified variations (more and less intense) to provide context and guidance for the LLM. 
-These examples are formatted into a structured prompt to help the model understand how to generate intensity-modified variations for new sentences.

In [7]:
@dataclass
class Example:
    baseline: str
    more_intense: str
    less_intense: str

    def format_for_prompt(self):
        return f"""<baseline>
{self.baseline}
</baseline>
<increased {TARGET_WORD_HUMAN} intensity>
{self.more_intense}
</increased {TARGET_WORD_HUMAN} intensity>
<decreased {TARGET_WORD_HUMAN} intensity>
{self.less_intense}
</decreased {TARGET_WORD_HUMAN} intensity>
"""

    @staticmethod
    def read_example_data(word: TARGET_WORD_CHOICES) -> pd.DataFrame:
        filepath = os.path.join('input', f'intensity_example_sentences.xlsx')
        # read to a dataframe
        df = pd.read_excel(filepath)
        # filter rows where `target` column is equal to the `word` argument
        df = df[df['target'] == word]
        print(f"Loaded {len(df)} examples for the term '{word}'")
        return df

example_data = Example.read_example_data(TARGET_WORD)

EXAMPLES = [
    Example(
        baseline=row['baseline'],
        more_intense=row['high_intensity'],
        less_intense=row['low_intensity']
    )
    for _, row in example_data.iterrows()
]

PROMPT_INTRO = f"""In psychology research, "Intensity" is defined as “the degree to which a word has emotionally charged 
(i.e., strong, potent, high-arousal) connotations.” Here we study the intensity of the term "{TARGET_WORD_HUMAN}".

You will be given a sentence with the word "{TARGET_WORD_HUMAN}" in it. You will then be asked to write two new sentences: 
One where the word "{TARGET_WORD_HUMAN}" is more intense, and one where it is less intense.

Important Guidelines:
- Retain the original structure and context of the baseline sentence as much as possible.
- Make only small, targeted adjustments to alter the intensity of the term "{TARGET_WORD_HUMAN}". 
- Do not add or remove key concepts or introduce entirely new elements not implied in the baseline.
- Do not replace the target term with another word.
"""



NameError: name 'dataclass' is not defined

In [20]:
# This code chunk constructs a structured prompt by combining predefined examples with a new target sentence, queries an LLM to generate intensity-modified variations (more and less intense) 
# of the target sentence, and extracts the results for further analysis.

@dataclass
class SentenceToModify:
    text: str
    increased_variation: str = None
    decreased_variation: str = None

    def get_prompt(self):
        prompt = PROMPT_INTRO
        for example in EXAMPLES:
            prompt += example.format_for_prompt()
            prompt += "\n\n"
        
        prompt += f"""<baseline>
{self.text}
</baseline>
"""
        return prompt
    
    def parse_response(self, response: str):
        # get the sentences inside <more {TARGET_WORD}> and <less {TARGET_WORD}>
        self.increased_variation = response.split(f"<increased {TARGET_WORD_HUMAN} intensity>")[1].split(f"</increased {TARGET_WORD_HUMAN} intensity>")[0].strip()
        self.decreased_variation = response.split(f"<decreased {TARGET_WORD_HUMAN} intensity>")[1].split(f"</decreased {TARGET_WORD_HUMAN} intensity>")[0].strip()
        return self.increased_variation, self.decreased_variation

    def get_variations(self) -> list[str]:
        """ Returns a list of two strings: one where the TARGET_WORD is more intense, and one where it is less intense """
        assert TARGET_WORD_HUMAN in self.text, f"word {TARGET_WORD_HUMAN} not found in {self.text}"
        prompt = self.get_prompt()
        res = sm.generate_text(prompt=prompt, llm_provider=PROVIDER, llm_model=MODEL)
        return self.parse_response(res)

    @staticmethod
    def load_baselines(word: TARGET_WORD_CHOICES) -> List[str]:
        # find the baselines.csv file in the `input` folder
        filepath = os.path.join('input', 'baselines', f'{word}_neutral_baselines_intensity.csv')
        df = pd.read_csv(filepath)
        # return the `sentence` column as a list
        print(f"Loaded a total of {len(df)} baseline sentences, sampling {MAX_BASELINES}")
        baselines = df['sentence'].tolist()
        baselines = random.sample(baselines, MAX_BASELINES) if len(baselines) > MAX_BASELINES else baselines
        baselines = [s.replace(TARGET_WORD, TARGET_WORD_HUMAN) for s in baselines]
        return baselines
    
    @staticmethod
    def save_sentences(sentences: List['SentenceToModify']):
        output_file = os.path.join('output', f'{TARGET_WORD}_synthetic_sentences.csv')
        df = pd.DataFrame([{'baseline': s.text, 'high_intensity': s.increased_variation, 'low_intensity': s.decreased_variation} for s in sentences])
        df.to_csv(output_file, index=False)
        print(f"Saved {len(sentences)} sentences to {output_file}")

baselines = SentenceToModify.load_baselines(TARGET_WORD)
sentence = SentenceToModify(text=baselines[0])
print(sentence.get_prompt())

Loaded a total of 680 baseline sentences, sampling 10
In psychology lexicographic research, we define "intensity" as the "extent to which a word refers to more emotionally or referentially intense phenomena". Here we study the intensity of the word "mental illness"
You will be given a sentence with the word "mental illness" in it. You will then be asked to write two new sentences: one where the word "mental illness" is more intense, and one where it is less intense.

<baseline>
The study examined the prevalence of mental illness in urban populations.
</baseline>
<increased mental illness intensity>
The study highlighted the staggering prevalence of severe mental illness in urban populations.
</increased mental illness intensity>
<decreased mental illness intensity>
The study explored the relatively low prevalence of mild mental illness in urban populations.
</decreased mental illness intensity>


<baseline>
The research analyzed the treatment of mental illness in clinical settings.
</b

In [18]:
sentences = []

for baseline in tqdm(baselines):
    sentence = SentenceToModify(text=baseline)
    positive_variation, negative_variation = sentence.get_variations()
    sentences.append(sentence)

SentenceToModify.save_sentences(sentences)


100%|██████████| 10/10 [00:24<00:00,  2.45s/it]

Saved 10 sentences to output/mental_illness_synthetic_sentences.csv



