### Producing Intensity variations using LLMs

Author: Raphael Merx (main) and Naomi Baes
Input: a baseline sentence; Output: variations of this sentence where a target word is more or less intense

In [1]:
#pip install simplemind python-dotenv

In [2]:
from dotenv import load_dotenv
load_dotenv()

from dataclasses import dataclass
from tqdm import tqdm
import simplemind as sm
from typing import Literal, List, get_args
import pandas as pd
import os
import random

random.seed(42)

# Define TARGET_WORD_CHOICES
TARGET_WORD_CHOICES = Literal['abuse', 'anxiety', 'depression', 'mental_health', 'mental_illness', 'trauma']
TARGET_WORD: TARGET_WORD_CHOICES = 'abuse'

# Generate human-readable versions for all target words
TARGET_WORD_HUMAN_CHOICES = {word: word.replace('_', ' ') for word in get_args(TARGET_WORD_CHOICES)}
TARGET_WORD_HUMAN = TARGET_WORD_HUMAN_CHOICES[TARGET_WORD]

# Validate that both 'mental_health' and 'mental_illness' are handled correctly
assert TARGET_WORD_HUMAN_CHOICES['mental_health'] == 'mental health', "Error: 'mental_health' not handled correctly"
assert TARGET_WORD_HUMAN_CHOICES['mental_illness'] == 'mental illness', "Error: 'mental_illness' not handled correctly"

# 1970-1974, ..., 2015-2019
EPOCH_CHOICES = [f"{y}-{y+4}" for y in range(1970, 2020, 5)]
EPOCH = EPOCH_CHOICES[0]

MAX_BASELINES = 10

# can be changed to gemini, see https://pypi.org/project/simplemind/
PROVIDER = "openai"
MODEL = "gpt-4o"

Get neutral baseline sentences from corpus for LLM input for each target

- **Script 1 Aim**: This script computes sentence-level sentiment scores using the NRC-VAD lexicon for a corpus of target terms spanning 1970–2019. It dynamically determines neutral sentiment ranges for each 5-year epoch by expanding outward from the median sentiment score within the interquartile range (Q1–Q3) until at least 500 sentences are included, capped at 1500. The selected sentences are saved to CSV files for each target and epoch, and a summary file logs the dynamic ranges and counts.

- **Script 2 Aim**: This script processes pre-saved baseline CSV files to calculate sentence counts by year and 5-year epochs for multiple target terms. It generates "year_count_lines.csv" and "epoch_count_lines.csv" summarizing these counts and creates an epoch-based bar plot visualizing sentence distributions across the specified epochs for each target.

In [None]:
%run step0_get_neutral_baselines_intensity.py
%run step1_plot_neutral_baselines_intensity.py

Setup examples to inject in the prompt

-This code sets up examples of baseline sentences and their intensity-modified variations (more and less intense) to provide context and guidance for the LLM. 
-These examples are formatted into a structured prompt to help the model understand how to generate intensity-modified variations for new sentences.

In [3]:
@dataclass
class Example:
    baseline: str
    more_intense: str
    less_intense: str

    def format_for_prompt(self):
        return f"""<baseline>
{self.baseline}
</baseline>
<increased {TARGET_WORD} intensity>
{self.more_intense}
</increased {TARGET_WORD} intensity>
<decreased {TARGET_WORD} intensity>
{self.less_intense}
</decreased {TARGET_WORD} intensity>
"""

    @staticmethod
    def read_example_data(word: TARGET_WORD_CHOICES) -> pd.DataFrame:
        filepath = os.path.join('input', f'intensity_example_sentences.xlsx')
        # read to a dataframe
        df = pd.read_excel(filepath)
        # filter rows where `target` column is equal to the `word` argument
        df = df[df['target'] == word]
        print(f"Loaded {len(df)} examples for the term '{word}'")
        return df

def get_examples() -> List[Example]:
    example_data = Example.read_example_data(TARGET_WORD)

    return [
        Example(
            baseline=row['baseline'],
            more_intense=row['high_intensity'],
            less_intense=row['low_intensity']
        )
        for _, row in example_data.iterrows()
    ]

PROMPT_INTRO = """In psychology research, Intensity is defined as “the degree to which a word has emotionally charged (i.e., strong, potent, high-arousal) connotations.” This task focuses on the intensity of the term **<<{target_word}>>**. 

### **Task**  
You will be given a sentence containing the term **<<{target_word}>>**. Your goal is to write two new sentences:
1. One where **<<{target_word}>>** is **less intense**.  
2. One where **<<{target_word}>>** is **more intense**.  

### **Rules**  
1. The term **<<{target_word}>>** must remain **exactly as it appears** in the original sentence:
   - Do **not** replace, rephrase, omit, or modify it in any way.
   - Synonyms, variations, or altered spellings are not allowed.  

2. **Meaning and Structure**:  
   - Stay true to the original context and subject matter.  
   - Maintain the sentence’s structure and ensure grammatical accuracy.  

### **Important**  
- Any response omitting, replacing, or altering **<<{target_word}>>** will be rejected.  
- Ensure the output is:  
   - **Grammatically correct**  
   - **Sensitive and serious** in tone  
   - **Free from exaggeration or sensationalism**  

Follow these guidelines strictly to produce valid responses.  
"""


In [5]:
@dataclass
class SentenceToModify:
    text: str
    increased_variation: str = None
    decreased_variation: str = None

    def get_prompt(self):
        prompt = PROMPT_INTRO.format(target_word=TARGET_WORD) + "\n\n"
        for example in get_examples():
            prompt += example.format_for_prompt()
            prompt += "\n\n"
        
        prompt += f"""<baseline>
{self.text}
</baseline>
"""
        return prompt
    
    def parse_response(self, response: str):
        # get the sentences inside <more {TARGET_WORD}> and <less {TARGET_WORD}>
        self.increased_variation = response.split(f"<increased {TARGET_WORD} intensity>")[1].split(f"</increased {TARGET_WORD} intensity>")[0].strip()
        self.decreased_variation = response.split(f"<decreased {TARGET_WORD} intensity>")[1].split(f"</decreased {TARGET_WORD} intensity>")[0].strip()
        return self.increased_variation, self.decreased_variation

    def get_variations(self) -> list[str]:
        """ Returns a list of two strings: one where the TARGET_WORD is more intense, and one where it is less intense """
        assert TARGET_WORD in self.text, f"word {TARGET_WORD} not found in {self.text}"
        prompt = self.get_prompt()
        res = sm.generate_text(prompt=prompt, llm_provider=PROVIDER, llm_model=MODEL)
        return self.parse_response(res)

    @staticmethod
    def load_baselines(word: TARGET_WORD_CHOICES, epoch: EPOCH_CHOICES) -> List[str]:
        # find the baselines.csv file in the `input` folder
        filepath = os.path.join('input', 'baselines', f'{word}_{epoch}.baseline_1500_sentences.csv')
        df = pd.read_csv(filepath)
        # return the `sentence` column as a list
        print(f"Word {word}, epoch {epoch}: ", end="")
        print(f"Loaded {len(df)} baseline sentences, sampling {MAX_BASELINES}")
        baselines = df['sentence'].tolist()
        if MAX_BASELINES and len(baselines) > MAX_BASELINES:
            baselines = random.sample(baselines, MAX_BASELINES)
        baselines = [s.replace(TARGET_WORD, TARGET_WORD) for s in baselines]
        return baselines
    
    @staticmethod
    def save_sentences(sentences: List['SentenceToModify'], word: TARGET_WORD_CHOICES, epoch: EPOCH_CHOICES):
        # Adjust the directory here to include 'test'
        output_dir = 'output/test'
        output_file = os.path.join(output_dir, f'{word}_{epoch}.synthetic_sentences.csv')
        
        # Ensure the directory exists before trying to write to it
        os.makedirs(output_dir, exist_ok=True)
        
        # Creating DataFrame and saving to CSV
        df = pd.DataFrame([{'baseline': s.text, 'high_intensity': s.increased_variation, 'low_intensity': s.decreased_variation} for s in sentences])
        df.to_csv(output_file, index=False)
        print(f"Saved {len(sentences)} sentences to {output_file}")

baselines = SentenceToModify.load_baselines(TARGET_WORD, EPOCH)
sentence = SentenceToModify(text=baselines[0])
print(sentence.get_prompt())

Word abuse, epoch 1970-1974: Loaded 13 baseline sentences, sampling 10
Loaded 5 examples for the term 'abuse'
In psychology research, Intensity is defined as “the degree to which a word has emotionally charged (i.e., strong, potent, high-arousal) connotations.” This task focuses on the intensity of the term **<<abuse>>**. 

### **Task**  
You will be given a sentence containing the term **<<abuse>>**. Your goal is to write two new sentences:
1. One where **<<abuse>>** is **less intense**.  
2. One where **<<abuse>>** is **more intense**.  

### **Rules**  
1. The term **<<abuse>>** must remain **exactly as it appears** in the original sentence:
   - Do **not** replace, rephrase, omit, or modify it in any way.
   - Synonyms, variations, or altered spellings are not allowed.  

2. **Meaning and Structure**:  
   - Stay true to the original context and subject matter.  
   - Maintain the sentence’s structure and ensure grammatical accuracy.  

### **Important**  
- Any response omitting, 

In [6]:
# Constants for processing
MAX_BASELINES = 10
OUTPUT_DIR = "output/test"  # Directory where processed files are saved

# Ensure the output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Loop through each target word and epoch
for TARGET_WORD in get_args(TARGET_WORD_CHOICES):
    for EPOCH in EPOCH_CHOICES:
        # Construct the file path to check if it already exists
        output_file = os.path.join(OUTPUT_DIR, f"{TARGET_WORD}_{EPOCH}.synthetic_sentences.csv")
        
        # Print the file path to debug
        print(f"Checking if {output_file} exists...")
        
        # Skip if the file already exists
        if os.path.exists(output_file):
            print(f"Skipping {TARGET_WORD}, {EPOCH}: File already processed.")
            continue
        
        # Load baselines if the file does not exist
        baselines = SentenceToModify.load_baselines(TARGET_WORD, EPOCH)
        sentences = []
        
        # Process each baseline sentence
        for baseline in tqdm(baselines):
            sentence = SentenceToModify(text=baseline)
            try:
                more_intense_variation, less_intense_variation = sentence.get_variations()
                sentences.append(sentence)
            except Exception as e:
                print(f"Error processing sentence: {baseline}. Error: {str(e)}")
        
        # Save the processed sentences
        if sentences:  # Only save if there are completed sentences
            SentenceToModify.save_sentences(sentences, word=TARGET_WORD, epoch=EPOCH)
            print(f"Processed and saved: {output_file}")
        else:
            print(f"No valid sentences processed for {TARGET_WORD}, {EPOCH}")

Checking if output/test/abuse_1970-1974.synthetic_sentences.csv exists...
Skipping abuse, 1970-1974: File already processed.
Checking if output/test/abuse_1975-1979.synthetic_sentences.csv exists...
Skipping abuse, 1975-1979: File already processed.
Checking if output/test/abuse_1980-1984.synthetic_sentences.csv exists...
Skipping abuse, 1980-1984: File already processed.
Checking if output/test/abuse_1985-1989.synthetic_sentences.csv exists...
Skipping abuse, 1985-1989: File already processed.
Checking if output/test/abuse_1990-1994.synthetic_sentences.csv exists...
Skipping abuse, 1990-1994: File already processed.
Checking if output/test/abuse_1995-1999.synthetic_sentences.csv exists...
Skipping abuse, 1995-1999: File already processed.
Checking if output/test/abuse_2000-2004.synthetic_sentences.csv exists...
Skipping abuse, 2000-2004: File already processed.
Checking if output/test/abuse_2005-2009.synthetic_sentences.csv exists...
Skipping abuse, 2005-2009: File already processed.
