### Producing Sentiment variations using LLMs

Author: Raphael Merx
Input: a baseline sentence; Output: variations of this sentence where a target word has more positive or negative sentiment


In [1]:
#pip install simplemind python-dotenv pandas openpyxl

In [None]:
from dotenv import load_dotenv
load_dotenv()

from dataclasses import dataclass
from tqdm import tqdm
import simplemind as sm
from typing import Literal, List
import pandas as pd
import os
import random

random.seed(42)


TARGET_WORD_CHOICES = Literal['abuse', 'anxiety', 'depression', 'mental_health', 'mental_illness', 'trauma']

TARGET_WORD: TARGET_WORD_CHOICES = 'mental_illness'
#TARGET_WORD_HUMAN = TARGET_WORD.replace('_', ' ') #Note:This is not needed unless there is a good reason as it makes it easier for processing down the line. 

MAX_BASELINES = 10

# can be changed to gemini, see https://pypi.org/project/simplemind/
PROVIDER = "openai"
MODEL = "gpt-4o"

Get neutral baseline sentences from corpus for LLM input for each target

- **Script 1 Aim**: This script computes sentence-level sentiment scores using the NRC-VAD lexicon for a corpus of target terms spanning 1970–2019. It dynamically determines neutral sentiment ranges for each 5-year epoch by expanding outward from the median sentiment score within the interquartile range (Q1–Q3) until at least 500 sentences are included, capped at 1500. The selected sentences are saved to CSV files for each target and epoch, and a summary file logs the dynamic ranges and counts.

- **Script 2 Aim**: This script processes pre-saved baseline CSV files to calculate sentence counts by year and 5-year epochs for multiple target terms. It generates "year_count_lines.csv" and "epoch_count_lines.csv" summarizing these counts and creates an epoch-based bar plot visualizing sentence distributions across the specified epochs for each target.

In [None]:
#%run step0_get_neutral_baselines_sentiment.py
#%run step1_plot_neutral_baselines_sentiment.py

Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\1_sentiment\synthetic\input\baselines\abuse_1970-1974.baseline_1500_sentences.csv
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\1_sentiment\synthetic\input\baselines\abuse_1975-1979.baseline_1500_sentences.csv
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\1_sentiment\synthetic\input\baselines\abuse_1980-1984.baseline_1500_sentences.csv
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\1_sentiment\synthetic\input\baselines\abuse_1985-1989.baseline_1500_sentences.csv
Processing file: c:\Users\naomi\OneDrive\COMP80004_PhDResearch\RESEARCH\PROJECTS\3_evaluation+validation - ACL 2025\ICL\1_sentiment\synthetic\input\baselines\abuse_1990-1994.baseline_1500_sent

Setup examples to inject in the prompt

In [None]:
@dataclass
class Example:
    baseline: str
    positive_sentiment: str
    negative_sentiment: str

    def format_for_prompt(self):
        return f"""<baseline>
{self.baseline}
</baseline>
<positive {TARGET_WORD_HUMAN}>
{self.positive_sentiment}
</positive {TARGET_WORD_HUMAN}>
<negative {TARGET_WORD_HUMAN}>
{self.negative_sentiment}
</negative {TARGET_WORD_HUMAN}>
"""


    @staticmethod
    def read_example_data(word: TARGET_WORD_CHOICES) -> pd.DataFrame:
        # find the sentiment_example_sentences.xlsx file in the `input` folder
        filepath = os.path.join('input', f'sentiment_example_sentences.xlsx')
        # read to a dataframe
        df = pd.read_excel(filepath)
        # filter rows where `target` column is equal to the `word` argument
        df = df[df['target'] == word]
        print(f"Loaded {len(df)} examples for the term '{word}'")
        return df

example_data = Example.read_example_data(TARGET_WORD)

EXAMPLES = [
    Example(
        baseline=row['baseline'],
        positive_sentiment=row['positive_sentiment'],
        negative_sentiment=row['negative_sentiment']
    )
    for _, row in example_data.iterrows()
]

PROMPT_INTRO = f"""
In psychology research, 'Sentiment' is defined as “a term’s acquisition of a more positive or negative connotation.” 
Here we study the sentiment of the term '{TARGET_WORD_HUMAN}'.

You will be given a sentence with the term '{TARGET_WORD_HUMAN}' in it. Your task is to write two new sentences:
- One where the term '{TARGET_WORD_HUMAN}' carries a more positive connotation.
- One where the term '{TARGET_WORD_HUMAN}' carries a more negative connotation.

Important Guidelines:
- Retain the original structure and context of the baseline sentence as much as possible.
- Make only small, targeted adjustments to alter the sentiment of the term '{TARGET_WORD_HUMAN}'.
- Do not add or remove key concepts or introduce entirely new elements not implied in the baseline.
- Do not replace the target term with another word.
"""



### NOTE: old prompt
PROMPT_INTRO = f"""In psychology lexicographic research, we define "sentiment" as the "a term’s acquisition of a more positive or negative connotation". Here we study the sentiment of the term "{TARGET_WORD_HUMAN}"
You will be given a sentence with the term "{TARGET_WORD_HUMAN}" in it. You will then be asked to write two new sentences: one where the term "{TARGET_WORD_HUMAN}" carries a more positive connotation, and one where it carries a more negative connotation.

In [42]:
@dataclass
class SentenceToModify:
    text: str
    positive_variation: str = None
    negative_variation: str = None

    def get_prompt(self):
        prompt = PROMPT_INTRO
        for example in EXAMPLES:
            prompt += example.format_for_prompt()
            prompt += "\n\n"
        
        prompt += f"""<baseline>
{self.text}
</baseline>
"""
        return prompt
    
    def parse_response(self, response: str):
        # get the sentences inside <positive {TARGET_WORD}> and <negative {TARGET_WORD}>
        self.positive_variation = response.split(f"<positive {TARGET_WORD_HUMAN}>")[1].split(f"</positive {TARGET_WORD_HUMAN}>")[0].strip()
        self.negative_variation = response.split(f"<negative {TARGET_WORD_HUMAN}>")[1].split(f"</negative {TARGET_WORD_HUMAN}>")[0].strip()
        return self.positive_variation, self.negative_variation

    def get_variations(self) -> list[str]:
        """ Returns a list of two strings: one where the TARGET_WORD has a more positive connotation, and one where it has a more negative one. """
        assert TARGET_WORD_HUMAN in self.text, f"TARGET_WORD {TARGET_WORD_HUMAN} not found in {self.text}"
        prompt = self.get_prompt()
        res = sm.generate_text(prompt=prompt, llm_provider=PROVIDER, llm_model=MODEL)
        return self.parse_response(res)

    @staticmethod
    def load_baselines(word: TARGET_WORD_CHOICES) -> List[str]:
        # find the baselines.csv file in the `input` folder
        filepath = os.path.join('input', 'baselines', f'{word}_neutral_baselines_sentiment.csv')
        df = pd.read_csv(filepath)
        # return the `sentence` column as a list
        print(f"Loaded a total of {len(df)} baseline sentences, sampling {MAX_BASELINES}")
        baselines = df['sentence'].tolist()
        baselines = random.sample(baselines, MAX_BASELINES) if len(baselines) > MAX_BASELINES else baselines
        baselines = [s.replace(TARGET_WORD, TARGET_WORD_HUMAN) for s in baselines]
        return baselines
    
    @staticmethod
    def save_sentences(sentences: List['SentenceToModify']):
        output_file = os.path.join('output', f'{TARGET_WORD}_synthetic_sentences.csv')
        df = pd.DataFrame([{'baseline': s.text, 'positive_variation': s.positive_variation, 'negative_variation': s.negative_variation} for s in sentences])
        df.to_csv(output_file, index=False)
        print(f"Saved {len(sentences)} sentences to {output_file}")
    

baselines = SentenceToModify.load_baselines(TARGET_WORD)
sentence = SentenceToModify(text=baselines[0])
print(sentence.get_prompt())

Loaded a total of 403 baseline sentences, sampling 10
In psychology lexicographic research, we define "sentiment" as the "a term’s acquisition of a more positive or negative connotation". Here we study the sentiment of the term "mental illness"
You will be given a sentence with the term "mental illness" in it. You will then be asked to write two new sentences: one where the term "mental illness" carries a more positive connotation, and one where it carries a more negative connotation.

<baseline>
The research explored the stigma surrounding mental illness in different cultures.
</baseline>
<positive mental illness>
The research highlighted campaigns that reduced stigma and improved perceptions of mental illness.
</positive mental illness>
<negative mental illness>
The research revealed that stigma surrounding mental illness often led to delayed treatment and isolation.
</negative mental illness>


<baseline>
The study analyzed access to care for individuals with mental illness.
</basel

In [39]:
sentences = []

for baseline in tqdm(baselines):
    sentence = SentenceToModify(text=baseline)
    positive_variation, negative_variation = sentence.get_variations()
    sentences.append(sentence)

SentenceToModify.save_sentences(sentences)


100%|██████████| 10/10 [00:31<00:00,  3.10s/it]

Saved 10 sentences to output/mental_illness_synthetic_sentences.csv



