### Producing Sentiment variations using LLMs

Author: Raphael Merx
Input: a baseline sentence; Output: variations of this sentence where a target word has more positive or negative sentiment


In [6]:
#pip install simplemind python-dotenv pandas openpyxl

In [1]:
from dotenv import load_dotenv
load_dotenv()

from dataclasses import dataclass
from tqdm import tqdm
import simplemind as sm
from typing import Literal, List, get_args
import pandas as pd
import os
import random

random.seed(42)


TARGET_WORD_CHOICES = Literal['abuse', 'anxiety', 'depression', 'mental_health', 'mental_illness', 'trauma']

TARGET_WORD: TARGET_WORD_CHOICES = 'mental_illness'
# TARGET_WORD_HUMAN = TARGET_WORD.replace('_', ' ') 

# 1970-1974, ..., 2015-2019
EPOCH_CHOICES = [f"{y}-{y+4}" for y in range(1970, 2020, 5)]
EPOCH = EPOCH_CHOICES[0]

MAX_BASELINES = 10

# can be changed to gemini, see https://pypi.org/project/simplemind/
PROVIDER = "openai"
MODEL = "gpt-4o"

Get neutral baseline sentences from corpus for LLM input for each target

- **Script 1 Aim**: This script computes sentence-level sentiment scores using the NRC-VAD lexicon for a corpus of target terms spanning 1970–2019. It dynamically determines neutral sentiment ranges for each 5-year epoch by expanding outward from the median sentiment score within the interquartile range (Q1–Q3) until at least 500 sentences are included, capped at 1500. The selected sentences are saved to CSV files for each target and epoch, and a summary file logs the dynamic ranges and counts.

- **Script 2 Aim**: This script processes pre-saved baseline CSV files to calculate sentence counts by year and 5-year epochs for multiple target terms. It generates "year_count_lines.csv" and "epoch_count_lines.csv" summarizing these counts and creates an epoch-based bar plot visualizing sentence distributions across the specified epochs for each target.

In [19]:
#%run step0_get_neutral_baselines_sentiment.py
#%run step1_plot_neutral_baselines_sentiment.py

Setup examples to inject in the prompt

-This code sets up examples of baseline sentences and their intensity-modified variations (more and less intense) to provide context and guidance for the LLM. 
-These examples are formatted into a structured prompt to help the model understand how to generate intensity-modified variations for new sentences.

In [6]:
@dataclass
class Example:
    baseline: str
    positive_sentiment: str
    negative_sentiment: str

    def format_for_prompt(self):
        return f"""<baseline>
{self.baseline}
</baseline>
<positive {TARGET_WORD}>
{self.positive_sentiment}
</positive {TARGET_WORD}>
<negative {TARGET_WORD}>
{self.negative_sentiment}
</negative {TARGET_WORD}>
"""


    @staticmethod
    def read_example_data(word: TARGET_WORD_CHOICES) -> pd.DataFrame:
        # find the sentiment_example_sentences.xlsx file in the `input` folder
        filepath = os.path.join('input', f'sentiment_example_sentences.xlsx')
        df = pd.read_excel(filepath)
        df = df[df['target'] == word]
        print(f"Loaded {len(df)} examples for the term '{word}'")
        return df

example_data = Example.read_example_data(TARGET_WORD)

EXAMPLES = [
    Example(
        baseline=row['baseline'],
        positive_sentiment=row['positive_sentiment'],
        negative_sentiment=row['negative_sentiment']
    )
    for _, row in example_data.iterrows()
]

PROMPT_INTRO = f"""
In psychology research, 'Sentiment' is defined as “a term’s acquisition of a more positive or negative connotation.” 
Here we study the sentiment of the term '{TARGET_WORD}'.

You will be given a sentence with the term '{TARGET_WORD}' in it. Your task is to write two new sentences:
- One where the term '{TARGET_WORD}' carries a more positive connotation.
- One where the term '{TARGET_WORD}' carries a more negative connotation.

Important Guidelines:
- Retain the original structure and context of the baseline sentence as much as possible.
- Make only small, targeted adjustments to alter the sentiment of the term '{TARGET_WORD}'.
- Do not add or remove key concepts or introduce entirely new elements not implied in the baseline.
- Do not replace the target term with another word.
"""


Loaded 5 examples for the term 'mental_illness'


In [11]:
@dataclass
class SentenceToModify:
    text: str
    positive_variation: str = None
    negative_variation: str = None

    def get_prompt(self):
        prompt = PROMPT_INTRO + "\n\n"
        for example in EXAMPLES:
            prompt += example.format_for_prompt()
            prompt += "\n\n"
        
        prompt += f"""<baseline>
{self.text}
</baseline>
"""
        return prompt
    
    def parse_response(self, response: str):
        # get the sentences inside <positive {TARGET_WORD}> and <negative {TARGET_WORD}>
        self.positive_variation = response.split(f"<positive {TARGET_WORD}>")[1].split(f"</positive {TARGET_WORD}>")[0].strip()
        self.negative_variation = response.split(f"<negative {TARGET_WORD}>")[1].split(f"</negative {TARGET_WORD}>")[0].strip()
        return self.positive_variation, self.negative_variation

    def get_variations(self) -> list[str]:
        """ Returns a list of two strings: one where the TARGET_WORD has a more positive connotation, and one where it has a more negative one. """
        assert TARGET_WORD in self.text.lower(), f"TARGET_WORD {TARGET_WORD} not found in {self.text}"
        prompt = self.get_prompt()
        res = sm.generate_text(prompt=prompt, llm_provider=PROVIDER, llm_model=MODEL)
        return self.parse_response(res)

    @staticmethod
    def load_baselines(word: TARGET_WORD_CHOICES, epoch: EPOCH_CHOICES) -> List[str]:
        # find the baselines.csv file in the `input` folder
        filepath = os.path.join('input', 'baselines', f'{word}_{epoch}.baseline_1500_sentences.csv')
        df = pd.read_csv(filepath)
        # return the `sentence` column as a list
        print(f"Word {word}, epoch {epoch}: ", end="")
        print(f"Loaded {len(df)} baseline sentences, sampling {MAX_BASELINES}")
        baselines = df['sentence'].tolist()
        if MAX_BASELINES and len(baselines) > MAX_BASELINES:
            baselines = random.sample(baselines, MAX_BASELINES)
        baselines = [s.replace(TARGET_WORD, TARGET_WORD) for s in baselines]
        return baselines
    
    @staticmethod
    def save_sentences(sentences: List['SentenceToModify'], word: TARGET_WORD_CHOICES, epoch: EPOCH_CHOICES):
        output_file = os.path.join('output', f'{word}_{epoch}.synthetic_sentences.csv')
        df = pd.DataFrame([{'baseline': s.text, 'positive_variation': s.positive_variation, 'negative_variation': s.negative_variation} for s in sentences])
        df.to_csv(output_file, index=False)
        print(f"Saved {len(sentences)} sentences to {output_file}")
    

baselines = SentenceToModify.load_baselines(TARGET_WORD, EPOCH)
sentence = SentenceToModify(text=baselines[0])
print(sentence.get_prompt())

Word abuse, epoch 1975-1979: Loaded 188 baseline sentences, sampling 10

In psychology research, 'Sentiment' is defined as “a term’s acquisition of a more positive or negative connotation.” 
Here we study the sentiment of the term 'mental_illness'.

You will be given a sentence with the term 'mental_illness' in it. Your task is to write two new sentences:
- One where the term 'mental_illness' carries a more positive connotation.
- One where the term 'mental_illness' carries a more negative connotation.

Important Guidelines:
- Retain the original structure and context of the baseline sentence as much as possible.
- Make only small, targeted adjustments to alter the sentiment of the term 'mental_illness'.
- Do not add or remove key concepts or introduce entirely new elements not implied in the baseline.
- Do not replace the target term with another word.


<baseline>
Internet addiction (IA) is an emerging social and mental_health issue among youths.
</baseline>
<positive abuse>
Internet

In [12]:
# to launch the whole script, see below:

MAX_BASELINES = 10

for TARGET_WORD in get_args(TARGET_WORD_CHOICES):
    for EPOCH in EPOCH_CHOICES:
        baselines = SentenceToModify.load_baselines(TARGET_WORD, EPOCH)
        sentences = []
        for baseline in tqdm(baselines):
            sentence = SentenceToModify(text=baseline)
            positive_variation, negative_variation = sentence.get_variations()
            sentences.append(sentence)
        SentenceToModify.save_sentences(sentences, word=TARGET_WORD, epoch=EPOCH)

Word abuse, epoch 1970-1974: Loaded 7 baseline sentences, sampling 10


100%|██████████| 7/7 [00:11<00:00,  1.57s/it]


Saved 7 sentences to output/abuse_1970-1974.synthetic_sentences.csv
Word abuse, epoch 1975-1979: Loaded 188 baseline sentences, sampling 10


100%|██████████| 10/10 [00:15<00:00,  1.52s/it]


Saved 10 sentences to output/abuse_1975-1979.synthetic_sentences.csv
Word abuse, epoch 1980-1984: Loaded 434 baseline sentences, sampling 10


100%|██████████| 10/10 [00:18<00:00,  1.86s/it]


Saved 10 sentences to output/abuse_1980-1984.synthetic_sentences.csv
Word abuse, epoch 1985-1989: Loaded 512 baseline sentences, sampling 10


100%|██████████| 10/10 [00:16<00:00,  1.69s/it]


Saved 10 sentences to output/abuse_1985-1989.synthetic_sentences.csv
Word abuse, epoch 1990-1994: Loaded 722 baseline sentences, sampling 10


100%|██████████| 10/10 [00:14<00:00,  1.40s/it]


Saved 10 sentences to output/abuse_1990-1994.synthetic_sentences.csv
Word abuse, epoch 1995-1999: Loaded 536 baseline sentences, sampling 10


100%|██████████| 10/10 [00:17<00:00,  1.75s/it]


Saved 10 sentences to output/abuse_1995-1999.synthetic_sentences.csv
Word abuse, epoch 2000-2004: Loaded 683 baseline sentences, sampling 10


100%|██████████| 10/10 [00:17<00:00,  1.71s/it]


Saved 10 sentences to output/abuse_2000-2004.synthetic_sentences.csv
Word abuse, epoch 2005-2009: Loaded 789 baseline sentences, sampling 10


100%|██████████| 10/10 [00:19<00:00,  1.92s/it]


Saved 10 sentences to output/abuse_2005-2009.synthetic_sentences.csv
Word abuse, epoch 2010-2014: Loaded 931 baseline sentences, sampling 10


100%|██████████| 10/10 [00:24<00:00,  2.44s/it]


Saved 10 sentences to output/abuse_2010-2014.synthetic_sentences.csv
Word abuse, epoch 2015-2019: Loaded 843 baseline sentences, sampling 10


100%|██████████| 10/10 [00:17<00:00,  1.70s/it]


Saved 10 sentences to output/abuse_2015-2019.synthetic_sentences.csv
Word anxiety, epoch 1970-1974: Loaded 378 baseline sentences, sampling 10


100%|██████████| 10/10 [00:17<00:00,  1.70s/it]


Saved 10 sentences to output/anxiety_1970-1974.synthetic_sentences.csv
Word anxiety, epoch 1975-1979: Loaded 547 baseline sentences, sampling 10


100%|██████████| 10/10 [00:15<00:00,  1.58s/it]


Saved 10 sentences to output/anxiety_1975-1979.synthetic_sentences.csv
Word anxiety, epoch 1980-1984: Loaded 664 baseline sentences, sampling 10


100%|██████████| 10/10 [00:14<00:00,  1.46s/it]


Saved 10 sentences to output/anxiety_1980-1984.synthetic_sentences.csv
Word anxiety, epoch 1985-1989: Loaded 605 baseline sentences, sampling 10


100%|██████████| 10/10 [00:14<00:00,  1.40s/it]


Saved 10 sentences to output/anxiety_1985-1989.synthetic_sentences.csv
Word anxiety, epoch 1990-1994: Loaded 973 baseline sentences, sampling 10


100%|██████████| 10/10 [00:15<00:00,  1.55s/it]


Saved 10 sentences to output/anxiety_1990-1994.synthetic_sentences.csv
Word anxiety, epoch 1995-1999: Loaded 642 baseline sentences, sampling 10


 30%|███       | 3/10 [00:06<00:16,  2.32s/it]


KeyboardInterrupt: 