## Installation

Activate Environment

`conda activate myvoiceenv`

Install Dependencies

`pip install pandas transformers torch`

`pip install transformers datasets accelerate`


## Data Processing
I downloaded the dataset from Kaggle and used Python to process the data, saving the results as a CSV file.

The dataset contains around 40,000 BBC news articles, including information such as title, date, author, link, and description.
https://www.kaggle.com/datasets/gpreda/bbc-news/data

To classify the emotional tone of the text, I used the following model:
https://www.kaggle.com/refs/hf-model/logasanjeev/emotion-analyzer-bert

I performed the following processing steps:

1. Kept only the news titles.

2. Used an AI model to classify the emotions of all 40,000 news titles, and saved the results as a CSV file.

3. Mapped the model's 28 emotion labels to 7 facial expression categories.

4. Saved the data with news titles and their corresponding emotion labels into a new CSV file.

5. Converted the CSV file into TXT format for easier training with GPT-2.

In [3]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# lodad model
model_name = "logasanjeev/emotions-analyzer-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# load CSV
df = pd.read_csv("../data/bbc_news.csv")  
titles = df["title"].tolist()

# original id2label
id2label = model.config.id2label

# mapping labels：28 -> 7 
mapping = {
    'admiration': 'happy',
    'amusement': 'happy',
    'anger': 'angry',
    'annoyance': 'angry',
    'approval': 'happy',
    'caring': 'happy',
    'confusion': 'sad',
    'curiosity': 'happy',
    'desire': 'happy',
    'disappointment': 'sad',
    'disapproval': 'angry',
    'disgust': 'disgust',
    'embarrassment': 'sad',
    'excitement': 'happy',
    'fear': 'fear',
    'gratitude': 'happy',
    'grief': 'sad',
    'joy': 'happy',
    'love': 'happy',
    'nervousness': 'fear',
    'optimism': 'happy',
    'pride': 'happy',
    'realization': 'neutral',
    'relief': 'happy',
    'remorse': 'sad',
    'sadness': 'sad',
    'surprise': 'surprise',
    'neutral': 'neutral'
}

# dealabel
results = []
for title in titles:
    inputs = tokenizer(title, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=1)
        top_idx = torch.argmax(probs, dim=1).item()
        original_label = id2label[top_idx]
        mapped_emotion = mapping.get(original_label, "neutral")  
        results.append(mapped_emotion)

# save csv
df["emotion"] = results
df.to_csv("news_with_7_emotions.csv", index=False)
print("done! save to news_with_7_emotions.csv")


done! save to news_with_7_emotions.csv


In [5]:
import pandas as pd

df = pd.read_csv("news_with_7_emotions.csv")  
with open("train.txt", "w", encoding="utf-8") as f:
    for _, row in df.iterrows():
        f.write(f"<{row['emotion'].strip()}> {row['title'].strip()}\n")


## Model Testing


1. During the initial data processing phase, I split the original dataset into three subsets for feasibility testing: one with around 1,000 entries, another with around 3,000, and the full set with 40,000 entries.

2. I tested three different GPT-2 models and saved the results under separate output paths:
    1. sshleifer/tiny-gpt2: The generated content was often incoherent and sometimes consisted of non-English words, though it was extremely fast to generate.
    2. gpt2：The output resembled fake news quite well and remained mostly readable in English. However, the generation speed was slightly slower.
    3. gpt2-medium：This model pushed my computer to its limits. The generated text was highly readable and felt more realistic, but the generation speed was noticeably slower.
3. In the end, I decided to use the gpt2 model for training.

In [6]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling

# use tiny-gpt2
model_name = "sshleifer/tiny-gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# add emotional tokens（<happy>, <sad> ）
special_tokens = ['<happy>', '<sad>', '<angry>', '<fear>', '<surprise>', '<disgust>', '<neutral>']
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
model.resize_token_embeddings(len(tokenizer))  

# add dataset
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",
    block_size=64,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

# training args
training_args = TrainingArguments(
    output_dir="./gpt2-news-emotion",
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
)

# start Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# train
trainer.train()
trainer.save_model("./gpt2-news-emotion")
tokenizer.save_pretrained("./gpt2-news-emotion")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


model.safetensors:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

Step,Training Loss
100,10.6543
200,10.64
300,10.626
400,10.6174
500,10.6127




('./gpt2-news-emotion/tokenizer_config.json',
 './gpt2-news-emotion/special_tokens_map.json',
 './gpt2-news-emotion/vocab.json',
 './gpt2-news-emotion/merges.txt',
 './gpt2-news-emotion/added_tokens.json')

In [7]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_path = "./gpt2-news-emotion"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)
model.eval()

def generate_title(emotion):
    prompt = f"<{emotion}>"
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(input_ids, max_length=20, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# print generated titles
print(generate_title("sad"))
print("👋")
print(generate_title("happy"))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Neil appointed letters weeks Ir Treasury Neil internet Lyn holdersoth Border Bert appointed stop declares my mutual Lebanese
👋
 internet girl Border save remarkable mutualb firm my Television about deflect Treasury holders makes Lyn weeks Only asked


In [8]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling

# use gpt2
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# add emotion tokens（<happy>, <sad> ）
special_tokens = ['<happy>', '<sad>', '<angry>', '<fear>', '<surprise>', '<disgust>', '<neutral>']
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
model.resize_token_embeddings(len(tokenizer))  

# add dataset
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",
    block_size=64,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

# training args
training_args = TrainingArguments(
    output_dir="./ogpt2-news-emotion",
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
)

# start Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# train
trainer.train()
trainer.save_model("./ogpt2-news-emotion")
tokenizer.save_pretrained("./ogpt2-news-emotion")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



Step,Training Loss
100,10.1189
200,10.7325
300,10.456
400,10.0646
500,9.8299




('./ogpt2-news-emotion/tokenizer_config.json',
 './ogpt2-news-emotion/special_tokens_map.json',
 './ogpt2-news-emotion/vocab.json',
 './ogpt2-news-emotion/merges.txt',
 './ogpt2-news-emotion/added_tokens.json')

In [9]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_path = "./ogpt2-news-emotion"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)
model.eval()

def generate_title(emotion):
    prompt = f"<{emotion}>"
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(input_ids, max_length=20, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95,repetition_penalty=1.2, )
    return tokenizer.decode(output[0], skip_special_tokens=True)

print(generate_title("sad"))
print("👋")
print(generate_title("happy"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 and refugee cases
 'Something is wrong with our system' - what we know from election
👋
 be suspended and expelled from US cricket team?
 'Are there plans for us to rebuild


In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling

# gpt2-medium
model_name = "gpt2-medium"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

special_tokens = ['<happy>', '<sad>', '<angry>', '<fear>', '<surprise>', '<disgust>', '<neutral>']
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
model.resize_token_embeddings(len(tokenizer))  

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",
    block_size=64,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

training_args = TrainingArguments(
    output_dir="./mgpt2-news-emotion",
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=8,
    save_steps=50,
    save_total_limit=2,
    logging_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,

)

trainer.train()
trainer.save_model("./mgpt2-news-emotion")
tokenizer.save_pretrained("./mgpt2-news-emotion")




Step,Training Loss
100,3.8201
200,1.9846
300,1.0408
400,0.6658
500,0.5221




('./mgpt2-news-emotion/tokenizer_config.json',
 './mgpt2-news-emotion/special_tokens_map.json',
 './mgpt2-news-emotion/vocab.json',
 './mgpt2-news-emotion/merges.txt',
 './mgpt2-news-emotion/added_tokens.json')

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_path = "./mgpt2-news-emotion"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)
model.eval()

def generate_title(emotion):
    prompt = f"<{emotion}>"
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(input_ids, max_length=20, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95,repetition_penalty=1.2, )
    return tokenizer.decode(output[0], skip_special_tokens=True)

print(generate_title("sad"))
print("👋")
print(generate_title("happy"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Will China step up if Trump takes a step back on climate change?
 Reeves tells the
👋
 How will the vulnerable be protected from Covid? And other questions
 Irish election 'too
