# Reinforcement learning for better sentiment control

In this notebook, the models after SFT are fine-tuned for better sentiment control. The trl library is used. For this the similar approach as: https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb is used.

## Read Libraries

In [1]:
!pip install wandb datasets trl

Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl
  Downloading trl-0.7.4-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.39.1-py2.py3-none-any.whl (254 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m254.1/254.1 kB

In [2]:
!pip install transformers



In [3]:
!python -m spacy download de_core_news_md

2023-12-21 18:50:43.732640: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-21 18:50:43.732686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-21 18:50:43.734070: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-21 18:50:43.741783: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-21 18:50:46.428946: I external/local_xla/xla/

In [None]:
import csv
import random
import torch
import wandb
import time
import os
from tqdm import tqdm
import numpy as np
import pandas as pd
from random import choices
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from datasets import Dataset
import tensorflow as tf
import transformers
tqdm.pandas()
from  transformers import BertTokenizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import spacy
from transformers import AutoTokenizer, pipeline,AutoModelWithLMHead
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
import re
import torch

nlp = spacy.load('de_core_news_md')
nltk.download('punkt')
nltk.download('stopwords')

from google.colab import drive
drive.mount('/content/drive')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

##Set up models for RL

In [None]:
sentiment_pipe_kwargs = {"top_k": None, "function_to_apply": "none"}

config = PPOConfig(
    model_name="/content/drive/MyDrive/Masterthesis/Models/german_gpt2_sft_with_tokens_big_2_epochs", mini_batch_size=16,steps=51200, learning_rate=1.41e-5, remove_unused_columns=False, log_with="wandb"
)

txt_in_len = 5
txt_out_len = 20
seed = 1

In [None]:
np.random.seed(seed)

In [None]:
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
model_ref = create_reference_model(model)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

Prepare Dataset for training:

In [None]:
path="/content/drive/MyDrive/Masterthesis/Data/data_processed/train_for_sft.tsv"
data=pd.read_csv(path, sep='\t')


In [None]:
data=data[["preprocessed_text","sentiment"]]

In [None]:
dataset = Dataset.from_pandas(data)
dataset = dataset.filter(lambda x: len(x["preprocessed_text"]) > 500, batched=False)
dataset = dataset.map(lambda x: {"preprocessed_text": x["preprocessed_text"][:1000]}, batched=False)


In [None]:
dataset = dataset.map(
    lambda x: {"input_ids": tokenizer.encode(" " + x["preprocessed_text"], return_tensors="pt",truncation=True,max_length=1024)[0, :txt_in_len]},
    batched=False
)
dataset = dataset.map(lambda x: {"query": tokenizer.decode(x["input_ids"])}, batched=False)
dataset = dataset.shuffle(seed=42)
dataset = dataset[:14770]
dataset = Dataset.from_dict(dataset)
dataset.set_format("pytorch")

Map:   0%|          | 0/24883 [00:00<?, ? examples/s]

Map:   0%|          | 0/24883 [00:00<?, ? examples/s]

{'preprocessed_text': ['Es ist alles sehr sauber gewesen, da ja jeden tag gereinigt worden ist, es ist zwar auch schon alles etwas abgenutzt aber das ist normal bei tausenden besuchern im jahr und es verhalten sich ja auch nicht immer alles so ordentlich im urlaub...leider!!!man sollte auf jeden fall vermeiden, ein einfaches zimmer zu buchen, denn die liegen zur straßenseite raus udn das kann schon mal unangenehm werden bei dem verkehr auf den straßen.die deluxzimmer sind echt gut und wer es etwas größer haben will, der bucht die suite.',
  'WENN RHODOS VILLAGE ÜBERBUCHT IST, KOMMT MAN HIER HER! Ich muss vorab sagen: Wir haben dieses Hotel nie gebucht. Eigentlich sollten wir ins Nachbarhotel MITSIS HOTELS RHODOS VILLAGE kommen, doch da es dort seit Jahren systematisch zu gewollten Überbuchungen kommt, werden Hotelgäste 1-3 Tage ins PrimaSol Princess Sun ausquartiert. Vor ort haben wir viele kennengelernt, denen es so erging und meine Recherchen haben ergeben, dass es vielen vor uns - u

In [None]:
wandb.login(key="e130f0007574e1afff5ab23cf27571448750a93c")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

##PPO training

In [None]:
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, dataset, data_collator=collator )

[34m[1mwandb[0m: Currently logged in as: [33mpaulina3381[0m ([33mmt_paulina[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "max_new_tokens": 32,

In [None]:
ctrl = ["[negative]", "[positive]"]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Encode control tokens to tensors
ctrl_tokens = {}
for c in ctrl:
    encoded_token = tokenizer.encode(c, return_tensors="pt").squeeze().to(device)
    ctrl_tokens[c] = encoded_token


In [None]:
classifier = pipeline("sentiment-analysis", model="/content/drive/MyDrive/Masterthesis/Models/sentiment_discriminator_bert_finetuned",**sentiment_pipe_kwargs)

def get_logits(texts):
  scores_texts=[]
  for text in texts:
    output=classifier(text)[0]
    score_dict = {item['label']: item['score'] for item in output}
    negative_score = score_dict.get('NEGATIVE', 0.0)
    positive_score = score_dict.get('POSITIVE', 0.0)
  # Create a list with negative score first, then positive score
    scores = [negative_score, positive_score]
    scores_texts.append(scores)


  return scores_texts



In [None]:
def logit_to_reward(logit, task):
    """
    Take the positive sentiment logit and scale it for the task.
    """
    scores=[]
    for i in range(len(logit)):
        if task[i] == "[negative]":
            scores.append(logit[i][0])
        elif task[i] == "[positive]":
            scores.append(logit[i][1])

    return [torch.tensor(score, dtype=torch.float32) for score in scores]

In [None]:
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": gpt2_tokenizer.eos_token_id,
    "max_new_tokens": 30
}

In [None]:
for epoch in range(2):
    for batch in ppo_trainer.dataloader:

        response=dict()
        #### prepend a random control token
        tasks = choices(ctrl, k=config.batch_size)
        query_tensors = []
        for task, input_ids in zip(tasks, batch["input_ids"]):
            concatenated = torch.cat((ctrl_tokens[task], input_ids))
            query_tensors.append(concatenated)        #### get response from gpt2
        response_tensors = []


        for query in query_tensors:
            response = ppo_trainer.generate(query, **generation_kwargs,return_prompt=True)
            response_tensors.append(response.squeeze()[-30:])
        response = [gpt2_tokenizer.decode(r.squeeze()) for r in response_tensors]

        #### sentiment analysis
        texts = [query + resp for query, resp in zip(batch["query"], response)]
       #
        print(texts)
        logits = get_logits(texts)

        rewards = logit_to_reward(logits, task_list)
        torch.cuda.empty_cache() #clear query and response


        #### Run PPO training
        t = time.time()
        print("Query: ", type(query_tensors), type(query_tensors[1]) )
        print("Response: ", type(response_tensors), type(response_tensors[1]) )
        print("Rewards: ", type(rewards), type(rewards[1]) )
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
ppo_trainer.save_pretrained("/content/drive/MyDrive/Masterthesis/Models/german_gpt2_sft_2_epoch_rl_2epochs")



In [None]:
gpt2_model_ref.save_pretrained("/content/drive/MyDrive/Masterthesis/Models/german_gpt2_sft_2_rl_2epochs_ref")

In [None]:
gpt2_model.save_pretrained("/content/drive/MyDrive/Masterthesis/Models/german_gpt2_sft_2_rl_2epochs_base")

### Testing

In [None]:
tokenizer_test = AutoTokenizer.from_pretrained("/content/drive/MyDrive/Masterthesis/Models/german_gpt2_sft_2_epoch_rl_2epochs")

model_test = AutoModelWithLMHead.from_pretrained("/content/drive/MyDrive/Masterthesis/Models/german_gpt2_sft_2_epoch_rl_2epochs")

Some weights of the model checkpoint at /content/drive/MyDrive/Masterthesis/Models/german_gpt2_sft_2_epoch_rl_2epochs were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
tokenizer_test.pad_token = tokenizer_test.eos_token

In [None]:
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 0.5,
    "do_sample": True,
    "pad_token_id": tokenizer_test.eos_token_id,
    "max_new_tokens": 30
}

In [None]:
prompt_sent="[negative]"
ctrl="[negative]"
input_ids = tokenizer_test.encode(prompt_sent, return_tensors="pt",truncation=True,max_length=1024)

In [None]:
output = model_test.generate(   input_ids,
  **generation_kwargs)

generated_text = tokenizer_test.decode(output[0])
print("Generated Text:")
print(generated_text)

Generated Text:
[negative] Nie wieder! Nicht zu empfehlen! Das Hotel ist in einem sehr schlechten Zustand. Das Personal ist unfreundlich und die Zimmer sind nicht sauber. Das Essen
