### Spoiler:

Small LLMs are bad for new similar phrases generation and it's time consuming prompting, so  I decided to use good old T5 paraphraser and created initial set of sentences that person can say in conversation occasionally (e.g. answer to how are you). 

Some examples were generated with ChatGPT, some with me =)

In [1]:
import json

from mrq import PROJECT_PATHS

with open(PROJECT_PATHS.init_data / "aug_init_data.json") as file:
    shots = json.load(file)

LLM_FLAG = False # change to play with small LLM

In [2]:
if LLM_FLAG:
    import numpy as np
    from transformers import GPTNeoXForCausalLM, AutoTokenizer

    gen_model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-410m")
    gen_tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-410m")

    # As this model is not fine-tuned to follow instructions,
    # let's try to use is just as text generation with few-shot prompts
    # based on previously generated inputs from curated list
    n_shots = 9

    pre = f"{n_shots+1} examples of replies to 'How are you doing':\n\n"

    conv = "\n".join([f"{i}. "+x for i, x in enumerate(np.random.choice(shots, n_shots, replace=False), 1)])
    conv+=f"\n{n_shots+1} "

    text = pre+conv
    # text = conv

    tokens = gen_model.generate(
        **gen_tokenizer(text, return_tensors="pt"),
        top_p=15,
        # top_k=20,
        # num_beams=5,
        temperature=0.3,
        do_sample=True,
        num_return_sequences=3,
        max_new_tokens=35,
        repetition_penalty=5.0,
        early_stopping=True
    )

    for x in tokens:
        print(gen_tokenizer.decode(x))
        print("-" * 100)

### T5 for augmentation

In [3]:
import json

from mrq import PROJECT_PATHS
from mrq.aug import Paraphraser

with open(PROJECT_PATHS.init_data / "aug_init_data.json") as file:
    shots = json.load(file)

device = "cpu"
paraphraser = Paraphraser(device=device)

text = "paraphrase: " + shots[0] + " </s>"
paraphraser(text)

['I felt great in the last year, enjoying some much-needed downtime, and finding new hobbies to explore.',
 'I am currently excellently enjoying some much needed time off and exploring new hobbies.',
 "I've felt in my last week well: enjoying some much-needed recovery times and exploring new hobbies (Turning Sprockets! )!",
 "I've been amazing lately having some much-needed downtime and exploring new hobbyry.",
 'Recent times have been great, enjoying some much-needed down time and exploring new skills.']

In [4]:
import os

from mrq.aug import Augmentator, augment_init_data
from mrq.logger import get_logger

os.environ["TOKENIZERS_PARALLELISM"] = "false"

log = get_logger(__name__)

In [6]:
aug = augment_init_data(
    "aug_init_data.json",
    "examples_for_augmentation.json",
    paraphraser=paraphraser,
    return_=True,
)

2023-05-28 17:34:16,665 - mrq.aug - INFO - Initial number of examples: 23
2023-05-28 17:34:16,665 - mrq.aug - INFO - Result number (before set) of examples: 138
2023-05-28 17:34:16,666 - mrq.aug - INFO - Epoch 0 of 1


  0%|          | 0/23 [00:00<?, ?it/s]

In [7]:
augmenter = Augmentator(aug)

In [9]:
augmenter("My very relevant sentence. And another one")

['And another one',
 'Recently I have kept busy with some interesting projects, feeling motivated.',
 'Like usual. But because this is far too important, I am trying my best to try to balance this.',
 'My very relevant sentence.',
 'So this guy still lacked ambition to play and just finished up this lovely book.']