## Import Libraries

In [3]:
import os
os.environ["KERAS_BACKEND"] = "torch" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.

import keras
import keras_nlp

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

## Configuration

In [4]:
class CFG:
    seed = 42
    dataset_path = "/kaggle/input/llm-prompt-recovery"
    preset = "7b-it-quant" # name of pretrained Gemma
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training
    epochs = 1 # number of epochs to train

## Reproducibility

Sets value for random seed to produce similar result in each run.

In [5]:
keras.utils.set_random_seed(CFG.seed)

## Data

No training data is provided in this competition; in other words, we can use any openly available datasets for this competition. In this notebook, we will use two external datasets that utilize the Gemma 7B model to transform texts using prompts.

Data Format:

These datasets includes:

* original_text: Input text/essay that needs to be transformed.
* rewrite_prompt: Prompt/Instruction that was used in the Gemma LM to transform original_text. This is also our target for this competition.
* rewritten_text: Output text that was generated by the Gemma model.

In [7]:
# `LLM Prompt Recovery - Synthetic Datastore dataset` by @dschettler8845
df1 = pd.read_csv("/kaggle/input/llm-prompt-recovery-synthetic-datastore/gemma1000_w7b.csv")
df1 = df1[["original_text", "rewrite_prompt", "gemma_7b_rewritten_text_temp0"]]
df1 = df1.rename(columns={"gemma_7b_rewritten_text_temp0":"rewritten_text"})
df1.head(2)

# `3000 Rewritten texts - Prompt recovery Challenge` by @dipamc77
df2 = pd.read_csv("/kaggle/input/3000-rewritten-texts-prompt-recovery-challenge/prompts_0_500_wiki_first_para_3000.csv")
df2.head(2)

# Merge all datasets
df = pd.concat([df1, df2], axis=0)
df = df.sample(2000).reset_index(drop=True) # to reduce training time we are only using 2k samples
df.head(5)

Unnamed: 0,original_text,rewrite_prompt,rewritten_text
0,Story highlights New findings suggest sperm al...,Turn this into a programmer's code.,```python\n# Code to summarize the text\n\nspe...
1,Glasgow's Needy is a non-registered charity or...,Make this a formal apology letter to a customer.,"[Your Name]\n[Your Address]\nGlasgow, Scotland..."
2,"Pune: In a bizarre case, a parrot accused of ""...",Convert this into a technique to be used.,**Technique:**\n\n**Step 1: Gather evidence.**...
3,Berrya is a genus of evergreen trees with fibr...,Convert the text into the layout of a classic ...,## Level Layout: Berrya Forest\n\n**Background...
4,The Libertarian Party of Oregon is a political...,Recast it as the backstory for a mysterious an...,In the shadows of the Evergreen State of Orego...
