# Instructions

This colab contains code used to run bulk inference on PaLM models for the paper "Personality Traits in Large Language Models" (https://arxiv.org/pdf/2307.00184). The code assumes that all the data produced and consumed in the colab lives in a local filesystem either in a cloud instance running a Jupyter notebook such as Google Colab or a desktop. But those file I/O operations can easily be replaced to use any other file management solutions.

The colab is composed of 3 steps:
* Step 1: Read in the input data structure for the experiment - Admin Session (from Drive) and convert it into TFRecord format and store it back in the filesystem (so that this step can be skipped next time). Admin sessions are defined here: https://github.com/google-research/google-research/blob/master/psyborgs/survey_bench_lib.py#L70 and are part of the PsyBORGS open source framework (https://github.com/google-research/google-research/tree/master/psyborgs). This forms the input to the bulk LLM inference script (Step 2).
* Step 2: Run the bulk LLM inference script on (separately started) prediction servers using PaLM. The script needs to be configured with the name of the input TFRecord (output of Step 1), and name of the output TFRecord (input to Step 3).
* Step 3: For PaLM models, the output is produced in a TFRecord format. Read in the TFRecord output from the bulk inference script, and convert it into pickle format to store it back in the filesystem in use. It needs to be in .pkl format to be used by the Personality Analysis pipeline.

To run this colab:
1. Connect to an appropriate runtime. (For instance, if running the bulk inference directly from the colab, connect to a GPU kernel.)
2. Check experiment parameters below.
3. Run Step 1 from above.
4. Ensure the TFRecord file exists in the intended location.
5. Update bulk inference script with filenames for input and output, and run.
6. Once the bulk inference completes, do a consistency check on the TFRecord output.
7. Run Step 3.

NOTE: Make sure to store and run this notebook from a location where the Psyborgs codebase package is stored (personality_in_llms.psyborgs)

#Setup

The repo containing this notebook has a version of the Psyborgs codebase needed to make the notebook run. But in case a more recent version is needed, it can be fetched from https://github.com/google-research/google-research/tree/master/psyborgs.

In [None]:
#@markdown Run this cell to install the dependencies needed to run Psyborgs.
#@markdown The dependencies are in a requirements.txt file in the Psyborgs repo.
%pip install -r psyborgs/requirements.txt

In [None]:
#@title Load Libraries
#@markdown Run this cell to import dependencies
import enum
import json
import matplotlib.pyplot as plt
import pandas as pd
import pickle
from psyborgs import survey_bench_lib
import dacite
import tensorflow as tf

In [None]:
#@title Experiment Parameters  { run: "auto" }
#@markdown Run this cell to setup the file locations.

#@markdown `admin_session_file_path` is the file path of the input admin session that needs to be bulk inferred on.
admin_session_file_path = 'admin_sessions/sample_admin_session.json'  # @param {"type":"string"}
#@markdown `tfrecord_file_path` is the file path of the TFRecord file that gets dumped after converting it from the admin session.
#@markdown It is the admin session unrolled into individual inference prompts that will be run on
#@markdown the bulk LLM inference pipeline. So this TFRecord is the input to the bulk LLM inference pipeline.
tfrecord_file_path = 'sample_file.tfrecord'  #@param {type:'string'}
#@markdown `llm_output_tfrecord_file_path` is the file path where the output of the LLM bulk inference is stored.
llm_output_tfrecord_file_path = 'sample_llm_output.tfrecord'  #@param {type:"string"}
#@markdown `output_pkl_filepath` is the path to the pkl file that contains the LLM inference output dataframe.
output_pkl_filepath = 'sample_llm_output.pkl'  #@param {type:"string"}
#@markdown ####Below are settings relevant to multi-sharded runs
#@markdown `max_num_rows_per_shard` is the maximum number of payload specs per shard of the final tfrecord created from the admin session.
max_num_rows_per_shard = 4800000  #@param {type:"integer"}
num_shards = 1  #@param {type:"number"}
#@markdown this is a model identifier needed for Psyborgs code. More info here: psyborgs/survey_bench_lib.py:L63
model_id = 'PaLM'  #@param {type:"string"}
use_custom_model = True

### Helper functions

In [None]:
#@markdown Run this cell to setup the util functions needed for this colab

def format_shard(shard_idx: int):
  return str(shard_idx).zfill(5)


def load_admin_session(admin_session_filename: str):
  with open(admin_session_filename, 'r') as admin_session_file:
    admin_session_dict = json.load(admin_session_file)

  # dacite documentation on casting input values to objects can be found here:
  # https://github.com/konradhalas/dacite#casting
  session = dacite.from_dict(data_class=survey_bench_lib.AdministrationSession,
                             data=admin_session_dict,
                             config=dacite.Config(cast=[enum.Enum]))

  return session

def generate_payload_df(input_admin_session: survey_bench_lib.AdministrationSession,
                        input_model_id: str) -> pd.DataFrame:
  """Returns sorted df of prompts, continuations, and info to be scored."""
  # accumulate payloads in a list to be sent to LLM endpoints in parallel
  payload_list = []

  # iterate through all measures and scale combinations
  for measure_iteration in survey_bench_lib.measure_generator(admin_session):

    # iterate through all prompt combinations
    for prompt_iteration in survey_bench_lib.prompt_generator(
        measure_iteration, input_admin_session):

      # iterate through all continuation combinations
      for continuation_iteration in survey_bench_lib.continuation_generator(
          measure_iteration, input_admin_session):

        # generate payload spec with null scores and set model_id
        payload_spec = survey_bench_lib.generate_payload_spec(
            measure_iteration, prompt_iteration, continuation_iteration, 0,
            input_model_id)
        payload_list.append(payload_spec)

  # dataframe is sorted by prompt, continuation
  return pd.DataFrame(payload_list).sort_values(
      ['prompt_text', 'continuation_text'])

def generate_payload_row(input_admin_session: survey_bench_lib.AdministrationSession,
                         input_model_id: str) -> survey_bench_lib.PayloadSpec:
  """Returns sorted df of prompts, continuations, and info to be scored."""
  # iterate through all measures and scale combinations
  for measure_iteration in survey_bench_lib.measure_generator(admin_session):

    # iterate through all prompt combinations
    for prompt_iteration in survey_bench_lib.prompt_generator(
        measure_iteration, input_admin_session):

      # iterate through all continuation combinations
      for continuation_iteration in survey_bench_lib.continuation_generator(
          measure_iteration, input_admin_session):

        # generate payload spec with null scores and set model_id
        yield survey_bench_lib.generate_payload_spec(
            measure_iteration, prompt_iteration, continuation_iteration, 0,
            input_model_id)


def write_df_as_tfrecord(input_payload_df: pd.DataFrame, shard_idx: int = 0):
  """Writes the input dataframe as a TFRecord file."""
  # Define the TFRecord filename
  tfrecord_filename = f'{tfrecord_file_path}_{format_shard(shard_idx)}'

  # Create a TFRecord writer
  with tf.io.TFRecordWriter(tfrecord_filename) as w:
    # Loop over the dataframe and serialize each row
    for r in input_payload_df.itertuples(index=False):
      # Create a feature dictionary from the row data
      feature_map = {}
      for col, val in zip(input_payload_df.columns, r):
        if input_payload_df[col].dtype == 'int64':
          feature_map[col] = tf.train.Feature(
              int64_list=tf.train.Int64List(value=[val])
          )
        elif input_payload_df[col].dtype == 'float64':
          feature_map[col] = tf.train.Feature(
              float_list=tf.train.FloatList(value=[val])
          )
        else:
          feature_map[col] = tf.train.Feature(
              bytes_list=tf.train.BytesList(value=[val.strip().encode('utf-8')])
          )
      ex = tf.train.Example(features=tf.train.Features(feature=feature_dict))
      # Serialize the example
      serial_example = ex.SerializeToString()
      # Write the serialized example to the TFRecord file
      w.write(serial_example)


def write_payload_df(input_admin_session: survey_bench_lib.AdministrationSession,
                     input_model_id: str) -> int:
  """Returns sorted df of prompts, continuations, and info to be scored."""
  # accumulate payloads in a list to be sent to LLM endpoints in parallel

  # iterate through all measures and scale combinations
  shards = 0
  while True:
    payload_list = []
    num_rows = 0
    for payload_spec in generate_payload_row(input_admin_session, input_model_id):
      payload_list.append(payload_spec)
      num_rows += 1
      if num_rows >= max_num_rows_per_shard: break
    if not payload_list: break
    # dataframe is sorted by prompt, continuation
    input_payload_df = pd.DataFrame(payload_list).sort_values(
        ['prompt_text', 'continuation_text'])
    write_df_as_tfrecord(input_payload_df, shards)
    shards += 1
  return shards

# Define a parsing function to extract the features
def parse_example(ex):
  return tf.io.parse_single_example(ex, feature_description)

In [None]:
#@markdown Run this cell to setup the feature column names, so that the inferred examples in the TFRecords files can be correctly extracted and translated into a dataframe.
# Define a default feature description dictionary
feature_description = {
    'continuation_text': tf.io.FixedLenFeature([], tf.string),
    'item_id': tf.io.FixedLenFeature([], tf.string),
    'item_postamble_id': tf.io.FixedLenFeature([], tf.string),
    'item_preamble_id': tf.io.FixedLenFeature([], tf.string),
    'measure_id': tf.io.FixedLenFeature([], tf.string),
    'measure_name': tf.io.FixedLenFeature([], tf.string),
    'model_id': tf.io.FixedLenFeature([], tf.string),
    # 'model_output': tf.io.FixedLenFeature([], tf.string),
    'model_output_score': tf.io.FixedLenFeature([], tf.float32),
    'prompt_text': tf.io.FixedLenFeature([], tf.string),
    'response_choice': tf.io.FixedLenFeature([], tf.string),
    'response_choice_postamble_id': tf.io.FixedLenFeature([], tf.string),
    'response_scale_id': tf.io.FixedLenFeature([], tf.string),
    'response_value': tf.io.FixedLenFeature([], tf.int64),
    'scale_id': tf.io.FixedLenFeature([], tf.string),
    'score': tf.io.FixedLenFeature([], tf.int64),
}

# Define the relevant feature names for deduplication
dedup_feature_names = [
    'item_preamble_id',
    'item_postamble_id',
    'response_scale_id',
    'response_choice_postamble_id',
    'model_id']


# Step 1) Admin Session -> TFRecord
This step also adds some parameters needed for the LLM inference pipelines to directly ingest and work with the input files.

In [None]:
# @title Convert admin session to dataframe {"run":"auto"}
#@markdown Load admin session from json and generate payload_spec dataframe.
#@markdown This dataframe is used for rest of the code below.

admin_session = load_admin_session(admin_session_file_path)
payload_df = generate_payload_df(admin_session, model_id)

### [For PaLM models only] Convert Dataframe to TFRecord and write to filesystem
Define the feature description dictionary and write TFRecord file to notebook-local location.


In [None]:
# Create a TFRecord writer
with tf.io.TFRecordWriter(tfrecord_file_path) as writer:
  # Loop over the dataframe and serialize each row
  for row in payload_df.itertuples(index=False):
    # Create a feature dictionary from the row data
    feature_dict = {}
    for column, value in zip(payload_df.columns, row):
      if payload_df[column].dtype == 'int64':
        feature_dict[column] = tf.train.Feature(
            int64_list=tf.train.Int64List(value=[value])
        )
      elif payload_df[column].dtype == 'float64':
        feature_dict[column] = tf.train.Feature(
            float_list=tf.train.FloatList(value=[value])
        )
      else:
        feature_dict[column] = tf.train.Feature(
            bytes_list=tf.train.BytesList(value=[value.strip().encode('utf-8')])
        )
    example = tf.train.Example(features=tf.train.Features(feature=feature_dict))
    # Serialize the example
    serialized_example = example.SerializeToString()
    # Write the serialized example to the TFRecord file
    writer.write(serialized_example)

# Step 2) Run Bulk Inference Script

Depending on whichever model is chosen, this step needs to be done outside this colab by executing the bulk inference script from CLI against your model of choice.

# Step 3) [PaLM models] TFRecord -> Pickle file

## Read Input TFRecords

In [None]:
#@markdown Run this cell to read the TFRecords sharded files as a dataset
dataset = tf.data.Dataset.list_files(llm_output_tfrecord_file_path, shuffle=False)
dataset = dataset.flat_map(tf.data.TFRecordDataset)

# Parse the dataset using the parsing function
parsed_dataset = dataset.map(parse_example)

# Convert the parsed dataset to a list of dictionaries
list_of_dicts = []
for example in parsed_dataset:
  example_dict = {}
  for key in example.keys():
    example_dict[key] = example[key].numpy()
  list_of_dicts.append(example_dict)

In [None]:
#@markdown Run this cell to convert the dataset to a Pandas DataFrame
df = pd.DataFrame(list_of_dicts)

## Scoring mode model output column creation
Run this cell only if running scoring mode

In [None]:
df['model_output'] = df['continuation_text'].copy()

## Generative text processing
Run the cells below only if running experiments for generating text e.g. for downstream tasks and not for LLM survey responses.

In [None]:
#@markdown Remove columns that are unnecessary.
string_cols = list(feature_description.keys())
string_cols.remove('score')
string_cols.remove('model_output_score')
string_cols.remove('response_value')
df[string_cols] = df[string_cols].applymap(lambda x: x.decode())

In [None]:
#@markdown Make sure the needed columns have the required format.
groupings = {k: 'first' for k in feature_description.keys()}
groupings['model_output'] = lambda x: '<SEP> '.join(x)
for dedup_feature_name in dedup_feature_names:
  del groupings[dedup_feature_name]
grouped_df = df.groupby(dedup_feature_names).agg(groupings).reset_index()

In [None]:
#@markdown [Optional] Plot histogram of string lengths.

grouped_df['model_output_len'] = grouped_df['model_output'].apply(lambda x: len(x.split()))
plt.hist(grouped_df['model_output_len'], bins='auto', edgecolor='black')
plt.xlabel('String Length')
plt.ylabel('Frequency')
plt.title('Histogram of String Lengths')
plt.show()

In [None]:
#@markdown Prep dataframe to be written.
df = grouped_df

## Convert to .pkl and output

In [None]:
#@title Convert to .pkl and output
#@markdown Run this cell to convert dataframe into pickle and dump to location
with open(output_pkl_filepath, 'wb') as f:
  pickle.dump(df, f)