# Perplexity ratio data length binning

The purpose of this notebook is to design a strategy to load text fragment data with perplexity ratio score feature generated by the scoring algorithm, bin it by text fragment length and get it ready for input into feature engineering. The plan is to handle data as Pandas dataframes and store it using hdf5. The text fragments will be put into overlapping length bins so that feature engineering and classifier training can be conducted separately for different length regimes. This general strategy is based on early observations of classifier performance on short fragments, long fragments and un-binned fragments.

## 1. Run set-up

In [1]:
# Change working directory to parent so we can import as we would from __main__.py
print(f'Working directory: ', end = '')
%cd ..
print()

import h5py
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.preprocessing import LabelEncoder

import configuration as config

print(f'H5py: {h5py.__version__}')
print(f'Numpy: {np.__version__}')
print(f'Pandas: {pd.__version__}')
print(f'Sklearn: {sk.__version__}')

Working directory: /mnt/arkk/llm_detector/classifier

H5py: 3.11.0
Numpy: 1.24.4
Pandas: 2.0.3
Sklearn: 1.3.2


In [2]:
# The dataset we want to bin - omit the file extension, it will be 
# added appropriately for the input and output files
dataset_name = 'falcon-7b_scores_v2_10-300_words'

# Construct input and output file paths
input_file = f'{config.HANS_DATA_PATH}/{dataset_name}.json'
output_file = f'{config.DATA_PATH}/{dataset_name}.h5'

# # Bins for 10-1000 word dataset
# bins = {
#     'combined': [0, np.inf],
#     'bin_100': [1, 100],
#     'bin_150': [51, 150],
#     'bin_200': [101, 200],
#     'bin_250': [151, 250],
#     'bin_300': [201, 300],
#     'bin_350': [251, 350],
#     'bin_400': [301, 400],
#     'bin_450': [351, 450],
#     'bin_500': [401, 500],
#     'bin_600': [451, 600]
# }

# Bins for 10-300 word dataset
bins = {
    'combined': [0, np.inf],
    'bin_50': [1, 50],
    'bin_75': [26, 75],
    'bin_100': [51, 100],
    'bin_125': [76, 125],
    'bin_150': [101, 150],
    'bin_175': [126, 175],
    'bin_200': [151, 200],
    'bin_225': [176, 225],
    'bin_250': [201, 250],
    'bin_275': [226, 275],
    'bin_300': [251, 300]
}

## 2. Load and clean data

In [3]:
# Load the data
data_df = pd.read_json(input_file)

# Replace and remove string 'OOM' and 'NAN' values
data_df.replace('NAN', np.nan, inplace = True)
data_df.replace('OOM', np.nan, inplace = True)
data_df.dropna(inplace = True)

# Shuffle the deck, resetting the index
data_df = data_df.sample(frac = 1).reset_index(drop = True)

# Use the index to add a unique fragment id
data_df.reset_index(inplace = True)
data_df.rename({'index': 'Fragment ID'}, axis = 1, inplace = True)

# Enforce dtypes
data_df = data_df.astype({
    'Fragment ID': np.int64,
    'Source record num': np.int64,
    'Fragment length (words)': np.int64,
    'Fragment length (tokens)': np.int64,
    'Dataset': object, #pd.StringDtype(), pandas recommends these, but PyTables for hdf5 can't use them
    'Source': object, #pd.StringDtype(),
    'Generator': object, #pd.StringDtype(),
    'String': object, #pd.StringDtype(),
    'Perplexity': np.float64,
    'Cross-perplexity': np.float64,
    'Perplexity ratio score': np.float64,
    'Reader time (seconds)': np.float64,
    'Writer time (seconds)': np.float64,
    'Reader peak memory (GB)': np.float64,
    'Writer peak memory (GB)': np.float64
})

data_df.head()

Unnamed: 0,Fragment ID,Source record num,Fragment length (words),Fragment length (tokens),Dataset,Source,Generator,String,Perplexity,Cross-perplexity,Perplexity ratio score,Reader time (seconds),Writer time (seconds),Reader peak memory (GB),Writer peak memory (GB)
0,0,3680,113,147,cc_news,synthetic,llama2-13b,for his coverage of 2008 Mumbai terror attacks...,2.459,2.787109,0.88227,2.406446,2.624167,5.897318,5.856219
1,1,10784,251,342,pubmed,synthetic,llama2-13b,was resuscitated successfully but subsequently...,1.187,1.419922,0.835626,4.489942,5.005341,8.505339,8.402437
2,2,3313,121,167,cc_news,synthetic,llama2-13b,"chiropractor from Lake Oswego, Oregon. ""I don'...",2.924,2.888672,1.01217,2.445095,2.690631,6.186151,6.132349
3,3,8361,275,375,cnn,synthetic,llama2-13b,"every one of these superfruits, but can they a...",2.299,2.509766,0.915953,4.490003,5.042619,8.63342,8.534275
4,4,8274,182,233,cnn,human,human,"(CNN)Well, I'll be the first to admit, I got c...",2.84,2.917969,0.973226,3.147694,3.505678,5.78584,5.773574


In [4]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55813 entries, 0 to 55812
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Fragment ID               55813 non-null  int64  
 1   Source record num         55813 non-null  int64  
 2   Fragment length (words)   55813 non-null  int64  
 3   Fragment length (tokens)  55813 non-null  int64  
 4   Dataset                   55813 non-null  object 
 5   Source                    55813 non-null  object 
 6   Generator                 55813 non-null  object 
 7   String                    55813 non-null  object 
 8   Perplexity                55812 non-null  float64
 9   Cross-perplexity          55813 non-null  float64
 10  Perplexity ratio score    55812 non-null  float64
 11  Reader time (seconds)     55813 non-null  float64
 12  Writer time (seconds)     55813 non-null  float64
 13  Reader peak memory (GB)   55813 non-null  float64
 14  Writer

## 3. Generate hdf5 data structure

OK, let's start structuring the dataset. We want two top level groups, one for training data and one for reserved testing data. Inside those groups will live datasets for the bins. We will also use attributes to save some metadata etc.

In [5]:
# Prepare the hdf5 output - create or open for read/write
output = h5py.File(output_file, 'a')

# Create the top-level groups
_ = output.require_group('training')
_ = output.require_group('testing')

print(f'Top level groups: {(list(output.keys()))}')

# Next, we need to add a group for each fragment length bin,
# and one for the un-binned data
for group in output.keys():

    # Add the un-binned data group
    _ = output.require_group(f'{group}/combined')

    # Loop on the bins and add a group for each
    for bin in bins.keys():
        _ = output.require_group(f'{group}/{bin}')

# Finally, add the bins under group bins
output.attrs.update(bins)

print(f'\nBin attributes:')
for key, value in output.attrs.items():
    print(f' {key}: {value}')

output.close()

Top level groups: ['testing', 'training']

Bin attributes:
 bin_100: [ 51 100]
 bin_125: [ 76 125]
 bin_150: [101 150]
 bin_175: [126 175]
 bin_200: [151 200]
 bin_225: [176 225]
 bin_250: [201 250]
 bin_275: [226 275]
 bin_300: [251 300]
 bin_50: [ 1 50]
 bin_75: [26 75]
 combined: [ 0. inf]


## 4. Populate hdf5 data structure

OK, the basic data structure is ready to go. Next thing to do is add data.

In [6]:
def make_labels(training_df, testing_df):
    '''Takes training and testing dataframes, gets and encode human/synthetic
    labels, encodes labels and returns. Note: this function leaves the 
    'Source' label column in the features dataframe for downstream use in 
    feature engineering.'''

    # Get the labels
    training_labels = training_df['Source']
    testing_labels = testing_df['Source']

    # Encode string class values as integers
    label_encoder = LabelEncoder()
    label_encoder = label_encoder.fit(training_labels)
    training_labels = pd.Series(label_encoder.transform(training_labels)).astype(np.int64)
    testing_labels = pd.Series(label_encoder.transform(testing_labels)).astype(np.int64)

    return training_labels, testing_labels

In [7]:
# Reopen out hdf5 file with pandas so we can work with dataframes
data_lake = pd.HDFStore(output_file)

# Add the raw data set at the top level as 'master' incase we want it later.
data_lake['master'] = data_df

# Next, get rid of un-trainable/unnecessary features
feature_drops = [
    'Fragment ID',
    'Source record num',
    'Dataset',
    'Generator',
    'Reader time (seconds)',
    'Writer time (seconds)',
    'Reader peak memory (GB)',
    'Writer peak memory (GB)'
]

data_df.drop(feature_drops, axis = 1, inplace = True)

# Split the data into training and testing
training_df = data_df.sample(frac = 0.7, random_state = 42)
testing_df = data_df.drop(training_df.index)

# Now do the same for the bins
for bin_id, bin_range in bins.items():

    # Pull the fragments for this bin
    bin_training_df = training_df[(training_df['Fragment length (words)'] >= bin_range[0]) & (training_df['Fragment length (words)'] <= bin_range[1])]
    bin_testing_df = testing_df[(testing_df['Fragment length (words)'] >= bin_range[0]) & (testing_df['Fragment length (words)'] <= bin_range[1])]

    # Fix the index
    bin_training_df.reset_index(inplace = True, drop = True)
    bin_testing_df.reset_index(inplace = True, drop = True)

    # Split un-binned data into features and labels
    bin_training_labels, bin_testing_labels = make_labels(bin_training_df, bin_testing_df)

    # Add the data to the data lake
    data_lake.put(f'training/{bin_id}/features', bin_training_df)
    data_lake.put(f'training/{bin_id}/labels', bin_training_labels)
    data_lake.put(f'testing/{bin_id}/features', bin_training_df)
    data_lake.put(f'testing/{bin_id}/labels', bin_training_labels)

data_lake.close()

## 5. Sanity check results

In [8]:
# Open the data lake to check out the result
data_lake = h5py.File(output_file, 'a')

# Print the result
for group in data_lake.keys():
    print(f'\n{group} contains:')

    for subgroup in data_lake[group]:

        if 'bin' in subgroup:
            print(f'  {subgroup} contains: ', end = '')

            for subsubgroup in data_lake[group][subgroup].keys():
                print(f'{subsubgroup} ', end = '')
            
            print()

        else:
            print(f' {subgroup}')

data_lake.close()


master contains:
 axis0
 axis1
 block0_items
 block0_values
 block1_items
 block1_values
 block2_items
 block2_values

testing contains:
  bin_100 contains: features labels 
  bin_125 contains: features labels 
  bin_150 contains: features labels 
  bin_175 contains: features labels 
  bin_200 contains: features labels 
  bin_225 contains: features labels 
  bin_250 contains: features labels 
  bin_275 contains: features labels 
  bin_300 contains: features labels 
  bin_50 contains: features labels 
  bin_75 contains: features labels 
  combined contains: features labels 

training contains:
  bin_100 contains: features labels 
  bin_125 contains: features labels 
  bin_150 contains: features labels 
  bin_175 contains: features labels 
  bin_200 contains: features labels 
  bin_225 contains: features labels 
  bin_250 contains: features labels 
  bin_275 contains: features labels 
  bin_300 contains: features labels 
  bin_50 contains: features labels 
  bin_75 contains: features la

In [9]:
# Reopen out hdf5 file with pandas so we can work with dataframes
data_lake = pd.HDFStore(output_file)

print('Combined training features:\n')
print(data_lake['training/combined/features'].info())
print('\nCombined training labels:\n')
print(data_lake['training/combined/labels'].info())
print('\nBin 100 training features:\n')
print(data_lake['training/bin_100/features'].info())
print('\nBin 100 training labels:\n')
print(data_lake['training/bin_100/labels'].info())

data_lake.close()

Combined training features:

<class 'pandas.core.frame.DataFrame'>
Index: 39069 entries, 0 to 39068
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Fragment length (words)   39069 non-null  int64  
 1   Fragment length (tokens)  39069 non-null  int64  
 2   Source                    39069 non-null  object 
 3   String                    39069 non-null  object 
 4   Perplexity                39069 non-null  float64
 5   Cross-perplexity          39069 non-null  float64
 6   Perplexity ratio score    39069 non-null  float64
dtypes: float64(3), int64(2), object(2)
memory usage: 2.4+ MB
None

Combined training labels:

<class 'pandas.core.series.Series'>
Index: 39069 entries, 0 to 39068
Series name: None
Non-Null Count  Dtype
--------------  -----
39069 non-null  int64
dtypes: int64(1)
memory usage: 610.5 KB
None

Bin 100 training features:

<class 'pandas.core.frame.DataFrame'>
Index: 8071 e

In [10]:
# Reopen out hdf5 file with pandas so we can work with dataframes
data_lake = pd.HDFStore(output_file)

for (path, subgroups, subkeys) in data_lake.walk():

    for subgroup in subgroups:

        print("GROUP: {}/{}".format(path, subgroup))

    for subkey in subkeys:

        key = "/".join([path, subkey])

        print("KEY: {}".format(key))

data_lake.close()

GROUP: /testing
GROUP: /training
KEY: /master
GROUP: /testing/bin_100
GROUP: /testing/bin_125
GROUP: /testing/bin_150
GROUP: /testing/bin_175
GROUP: /testing/bin_200
GROUP: /testing/bin_225
GROUP: /testing/bin_250
GROUP: /testing/bin_275
GROUP: /testing/bin_300
GROUP: /testing/bin_50
GROUP: /testing/bin_75
GROUP: /testing/combined
GROUP: /training/bin_100
GROUP: /training/bin_125
GROUP: /training/bin_150
GROUP: /training/bin_175
GROUP: /training/bin_200
GROUP: /training/bin_225
GROUP: /training/bin_250
GROUP: /training/bin_275
GROUP: /training/bin_300
GROUP: /training/bin_50
GROUP: /training/bin_75
GROUP: /training/combined
KEY: /training/bin_100/features
KEY: /training/bin_100/labels
KEY: /training/bin_125/features
KEY: /training/bin_125/labels
KEY: /training/bin_150/features
KEY: /training/bin_150/labels
KEY: /training/bin_175/features
KEY: /training/bin_175/labels
KEY: /training/bin_200/features
KEY: /training/bin_200/labels
KEY: /training/bin_225/features
KEY: /training/bin_225/lab

OK, I think we are happy with this to start with. Let's move on and do some feature engineering in the bins.