# Raw data length binning

The purpose of this notebook is to design a strategy to load perplexity ratio score data generated by the scoring algorithm, bin it by text fragment length and get it ready for input into feature engineering. The plan is to handle data as pandas dataframes and store it using hdf5. The text fragments will be put into overlapping length bins so that feature engineering and classifier training can be conducted separately for different length regimes. This general strategy is based on early observations of classifier performance on short fragments, long fragments and un-binned fragments.

In [1]:
# Change working directory to parent so we can import as we would from __main__.py
%cd ..

import h5py
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.preprocessing import LabelEncoder

import configuration as config

print(f'H5py: {h5py.__version__}')
print(f'Numpy: {np.__version__}')
print(f'Pandas: {pd.__version__}')
print(f'Sklearn: {sk.__version__}')

/mnt/arkk/llm_detector/classifier
H5py: 3.11.0
Numpy: 1.24.4
Pandas: 2.0.3
Sklearn: 1.3.2


In [2]:
# The bins we are planning on using
bins = {
    'combined': [0, np.inf],
    'bin_100': [1, 100],
    'bin_150': [51, 150],
    'bin_200': [101, 200],
    'bin_250': [151, 250],
    'bin_300': [201, 300],
    'bin_350': [251, 350],
    'bin_400': [301, 400],
    'bin_450': [351, 450],
    'bin_500': [401, 500],
    'bin_550': [451, 550],
    'bin_600': [501, 600]
}

In [3]:
# Load the data
data_df = pd.read_json(config.RAW_INPUT_DATA)

# Replace and remove string 'OOM' and 'NAN' values
data_df.replace('NAN', np.nan, inplace = True)
data_df.replace('OOM', np.nan, inplace = True)
data_df.dropna(inplace = True)

# Shuffle the deck, resetting the index
data_df = data_df.sample(frac = 1).reset_index(drop = True)

# Use the index to add a unique fragment id
data_df.reset_index(inplace = True)
data_df.rename({'index': 'Fragment ID'}, axis = 1, inplace = True)

# Enforce dtypes
data_df = data_df.astype({
    'Fragment ID': np.int32,
    'Source record num': np.int32,
    'Fragment length (words)': np.int32,
    'Fragment length (tokens)': np.int32,
    'Dataset': object, #pd.StringDtype(),
    'Source': object, #pd.StringDtype(),
    'Generator': object, #pd.StringDtype(),
    'String': object, #pd.StringDtype(),
    'Perplexity': np.float32,
    'Cross-perplexity': np.float32,
    'Perplexity ratio score': np.float32,
    'Reader time (seconds)': np.float32,
    'Writer time (seconds)': np.float32,
    'Reader peak memory (GB)': np.float32,
    'Writer peak memory (GB)': np.float32
})

data_df.head()

Unnamed: 0,Fragment ID,Source record num,Fragment length (words),Fragment length (tokens),Dataset,Source,Generator,String,Perplexity,Cross-perplexity,Perplexity ratio score,Reader time (seconds),Writer time (seconds),Reader peak memory (GB),Writer peak memory (GB)
0,0,3310,309,451,cc_news,human,human,2-point lead with about 5 1/2 minutes left bef...,1.739,1.871094,0.929541,6.109275,6.706055,7.092305,7.092305
1,1,8408,202,261,cnn,human,human,a so-called 'legal high' he bought off the int...,2.61,2.630859,0.991834,3.636041,4.155262,10.66083,10.476773
2,2,3526,127,177,cc_news,human,human,"Forever,"" taken from the Fifty Shades Darker m...",2.576,2.542969,1.013057,2.465926,2.709363,7.34402,7.252112
3,3,4834,70,111,cc_news,synthetic,llama2-13b,and Harvey Weinstein attend The Fashion Group ...,1.981,1.942383,1.020111,1.937368,2.140092,5.439086,5.415662
4,4,5613,156,270,cc_news,synthetic,llama2-13b,"the first time ever.\nKANSAS CITY, Mo. — LSU A...",1.512,1.666016,0.907386,3.747122,4.190398,7.175822,7.106102


In [4]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34196 entries, 0 to 34195
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Fragment ID               34196 non-null  int32  
 1   Source record num         34196 non-null  int32  
 2   Fragment length (words)   34196 non-null  int32  
 3   Fragment length (tokens)  34196 non-null  int32  
 4   Dataset                   34196 non-null  object 
 5   Source                    34196 non-null  object 
 6   Generator                 34196 non-null  object 
 7   String                    34196 non-null  object 
 8   Perplexity                34196 non-null  float32
 9   Cross-perplexity          34196 non-null  float32
 10  Perplexity ratio score    34196 non-null  float32
 11  Reader time (seconds)     34196 non-null  float32
 12  Writer time (seconds)     34196 non-null  float32
 13  Reader peak memory (GB)   34196 non-null  float32
 14  Writer

OK, let's start structuring the dataset. We want two top level groups, one for training data and one for reserved testing data. Inside those groups will live datasets for the bins. We will also use attributes to save some metadata etc.

In [5]:
# Prepare the hdf5 output - create or open for read/write
output = h5py.File(config.LENGTH_BINNED_DATASET, 'a')

# Create the top-level groups
_ = output.require_group('training')
_ = output.require_group('testing')

print(f'Output is type: {type(output)}')
print(f'Top level groups: {(list(output.keys()))}')

# Next, we need to add a group for each fragment length bin,
# and one for the un-binned data
for group in output.keys():

    # Add the un-binned data group
    _ = output.require_group(f'{group}/combined')

    # Loop on the bins and add a group for each
    for bin in bins.keys():
        _ = output.require_group(f'{group}/{bin}')

output.close()

Output is type: <class 'h5py._hl.files.File'>
Top level groups: ['testing', 'training']


OK, the basic data structure is ready to go. Next thing to do is add data.

In [6]:
def make_features_labels(training_df, testing_df):
    '''Takes training and testing dataframes, separates features
    from labels, encodes labels and returns'''

    # Split the data into features and labels
    training_labels = training_df['Source']
    training_features = training_df.drop('Source', axis = 1)

    testing_labels = testing_df['Source']
    testing_features = testing_df.drop('Source', axis = 1)

    # Encode string class values as integers
    label_encoder = LabelEncoder()
    label_encoder = label_encoder.fit(training_labels)
    training_labels = pd.Series(label_encoder.transform(training_labels)).astype(np.int32)
    testing_labels = pd.Series(label_encoder.transform(testing_labels)).astype(np.int32)

    return training_labels, testing_labels, training_features, testing_features

In [7]:
# Reopen out hdf5 file with pandas so we can work with dataframes
data_lake = pd.HDFStore(config.LENGTH_BINNED_DATASET)

# Add the raw data set at the top level as 'master' incase we want it later.
data_lake['master'] = data_df

# Next, get rid of un-trainable/unnecessary features
feature_drops = [
    'Fragment ID',
    'Source record num',
    'Dataset',
    'Generator',
    'String',
    'Reader time (seconds)',
    'Writer time (seconds)',
    'Reader peak memory (GB)',
    'Writer peak memory (GB)'
]

data_df.drop(feature_drops, axis = 1, inplace = True)

# Split the data into training and testing
training_df = data_df.sample(frac = 0.7, random_state = 42)
testing_df = data_df.drop(training_df.index)

# Now do the same for the bins
for bin_id, bin_range in bins.items():

    # Pull the fragments for this bin
    bin_training_df = training_df[(training_df['Fragment length (words)'] >= bin_range[0]) & (training_df['Fragment length (words)'] <= bin_range[1])]
    bin_testing_df = testing_df[(testing_df['Fragment length (words)'] >= bin_range[0]) & (testing_df['Fragment length (words)'] <= bin_range[1])]

    # Fix the index
    bin_training_df.reset_index(inplace = True, drop = True)
    bin_testing_df.reset_index(inplace = True, drop = True)

    # Split un-binned data into features and labels
    bin_training_labels, bin_testing_labels, bin_training_features, bin_testing_features = make_features_labels(bin_training_df, bin_testing_df)

    # Add the data to the data lake
    data_lake.put(f'training/{bin_id}/features', bin_training_features)
    data_lake.put(f'training/{bin_id}/labels', bin_training_labels)
    data_lake.put(f'testing/{bin_id}/features', bin_training_features)
    data_lake.put(f'testing/{bin_id}/labels', bin_training_labels)

data_lake.close()

In [8]:
# Open the data lake to check out the result
data_lake = h5py.File(config.LENGTH_BINNED_DATASET, 'a')

# Print the result
for group in data_lake.keys():
    print(f'\n{group} contains:')

    for subgroup in data_lake[group]:

        if 'bin' in subgroup:
            print(f'  {subgroup} contains: ', end = '')

            for subsubgroup in data_lake[group][subgroup].keys():
                print(f'{subsubgroup} ', end = '')
            
            print()

        else:
            print(f' {subgroup}')

data_lake.close()


master contains:
 axis0
 axis1
 block0_items
 block0_values
 block1_items
 block1_values
 block2_items
 block2_values

testing contains:
  bin_100 contains: features labels 
  bin_150 contains: features labels 
  bin_200 contains: features labels 
  bin_250 contains: features labels 
  bin_300 contains: features labels 
  bin_350 contains: features labels 
  bin_400 contains: features labels 
  bin_450 contains: features labels 
  bin_500 contains: features labels 
  bin_550 contains: features labels 
  bin_600 contains: features labels 
  combined contains: features labels 

training contains:
  bin_100 contains: features labels 
  bin_150 contains: features labels 
  bin_200 contains: features labels 
  bin_250 contains: features labels 
  bin_300 contains: features labels 
  bin_350 contains: features labels 
  bin_400 contains: features labels 
  bin_450 contains: features labels 
  bin_500 contains: features labels 
  bin_550 contains: features labels 
  bin_600 contains: feature

In [9]:
# Reopen out hdf5 file with pandas so we can work with dataframes
data_lake = pd.HDFStore(config.LENGTH_BINNED_DATASET)

print('Combined training features:\n')
print(data_lake['training/combined/features'].info())
print('\nCombined training labels:\n')
print(data_lake['training/combined/labels'].info())
print('\nBin 100 training features:\n')
print(data_lake['training/bin_100/features'].info())
print('\nBin 100 training labels:\n')
print(data_lake['training/bin_100/labels'].info())

data_lake.close()

Combined training features:

<class 'pandas.core.frame.DataFrame'>
Index: 23937 entries, 0 to 23936
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Fragment length (words)   23937 non-null  int32  
 1   Fragment length (tokens)  23937 non-null  int32  
 2   Perplexity                23937 non-null  float32
 3   Cross-perplexity          23937 non-null  float32
 4   Perplexity ratio score    23937 non-null  float32
dtypes: float32(3), int32(2)
memory usage: 654.5 KB
None

Combined training labels:

<class 'pandas.core.series.Series'>
Index: 23937 entries, 0 to 23936
Series name: None
Non-Null Count  Dtype
--------------  -----
23937 non-null  int32
dtypes: int32(1)
memory usage: 280.5 KB
None

Bin 100 training features:

<class 'pandas.core.frame.DataFrame'>
Index: 8347 entries, 0 to 8346
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------    

In [10]:
# Reopen out hdf5 file with pandas so we can work with dataframes
data_lake = pd.HDFStore(config.LENGTH_BINNED_DATASET)

for (path, subgroups, subkeys) in data_lake.walk():

    for subgroup in subgroups:

        print("GROUP: {}/{}".format(path, subgroup))

    for subkey in subkeys:

        key = "/".join([path, subkey])

        print("KEY: {}".format(key))

data_lake.close()

GROUP: /testing
GROUP: /training
KEY: /master
GROUP: /testing/bin_100
GROUP: /testing/bin_150
GROUP: /testing/bin_200
GROUP: /testing/bin_250
GROUP: /testing/bin_300
GROUP: /testing/bin_350
GROUP: /testing/bin_400
GROUP: /testing/bin_450
GROUP: /testing/bin_500
GROUP: /testing/bin_550
GROUP: /testing/bin_600
GROUP: /testing/combined
GROUP: /training/bin_100
GROUP: /training/bin_150
GROUP: /training/bin_200
GROUP: /training/bin_250
GROUP: /training/bin_300
GROUP: /training/bin_350
GROUP: /training/bin_400
GROUP: /training/bin_450
GROUP: /training/bin_500
GROUP: /training/bin_550
GROUP: /training/bin_600
GROUP: /training/combined
KEY: /training/bin_100/features
KEY: /training/bin_100/labels
KEY: /training/bin_150/features
KEY: /training/bin_150/labels
KEY: /training/bin_200/features
KEY: /training/bin_200/labels
KEY: /training/bin_250/features
KEY: /training/bin_250/labels
KEY: /training/bin_300/features
KEY: /training/bin_300/labels
KEY: /training/bin_350/features
KEY: /training/bin_350

OK, I think we are happy with this to start with. Let's move on and do some feature engineering in the bins.