# Semantic splitting

Last approach to try for text splitting is semantic splitting with semantic-text-splitter from PyPI. Hopefully, this will combine the best aspects of sentence splitting and word length based 'dumb' splitting. We will have some control over the output chunk size and splits will occur in more rational places that a simple arbitrary word length.

## Notebook setup

In [1]:
# Change working directory to parent so we can import as we would
# from the perplexity ratio score root directory
%cd ..

# Standard library imports
import glob
import time
import json
import multiprocessing as mp

# PyPI imports
import numpy as np
import pandas as pd
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Internal imports
import configuration as config

/home/siderealyear/projects/llm_detector/perplexity_ratio_score


## 1. Load an example data file

In [2]:
data_file=f'{config.INTERMEDIATE_DATA_PATH}/texts.0.parquet'
data_df=pd.read_parquet(data_file)
data_df.head()

Unnamed: 0,text,synthetic,author,source
0,You’re probably doing a lot of things that tak...,1,unknown_model,yatsenko
1,The last time the Dallas Cowboys visited New O...,1,unknown_model,yatsenko
2,Technology is now reading your emotions? The r...,0,human,gerami
3,Certain aspects of the prison environment (she...,0,human,gaggar
4,What was the reason for permitting women to se...,1,GPT-3.5-turbo,gaggar


In [3]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115606 entries, 0 to 115605
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   text       115606 non-null  object
 1   synthetic  115606 non-null  int64 
 2   author     115606 non-null  object
 3   source     115606 non-null  object
dtypes: int64(1), object(3)
memory usage: 3.5+ MB


## 2. Test split a small batch of records

In [4]:
# Holder for results
results={
    'text': [],
    'synthetic': [],
    'author': [],
    'source': []
}

# Tokenizer & splitter
tokenizer_name='bert-base-uncased'
max_tokens=16

tokenizer=Tokenizer.from_pretrained(tokenizer_name)
splitter=TextSplitter.from_huggingface_tokenizer(tokenizer, max_tokens)

start_time=time.time()

for i in range(1000):
    
    text=data_df['text'].iloc[i]
    chunks=splitter.chunks(text)

    for chunk in chunks:
        results['text'].append(chunk)
        results['synthetic'].append(data_df['synthetic'].iloc[i])
        results['author'].append(data_df['author'].iloc[i])
        results['source'].append(data_df['source'].iloc[i])

dT=time.time() - start_time
splitting_rate=(i + 1)/dT
print(f'Split {i} records in {dT:.1f} seconds')
print(f'Splitting rate: {splitting_rate:.1f} records per second')

Split 999 records in 14.4 seconds
Splitting rate: 69.5 records per second


Slower than the NLTK sentence splitter, which ran at ~600 records per second. But, it's still tractable. 

The splitting rate is target chunk size dependent. With a target size of 16 tokens we can split at a rate of around 70 records per second, while with a target size of 512 tokens the rate is ~250 records per second. Therefore, it should take 4 and 14 hours to split all 3.47 million records. Or, under 30 minutes in the worst case scenario when parallelized over 30 input files. Let's do it.

## 3. Parallel splitting

In [7]:
# Define the splitting function

def split_text(data_file: str=None, target_size: int=512, worker: int=0, sample_fraction: float=1) -> dict:
    '''Function to parallelize semantic splitting of text over input files. 
    Meant to be called with multiprocessing worker. Take an input file 
    string, loads the data, splits sentences, collects results in dictionary
    and returns dictionary.'''

    data_df=pd.read_parquet(data_file)
    print(f"\nWorker {worker} loaded {data_file.split('/')[-1]}", end='')

    results={
        'text': [],
        'synthetic': [],
        'author': [],
        'source': []
    }

    # Tokenizer & splitter
    tokenizer_name='bert-base-uncased'
    tokenizer=Tokenizer.from_pretrained(tokenizer_name)
    splitter=TextSplitter.from_huggingface_tokenizer(tokenizer, target_size)

    # Calculate the number of records to process
    num_records=int(len(data_df)*sample_fraction)

    # Calculate how often to print the completion percent
    completion_checkpoint=int(num_records * 0.1)

    # Loop over records
    for i in range(num_records):
        
        text=data_df['text'].iloc[i]
        chunks=splitter.chunks(text)

        for chunk in chunks:
            results['text'].append(chunk)
            results['synthetic'].append(data_df['synthetic'].iloc[i])
            results['author'].append(data_df['author'].iloc[i])
            results['source'].append(data_df['source'].iloc[i])

        # Print percent complete every 100 records
        if i % completion_checkpoint == 0:
            percent_complete=(i/len(data_df))*100
            print(f'\nWorker {worker} is {percent_complete:.1f}% complete', end='')

    print(f'\nWorker {worker} finished, parsed {len(results["text"])} chunks', end='')
    return results

In [8]:
%%time

# Collect the results
chunks={
    'text': [],
    'synthetic': [],
    'author': [],
    'source': []
}

target_lengths=[16,32,64,128,256,512]

for target_length in target_lengths:

    # Get list of input files
    input_files=glob.glob(f'{config.INTERMEDIATE_DATA_PATH}/texts.*.parquet')

    # Instantiate pool with one worker per input file
    pool=mp.Pool(
        processes=len(input_files),
        maxtasksperchild=1
    )

    # Holder for returns from workers
    async_results=[]

    # Loop input files
    for i, data_file in enumerate(input_files):

        async_results.append(
            pool.apply_async(
                split_text,
                args=(
                    data_file,
                    target_length,
                    i,
                    0.1,
                )
            )
        )

    # Clean up
    pool.close()
    pool.join()

    # Get the results
    results=[async_result.get() for async_result in async_results]

    # Collect the results
    for result in results:
        for key, value in result.items():
            chunks[key].extend(value)
    
    print(f'\n\nFinished target length {target_length}')

print()


Worker 19 loaded texts.11.parquet
Worker 2 loaded texts.6.parquet
Worker 19 is 0.0% complete
Worker 2 is 0.0% complete
Worker 5 loaded texts.24.parquet
Worker 29 loaded texts.25.parquet
Worker 28 loaded texts.15.parquet
Worker 23 loaded texts.5.parquet
Worker 11 loaded texts.18.parquet
Worker 15 loaded texts.0.parquet
Worker 1 loaded texts.27.parquet
Worker 6 loaded texts.29.parquet
Worker 12 loaded texts.7.parquet
Worker 18 loaded texts.2.parquet
Worker 7 loaded texts.14.parquet
Worker 27 loaded texts.28.parquet
Worker 0 loaded texts.23.parquet
Worker 20 loaded texts.21.parquet
Worker 10 loaded texts.20.parquet
Worker 3 loaded texts.9.parquet
Worker 26 loaded texts.8.parquet
Worker 9 loaded texts.1.parquet
Worker 25 loaded texts.19.parquet
Worker 5 is 0.0% complete
Worker 17 loaded texts.17.parquet
Worker 13 loaded texts.16.parquet
Worker 4 loaded texts.13.parquet
Worker 14 loaded texts.4.parquet
Worker 24 loaded texts.3.parquet
Worker 22 loaded texts.12.parquet
Worker 21 loaded text

In [9]:
chunks_df=pd.DataFrame(chunks)
chunks_df.head()

Unnamed: 0,text,synthetic,author,source
0,Citigroup Chairman and CEO Vikram Pandit is th...,1,unknown_model,yatsenko
1,International Monetary Fund. The views express...,1,unknown_model,yatsenko
2,(CNN) It looks as though the election of Donal...,1,unknown_model,yatsenko
3,the IMF out of the business of doing what it's...,1,unknown_model,yatsenko
4,"And, to be clear, I have no illusions that the...",1,unknown_model,yatsenko


## 4. Save results

### 4.1. Parquet shards

In [10]:
# Give it a shuffle
chunks_df=chunks_df.sample(frac=1)

# Split the dataframe into shards
chunk_shards=np.array_split(chunks_df, mp.cpu_count() - 2)

# Save each chunk as parquet with a clean index
for i, chunk in enumerate(chunk_shards):
    output_file=f'{config.INTERMEDIATE_DATA_PATH}/chunks.{i}.parquet'
    chunk.reset_index(inplace=True, drop=True)
    chunk.to_parquet(output_file)

### 4.2. Single JSON

In [11]:
# Convert the sentences data to dict
chunks_dict=chunks_df.to_dict(orient='list')

# Save it as JSON
with open(f'{config.INTERMEDIATE_DATA_PATH}/all_chunks.json', 'w', encoding='utf-8') as output_file:
    json.dump(chunks_dict, output_file, ensure_ascii=False, indent=4)