# Semantic splitting

Last approach to try for text splitting is semantic splitting with semantic-text-splitter from PyPI. Hopefully, this will combine the best aspects of sentence splitting and word length based 'dumb' splitting. We will have some control over the output chunk size and splits will occur in more rational places that a simple arbitrary word length.

## Notebook setup

In [1]:
# Change working directory to parent so we can import as we would
# from the perplexity ratio score root directory
%cd ..

# Standard library imports
import glob
import time
import json
import multiprocessing as mp

# PyPI imports
import numpy as np
import pandas as pd
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Internal imports
import configuration as config

/mnt/arkk/llm_detector/perplexity_ratio_score


## 1. Load an example data file

In [2]:
data_file=f'{config.INTERMEDIATE_DATA_PATH}/texts.0.parquet'
data_df=pd.read_parquet(data_file)
data_df.head()

Unnamed: 0,text,synthetic,author,source
0,ad them all. Some of the best were the ones th...,0,human,yatsenko
1,ER (emergency) vets are more expensive than re...,1,unknown_model,yatsenko
2,"Hey, Mrs. Johnson! Here's my essay on how a pe...",1,unknown_model,grinberg
3,"This article is about the daemonic name, for t...",0,human,yatsenko
4,Budget For Server Desktop Upgrade Essay\n\nExe...,0,human,yatsenko


## 2. Test split a small batch of records

In [3]:
# Holder for results
results={
    'text': [],
    'synthetic': [],
    'author': [],
    'source': []
}


# Tokenizer & splitter
tokenizer_name='bert-base-uncased'
max_tokens=512

tokenizer=Tokenizer.from_pretrained(tokenizer_name)
splitter=TextSplitter.from_huggingface_tokenizer(tokenizer, max_tokens)

start_time=time.time()

for i in range(1000):
    
    text=data_df['text'].iloc[i]
    chunks=splitter.chunks(text)

    for chunk in chunks:
        results['text'].append(chunk)
        results['synthetic'].append(data_df['synthetic'].iloc[i])
        results['author'].append(data_df['author'].iloc[i])
        results['source'].append(data_df['source'].iloc[i])

dT=time.time() - start_time
splitting_rate=(i + 1)/dT
print(f'Split {i} records in {dT:.1f} seconds')
print(f'Splitting rate: {splitting_rate:.1f} records per second')

Split 999 records in 4.7 seconds
Splitting rate: 213.9 records per second


Slower than the NLTK sentence splitter, which ran at ~600 records per second. But, it's still tractable. At ~200 records per second, it should take about four and a half hours to split all 3.47 million records, or about 30 minutes when parallelized over 16 input files. Let's do it.

## 3. Parallel splitting

In [9]:
# Define the splitting function

def split_text(data_file: str=None, target_size: int=512, worker: int=0) -> dict:
    '''Function to parallelize semantic splitting of text over input files. 
    Meant to be called with multiprocessing worker. Take an input file 
    string, loads the data, splits sentences, collects results in dictionary
    and returns dictionary.'''

    data_df=pd.read_parquet(data_file)
    print(f"\nWorker {worker} loaded: {data_file.split('/')[-1]}", end='')

    results={
        'text': [],
        'synthetic': [],
        'author': [],
        'source': []
    }

    # Tokenizer & splitter
    tokenizer_name='bert-base-uncased'
    tokenizer=Tokenizer.from_pretrained(tokenizer_name)
    splitter=TextSplitter.from_huggingface_tokenizer(tokenizer, target_size)

    for i in range(len(data_df)):
        
        text=data_df['text'].iloc[i]
        chunks=splitter.chunks(text)

        for chunk in chunks:
            results['text'].append(chunk)
            results['synthetic'].append(data_df['synthetic'].iloc[i])
            results['author'].append(data_df['author'].iloc[i])
            results['source'].append(data_df['source'].iloc[i])

    print(f'\nWorker {worker} finished, parsed {len(chunks)} chunks', end='')
    return results

In [10]:
%%time

# Get list of input files
input_files=glob.glob(f'{config.INTERMEDIATE_DATA_PATH}/texts.*.parquet')

# Instantiate pool with one worker per input file
pool=mp.Pool(
    processes=len(input_files),
    maxtasksperchild=1
)

# Holder for returns from workers
async_results=[]

# Loop input files
for i, data_file in enumerate(input_files):

    async_results.append(pool.apply_async(split_text,args=(data_file,512,i,)))

# Clean up
pool.close()
pool.join()

# Get the results
results=[async_result.get() for async_result in async_results]

# Collect the results
chunks={
    'text': [],
    'synthetic': [],
    'author': [],
    'source': []
}

for result in results:
    for key, value in result.items():
        chunks[key].extend(value)

print()


Worker 10 loaded: texts.10.parquet
Worker 13 loaded: texts.13.parquet
Worker 5 loaded: texts.5.parquet
Worker 15 loaded: texts.15.parquet
Worker 9 loaded: texts.9.parquet
Worker 4 loaded: texts.4.parquet
Worker 0 loaded: texts.0.parquet
Worker 7 loaded: texts.7.parquet
Worker 14 loaded: texts.14.parquet
Worker 6 loaded: texts.6.parquet
Worker 1 loaded: texts.1.parquet
Worker 3 loaded: texts.3.parquet
Worker 11 loaded: texts.11.parquet
Worker 2 loaded: texts.2.parquet
Worker 8 loaded: texts.8.parquet
Worker 12 loaded: texts.12.parquet
Worker 15 finished, parsed 1 sentences
Worker 13 finished, parsed 1 sentences
Worker 5 finished, parsed 1 sentences
Worker 11 finished, parsed 1 sentences
Worker 3 finished, parsed 1 sentences
Worker 14 finished, parsed 1 sentences
Worker 2 finished, parsed 1 sentences
Worker 4 finished, parsed 1 sentences
Worker 6 finished, parsed 1 sentences
Worker 9 finished, parsed 2 sentences
Worker 10 finished, parsed 1 sentences
Worker 8 finished, parsed 2 sentence

In [11]:
chunks_df=pd.DataFrame(chunks)
chunks_df.head()

Unnamed: 0,text,synthetic,author,source
0,ad them all. Some of the best were the ones th...,0,human,yatsenko
1,ER (emergency) vets are more expensive than re...,1,unknown_model,yatsenko
2,"Hey, Mrs. Johnson! Here's my essay on how a pe...",1,unknown_model,grinberg
3,"This article is about the daemonic name, for t...",0,human,yatsenko
4,Budget For Server Desktop Upgrade Essay\n\nExe...,0,human,yatsenko


## 4. Save results

### 4.1. Parquet shards

In [7]:
# Give it a shuffle
chunks_df=chunks_df.sample(frac=1)

# Split the dataframe into 16 shards
chunk_shards=np.array_split(chunks_df, 16)

# Save each chunk as parquet with a clean index
for i, chunk in enumerate(chunk_shards):
    output_file=f'{config.INTERMEDIATE_DATA_PATH}/chunks.{i}.parquet'
    chunk.reset_index(inplace=True, drop=True)
    chunk.to_parquet(output_file)

### 4.2. Single JSON

In [8]:
# Convert the sentences data to dict
chunks_dict=chunks_df.to_dict(orient='list')

# Save it as JSON
with open(f'{config.INTERMEDIATE_DATA_PATH}/all_chunks.json', 'w', encoding='utf-8') as output_file:
    json.dump(chunks_dict, output_file, ensure_ascii=False, indent=4)