# Text splitting

From the text length distributions in the data exploration notebook, it's pretty clear that we need to break the text up. And, we need to do some clean-up, especially in the shorter length regimes. Seems to me the first thing to do is break up the text into shorter fragments. I'd like to try doing this with semantic/tokenization based splitting to get sentences, rather than arbitrary length text fragments that could be broken in the middle of a word or thought.

## Notebook setup

In [11]:
# Change working directory to parent so we can import as we would
# from the perplexity ratio score root directory
%cd ..

# Standard library imports
import glob
import time
import json
import multiprocessing as mp

# PyPI imports
import nltk
import numpy as np
import pandas as pd

# Internal imports
import configuration as config

# Download NLTK assets
nltk.download('punkt')

/mnt/arkk


[nltk_data] Downloading package punkt to
[nltk_data]     /home/siderealyear/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1. Load an example data file

In [2]:
data_file=f'{config.INTERMEDIATE_DATA_PATH}/texts.0.parquet'
data_df=pd.read_parquet(data_file)
data_df.head()

Unnamed: 0,text,synthetic,author,source
0,ad them all. Some of the best were the ones th...,0,human,yatsenko
1,ER (emergency) vets are more expensive than re...,1,unknown_model,yatsenko
2,"Hey, Mrs. Johnson! Here's my essay on how a pe...",1,unknown_model,grinberg
3,"This article is about the daemonic name, for t...",0,human,yatsenko
4,Budget For Server Desktop Upgrade Essay\n\nExe...,0,human,yatsenko


## 2. Test split a small batch of records

In [3]:
# Holder for results
results={
    'text': [],
    'synthetic': [],
    'author': [],
    'source': []
}

start_time=time.time()

for i in range(5000):
    
    text=data_df['text'].iloc[i]
    sentences=nltk.tokenize.sent_tokenize(text, language='english')

    for sentence in sentences:
        results['text'].append(sentence)
        results['synthetic'].append(data_df['synthetic'].iloc[i])
        results['author'].append(data_df['author'].iloc[i])
        results['source'].append(data_df['source'].iloc[i])

dT=time.time() - start_time
splitting_rate=i/dT
print(f'Split {i} records in {dT:.1f} seconds')
print(f'Splitting rate: {splitting_rate:.1f} records per second')

Split 4999 records in 7.1 seconds
Splitting rate: 700.7 records per second


OK, so ~600 records per second, single threaded, means about an hour and a half to split all 3.47 million records. If we parallelize it over the input files, we should be looking at about 6 minutes, assuming a linear speed-up. I'd like to collect the results back to the main process and then shuffle/split them again, so we end up with more approximately equal numbers of sentences in each file.

## 3. Parallel splitting

In [4]:
# Define the splitting function

def split_text(data_file: str=None, worker: int=0) -> dict:
    '''Function to parallelize NLTK based sentence splitting of
    text over input files. Meant to be called with multiprocessing
    worker. Take an input file string, loads the data, splits
    sentences, collects results in dictionary and returns dictionary.'''

    data_df=pd.read_parquet(data_file)
    print(f"\nWorker {worker} loaded: {data_file.split('/')[-1]}", end='')

    results={
        'text': [],
        'synthetic': [],
        'author': [],
        'source': []
    }

    for i in range(len(data_df)):
        
        text=data_df['text'].iloc[i]
        sentences=nltk.tokenize.sent_tokenize(text, language='english')

        for sentence in sentences:
            results['text'].append(sentence)
            results['synthetic'].append(data_df['synthetic'].iloc[i])
            results['author'].append(data_df['author'].iloc[i])
            results['source'].append(data_df['source'].iloc[i])

    print(f'\nWorker {worker} finished, parsed {len(sentences)} sentences', end='')
    return results

In [5]:
%%time

# Get list of input files
input_files=glob.glob(f'{config.INTERMEDIATE_DATA_PATH}/texts.*.parquet')

# Instantiate pool with one worker per input file
pool=mp.Pool(
    processes=len(input_files),
    maxtasksperchild=1
)

# Holder for returns from workers
async_results=[]

# Loop input files
for i, data_file in enumerate(input_files):

    async_results.append(pool.apply_async(split_text,args=(data_file,i,)))

# Clean up
pool.close()
pool.join()

# Get the results
results=[async_result.get() for async_result in async_results]

# Collect the results
sentences={
    'text': [],
    'synthetic': [],
    'author': [],
    'source': []
}

for result in results:
    for key, value in result.items():
        sentences[key].extend(value)

print()


Worker 13 loaded: texts.13.parquet
Worker 10 loaded: texts.10.parquet
Worker 5 loaded: texts.5.parquet
Worker 15 loaded: texts.15.parquet
Worker 9 loaded: texts.9.parquet
Worker 6 loaded: texts.6.parquet
Worker 1 loaded: texts.1.parquet
Worker 0 loaded: texts.0.parquet
Worker 7 loaded: texts.7.parquet
Worker 8 loaded: texts.8.parquet
Worker 11 loaded: texts.11.parquet
Worker 4 loaded: texts.4.parquet
Worker 12 loaded: texts.12.parquet
Worker 14 loaded: texts.14.parquet
Worker 3 loaded: texts.3.parquet
Worker 2 loaded: texts.2.parquet
Worker 12 finished, parsed 17 sentences
Worker 5 finished, parsed 1 sentences
Worker 10 finished, parsed 23 sentences
Worker 3 finished, parsed 7 sentences
Worker 0 finished, parsed 5 sentences
Worker 15 finished, parsed 13 sentences
Worker 2 finished, parsed 1 sentences
Worker 4 finished, parsed 5 sentences
Worker 13 finished, parsed 25 sentences
Worker 9 finished, parsed 29 sentences
Worker 7 finished, parsed 40 sentences
Worker 6 finished, parsed 1 sen

In [6]:
sentences_df=pd.DataFrame(sentences)
sentences_df.head()

Unnamed: 0,text,synthetic,author,source
0,ad them all.,0,human,yatsenko
1,Some of the best were the ones that weren't su...,0,human,yatsenko
2,The young ones.,0,human,yatsenko
3,They knew what they wanted but their fear held...,0,human,yatsenko
4,They were my favorite.,0,human,yatsenko


## 4. Save results

### 4.1. Parquet shards

In [9]:
# Give it a shuffle
sentences_df=sentences_df.sample(frac=1)

# Split the dataframe into 16 chunks
chunks=np.array_split(sentences_df, 16)

# Save each chunk as parquet with a clean index
for i, chunk in enumerate(chunks):
    output_file=f'{config.INTERMEDIATE_DATA_PATH}/sentences.{i}.parquet'
    chunk.reset_index(inplace=True, drop=True)
    chunk.to_parquet(output_file)

### 4.2. Single JSON

In [14]:
# Convert the sentences data to dict
sentences_dict=sentences_df.to_dict(orient='list')

# Save it as JSON
with open(f'{config.INTERMEDIATE_DATA_PATH}/all_sentences.json', 'w', encoding='utf-8') as output_file:
    json.dump(sentences_dict, output_file, ensure_ascii=False, indent=4)