# Semantic splitting

Next, we are going to try using a semantic text splitter to chunk the text to blocks of around 512 tokens. We will use semantic-text-splitter from PyPI and the bert base uncased tokenizer from HuggingFace.

**Estimated total run time**: extraction + cleaning + splitting = approximatly 3 hours

**TLDR**
1. semantic_text_splitter seems to do a fine job of splitting the text into smaller chunks.
2. Splitting with a target size of 512 tokens gives us 21 million chunks.
3. Splitting the whole corpus with one worker will take about a day, but we can get the net run time under two hours by parallelizing it over batches.

## 1. Run set-up

### 1.1. Imports

In [None]:
# Change working directory to parent so we can import as we would from __main__.py
print(f'Working directory: ', end = '')
%cd ..

# Standard ports
import time

# PyPI imports
import h5py
import matplotlib.pyplot as plt
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Internal imports
import configuration as config

### 1.2. Notebook parameters

In [2]:
# Total line and record counts determined in data exploration notebook
line_count=13778448
record_count=6889224

# Tokenizer & splitter
tokenizer_name='bert-base-uncased'
max_tokens=512

tokenizer=Tokenizer.from_pretrained(tokenizer_name)
splitter=TextSplitter.from_huggingface_tokenizer(tokenizer, max_tokens)

# Where to save plots
figure_dir='./notebooks/figures/02-semantic_splitting'

### 1.3. Data loading

Load up the first batch from the data extractor.

In [None]:
input_file=f'{config.DATA_PATH}/wikipedia-sample/{config.BATCHED_TEXT}'
input_data=h5py.File(input_file, 'r')
batch=input_data['batches/1']

sample_text=' '.join(batch[0].decode('utf-8').split(' ')[:100])

print(f'First batch contains {len(batch)} texts\n')
print(f"Sample text:\n{sample_text}")

Ok, looks fine - we still have some character level garbage in there, but that will be cleaned up during the data transform task.

## 3. Semantic splitting test

Test split the first text from the first batch:

In [None]:
chunks=splitter.chunks(batch[0].decode('utf-8'))
print(f'Have {len(chunks)} chunks')

for i, chunk in enumerate(chunks[:5]):
    chunk_start=' '.join(chunk.split(' ')[:25])
    chunk_end=' '.join(chunk.split(' ')[-25:])
    print(f'\n{i}: {chunk_start} ... {chunk_end}')

OK, pretty good. We didn't break up any sentences. Some chunks start with a pronoun, so they would be a little unclear to read in isolation. But this approach is obviously much better than taking chunks by word count.

It feels snappy too - let's time splitting a few batches and see what we are working with.

## 4. Semantic splitting rate

### 4.1. Benchmark specific parameters

In [None]:
# Number of batches to time splitting for
num_batches=100

# Holder to collect splitting rates
splitting_rates=[]

# Also, collect the number of chunks we get from each article
# so we can get an average at the end
chunks_per_article=[]

### 4.2. Benchmark

In [None]:
%%time

for i in range(num_batches):

    # Start the timer
    start_time=time.time()

    # Get the text batch
    batch=input_data[f'batches/{i + 1}']

    # Split the records from the batch
    for record in batch:
        chunks=splitter.chunks(record.decode('utf-8'))

        # Collect the chunk count
        chunks_per_article.append(len(chunks))

    # Stop the timer
    dT=time.time() - start_time

    # Collect the split rate
    splitting_rates.append(len(batch) / dT)

### 4.3. Splitting rate results

In [None]:
plt.title('Semantic splitting rate benchmark')
plt.hist(splitting_rates, bins=25)
plt.xlabel(f'Mean replicate splitting rate (articles per second)')
plt.ylabel('Count')
plt.savefig(f'{figure_dir}/4.3-semantic_splitting_rate.jpg')
plt.show()

mean_splitting_rate=sum(splitting_rates) / len(splitting_rates)
print(f'\nEstimated total splitting time: {((record_count / mean_splitting_rate) / (60**2)):.1f} hours')
print(f'Mean splitting rate {mean_splitting_rate:.1f} records per second')

### 4.5. Splitting yield results

In [None]:
plt.title('Semantic splitting yield')
plt.hist(chunks_per_article, bins=25)
plt.xlabel(f'Chunks per article)')
plt.ylabel('Count')
plt.savefig(f'{figure_dir}/4.4-semantic_splitting_yield.jpg')
plt.show()

mean_chunks_per_article=sum(chunks_per_article) / len(chunks_per_article)
estimated_total_chunks=record_count * mean_chunks_per_article
print(f'\nEstimated total chunks: {estimated_total_chunks:.0f}')
print(f'Mean chunks per article {mean_chunks_per_article:.1f}\n')

## 5. Conclusion

Single threaded semantic splitting rate is about 60 records per second or approximately 30 hours to spit all of Wikipedia. Parallelizing that over 18 cores gives us a net splitting time of about an hour and 45 minutes. 

Combining the semantic splitting time with the net one-hour run time of the parallelized extractor/parser/cleaner functions, and we are looking at best case scenario, about 3 hours for the extraction and parsing steps to complete. All-in-all, not terrible. We can live with overnight, especially considering that the first iteration of this pipeline would have taken over 8 days to run!