# Semantic chunking

Next, we are going to try using a semantic text splitter to chunk the text to blocks of around 512 tokens.

## 1. Setup

We will use semantic-text-splitter from PyPI and the bert base uncased tokenizer from HuggingFace.

In [1]:
# Change working directory to parent so we can import as we would from __main__.py
print(f'Working directory: ', end = '')
%cd ..

# Standard ports
import time

# PyPI imports
import h5py
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Internal imports
import configuration as config

Working directory: /mnt/arkk/opensearch/semantic_search


ModuleNotFoundError: No module named 'semantic_text_splitter'

In [None]:
input_file=f'{config.DATA_PATH}/wikipedia/{config.BATCHED_TEXT}'

tokenizer_name='bert-base-uncased'
max_tokens=512

tokenizer=Tokenizer.from_pretrained(tokenizer_name)
splitter=TextSplitter.from_huggingface_tokenizer(tokenizer, max_tokens)

## 2. Load data

Load up the first batch from the data extractor.

In [None]:
input_data=h5py.File(input_file, 'r')
batch=input_data['batches/0']

sample_text=' '.join(batch[0].decode('utf-8').split(' ')[:100])

print(f'First batch contains {len(batch)} texts')
print(f"First text:\n{sample_text}")

## 3. Semantic chunking test
Test split the first text from the first batch:

In [None]:
chunks=splitter.chunks(batch[0].decode('utf-8'))
print(f'Chunks are: {type(chunks)}')
print(f'Have {len(chunks)} chunks')

for chunk in chunks:
    chunk_start=' '.join(chunk.split(' ')[:25])
    chunk_end=' '.join(chunk.split(' ')[-25:])
    print(f'\n{chunk_start} ... {chunk_end}')

OK, pretty good. We didn't break up any sentences. Some chunks start with a pronoun, so they would be a little unclear to read in isolation. But this approach is obviously much better than taking chunks by word count. Might want to think about a clever way to filter out chunks that only contain references.

It feels snappy too - let's time chunking a few batches and see what we are working with.

## 4. Semantic text chunking rate

In [None]:
%%time

# Number of batches to time chunking for
num_batches=1

# Holder to collect chunking times
chunking_times=[]

for i in range(num_batches):

    # Start the timer
    start_time=time.time()

    # Get the text batch
    batch=input_data[f'batches/{i}']

    # Chunk the records from the batch
    for record in batch:
        chunks=splitter.chunks(record.decode('utf-8'))

    # Stop the timer
    dT=time.time() - start_time
    chunking_times.append(dT)

mean_chunking_time=sum(chunking_times)/len(chunking_times)
print(f'Mean chunking time {mean_chunking_time::.1f} seconds per record\n')

Ok here's the estimate: we have 688 batches so about 32 hours to chunk the whole thing.

If we parallelize it across our CPU cores, we should be able to get it down to under two hours.

## 5. Reference chunk filter

Have a clever idea about how to skip chunks which are mostly or all references from the article. Take a look at the following chunk, which contains only references:

```text
Theatricalia. Retrieved 26 July 2024. "Black Coffee". Theatricalia. Retrieved 26 July 2024. "Alfred Marks". Theatricalia. Retrieved 26 July 2024. At the Hercule Poirot Central website ... story". Agatha Christie. 10 October 2017. Retrieved 5 January 2019. "The Murder on the Links". latw.org. Retrieved 31 January 2022. "The Brasserie Ellezelloise's Hercule". Brasserie-ellezelloise.be.
```

It contains the word 'Retrieved' with a capitol 'R' many times. I would guess much more than the actual text.

Here's another one:

```text
Christie, Agatha (3 October 2006b). Three Act Tragedy. HarperCollins. ISBN 978-0-06-175403-6. Christie, Agatha (17 March 2009) [1926]. The Murder of Roger Ackroyd. HarperCollins. ISBN 978-0-06-176340-3. Christie, Agatha ... (9 July 2013a). The Lost Mine: A Hercule Poirot Story. HarperCollins. ISBN 978-0-06-229818-8. Christie, Agatha (23 July 2013b). Double Sin: A Hercule Poirot Story. HarperCollins. ISBN 978-0-06-229845-4.
```

This one doesn't have the 'Retrieved' signal, but it does contain a bunch of ISBNs. Same idea - more than maybe 2 ISBNs means references, not text.

We could likely use just those two signals to get rid of a lot of reference chunks. But, now that I'm thinking about it, that breaks our rule that only the text extractor can be tailored to the data source. This is a good idea, but I think we need to try and do it at the level of the extractor.
