# Data exploration

Let's take a look at the data we got from the Wikipedia CirrusSearch dump. The gzip 'content' file is 34 GB on disk, so I don't think that we want to try decompressing and reading the whole thing into memory unless we have to. Let's see if we can stream the data from the gzip archive and take a look at what we have.

# 1. Setup

In [1]:
import json
from gzip import GzipFile

In [2]:
sample_records=5
data_file_path='./data/enwiki-20240930-cirrussearch-content.json.gz'

# 2. Load data sample

In [None]:
# Load a few records for inspection
file_stream=GzipFile(data_file_path)
records = []

for i in range(sample_records):

    line=next(file_stream)
    record=json.loads(line)
    records.append(record)

print(f'Loaded {len(records)} records from gzip archive.')
print(f'Record is: {type(records[0])}')

print(f'\nRecord 0 contains:')

for key, value in records[0].items():
    print(f' {key}: {value}')


OK, looks like the first line in the file is just some metadata. Let's look at the second record.

In [None]:
print(f'\nRecord 1 keys:')

for key in records[1].keys():
    print(f' {key}')

More like what we were expecting. We have keys for title, text, timestamp, language, even popularity score? didn't know Wikipedia had that. Here is the title and some of the text from the first record:

In [None]:
text_sample = ' '.join(records[1]['text'].split(' ')[:100])

print(f"Title: {records[1]['title']}")
print(f'Text\n{text_sample}')

Yep, that's a Wikipedia article. Good, I think we can work with this. Since the unzipped files are too big to fit in memory, let's first see if there is any time benefit to streaming the data from the gzip archive vs the decompressed JSON file.

# 2. Record read rate
We have two options to stream the data. 
1. From the gzip archive
2. From the decompressed JSON
Which one is faster? By how much? There is obviously an advantage to reading straight from the archive.

In [None]:
%%time

file_stream=GzipFile(data_file_path)

record_count=0
no_text_records=0
text_word_counts=[]

for line in file_stream:

    record_count+=1
    record=json.loads(line)
    
    if 'text' in record.keys():
        article_text=record['text']

        text_word_count=len(article_text.split(' '))
        text_word_counts.append(text_word_count)

    else:
        no_text_records+=1

So while this is running bond0 is only seeing 14-15 MiB per second from the array. This means to visit the whole file, assuming ~160 GB data total could take up to 22 hrs! Maybe we should try decompressing it first for processing. Regardless, this is a huge bottle-neck. Will definitely want to parallelize the data preparation as much as possible. Think the strategy should be the following:

1. Definitely use the fast scratch NVMe SSD, not the NFS RAID.
2. Decompress the file using linux's gzip.
4. Split the file with the linux split command.

At this point we need to think about what format we want to keep intermediate data it. HDF5 via h5py supports many readers operating on the same file out of the box. To run multiple writers, you need to use mpi4py and compile HDF5 and h5py with MPI support. I would still like to end up with an HDF5 file at the end of the pre-processing, even if it contains multiple batches. But, given the complexity of multiple writers it may make more sense to shard data as parquet or something. This might deserve a little more research.

Ok, let's do it this way:

5. Read the chunks in parallel.
6. Extract title and text.
7. Semantically chunk text down to ~512 tokens per chunk.
8. Save processed batches as parquet.

Looks like (semantic-text-splitter)[https://pypi.org/project/semantic-text-splitter/] will be the way to do the chunking. Don't know how fast it will be, but we can always fall back on dumb word count splitting if need be.

OK, think we have a good plan and some work to do.