# Subset Selection on Text Chunks

This notebook demonstrates how to perform subset selection on a set of text chunks specified in a `chunks.jsonl` file, with an example included in the `data/` subdirectory.

**Note** that a GPU is required for this to run in a reasonable amount of time, and that **only Linux is supported** due to the usage of `faiss`.

## Load the dataset

First, we load the `chunks.jsonl` file. It is expected that the text chunks are in a key called `chunk`, with all other JSON key/values being metadata to have been preserved throughout this process.

Example `chunk.jsonl` content:

```
{"chunk": "this is the first chunk....", "created_date": "01-01-1970", ... }
{"chunk": "this is the second chunk....", "created_date": "01-01-1971", ... }
```

In [None]:
import json

chunk_lookup = {}
with open('data/chunks.jsonl', encoding='utf-8') as fin:
    for line in fin.readlines():
        chunk_json = json.loads(line)
        chunk_lookup[chunk_json['chunk']] = chunk_json

print(f'Read {len(chunk_lookup)} chunks')

## Set up subset selection environment

We begin by checking out the `[DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs) repository` and change into that directory.

There has been a change to the dependencies of this project, so we will correct that locally and install dependencies.

Finally, we create an `output/` folder to hold our work.

In [None]:
!git clone --depth 1 git@github.com:krishnatejakk/DataCurate4LLMs.git
%cd DataCurate4LLMs

In [None]:
!sed -i -e 's ^faiss-gpu$ faiss-gpu-cu12 g' requirements.txt # fix faiss dependency; yes you can use spaces as delimiters for sed expressions
!pip install -qq -r requirements.txt
!pip install -qq submodlib-py

In [None]:
import os

os.makedirs('output', exist_ok=True)

## Configuration

Next, we set up a configuration file for subset selection. We choose an embedding model, specify a simple template that directly uses chunks without modification for embedding, and use a random seed. We are creating it dynamically and then saving it to disk because of the random seed.

The saved configuration file is then displayed below to inspection and verification.

In [None]:
import random

seed = random.randint(0,10000)

config = f"""{{
    "output_dir": "output",
    "encoder_model": "BAAI/bge-large-en-v1.5",
    "encoder_type": "bge",
    "instruction": "Generate embeddings that capture the semantic meaning of text segments across multiple domains, ensuring g\
eneralization and suitability for clustering based on semantic similarity.",
    "query_description": "default",
    "templates": {{
        "default": "{{{{ chunk }}}}"
    }},
    "template_name": "default",
    "num_folds": 1,
    "num_gpus": 1,
    "subset_sizes": ["5"],
    "epsilon": 0.01,
    "seed": {seed}
}}"""

config_path = 'subset_config.json'
with open(config_path, 'w') as f:
    f.write(config)

print(config)

## Perform subset selection

Finally, we run the algorithm, save the selected 5 chunks to disk, and then read out that file. A list of all supported parameters is given in the [`DataCurate4LLMs` project `README`](https://github.com/krishnatejakk/DataCurate4LLMs/blob/main/README.md).

In [None]:
!python data_subset_selection.py --input_files '../data/chunks.jsonl' --output_dir 'output' --config 'subset_config.json' --retry_delay 1 --subset_sizes 5 --num_gpus 1

## Convert output back to `chunks.jsonl` format

Finally, we read in the selected chunks, match the to the original data extracted from `chunks.jsonl` in order to preserve metadata, and save the final subset into a file in the same format as `chunks.jsonl`.

In [None]:
import json

with open('output/chunks/chunks_samples_5_subset.jsonl') as fin:
    with open('output/selected_chunks.jsonl','w') as fout:
        for line in fin.readlines():
            selected_chunk = json.loads(line)['chunk']
            original_chunk = chunk_lookup[selected_chunk]
            fout.write(json.dumps(original_chunk) + "\n")

with open('../data/selected_chunks.jsonl') as final:
    for line in final.readlines():
        print(json.dumps(json.loads(line), indent=2))