# Subset Selection on Text Chunks

This notebook demonstrates how to perform subset selection on a set of text chunks specified in a `chunks.jsonl` file, with an example included in the `data/` subdirectory.

## Load the dataset

First, we load the `chunks.jsonl` file. It is expected that the text chunks are in a key called `chunk`, with all other JSON key/values being metadata to have been preserved throughout this process.

In [None]:
import json
import os

chunk_lookup = {}
with open('data/chunks.jsonl', encoding='utf-8') as fin:
    os.makedirs('tmp', exist_ok=True)
    for line in fin.readlines():
        chunk_json = json.loads(line)
        chunk_lookup[chunk_json['chunk']] = chunk_json

print(f'Read {len(chunk_lookup)} chunks')

# Preparation for performing subset selection

We check out the 

In [None]:
!git clone git@github.com:krishnatejakk/DataCurate4LLMs.git
%cd DataCurate4LLMs

In [None]:
import random

seed = random.randint(0,10000)

config = f"""{{
    "output_dir": "output",
    "encoder_model": "BAAI/bge-large-en-v1.5",
    "encoder_type": "bge",
    "instruction": "Generate embeddings that capture the semantic meaning of text segments across multiple domains, ensuring g\
eneralization and suitability for clustering based on semantic similarity.",
    "query_description": "default",
    "templates": {{
        "default": "{{{{ chunk }}}}"
    }},
    "template_name": "default",
    "num_folds": 1,
    "num_gpus": 1,
    "subset_sizes": ["5"],
    "epsilon": 0.01,
    "seed": {seed}
}}"""

config_path = 'subset_config.json'
with open(config_path, 'w') as f:
    f.write(config)

print(config)

In [None]:
!sed -i -e 's ^faiss-gpu$ faiss-gpu-cu12 g' requirements.txt # fix faiss dependency; yes you can use spaces as delimiters for sed expressions
!pip install -qq -r requirements.txt
!pip install -qq submodlib-py

In [None]:
!python data_subset_selection.py --input_files '../data/chunks.jsonl' --output_dir '../data' --config '../subset_config.json' --retry_delay 1 --subset_sizes 5 --num_gpus 1

In [None]:
import json

with open('../data/chunks/chunks_samples_5_subset.jsonl') as fin:
    with open('../data/selected_chunks.jsonl','w') as fout:
        for line in fin.readlines():
            selected_chunk = json.loads(line)['chunk']
            original_chunk = chunk_lookup[selected_chunk]
            fout.write(json.dumps(original_chunk) + "\n")

with open('../data/selected_chunks.jsonl') as final:
    for line in final.readlines():
        print(json.dumps(json.loads(line), indent=2))