#### Preface

In this short tutorial, the tuning process for further evaluation is demonstrated. For this purpose, datasets from the [chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation) framework will be used.

#### Packages installation 

In [19]:
!pip install "git+https://github.com/panalexeu/horchunk.git"
!pip install requests
!pip install chroma
!pip install numpy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting git+https://github.com/panalexeu/horchunk.git
  Cloning https://github.com/panalexeu/horchunk.git to /tmp/pip-req-build-l3fpo03i
  Running command git clone --filter=blob:none --quiet https://github.com/panalexeu/horchunk.git /tmp/pip-req-build-l3fpo03i
  Resolved https://github.com/panalexeu/horchunk.git to commit 4fd7e6936057689b3677d92f7a1be398cbf388fd
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Loading datasets from git

In [20]:
import requests 

links_dict = dict(
    wikitexts='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/wikitexts.md',
    chatlogs='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/chatlogs.md',
    finance='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/finance.md',
    pubmed='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/pubmed.md',
    state_of_the_union='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/state_of_the_union.md'
) 


def load_datasets(links_dict: dict) -> dict:
    data = dict()
    for name, link in links_dict.items():
        response = requests.get(link)
        data[name] = response.text 
        
    return data 

datasets = load_datasets(links_dict)
for key in datasets.keys():
    print(f'{key}: {len(datasets[key].split())} words')

wikitexts: 22406 words
chatlogs: 5968 words
finance: 116860 words
pubmed: 75846 words
state_of_the_union: 8468 words


#### Tuning

Let's start by instantiating WindowTuner.

In [21]:
from horchunk.chunkers import WindowTuner
from chromadb.utils import embedding_functions

ef = embedding_functions.DefaultEmbeddingFunction() # all-MiniLM-L6-v2
tuner = WindowTuner(ef) 

In this tutorial, we will tune the minimum threshold for document of 3 sentences in size. We will repeat this process for every dataset. The identified thresholds will then be averaged to determine a single, generalized threshold value.

In [22]:
thresholds = []
DEPTH = 4

**wikitexts**

In [23]:
from horchunk.splitters import SentenceSplitter 

splitter = SentenceSplitter(text=datasets['wikitexts'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1090/1090 [01:50<00:00,  9.90it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


**chatlogs**

In [24]:
splitter = SentenceSplitter(text=datasets['chatlogs'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 257/257 [00:28<00:00,  8.91it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


**finance**

In [25]:
splitter = SentenceSplitter(text=datasets['finance'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6320/6320 [10:24<00:00, 10.12it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


**pubmed**

In [26]:
splitter = SentenceSplitter(text=datasets['pubmed'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4458/4458 [06:52<00:00, 10.81it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


**state_of_the_union**

In [None]:
splitter = SentenceSplitter(text=datasets['state_of_the_union'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 658/658 [01:00<00:00, 10.83it/s]


#### Averaging thresh values 

The calculated average threshold value will be used for evaluation over the general dataset in the `evaluation.ipynb`notebook.

In [18]:
import numpy as np

print(len(thresholds), thresholds)
thresh = np.mean(thresholds)
print(f'avg. thresh: {thresh} = {round(thresh, 2)}')

5 [0.7941413521766663, 0.6194384098052979, 0.5499992966651917, 0.5951321125030518, 0.4767434597015381]
avg. thresh: 0.6070909261703491 = 0.61
