#### Preface

In this short tutorial, the tuning process for further evaluation is demonstrated. For this purpose, datasets from the [chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation) framework will be used.

#### Packages installation 

In [1]:
!pip install "git+https://github.com/panalexeu/horchunk.git"
!pip install requests
!pip install chroma
!pip install numpy

Collecting git+https://github.com/panalexeu/horchunk.git
  Cloning https://github.com/panalexeu/horchunk.git to /tmp/pip-req-build-cygrr0x5
  Running command git clone --filter=blob:none --quiet https://github.com/panalexeu/horchunk.git /tmp/pip-req-build-cygrr0x5
  Resolved https://github.com/panalexeu/horchunk.git to commit 4fd7e6936057689b3677d92f7a1be398cbf388fd
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[

#### Loading datasets from git

In [2]:
import requests 

links_dict = dict(
    wikitexts='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/wikitexts.md',
    chatlogs='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/chatlogs.md',
    finance='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/finance.md',
    pubmed='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/pubmed.md',
    state_of_the_union='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/state_of_the_union.md'
) 


def load_datasets(links_dict: dict) -> dict:
    data = dict()
    for name, link in links_dict.items():
        response = requests.get(link)
        data[name] = response.text 
        
    return data 

datasets = load_datasets(links_dict)
for key in datasets.keys():
    print(f'{key}: {len(datasets[key].split())} words')

wikitexts: 22406 words
chatlogs: 5968 words
finance: 116860 words
pubmed: 75846 words
state_of_the_union: 8468 words


#### Tuning

Let's start by instantiating WindowTuner.

In [3]:
from horchunk.chunkers import WindowTuner
from chromadb.utils import embedding_functions

ef = embedding_functions.DefaultEmbeddingFunction() # all-MiniLM-L6-v2
tuner = WindowTuner(ef) 

  from .autonotebook import tqdm as notebook_tqdm


In this tutorial, we will tune the minimum threshold for document of 3 sentences in size. We will repeat this process for every dataset. The identified thresholds will then be averaged to determine a single, generalized threshold value.

In [4]:
thresholds = []
DEPTH = 3

**wikitexts**

In [5]:
from horchunk.splitters import SentenceSplitter 

splitter = SentenceSplitter(text=datasets['wikitexts'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1090/1090 [01:06<00:00, 16.31it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


**chatlogs**

In [6]:
splitter = SentenceSplitter(text=datasets['chatlogs'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 257/257 [00:13<00:00, 18.87it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


**finance**

In [7]:
splitter = SentenceSplitter(text=datasets['finance'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6320/6320 [06:03<00:00, 17.40it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


**pubmed**

In [8]:
splitter = SentenceSplitter(text=datasets['pubmed'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4458/4458 [04:05<00:00, 18.14it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


**state_of_the_union**

In [9]:
splitter = SentenceSplitter(text=datasets['state_of_the_union'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 658/658 [00:39<00:00, 16.84it/s]


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  k


Type 'k' to raise thresh, or 'j' - to lower it, then press 'Enter':  j


#### Averaging thresh values 

The calculated average threshold value will be used for evaluation over the general dataset in the `evaluation.ipynb`notebook.

In [12]:
import numpy as np

print(len(thresholds), thresholds)
thresh = np.mean(thresholds)
print(f'avg. thresh: {thresh} = {str(thresh)[:4]}')

5 [0.8196150064468384, 0.6262622475624084, 0.7286220788955688, 0.7016535997390747, 0.6557968258857727]
avg. thresh: 0.7063899517059327 = 0.70
