#### Preface

In this short tutorial, the tuning process for further evaluation is demonstrated. For this purpose, datasets from the [chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation) framework will be used.

#### Packages installation 

In [1]:
!pip install -q "git+https://github.com/panalexeu/horchunk.git"
!pip install -q requests
!pip install -q chroma
!pip install -q numpy

#### Loading datasets from git

In [2]:
import requests 

links_dict = dict(
    wikitexts='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/wikitexts.md',
    chatlogs='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/chatlogs.md',
    finance='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/finance.md',
    pubmed='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/pubmed.md',
    state_of_the_union='https://raw.githubusercontent.com/brandonstarxel/chunking_evaluation/refs/heads/main/chunking_evaluation/evaluation_framework/general_evaluation_data/corpora/state_of_the_union.md'
) 


def load_datasets(links_dict: dict) -> dict:
    data = dict()
    for name, link in links_dict.items():
        response = requests.get(link)
        data[name] = response.text 
        
    return data 

datasets = load_datasets(links_dict)
for key in datasets.keys():
    print(f'{key}: {len(datasets[key].split())} words')

wikitexts: 22406 words
chatlogs: 5968 words
finance: 116860 words
pubmed: 75846 words
state_of_the_union: 8468 words


#### Tuning

Let's start by instantiating WindowTuner.

In [3]:
from horchunk.chunkers import WindowTuner
from chromadb.utils import embedding_functions

ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2", device='cuda')
tuner = WindowTuner(ef) 

In this tutorial, we will tune the minimum threshold for document of 3 sentences in size. We will repeat this process for every dataset. The identified thresholds will then be averaged to determine a single, generalized threshold value.

In [4]:
thresholds = []
DEPTH = 3

**wikitexts**

In [None]:
from horchunk.splitters import SentenceSplitter 

splitter = SentenceSplitter(text=datasets['wikitexts'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1090/1090 [00:10<00:00, 99.90it/s]


**chatlogs**

In [None]:
splitter = SentenceSplitter(text=datasets['chatlogs'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

**finance**

In [None]:
splitter = SentenceSplitter(text=datasets['finance'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

**pubmed**

In [None]:
splitter = SentenceSplitter(text=datasets['pubmed'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

**state_of_the_union**

In [None]:
splitter = SentenceSplitter(text=datasets['state_of_the_union'])
splits = splitter()
res = tuner(splits, depth=DEPTH)
thresholds.append(res)

#### Averaging thresh values 

The calculated average threshold value will be used for evaluation over the general dataset in the `evaluation.ipynb`notebook.

In [None]:
import numpy as np

print(len(thresholds), thresholds)
thresh = np.mean(thresholds)
print(f'avg. thresh: {thresh} = {str(thresh)[:4]}')