# Plan of the project
- Run experiments 
- Visualize the results
- Look into the results
- Make conclusions

## Run the experiment
Uncomment the code below to rerun the experiment or change any parameters.
You can run the rest of the notebook without running this cell.

In [2]:
# %%bash 
# python optimize_hyperparams.py \
#     --trials 100 \
#     --questions_path "../data/state_of_the_union_questions_df.csv" \
#     --text_path "../data/state_of_the_union.md" \
#     --chunk_size_min 30 \
#     --chunk_size_max 300 \
#     --chunk_size_step 10 \
#     --chunk_overlap_ratio_min 0.0 \
#     --chunk_overlap_ratio_max 0.7 \
#     --chunk_overlap_ratio_step 0.1 \
#     --num_retrieved_chunks_min 1 \
#     --num_retrieved_chunks_max 10 \
#     --threshold_min 0.0 \
#     --threshold_max 0.9 \
#     --model_names "sentence-transformers/all-MiniLM-L6-v2,sentence-transformers/all-MiniLM-L12-v2" \
#     --study_csv "../data/study.csv"

## Load and Visualize the results

In [3]:
import pandas as pd
import optuna
import optuna.visualization as vis

In [4]:
def load_study_from_csv(csv_path):
    trials_df = pd.read_csv(csv_path)
    study = optuna.create_study(direction="maximize")
    for _, row in trials_df.iterrows():
        params = {
            'chunk_overlap': row['params_chunk_overlap'],
            'chunk_size': row['params_chunk_size'],
            'model_name': row['params_model_name'],
            'num_retrieved_chunks': row['params_num_retrieved_chunks'],
            'threshold': row['params_threshold']
        }
        trial = optuna.trial.create_trial(
            params=params,
            distributions={
                'chunk_overlap': optuna.distributions.FloatDistribution(0.0, 0.7),
                'chunk_size': optuna.distributions.IntDistribution(30, 300),
                'model_name': optuna.distributions.CategoricalDistribution(['sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-MiniLM-L12-v2']),
                'num_retrieved_chunks': optuna.distributions.IntDistribution(1, 10),
                'threshold': optuna.distributions.FloatDistribution(0.0, 0.9)
            },
            value=row['value']
        )
        study.add_trial(trial)
    return study

# Example usage
study = load_study_from_csv("../data/study.csv")
print("Best trial:")
trial = study.best_trial
print(f"  Value: {trial.value}")
print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

[I 2025-03-12 12:58:33,711] A new study created in memory with name: no-name-aaa073f4-b113-4cc9-b98c-0f3398024954


Best trial:
  Value: 0.4595684560698659
  Params: 
    chunk_overlap: 0.7
    chunk_size: 40
    model_name: sentence-transformers/all-MiniLM-L6-v2
    num_retrieved_chunks: 1
    threshold: 0.2786397449038925


In [5]:
vis.plot_optimization_history(study)

In [6]:
vis.plot_param_importances(study)

In [7]:
vis.plot_slice(study)

Two most important parameters are `chunk_size` and `threshold`.  
It does make sence, that small `chunk_size` with big enogh `chunk_overlap` produce a lot of chunks, that are more flexible and can be more accurate.  
Also filtering out chunks with low `threshold` can help to get rid of noise.  
Both models performed very similarly, which was expected since the models are very similar.  
Bigger models didn't fit into my memory.  

## Look into the results

In [8]:
from evaluation import (
    create_chunks, 
    create_embeddings, 
    retrieve_chunks, 
    merge_intervals, 
    get_ranges,
)
import json

In [11]:
questions_path = "../data/state_of_the_union_questions_df.csv"
text_path = "../data/state_of_the_union.md"

questions_df = pd.read_csv(questions_path)
with open(text_path, encoding="utf-8") as f:
    text = f.read()

def run_samples(questions_df, text, chunk_size, chunk_overlap, num_retrieved_chunks, threshold, model_name, device):
    chunks, tokenizer = create_chunks(text, chunk_size, chunk_overlap, model_name)
    text_embeddings, question_embeddings = create_embeddings(chunks, questions_df["question"].tolist(), model_name, device)
    retrieved_indices = retrieve_chunks(question_embeddings, text_embeddings, threshold, num_retrieved_chunks)
    retrieved_intervals = [merge_intervals(get_ranges(text, chunks, list(indices))) for indices in retrieved_indices]
    target_intervals = [
        merge_intervals([(ref['start_index'], ref['end_index']) for ref in json.loads(references)])
        for references in questions_df['references']
    ]

    questions = questions_df['question'].tolist()

    for retrieved, target, question in zip(retrieved_intervals, target_intervals, questions):

        print("Question:")
        print(question)

        print("Retrieved:")
        print(retrieved)
        for start, end in retrieved:
            print(text[start:end])

        print("Target:")
        print(target)
        for start, end in target:
            print(text[start:end])
        
        print()

In [13]:
chunk_size = study.best_params['chunk_size']
chunk_overlap = int(study.best_params['chunk_overlap'] * chunk_size)
num_retrieved_chunks = study.best_params['num_retrieved_chunks']
threshold = study.best_params['threshold']
model_name = study.best_params['model_name']

run_samples(
    questions_df.sample(5), 
    text, 
    chunk_size, 
    chunk_overlap, 
    num_retrieved_chunks, 
    threshold, 
    model_name, 
    "cuda"
)

Question:
What specific new office did President Biden establish to address gun violence, and who is leading it?
Retrieved:
[(37137, 37355)]
 something by establishing the first-ever Office of Gun Violence Prevention in the White House, that the Vice President is leading the charge. Thank you for doing it.

Meanwhile — meanwhile, my predecessor told the NRA
Target:
[(37123, 37279)]
Well, I did do something by establishing the first-ever Office of Gun Violence Prevention in the White House, that the Vice President is leading the charge.

Question:
At what age did President Biden get elected to the United States Senate, and was it planned?
Retrieved:
[(45927, 46114)]
 United States Senate when I had no intention of running, at age 29.

Then vice president to our first Black president. Now a president to the first woman vice president.

In my career, I
Target:
[(45907, 45995)]
I got elected to the United States Senate when I had no intention of running, at age 29.

Question:
What signific

# Conclusion
