# Sentence Convergence w/ NLP

cosine similarity might be able to work best as the output metric to tell what idea is converging/diverging. Because it tracks the similarities between points of comparison.

But then checking the length of transcript for a particular person. Do we break it up by sentence. 
- What is considered that "smallest unit?"

# Setting up the files for analysis

## Idea 1 - Clean and reproducible - placebo

This just cleans up the transcript so it is cleanly viewed. Cleans up text format. If it is empty transcript, removes row.

In [4]:
import pandas as pd

# Load your file
df = pd.read_csv("/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_aligned_full.csv")

# --- Basic cleanup ---
df['transcript'] = df['transcript'].astype(str).str.strip()
df = df[df['transcript'] != ""]
df = df.dropna(subset=['transcript'])

# Sort chronologically
df = df.sort_values(by=['global_session', 'global_timestamp_sec']).reset_index(drop=True)

# Filter for specific session
session_id = "2021_05_21_ABI_S15_ABI"
df_session = df[df['global_session'] == session_id].reset_index(drop=True)

# Summary
print(f"Loaded {len(df_session)} utterances for session {session_id}")

# Show preview
preview_cols = ['global_session', 'speaker', 'timestamp', 'global_timestamp_sec', 'transcript']
print(df_session[preview_cols].head(10).to_string(index=False))



Loaded 255 utterances for session 2021_05_21_ABI_S15_ABI
        global_session                speaker   timestamp  global_timestamp_sec                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     transcript
2021_05_21_ABI_S15_ABI             Brad Smith 00:00-00:37                 615.0                                                                                                 

## Idea 2 - Sentence Level Unit-ideas

(the main way how the excel file will be processed.)

Deletes the backchannel utterances (e.g. "okay", "yes", "right", "sounds good")

Splits up transcript by sentence by speaker

Couple of things done here:

1. Breaking the excel files' row into sentences

In the raw file, one "row" might be a short filler ("Yeah.") or a long paragraph with multiple ideas. That makes it hard to comapre ideas fairly. So we split each row into sentences. Now every chunk is about the same size, which makes our simililarity scores more meaningful.

2. Keeping the speaker

We always keep track of who said it. This way we can tell if an idea is picked up by someone-else (cross-speaker) or if the same person is just adding more details (same-speaker)

3. Tagging the type of sentence

We give each sentence a quick label like `Proposal/Offer`, `Acceptance`, `Question`, or `Inform/Report`. This is important because not every sentence is an "idea". For example, "Sounds good." is an acceptane, not a new idea. Tagging them stops these from messing up our idea similarity results, while still letting us track decisions.

4. Making the timeline cleans

Sometimes timestamps in the fiel go backwards or have ties. We fix this by nudging times forward by 0.001 seconds if needed. This tiny change doesn't affect the analysis, but it keeps our data in the right order.

### implementing co-similarity (SBERT)

all of this still tests the idea if Gemini is even capable of defining a convergence to solution, even if NLP topics are used within the transcript, based on all this code run?

In [None]:
# === Make timestamps continuous for ALL ABI_S15 master sessions and OVERWRITE `timestamp` ===
import pandas as pd
import numpy as np
import re
from pathlib import Path

# --- CONFIG ---
in_path   = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_aligned_full.csv"
out_full  = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_aligned_full_continuous_overwritten.csv"
out_s15   = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_ABI_S15_continuous_overwritten.csv"

# --- LOAD (preserve original order) ---
df = pd.read_csv(in_path)
df["_orig_row_order"] = np.arange(len(df), dtype=np.int64)
orig_len = len(df)

# --- Target rows: anything with "ABI_S15" in session/global_session ---
def _contains_s15(x):
    s = str(x) if pd.notna(x) else ""
    return "ABI_S15" in s

has_session = "session" in df.columns
has_global  = "global_session" in df.columns

session_mask = df["session"].map(_contains_s15) if has_session else pd.Series(False, index=df.index)
global_mask  = df["global_session"].map(_contains_s15) if has_global else pd.Series(False, index=df.index)
mask_target  = session_mask | global_mask

if not mask_target.any():
    raise ValueError("No rows found that look like ABI_S15 in 'session' or 'global_session'.")

# --- Build a stitch key that prefers the master 'session' over clip-level ids ---
stitch_key = pd.Series(np.nan, index=df.index, dtype=object)
if has_session:
    stitch_key.loc[session_mask] = df.loc[session_mask, "session"]
if has_global:
    # fill any remaining ABI_S15 rows (where session didn't match) with their global_session
    fill_mask = mask_target & stitch_key.isna()
    stitch_key.loc[fill_mask] = df.loc[fill_mask, "global_session"]

df["_stitch_key"] = stitch_key

# --- Helpers ---
_time_pat = re.compile(r'(\d{1,2}:\d{2}(?::\d{2})?)')

def parse_hhmmss_to_sec(x):
    if pd.isna(x): return np.nan
    s = str(x).strip()
    if not s: return np.nan
    parts = s.split(":")
    try:
        parts = [float(p) for p in parts]
    except Exception:
        return np.nan
    if len(parts) == 3:
        h, m, sec = parts
    elif len(parts) == 2:
        h, m, sec = 0.0, parts[0], parts[1]
    elif len(parts) == 1:
        h, m, sec = 0.0, 0.0, parts[0]
    else:
        return np.nan
    return h*3600 + m*60 + sec

def split_timestamp_robust(ts):
    """Return (start_sec, end_sec) from 'mm:ss-mm:ss' or 'hh:mm:ss-hh:mm:ss' or bracketed variants."""
    s = "" if pd.isna(ts) else str(ts)
    hits = _time_pat.findall(s)
    if len(hits) >= 2:
        return parse_hhmmss_to_sec(hits[0]), parse_hhmmss_to_sec(hits[1])
    elif len(hits) == 1:
        v = parse_hhmmss_to_sec(hits[0]); return v, v
    else:
        return np.nan, np.nan

def fmt_mmss(total_seconds: float) -> str:
    total_seconds = int(round(float(total_seconds)))
    m, s = divmod(total_seconds, 60)
    return f"{m:02}:{s:02}"

def make_timestamps_continuous_one_group(group: pd.DataFrame, tol: float = 2.0) -> pd.DataFrame:
    """
    Convert local timestamps to a continuous session clock for ONE stitch group (master session).
    - Reset detected when local start < previous local end.
    - Only the FIRST near-zero row in a consecutive run starts a new clip.
    - No row cuts/reorders (iterate in original order).
    """
    g = group.copy()

    # Prefer numeric start/end if present; else parse from 'timestamp'
    start_num = pd.to_numeric(g.get("start_sec", np.nan), errors="coerce")
    end_num   = pd.to_numeric(g.get("end_sec",   np.nan), errors="coerce")

    starts_local, ends_local = [], []
    for idx, ts in g["timestamp"].items():
        s_txt, e_txt = split_timestamp_robust(ts)
        s_val = start_num.loc[idx] if pd.notna(start_num.loc[idx]) else s_txt
        e_val = end_num.loc[idx]   if pd.notna(end_num.loc[idx])   else (e_txt if not np.isnan(e_txt) else s_val)
        s_val = 0.0 if np.isnan(s_val) else float(s_val)
        e_val = s_val if np.isnan(e_val) else float(e_val)
        starts_local.append(s_val)
        ends_local.append(e_val)

    starts_local = np.asarray(starts_local, dtype=float)
    ends_local   = np.asarray(ends_local,   dtype=float)

    offset = 0.0
    fixed_starts, fixed_ends = [], []
    resets = 0
    prev_near_zero = False

    for i, (s_local, e_local) in enumerate(zip(starts_local, ends_local)):
        near_zero = (s_local <= tol)
        is_reset = False
        if i > 0:
            went_backwards = s_local < (ends_local[i-1] - 1e-9)
            if went_backwards:
                # Only the FIRST near-zero in a run starts a new clip; any non-near-zero back-jump also starts one
                if (near_zero and not prev_near_zero) or (not near_zero):
                    is_reset = True

        if is_reset:
            # Set (not add) offset to previous FIXED end so we don't double-count
            offset = fixed_ends[-1]
            resets += 1

        fixed_s = s_local + offset
        fixed_e = e_local + offset
        fixed_starts.append(fixed_s)
        fixed_ends.append(fixed_e)

        prev_near_zero = near_zero

    g["global_start_sec"]     = fixed_starts
    g["global_end_sec"]       = fixed_ends
    g["timestamp_continuous"] = [f"{fmt_mmss(s)}-{fmt_mmss(e)}" for s, e in zip(fixed_starts, fixed_ends)]
    g["resets_detected_here"] = resets  # same value within this group

    return g

# --- Process only the ABI_S15 subset, stitched by the master session key (_stitch_key) ---
df_out = df.copy()
sub = df_out.loc[mask_target].copy()

if "_stitch_key" not in sub.columns:
    raise ValueError("Internal stitch key missing.")

stitched = []
for key, g in sub.groupby("_stitch_key", sort=False):
    stitched.append(make_timestamps_continuous_one_group(g, tol=2.0))
sub_fixed = pd.concat(stitched, axis=0)

# --- Overwrite timestamp + keep backup for affected rows ---
df_out.loc[sub_fixed.index, "timestamp_local"] = df_out.loc[sub_fixed.index, "timestamp"]
df_out.loc[sub_fixed.index, "timestamp"] = sub_fixed["timestamp_continuous"]

# Also store continuous second columns in the full DF
for col in ["timestamp_continuous","global_start_sec","global_end_sec","resets_detected_here"]:
    if col not in df_out.columns:
        df_out[col] = np.nan
df_out.loc[sub_fixed.index, ["timestamp_continuous","global_start_sec","global_end_sec","resets_detected_here"]] = \
    sub_fixed[["timestamp_continuous","global_start_sec","global_end_sec","resets_detected_here"]]

# --- Per-master-session durations BEFORE dropping helper columns ---
dur_rows = []
s15_out = df_out.loc[mask_target].copy()
for key, g in s15_out.groupby(df.loc[mask_target, "_stitch_key"], sort=False):
    s = float(pd.to_numeric(g["global_start_sec"], errors="coerce").min())
    e = float(pd.to_numeric(g["global_end_sec"],   errors="coerce").max())
    total = e - s
    dur_rows.append({
        "stitch_key(session)": key,
        "rows": len(g),
        "resets_detected": int(g["resets_detected_here"].iloc[0]) if "resets_detected_here" in g.columns else 0,
        "duration_sec": total,
        "duration_hms": f"{int(total//3600)}:{int((total%3600)//60):02}:{int(total%60):02}",
    })
dur = pd.DataFrame(dur_rows).sort_values("duration_sec", ascending=False)

# --- Restore exact original order and assert row count unchanged ---
df_out = df_out.sort_values("_orig_row_order").reset_index(drop=True)
assert len(df_out) == orig_len, "Row count changed — content was cut, which should not happen."

# --- SAVE (drop helper cols only at write time) ---
Path(out_full).parent.mkdir(parents=True, exist_ok=True)
# Keep helper columns in the saved files? If not, drop them:
save_full = df_out.drop(columns=["_orig_row_order", "_stitch_key"], errors="ignore")
save_s15  = save_full.loc[mask_target]

save_full.to_csv(out_full, index=False)
save_s15.to_csv(out_s15, index=False)

print(f"Done. Kept all rows: {orig_len}.")
print(f"• Wrote full file: {out_full}")
print(f"• Wrote ABI S15-only file: {out_s15}")
print("\nPer-master-session ABI_S15 durations:")
print(dur.to_string(index=False))


✅ Done. Kept all rows: 16467.
• Wrote full file: /Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_aligned_full_continuous_overwritten.csv
• Wrote ABI S15-only file: /Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_ABI_S15_continuous_overwritten.csv

Per-master-session ABI_S15 durations:
stitch_key(session)  rows  resets_detected  duration_sec duration_hms
 2021_05_21_ABI_S15   255                6        3515.0      0:58:35


 '01:07-01:59' '02:00-02:03' '02:03-02:52' '02:52-02:54' '02:56-03:44'
 '03:44-03:50' '03:50-05:01' '05:01-05:02' '05:02-05:52' '05:52-06:22'
 '06:22-06:45' '06:45-06:51' '06:53-07:27' '07:27-07:30' '07:30-07:31'
 '07:35-07:37' '07:37-08:22' '08:22-08:26' '08:27-09:00' '09:00-09:27'
 '09:27-09:30' '09:30-09:50' '09:50-09:54' '09:54-10:00' '10:00-10:03'
 '10:03-10:59' '10:59-11:06' '11:08-11:11' '11:11-11:14' '11:15-11:21'
 '11:21-11:36' '11:36-11:43' '11:43-11:59' '11:59-12:06' '12:06-12:23'
 '12:28-12:36' '12:44-12:59' '13:00-13:04' '13:04-13:10' '13:12-13:15'
 '13:16-13:29' '13:29-13:47' '13:49-14:29' '14:29-14:30' '14:30-14:30'
 '14:30-14:31' '14:31-14:31' '14:31-14:39' '14:40-14:45' '14:45-14:46'
 '14:46-15:09' '15:09-15:18' '15:18-15:19' '15:19-15:19' '15:19-15:19'
 '15:19-15:20' '15:20-15:22' '15:22-15:23' '15:23-15:24' '15:24-15:24'
 '15:24-15:25' '15:25-15:26' '15:27-15:39' '15:39-15:39' '15:39-15:39'
 '15:39-15:39' '15:39-15:39' '15:39-15:39' '15:39-15:39' '15:39-15:39'
 '15:3

using the utterance_data_ABI_S15_continuous_overwritten.csv

The `global_start_sec` works correctly even though it ends on the 58th minute approximately.

But somehow that's only the last of the transcripts pulled out?

I had checked across the JSON file of transcripts and that's the last one from EV's GitHub so this is all I can do for this part right now.

### breaking down into sentence as the "smallest unit"

In [26]:
# === Split transcripts into sentence-level units (with light tags + estimated per-sentence times) ===
import pandas as pd
import numpy as np
import re
from pathlib import Path

# -------- CONFIG --------
in_path  = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_ABI_S15_continuous_overwritten.csv"
out_path = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_ABI_S15.csv"

# If you want to include equal-time estimates for each sentence in an utterance:
ADD_SENT_TIME_ESTIMATES = True

# -------- LOAD --------
df = pd.read_csv(in_path)
if "transcript" not in df.columns:
    raise ValueError("Expected a 'transcript' column in the input CSV.")

# Keep a stable order
df["_row_id"] = np.arange(len(df))

# Normalize some columns if present
for col in ["transcript","speaker","session","global_session","timestamp"]:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip()

# -------- Sentence splitter (robust-ish) --------
# Avoid splitting on common abbreviations (e.g., "Dr.", "e.g.", "i.e.", etc.)
ABBR_TERMINALS = {
    "mr.","mrs.","ms.","dr.","prof.","sr.","jr.","vs.","etc.","e.g.","i.e.",
    "al.","fig.","figs.","eq.","eqs.","cf.","inc.","ltd.","co.","corp.","dept."
}

# Backchannels we may want to merge into neighboring sentences
BACKCHANNELS = {
    "ok","okay","okay.","ok.","yes","yeah","yep","right","mm-hmm","mhm","uh-huh",
    "sounds good","sounds good.","sure","great","thanks","thank you","cool","nice",
    "works for me","fine by me","let's do it","lets do it","i'm in","im in"
}

# Dialogue-act regexes (very lightweight)
PROPOSAL_PATTERNS = [
    r"\blet'?s\b", r"\b(shall we|should we|could we|can we)\b",
    r"\b(do you want to|wanna)\b", r"\b(how about|what if|why don'?t we)\b",
]
ACCEPTANCE_PATTERNS = [
    r"\b(sounds (good|great)|works for me|fine by me|that works|all good)\b",
    r"^(ok|okay|yes|yeah|yep|sure|alright)\b", r"\b(let'?s do it|count me in|i'?m in|im in|sgtm)\b",
]
REJECTION_PATTERNS = [r"\b(no|not now|maybe later|can'?t|cannot|won'?t|don'?t think so)\b"]
QUESTION_MARKERS   = [r"\?$", r"^(who|what|when|where|why|how)\b"]

def naive_split(text: str):
    """Split on sentence enders while protecting abbreviations."""
    text = (text or "").strip()
    if not text:
        return []
    # Split on punctuation that ends a sentence, keeping it with the sentence
    parts = re.split(r'(?<=[.!?])\s+', text)
    parts = [p for p in parts if p]
    if not parts:
        return []

    merged = [parts[0]]
    for p in parts[1:]:
        prev = merged[-1]
        prev_tail = prev.strip().lower().split()[-1] if prev.strip() else ""
        if prev_tail in ABBR_TERMINALS:
            merged[-1] = prev + " " + p
        else:
            merged.append(p)
    # Clean stray quotes/spaces
    return [m.strip(' "\'').strip() for m in merged if m.strip()]

def is_backchannel(s: str) -> bool:
    t = re.sub(r"[^\w\s?.!-]", "", (s or "").lower()).strip()
    return (t in BACKCHANNELS) or (len(t) <= 12 and t in {"ok","okay","yes","yeah","yep","right","cool","nice"})

def attach_intra_row_backchannels(s_list):
    """Attach short backchannels to the previous sentence to avoid standalones."""
    if not s_list: return []
    cleaned, buf = [], []
    for s in s_list:
        if is_backchannel(s):
            if cleaned:
                cleaned[-1] = (cleaned[-1] + " " + s).strip()
            else:
                buf.append(s)
        else:
            if buf:
                s = " ".join(buf) + " " + s
                buf = []
            cleaned.append(s.strip())
    if buf and cleaned:
        cleaned[-1] = (cleaned[-1] + " " + " ".join(buf)).strip()
    elif buf and not cleaned:
        cleaned = [" ".join(buf)]
    return cleaned

def any_match(pats, s): 
    return any(re.search(p, (s or "").lower()) for p in pats)

def classify_dialogue_act(sentence: str) -> str:
    s = (sentence or "").strip().lower()
    if any_match(ACCEPTANCE_PATTERNS, s): return "Acceptance"
    if any_match(REJECTION_PATTERNS, s):  return "Rejection/Deferral"
    if any_match(PROPOSAL_PATTERNS, s) and not re.search(r"\blet me\b", s): return "Proposal/Offer"
    if any_match(QUESTION_MARKERS, s):    return "Question"
    return "Inform/Report"

# -------- Build sentence-level rows --------
rows = []

# Columns we’ll try to carry through if present
carry_cols = [c for c in [
    "global_session","session","speaker","timestamp",
    "global_start_sec","global_end_sec","clip_number"
] if c in df.columns]

df = df.sort_values(["_row_id"]).reset_index(drop=True)

for i, r in df.iterrows():
    utt = r.get("transcript", "")
    sents = naive_split(utt)
    sents = attach_intra_row_backchannels(sents)

    # Optional: estimate per-sentence times by evenly splitting the utterance span
    start_g = float(r.get("global_start_sec")) if "global_start_sec" in r and pd.notna(r["global_start_sec"]) else np.nan
    end_g   = float(r.get("global_end_sec"))   if "global_end_sec"   in r and pd.notna(r["global_end_sec"])   else np.nan
    dur     = end_g - start_g if (np.isfinite(start_g) and np.isfinite(end_g)) else np.nan

    n = max(len(sents), 1)
    per = (dur / n) if (ADD_SENT_TIME_ESTIMATES and np.isfinite(dur) and dur > 0) else np.nan

    for j, s in enumerate(sents or [utt], start=1):
        row = {
            "source_row_index": i,
            "sent_index_in_row": j,
            "sentence": s,
            "dialogue_act": classify_dialogue_act(s),
        }
        # Carry through identifying columns
        for c in carry_cols:
            row[c] = r.get(c)

        # Estimated per-sentence times
        if ADD_SENT_TIME_ESTIMATES and np.isfinite(per):
            est_start = start_g + per * (j - 1)
            est_end   = start_g + per * j
            row["sent_start_sec_est"] = est_start
            row["sent_end_sec_est"]   = min(end_g, est_end)
        else:
            row["sent_start_sec_est"] = np.nan
            row["sent_end_sec_est"]   = np.nan

        rows.append(row)

df_sent = pd.DataFrame(rows).sort_values(
    ["source_row_index","sent_index_in_row"], kind="mergesort"
).reset_index(drop=True)

# -------- SAVE --------
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
df_sent.to_csv(out_path, index=False)

print(f"Wrote sentence units -> {out_path}")
print(f"Rows (utterances): {len(df)}  |  Rows (sentences): {len(df_sent)}")
if ADD_SENT_TIME_ESTIMATES:
    with_times = df_sent["sent_start_sec_est"].notna().sum()
    print(f"Sentences with estimated times: {with_times} / {len(df_sent)}")


Wrote sentence units -> /Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_ABI_S15.csv
Rows (utterances): 255  |  Rows (sentences): 497
Sentences with estimated times: 457 / 497


In [27]:
# === Fill missing sentence time estimates from parent utterance spans ===
# - Merges sentence rows with their parent utterance via `source_row_index`
# - Uses parent numeric times (global_start_sec/global_end_sec) if present
# - Else parses parent timestamp strings (timestamp or timestamp_continuous)
# - Evenly divides the utterance duration across its sentences
# - Guarantees no NaNs remain in sent_start_sec_est / sent_end_sec_est

import pandas as pd
import numpy as np
import re
from pathlib import Path

# --- CONFIG (edit these paths for your machine) ---
sentences_path  = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_ABI_S15.csv"
utterances_path = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_ABI_S15_continuous_overwritten.csv"
out_path        = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_ABI_S15_filled.csv"

# --- Load ---
sents = pd.read_csv(sentences_path)
utts  = pd.read_csv(utterances_path)

# Sanity: need this to map sentences back to the utterance row they came from
if "source_row_index" not in sents.columns:
    raise ValueError("Expected 'source_row_index' in the sentence file to map back to utterances.")

# Build a stable utterance row index (original file order)
utts = utts.copy()
utts["utter_row_idx"] = np.arange(len(utts), dtype=np.int64)

# Keep only what's useful from utterances (robust to column presence)
def present(cols, df_cols): return [c for c in cols if c in df_cols]
need_from_utts = ["utter_row_idx", "global_start_sec", "global_end_sec",
                  "timestamp", "timestamp_continuous"]
avail = present(need_from_utts, utts.columns)
utts_sub = utts[avail].copy()

# Avoid name collisions: rename timestamp strings
if "timestamp" in utts_sub.columns:
    utts_sub = utts_sub.rename(columns={"timestamp": "utter_timestamp"})
if "timestamp_continuous" in utts_sub.columns:
    utts_sub = utts_sub.rename(columns={"timestamp_continuous": "utter_timestamp_continuous"})

# Merge sentences with parent utterances
merged = sents.merge(utts_sub, left_on="source_row_index", right_on="utter_row_idx", how="left")

# --- Helpers to parse "mm:ss-mm:ss" or "hh:mm:ss-hh:mm:ss" ---
_time_pat = re.compile(r'(\d{1,2}:\d{2}(?::\d{2})?)')

def parse_hhmmss_to_sec(x):
    if pd.isna(x): return np.nan
    s = str(x).strip()
    if not s: return np.nan
    parts = s.split(":")
    try:
        parts = [float(p) for p in parts]
    except Exception:
        return np.nan
    if len(parts) == 3:
        h, m, sec = parts
    elif len(parts) == 2:
        h, m, sec = 0.0, parts[0], parts[1]
    elif len(parts) == 1:
        h, m, sec = 0.0, 0.0, parts[0]
    else:
        return np.nan
    return h*3600 + m*60 + sec

def split_timestamp_robust(ts):
    s = "" if pd.isna(ts) else str(ts)
    hits = _time_pat.findall(s)
    if len(hits) >= 2:
        return parse_hhmmss_to_sec(hits[0]), parse_hhmmss_to_sec(hits[1])
    elif len(hits) == 1:
        v = parse_hhmmss_to_sec(hits[0]); return v, v
    else:
        return np.nan, np.nan

# Ensure the target columns exist
for col in ["sent_start_sec_est", "sent_end_sec_est"]:
    if col not in merged.columns:
        merged[col] = np.nan

# --- Core fill logic (applied per utterance) ---
def fill_group(g: pd.DataFrame) -> pd.DataFrame:
    # If every sentence already has estimates, skip
    need = g["sent_start_sec_est"].isna() | g["sent_end_sec_est"].isna()
    if not need.any():
        return g

    # Prefer parent numeric times
    start_parent = pd.to_numeric(g.get("global_start_sec"), errors="coerce").iloc[0] if "global_start_sec" in g else np.nan
    end_parent   = pd.to_numeric(g.get("global_end_sec"),   errors="coerce").iloc[0] if "global_end_sec"   in g else np.nan

    # Fallback: parse parent's timestamp strings
    if (not np.isfinite(start_parent)) or (not np.isfinite(end_parent)):
        s1, e1 = split_timestamp_robust(g["utter_timestamp"].iloc[0]) if "utter_timestamp" in g else (np.nan, np.nan)
        s2, e2 = split_timestamp_robust(g["utter_timestamp_continuous"].iloc[0]) if "utter_timestamp_continuous" in g else (np.nan, np.nan)
        start_parent = s1 if np.isfinite(s1) else (s2 if np.isfinite(s2) else start_parent)
        end_parent   = e1 if np.isfinite(e1) else (e2 if np.isfinite(e2) else end_parent)

    # Last resort: tiny non-NaN span to avoid downstream errors
    if (not np.isfinite(start_parent)) or (not np.isfinite(end_parent)) or end_parent < start_parent:
        start_parent = 0.0 if not np.isfinite(start_parent) else float(start_parent)
        end_parent   = start_parent + 1e-6

    # Evenly space across sentences in this utterance
    n = len(g)
    step = (end_parent - start_parent) / n if n > 0 else 0.0
    starts = [start_parent + i * step for i in range(n)]
    ends   = [start_parent + (i + 1) * step for i in range(n)]

    g.loc[:, "sent_start_sec_est"] = starts
    g.loc[:, "sent_end_sec_est"]   = ends
    return g

# Apply per-utterance (grouped by the row index from the utterance table)
merged = merged.groupby("utter_row_idx", group_keys=False, dropna=False).apply(fill_group)

# Guard: any residual NaNs -> fill with safe defaults
resid = merged["sent_start_sec_est"].isna() | merged["sent_end_sec_est"].isna()
if resid.any():
    merged.loc[resid, "sent_start_sec_est"] = merged.loc[resid, "sent_start_sec_est"].fillna(0.0)
    merged.loc[resid, "sent_end_sec_est"]   = merged.loc[resid, "sent_end_sec_est"].fillna(1e-6)

# Drop helper cols (optional)
final = merged.drop(columns=[c for c in ["utter_row_idx", "utter_timestamp", "utter_timestamp_continuous"] if c in merged.columns])

# --- Save ---
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
final.to_csv(out_path, index=False)

print(f"Wrote: {out_path}")
print(f"Rows total: {len(final)}")
print("Missing sent_start_sec_est:", final['sent_start_sec_est'].isna().sum())
print("Missing sent_end_sec_est:  ", final['sent_end_sec_est'].isna().sum())


Wrote: /Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_ABI_S15_filled.csv
Rows total: 497
Missing sent_start_sec_est: 0
Missing sent_end_sec_est:   0


  merged = merged.groupby("utter_row_idx", group_keys=False, dropna=False).apply(fill_group)


The file that is in use now is `sentences_ABI_S15_filled.csv`

So now just working on the cosine similarity idea with the global time stamp and the sentences as the "smallest unit" from this point and below.

## Idea 3 - merge small speaker runs - unecessary?