# Sentence Convergence w/ NLP

cosine similarity might be able to work best as the output metric to tell what idea is converging/diverging. Because it tracks the similarities between points of comparison.

But then checking the length of transcript for a particular person. Do we break it up by sentence. 
- What is considered that "smallest unit?"

# Setting up the files for analysis

## Idea 1 - Clean and reproducible - placebo

This just cleans up the transcript so it is cleanly viewed. Cleans up text format. If it is empty transcript, removes row.

In [4]:
import pandas as pd

# Load your file
df = pd.read_csv("/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_aligned_full.csv")

# --- Basic cleanup ---
df['transcript'] = df['transcript'].astype(str).str.strip()
df = df[df['transcript'] != ""]
df = df.dropna(subset=['transcript'])

# Sort chronologically
df = df.sort_values(by=['global_session', 'global_timestamp_sec']).reset_index(drop=True)

# Filter for specific session
session_id = "2021_05_21_ABI_S15_ABI"
df_session = df[df['global_session'] == session_id].reset_index(drop=True)

# Summary
print(f"Loaded {len(df_session)} utterances for session {session_id}")

# Show preview
preview_cols = ['global_session', 'speaker', 'timestamp', 'global_timestamp_sec', 'transcript']
print(df_session[preview_cols].head(10).to_string(index=False))



Loaded 255 utterances for session 2021_05_21_ABI_S15_ABI
        global_session                speaker   timestamp  global_timestamp_sec                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     transcript
2021_05_21_ABI_S15_ABI             Brad Smith 00:00-00:37                 615.0                                                                                                 

## Idea 2 - Sentence Level Unit-ideas

(the main way how the excel file will be processed.)

Deletes the backchannel utterances (e.g. "okay", "yes", "right", "sounds good")

Splits up transcript by sentence by speaker

In [23]:
# === End-to-end (stitched): Clean -> Global time across clips -> Sentence units (+ dialogue acts) ===
import pandas as pd
import numpy as np
import re

# -------------------- CONFIG --------------------
csv_path   = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/utterance_data_aligned_full.csv"
session_key_prefix = r"^2021_05_21_ABI_S15"   # regex to match all clips for this session
out_name  = "sentences_2021_05_21_ABI_S15.csv"

# -------------------- LOAD + BASIC CLEAN --------------------
df = pd.read_csv(csv_path)
df["transcript"] = df["transcript"].astype(str).str.strip()
df = df[df["transcript"] != ""].dropna(subset=["transcript"])

if "global_session" not in df.columns:
    raise ValueError("Expected column 'global_session' not found.")

# Keep ALL clips that belong to this session (prefix match)
mask_session = df["global_session"].astype(str).str.match(session_key_prefix, na=False)
df_session = df[mask_session].copy()
if df_session.empty:
    raise ValueError("No rows found for the selected session prefix. Check 'global_session' values.")

# Make numeric
for c in ["start_sec","end_sec","clip_offset_sec","global_timestamp_sec","clip_number"]:
    if c in df_session.columns:
        df_session[c] = pd.to_numeric(df_session[c], errors="coerce")

# -------------------- BUILD STITCHED GLOBAL TIMESTAMP --------------------
def build_global_ts_clean_stitched(df_sess: pd.DataFrame) -> pd.Series:
    """
    Prefer true stitched time = clip_offset_sec + start_sec.
    If clip_offset_sec is missing/NaN, synthesize offsets by ordering clip_number
    and accumulating each clip's max(end_sec).
    Enforce strictly non-decreasing timeline with tiny nudges.
    """
    idx = df_sess.index
    start = df_sess.get("start_sec")
    offset = df_sess.get("clip_offset_sec")

    # If clip_offset_sec is mostly missing, synthesize from clip_number blocks
    if (offset is None) or (offset.isna().mean() > 0.5):
        if "clip_number" not in df_sess.columns:
            # fallback to best-effort: use global_timestamp_sec if present
            base = pd.to_numeric(df_sess.get("global_timestamp_sec"), errors="coerce")
            ts = base.fillna(start).fillna(0.0)
        else:
            # order clips and accumulate durations
            tmp = df_sess[["clip_number","start_sec","end_sec"]].copy()
            # robust per-clip duration (>= 0)
            clip_stats = tmp.groupby("clip_number").agg(
                clip_min=("start_sec","min"),
                clip_max=("end_sec","max"),
            )
            clip_stats["clip_dur"] = (clip_stats["clip_max"] - clip_stats["clip_min"]).clip(lower=0).fillna(0)
            # sort clips as they appear
            ordered_clips = sorted(clip_stats.index.dropna())
            cum = 0.0
            offsets = {}
            for cn in ordered_clips:
                offsets[cn] = cum
                cum += float(clip_stats.loc[cn, "clip_dur"])
            # stitched time = synthesized_offset + (start - clip_min) within that clip
            within = df_sess["start_sec"] - df_sess.groupby("clip_number")["start_sec"].transform("min")
            ts = within.fillna(0.0) + df_sess["clip_number"].map(offsets).fillna(0.0)
    else:
        # We have usable offsets: stitched time = offset + start
        ts = offset.fillna(0.0) + start.fillna(0.0)

    # Enforce strictly non-decreasing
    eps = 1e-3
    ts = pd.to_numeric(ts, errors="coerce").fillna(0.0).astype(float)
    prev = -np.inf
    fixed = []
    for v in ts:
        if v <= prev:
            v = prev + eps
        fixed.append(v)
        prev = v

    return pd.Series(fixed, index=df_sess.index, dtype="float64")

df_session["global_ts_clean"] = build_global_ts_clean_stitched(df_session)

# -------------------- SENTENCE SPLIT + DIALOGUE ACTS --------------------
BACKCHANNELS = {
    "ok","okay","okay.","ok.","yes","yeah","yep","right","mm-hmm","mhm","uh-huh",
    "sounds good","sounds good.","sure","great","thanks","thank you","cool","nice",
    "works for me","fine by me","let's do it","lets do it","i'm in","im in"
}
PROPOSAL_PATTERNS = [
    r"\blet'?s\b",
    r"\b(shall we|should we|could we|can we)\b",
    r"\b(do you want to|dya want to|wanna)\b",
    r"\b(how about|what if|why don'?t we)\b",
]
ACCEPTANCE_PATTERNS = [
    r"\b(sounds good|sounds great|works for me|fine by me|that works|all good)\b",
    r"^(ok|okay|yes|yeah|yep|sure|alright)\b",
    r"\b(let'?s do it|count me in|i'?m in|im in|sgtm)\b",
]
REJECTION_PATTERNS = [r"\b(no|not now|maybe later|can'?t|cannot|won'?t|don'?t think so)\b"]
QUESTION_MARKERS  = [r"\?$", r"^(who|what|when|where|why|how)\b"]
ABBR_TERMINALS    = {"mr.","mrs.","ms.","dr.","prof.","sr.","jr.","vs.","etc.","e.g.","i.e."}

def naive_split(text: str):
    text = (text or "").strip()
    if not text: return []
    parts = re.split(r'(?<=[.!?])\s+', text)
    parts = [p for p in parts if p]
    merged = [parts[0]] if parts else []
    for p in parts[1:]:
        prev = merged[-1]
        prev_tail = prev.strip().lower().split()[-1] if prev.strip() else ""
        if prev_tail in ABBR_TERMINALS:
            merged[-1] = prev + " " + p
        else:
            merged.append(p)
    return [m.strip() for m in merged if m.strip()]

def split_into_sentences(text: str):
    return naive_split(text)

def is_backchannel(s: str) -> bool:
    t = re.sub(r"[^\w\s?.!-]", "", (s or "").lower()).strip()
    return (t in BACKCHANNELS) or (len(t) <= 12 and t in {"ok","okay","yes","yeah","yep","right","cool","nice"})

def attach_intra_row_backchannels(sent_list):
    if not sent_list: return []
    cleaned, bc_buf = [], []
    for s in sent_list:
        if is_backchannel(s):
            if cleaned: cleaned[-1] = cleaned[-1] + " " + s
            else: bc_buf.append(s)
        else:
            if bc_buf:
                s = " ".join(bc_buf) + " " + s
                bc_buf = []
            cleaned.append(s)
    if bc_buf and cleaned: cleaned[-1] = cleaned[-1] + " " + " ".join(bc_buf)
    elif bc_buf and not cleaned: cleaned = [" ".join(bc_buf)]
    return cleaned

def any_match(patterns, s):
    s_l = (s or "").lower()
    return any(re.search(p, s_l) for p in patterns)

def classify_dialogue_act(sentence: str) -> str:
    s = (sentence or "").strip()
    s_l = s.lower()
    if any_match(ACCEPTANCE_PATTERNS, s_l): return "Acceptance"
    if any_match(REJECTION_PATTERNS, s_l):  return "Rejection/Deferral"
    if any_match(PROPOSAL_PATTERNS, s_l) and not re.search(r"\blet me\b", s_l): return "Proposal/Offer"
    if any_match(QUESTION_MARKERS, s_l):    return "Question"
    return "Inform/Report"

STOP = {
    "the","a","an","and","or","but","so","to","of","in","on","for","with","at","by",
    "is","am","are","was","were","be","been","being","it","this","that","these","those",
    "i","you","he","she","we","they","me","him","her","us","them","my","your","his","her","our","their"
}
def content_signature(text, k=3):
    toks = re.findall(r"[A-Za-z0-9']+", (text or "").lower())
    toks = [t for t in toks if t not in STOP]
    seen, sig = set(), []
    for t in toks:
        if t not in seen:
            sig.append(t); seen.add(t)
        if len(sig) >= k: break
    return " ".join(sig)

# -------------------- BUILD SENTENCE-LEVEL UNITS --------------------
df_session = df_session.sort_values(["clip_number","start_sec","end_sec"], na_position="last").reset_index(drop=True)

rows = []
for idx, r in df_session.iterrows():
    sents = attach_intra_row_backchannels(split_into_sentences(r["transcript"]))
    for j, s in enumerate(sents, start=1):
        rows.append({
            "global_session": r.get("global_session"),
            "session": r.get("session"),
            "clip_number": r.get("clip_number"),
            "speaker": r.get("speaker"),
            "timestamp": r.get("timestamp"),
            "global_ts_clean": r.get("global_ts_clean"),
            "start_sec": r.get("start_sec"),
            "end_sec": r.get("end_sec"),
            "source_row_index": idx,
            "sent_index_in_row": j,
            "sentence": s,
            "dialogue_act": classify_dialogue_act(s),
            "content_signature": content_signature(s, k=3)
        })

df_sentences = pd.DataFrame(rows).sort_values(
    ["global_ts_clean","source_row_index","sent_index_in_row"]
).reset_index(drop=True)

# -------------------- SAVE + AUDIT --------------------
df_sentences.to_csv(out_name, index=False)
mmin = float(df_sentences["global_ts_clean"].min())
mmax = float(df_sentences["global_ts_clean"].max())
print(f"Sentence-level units saved -> {out_name}  ({len(df_sentences)} rows)")
print(f"global_ts_clean min/max: {mmin:.3f} / {mmax:.3f}  (~ minutes {int(mmin//60)} → {int(mmax//60)})")
print(f"clips included (unique): {sorted(pd.unique(df_session['clip_number'].dropna()))[:10]} ...")
print(df_sentences[["speaker","global_ts_clean","dialogue_act","sentence"]].head(12).to_string(index=False))




Sentence-level units saved -> sentences_2021_05_21_ABI_S15.csv  (497 rows)
global_ts_clean min/max: 615.000 / 1267.127  (~ minutes 10 → 21)
clips included (unique): [15.0] ...
      speaker  global_ts_clean   dialogue_act                                                                                                                    sentence
   Brad Smith            615.0 Proposal/Offer                                                                       Well, um, yeah, let's maybe start some introductions.
   Brad Smith            615.0  Inform/Report We're beginning the obviously by the third breakout session and all of this stuff beginning to know each other pretty well.
   Brad Smith            615.0  Inform/Report                                                                                       So, uh, I'll just read them out here.
   Brad Smith            615.0  Inform/Report                                                                                       Well, let me

In [25]:
import pandas as pd
import numpy as np

path = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_2021_05_21_ABI_S15.csv"
s = pd.read_csv(path)

# ---- Basic coverage ----
tmin, tmax = float(s["global_ts_clean"].min()), float(s["global_ts_clean"].max())
dur_min = (tmax - tmin) / 60.0
print(f"global_ts_clean range: {tmin:.3f} → {tmax:.3f} sec  (~{dur_min:.1f} min)")

# Expect ~0 → ~3997 sec (≈ 66.6 min). If you see ~615→1267 (~10.9 min), you only captured one clip.

# ---- Monotonicity & big gaps (clip seams) ----
s = s.sort_values(["global_ts_clean","source_row_index","sent_index_in_row"]).reset_index(drop=True)
diffs = s["global_ts_clean"].diff()
non_mono = int((diffs < 0).sum())
big_gaps = s.loc[diffs > 300, "global_ts_clean"]  # >5 min jumps
print(f"non‑monotonic steps: {non_mono}   |   gaps >5min: {len(big_gaps)}")

# ---- Clip coverage & order sanity ----
# (These columns came from the source rows, so they’ll be NaN if your source lacked them.)
for c in ["clip_number","start_sec","end_sec"]:
    if c not in s.columns:
        s[c] = np.nan

clip_span = (s.groupby("clip_number")
               .agg(n=("sentence","count"),
                    ts_min=("global_ts_clean","min"),
                    ts_max=("global_ts_clean","max"))
               .sort_values("ts_min"))
print("\nPer‑clip coverage on stitched timeline:")
print(clip_span.to_string())

# ---- Are we really getting *all* the clips for the session prefix? ----
# If ts range is too short, check how many distinct global_session IDs survived:
print("\nDistinct global_session values (first 20):")
print(s["global_session"].dropna().astype(str).value_counts().head(20).to_string())

# ---- Spot check: early, middle, late rows ----
def show(rows):
    cols = ["global_ts_clean","clip_number","speaker","dialogue_act","sentence"]
    print(s.loc[rows, cols].to_string(index=False, max_colwidth=80))

n = len(s)
print("\nEARLY:")
show(range(3))
print("\nMIDDLE:")
show(range(max(0,n//2-2), min(n, n//2+1)))
print("\nLATE:")
show(range(max(0,n-3), n))


global_ts_clean range: 615.000 → 1267.127 sec  (~10.9 min)
non‑monotonic steps: 0   |   gaps >5min: 0

Per‑clip coverage on stitched timeline:
               n  ts_min    ts_max
clip_number                       
15.0         497   615.0  1267.127

Distinct global_session values (first 20):
global_session
2021_05_21_ABI_S15_ABI    497

EARLY:
 global_ts_clean  clip_number    speaker   dialogue_act                                                                         sentence
           615.0         15.0 Brad Smith Proposal/Offer                            Well, um, yeah, let's maybe start some introductions.
           615.0         15.0 Brad Smith  Inform/Report We're beginning the obviously by the third breakout session and all of this s...
           615.0         15.0 Brad Smith  Inform/Report                                            So, uh, I'll just read them out here.

MIDDLE:
 global_ts_clean  clip_number                speaker  dialogue_act                                

Couple of things done here:

1. Breaking the excel files' row into sentences

In the raw file, one "row" might be a short filler ("Yeah.") or a long paragraph with multiple ideas. That makes it hard to comapre ideas fairly. So we split each row into sentences. Now every chunk is about the same size, which makes our simililarity scores more meaningful.

2. Keeping the speaker

We always keep track of who said it. This way we can tell if an idea is picked up by someone-else (cross-speaker) or if the same person is just adding more details (same-speaker)

3. Tagging the type of sentence

We give each sentence a quick label like `Proposal/Offer`, `Acceptance`, `Question`, or `Inform/Report`. This is important because not every sentence is an "idea". For example, "Sounds good." is an acceptane, not a new idea. Tagging them stops these from messing up our idea similarity results, while still letting us track decisions.

4. Making the timeline cleans

Sometimes timestamps in the fiel go backwards or have ties. We fix this by nudging times forward by 0.001 seconds if needed. This tiny change doesn't affect the analysis, but it keeps our data in the right order.

### implementing co-similarity (SBERT)

all of this still tests the idea if Gemini is even capable of defining a convergence to solution, even if NLP topics are used within the transcript, based on all this code run?

In [19]:
# ==== Convergence vs Divergence (refined; meta-filter, Top-K, chains, decision labels) ====
import pandas as pd, numpy as np, re, glob, os

# --- Config ---
session_csv    = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_2021_05_21_ABI_S15_ABI.csv"
model_name     = "sentence-transformers/all-MiniLM-L6-v2"

# Pairing policy
window_s   = 120
min_dt     = 2.0
cross_only = True

# Text filters
min_tokens   = 3
exclude_acts = {"Acceptance"}

# Meta-utterance filter
META_PATTERNS = [
    r"\bsorry\b", r"\bchange location\b", r"\bcan you hear\b", r"\bi can hear\b",
    r"\bare you there\b", r"\bi am here\b", r"\bmy name is\b", r"\bhello\b", r"\bhi\b",
    r"\bmid bite\b", r"\btesting\b", r"\btest\b"
]
meta_re = re.compile("|".join(META_PATTERNS), flags=re.IGNORECASE)

# Scoring policy
topK_per_A = 3
tau_conv   = 0.68
tau_div    = 0.35

N_preview  = 15

# --- Load & prefilter ---
df = pd.read_csv(session_csv)
req = {"global_session","speaker","sentence","global_ts_clean","dialogue_act","source_row_index","sent_index_in_row"}
missing = req - set(df.columns)
if missing:
    raise ValueError(f"Missing columns: {missing}")

def norm_text(s: str) -> str:
    s = str(s).strip().lower()
    s = "".join(ch for ch in s if ch.isalnum() or ch.isspace())
    return " ".join(s.split())

df["sentence"] = df["sentence"].astype(str).str.strip()
df["tok_len"]  = df["sentence"].str.split().apply(len)
df["is_meta"]  = df["sentence"].apply(lambda s: bool(meta_re.search(s)))

mask_len = df["tok_len"] >= min_tokens
mask_act = ~df["dialogue_act"].isin(exclude_acts) if exclude_acts else True

df_use = df[mask_len & mask_act & (~df["is_meta"])].copy()
df_use["norm"] = df_use["sentence"].apply(norm_text)

df_use = df_use.sort_values(["global_session","global_ts_clean","source_row_index","sent_index_in_row"]).reset_index(drop=True)
df_use["unit_id"] = (
    df_use["global_session"].astype(str) + ":" +
    df_use["source_row_index"].astype(str) + ":" +
    df_use["sent_index_in_row"].astype(str)
)

print(f"Usable units for cosine: {len(df_use)} (of {len(df)})")

keep = ["global_session","unit_id","speaker","sentence","norm","global_ts_clean","dialogue_act","source_row_index"]
L = df_use[keep].rename(columns={
    "unit_id":"unit_A","speaker":"speaker_A","sentence":"sentence_A","norm":"norm_A",
    "global_ts_clean":"tA","dialogue_act":"act_A","source_row_index":"row_A"
})
R = df_use[keep].rename(columns={
    "unit_id":"unit_B","speaker":"speaker_B","sentence":"sentence_B","norm":"norm_B",
    "global_ts_clean":"tB","dialogue_act":"act_B","source_row_index":"row_B"
})

pairs = L.merge(R, on="global_session", how="inner")
pairs = pairs[(pairs["tB"] >= pairs["tA"]) & (pairs["unit_A"] != pairs["unit_B"])]
pairs["dt"] = (pairs["tB"] - pairs["tA"]).astype(float)
pairs = pairs[(pairs["dt"] <= window_s) & (pairs["dt"] >= min_dt)]
pairs["cross_speaker"] = pairs["speaker_A"].ne(pairs["speaker_B"])
if cross_only:
    pairs = pairs[pairs["cross_speaker"]]
pairs = pairs[pairs["norm_A"] != pairs["norm_B"]]

print(f"Candidate pairs after constraints: {len(pairs)}")

# --- SBERT embeddings + cosine ---
try:
    from sentence_transformers import SentenceTransformer
except Exception as e:
    raise RuntimeError("Install sentence-transformers: pip install -q sentence-transformers") from e

model = SentenceTransformer(model_name)
embA = model.encode(pairs["sentence_A"].tolist(), convert_to_tensor=True, normalize_embeddings=True)
embB = model.encode(pairs["sentence_B"].tolist(), convert_to_tensor=True, normalize_embeddings=True)
pairs["cosine"] = (embA * embB).sum(dim=1).detach().cpu().numpy()

# --- Top-K per A ---
pairs = pairs.sort_values(["unit_A","cosine"], ascending=[True, False])
pairs_topK = pairs.groupby("unit_A", as_index=False).head(topK_per_A)

# --- Decision-language labels (toward solution) ---
DECISION_PATTERNS = [
    r"\b(let'?s|we should|we ought|we can|we could)\b",
    r"\b(decide(?:d)?|agreed|agreement|consensus)\b",
    r"\b(plan|next step|action item|assign|deadline)\b",
    r"\b(so the solution|the solution is|this works|that works)\b",
    r"\b(we'?ll|we will|we are going to|i will|i can)\b",
    r"\b(implement|try this|do this|use that)\b"
]
dec_re = re.compile("|".join(DECISION_PATTERNS), flags=re.IGNORECASE)
pairs_topK["decision_B"] = pairs_topK["sentence_B"].apply(lambda s: bool(dec_re.search(str(s))))
pairs_topK["decision_A"] = pairs_topK["sentence_A"].apply(lambda s: bool(dec_re.search(str(s))))

# --- Split once (after labeling) ---
conv = pairs_topK[pairs_topK["cosine"] >= tau_conv].copy()
div  = pairs_topK[pairs_topK["cosine"] <= tau_div].copy()
conv["conv_toward_solution"] = conv["decision_B"] | (~conv["decision_A"] & conv["decision_B"])

print(f"Convergence candidates (>= {tau_conv}): {len(conv)} | decision-like B: {int(conv['decision_B'].sum())}")
print(f"Divergence candidates (<= {tau_div}): {len(div)}")

# --- Temporal series for plotting ---
def to_minute(x):
    try: return int(float(x)//60)
    except: return None

conv["minute_A"] = conv["tA"].apply(to_minute)
conv["minute_B"] = conv["tB"].apply(to_minute)

series_conv = conv.groupby("minute_B").size().rename("conv_count").reset_index()
series_conv_dec = conv[conv["decision_B"]].groupby("minute_B").size().rename("conv_decision_count").reset_index()

if not series_conv.empty:
    mmin, mmax = series_conv["minute_B"].min(), series_conv["minute_B"].max()
    timeline = pd.DataFrame({"minute_B": list(range(mmin, mmax+1))})
    temporal = (timeline.merge(series_conv, how="left", on="minute_B")
                        .merge(series_conv_dec, how="left", on="minute_B")
                        .fillna(0))
    temporal["conv_rate"] = temporal["conv_count"] / max(temporal["conv_count"].sum(), 1)
else:
    temporal = pd.DataFrame(columns=["minute_B","conv_count","conv_decision_count","conv_rate"])

# --- Optional: attach human minute-notes (summary_..._minXX.txt) ---
def read_summary_snippets(pattern="summary_*.txt"):
    rows = []
    for p in glob.glob(pattern):
        m = re.search(r"_min(\d+)\.txt$", os.path.basename(p))
        if not m: continue
        minute = int(m.group(1))
        try:
            with open(p, "r", encoding="utf-8") as fh: txt = fh.read().strip()
        except Exception: txt = ""
        rows.append({"minute_B": minute, "summary_note": txt, "summary_file": os.path.basename(p)})
    return pd.DataFrame(rows)

summ_df = read_summary_snippets("summary_*.txt")
if not summ_df.empty:
    temporal_annot = temporal.merge(summ_df, how="left", on="minute_B")
else:
    temporal_annot = temporal.copy()

# --- Save outputs ---
cols = [
    "global_session",
    "unit_A","speaker_A","tA","act_A","sentence_A",
    "unit_B","speaker_B","tB","act_B","sentence_B",
    "dt","cosine","cross_speaker"
]
conv_out = conv[cols + ["decision_A","decision_B","conv_toward_solution"]]
conv_path = f"convergence_pairs_w{window_s}_k{topK_per_A}_t{tau_conv}_labeled.csv"
div_path  = f"divergence_pairs_w{window_s}_k{topK_per_A}_t{tau_div}.csv"
chain_path= f"convergence_chains_w{window_s}_k{topK_per_A}_t{tau_conv}_labeled.csv"
ts_path   = f"temporal_uptake_series_w{window_s}_k{topK_per_A}_t{tau_conv}.csv"
ts_ann    = f"temporal_uptake_series_annotated_w{window_s}_k{topK_per_A}_t{tau_conv}.csv"

conv_out.to_csv(conv_path, index=False)
div[cols].to_csv(div_path, index=False)
temporal.to_csv(ts_path, index=False)
temporal_annot.to_csv(ts_ann, index=False)

if not conv.empty:
    conv_chain = (conv.groupby(["unit_A","speaker_A","tA","act_A","sentence_A"])
                      .agg(n_responders=("speaker_B","nunique"),
                           n_links=("unit_B","nunique"),
                           latest_dt=("dt","max"),
                           max_cosine=("cosine","max"),
                           any_decision_B=("decision_B","any"))
                      .reset_index()
                      .sort_values(["n_responders","n_links","max_cosine"], ascending=[False,False,False]))
    conv_chain.to_csv(chain_path, index=False)

print(f"Saved:\n - {conv_path}\n - {div_path}\n - {ts_path}\n - {ts_ann}" + (f"\n - {chain_path}" if not conv.empty else ""))

# --- Preview AFTER labeling ---
def preview(dfp, title, n=N_preview, reverse=False):
    if dfp.empty:
        print(f"\n=== {title}: (none) ==="); return
    dfp = dfp.sort_values("cosine", ascending=reverse).head(n)
    print(f"\n=== {title} (n={len(dfp)}) ===")
    for _, r in dfp.iterrows():
        print(f"[{r['cosine']:.3f}] Δt={int(r['dt'])}s | {r['speaker_A']} → {r['speaker_B']} | Acts: {r['act_A']}→{r['act_B']} | decision_B={r.get('decision_B', False)}")
        print(f"  A: {r['sentence_A']}")
        print(f"  B: {r['sentence_B']}\n")

preview(conv, "Top convergence (labeled; highest cosine)", n=N_preview, reverse=False)
preview(div,  "Top divergence (lowest cosine)",           n=N_preview, reverse=True)


Usable units for cosine: 393 (of 497)
Candidate pairs after constraints: 23056




Convergence candidates (>= 0.68): 21 | decision-like B: 5
Divergence candidates (<= 0.35): 222
Saved:
 - convergence_pairs_w120_k3_t0.68_labeled.csv
 - divergence_pairs_w120_k3_t0.35.csv
 - temporal_uptake_series_w120_k3_t0.68.csv
 - temporal_uptake_series_annotated_w120_k3_t0.68.csv
 - convergence_chains_w120_k3_t0.68_labeled.csv

=== Top convergence (labeled; highest cosine) (n=15) ===
[0.775] Δt=49s | Yevgenia Kozorovitskiy → Brad Smith | Acts: Inform/Report→Question | decision_B=True
  A: That kind of moves us a bit away from uh considering deep tissue applications specifically.
  B: I mean is that is that a technology that would work at a deeper tissue as well?

[0.758] Δt=49s | Carolyn Bayer → Aniruddha Ray | Acts: Inform/Report→Inform/Report | decision_B=False
  A: So I'm spatial resolution is more in the 100 to 200 micron.
  B: Um, or or even a few micron resolution.

[0.757] Δt=67s | Brad Smith → Aniruddha Ray | Acts: Inform/Report→Inform/Report | decision_B=False
  A: So the 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pairs_topK["decision_B"] = pairs_topK["sentence_B"].apply(lambda s: bool(dec_re.search(str(s))))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pairs_topK["decision_A"] = pairs_topK["sentence_A"].apply(lambda s: bool(dec_re.search(str(s))))


In [22]:
import pandas as pd, numpy as np

sentences_path = "/Users/maxchalekson/Projects/NICO-Research/NICO_human-gemini/Data/sentences_2021_05_21_ABI_S15_ABI.csv"
s = pd.read_csv(sentences_path)

print("global_ts_clean min/max:", float(s["global_ts_clean"].min()), float(s["global_ts_clean"].max()))
print("minutes span:", int(s["global_ts_clean"].min()//60), "→", int(s["global_ts_clean"].max()//60))

print("\nRows per clip_number (min/max ts):")
cols = [c for c in ["clip_number","start_sec","end_sec","global_ts_clean"] if c in s.columns]
print(s[cols].groupby("clip_number").agg(["min","max","count"]))


global_ts_clean min/max: 615.0 1267.126999999997
minutes span: 10 → 21

Rows per clip_number (min/max ts):
            start_sec              end_sec              global_ts_clean  \
                  min    max count     min    max count             min   
clip_number                                                               
15.0              0.0  652.0   497     6.0  675.0   497           615.0   

                             
                  max count  
clip_number                  
15.0         1267.127   497  


## Idea 3 - merge small speaker runs - unecessary?