# Snapshot Timestamp Tagging Notebook

This Jupyter-notebook walks through the complete pipeline for **inserting snapshot markers** into observation-segment transcripts.  
After execution you will have a new file, **`TIMESTAMPED_peru_cleaned_transcripts.csv`**, that contains the modified transcript text with inline tags identifying three key one-minute windows:

| Snapshot | Minutes | Seconds | Tag |
|----------|---------|---------|------|
| 1        | 4 – 5   | 240 – 300 | `<SNAPSHOT 1> … </SNAPSHOT 1>` |
| 2        | 9 – 10  | 540 – 600 | `<SNAPSHOT 2> … </SNAPSHOT 2>` |
| 3        | 14 – 15 | 840 – 900 | `<SNAPSHOT 3> … </SNAPSHOT 3>` |

Only the text inside these ranges is wrapped; everything else remains exactly as recorded.

## 1 — Setup

We import core libraries and set paths to the source and destination CSV files.  
Feel free to adjust `SOURCE_CSV` and `DEST_CSV` if your directory layout differs.

In [1]:
import json
from pathlib import Path
import pandas as pd

# ----- Paths ------------------------------------------------------------
SOURCE_CSV = Path("/Users/mkrasnow/Desktop/montesa/new/formattedData/peru_cleaned_transcripts.csv")
DEST_CSV   = SOURCE_CSV.parent / "TIMESTAMPED_peru_cleaned_transcripts.csv"

assert SOURCE_CSV.exists(), f"Source CSV not found: {SOURCE_CSV}"

## 2 — Snapshot-insertion utility

Below is a single helper function, **`insert_snapshot_tags`**, that takes the raw JSON string from any transcript column and returns the text with snapshot tags inserted at the correct points.

### Algorithm overview
1. Parse the JSON into a dictionary (no external schema required).  
2. Walk through the words chronologically.  
3. Open a `<SNAPSHOT n>` tag once we **enter** its window (first word ≥ window start).  
4. Close with a `</SNAPSHOT n>` tag on the first word whose `start` time **exceeds** the window end.  
5. Emit the word’s text exactly as stored.  
6. Join tokens with spaces and perform a light tidy-up for readability.

In [2]:
from typing import List, Tuple

SNAPSHOT_WINDOWS: List[Tuple[int, float, float]] = [
    (1, 4 * 60, 5 * 60),   # 4:00 – 4:59.999 … up to 300 s
    (2, 9 * 60, 10 * 60),  # 9:00 – 9:59.999 … up to 600 s
    (3, 14 * 60, 15 * 60)  # 14:00 – 14:59.999 … up to 900 s
]

def insert_snapshot_tags(transcript_json: str) -> str:
    """Return transcript text with <SNAPSHOT n> markers inserted."""
    if not isinstance(transcript_json, str) or not transcript_json.strip():
        return ""

    try:
        t_dict = json.loads(transcript_json)
    except json.JSONDecodeError:
        # If the JSON is malformed we leave the cell blank rather than crashing.
        return ""

    words = t_dict.get("words", [])
    if not words:
        return t_dict.get("text", "")

    # Track whether we are *inside* a given snapshot window.
    in_window = {num: False for num, _s, _e in SNAPSHOT_WINDOWS}
    tagged_tokens: List[str] = []

    for w in words:
        start_time = float(w.get("start", 0.0))
        token_text = w.get("text", "")

        # Open or close snapshot tags as required before appending the token.
        for num, win_start, win_end in SNAPSHOT_WINDOWS:
            if (not in_window[num]) and start_time >= win_start and start_time < win_end:
                tagged_tokens.append(f"<SNAPSHOT {num}>")
                in_window[num] = True

            if in_window[num] and start_time >= win_end:
                tagged_tokens.append(f"</SNAPSHOT {num}>")
                in_window[num] = False

        tagged_tokens.append(token_text)

    # Close any snapshot still open at the end of the transcript.
    for num, _s, _e in SNAPSHOT_WINDOWS:
        if in_window[num]:
            tagged_tokens.append(f"</SNAPSHOT {num}>")
            in_window[num] = False

    # Join on spaces; then collapse any multiple-space sequences introduced during tagging.
    final_text = " ".join(tagged_tokens)
    return " ".join(final_text.split())  # simple whitespace normalisation

## 3 — Load data and apply tagging

The dataset contains two transcript-JSON columns and two human-readable text columns.  
Our task is to **overwrite** the text columns with snapshots inserted, leaving everything else untouched.

In [3]:
# Read the cleaned transcripts CSV
df = pd.read_csv(SOURCE_CSV)

COLUMN_PAIRS = [
    ("First Audio Transcript_JSON", "First Audio Transcript Text"),
    ("Last Audio Transcript_JSON",  "Last Audio Transcript Text"),
]

for json_col, text_col in COLUMN_PAIRS:
    if json_col not in df.columns or text_col not in df.columns:
        raise KeyError(f"Expected columns '{json_col}' and '{text_col}' not found in CSV header.")

    df[text_col] = df[json_col].apply(insert_snapshot_tags)

print("Snapshot insertion complete for all rows.")

Snapshot insertion complete for all rows.


## 4 — Save result

The final DataFrame is saved alongside the original with the mandated filename.

In [4]:
df.to_csv(DEST_CSV, index=False)
print(f"✅  Saved timestamp-tagged transcripts to:  {DEST_CSV}")

✅  Saved timestamp-tagged transcripts to:  /Users/mkrasnow/Desktop/montesa/new/formattedData/TIMESTAMPED_peru_cleaned_transcripts.csv


## 5 — Run notebook end-to-end

1. Execute every cell (⏯ **Run All**).  
2. Confirm the console prints the success message.  
3. Verify the new CSV in the same directory.  

> **Tip** If your notebook kernel has no `pandas` installed, run `pip install pandas` in a fresh cell before the import section.