# Topic segmentation
### This notebook will serve as support to the reserach that I do regarding this topic - exploring how can AI detect topic segmentation.
This notebook does not use a dataset but only one journaling etnry for initial exploration. Additional reasearch and experimentation will follow.
## Load libraries

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import json
import spacy 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import SpectralCoclustering

  from .autonotebook import tqdm as notebook_tqdm
  from scipy.sparse import csr_matrix, issparse


## Load test data
After researching available benchmarks, the conclusion turned out to be that they do not cover this personal journaling domain and the data they use is not suitable to identify the best method of topic segmentation and it would be irrelevant to do so. In this case, I decided to create my own dataset with similar jourbaling entries to what may be the case in real-world scenario. I also provided the segmented sentences and topics in the needed format. Here I will load my data first and convert them to the correct format.

In [2]:
df = pd.read_csv("topics.csv")
row = df.iloc[0]

sentences = json.loads(row["sentences"])
segments = json.loads(row["segments"])
print("Sentences:", sentences)
print("Gold segments:", segments)

Sentences: ['Today I had a meeting with my semester coach and we started by discussing my individual project.', ' I had some troubles in the beginning but now everything is clear.', ' I enjoy the topic I chose and I am very happy with the progress I make.', ' I look formard to finish it and see the end product.', ' Then I talked with one of my team mates regarding our group work together because I was not satisfied with his way of working.', ' He always misses deadlines and skips our group meetings and I suggested that he tries to put more effort.', 'Finally, I got home and saw my mother making my favourite meal.', ' I have always loved her cooking and appreaciate that she does it for me.', ' Then we sat together on the table, ate the dinner and talked about our days. ']
Gold segments: [0, 0, 0, 0, 1, 1, 2, 2, 2]


It can be seen that I have 9 sentences - first four in one topic, then two more in another and the last three in a different topic.
## Sentence Embeddings
In order to work with these sentences in any way, they should be converted to numbers/vectors. After doing research on that, using a transformers model turned out to be the best option since it captures semantic meaning which is important for our purpose. Here I will use SBERT model that provides sentence-level embeddings which will help for the detection of topics later.

In [3]:
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(sentences, batch_size=8, show_progress_bar=True)
print("Embeddings shape:", embeddings.shape)

Batches: 100%|██████████| 2/2 [00:00<00:00,  2.22it/s]

Embeddings shape: (9, 768)





## Lexical Cohesion Method - Text Tiling
This function implements a segmentation method inspired by TextTiling, but instead of relying on word frequencies, it uses sentence embeddings to detect topic shifts more efficiently. It slides a window across the text, compares the average embeddings of adjacent windows with cosine similarity, and marks a boundary whenever the similarity drops below a chosen threshold. The output is a list of sentence indices where topic boundaries are predicted, making it useful for splitting text into coherent sections based on semantic changes. I am going to use the base most pupular thresholds that are supposed to provide the best segments - 0.8 and window size - 2.

In [4]:
def embedding_text_tiling(embeddings, window_size=2, threshold=0.8):
  
    num_sentences = embeddings.shape[0]
    boundaries = []
    
    for i in range(num_sentences - window_size):
        block1 = embeddings[i:i+window_size].mean(axis=0)
        block2 = embeddings[i+1:i+1+window_size].mean(axis=0)
        sim = cosine_similarity(block1.reshape(1,-1), block2.reshape(1,-1))[0][0]
        if sim < threshold:
            boundaries.append(i+window_size-1)
    return boundaries

boundaries = embedding_text_tiling(embeddings, window_size=2, threshold=0.8)
print("Predicted boundaries at sentence indices:", boundaries)


Predicted boundaries at sentence indices: [1, 2, 3, 4, 5, 6]


We can see that it predicted a lot of boundaries which is not our case but still let's get the topic segments themselves

In [5]:
def boundaries_to_segments(boundaries, num_sentences, min_size=1):
    segments = [0] * num_sentences
    current = 0
    last_boundary = -1
    filtered_boundaries = []

    for b in boundaries:
        if b - last_boundary >= min_size:
            filtered_boundaries.append(b)
            last_boundary = b

    for i in range(num_sentences):
        segments[i] = current
        if i in filtered_boundaries:
            current += 1
    return segments


pred_segments = boundaries_to_segments(boundaries, len(sentences))
print("Predicted segments:", pred_segments)


Predicted segments: [0, 0, 1, 2, 3, 4, 5, 6, 6]


We see again that this model oversegments the text and put almost every new sentence as new topic. 
## Sequential Thresholding
This function segments text based on sentence embeddings. It works by comparing each sentence embedding to the one immediately before it using cosine similarity. If the similarity falls below the threshold, it assumes a topic shift has occurred and starts a new segment. Each sentence is then assigned a segment label, producing a list of segment indices that indicate how the text is divided into semantically coherent chunks.

In [6]:
def sequential_thresholding(embeddings, threshold=0.75):
    num_sentences = embeddings.shape[0]
    pred_segments = [0]  
    current_segment = 0

    for i in range(1, num_sentences):
        sim = cosine_similarity(
            embeddings[i-1].reshape(1,-1), embeddings[i].reshape(1,-1)
        )[0][0]
        if sim < threshold:
            current_segment += 1
        pred_segments.append(current_segment)

    return pred_segments

In [7]:
pred_segments = sequential_thresholding(embeddings, threshold=0.75)
print("Predicted segments:", pred_segments)

Predicted segments: [0, 1, 2, 3, 4, 5, 6, 7, 8]


We have the same situation here as before - oversegmentation. The next method I am going to try is the very populat clustering.
## Sequential Clustering
This function applies adaptive sequential thresholding for topic segmentation, which is a refinement of the basic sequential method. Instead of comparing each sentence only to the previous one, it compares the current sentence embedding to the centroid (average vector) of the ongoing segment. If the similarity to the centroid drops below the threshold, it signals a topic change and starts a new segment; otherwise, the sentence is added to the current segment and the centroid is updated.

In [8]:
def adaptive_sequential_thresholding(embeddings, threshold=0.8):
    pred_segments = [0]
    current_segment = 0
    segment_vectors = [embeddings[0]]  

    for i in range(1, len(embeddings)):
        centroid = np.mean(segment_vectors, axis=0)
        sim = cosine_similarity(embeddings[i].reshape(1, -1), centroid.reshape(1, -1))[0][0]
        if sim < threshold:
            current_segment += 1
            segment_vectors = [embeddings[i]] 
        else:
            segment_vectors.append(embeddings[i])
        pred_segments.append(current_segment)

    return pred_segments

pred_segments = adaptive_sequential_thresholding(embeddings, threshold=0.8)
print("Predicted segments:", pred_segments)

Predicted segments: [0, 1, 2, 3, 4, 5, 6, 7, 8]


Unfortunately we have the same situation here, so I will try another more complex approach - biclustering.
## BATS
This function implements a BATS-style segmentation approach, which clusters sentences into topics using word distributions rather than embeddings. It first builds a sentence–word TF-IDF matrix, then prunes out low-variance (uninformative) words and boosts highly discriminative ones to emphasize meaningful differences. With this refined matrix, it applies spectral biclustering, a method that simultaneously groups sentences and words into coherent clusters. Finally, it converts the resulting sentence cluster assignments into linear segments by marking boundaries whenever the cluster label changes. 

In [9]:
def bats_segmentation(sentences, n_topics=2, noise_thresh=0.001, boost_factor=2.0):
    vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
    M = vectorizer.fit_transform(sentences).toarray()  
    words = vectorizer.get_feature_names_out()

    col_vars = np.var(M, axis=0)
    keep_mask = col_vars > noise_thresh
    M = M[:, keep_mask]
    col_vars = col_vars[keep_mask]
    words = [w for w, k in zip(words, keep_mask) if k]

    if len(words) == 0:
        raise ValueError("No informative words left after pruning. Try lowering noise_thresh.")

    var_norm = col_vars / (col_vars.max() + 1e-9)
    boost = 1 + (boost_factor - 1) * var_norm
    M = M * boost[np.newaxis, :]

    model = SpectralCoclustering(n_clusters=n_topics, random_state=42)
    model.fit(M)
    sent_labels = model.row_labels_

    boundaries = []
    for i in range(1, len(sent_labels)):
        if sent_labels[i] != sent_labels[i-1]:
            boundaries.append(i-1)

    segments = [0] * len(sentences)
    current = 0
    for i in range(len(sentences)):
        segments[i] = current
        if i in boundaries:
            current += 1

    return segments, boundaries, sent_labels

In [10]:
pred_segments, boundaries, raw_labels = bats_segmentation(sentences, n_topics=2)

print("Predicted boundaries at indices:", boundaries)
print("Predicted segments:", pred_segments)

for seg_id in set(pred_segments):
    print(f"\n--- Segment {seg_id} ---")
    for s, seg in zip(sentences, pred_segments):
        if seg == seg_id:
            print(s)


Predicted boundaries at indices: [0, 1, 3, 5, 6]
Predicted segments: [0, 1, 2, 2, 3, 3, 4, 5, 5]

--- Segment 0 ---
Today I had a meeting with my semester coach and we started by discussing my individual project.

--- Segment 1 ---
 I had some troubles in the beginning but now everything is clear.

--- Segment 2 ---
 I enjoy the topic I chose and I am very happy with the progress I make.
 I look formard to finish it and see the end product.

--- Segment 3 ---
 Then I talked with one of my team mates regarding our group work together because I was not satisfied with his way of working.
 He always misses deadlines and skips our group meetings and I suggested that he tries to put more effort.

--- Segment 4 ---
Finally, I got home and saw my mother making my favourite meal.

--- Segment 5 ---
 I have always loved her cooking and appreaciate that she does it for me.
 Then we sat together on the table, ate the dinner and talked about our days. 


We can see a little bit of improvement - some sentences are correctly put into one topic but still the segmentation is not accurate enough. 

The reason for all of these unsuccessful attempts is that BERT embeddings provide very sensitive vectors because they capture context. This make almost every new appearing word seem like a new topic - lower similarity. For this reason, I decided to implement addaptive threshold for the similarities which may help predict correct segments.

## Adaptive threshold segmentation
This function performs adaptive threshold segmentation by dynamically setting the similarity cutoff instead of using a fixed value. It first computes cosine similarities between consecutive sentence embeddings, then calculates a threshold either by subtracting a multiple of the standard deviation from the mean similarity (std method) or by selecting a percentile of the similarity distribution (percentile method). Using this adaptive threshold, the function detects boundaries whenever similarity drops low enough, while enforcing a minimum segment size to avoid over-splitting.

In [11]:
def adaptive_threshold_segmentation(
    embeddings, method="std", min_size=2, std_factor=1.0, percentile=20
):
   
    num_sentences = embeddings.shape[0]
    sims = []

    for i in range(1, num_sentences):
        sim = cosine_similarity(
            embeddings[i-1].reshape(1,-1), embeddings[i].reshape(1,-1)
        )[0][0]
        sims.append(sim)
    
    sims = np.array(sims)

    if method == "std":
        threshold = sims.mean() - std_factor * sims.std()
    elif method == "percentile":
        threshold = np.percentile(sims, percentile)
    else:
        raise ValueError("method must be 'std' or 'percentile'")
    
    pred_segments = [0]
    current_segment = 0
    last_boundary = 0
    
    for i in range(1, num_sentences):
        sim = cosine_similarity(
            embeddings[i-1].reshape(1,-1), embeddings[i].reshape(1,-1)
        )[0][0]
        
        if sim < threshold and (i - last_boundary) >= min_size:
            current_segment += 1
            last_boundary = i
        pred_segments.append(current_segment)
    
    return pred_segments, threshold


In [12]:
pred_segments, used_threshold = adaptive_threshold_segmentation(
    embeddings, method="percentile", percentile=30, min_size=2
)
print("Adaptive threshold used:", used_threshold)
print("Predicted segments:", pred_segments)

Adaptive threshold used: 0.28152403
Predicted segments: [0, 0, 0, 0, 1, 1, 2, 2, 2]


Finally, we have the absolute correct segments.Adaptive threshold segmentation works especially well for short personal journals with BERT embeddings because it doesn’t rely on a rigid, one-size-fits-all cutoff for detecting topic shifts. In journals, the writing style is often fragmented, with sudden changes of mood, subject, or reflection, but also stretches where sentences remain semantically close. Fixed-threshold methods may either split too aggressively or miss subtle transitions. By calibrating the threshold based on the distribution of similarities in the specific text—using either standard deviation or percentiles—this method adapts to the natural “texture” of each journal entry. In practice, it means the algorithm is sensitive to real shifts in thought while filtering out small fluctuations that are just noise, making it better aligned with the irregular, personal style of diary-like writing.

## Topic names
This function identifies the most representative topic words for each text segment by applying TF-IDF (Term Frequency–Inverse Document Frequency). It first groups sentences according to their segment IDs, then merges all the sentences in a segment into one block of text. For each block, it calculates TF-IDF scores, which highlight words that are frequent in that segment but relatively uncommon overall. The top-scoring words are selected as the “topic words” for that segment. The function returns a dictionary mapping each segment ID to its top words, providing a concise summary of what each segment is about.

In [13]:
def get_segment_topics(sentences, segments, top_n=1):
    segment_dict = {}
    unique_segments = sorted(set(segments))

    for seg_id in unique_segments:
        seg_sentences = [s for s, seg in zip(sentences, segments) if seg == seg_id]
        seg_text = " ".join(seg_sentences)
        
        vectorizer = TfidfVectorizer(stop_words='english')
        X = vectorizer.fit_transform([seg_text])
        feature_array = np.array(vectorizer.get_feature_names_out())
        tfidf_scores = X.toarray()[0]

        top_indices = tfidf_scores.argsort()[::-1][:top_n]
        top_words = feature_array[top_indices].tolist()
        segment_dict[seg_id] = top_words

    return segment_dict


In [None]:
segment_topics = get_segment_topics(sentences, pred_segments, top_n=1)

for seg_id, words in segment_topics.items():
    print(f"Segment {seg_id} topic: {words[0]}")


Segment 0 topic: troubles
Segment 1 topic: group
Segment 2 topic: talked


This method doesn’t always provide the best topics because TF-IDF only looks at word frequency patterns, not the deeper semantic meaning of sentences. In short personal journals, important themes may be expressed with subtle wording, synonyms, or emotional nuance that TF-IDF can’t capture. It also tends to overemphasize rare but unimportant words (like a quirky adjective or a specific name) rather than core ideas. Since it ignores context and relations.
## Topic names with BERT embeddings
This function tries to find a meaningful topic word for each segment by combining BERT-style embeddings with linguistic filtering. Instead of relying on TF-IDF, it first computes a centroid embedding for all sentences in a segment (a kind of “semantic average” of that segment). Then, it extracts candidate nouns and proper nouns from the segment text using spaCy, since these are more likely to represent concrete topics. Each candidate word is embedded, and the one whose embedding is closest to the segment centroid is chosen as the representative topic. The result is a dictionary mapping segment IDs to their most semantically relevant noun(s), which usually produces more meaningful and context-aware topics than simple frequency-based methods.

In [None]:
nlp = spacy.load("en_core_web_trf")

def segment_topic_embedding(sentences, segments, sentence_embeddings, top_n=1):
    segment_dict = {}
    unique_segments = sorted(set(segments))

    for seg_id in unique_segments:
        indices = [i for i, seg in enumerate(segments) if seg == seg_id]
        if not indices:
            continue
    
        seg_emb = sentence_embeddings[indices].mean(axis=0, keepdims=True)
        
        seg_text = " ".join([sentences[i] for i in indices])
        doc = nlp(seg_text)
        candidates = [token.lemma_ for token in doc if token.pos_ in {"NOUN", "PROPN"}]

        if not candidates:
            segment_dict[seg_id] = ["[no noun found]"]
            continue

        candidate_embeddings = model.encode(candidates, convert_to_numpy=True)
    
        sims = cosine_similarity(seg_emb, candidate_embeddings)[0]
        top_indices = sims.argsort()[::-1][:top_n]
        top_words = [candidates[i] for i in top_indices]

        segment_dict[seg_id] = top_words
 
    return segment_dict 

In [None]:
segment_topics = segment_topic_embedding(sentences, pred_segments, embeddings)

for seg_id, words in segment_topics.items():
    print(f"Segment {seg_id} topic: {words[0]},{words[1]}")


Segment 0 topic: project,progress
Segment 1 topic: group,group
Segment 2 topic: dinner,meal


The next thing I would like to try out is to get the topics for the whole text and see if they resonate with the segment topics.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def text_topic_embedding(sentences, sentence_embeddings, top_n=3):
    """
    Extract top-N topic words for the whole text (ignoring segments).
    
    sentences: list of sentence strings
    sentence_embeddings: numpy array of sentence embeddings
    top_n: number of topic words to return
    """
    
    seg_emb = sentence_embeddings.mean(axis=0, keepdims=True)
   
    full_text = " ".join(sentences)
    doc = nlp(full_text)

    candidates = [token.lemma_ for token in doc if token.pos_ in {"NOUN", "PROPN"}]

    if not candidates:
        return ["[no noun found]"] * top_n

    candidate_embeddings = model.encode(candidates, convert_to_numpy=True)

    sims = cosine_similarity(seg_emb, candidate_embeddings)[0]

    top_indices = sims.argsort()[::-1][:top_n]
    top_words = [candidates[i] for i in top_indices]

    return top_words


In [37]:
topics = text_topic_embedding(sentences, embeddings, top_n=3)
print("Top topics for the whole text:", topics)

Top topics for the whole text: ['dinner', 'project', 'progress']


As we can see, this works better for journals because it links what’s being talked about (nouns) with how the segment feels semantically as a whole (embeddings). I would still like to try out and provide a bit more meaningful topics by incorporationg phrases.
## Topic as a phrase
This function extends the noun-based topic extraction to instead find representative phrases for each segment. It works by first computing the centroid embedding of all sentences in a segment, capturing the segment’s overall meaning. Then, instead of single words, it extracts noun chunks with spaCy—multi-word phrases like “my best friend” or “a stressful day”—but filters out overly long ones (max 4 words) to keep them concise. Each candidate phrase is embedded using SBERT, and the phrase whose embedding is closest to the segment centroid is chosen as the representative topic.

In [20]:
nlp = spacy.load("en_core_web_trf")

def segment_topic_phrase(sentences, segments, sentence_embeddings, top_n=1):
 
    segment_dict = {}
    unique_segments = sorted(set(segments))

    for seg_id in unique_segments:
        indices = [i for i, seg in enumerate(segments) if seg == seg_id]
        if not indices:
            continue
        
        seg_emb = sentence_embeddings[indices].mean(axis=0, keepdims=True)
        
        seg_text = " ".join([sentences[i] for i in indices])
        doc = nlp(seg_text)
        candidates = [chunk.text for chunk in doc.noun_chunks if len(chunk.text.split()) <= 4]

        if not candidates:
            segment_dict[seg_id] = ["[no phrase found]"]
            continue
        
        candidate_embeddings = model.encode(candidates, convert_to_numpy=True)
        
        sims = cosine_similarity(seg_emb, candidate_embeddings)[0]
        top_indices = sims.argsort()[::-1][:top_n]
        top_phrases = [candidates[i] for i in top_indices]

        segment_dict[seg_id] = top_phrases

    return segment_dict

In [19]:
segment_phrases = segment_topic_phrase(sentences, pred_segments, embeddings)

for seg_id, phrases in segment_phrases.items():
    print(f"Segment {seg_id} topic phrase: {phrases[0]}")


Segment 0 topic phrase: my individual project
Segment 1 topic phrase: our group work
Segment 2 topic phrase: her cooking


This method performs well because it combines the semantic power of embeddings with the expressiveness of noun phrases. Unlike single words, short phrases capture context and nuance—important in personal journals where meaning often lies in small details like “long phone call” or “quiet evening walk.” By selecting the phrase closest to the segment’s overall embedding, it ensures the chosen label truly reflects the central theme of that part of the text. This makes it especially useful for creating human-readable summaries of segments, giving you intuitive and meaningful “tags” for different parts of a journal.