<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/EventDetectionGranularity.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Granularity, Events and Embeddings

This notebooks provides basic functionality for embedding texts at different levels, from the whole article to chunks and headlines. The main purpose is to showcase the retrieval and embedding functionalities of the Impresso API, as well as provide some code for visualising embeddings using dimensionality reduction.

## Install required packages

In [None]:
!pip -qqq install pandas chonkie faiss-cpu tqdm seaborn plotly umap-learn

In [None]:
!pip -qqq install git+https://github.com/impresso/impresso-py.git@embeddings-search

In [None]:
# restart the kernel just in case...
import os
os.kill(os.getpid(), 9)

## Load and process data

In [None]:
!gdown 1H8_1-PbGPlcrm3wvwd1xGaUrhNnYETha

In [None]:
# unzip the data for the general query
!unzip -o Olympics-general.zip -d data

In [None]:
import os
import pandas as pd
from chonkie import SemanticChunker
from tqdm import tqdm
import faiss
import numpy as np
from impresso import connect

## Connect to the Impresso client

In [None]:

impresso_session = connect('https://dev.impresso-project.ch/public-api/v1')

## Load helper functions

In [None]:
# embed text helper functions
import time
import base64
import struct

def embed_text(text: str, target: str):
  """
  Convert text to embedding, return None in case of an error
  """
  time.sleep(1)
  try:
    return impresso_session.tools.embed_text(text, target)
  except Exception as e:
    print(text)
    print(e)
    return None


def convert_embedding(embedding: np.float32):
  """
  Convert base64 string to a float array
  """
  if not embedding:
    return None

  _, arr = embedding.split(':')
  arr = base64.b64decode(arr)
  outof_corpus_emb = [struct.unpack('f', arr[i:i+4])[0] for i in range(0, len(arr), 4)]
  return outof_corpus_emb

## Load data

The data file is csv document containing articles mentioning "Olympic Games"

In [None]:

# --- CONFIG ---
CSV_PATH = "data/2025-10-20T13-28-55-45260b95.csv"         # Path to your CSV file


In [None]:
df = pd.read_csv(CSV_PATH, sep=';',skiprows=4)
df.head(3)

In [None]:
df['year'].value_counts().sort_index().plot(kind='bar')

In [None]:
df.columns

In [None]:
# to reduce the data a bit let's focus on the 30s and 40?
df_period = df[df.year.between(1930,1950) & ~(df['title'].isnull())]
df_period.shape

## Embed Headlines

In [None]:
# let's first get the embeddings of the transcript title and look

tqdm.pandas()
df_period['title_embedding'] = df_period['title'].progress_apply(
    lambda x: convert_embedding(embed_text(x,'text'))
      )


## Retrieve transcript embeddings

In [None]:
# get the article embeddings from the API

def get_embedding_by_uid(uid):
  time.sleep(1)
  try:
    return convert_embedding(impresso_session.content_items.get_embeddings(uid)[0])
  except Exception as e:
    print(e)
    print(uid)
    return None

df_period['article_embedding']  = df_period.uid.progress_apply(get_embedding_by_uid)


## Save data

In [None]:

df_period.to_json('olympic-general-embedded.json')

## Plot Embeddings with UMAP

Plot either transcript (i.e. article) or headline (i.e. title) embeddings with dimensionality reduction.

In [None]:
!gdown 18SyEXcXjRTyu3UOzDOFZdDb16jA2ejmM

In [None]:
# --- DIMENSIONALITY REDUCTION ---
from umap import UMAP
print("Reducing to 2D with UMAP...")
reducer = UMAP(
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine",
    random_state=42
)

EMBEDDING = 'article_embedding' # 'title_embedding' | 'article_embedding'

df_period = pd.read_json('olympic-general-embedded.json')

df_period = df_period[~df_period[EMBEDDING].isnull()]

embeddings = list(df_period[EMBEDDING])
embeddings_2d = reducer.fit_transform(embeddings)

df_period["x"] = embeddings_2d[:, 0]
df_period["y"] = embeddings_2d[:, 1]

In [None]:
def clean_text(text, max_len=250):
    """Truncate text and replace newlines for nicer tooltips"""
    text = str(text).replace("\n", " ")
    return text[:max_len] + ("..." if len(text) > max_len else "")

df_period["hover_text"] = df_period.title.apply(clean_text)


In [None]:
# normalized, size = individual, colour = servants
import seaborn as sns
import plotly.express as px

fig = px.scatter(df_period,
                 x="x",
                 #size='all_inds',
                 y="y",
                 #color="all_inds",
                 hover_data=['hover_text',"year"],
                 width=1000, height=1000)
fig.update_layout(showlegend=False)
fig.show()

# Query

The example code below shows how create a local vector database with FAISS, which you can then query. You should be able to create a database at different levels (i.e. article, title or chunk).

The code for creating chunk level embeddings is shown below.

In [None]:
!gdown 1Z5bGLddcCuwxAv4Ehu4Jt1QGNUh5CgYZ

In [None]:
# --- VECTOR STORE (FAISS) ---
# save index

EMBEDDING_LEVEL = 'chunk' # 'title' | 'article' | 'chunk'
# see below for chunking script

if EMBEDDING_LEVEL == 'chunk':
  df_period = pd.read_json('olympic-general-chunks-sample-embedded.json')
else:
  df_period = pd.read_json('olympic-general-embedded.json')
df_period = df_period[~df_period[f'{EMBEDDING_LEVEL}_embedding'].isnull()]

embeddings = list(df_period[EMBEDDING])


VECTOR_DB_PATH = f"vector_db_{EMBEDDING_LEVEL}.faiss"
embeddings = np.array(list(df_period[f'{EMBEDDING_LEVEL}_embedding']), dtype="float32")


dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)
faiss.write_index(index, VECTOR_DB_PATH)
print(f"Vector DB saved to {VECTOR_DB_PATH}")


Now you can search for different subthemes within the data (i.e. score each document to the similarity of the query embedding)

In [None]:

query = "Weltkrieg"
q_emb = convert_embedding(embed_text(query,'text'))

D, I = index.search(np.array([q_emb], dtype="float32"), k=5)
print("Top 5 most similar chunks:")
print(df_period.iloc[I[0]])

# Chunk

The code below shows how create chunk level embeddings. We apply this only to a small sample as it would take too long otherwise.

In [None]:

# --- CHUNK TEXTS ---
# Basic initialization with default parameters
# see https://docs.chonkie.ai/oss/chunkers/semantic-chunker
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-32M",  # Default model
    threshold=0.8,                               # Similarity threshold (0-1)
    chunk_size=256,                             # Maximum tokens per chunk
    similarity_window=10,                         # Window for similarity calculation
    skip_window=0                                # Skip-and-merge window (0=disabled)
)


In [None]:
# let's go chunky!!

chunks = []

# COLUMN VARIABLES
TEXT = 'transcript'

print("Chunking text columns...")
for idx, row in tqdm(df_period.iterrows(), total=len(df_period)):
        text = str(row[TEXT])
        if text.strip():
          chunks.append(chunker.chunk(text))

df_period['chunks'] = chunks
df_period_chunked = df_period.explode('chunks')
df_period_chunked_sample = df_period_chunked.sample(1000, random_state=32)
tqdm.pandas()
df_period_chunked_sample['chunk_embedding'] = df_period_chunked_sample['chunks'].progress_apply(lambda x: convert_embedding(embed_text(x.text,'text')))



In [None]:
df_period_chunked_sample.reset_index(drop=True).to_json('olympic-general-chunks-sample-embedded.json')

# Fin