<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/linking-in-external-data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# UMAP visualization of text embeddings with different categories

The texts are embedded via the GTE model and visualized using UMAP. Different colors represent different categories of texts, such as query texts, topic-related texts, internal documents, and external documents.


In [80]:
!pip install pandas umap-learn matplotlib seaborn numpy mplcursors plotly 'nbformat>=4.2.0' ipykernel bokeh gdown



## The data for this use case contains the following data items

- 3407 embedded impresso content items (derived via a query in impresso app (originally 5000
  content items, but only 3407 have an embedding for now due to minimal length
  restrictions)
- 28 embedded external documents
- 969 embedded impresso content items that were found by semantic search with 50 nearest
  neighbours of 20 topic phrases:
  `swiss banking industry;bank loans;banking secrecy;taxes on bank transactions;swiss
bankers;geneva financial center;monetary policies;federal finances;parliamentary
debates on banking;banking regulation;bank accounts;banks as providers;collaboration
in banking;tax evasion;insider trading;financial crisis;gnomes of
zurich;euromarkets;swiss bank corporation`)

For convenience all embeddings are concatenated in a single bz2ipped JSONL file on
google drive and can be downloaded by the following command:


In [81]:
import gdown, os

DATA_FILENAME = "case_study_external_docs_linking_in.jsonl.bz2"
url = "https://os.zhdk.cloud.switch.ch/impresso-public/impresso-2-ws4-lux-2025-data/linking-in-case-study/case_study_external_docs_linking_in.jsonl.bz2"
output = DATA_FILENAME
os.path.exists(DATA_FILENAME) or gdown.download(url, output, quiet=False)

True

Some reusable code for loading the embeddings from a jsonl file.
The data is expected to have the following fields:

- "embedding": The text embedding as an array of numbers.
- "ci_id": The unique identifier of the text. Resolvable to URLs via the Impresso webapp
- "category": The category of the text. One of "query", "topic**\*", "internal**\_", "external\_\_\_"


In [82]:
import json
import pandas as pd
import bz2


def load_embeddings_from_jsonl(file_paths):
    """
    Reads one or more jsonl or jsonl.bz2 files with embeddings, optional categories, and ci_id,
    and returns a DataFrame.

    Args:
        file_paths (list or str): A list of file paths or a single file path to the jsonl or jsonl.bz2 file(s).

    Returns:
        pd.DataFrame: DataFrame with embeddings as columns, 'category' column, and 'ci_id' column.
    """
    if isinstance(file_paths, str):
        file_paths = [file_paths]

    data = []
    for file_path in file_paths:
        if file_path.endswith(".jsonl.bz2"):
            opener = bz2.open
            mode = "rt"
        elif file_path.endswith(".jsonl"):
            opener = open
            mode = "r"
        else:
            print(f"Skipping unsupported file format: {file_path}")
            continue

        try:
            with opener(file_path, mode) as f:
                for line in f:
                    item = json.loads(line)
                    embedding = item.get("embedding")
                    category = item.get(
                        "category", "query"
                    )  # Assign 'query' if no category exists
                    ci_id = item.get("ci_id")
                    if embedding:
                        row = dict(
                            zip([f"emb_{i}" for i in range(len(embedding))], embedding)
                        )
                        row["category"] = category
                        row["ci_id"] = ci_id
                        data.append(row)
        except FileNotFoundError:
            print(f"Error: File not found at {file_path}")
        except Exception as e:
            print(f"Error processing file {file_path}: {e}")

    # Convert list of dictionaries to DataFrame
    df = pd.DataFrame(data)

    # Reorder columns to have category and ci_id at the end
    if "category" in df.columns:
        category_column = df.pop("category")
        df["category"] = category_column
    if "ci_id" in df.columns:
        ci_id_column = df.pop("ci_id")
        df["ci_id"] = ci_id_column

    return df

# Dimension reduction to 2D using UMAP


In [83]:
import umap
import pandas as pd


def perform_umap_reduction(df, n_components=2, random_state=42):
    """
    Applies UMAP to reduce the dimensionality of vectors in a DataFrame.

    Args:
        df (pd.DataFrame): DataFrame containing the vectors and potentially 'category' and 'ci_id' columns.
        n_components (int): The number of dimensions to reduce to.
        random_state (int): Random state for reproducibility.

    Returns:
        pd.DataFrame: DataFrame with UMAP embeddings and the 'category' and 'ci_id' columns.
    """
    reducer = umap.UMAP(n_components=n_components, random_state=random_state)

    # Select only the embedding columns (assuming they start with 'emb_')
    embedding_columns = [col for col in df.columns if col.startswith("emb_")]
    vector_data = df[embedding_columns]

    embedding = reducer.fit_transform(vector_data)
    df_umap = pd.DataFrame(embedding, columns=[f"umap_x", f"umap_y"])

    # Include 'category' and 'ci_id' if they exist in the original DataFrame
    if "category" in df.columns:
        df_umap["category"] = df["category"]
    if "ci_id" in df.columns:
        df_umap["ci_id"] = df["ci_id"]

    return df_umap

## Bokeh Visualization with Links and Simple Text Search in Category Labels for Highlighting


In [84]:
from bokeh.plotting import figure, output_file, save, show
from bokeh.models import (
    HoverTool,
    TapTool,
    OpenURL,
    ColumnDataSource,
    TextInput,
    CustomJS,
)
from bokeh.transform import factor_cmap
from bokeh.layouts import column


def visualize_umap_bokeh(
    df_umap,
    title="UMAP Visualization with Clickable URLs",
    output_filename="umap_bokeh.html",
):
    """
    Creates an interactive Bokeh scatter plot where points are clickable and open URLs.
    Includes search functionality to find and highlight points by category.

    Args:
        df_umap (pd.DataFrame): DataFrame with UMAP embeddings, 'category' column, and 'ci_id'.
        title (str): Title of the plot.
        output_filename (str): Name of the HTML file to save.

    Returns:
        bokeh.plotting.figure: Bokeh figure object.
    """
    # Create a copy to avoid modifying the original dataframe
    df_plot = df_umap.copy()

    # Create URLs
    if "ci_id" in df_plot.columns:
        df_plot["url"] = df_plot["ci_id"].apply(
            lambda x: (
                f"https://dev.impresso-project.ch/app/article/{x}"
                if pd.notna(x)
                else ""
            )
        )
    else:
        df_plot["url"] = ""

    # Extract category prefix for coloring
    def get_category_prefix(cat):
        if pd.isna(cat):
            return "other"
        if "__" in str(cat):
            return str(cat).split("__")[0]
        return str(cat)

    df_plot["category_prefix"] = df_plot["category"].apply(get_category_prefix)

    # Create size column - make topic dots 50% larger
    df_plot["dot_size"] = df_plot["category_prefix"].apply(
        lambda x: 12 if x == "topics" else 8
    )

    # Add alpha column for search highlighting (default: normal visibility)
    df_plot["alpha"] = 0.6

    # Create ColumnDataSource
    source = ColumnDataSource(df_plot)

    # Get unique category prefixes for color mapping
    category_prefixes = df_plot["category_prefix"].unique().tolist()

    # Create figure
    p = figure(
        width=1800,
        height=1000,
        title=title,
        tools="pan,wheel_zoom,box_zoom,reset,save",
        toolbar_location="above",
    )

    # Create custom color palette for category prefixes
    prefix_color_map = {
        "query": "#9d9c9a",
        "topics": "#ff7f0e",
        "internal": "#8c564b",
        "external": "#fd0581",
        "other": "#7f7f7f",
    }

    # Build palette based on category prefixes in the data
    palette = []
    for prefix in category_prefixes:
        if prefix in prefix_color_map:
            palette.append(prefix_color_map[prefix])
        else:
            palette.append("#7f7f7f")

    color_mapper = factor_cmap(
        "category_prefix", palette=palette, factors=category_prefixes
    )

    # Add scatter plot with alpha from data source
    scatter = p.circle(
        "umap_x",
        "umap_y",
        source=source,
        size="dot_size",
        color=color_mapper,
        alpha="alpha",  # Use alpha from data source
        legend_field="category_prefix",
    )

    # Add hover tool with detailed information
    hover = HoverTool(
        tooltips=[
            ("Category", "@category"),
            ("Prefix", "@category_prefix"),
            ("CI ID", "@ci_id"),
            ("URL", "@url"),
            ("Coordinates", "(@umap_x{0.00}, @umap_y{0.00})"),
        ]
    )
    p.add_tools(hover)

    # Add tap tool to open URLs when clicking on points
    tap = TapTool()
    p.add_tools(tap)

    # Configure tap tool to open URL
    url_open = OpenURL(url="@url")
    tap.callback = url_open

    # Styling
    p.legend.location = "top_right"
    p.legend.click_policy = "hide"
    p.legend.label_text_font_size = "10pt"
    p.legend.title = "Category Types"
    p.legend.title_text_font_size = "11pt"
    p.legend.title_text_font_style = "bold"
    p.legend.label_text_color = "#333333"
    p.legend.border_line_width = 2
    p.legend.border_line_color = "#cccccc"
    p.legend.border_line_alpha = 0.8
    p.legend.background_fill_alpha = 0.9
    p.legend.spacing = 5
    p.legend.padding = 10
    p.xaxis.axis_label = "UMAP X"
    p.yaxis.axis_label = "UMAP Y"

    # Create search input widget
    search_input = TextInput(
        title="Search categories (case-insensitive):",
        placeholder="e.g., tax evasion, climate, etc.",
        width=400,
    )

    # JavaScript callback for search functionality - searches only the category part after "__"
    callback = CustomJS(
        args=dict(source=source),
        code="""
        const search_term = cb_obj.value.toLowerCase().trim();
        const data = source.data;
        const categories = data['category'];
        const alphas = data['alpha'];
        const sizes = data['dot_size'];
        const original_sizes = data['category_prefix'].map(prefix => prefix === 'topics' ? 12 : 8);
        
        // Split search terms by space or underscore to get individual search words
        const search_words = search_term.split(/[\\s_]+/).filter(word => word.length > 0);
        
        // Reset all points
        for (let i = 0; i < categories.length; i++) {
            if (search_term === '') {
                // No search term: show all normally
                alphas[i] = 0.6;
                sizes[i] = original_sizes[i];
            } else {
                // Extract the part after "__" (or use full category if no "__")
                const category_full = categories[i].toLowerCase();
                const category_parts = category_full.split('__');
                const category_to_search = category_parts.length > 1 ? category_parts[1] : category_full;
                
                // Split category into words by underscore or space
                const category_words = category_to_search.split(/[\\s_]+/).filter(word => word.length > 0);
                
                // Check if ALL search words appear in the category words
                // Each search word must match (be contained in) at least one category word
                let match = true;
                for (const search_word of search_words) {
                    let search_word_found = false;
                    for (const cat_word of category_words) {
                        if (cat_word.includes(search_word)) {
                            search_word_found = true;
                            break;
                        }
                    }
                    if (!search_word_found) {
                        match = false;
                        break;
                    }
                }
                
                if (match) {
                    // Highlight matching points
                    alphas[i] = 0.8;
                    sizes[i] = original_sizes[i] * 1.5;  // Make matching points larger
                } else {
                    // Dim non-matching points
                    alphas[i] = 0.1;
                    sizes[i] = original_sizes[i];
                }
            }
        }
        
        source.change.emit();
    """,
    )

    search_input.js_on_change("value", callback)

    # Combine search input and plot in a layout
    layout = column(search_input, p)

    # Output to HTML file
    output_file(output_filename)
    save(layout)

    print(f"✅ Bokeh visualization saved to: {output_filename}")
    print(f"📌 Click on any point to open its URL in a new browser tab!")
    print(f"🔍 Use the search box to filter points by category name!")

    return layout

# Plug All Elements and run them


In [None]:
RANDOM_STATE = 42
# Different RANDOM_STATE numbers change the non-deterministic UMAP dimenions reduction
file_list = [DATA_FILENAME]  # Using the existing file for demonstration

# Load the embeddings from the files, including ci_id
df = load_embeddings_from_jsonl(file_list)

# Perform UMAP reduction
# The perform_umap_reduction function now includes 'ci_id' in the output DataFrame
df_umap_result = perform_umap_reduction(df, random_state=RANDOM_STATE)

  warn(


In [88]:
# Create and save Bokeh visualization
bokeh_fig = visualize_umap_bokeh(
    df_umap_result,
    title="Interactive UMAP Visualization - Click points to open impresso URLs",
    output_filename="umap_bokeh_clickable_query+topics+internal+external+search.html",
)

✅ Bokeh visualization saved to: umap_bokeh_clickable_query+topics+internal+external+search.html
📌 Click on any point to open its URL in a new browser tab!
🔍 Use the search box to filter points by category name!




Open [Bokeh HTML file](umap_bokeh_clickable_query+topics+internal+external+search.html) in
your browser
