# QWEN-3 Multi-Task Embeddings Exploration: Newsgroups Dataset

This notebook demonstrates:
1. Loading the 20 Newsgroups dataset (10 categories)
2. Embedding texts using QWEN-3-Embedding-0.6B with different task instructions
3. Comparing Default vs Sentiment vs Topic vs Toxicity task embeddings
4. Dimensionality reduction with UMAP
5. Interactive visualization with text tooltips

In [None]:
import numpy as np
import pandas as pd
import pickle
from pathlib import Path
from sklearn.datasets import fetch_20newsgroups
import umap
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from qwen_embedder import QwenEmbedder, QwenModel, EmbeddingConfig

## 1. Load Newsgroups Dataset

We'll load a subset of the 20 Newsgroups dataset for our demo.

In [None]:
# Load newsgroups dataset
# Using a subset of categories for clarity
categories = [
    'alt.atheism',
    'comp.graphics',
    'sci.med',
    'talk.religion.misc',
    'rec.sport.baseball',
    'sci.space',
    'talk.politics.guns',
    'talk.politics.mideast',
    'rec.autos',
    'sci.crypt'
]

newsgroups = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),  # Clean text
    random_state=42
)

print(f"Total documents: {len(newsgroups.data)}")
print(f"Categories: {newsgroups.target_names}")

In [3]:
# Sample a subset for demo (500-1000 samples to keep API costs reasonable)
n_samples = 800

# Stratified sampling to keep balanced representation
np.random.seed(42)
indices = []
samples_per_category = n_samples // len(categories)

for cat_idx in range(len(categories)):
    cat_indices = np.where(newsgroups.target == cat_idx)[0]
    selected = np.random.choice(cat_indices, size=min(samples_per_category, len(cat_indices)), replace=False)
    indices.extend(selected)

indices = np.array(indices)
np.random.shuffle(indices)

# Create DataFrame for easier handling
df = pd.DataFrame({
    'text': [newsgroups.data[i] for i in indices],
    'category': [newsgroups.target_names[newsgroups.target[i]] for i in indices],
    'category_id': [newsgroups.target[i] for i in indices]
})

# Clean up texts: remove very short documents
df['text_length'] = df['text'].str.len()
df = df[df['text_length'] > 100].reset_index(drop=True)

# Truncate very long texts to avoid token limits
df['text_clean'] = df['text'].str[:2000]

print(f"\nSampled {len(df)} documents")
print(f"\nCategory distribution:")
print(df['category'].value_counts())


Sampled 721 documents

Category distribution:
category
sci.med               130
sci.space             125
alt.atheism           119
rec.sport.baseball    118
comp.graphics         115
talk.religion.misc    114
Name: count, dtype: int64


In [4]:
# Preview a sample document
print("Sample document:")
print(f"Category: {df.iloc[0]['category']}")
print(f"Text: {df.iloc[0]['text_clean'][:500]}...")

Sample document:
Category: sci.med
Text: 
Flights of fancy, and other irrational approaches, are common.  The crucial
thing is not to sit around just having fantasies; they aren't of any use
unless they make you do some experiments.  I've known a lot of scientists
whose fantasies lead them on to creative work; usually they won't admit
out loud what the fantasy was, prior to the consumption of a few beers.

(Simple example: Warren Jelinek noticed an extremely heavy band on a DNA
electrophoresis gel of human ALU fragments.  He got very e...


## 2. Embed Texts with QWEN-3 (Multi-Task Comparison)

We'll embed the same texts using QWEN-3-Embedding-0.6B with four different task instructions:
- **Default**: No instruction (general-purpose)
- **Sentiment**: "Classify the sentiment of the given text as positive, negative, or neutral"
- **Topic**: "Identify the topic or theme of the given text"
- **Toxicity**: "Classify the given text as either toxic or not toxic"

In [5]:
# Configure embedder
config = EmbeddingConfig(
    model=QwenModel.SMALL,
    batch_size=32,
    max_concurrent=10
)

# Cache file for embeddings
cache_dir = Path('embeddings')
cache_dir.mkdir(exist_ok=True)
cache_file = cache_dir / 'newsgroups_embeddings.pkl'

In [None]:
# Define the tasks we want to compare
tasks = {
    'default': None,  # No instruction
    'sentiment': "Classify the sentiment of the given text as positive, negative, or neutral",
    'topic': "Identify the topic or theme of the given text",
    'toxicity': "Classify the given text as either toxic or not toxic"
}

# Cache file for embeddings
cache_dir = Path('embeddings')
cache_dir.mkdir(exist_ok=True)
cache_file = cache_dir / 'newsgroups_multitask_embeddings.pkl'

# Embed texts for all tasks (or load from cache)
all_embeddings = {}

if cache_file.exists():
    print("Loading embeddings from cache...")
    with open(cache_file, 'rb') as f:
        cache_data = pickle.load(f)
        cached_embeddings = cache_data['embeddings']
        cached_texts = cache_data['texts']
    
    # Verify cache matches current data
    if cached_texts == df['text_clean'].tolist():
        all_embeddings = cached_embeddings
        print(f"Loaded embeddings for {len(all_embeddings)} tasks from cache")
        for task_name in all_embeddings.keys():
            print(f"  - {task_name}: {len(all_embeddings[task_name])} embeddings")
    else:
        print("Cache mismatch, re-embedding...")
        cache_file.unlink()

if not all_embeddings:
    print(f"Embedding texts with {len(tasks)} different task instructions...")
    
    async def embed_all_tasks():
        results = {}
        async with QwenEmbedder(config=config) as embedder:
            for task_name, instruction in tasks.items():
                print(f"\n{'='*60}")
                print(f"Task: {task_name}")
                if instruction:
                    print(f"Instruction: {instruction}")
                else:
                    print(f"Instruction: None (default embeddings)")
                print(f"{'='*60}")
                
                embeddings = await embedder.embed_async(
                    df['text_clean'].tolist(),
                    task_instruction=instruction,
                    show_progress=True
                )
                results[task_name] = embeddings
        return results
    
    import asyncio
    all_embeddings = await embed_all_tasks()
    
    # Cache the results
    print("\nSaving embeddings to cache...")
    with open(cache_file, 'wb') as f:
        pickle.dump({
            'embeddings': all_embeddings,
            'texts': df['text_clean'].tolist()
        }, f)
    
    print(f"Saved embeddings for {len(all_embeddings)} tasks to cache")

# Convert to numpy arrays
embeddings_arrays = {
    task_name: np.array(embeddings) 
    for task_name, embeddings in all_embeddings.items()
}

print(f"\nEmbeddings shapes:")
for task_name, arr in embeddings_arrays.items():
    print(f"  {task_name}: {arr.shape}")

## 3. Dimensionality Reduction with UMAP

We'll apply UMAP to each embedding set to reduce to 2D for visualization.

In [None]:
print("Applying UMAP to all embedding sets...")

# Apply UMAP to each task's embeddings
umap_embeddings = {}

for task_name, embeddings_array in embeddings_arrays.items():
    print(f"\nProcessing {task_name}...")
    
    umap_reducer = umap.UMAP(
        n_components=2,
        n_neighbors=15,
        min_dist=0.1,
        random_state=42,
        verbose=False
    )
    
    reduced = umap_reducer.fit_transform(embeddings_array)
    umap_embeddings[task_name] = reduced
    print(f"  {task_name}: {embeddings_array.shape} -> {reduced.shape}")

print("\nDimensionality reduction complete!")

In [None]:
# Add UMAP coordinates for each task to dataframe
for task_name, reduced in umap_embeddings.items():
    df[f'{task_name}_x'] = reduced[:, 0]
    df[f'{task_name}_y'] = reduced[:, 1]

# Create shortened text for tooltips (first 200 chars)
df['text_preview'] = df['text_clean'].str[:200] + '...'

print("Added UMAP coordinates to dataframe:")
print(df[['category', 'default_x', 'default_y', 'sentiment_x', 'sentiment_y', 'topic_x', 'topic_y', 'toxicity_x', 'toxicity_y']].head())

## 4. Multi-Task Visualization Comparison

Create 4-panel visualization comparing how different task instructions affect the embedding space.

In [None]:
# Create 4-panel comparison: Default vs Sentiment vs Topic vs Toxicity
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Color palette for categories
color_map = {
    cat: px.colors.qualitative.Set2[i % len(px.colors.qualitative.Set2)]
    for i, cat in enumerate(df['category'].unique())
}

# Task titles for display
task_titles = {
    'default': 'Default (No Instruction)',
    'sentiment': 'Sentiment Task',
    'topic': 'Topic Classification Task',
    'toxicity': 'Toxicity Detection Task'
}

fig = make_subplots(
    rows=1, cols=4,
    subplot_titles=[task_titles[task] for task in ['default', 'sentiment', 'topic', 'toxicity']],
    horizontal_spacing=0.05
)

# Create a plot for each task
for col_idx, task_name in enumerate(['default', 'sentiment', 'topic', 'toxicity'], start=1):
    for category in df['category'].unique():
        mask = df['category'] == category
        fig.add_trace(
            go.Scatter(
                x=df[mask][f'{task_name}_x'],
                y=df[mask][f'{task_name}_y'],
                mode='markers',
                name=category,
                marker=dict(
                    size=6,
                    color=color_map[category],
                    opacity=0.7,
                    line=dict(width=0.5, color='white')
                ),
                text=df[mask]['text_preview'],
                hovertemplate='<b>%{fullData.name}</b><br>' +
                             '<br>%{text}<br>' +
                             '<extra></extra>',
                legendgroup=category,
                showlegend=(col_idx == 1)  # Only show legend for first plot
            ),
            row=1, col=col_idx
        )

# Update layout
fig.update_layout(
    title_text='QWEN-3 Multi-Task Embedding Comparison (UMAP Projection)',
    title_x=0.5,
    height=600,
    width=2400,
    hovermode='closest',
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.01,
        title="Newsgroup Category"
    )
)

# Update axes labels
for col_idx in range(1, 5):
    fig.update_xaxes(title_text="UMAP Dimension 1", row=1, col=col_idx)
    fig.update_yaxes(title_text="UMAP Dimension 2", row=1, col=col_idx)

fig.show()

## 5. Individual Task Plots for Detailed Exploration

In [None]:
# Default embeddings plot
fig_default = px.scatter(
    df,
    x='default_x',
    y='default_y',
    color='category',
    hover_data={'text_preview': True, 'default_x': False, 'default_y': False, 'category': False},
    title='Default Embeddings (No Task Instruction) - UMAP Projection',
    width=900,
    height=700,
    color_discrete_map=color_map
)

fig_default.update_traces(
    marker=dict(size=8, opacity=0.7, line=dict(width=0.5, color='white')),
    hovertemplate='<b>%{fullData.name}</b><br><br>%{customdata[0]}<br><extra></extra>'
)

fig_default.update_layout(hovermode='closest')
fig_default.show()

In [None]:
# Sentiment task embeddings plot
fig_sentiment = px.scatter(
    df,
    x='sentiment_x',
    y='sentiment_y',
    color='category',
    hover_data={'text_preview': True, 'sentiment_x': False, 'sentiment_y': False, 'category': False},
    title='Sentiment Task Embeddings - UMAP Projection',
    width=900,
    height=700,
    color_discrete_map=color_map
)

fig_sentiment.update_traces(
    marker=dict(size=8, opacity=0.7, line=dict(width=0.5, color='white')),
    hovertemplate='<b>%{fullData.name}</b><br><br>%{customdata[0]}<br><extra></extra>'
)

fig_sentiment.update_layout(hovermode='closest')
fig_sentiment.show()

In [None]:
# Topic task embeddings plot
fig_topic = px.scatter(
    df,
    x='topic_x',
    y='topic_y',
    color='category',
    hover_data={'text_preview': True, 'topic_x': False, 'topic_y': False, 'category': False},
    title='Topic Classification Task Embeddings - UMAP Projection',
    width=900,
    height=700,
    color_discrete_map=color_map
)

fig_topic.update_traces(
    marker=dict(size=8, opacity=0.7, line=dict(width=0.5, color='white')),
    hovertemplate='<b>%{fullData.name}</b><br><br>%{customdata[0]}<br><extra></extra>'
)

fig_topic.update_layout(hovermode='closest')
fig_topic.show()

In [None]:
# Toxicity task embeddings plot
fig_toxicity = px.scatter(
    df,
    x='toxicity_x',
    y='toxicity_y',
    color='category',
    hover_data={'text_preview': True, 'toxicity_x': False, 'toxicity_y': False, 'category': False},
    title='Toxicity Detection Task Embeddings - UMAP Projection',
    width=900,
    height=700,
    color_discrete_map=color_map
)

fig_toxicity.update_traces(
    marker=dict(size=8, opacity=0.7, line=dict(width=0.5, color='white')),
    hovertemplate='<b>%{fullData.name}</b><br><br>%{customdata[0]}<br><extra></extra>'
)

fig_toxicity.update_layout(hovermode='closest')
fig_toxicity.show()

## 6. Save Visualizations

In [None]:
# Save interactive HTML files
viz_dir = Path('visualizations')
viz_dir.mkdir(exist_ok=True)

# Save comparison and individual plots
fig.write_html(viz_dir / 'multitask_comparison.html')
fig_default.write_html(viz_dir / 'default_embeddings.html')
fig_sentiment.write_html(viz_dir / 'sentiment_embeddings.html')
fig_topic.write_html(viz_dir / 'topic_embeddings.html')
fig_toxicity.write_html(viz_dir / 'toxicity_embeddings.html')

print("Saved visualizations to:", viz_dir.absolute())
print("\nFiles created:")
print("  - multitask_comparison.html (4-panel comparison)")
print("  - default_embeddings.html")
print("  - sentiment_embeddings.html")
print("  - topic_embeddings.html")
print("  - toxicity_embeddings.html")

## Analysis Notes

**Hover over points** to see the text content of each document.

### Comparing Task Instructions:

This notebook explores how task-specific instructions affect QWEN-3 embeddings:

1. **Default Embeddings** - General-purpose, no task optimization
2. **Sentiment Task** - Optimized for sentiment classification (positive/negative/neutral)
3. **Topic Task** - Optimized for topic/theme identification
4. **Toxicity Task** - Optimized for detecting toxic/inflammatory discourse

### What to Look For:

1. **Cluster Quality**: Which task produces the best newsgroup category separation?
2. **Task Alignment**: Does the topic task create better clusters than default for newsgroup categorization?
3. **Orthogonal Dimensions**: How does toxicity detection differ from sentiment and topic?
4. **Cross-Topic Patterns**: Does toxicity appear across multiple newsgroups (e.g., political vs technical)?
5. **Semantic Organization**: How do different instructions reorganize the embedding space?

### Expected Observations:

- **Topic task** should produce the best category separation (aligned with newsgroup classification)
- **Sentiment task** may group texts by emotional tone rather than category
- **Toxicity task** should reveal discourse patterns - political/religious groups may show different toxicity distributions than technical/hobby groups
- **Default embeddings** provide a balanced, general-purpose representation
- UMAP preserves both local clusters and global structure across all tasks

### Key Insight:

Task-specific instructions can significantly reshape the embedding space to optimize for particular downstream applications. The "right" instruction depends on your use case!

### Categories Analyzed (10 total):

- **Religious/Political**: alt.atheism, talk.religion.misc, talk.politics.guns, talk.politics.mideast
- **Scientific/Technical**: sci.med, sci.space, sci.crypt, comp.graphics
- **Recreation**: rec.sport.baseball, rec.autos