# QWEN-3 Task-Specific Embeddings Demo

This notebook demonstrates how **task-specific instructions** reshape the embedding space of QWEN-3 models.

## What We'll Do

We'll embed 800 documents from the 20 Newsgroups dataset using two approaches:

1. **Default Embeddings** - No instruction (general-purpose)
2. **Sentiment Embeddings** - With instruction: "Classify the sentiment of the given text as positive, negative, or neutral"

Then we'll compare how the embedding space changes based on the task instruction.

## Key Insight

The same model, same documents, but different instructions create fundamentally different embedding spaces. This is powerful for building applications where you want embeddings optimized for specific downstream tasks.

## 1. Setup and Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import umap
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from qwen_embedder import QwenEmbedder, QwenModel, EmbeddingConfig

print("‚úì Imports successful")

## 2. Load and Preprocess Data

We'll load 10 newsgroup categories covering diverse topics:
- Religious/Political: alt.atheism, talk.religion.misc, talk.politics.guns, talk.politics.mideast
- Scientific/Technical: sci.med, sci.space, sci.crypt, comp.graphics
- Recreation: rec.sport.baseball, rec.autos

We'll sample 80 documents per category for a total of 800 documents.

In [None]:
# Define categories
CATEGORIES = [
    'alt.atheism',
    'comp.graphics',
    'sci.med',
    'talk.religion.misc',
    'rec.sport.baseball',
    'sci.space',
    'talk.politics.guns',
    'talk.politics.mideast',
    'rec.autos',
    'sci.crypt'
]

print(f"Loading 20 Newsgroups dataset ({len(CATEGORIES)} categories)...")

newsgroups = fetch_20newsgroups(
    subset='all',
    categories=CATEGORIES,
    remove=('headers', 'footers', 'quotes'),
    random_state=42
)

# Create DataFrame
df = pd.DataFrame({
    'text': newsgroups.data,
    'category': [newsgroups.target_names[i] for i in newsgroups.target]
})

# Clean and filter text
df['text_clean'] = df['text'].str.replace(r'\s+', ' ', regex=True).str.strip()
df = df[df['text_clean'].str.len() >= 100].copy()  # Remove very short posts
df['text_clean'] = df['text_clean'].str[:5000]  # Truncate very long posts

# Stratified sampling: 80 documents per category
df_sampled = df.groupby('category', group_keys=False).apply(
    lambda x: x.sample(n=min(80, len(x)), random_state=42)
).reset_index(drop=True)

# Create preview for hover tooltips
df_sampled['text_preview'] = df_sampled['text_clean'].str[:200] + '...'

print(f"\n‚úì Loaded {len(df_sampled)} documents")
print(f"\nCategory distribution:")
print(df_sampled['category'].value_counts().sort_index())

In [None]:
# Preview a sample document
sample = df_sampled.iloc[0]
print(f"Sample Document:")
print(f"Category: {sample['category']}")
print(f"\nText preview:\n{sample['text_preview']}")

## 3. Generate Embeddings with Two Tasks

Now we'll embed all 800 documents **twice**:

### Task 1: Default Embeddings
No instruction provided - general-purpose embeddings that capture overall semantic meaning.

### Task 2: Sentiment Embeddings
Instruction: "Classify the sentiment of the given text as positive, negative, or neutral"

This tells QWEN-3 to focus on emotional tone and polarity when creating embeddings.

---

**Note**: This step calls the SiliconFlow API. Progress bars will show embedding generation.

In [None]:
# Configure embedder
config = EmbeddingConfig(
    model=QwenModel.SMALL,  # QWEN-3-Embedding-0.6B
    batch_size=32,
    max_concurrent=10
)

print("Initialized QWEN-3 Embedder")
print(f"Model: {config.model.value}")
print(f"Batch size: {config.batch_size}")
print(f"Max concurrent requests: {config.max_concurrent}")

In [None]:
# Embed all documents with both tasks
async def generate_embeddings():
    """Generate embeddings for both tasks."""
    texts = df_sampled['text_clean'].tolist()
    
    all_embeddings = {}
    
    async with QwenEmbedder(config=config) as embedder:
        # Task 1: Default (no instruction)
        print("="*60)
        print("TASK 1: DEFAULT EMBEDDINGS (No Instruction)")
        print("="*60)
        
        embeddings_default = await embedder.embed_async(
            texts,
            task_instruction=None,
            show_progress=True
        )
        all_embeddings['default'] = np.array(embeddings_default)
        print(f"‚úì Generated {len(embeddings_default)} default embeddings (dim={len(embeddings_default[0])})\n")
        
        # Task 2: Sentiment
        print("="*60)
        print("TASK 2: SENTIMENT EMBEDDINGS")
        print("Instruction: 'Classify the sentiment of the given text as positive, negative, or neutral'")
        print("="*60)
        
        embeddings_sentiment = await embedder.embed_async(
            texts,
            task_instruction="Classify the sentiment of the given text as positive, negative, or neutral",
            show_progress=True
        )
        all_embeddings['sentiment'] = np.array(embeddings_sentiment)
        print(f"‚úì Generated {len(embeddings_sentiment)} sentiment embeddings (dim={len(embeddings_sentiment[0])})\n")
    
    return all_embeddings

# Run the async function
all_embeddings = await generate_embeddings()

print("\n" + "="*60)
print("‚úì EMBEDDING GENERATION COMPLETE")
print("="*60)

## 4. Apply UMAP Dimensionality Reduction

Our embeddings are 1024-dimensional vectors. To visualize them, we'll reduce them to 2D using UMAP.

UMAP (Uniform Manifold Approximation and Projection) preserves both local and global structure, making it ideal for visualizing embedding spaces.

We'll apply UMAP separately to each task's embeddings to see how the 2D projections differ.

In [None]:
print("Applying UMAP dimensionality reduction...\n")

umap_coords = {}

for task_name in ['default', 'sentiment']:
    print(f"  Processing {task_name} embeddings...")
    
    reducer = umap.UMAP(
        n_components=2,
        n_neighbors=15,
        min_dist=0.1,
        random_state=42,
        verbose=False
    )
    
    coords = reducer.fit_transform(all_embeddings[task_name])
    umap_coords[task_name] = coords
    
    # Add to dataframe
    df_sampled[f'{task_name}_x'] = coords[:, 0]
    df_sampled[f'{task_name}_y'] = coords[:, 1]
    
    print(f"  ‚úì {task_name}: reduced to {coords.shape}")

print("\n‚úì UMAP reduction complete!")

## 5. Visualize: Default vs Sentiment Embeddings

Now let's create a side-by-side comparison of the two embedding spaces.

Each point represents one document, colored by its newsgroup category. Hover over points to see the text preview.

### What to Look For

- **Default (left)**: General-purpose organization based on overall semantic similarity
- **Sentiment (right)**: Organization based on emotional tone and polarity

Notice how the same documents occupy different positions in the two spaces!

In [None]:
# Create 2-panel comparison
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=['Default Embeddings (No Instruction)', 'Sentiment Embeddings'],
    horizontal_spacing=0.08
)

# Plot both tasks
for col_idx, task_name in enumerate(['default', 'sentiment'], start=1):
    for category in df_sampled['category'].unique():
        mask = df_sampled['category'] == category
        
        fig.add_trace(
            go.Scatter(
                x=df_sampled[mask][f'{task_name}_x'],
                y=df_sampled[mask][f'{task_name}_y'],
                mode='markers',
                name=category,
                marker=dict(
                    size=7,
                    opacity=0.7,
                    line=dict(width=0.5, color='white')
                ),
                text=df_sampled[mask]['text_preview'],
                hovertemplate='<b>%{fullData.name}</b><br><br>%{text}<br><extra></extra>',
                legendgroup=category,
                showlegend=(col_idx == 1)  # Only show legend once
            ),
            row=1, col=col_idx
        )

# Update layout
fig.update_layout(
    title_text='How Task Instructions Reshape Embeddings: Default vs Sentiment',
    title_x=0.5,
    title_font_size=18,
    height=700,
    width=1800,
    hovermode='closest',
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.01,
        title="Newsgroup Category",
        font=dict(size=11)
    )
)

# Update axes
for col_idx in range(1, 3):
    fig.update_xaxes(title_text="UMAP Dimension 1", row=1, col=col_idx)
    fig.update_yaxes(title_text="UMAP Dimension 2", row=1, col=col_idx)

fig.show()

## 6. Analysis and Key Takeaways

### Observations

Compare the two plots above:

1. **Cluster Quality**: Which embedding space creates clearer separation between newsgroup categories?
   - The **default embeddings** (left) likely show better category separation since they're optimized for general semantic similarity
   - The **sentiment embeddings** (right) may group documents by emotional tone rather than topic

2. **Semantic Reorganization**: The same documents occupy different positions in the two spaces
   - A political post and a sports post might be far apart in default space but close together in sentiment space if both are positive
   - This shows how task instructions literally "reshape" the embedding geometry

3. **Cross-Category Patterns**:
   - Do political newsgroups (guns, mideast) cluster together in sentiment space?
   - Are technical discussions (graphics, crypto) more neutral in sentiment?
   - Do hobby groups (baseball, autos) lean more positive?

### Key Insight: Task Instructions Are Powerful!

The same model, same documents, but different instructions create fundamentally different embedding spaces.

**When building applications:**
- üîç **Search/Retrieval**: Use task instructions matching your search intent
- üìä **Classification**: Align the instruction with your classification goal
- üéØ **Clustering**: Choose task based on how you want documents grouped
- ‚ÜîÔ∏è **Similarity**: Define "similar" via the task instruction

### Technical Details

- **Model**: QWEN-3-Embedding-0.6B (1024 dimensions)
- **Reduction**: UMAP (n_neighbors=15, min_dist=0.1, random_state=42)
- **Documents**: 800 (80 per category, stratified sampling)
- **API**: SiliconFlow
- **Format**: `"Instruct: {instruction}\nQuery: {text}"` (for sentiment task)

## 7. Next Steps

### Try Different Tasks

Modify the sentiment instruction to explore other tasks:

```python
# Topic classification
"Identify the topic or theme of the given text"

# Toxicity detection
"Classify the given text as either toxic or not toxic"

# Emotion recognition
"Classify the emotion expressed in the given text"

# Query-document matching
"Given a web search query, retrieve relevant passages that answer the query"
```

### Experiment with Parameters

- Try different newsgroup categories
- Adjust sample size (more/fewer documents)
- Modify UMAP parameters (n_neighbors, min_dist)
- Compare QWEN-3-0.6B vs larger models

### Deploy to Web

Use `scripts/generate_data.py` to create the full web app:

```bash
poetry run python scripts/generate_data.py
```

Then open `docs/index.html` for an interactive version!

### Build Applications

- **Semantic search** with task-specific embeddings
- **Document classification** using instruction-tuned embeddings
- **Recommendation systems** that understand context
- **Content moderation** with toxicity-focused embeddings

### Resources

- [QWEN-3 Embedding arXiv Paper](https://arxiv.org/abs/2506.05176)
- [Official QWEN Repository](https://github.com/QwenLM/Qwen3-Embedding)
- [SiliconFlow API Docs](https://siliconflow.com)
- [20 Newsgroups Dataset](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset)
- [UMAP Documentation](https://umap-learn.readthedocs.io/)