Skip to content

A Python toolkit for analyzing and visualizing text embeddings from Cohere, OpenAI, and Google. Features PCA visualization, clustering analysis, similarity matrices, and performance comparisons across different embedding models. Built for researchers and developers to evaluate and compare embedding quality.

Notifications You must be signed in to change notification settings

itzreqle/text-embeddings-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Embeddings Visualization

This project demonstrates text embeddings from multiple providers (Cohere, OpenAI, Google) and visualizes them using PCA and clustering analysis.

Features

  • Multiple Embedding Providers: Test embeddings from Cohere, OpenAI, and Google
  • Visualization: 2D PCA plots of embeddings colored by intent
  • Clustering Analysis: K-means clustering with performance evaluation
  • Model Comparison: Compare different embedding models using Adjusted Rand Index
  • Export Results: Save plots and results to files

Installation

  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your API keys in the .env file:
# Copy the example .env file and add your keys
cp .env.example .env
# Edit .env with your actual API keys
  1. Add your API keys to the .env file:
    • COHERE_API_KEY: Your Cohere API key
    • OPENAI_API_KEY: Your OpenAI API key
    • GOOGLE_API_KEY: Your Google AI API key

Usage

Basic Usage

Run the demo script with default settings (Cohere models only):

python text_embeddings_demo.py

Advanced Usage

Test specific providers:

# Test all providers
python text_embeddings_demo.py --providers all

# Test specific providers
python text_embeddings_demo.py --providers cohere openai

# Test specific models
python text_embeddings_demo.py --providers cohere --models embed-v4.0 embed-english-v3.0

# Custom dataset sampling
python text_embeddings_demo.py --sample-size 0.2 --random-seed 42

Command Line Arguments

  • --providers: Choose providers to test (cohere, openai, google, all)
  • --models: Specific models to test (if not specified, tests all available models)
  • --sample-size: Fraction of dataset to sample (default: 0.1)
  • --random-seed: Random seed for reproducibility (default: 30)

Output

The script creates organized output in the output/ directory:

Directory Structure

output/
├── YYYYMMDD_HHMMSS/          # Timestamped folder for each run
│   ├── pca_analysis_*.png    # PCA variance analysis plots
│   ├── heatmap_*.png         # Principal components correlation matrices
│   ├── similarity_*.png      # Query similarity matrices
│   ├── embeddings_*.png      # 2D PCA scatter plots
│   └── model_comparison.png  # Performance comparison chart
└── embedding_results_YYYYMMDD_HHMMSS.csv  # Detailed results

Generated Files

  • PCA Analysis Plots: Explained variance and cumulative variance plots for each model
  • Correlation Heatmaps: Principal components correlation matrices
  • Similarity Heatmaps: Query-to-query similarity matrices
  • 2D PCA Scatter Plots: Embedding visualizations colored by intent
  • Model Comparison Plot: Bar chart comparing all model performances
  • Results CSV: Detailed performance metrics with timestamp
  • Console Output: Adjusted Rand Index scores and variance information

Example Outputs

Here are example outputs from different embedding models:

Cohere Models (from output/20250812_144139/)

Embeddings Visualization

Cohere Embed v4.0 Cohere English v3.0 Cohere Multilingual v3.0

Heatmaps

Heatmap v4.0 Heatmap English v3.0 Heatmap Multilingual v3.0

PCA Analysis

PCA v4.0 PCA English v3.0 PCA Multilingual v3.0

Similarity Analysis

Similarity v4.0 Similarity English v3.0 Similarity Multilingual v3.0

Model Comparison

Model Comparison

Google Models (from output/20250812_144446/)

Embeddings Visualization

Google Gemini

Heatmap

Google Gemini Heatmap

PCA Analysis

Google Gemini PCA

Similarity Analysis

Google Gemini Similarity

Model Comparison

Model Comparison

Dataset

The script uses the ATIS (Airline Travel Information System) dataset, which contains queries about airline travel with different intents:

  • atis_airfare: Questions about ticket prices
  • atis_airline: Questions about airlines
  • atis_ground_service: Questions about ground transportation

Models Tested

Cohere

  • embed-v4.0
  • embed-english-v3.0
  • embed-multilingual-v3.0

OpenAI

  • text-embedding-3-small
  • text-embedding-3-large
  • text-embedding-ada-002

Google

  • gemini-embedding-001
  • text-embedding-005
  • text-multilingual-embedding-002

Performance Metrics

The script evaluates embedding quality using:

  • Adjusted Rand Index (ARI): Measures clustering quality compared to true labels
  • PCA Analysis: Explained variance and dimensionality reduction analysis
  • Correlation Analysis: Principal components correlation matrices
  • Similarity Analysis: Query-to-query cosine similarity matrices
  • 2D Visualization: PCA projection to visualize embedding space structure

Security Notes

  • API keys are stored in the .env file (not committed to version control)
  • The .env file is automatically ignored by git
  • Never commit your actual API keys to version control

Technical Notes

  • The script uses a non-interactive matplotlib backend to avoid display issues
  • All plots are automatically saved as PNG files in timestamped directories
  • Results are sorted by ARI score (higher is better)
  • Error handling for API failures with graceful continuation
  • Reproducible results with configurable random seeds

About

A Python toolkit for analyzing and visualizing text embeddings from Cohere, OpenAI, and Google. Features PCA visualization, clustering analysis, similarity matrices, and performance comparisons across different embedding models. Built for researchers and developers to evaluate and compare embedding quality.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages