This project demonstrates text embeddings from multiple providers (Cohere, OpenAI, Google) and visualizes them using PCA and clustering analysis.
- Multiple Embedding Providers: Test embeddings from Cohere, OpenAI, and Google
- Visualization: 2D PCA plots of embeddings colored by intent
- Clustering Analysis: K-means clustering with performance evaluation
- Model Comparison: Compare different embedding models using Adjusted Rand Index
- Export Results: Save plots and results to files
- Install dependencies:
pip install -r requirements.txt- Set up your API keys in the
.envfile:
# Copy the example .env file and add your keys
cp .env.example .env
# Edit .env with your actual API keys- Add your API keys to the
.envfile:COHERE_API_KEY: Your Cohere API keyOPENAI_API_KEY: Your OpenAI API keyGOOGLE_API_KEY: Your Google AI API key
Run the demo script with default settings (Cohere models only):
python text_embeddings_demo.pyTest specific providers:
# Test all providers
python text_embeddings_demo.py --providers all
# Test specific providers
python text_embeddings_demo.py --providers cohere openai
# Test specific models
python text_embeddings_demo.py --providers cohere --models embed-v4.0 embed-english-v3.0
# Custom dataset sampling
python text_embeddings_demo.py --sample-size 0.2 --random-seed 42--providers: Choose providers to test (cohere,openai,google,all)--models: Specific models to test (if not specified, tests all available models)--sample-size: Fraction of dataset to sample (default: 0.1)--random-seed: Random seed for reproducibility (default: 30)
The script creates organized output in the output/ directory:
output/
├── YYYYMMDD_HHMMSS/ # Timestamped folder for each run
│ ├── pca_analysis_*.png # PCA variance analysis plots
│ ├── heatmap_*.png # Principal components correlation matrices
│ ├── similarity_*.png # Query similarity matrices
│ ├── embeddings_*.png # 2D PCA scatter plots
│ └── model_comparison.png # Performance comparison chart
└── embedding_results_YYYYMMDD_HHMMSS.csv # Detailed results
- PCA Analysis Plots: Explained variance and cumulative variance plots for each model
- Correlation Heatmaps: Principal components correlation matrices
- Similarity Heatmaps: Query-to-query similarity matrices
- 2D PCA Scatter Plots: Embedding visualizations colored by intent
- Model Comparison Plot: Bar chart comparing all model performances
- Results CSV: Detailed performance metrics with timestamp
- Console Output: Adjusted Rand Index scores and variance information
Here are example outputs from different embedding models:
The script uses the ATIS (Airline Travel Information System) dataset, which contains queries about airline travel with different intents:
atis_airfare: Questions about ticket pricesatis_airline: Questions about airlinesatis_ground_service: Questions about ground transportation
embed-v4.0embed-english-v3.0embed-multilingual-v3.0
text-embedding-3-smalltext-embedding-3-largetext-embedding-ada-002
gemini-embedding-001text-embedding-005text-multilingual-embedding-002
The script evaluates embedding quality using:
- Adjusted Rand Index (ARI): Measures clustering quality compared to true labels
- PCA Analysis: Explained variance and dimensionality reduction analysis
- Correlation Analysis: Principal components correlation matrices
- Similarity Analysis: Query-to-query cosine similarity matrices
- 2D Visualization: PCA projection to visualize embedding space structure
- API keys are stored in the
.envfile (not committed to version control) - The
.envfile is automatically ignored by git - Never commit your actual API keys to version control
- The script uses a non-interactive matplotlib backend to avoid display issues
- All plots are automatically saved as PNG files in timestamped directories
- Results are sorted by ARI score (higher is better)
- Error handling for API failures with graceful continuation
- Reproducible results with configurable random seeds

















