# Vocabulous Demo: Bootstrapping Language Detection

This notebook demonstrates how to use Vocabulous to build language detection models from noisy training data.

## Overview

Vocabulous is a bootstrapping language detection system that:
- Builds dictionaries from potentially mislabeled training data
- Iteratively cleans the data to improve model quality
- Provides fast, interpretable language detection

Let's explore its capabilities!


In [1]:
# Install required packages if running in Colab
# !pip install vocabulous matplotlib seaborn pandas


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from vocabulous import Vocabulous
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")


ModuleNotFoundError: No module named 'matplotlib'

## 1. Creating Sample Data

Let's create a realistic multilingual dataset with some label noise to demonstrate Vocabulous's capabilities.


In [None]:
# Create sample training data for English, French, and Spanish
clean_training_data = [
    # English samples
    {'text': 'Hello world how are you today', 'lang': 'en'},
    {'text': 'Good morning everyone', 'lang': 'en'},
    {'text': 'The weather is nice today', 'lang': 'en'},
    {'text': 'I love programming in Python', 'lang': 'en'},
    {'text': 'Machine learning is fascinating', 'lang': 'en'},
    {'text': 'Natural language processing rocks', 'lang': 'en'},
    {'text': 'Open source software is amazing', 'lang': 'en'},
    {'text': 'Data science helps solve problems', 'lang': 'en'},
    {'text': 'Artificial intelligence is the future', 'lang': 'en'},
    {'text': 'Technology makes life easier', 'lang': 'en'},
    
    # French samples
    {'text': 'Bonjour tout le monde', 'lang': 'fr'},
    {'text': 'Comment allez vous aujourd hui', 'lang': 'fr'},
    {'text': 'Le temps est magnifique', 'lang': 'fr'},
    {'text': 'J aime programmer en Python', 'lang': 'fr'},
    {'text': 'L apprentissage automatique est fascinant', 'lang': 'fr'},
    {'text': 'Le traitement du langage naturel', 'lang': 'fr'},
    {'text': 'Les logiciels libres sont formidables', 'lang': 'fr'},
    {'text': 'La science des données résout les problèmes', 'lang': 'fr'},
    {'text': 'L intelligence artificielle est l avenir', 'lang': 'fr'},
    {'text': 'La technologie facilite la vie', 'lang': 'fr'},
    
    # Spanish samples
    {'text': 'Hola mundo cómo están ustedes', 'lang': 'es'},
    {'text': 'Buenos días a todos', 'lang': 'es'},
    {'text': 'El clima está hermoso hoy', 'lang': 'es'},
    {'text': 'Me encanta programar en Python', 'lang': 'es'},
    {'text': 'El aprendizaje automático es fascinante', 'lang': 'es'},
    {'text': 'El procesamiento de lenguaje natural', 'lang': 'es'},
    {'text': 'El software libre es increíble', 'lang': 'es'},
    {'text': 'La ciencia de datos resuelve problemas', 'lang': 'es'},
    {'text': 'La inteligencia artificial es el futuro', 'lang': 'es'},
    {'text': 'La tecnología hace la vida más fácil', 'lang': 'es'},
]

print(f"Created {len(clean_training_data)} clean training samples")
pd.DataFrame(clean_training_data).groupby('lang').size()


In [None]:
# Add some label noise to simulate real-world conditions
import random
random.seed(42)

noisy_training_data = clean_training_data.copy()
languages = ['en', 'fr', 'es']

# Introduce 15% label noise
noise_rate = 0.15
num_noisy_samples = int(len(noisy_training_data) * noise_rate)

for i in random.sample(range(len(noisy_training_data)), num_noisy_samples):
    original_lang = noisy_training_data[i]['lang']
    # Assign a random wrong language
    wrong_langs = [lang for lang in languages if lang != original_lang]
    noisy_training_data[i]['lang'] = random.choice(wrong_langs)

print(f"Introduced noise in {num_noisy_samples} samples ({noise_rate*100}% noise rate)")
print("\nNoisy data distribution:")
pd.DataFrame(noisy_training_data).groupby('lang').size()


## 2. Training Vocabulous Models

Let's train models on both clean and noisy data to see how Vocabulous handles the bootstrapping process.


In [None]:
# Create evaluation data
eval_data = [
    {'text': 'Hello there my friend', 'lang': 'en'},
    {'text': 'Programming is fun and exciting', 'lang': 'en'},
    {'text': 'Bonjour mes amis', 'lang': 'fr'},
    {'text': 'La programmation est amusante', 'lang': 'fr'},
    {'text': 'Hola mis amigos', 'lang': 'es'},
    {'text': 'La programación es divertida', 'lang': 'es'},
]

# Train on noisy data with bootstrapping
print("Training Vocabulous on noisy data...")
model = Vocabulous()
model, report = model.train(
    train_data=noisy_training_data,
    eval_data=eval_data,
    cycles=3,
    base_confidence=0.4,
    confidence_margin=0.3
)

print(f"\nTraining completed!")
print(f"Number of cycles: {report['cycles']}")
print(f"Dictionary size: {report['dictionary_size']} words")

# Show improvement across cycles
print("\nProgress across training cycles:")
for i, cycle_report in enumerate(report['cycle_reports']):
    print(f"Cycle {i+1}: Accuracy={cycle_report['accuracy']:.3f}, "
          f"F1={cycle_report['f1']:.3f}, "
          f"Removed={cycle_report['removed_samples']} samples")


## 3. Testing Language Detection

Now let's test our trained model on some new sentences to see how well it performs.


In [None]:
# Test sentences in different languages
test_sentences = [
    "Hello how are you doing today",
    "Machine learning algorithms are powerful", 
    "Bonjour comment ça va aujourd hui",
    "Les algorithmes d apprentissage automatique",
    "Hola cómo estás hoy",
    "Los algoritmos de aprendizaje automático"
]

expected_langs = ['en', 'en', 'fr', 'fr', 'es', 'es']

print("Language Detection Results:")
print("=" * 60)

for i, sentence in enumerate(test_sentences):
    # Get scores from model
    scores = model._score_sentence(sentence)
    
    # Get top prediction
    if scores:
        predicted = max(scores.items(), key=lambda x: x[1])[0]
        confidence = max(scores.values())
    else:
        predicted = 'unknown'
        confidence = 0.0
    
    expected = expected_langs[i]
    correct = "✓" if predicted == expected else "✗"
    
    print(f"\nText: '{sentence}'")
    print(f"Expected: {expected} | Predicted: {predicted} {correct}")
    print(f"Confidence: {confidence:.3f}")
    print(f"All scores: {scores}")


## Key Takeaways

This demo showcased the core capabilities of Vocabulous:

### ✅ **Bootstrapping Success**
- Started with 15% label noise in training data
- Iteratively improved model quality through progressive data cleaning
- Achieved good performance despite noisy labels

### ✅ **Interpretable Results**
- Dictionary-based approach provides clear word-language associations
- Fast inference without neural network complexity
- Easy to understand and debug

### ✅ **Practical Applications**
- Language detection from noisy datasets
- Data cleaning and preprocessing
- Bootstrap training for other models

### 🎯 **When to Use Vocabulous**

**Perfect for:**
- Noisy multilingual datasets
- Fast language detection requirements
- Interpretable model requirements
- Data cleaning pipelines

Try experimenting with different parameters and datasets to see how Vocabulous can help with your language detection needs!

For more advanced features and examples, check out the full documentation at: https://github.com/omar/vocabulous
