# üéØ Speech Recognition Breaking Points

## Goal
Find out what makes speech recognition fail by testing various challenging conditions.

## Overview
We'll test speech recognition accuracy under different conditions:
- Normal speaking (baseline)
- Fast talking
- Background noise

By the end, you'll understand which conditions cause the most errors and why!

## üì¶ Setup and Installation

In [None]:
# Install required packages
!pip install sounddevice numpy scipy openai matplotlib pandas seaborn librosa soundfile

In [2]:
import sounddevice as sd
import numpy as np
import scipy.io.wavfile as wavfile
from openai import OpenAI
import io
import os
import json
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Audio, display, Markdown
import time
import librosa
import librosa.display
import soundfile as sf
from dotenv import load_dotenv

# The "../" tells Python to look one folder up (in the WEEK 1 root)
load_dotenv("../.env")

# Retrieve the key
api_key = os.getenv("OPENAI_API_KEY")

if api_key:
    print("‚úÖ API Key successfully loaded!")
else:
    print("‚ùå API Key not found. Check your .env file location.")

# Set up OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Create directories for storing recordings
os.makedirs('recordings', exist_ok=True)
os.makedirs('results', exist_ok=True)



‚úÖ API Key successfully loaded!


## üìù Step 1: Select Test Sentences

We'll use 5 carefully chosen sentences that test different aspects of speech recognition:

In [4]:
# Test sentences with different challenges
test_sentences = [
    "The quick brown fox jumps over the lazy dog",  # Classic pangram
    "She sells seashells by the seashore",  # Tongue twister with similar sounds
    "The weather whether we like it or not affects our mood",  # Homophones
    "I scream you scream we all scream for ice cream",  # Fast repetitive sounds
    "Dr Smith's laboratory analyzed 1234 samples at 3:45 PM",  # Numbers, abbreviations, punctuation
]

# Display the sentences nicely
display(Markdown("### Test Sentences:"))
for i, sentence in enumerate(test_sentences, 1):
    print(f"{i}. {sentence}")

### Test Sentences:

1. The quick brown fox jumps over the lazy dog
2. She sells seashells by the seashore
3. The weather whether we like it or not affects our mood
4. I scream you scream we all scream for ice cream
5. Dr Smith's laboratory analyzed 1234 samples at 3:45 PM


## üé§ Step 2: Recording Functions

Let's create functions to record audio under different conditions:

In [5]:
def record_audio(duration=5, sample_rate=16000, condition="normal"):
    """
    Record audio with visual countdown
    """
    print(f"\nüé§ Recording Mode: {condition.upper()}")
    print("Get ready...")
    time.sleep(2)
    
    # Countdown
    for i in range(3, 0, -1):
        print(f"Starting in {i}...")
        time.sleep(1)
    
    print(f"üî¥ RECORDING! Speak now for {duration} seconds!")
    
    # Record audio
    audio = sd.rec(int(duration * sample_rate), 
                   samplerate=sample_rate, 
                   channels=1, 
                   dtype='float32')
    sd.wait()
    
    print("‚úÖ Recording complete!")
    return audio.flatten(), sample_rate


def add_background_noise(audio, noise_level=0.05):
    """
    Add synthetic background noise to audio
    """
    noise = np.random.normal(0, noise_level, audio.shape)
    return audio + noise


def save_recording(audio, sample_rate, filename):
    """
    Save audio to WAV file
    """
    # Normalize audio to prevent clipping
    audio = np.clip(audio, -1, 1)
    audio_int16 = (audio * 32767).astype(np.int16)
    wavfile.write(filename, sample_rate, audio_int16)
    return filename


def visualize_waveform(audio, sample_rate, title="Audio Waveform"):
    """
    Display the waveform of recorded audio
    """
    plt.figure(figsize=(12, 3))
    time_axis = np.linspace(0, len(audio) / sample_rate, len(audio))
    plt.plot(time_axis, audio)
    plt.title(title)
    plt.xlabel('Time (seconds)')
    plt.ylabel('Amplitude')
    plt.grid(True, alpha=0.3)
    plt.show()

print("‚úÖ Recording functions ready!")

‚úÖ Recording functions ready!


## üé¨ Step 3: Record Test Sentences

### Recording Instructions:

#### üéØ Normal Recording
- Quiet room
- Clear, moderate pace
- Natural speaking voice

#### ‚ö° Fast Talking
- Speak as quickly as you can
- Still try to be clear
- Like you're in a rush!

#### üîä Background Noise
- Play music or TV in background
- Or record near a window with traffic
- Or have someone talk nearby

In [None]:
# Initialize storage for all recordings
all_recordings = []
recording_metadata = []

conditions = ['normal', 'fast', 'noisy']
sample_rate = 16000
duration = 5  # seconds per recording

# Instructions for each condition
condition_instructions = {
    'normal': "üìç Speak clearly at a normal pace in a quiet environment",
    'fast': "‚ö° Speak as fast as you can while still being clear",
    'noisy': "üîä Speak normally but with background noise (TV, music, or talking)"
}

# Record each sentence under each condition
for sentence_idx, sentence in enumerate(test_sentences):
    display(Markdown(f"\n### üìù Sentence {sentence_idx + 1}: *\"{sentence}\"*"))
    
    for condition in conditions:
        display(Markdown(f"\n**Condition: {condition.upper()}**"))
        print(condition_instructions[condition])
        
        # Wait for user to be ready
        input(f"\nPress ENTER when ready to record ({condition} version)...")
        
        # Record audio
        audio, sr = record_audio(duration, sample_rate, condition)
        
        # Add synthetic noise if needed (optional - for testing without real noise)
        if condition == 'noisy' and input("Add synthetic noise? (y/n): ").lower() == 'y':
            audio = add_background_noise(audio, noise_level=0.08)
        
        # Save recording
        filename = f"recordings/sentence{sentence_idx+1}_{condition}.wav"
        save_recording(audio, sr, filename)
        
        # Store metadata
        recording_metadata.append({
            'sentence_id': sentence_idx + 1,
            'original_text': sentence,
            'condition': condition,
            'filename': filename,
            'timestamp': datetime.now().isoformat()
        })
        
        # Display waveform
        visualize_waveform(audio, sr, f"Sentence {sentence_idx+1} - {condition.upper()}")
        
        # Let user listen to their recording
        display(Audio(audio, rate=sr))
        
        print("‚úÖ Saved!\n")

print(f"\nüéâ All {len(recording_metadata)} recordings complete!")


### üìù Sentence 1: *"The quick brown fox jumps over the lazy dog"*


**Condition: NORMAL**

üìç Speak clearly at a normal pace in a quiet environment


## ü§ñ Step 4: Run Speech Recognition

Now let's transcribe all recordings using Whisper API:

In [None]:
def transcribe_audio(filename, client):
    """
    Transcribe audio file using OpenAI Whisper API
    """
    try:
        with open(filename, 'rb') as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="text"
            )
        return transcript.strip()
    except Exception as e:
        return f"Error: {str(e)}"


# Transcribe all recordings
print("ü§ñ Starting transcription...\n")
results = []

for metadata in recording_metadata:
    print(f"Transcribing: {metadata['filename']}...")
    
    # Get transcription
    transcription = transcribe_audio(metadata['filename'], client)
    
    # Calculate basic accuracy (word-level)
    original_words = metadata['original_text'].lower().split()
    transcribed_words = transcription.lower().split()
    
    # Simple word accuracy calculation
    correct_words = 0
    for i, word in enumerate(original_words):
        if i < len(transcribed_words) and word == transcribed_words[i]:
            correct_words += 1
    
    accuracy = (correct_words / len(original_words)) * 100 if original_words else 0
    
    # Store results
    result = {
        'sentence_id': metadata['sentence_id'],
        'condition': metadata['condition'],
        'original': metadata['original_text'],
        'transcription': transcription,
        'word_accuracy': round(accuracy, 2),
        'exact_match': metadata['original_text'].lower() == transcription.lower()
    }
    results.append(result)
    
    print(f"‚úÖ Done! Accuracy: {accuracy:.1f}%\n")

print("\nüéâ All transcriptions complete!")

## üìä Step 5: Analyze Results

Let's compare the transcriptions and find patterns:

In [None]:
# Create a DataFrame for easier analysis
df_results = pd.DataFrame(results)

# Display results for each sentence
for sentence_id in range(1, 6):
    sentence_results = df_results[df_results['sentence_id'] == sentence_id]
    original = sentence_results.iloc[0]['original']
    
    display(Markdown(f"\n### Sentence {sentence_id}: *\"{original}\"*"))
    
    for _, row in sentence_results.iterrows():
        status = "‚úÖ" if row['exact_match'] else "‚ö†Ô∏è"
        print(f"\n{status} {row['condition'].upper()}:")
        print(f"   Transcription: \"{row['transcription']}\"")
        print(f"   Word Accuracy: {row['word_accuracy']}%")
        
        # Highlight differences
        if not row['exact_match']:
            original_words = original.lower().split()
            transcribed_words = row['transcription'].lower().split()
            
            # Find mismatched words
            mismatches = []
            for i, orig_word in enumerate(original_words):
                if i >= len(transcribed_words):
                    mismatches.append(f"Missing: '{orig_word}'")
                elif orig_word != transcribed_words[i]:
                    mismatches.append(f"'{orig_word}' ‚Üí '{transcribed_words[i]}'")
            
            if mismatches:
                print(f"   Errors: {', '.join(mismatches[:3])}")  # Show first 3 errors

In [None]:
# Statistical Analysis
display(Markdown("## üìà Statistical Summary"))

# Calculate average accuracy by condition
condition_stats = df_results.groupby('condition').agg({
    'word_accuracy': ['mean', 'std', 'min', 'max'],
    'exact_match': 'mean'
}).round(2)

print("\nAccuracy by Condition:")
print(condition_stats)

# Calculate average accuracy by sentence
sentence_stats = df_results.groupby('sentence_id').agg({
    'word_accuracy': 'mean',
    'exact_match': 'mean'
}).round(2)

print("\nAccuracy by Sentence:")
for idx, row in sentence_stats.iterrows():
    print(f"Sentence {idx}: {row['word_accuracy']}% accuracy")

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Accuracy by Condition
condition_means = df_results.groupby('condition')['word_accuracy'].mean()
ax1 = axes[0, 0]
bars1 = ax1.bar(condition_means.index, condition_means.values, 
                color=['green', 'orange', 'red'])
ax1.set_title('Average Accuracy by Condition', fontsize=14, fontweight='bold')
ax1.set_xlabel('Condition')
ax1.set_ylabel('Word Accuracy (%)')
ax1.set_ylim(0, 100)
for bar, val in zip(bars1, condition_means.values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
             f'{val:.1f}%', ha='center', va='bottom')

# 2. Accuracy by Sentence
sentence_means = df_results.groupby('sentence_id')['word_accuracy'].mean()
ax2 = axes[0, 1]
bars2 = ax2.bar(sentence_means.index, sentence_means.values, color='steelblue')
ax2.set_title('Average Accuracy by Sentence', fontsize=14, fontweight='bold')
ax2.set_xlabel('Sentence ID')
ax2.set_ylabel('Word Accuracy (%)')
ax2.set_ylim(0, 100)
for bar, val in zip(bars2, sentence_means.values):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
             f'{val:.1f}%', ha='center', va='bottom')

# 3. Heatmap of Accuracy
pivot_table = df_results.pivot(index='sentence_id', 
                               columns='condition', 
                               values='word_accuracy')
ax3 = axes[1, 0]
sns.heatmap(pivot_table, annot=True, fmt='.1f', cmap='RdYlGn', 
            vmin=0, vmax=100, ax=ax3, cbar_kws={'label': 'Accuracy (%)'})
ax3.set_title('Accuracy Heatmap: Sentence vs Condition', fontsize=14, fontweight='bold')
ax3.set_xlabel('Condition')
ax3.set_ylabel('Sentence ID')

# 4. Error Distribution
ax4 = axes[1, 1]
conditions_data = [df_results[df_results['condition'] == c]['word_accuracy'].values 
                   for c in conditions]
bp = ax4.boxplot(conditions_data, labels=conditions, patch_artist=True)
colors = ['lightgreen', 'lightyellow', 'lightcoral']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
ax4.set_title('Accuracy Distribution by Condition', fontsize=14, fontweight='bold')
ax4.set_xlabel('Condition')
ax4.set_ylabel('Word Accuracy (%)')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('results/accuracy_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüìä Visualizations saved to 'results/accuracy_analysis.png'")

## üîç Step 6: Key Findings & Patterns

Based on your results, answer these questions:

In [None]:
# Automated pattern detection
display(Markdown("### üéØ Automated Pattern Analysis"))

# 1. Which condition caused the most errors?
worst_condition = condition_means.idxmin()
best_condition = condition_means.idxmax()

print(f"\nüìâ **Worst Condition:** {worst_condition.upper()}")
print(f"   Average accuracy: {condition_means[worst_condition]:.1f}%")
print(f"\nüìà **Best Condition:** {best_condition.upper()}")
print(f"   Average accuracy: {condition_means[best_condition]:.1f}%")

# 2. Which sentence was hardest to recognize?
hardest_sentence = sentence_means.idxmin()
easiest_sentence = sentence_means.idxmax()

print(f"\nüî¥ **Hardest Sentence:** #{hardest_sentence}")
print(f"   \"{test_sentences[hardest_sentence-1]}\"")
print(f"   Average accuracy: {sentence_means[hardest_sentence]:.1f}%")

print(f"\nüü¢ **Easiest Sentence:** #{easiest_sentence}")
print(f"   \"{test_sentences[easiest_sentence-1]}\"")
print(f"   Average accuracy: {sentence_means[easiest_sentence]:.1f}%")

# 3. Perfect transcriptions
perfect_count = df_results['exact_match'].sum()
total_count = len(df_results)
print(f"\n‚ú® **Perfect Transcriptions:** {perfect_count}/{total_count} ({perfect_count/total_count*100:.1f}%)")

# 4. Accuracy drop from normal to challenging conditions
normal_acc = condition_means['normal']
fast_acc = condition_means['fast']
noisy_acc = condition_means['noisy']

print(f"\nüìä **Impact of Conditions:**")
print(f"   Fast talking reduced accuracy by: {normal_acc - fast_acc:.1f}%")
print(f"   Background noise reduced accuracy by: {normal_acc - noisy_acc:.1f}%")

## üí° Conclusions & Insights

### Write Your Findings Here:

Based on your experiment, answer these questions:

1. **What patterns did you notice?**
   - Which types of words were most often misheard?
   - Were certain sounds consistently problematic?

2. **Why do you think certain conditions caused more errors?**
   - Think about how noise masks certain frequencies
   - Consider how fast speech affects word boundaries

3. **What surprised you about the results?**

4. **How could speech recognition be improved?**