# Sign Language, Vision & AAC

**Author:** Luke Steuber  
**Date:** February 2026

Exploring three accessibility datasets:
1. **WLASL** - Word-Level American Sign Language video index (21,083 videos, 2,000 signs)
2. **VizWiz** - Visual questions from blind users (4,319 questions with crowdsourced answers)
3. **AAC Vocabulary** - Augmentative and Alternative Communication symbol libraries and core vocabularies

In [None]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

%matplotlib inline

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

## 1. WLASL: Word-Level American Sign Language

Dataset of video recordings showing individual ASL signs with gloss labels.

In [None]:
# Load WLASL dataset
wlasl = pd.read_csv('../wlasl_index.csv')

print(f"Total records: {len(wlasl):,}")
print(f"\nDataset structure:")
print(wlasl.info())
print(f"\nFirst few records:")
wlasl.head(10)

In [None]:
# Count videos per sign (gloss)
videos_per_sign = wlasl.groupby('gloss').size().sort_values(ascending=False)

print(f"Unique signs (glosses): {len(videos_per_sign):,}")
print(f"Average videos per sign: {videos_per_sign.mean():.1f}")
print(f"Median videos per sign: {videos_per_sign.median():.0f}")
print(f"\nTop 10 most-recorded signs:")
print(videos_per_sign.head(10))

In [None]:
# Visualization: Distribution of videos per sign
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1.hist(videos_per_sign, bins=30, color='#2E86AB', edgecolor='black', alpha=0.7)
ax1.set_xlabel('Number of Videos per Sign')
ax1.set_ylabel('Frequency (Number of Signs)')
ax1.set_title('Distribution of Videos per Sign (WLASL)', fontweight='bold', fontsize=13)
ax1.axvline(videos_per_sign.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {videos_per_sign.mean():.1f}')
ax1.axvline(videos_per_sign.median(), color='orange', linestyle='--', linewidth=2, label=f'Median: {videos_per_sign.median():.0f}')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Top 15 signs
top_15 = videos_per_sign.head(15)
ax2.barh(range(len(top_15)), top_15.values, color='#A23B72')
ax2.set_yticks(range(len(top_15)))
ax2.set_yticklabels(top_15.index)
ax2.invert_yaxis()
ax2.set_xlabel('Number of Videos')
ax2.set_title('Top 15 Most-Recorded Signs', fontweight='bold', fontsize=13)
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## 2. VizWiz: Visual Questions from Blind Users

Crowdsourced visual question answering dataset where blind users ask questions about images.

In [None]:
# Load VizWiz dataset
vizwiz = pd.read_csv('../vizwiz_val_annotations.csv')

print(f"Total questions: {len(vizwiz):,}")
print(f"\nDataset structure:")
print(vizwiz.info())
print(f"\nFirst few records:")
vizwiz.head(10)

In [None]:
# Analyze answerability
if 'answerable' in vizwiz.columns:
    answerable_counts = vizwiz['answerable'].value_counts()
    print("Answerability distribution:")
    print(answerable_counts)
    print(f"\nAnswerable rate: {answerable_counts.get(True, 0) / len(vizwiz) * 100:.1f}%")

# Sample questions
if 'question' in vizwiz.columns:
    print("\nSample questions:")
    for i, q in enumerate(vizwiz['question'].dropna().sample(5).values, 1):
        print(f"{i}. {q}")

In [None]:
# Visualization: Answerable vs Unanswerable
if 'answerable' in vizwiz.columns:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Pie chart
    labels = ['Answerable', 'Unanswerable']
    sizes = [answerable_counts.get(True, 0), answerable_counts.get(False, 0)]
    colors = ['#06A77D', '#D84E4E']
    explode = (0.05, 0)
    
    ax1.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%',
            startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'})
    ax1.set_title('VizWiz Questions: Answerability', fontweight='bold', fontsize=13)
    
    # Bar chart with counts
    ax2.bar(labels, sizes, color=colors, edgecolor='black', linewidth=1.5)
    ax2.set_ylabel('Number of Questions')
    ax2.set_title('Question Counts by Answerability', fontweight='bold', fontsize=13)
    ax2.grid(axis='y', alpha=0.3)
    
    # Add count labels on bars
    for i, (label, count) in enumerate(zip(labels, sizes)):
        ax2.text(i, count + 50, f'{count:,}', ha='center', fontweight='bold', fontsize=11)
    
    plt.tight_layout()
    plt.show()
else:
    print("'answerable' column not found in dataset")

## 3. AAC Vocabulary: Symbol Libraries and Core Vocabularies

Augmentative and Alternative Communication resources including symbol libraries (ARASAAC, GlobalSymbols) and core vocabulary lists.

In [None]:
# Load AAC vocabulary data
with open('../aac_vocabulary_data.json', 'r') as f:
    aac = json.load(f)

print("AAC Vocabulary Data Structure:")
print(f"Keys: {list(aac.keys())}")

if 'core_vocabulary_lists' in aac:
    print(f"\nCore Vocabulary Lists: {list(aac['core_vocabulary_lists'].keys())}")

if 'symbol_libraries' in aac:
    print(f"\nSymbol Libraries: {list(aac['symbol_libraries'].keys())}")

if 'global_symbols_libraries' in aac:
    print(f"\nGlobal Symbols Libraries: {len(aac['global_symbols_libraries'])} libraries")

In [None]:
# Explore Global Symbols libraries
if 'global_symbols_libraries' in aac:
    libraries_df = pd.DataFrame(aac['global_symbols_libraries'])
    
    print(f"Total symbol libraries: {len(libraries_df)}")
    print(f"\nTop 10 largest symbol libraries:")
    top_libraries = libraries_df.nlargest(10, 'symbol_count')
    print(top_libraries[['name', 'publisher', 'symbol_count']])
    
    print(f"\nTotal symbols across all libraries: {libraries_df['symbol_count'].sum():,}")
    print(f"Average library size: {libraries_df['symbol_count'].mean():.0f} symbols")

In [None]:
# Visualization: Symbol libraries by size
if 'global_symbols_libraries' in aac:
    libraries_df = pd.DataFrame(aac['global_symbols_libraries'])
    top_15_libs = libraries_df.nlargest(15, 'symbol_count').sort_values('symbol_count')
    
    plt.figure(figsize=(12, 8))
    colors = plt.cm.viridis(range(len(top_15_libs)))
    bars = plt.barh(range(len(top_15_libs)), top_15_libs['symbol_count'], color=colors, edgecolor='black', linewidth=1)
    
    plt.yticks(range(len(top_15_libs)), top_15_libs['name'])
    plt.xlabel('Number of Symbols', fontweight='bold')
    plt.title('Top 15 AAC Symbol Libraries by Size', fontweight='bold', fontsize=14)
    plt.grid(axis='x', alpha=0.3)
    
    # Add count labels
    for i, (idx, row) in enumerate(top_15_libs.iterrows()):
        plt.text(row['symbol_count'] + 200, i, f"{row['symbol_count']:,}", 
                va='center', fontweight='bold', fontsize=9)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Explore core vocabularies
if 'core_vocabulary_lists' in aac:
    print("Core Vocabulary Lists:")
    for vocab_name, vocab_data in aac['core_vocabulary_lists'].items():
        if isinstance(vocab_data, dict):
            words = vocab_data.get('words', [])
            print(f"\n{vocab_name}:")
            print(f"  Total words: {len(words)}")
            if words:
                print(f"  Sample words: {', '.join(words[:15])}")
                
                # Word length distribution
                word_lengths = [len(w) for w in words]
                print(f"  Average word length: {sum(word_lengths) / len(word_lengths):.1f} characters")
        elif isinstance(vocab_data, list):
            print(f"\n{vocab_name}:")
            print(f"  Total words: {len(vocab_data)}")
            print(f"  Sample words: {', '.join(vocab_data[:15])}")

In [None]:
# Visualization: Core vocabulary comparison
if 'core_vocabulary_lists' in aac:
    vocab_sizes = {}
    
    for vocab_name, vocab_data in aac['core_vocabulary_lists'].items():
        if isinstance(vocab_data, dict):
            vocab_sizes[vocab_name] = len(vocab_data.get('words', []))
        elif isinstance(vocab_data, list):
            vocab_sizes[vocab_name] = len(vocab_data)
    
    if vocab_sizes:
        plt.figure(figsize=(10, 5))
        names = list(vocab_sizes.keys())
        sizes = list(vocab_sizes.values())
        colors = ['#E63946', '#F1FAEE', '#A8DADC', '#457B9D', '#1D3557'][:len(names)]
        
        bars = plt.bar(names, sizes, color=colors, edgecolor='black', linewidth=2)
        plt.ylabel('Number of Words', fontweight='bold')
        plt.title('Core Vocabulary Lists Comparison', fontweight='bold', fontsize=14)
        plt.grid(axis='y', alpha=0.3)
        
        # Add count labels on bars
        for bar, size in zip(bars, sizes):
            height = bar.get_height()
            plt.text(bar.get_x() + bar.get_width()/2., height + 1,
                    f'{size}', ha='center', va='bottom', fontweight='bold', fontsize=11)
        
        plt.xticks(rotation=15, ha='right')
        plt.tight_layout()
        plt.show()

## Key Findings

### WLASL (Word-Level American Sign Language)
- **21,083 videos** covering **2,000 unique signs**
- Average of ~10 videos per sign (enables machine learning for sign recognition)
- Distribution shows some signs have many more videos than others (training data imbalance)

### VizWiz (Visual Questions from Blind Users)
- **4,319 questions** from blind users about images
- Mix of answerable and unanswerable questions (image quality, clarity issues)
- Real-world dataset showing actual information needs of blind users
- Crowdsourced answers provide ground truth for training vision models

### AAC Vocabulary (Augmentative Communication)
- **34 symbol libraries** with varying sizes (from hundreds to tens of thousands of symbols)
- ARASAAC: 13,709 pictograms (one of the largest open symbol sets)
- Core vocabularies: PRC-Saltillo (100 words), Project Core (36 words)
- Symbol libraries serve different languages, cultures, and communication needs
- Core vocabularies focus on high-frequency words for efficient communication

### Cross-Dataset Insights
All three datasets address different aspects of accessibility:
- **WLASL**: Visual-gestural communication (deaf/hard of hearing)
- **VizWiz**: Visual accessibility for blind users
- **AAC**: Communication support for non-verbal individuals

Together they represent the diversity of accessibility needs and technological approaches to addressing them.