# Word Usage Analysis: Scatter Plot of Age-Specific Words

This notebook creates a scatter plot visualization of distinctive words used by young vs. old speakers, similar to the visualization shown in the provided image. The analysis uses data from the BNC2014 corpus.

We'll use the same data processing pipeline from the binary age classifier but focus on creating a scattertext visualization that shows:
1. Words more frequently used by young speakers
2. Words more frequently used by old speakers 
3. Words used by both groups but not distinctively

In [1]:
# Import required libraries
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import lxml.etree as ET

# For visualization
import scattertext as st
import spacy
import html

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set(font_scale=1.2)
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Install scattertext if not already installed
try:
    import scattertext
except ImportError:
    !pip install scattertext
    import scattertext

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load the BNC2014 Corpus Data and Speaker Metadata

In [2]:
# Set the path to the dataset
path = 'Dataset'  
dir_corpus = os.path.join(path, 'spoken', 'tagged')
dir_meta = os.path.join(path, 'spoken', 'metadata')

# Load speaker metadata
fields_s = pd.read_csv(
    os.path.join(dir_meta, 'metadata-fields-speaker.txt'),
    sep='\t', skiprows=1, index_col=0
)

# Load the speaker metadata
df_speakers_meta = pd.read_csv(
    os.path.join(dir_meta, 'bnc2014spoken-speakerdata.tsv'),
    sep='\t', names=fields_s['XML tag'], index_col=0
)

print(f"Loaded metadata for {len(df_speakers_meta)} speakers")

Loaded metadata for 671 speakers


In [3]:
# Function to map BNC age ranges to binary categories
def map_to_binary_age(age_range):
    """
    Map BNC age ranges to binary categories:
    Young (0-29) vs Old (30+)
    
    Parameters:
    -----------
    age_range : str
        Age range from BNC metadata (e.g., '0_18', '19-29', '30_59', '60_plus')
        
    Returns:
    --------
    str
        'Young' or 'Old' classification
    """
    if pd.isna(age_range) or age_range == 'Unknown':
        return np.nan
    
    # Handle different formats in the age range field
    try:
        # Extract the upper bound of the age range
        if '_' in str(age_range):
            ages = str(age_range).split('_')
        elif '-' in str(age_range):
            ages = str(age_range).split('-')
        else:
            # Silently skip unknown formats without printing
            return np.nan
        
        # Parse the upper bound
        if ages[1] == 'plus':
            upper = 100  # Arbitrarily high for '60_plus'
        else:
            upper = int(ages[1])
        
        # Classify as young or old
        if upper <= 29:
            return "Young"
        else:
            return "Old"
    except Exception as e:
        # Silently return np.nan for any errors
        return np.nan

# Apply the binary age classification to speaker metadata
df_speakers_meta['binary_age'] = df_speakers_meta['agerange'].apply(map_to_binary_age)

# Display the counts for each binary age group
binary_age_counts = df_speakers_meta['binary_age'].value_counts()
print("Distribution of speakers by binary age classification:")
print(binary_age_counts)

Distribution of speakers by binary age classification:
binary_age
Old      363
Young    299
Name: count, dtype: int64


In [4]:
# Process tagged corpus files to extract word and linguistic feature data
# We'll limit to 30 files to keep processing reasonable, but you can increase this
file_limit = 30  # Adjust based on your computational resources

tagged_rows = []
try:
    # Load a subset of corpus files
    for file_count, fname in enumerate(sorted(os.listdir(dir_corpus))[:file_limit]):
        if file_count % 5 == 0:
            print(f"Processing file {file_count+1}/{file_limit}: {fname}")
            
        fpath = os.path.join(dir_corpus, fname)
        xml = ET.parse(fpath)
        root = xml.getroot()
        text_id = root.get('id')
        
        for u in root.findall('.//u'):
            utt_id = u.get('n')
            spk = u.get('who')
            for w in u.findall('w'):
                tagged_rows.append({
                    'text_id': text_id,
                    'utterance_id': utt_id,
                    'speaker_id': spk,
                    'word': w.text,
                    'lemma': w.get('lemma'),
                    'pos': w.get('pos'),
                    'class': w.get('class'),
                    'usas': w.get('usas'),
                })
    
    # Create a DataFrame from the extracted data
    df_tagged = pd.DataFrame(tagged_rows)
    
    print(f"\nLoaded {len(df_tagged)} word tokens from {file_limit} files")
    print(f"Found {df_tagged['speaker_id'].nunique()} unique speakers in the processed data")
    
except Exception as e:
    print(f"Error loading corpus data: {e}")

Processing file 1/30: S23A-tgd.xml
Processing file 6/30: S26N-tgd.xml
Processing file 11/30: S2A5-tgd.xml
Processing file 16/30: S2CY-tgd.xml
Processing file 21/30: S2FT-tgd.xml
Processing file 26/30: S2K6-tgd.xml

Loaded 272258 word tokens from 30 files
Found 64 unique speakers in the processed data


In [5]:
# Count of speakers with valid age data
valid_age_speakers = set(df_speakers_meta[~df_speakers_meta['binary_age'].isna()].index)
tagged_speakers = set(df_tagged['speaker_id'].unique())
valid_speakers = valid_age_speakers.intersection(tagged_speakers)

print(f"\nOf {len(tagged_speakers)} speakers in the corpus data, {len(valid_speakers)} have valid age data")

# Filter to only include speakers with valid age data
df_tagged_valid = df_tagged[df_tagged['speaker_id'].isin(valid_speakers)]
print(f"Filtered corpus data contains {len(df_tagged_valid)} word tokens from {len(valid_speakers)} speakers")


Of 64 speakers in the corpus data, 61 have valid age data
Filtered corpus data contains 271916 word tokens from 61 speakers


## 2. Create Corpus for ScatterText Visualization

Now, let's create a corpus suitable for the ScatterText visualization by combining all utterances from each speaker into a single document.

In [6]:
# Combine all words from each speaker into a single document
# We'll use a slightly different approach optimized for scattertext visualization

# Create a list to store documents with metadata
documents = []

# Group by speaker and create one document per speaker
for speaker_id, speaker_data in df_tagged_valid.groupby('speaker_id'):
    if speaker_id in df_speakers_meta.index:
        # Get the age group for this speaker
        age_group = df_speakers_meta.loc[speaker_id, 'binary_age']
        if pd.isna(age_group):
            continue
        
        # Extract words and combine into a text
        words = speaker_data['word'].fillna('').tolist()
        text = ' '.join([w for w in words if w])
        
        # Keep only documents with sufficient length
        if len(text.split()) >= 50:
            documents.append({
                'speaker_id': speaker_id,
                'age_group': age_group,
                'text': text
            })

# Create a DataFrame from the documents
corpus_df = pd.DataFrame(documents)

print(f"Created corpus with {len(corpus_df)} documents")
print(f"Age group distribution in corpus:")
print(corpus_df['age_group'].value_counts())

# Sample a few examples
print("\nSample documents:")
for age_group in ['Young', 'Old']:
    print(f"\nSample {age_group} document:")
    sample = corpus_df[corpus_df['age_group'] == age_group].sample(1).iloc[0]
    print(sample['text'][:200] + "...")  # Show first 200 characters

Created corpus with 60 documents
Age group distribution in corpus:
age_group
Old      34
Young    26
Name: count, dtype: int64

Sample documents:

Sample Young document:
exactly never mind exactly or we yeah yeah yeah yeah exactly and that 's like we go months and months and months without seeing each other so like we saw each other in December and that was the first ...

Sample Old document:
oh right yeah yeah he put it on the system did he ? yeah you 'll have one he 'll have one do n't walk away --ANONnameF cor did they ? did n't they ? oh my goodness are you alright though ? okay three ...


## 3. Create Scattertext Visualization

Now we'll use the scattertext library to create an interactive HTML visualization of word usage differences between age groups.

In [7]:
# Load spacy model - we need this for text preprocessing
try:
    nlp = spacy.load('en_core_web_sm')
except:
    # If the model is not available, download it
    import subprocess
    subprocess.run(['python', '-m', 'spacy', 'download', 'en_core_web_sm'])
    nlp = spacy.load('en_core_web_sm')

# Process texts with spaCy for better parsing
corpus_df['processed_text'] = corpus_df['text'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x) 
                                                                          if not token.is_stop and not token.is_punct]))

# Create a corpus for scattertext
corpus = st.CorpusFromPandas(
    corpus_df,
    category_col='age_group',
    text_col='processed_text',
    nlp=nlp
).build()

# Create the scattertext visualization
html_file = st.produce_scattertext_explorer(
    corpus,
    category='Old',  # Top category
    category_name='Old',
    not_category_name='Young',
    width_in_pixels=1000,
    metadata=corpus_df['speaker_id'],
    minimum_term_frequency=5,
    term_significance='mann_whitney',  # Use Mann-Whitney U test for significance
    transform=st.Scalers.dense_rank,   # Use dense rank for better visualization
    max_docs_per_category=100,          # Limit number of documents for faster rendering
    use_non_text_features=True,         # Include non-text features if available
    pmi_threshold_coefficient=4,        # Higher value: more focused on distinctive terms
)

# Save the visualization to an HTML file
output_filename = 'age_scattertext.html'
with open(output_filename, 'w', encoding='utf-8') as f:
    f.write(html_file)

print(f"Saved interactive visualization to {output_filename}")

ValueError: zero-size array to reduction operation maximum which has no identity

## 4. Create Static Scatter Plot for Age-Specific Words

Let's also create a static scatter plot using matplotlib and seaborn that shows the most distinctive words for each age group.

In [None]:
# Create a CountVectorizer to get word frequencies
count_vectorizer = CountVectorizer(
    min_df=5,          # Minimum document frequency
    max_df=0.7,        # Maximum document frequency (remove very common words)
    stop_words='english', # Remove English stopwords
    max_features=1000  # Limit to top 1000 features
)

# Fit and transform the corpus
X_counts = count_vectorizer.fit_transform(corpus_df['processed_text'])
words = count_vectorizer.get_feature_names_out()

# Create a DataFrame with word counts by category
word_counts = pd.DataFrame(X_counts.toarray(), columns=words)
word_counts['age_group'] = corpus_df['age_group'].values

# Calculate average frequency for each word by age group
young_freqs = word_counts[word_counts['age_group'] == 'Young'].drop('age_group', axis=1).mean()
old_freqs = word_counts[word_counts['age_group'] == 'Old'].drop('age_group', axis=1).mean()

# Combine frequencies into a DataFrame
word_freq_df = pd.DataFrame({
    'word': words,
    'young_freq': young_freqs.values,
    'old_freq': old_freqs.values
})

# Add a small constant to avoid log(0)
epsilon = 1e-10
word_freq_df['young_freq_adj'] = word_freq_df['young_freq'] + epsilon
word_freq_df['old_freq_adj'] = word_freq_df['old_freq'] + epsilon

# Calculate log ratio as a measure of distinctiveness
word_freq_df['log_ratio'] = np.log2(word_freq_df['old_freq_adj'] / word_freq_df['young_freq_adj'])

# Calculate overall frequency (for point size)
word_freq_df['total_freq'] = word_freq_df['young_freq'] + word_freq_df['old_freq']

# Identify most distinctive words for each category
young_words = word_freq_df.sort_values('log_ratio').head(30)['word'].tolist()
old_words = word_freq_df.sort_values('log_ratio', ascending=False).head(30)['word'].tolist()
distinctive_words = young_words + old_words

# Filter to most distinctive words for visualization
plot_df = word_freq_df[word_freq_df['word'].isin(distinctive_words)].copy()

# Create a static scatter plot
plt.figure(figsize=(16, 12))

# Set color based on log_ratio (blue for Young, red for Old)
colors = ['blue' if ratio < 0 else 'red' for ratio in plot_df['log_ratio']]

# Create scatter plot
scatter = plt.scatter(
    x=plot_df['young_freq'],
    y=plot_df['old_freq'],
    s=plot_df['total_freq'] * 1000,  # Scale point size
    alpha=0.6,
    c=colors
)

# Add labels for distinctive words
for i, row in plot_df.iterrows():
    plt.annotate(
        row['word'], 
        (row['young_freq'], row['old_freq']),
        fontsize=10,
        ha='center' if abs(row['log_ratio']) < 1 else ('right' if row['log_ratio'] < 0 else 'left'),
        va='center',
        weight='bold' if abs(row['log_ratio']) > 2 else 'normal'
    )

# Add diagonal line
max_val = max(plot_df['young_freq'].max(), plot_df['old_freq'].max()) * 1.1
plt.plot([0, max_val], [0, max_val], 'k--', alpha=0.3)

# Add labels and title
plt.xlabel('Frequency in Young Speaker Texts', fontsize=14)
plt.ylabel('Frequency in Old Speaker Texts', fontsize=14)
plt.title('Word Usage Patterns: Young vs. Old Speakers', fontsize=16)

# Add legend
blue_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10, label='More common in Young')
red_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='More common in Old')
plt.legend(handles=[blue_patch, red_patch], loc='upper left')

# Improve layout
plt.grid(alpha=0.3)
plt.tight_layout()

# Save the plot
plt.savefig('age_word_usage_scatter.png', dpi=300)
plt.show()

print("Scatter plot created and saved as 'age_word_usage_scatter.png'")

## 5. Word Frequency Analysis: Create Top Words Tables

Let's also create tables showing the top words distinctive of each age group, along with their frequencies.

In [None]:
# Create a modified scatter plot that more clearly shows words associated with each class
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
tfidf_vec = TfidfVectorizer(
    min_df=5,          # Minimum document frequency
    max_df=0.7,        # Maximum document frequency (remove very common words)
    stop_words='english', # Remove English stopwords
    max_features=3000  # Limit to top 3000 features to keep processing manageable
)

# Fit vectorizer on all texts
X_tfidf = tfidf_vec.fit_transform(corpus_df['processed_text'])

# Get feature names
feature_names = tfidf_vec.get_feature_names_out()

# Create a DataFrame with TF-IDF values for each document
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=feature_names)
tfidf_df['age_group'] = corpus_df['age_group'].values

# Calculate average TF-IDF value for each term by age group
young_tfidf = tfidf_df[tfidf_df['age_group'] == 'Young'].drop('age_group', axis=1).mean()
old_tfidf = tfidf_df[tfidf_df['age_group'] == 'Old'].drop('age_group', axis=1).mean()

# Create a DataFrame of term importances
term_importances = pd.DataFrame({
    'word': feature_names,
    'young_importance': young_tfidf,
    'old_importance': old_tfidf
})

# Add a small constant to avoid division by zero
epsilon = 1e-8
term_importances['ratio'] = (term_importances['old_importance'] + epsilon) / \
                            (term_importances['young_importance'] + epsilon)
term_importances['log_ratio'] = np.log2(term_importances['ratio'])

# Find the most distinctive terms for each age group
young_words = term_importances.nlargest(40, 'young_importance')
young_distinctive = term_importances.nsmallest(40, 'log_ratio')
old_words = term_importances.nlargest(40, 'old_importance')
old_distinctive = term_importances.nlargest(40, 'log_ratio')

# Take a union of the most important terms for visualization
important_terms = pd.concat([young_distinctive, old_distinctive]).drop_duplicates('word')

# Create a new figure
plt.figure(figsize=(14, 12))

# Plot the terms with distinct coloring based on which age group they're associated with
for _, row in important_terms.iterrows():
    x = row['young_importance']
    y = row['old_importance']
    word = row['word']
    # Color words based on their association
    if row['log_ratio'] < -1:  # Strongly associated with Young
        color = 'blue'
        alpha = 0.8
    elif row['log_ratio'] > 1:  # Strongly associated with Old
        color = 'red'
        alpha = 0.8
    else:  # Less strongly associated
        color = 'purple'
        alpha = 0.5
    
    # Vary sizes based on total importance (how "special" the word is)
    size = 10 + 1000 * (x + y)
    plt.scatter(x, y, s=size, color=color, alpha=alpha)
    
    # Add word labels
    plt.annotate(
        word,
        (x, y),
        fontsize=11,
        ha='center' if abs(row['log_ratio']) < 0.5 else ('right' if row['log_ratio'] < 0 else 'left'),
        va='center',
        alpha=0.9,
        weight='bold' if abs(row['log_ratio']) > 1.5 else 'normal'
    )

# Add diagonal line
max_val = max(important_terms['young_importance'].max(), important_terms['old_importance'].max()) * 1.1
plt.plot([0, max_val], [0, max_val], 'k--', alpha=0.3)

# Add labels and title
plt.xlabel('Association with Young Speakers', fontsize=14)
plt.ylabel('Association with Old Speakers', fontsize=14)
plt.title('Words Associated with Young vs. Old Speakers', fontsize=16)

# Add legend
blue_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10, label='Young')
red_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Old')
purple_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='purple', markersize=10, label='Shared')
plt.legend(handles=[blue_patch, red_patch, purple_patch], loc='upper left')

# Improve layout
plt.grid(alpha=0.3)
plt.tight_layout()

# Save and show
plt.savefig('age_specific_words_scatter.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nScatter plot of age-specific words created and saved.")

In [None]:
# Calculate significance using Mann-Whitney U test
from scipy.stats import mannwhitneyu

# Calculate p-values for each term
p_values = []
for word in words:
    young_word_counts = word_counts[word_counts['age_group'] == 'Young'][word]
    old_word_counts = word_counts[word_counts['age_group'] == 'Old'][word]
    
    try:
        u_stat, p_value = mannwhitneyu(young_word_counts, old_word_counts, alternative='two-sided')
        p_values.append(p_value)
    except:
        p_values.append(1.0)  # If test fails, assign no significance

# Add p-values to the DataFrame
word_freq_df['p_value'] = p_values

# Add significance indicator
word_freq_df['significant'] = word_freq_df['p_value'] < 0.05

# Filter for significant differences only
sig_words_df = word_freq_df[word_freq_df['significant']].copy()

# Add a distinctiveness category
sig_words_df['distinctiveness'] = pd.cut(
    sig_words_df['log_ratio'],
    bins=[-float('inf'), -1, 1, float('inf')],
    labels=['Young', 'Similar', 'Old']
)

# Get the top words for each category
young_distinctive = sig_words_df[sig_words_df['distinctiveness'] == 'Young'].sort_values('log_ratio').head(20)
old_distinctive = sig_words_df[sig_words_df['distinctiveness'] == 'Old'].sort_values('log_ratio', ascending=False).head(20)

# Display tables of top distinctive words
print("Top 20 words distinctively used by YOUNG speakers:")
print("="*50)
young_table = young_distinctive[['word', 'young_freq', 'old_freq', 'log_ratio', 'p_value']].reset_index(drop=True)
print(young_table)

print("\nTop 20 words distinctively used by OLD speakers:")
print("="*50)
old_table = old_distinctive[['word', 'young_freq', 'old_freq', 'log_ratio', 'p_value']].reset_index(drop=True)
print(old_table)

# Save tables to CSV
young_table.to_csv('young_distinctive_words.csv', index=False)
old_table.to_csv('old_distinctive_words.csv', index=False)

print("\nSaved distinctive words lists to CSV files.")

## 6. Create Word Cloud Visualizations

Finally, let's create word clouds for each age group to visualize their distinctive vocabulary.

In [None]:
from wordcloud import WordCloud

# Create word clouds based on distinctiveness
def create_age_wordcloud(words_df, title, output_filename):
    """Create and save a wordcloud for distinctive words"""
    # Create a dictionary of word frequencies scaled by log_ratio
    freq_dict = dict(zip(
        words_df['word'], 
        words_df['total_freq'] * np.abs(words_df['log_ratio'])
    ))
    
    # Generate word cloud
    wordcloud = WordCloud(
        width=800, height=400, 
        background_color='white',
        max_words=100,
        colormap='Blues' if 'young' in output_filename.lower() else 'Reds',
        contour_width=1, contour_color='steelblue'
    ).generate_from_frequencies(freq_dict)
    
    # Display the word cloud
    plt.figure(figsize=(12, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=18)
    plt.tight_layout(pad=0)
    plt.savefig(output_filename, dpi=300, bbox_inches='tight')
    plt.show()

# Create word clouds for each age group
create_age_wordcloud(
    young_distinctive, 
    'Words Distinctive of Young Speakers',
    'young_distinctive_wordcloud.png'
)

create_age_wordcloud(
    old_distinctive, 
    'Words Distinctive of Old Speakers',
    'old_distinctive_wordcloud.png'
)

print("Word clouds created and saved.")

## Summary

In this notebook, we've created several visualizations to explore the differences in word usage between young and old speakers:

1. **Interactive Scattertext Plot**: An HTML visualization that provides an interactive exploration of word differences.

2. **Static Word Usage Scatter Plot**: A plot showing the most distinctive words for each age group, with word frequency in young speakers on the x-axis and frequency in old speakers on the y-axis.

3. **Word Frequency Tables**: Tables listing the top 20 most distinctive words for each age group, along with their frequencies and statistical significance.

4. **Word Clouds**: Visual representations of the distinctive vocabulary for each age group.

These visualizations reveal interesting patterns in how language use differs between age groups in the BNC2014 corpus. The scatter plot in particular provides a clear visualization of which words are more characteristic of young vs. old speakers.