# Notebook 1 - finding cohesive n-grams (MI/PMI)

An n-gram is a sequence of n consecutive words. In our analysis:

- **Bigram** (n=2): "opportunity areas"
- **Trigram** (n=3): "body and mind"

These sequences can capture concepts that individual words miss. "Status" and "symbol" mean different things separately than "status symbol" does together.

In this notebook, we create a comprehensive set of bigrams/trigrams from the transcript and compute their **association strength** (MI), with **length-wise normalization** and **speaker adoption**, to prepare candidates for downstream filtering. 


Please refer to the README.md for an overview of the whole analysis.






## 1. Inputs & Setup

### create environment and install dependencies

This project is tested with **Python 3.9.6**. Some dependencies (e.g., VegaFusion/Altair integrations) work best on this version. To make sure everything runs smoothly,

→ Open a terminal (macOs: press `⌘ + Space`, type 'Terminal', hit Enter | windows: press `Win + R`, type 'Command Prompt', hit Enter)  
→ Paste the below commands (replace the path in first command with the folder.) Correct python version and all dependencies will be installed automatically.

```bash
# Go to the project folder
cd /project_path
# Create the environment and install all dependencies:
conda env create -f environment.yml
# Activate the environment:
conda activate vis-analysis
# Register the environment in Jupyter so that you can select it inside notebooks:
python -m ipykernel install --user --name vis-analysis --display-name "Python 3.9.6 (vis-analysis)"
```
→ When opening a notebook, select the kernel called “Python 3.9.6 (vis-analysis)” from the Jupyter or VS Code kernel menu.



### import the functions and modules


In [1]:
# Core imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from IPython.display import display

alt.data_transformers.enable("vegafusion")

os.makedirs('altair', exist_ok=True)

# Display options
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 50)
sns.set_context('talk')

# Output folder
os.makedirs('outputs', exist_ok=True)

# Optional - Enable automatic reloading of modules when source code changes. This eliminates the need to restart the kernel when updating external .py files.
%load_ext autoreload
%autoreload 2


### Load and prepare design conversation transcripts

The transcript should be a table of utterances with at least the columns:  
- `speaker`  
- `text`  
- (optional) `session`, if multiple sessions are included  

We also apply basic cleaning:  
- Remove transcription artifacts (e.g., `[inaudible]`)  
- Standardize case and punctuation  
- Merge consecutive utterances from the same speaker into larger “turns” so that multiword phrases spanning lines are not missed.

In [None]:
# ======================================
# OPTION 1: Load your own data
# ======================================
# Replace 'your_data.csv' with your actual data file
# Your CSV should have columns: 'text' (required), 'speaker' (optional), 'session' (optional)

from src.text_utils import clean_text, utterances_to_turns

# Example loading your data:
# all_utterances = pd.read_csv('data/your_corpus.csv')
# cleaned = clean_text(all_utterances)
# session_turns = utterances_to_turns(cleaned)

# ======================================
# OPTION 2: Use sample data for demonstration
# ======================================
# Creating a small sample dataset to demonstrate the analysis

sample_data = pd.DataFrame({
    'speaker': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'] * 10,
    'text': [
        'We need to think about user experience and interaction design.',
        'Yes, the user interface should be intuitive and clean.',
        'What about incorporating machine learning algorithms?',
        'Good idea. Machine learning could help with personalization.',
        'We should also consider data privacy and security concerns.',
        'Absolutely. Privacy by design is essential.',
        'Let me check the technical requirements for implementation.',
        'The system architecture needs to be scalable.'
    ] * 10,
    'session': [1, 1, 1, 1, 1, 1, 1, 1] * 10
})

# Process the data
cleaned = clean_text(sample_data)
session_turns = utterances_to_turns(cleaned)

print(f'Total turns: {len(session_turns)}')
print(f'Unique speakers: {session_turns["speaker"].nunique()}')
print(f'Sessions: {session_turns["session"].nunique() if "session" in session_turns.columns else "N/A"}')


## 2. Extract bigrams and trigrams with frequency counts
We start by extracting all contiguous bigrams and trigrams from the corpus. This gives us our raw material.  Most are unremarkable ("and the", "I think"), but some capture how the team thinks about the problem. 


We extract **n-grams** (sequences of n consecutive words) from each speaker turn.  
- **Unigrams** = single words
- **Bigrams** = 2 words in sequence  
- **Trigrams** = 3 words in sequence  

This gives us candidate phrases to later evaluate for **cohesion strength**.  
At this stage, we only count surface forms; filtering and statistical measures come next.

In [None]:
from src.ngram_extraction import extract_ngrams_from_corpus, count_speakers_per_ngram

session_ngrams = extract_ngrams_from_corpus(
    session_turns,
    text_column='text',
    frequency_column='frequency_in_session',
    n_gram_min=1,
    n_gram_max=3,
    preprocess=True,
    remove_punctuation=True
)

session_ngrams

## 3. Understanding the frequency distribution of ngrams

### 3.1. Histograms

Before exploring the long table, let’s plot histograms to see how n-gram frequencies are distributed, so we get a sense of which frequency ranges to inspect for the special n-grams we’re looking for.

In [None]:
from src.mi_visualization import create_log_binned_histogram_linear, create_log_binned_histogram_normalized
import matplotlib.pyplot as plt

# Filter dataframes for different n-gram lengths
unigram_df = session_ngrams[session_ngrams['ngram_length'] == 1]
bigram_df = session_ngrams[session_ngrams['ngram_length'] == 2]
trigram_df = session_ngrams[session_ngrams['ngram_length'] == 3]

# Create a list of dataframes and titles
ngram_dfs = [trigram_df, bigram_df, unigram_df]
titles = ['Trigrams', 'Bigrams', 'Unigrams']

# Create and display the histograms with linear y-axis
fig = create_log_binned_histogram_linear(ngram_dfs, titles, frequency_column='frequency_in_session', n_bins=8, figsize=(12, 8))
plt.show()

# Create and display the histograms with normalized y-axis
fig = create_log_binned_histogram_normalized(ngram_dfs, titles, frequency_column='frequency_in_session', n_bins=8, figsize=(12, 8))
plt.show()

Each bar counts how many distinct n-grams fall within a given frequency range on the x-axis. The ranges are logarithmically spaced (shared across Unigram/Bigram/Trigram panels), so the subplots are directly comparable.

We used log-spaced bins on the x-axis to stop everything from collapsing into the first bin and make both the high-frequency head and the long tail visible.

**Linear y-axis view (first figure)** Uses raw counts on the y-axis. This shows how extremely skewed the data. The leftmost (low-frequency) bins dominate, while high-frequency bins are barely visible.

**Normalized y-axis view (second figure)** rescales the counts so we can more easily compare shapes across n-gram lengths without the sheer size of the rare bin overwhelming everything.

**Comparing n-gram lengths**
As n increases, mass shifts left (toward lower frequencies). Unigrams retain more high-frequency items; bigrams fewer; trigrams are overwhelmingly rare.



### 3.2. Using frequency rank instead of raw frequency (Zipf plot)

Histograms are hard to read when the distribution is this skewed: almost everything piles into the first bin. This is because language follows a predictable pattern called Zipf's Law:

- A small number of phrases account for most usage
- The vast majority of phrases are rare

While searching for a range that frame-signaling n-grams appear, a Zipf plot can be more useful. This plot replaces raw counts with ranks (the 1st most common, 2nd, ...nth) and uses log–log axes, so the head and the long tail become readable on one scale. 

_Hover on the points to see the n-grams..._


In [None]:
from src.mi_visualization import add_frequency_rank
# Global rank → write to 'frequency_rank_global'
session_ngrams = add_frequency_rank(session_ngrams, freq_col='frequency_in_session', method='sequential', rank_col='frequency_rank_global')
# Rank within ngram_length → write to 'frequency_rank_within_length'
session_ngrams = add_frequency_rank(session_ngrams, freq_col='frequency_in_session', group_by='ngram_length', method='sequential', rank_col='frequency_rank_within_length')

In [None]:
from src.mi_visualization import create_parametric_ngram_scatter, normalise_values
session_ngrams = normalise_values(
    session_ngrams,
    freq_col='frequency_in_session',
    out_col='normalized_frequency_within_length',
    method='max',
    group_by='ngram_length'
)

In [None]:
from src.mi_visualization import create_parametric_ngram_scatter

chart = create_parametric_ngram_scatter(
    df = session_ngrams,
    x_col='frequency_rank_within_length', x_mode='continuous', x_scale='log', x_bin=False,
    y_col='normalized_frequency_within_length',     y_mode='continuous', y_scale='log', y_bin=False,
    color_col='ngram_length', color_mode='categorical',
    color_scheme='tableau10',
    width=600, height=450
)
chart

Each point in this plot is an n-gram. Colors separate Unigrams / Bigrams / Trigrams.  
x = global rank,   
y = frequency normalized to the single global maximum (both axes log-scaled).

- Left/top = very common (unigrams dominate the head)
- right/bottom = rare (longer n-grams shift into the tail)

You can check how quickly frequency drops as the rank increase. 

- Points above the line are over-common for their rank (often set phrases); 
- Points below the line are under-common relative to the trend.



In the first plot, ranks are computed globally across all n-grams and frequencies are normalized by the single global maximum, with one dashed trend line fit to all points. This gives a corpus-wide view and allows direct global rank comparisons, though smaller groups can look compressed under the global scaling. 

In the second plot, ranks and normalizations are done within each n-gram length, and each group has its own dashed trend line. This makes it easier to compare the slopes and shapes of Unigrams, Bigrams, and Trigrams on fair, per-group scales.



## 4. Looking at actual phrases

Now let's see what these frequency distributions actually contain. The histograms showed us the shape - here we'll look at concrete examples. We've divided the frequency spectrum into 8 logarithmic bands (matching our histograms) and sampled actual phrases from each band. Which range seems to contain the n-grams that seems to signal framing?

In [None]:
from src.mi_visualization import ngram_examples_grid

df = session_ngrams

table_df, fig = ngram_examples_grid(
    df=df,
    x_col='frequency_in_session', x_mode='continuous', x_binning='log', x_bins=8,
    y_col='ngram_length',   y_mode='categorical', y_binning='none',
    examples_per_cell=8, wrap_width=14, figsize=(18, 10),
    y_label='n-gram lengths', tick_fontsize=26, label_fontsize=28, cell_text_fontsize=18,
    x_label='Frequency (log bins)', 
    cell_fill='#e9edf2', cell_fill_alpha=0.7
)
plt.show()

## 5. Speaker adoption of n-grams
A phrase used by only one person might be an individual peculiarity. Phrases adopted by multiple team members suggest shared framing. So we calculate how many unique speakers have used each phrase. This will narrow down our candidates significantly.


In [None]:
# Count unique speakers for each n-gram in the corpus. Add speaker_count column to the session_ngrams.
session_ngrams = count_speakers_per_ngram(
    session_ngrams=session_ngrams,
    session_turns=session_turns,
    text_column='text'
)

session_ngrams

In [None]:
# Create histogram of speaker counts
# Most phrases are used by only 1-2 speakers
# Very few achieve team-wide adoption
fig, ax = plt.subplots(figsize=(6, 8))
max_speakers = session_ngrams['speaker_count'].max()
bins = range(1, int(max_speakers) + 2)
ax.hist(session_ngrams['speaker_count'], bins=bins, edgecolor='black', alpha=0.7)
ax.set_xlabel('Number of speakers', fontsize=10) 
ax.set_ylabel('Number of phrases', fontsize=10)   
ax.set_title('Distribution of speaker adoption', fontsize=10)
ax.set_xticks(range(1, min(int(max_speakers) + 1, 11)))
ax.tick_params(axis='both', which='major', labelsize=8)  # Set x and y tick font size to 8
plt.tight_layout()
plt.show()

As the histogram shows, the vast majority of phrases are used by just one or two speakers, with progressively fewer phrases achieving broader team adoption. This is expected and actually helpful for our analysis. By requiring phrases to be used by multiple speakers (say ≥2 or ≥3), we can dramatically reduce our candidate pool 

### Examples accros speaker adoption levels
We use the same "ngram_examples_grid" function, dividing the frequency spectrum into logarithmic bands and sampling actual phrases from each band.

Looking at this grid, we can see clear patterns emerge as we move across frequency ranges and speaker adoption levels.

Notice how the grid becomes increasingly sparse as we move right - very few phrases achieve both high frequency AND broad speaker adoption unless they're basic function words. The cells start disappearing around the 6-speaker mark for moderate frequencies, and only the most frequent bins retain phrases at 10+ speakers.



In [None]:
from src.mi_visualization import ngram_examples_grid

table_df, fig = ngram_examples_grid(
    df=session_ngrams,
    x_col='frequency_in_session', x_mode='continuous', x_binning='log', x_bins=8,
    y_col='speaker_count',   y_mode='categorical', y_binning='none',
    color_col='ngram_length', color_mode='categorical', color_scheme='v',
    color_categories=[1, 2, 3],
    examples_per_cell=4, wrap_width=20, figsize=(10, 14),
    y_label='Speaker count', tick_fontsize=12, label_fontsize=12, cell_text_fontsize=9,
    x_label='Frequency (log bins)', 
    cell_fill='#e9edf2', cell_fill_alpha=0.7
)
plt.show()

## 6. Finding Cohesive Word Combinations: Calculating Mutual Information Scores


Not all word sequences are meaningful phrases. *“status symbol”* refers to a concept; *“status the”* just happened to appear together.  
**Mutual Information (PMI)** helps us distinguish between these by measuring whether words appear together **more often than chance would predict**. for words $w_1,w_2$: 

$$MI(w_1, w_2) = \log_2 \frac{P(w_1, w_2)}{P(w_1) \cdot P(w_2)}$$

In practical terms:

- **High MI** → words form a cohesive unit (stronger-than-chance association).  
- **MI ≈ 0** → words behave independently.  
- **Low/negative MI** → co-occurrence is weaker than chance.




In [None]:
# Import necessary libraries
import numpy as np
from src.mutual_information import analyze_mutual_information

# Analyze mutual information for n-grams in the session_turns
mi_df = analyze_mutual_information(
    df=session_turns,
    text_column='text',
    n_gram_max=3, # max ngram length to consider
    min_frequency=2,
    min_length=2,
    mi_column='mi_in_session'
)

# Merge the mi_in_session column with the existing session_ngrams dataframe
session_ngrams = session_ngrams.merge(mi_df[['ngram', 'mi_in_session']], on='ngram', how='left')


The n-gram example grid below shows how **Cohesion (PMI)** relates to **how common a phrase is** in the session.

**Axes**
- **x (right)** → higher **Cohesion (PMI)**: words co-occur more than chance → tighter phrase.
- **y (up)** → higher **frequency rank** → **less frequent** phrase (rank 1 = most frequent).  
  *(We use log-binned ranks to separate the head from the long tail.)*

**How to read the grid**
- **Top-right:** cohesive **but rare** → niche/interesting candidates to inspect.
- **Bottom-right:** cohesive **and common** → strong, widely-used phrases.
- **Top-left:** rare **and** low cohesion → likely noise/accidental adjacency.
- **Bottom-left:** common **but** low cohesion → generic function-word combinations.

**Filters applied:** `speaker_count > 1`, `frequency_in_session > 3`, `ngram_length > 1`.  
**Color:** `ngram_length` (bigrams vs trigrams).


In [None]:
from src.mi_visualization import ngram_examples_grid

# this
df = session_ngrams[(session_ngrams['speaker_count'] > 1) & (session_ngrams['frequency_in_session'] > 3) & (session_ngrams['ngram_length'] > 1)]

table_df, fig = ngram_examples_grid(
    df=df,
    x_col='mi_in_session', x_mode='continuous', x_binning='linear', x_bins=6,
    y_col='frequency_rank_global',   y_mode='continuous', y_binning='log', y_bins=7,
    color_col='ngram_length', color_mode='categorical', color_scheme='tableau20b',    
    examples_per_cell=6, wrap_width=22, figsize=(12, 14), tick_fontsize=12, label_fontsize=12, cell_text_fontsize=9,
    example_order_col = "frequency_rank_global",
    cell_fill='#e9edf2', cell_fill_alpha=0.7
)
plt.show()

### Normalising MI scores by n-gram length to make them comparable

Longer n-grams naturally have different MI distributions than shorter ones. A three-word phrase has more opportunity to be "surprising" than a two-word phrase, simply because it involves more words whose co-occurrence could be unexpected.

We solve this by Z-score normalization that converts each MI score to standard deviations from the mean of its length group:

$$z = \frac{MI - \text{mean}_{mi}}{\text{std}_{mi}}$$

We apply following steps:

- Calculate mean and std for bigrams and trigrams separately
- Convert each MI to z-score: (MI - mean) / std
- Now both distributions are centered and comparable


In [None]:
from src.mi_visualization import normalise_values

# Z-score MI per n-gram length (common choice)
session_ngrams = normalise_values(
    session_ngrams,
    freq_col='mi_in_session',
    out_col='mi_in_session_z',
    method='zscore',
    group_by='ngram_length'
)

session_ngrams = normalise_values(
    session_ngrams,
    freq_col='frequency_in_session',
    out_col='frequency_z_within_length',
    method='zscore',
    group_by='ngram_length'
)

session_ngrams = normalise_values(
    session_ngrams,
    freq_col='frequency_in_session',
    out_col='frequency_z_global',
    method='zscore',
    group_by=None
)


In [None]:
from src.mi_visualization import plot_mi_histograms_before_after

fig, axes = plot_mi_histograms_before_after(
    df=session_ngrams,
    length_col='ngram_length',
    lengths=(2,3),
    mi_col='mi_in_session',
    mi_z_col='mi_in_session_z',
    bins=30,
    density=True,
    alpha=0.6,
    figsize=(12,4)
)
plt.show()

**Distributions: raw PMI vs normalised PMI (PMI z)**

- **Before** (raw `mi_in_session`): trigrams tend to sit higher than bigrams due to length effects.
- **After** (`mi_in_session_z`): both lengths are on a **shared scale**.
  - **z = 0**: Average association for that length
  - **z > 2**: Unusually strong association
  - **z < -2**: Likely random co-occurrence

This confirms that z-scoring by length makes cohesion comparable across bigrams/trigrams.

## 7. Example ngrams across different MI values
Let's see what high and low MI phrases actually look like:

- **x (→ right)**: `mi_in_session_z` — **Cohesion (PMI z)** within length.
- **y (↑ up)**: `frequency_rank_within_length` (log-binned) — **less frequent** as you go up.






**What these MI ranges mean**

- Very Low (z < -2): Random adjacencies:
  - "and the", "it was", "to be": Words that just happened to be next to each other
- Low to Average (-2 to 0): Weak associations
  - Common function word combinations
  - No special conceptual bond
- Above Average to High (0 to 2): Meaningful combinations
  - "makes sense", "opportunity areas"
  - Words that preferentially occur together
- Very High (z > 2): Strong conceptual units
  - "status symbol", "sexy commitment"


In [None]:
from src.mi_visualization import ngram_examples_grid

df = session_ngrams[
   # (session_ngrams['mi_in_session_z'] > 0) & 
    (session_ngrams['speaker_count'] > 2) & 
    (session_ngrams['frequency_in_session'] > 3) & 
    (session_ngrams['ngram_length'] > 1)]

table_df, fig = ngram_examples_grid(
    df=df,
    x_col='mi_in_session_z', x_mode='continuous', x_binning='linear', x_bins=6,
    y_col='frequency_rank_within_length',   y_mode='continuous', y_binning='log', y_bins=10,
    color_col='ngram_length', color_mode='categorical', color_scheme='tableau20b',    
    examples_per_cell=6, wrap_width=20, figsize=(12, 16),
    y_label='Frequency ranks of ngrams within same length (log bins)', tick_fontsize=12, label_fontsize=12, cell_text_fontsize=9,
    x_label='Normalised MI (linear bins)', 
    cell_fill='#e9edf2', cell_fill_alpha=0.7
)
plt.show()

### Narrowing down the range of MI and frequency

We filter by 'normalised mi > 0' and 'frequency_rank > 450 since there seems to be no candidates here

In [None]:
from src.mi_visualization import ngram_examples_grid

df = session_ngrams[
    (session_ngrams['mi_in_session_z'] > 0.5) & 
    (session_ngrams['speaker_count'] > 1) & 
    (session_ngrams['frequency_in_session'] > 3) & 
    (session_ngrams['ngram_length'] > 1)]

table_df, fig = ngram_examples_grid(
    df=df,
    x_col='mi_in_session_z', x_mode='continuous', x_binning='linear', x_bins=6,
    y_col='frequency_rank_within_length',   y_mode='continuous', y_binning='log', y_bins=10,
    color_col='ngram_length', color_mode='categorical', color_scheme='tableau20b',    
    examples_per_cell=6, wrap_width=22, figsize=(12, 16),
    y_label='Frequency ranks of ngrams within same length (log bins)', tick_fontsize=12, label_fontsize=12, cell_text_fontsize=9,
    x_label='Normalised MI (linear bins)', 
    cell_fill='#e9edf2', cell_fill_alpha=0.7
)
plt.show()

## 9. Save outputs to use in the next notebook


In [None]:
# Save unified session_ngrams with agreed naming for downstream notebooks
session_ngrams.to_csv('outputs/session_ngrams_1.csv', index=False)
session_turns.to_csv('outputs/session_turns.csv', index=False)
