# Notebook 2: Finding Domain-Specific Phrases by Comparing with General english

Many meaningful phrases aren’t specific to the design session. "Makes sense" has high MI but appears everywhere and doesn't tell us anything about the context of the design session.  What matters here is **domain specificity**: phrases that are **proportionally more common** in the design session than in general English. In this notebook we compare **relative frequencies** between the design corpus and a general-English reference to find such phrases, quantify their **effect size** (log ratio), and assess **statistical significance** (log-likelihood ratio; G-test).

## 1.Inputs & Setup


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata

from src.ngram_extraction import extract_ngrams_from_corpus
from src.text_utils import clean_text

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 50)
sns.set_context('talk')


In [None]:
# Optional - Enable automatic reloading of modules when source code changes
# This eliminates the need to restart the kernel when updating external .py files.
%load_ext autoreload
%autoreload 2

In [None]:
# load dataframes from previous notebook
session_ngrams = pd.read_csv('outputs/session_ngrams_1.csv')
session_turns = pd.read_csv('outputs/session_turns.csv')


## 2. Building Reference Corpus and Extracting N-grams

To provide general-language baselines for frequency and MI comparisons, several large English corpora can be used:   
- C4 (Colossal Clean Crawled Corpus) 
  - cleaned web text corpus by AllenAI (HuggingFace: `allenai/c4`). Used here via a Kaggle subset: `almirneto/corpus-c4`.  
  - very large, 300m+ lines. 
- Switchboard Corpus 
  - transcribed English telephone conversations, representing spoken dialogue patterns.  
- OpenSubtitles 
  - English movie and TV subtitles, representing informal conversational language across diverse contexts.  



### Option 1: very large corpora
 
You can place the CSV files of any of the above corpora into a folder and use `extract_counts_for_target_ngrams_from_csv_folder` function. This function can handle very large corpus sizes like `C4` by only counting the session ngrams and streaming the files rather than loading all of theminto RAM all at once.

In [None]:
from src.ngram_extraction import extract_counts_for_target_ngrams_from_csv_folder

folder = "/Users/skaraoglu/Documents/vis-analysis-designing_senthil/data/corpora"
targets = session_ngrams['ngram'].astype(str).tolist()

reference_ngrams = extract_counts_for_target_ngrams_from_csv_folder(
    folder_path=folder,
    target_ngrams=targets,
    text_column="text",
    chunksize=250_000,           # tune to RAM
    remove_punctuation=True,
    frequency_column="frequency_in_reference"
)

# Uncomment to save the reference_ngrams dataframe to a CSV file
# pd.to_csv('outputs/reference_ngrams.csv', index=False)

# Uncomment to load the reference_ngrams dataframe from a CSV file
# reference_ngrams = pd.read_csv('outputs/reference_ngrams.csv')

### Option 2: from single CSV
If a very large corpora is not necessary, you can load a CSV file, create a dataframe and use `extract_ngrams_from_corpus`. 

In [None]:
import unicodedata 

# Load reference corpus (merged Switchboard + OpenSubtitles)
reference_turns = pd.read_csv(
    'data/switchboard_opensubtitles_merged.csv',
    dtype={'text': 'string'},          # ensure consistent string type for text
    usecols=['text'],                  # only the text column is needed
    encoding='utf-8',                  # try utf-8 first; fallback to 'latin1' if needed
    on_bad_lines='skip',               # skip ill-formed rows
    low_memory=False                   # avoid dtype inference chunking
)

# Normalize Unicode in-place to mitigate ambiguous unicode characters
reference_turns['text'] = reference_turns['text'].astype(str).map(lambda s: unicodedata.normalize('NFKC', s))
reference_turns = clean_text(reference_turns)

reference_ngrams = extract_ngrams_from_corpus(
    reference_turns,
    text_column='text',
    n_gram_min=1, 
    n_gram_max=3,
    preprocess=True,
    remove_punctuation=True
)

reference_ngrams = reference_ngrams.rename(columns={'frequency':'frequency_in_reference'})
reference_ngrams.head()



## 3. Finding session-only n-grams

N-grams that appear **only** in the design session but not in the reference corpus may show us 'neologisms', newly formed phrases. Since we have all n-grams in both the reference corpus and the design session data, we can extract which of them are only used in the session.


In [None]:
# n-grams present in session_ngrams but absent from reference_ngrams
session_only_ngrams = (
    session_ngrams.merge(
        reference_ngrams[['ngram','ngram_length']].drop_duplicates(),
        on=['ngram','ngram_length'],
        how='left',
        indicator=True
    )
    .query('_merge == "left_only"')
    .sort_values('frequency_in_session', ascending=False)
    .reset_index(drop=True)
)

print(f"Total n-grams in session_ngrams: {len(session_ngrams)}")
print(f"Total n-grams unique to session (session_only_ngrams): {len(session_only_ngrams)}")
disqualified_ngrams = len(session_ngrams) - len(session_only_ngrams)
print(f"Number of n-grams present in reference_ngrams thus were not considered unique to the session: {disqualified_ngrams}")
session_only_ngrams.head()


In [None]:
from src.mi_visualization import create_parametric_ngram_scatter

df = session_only_ngrams[
    (session_only_ngrams['mi_in_session_z'] > 0) &
    (session_only_ngrams['speaker_count'] > 1) &
    (session_only_ngrams['frequency_in_session'] > 3) & (session_only_ngrams['frequency_in_session'] < 50) &  
    (session_only_ngrams['ngram_length'] > 1)
]

import altair as alt

chart = create_parametric_ngram_scatter(
    df=df,
    x_col='mi_in_session_z', x_mode='continuous', x_scale='linear', x_bin=False,
    y_col='frequency_rank_within_length', y_mode='continuous', y_scale='linear', y_bin=False,
    color_col='ngram_length', color_mode='categorical',
    color_scheme='tableau10',  # or 'category20', etc.
    opacity=0.8, 
    width=300, height=400
)

chart

### Example session-only ngrams

Design-only phrases might signal potential novelty but need further validation:

- Coined terms: "sexy commitment", "lifetime companion", "pockets of enjoyment" 
- Technical jargon: "co-creation session"
- Person names: "Tiffany and Hans"

In [None]:
from src.mi_visualization import ngram_examples_grid

df = session_only_ngrams[
    (session_only_ngrams['mi_in_session_z'] > 0) &
    (session_only_ngrams['speaker_count'] > 1) &
    (session_only_ngrams['frequency_in_session'] > 3) &
    (session_only_ngrams['ngram_length'] > 1)
]

table_df, fig = ngram_examples_grid(
    df=df,
    x_col='mi_in_session_z', x_mode='continuous', x_binning='linear', x_bins=6,
    y_col='frequency_rank_within_length',   y_mode='continuous', y_binning='log', y_bins=10,
    color_col='ngram_length', color_mode='categorical', color_scheme='tableau20b',    
    examples_per_cell=6, wrap_width=22, figsize=(12, 16),
    y_label='Frequency ranks of ngrams within same length (log bins)', tick_fontsize=12, label_fontsize=12, cell_text_fontsize=9,
    x_label='Normalised MI (linear bins)', 
    cell_fill='#e9edf2', cell_fill_alpha=0.7
)
plt.show()


## 4.  Measuring the Domain Specificity by comparing session data and reference data 

Initially, we might think phrases unique to the design corpus are most interesting. However, this misses phrases that exist in everyday language but take on special significance in the context of the design session.  



**Finding phrases that are used in the session significantly more frequent than the general english**:

- Compare phrase frequencies between design and reference corpora  
- Calculate effect sizes (how much more/less common in design)  
- Apply log-likelihood ratio test for statistical significance  

**Effect Size Formula**:

$$\text{Effect Size} = \log_2 \left( \frac{f_{design}/N_{design}}{f_{reference}/N_{reference}} \right)$$


**Interpretation**
- **Positive effect size** → overrepresented in **design**  
- **Negative** → overrepresented in **reference** (generic)  
- **Near zero** → proportionally similar


**Expected Insights**: Phrases like "opportunity areas" will show high domain specificity, while common phrases like "I think" will not—even if they have high MI.

### Build comparison table on overlapping n-grams
We’ll use design frequencies from `design_mi` and reference frequencies from extracted `reference_ngrams`. We align the two corpora on the **overlap set** of n-grams and bring in: 
- `frequency_in_session`, `frequency_in_reference`
- Totals per length: `total_ngrams_in_session`, `total_ngrams_in_reference`
- Derived **rates** and `effect_size_log_ratio`


In [None]:
# Build frequency_comparisonarison on overlapping n-grams using unified names
Overlap = set(session_ngrams['ngram']) & set(reference_ngrams['ngram'])

frequency_comparison = (
    session_ngrams[session_ngrams['ngram'].isin(Overlap)][['ngram','ngram_length','frequency_in_session']]
    .merge(
        reference_ngrams[reference_ngrams['ngram'].isin(Overlap)][['ngram','ngram_length','frequency_in_reference']],
        on=['ngram','ngram_length'], how='inner'
    )
)
print(len(frequency_comparison))
frequency_comparison.head()


### Compute totals per n-gram length and effect size (log2 ratio)
- design_total_ngrams and reference_total_ngrams per `ngram_length`.
- effect_size_log_ratio = log2( (fd/Nd + α) / (fr/Nr + α) ).


In [None]:
# Totals by length
Ns_by_len = session_ngrams.groupby('ngram_length')['frequency_in_session'].sum().to_dict()
Nr_by_len = reference_ngrams.groupby('ngram_length')['frequency_in_reference'].sum().to_dict()

frequency_comparison['total_ngrams_in_session'] = frequency_comparison['ngram_length'].map(Ns_by_len)
frequency_comparison['total_ngrams_in_reference'] = frequency_comparison['ngram_length'].map(Nr_by_len)


alpha = 0.5 # Add small constant to avoid division by zero`

# Calculate relative frequencies in each corpus 
frequency_comparison['rate_in_session'] = (frequency_comparison['frequency_in_session'] + alpha) / (frequency_comparison['total_ngrams_in_session'] + 2*alpha)
frequency_comparison['rate_in_reference'] = (frequency_comparison['frequency_in_reference'] + alpha) / (frequency_comparison['total_ngrams_in_reference'] + 2*alpha)
# frequency_comparisonute log2 ratio: log2(freq_design / freq_reference) 
frequency_comparison['effect_size_log_ratio'] = np.log2(frequency_comparison['rate_in_session'] / frequency_comparison['rate_in_reference'])
frequency_comparison[['ngram','ngram_length','frequency_in_session','frequency_in_reference','effect_size_log_ratio', 'rate_in_session']].head(10)


### Testing Statistical significance: Log-likelihood ratio (G-test)
Effect size shows magnitude, but we also need statistical confidence. The log-likelihood ratio (G-test) tells us whether observed differences could be due to chance.

We apply the standard G-test per phrase using a 2x2 table and mark significance (p < 0.05). Small p-values indicate reliable differences. This prevents over-interpreting noisy variations, especially for rare phrases.

The test works by comparing observed frequencies with expected frequencies if there were no corpus preference:

- **G > 3.84**: Significant difference (95% confidence)
- **p < 0.05**: Less than 5% chance the difference is random




In [None]:
from scipy.stats import chi2

# Apply log-likelihood ratio test 

def log_likelihood_ratio(k1, n1, k2, n2, eps=1e-12):
    p = (k1 + k2) / (n1 + n2)
    p1 = k1 / n1
    p2 = k2 / n2
    def safe_log(x):
        return np.log(max(x, eps))
    ll = 2 * (
        k1 * safe_log(p1 / p) + (n1 - k1) * safe_log((1 - p1) / (1 - p)) +
        k2 * safe_log(p2 / p) + (n2 - k2) * safe_log((1 - p2) / (1 - p))
    )
    return ll

frequency_comparison['G'] = frequency_comparison.apply(
    lambda r: log_likelihood_ratio(
        k1=int(r['frequency_in_session']), n1=int(r['total_ngrams_in_session']),
        k2=int(r['frequency_in_reference']), n2=int(r['total_ngrams_in_reference'])
    ), axis=1
)
# Calculate G statistic and p-values 
frequency_comparison['p_value'] = 1 - chi2.cdf(frequency_comparison['G'], df=1)

# Mark significance threshold (typically p < 0.05)`
frequency_comparison['significant'] = frequency_comparison['p_value'] < 0.05

frequency_comparison[['ngram','effect_size_log_ratio','G','p_value','significant']].head(10)


In [None]:
session_ngrams = session_ngrams.merge(
    frequency_comparison.loc[:, ['ngram', 'effect_size_log_ratio', 'significant', 'rate_in_session']], 
    on='ngram', 
    how='left', 
    suffixes=('', '_frequency_comparison')
)

In [None]:
session_ngrams

## 5. Understanding the results

Visualization reveals:

- High effect size + significant: Strong domain specificity
- High effect size + not significant: Possibly interesting but unreliable
- Near zero effect size: Common to both corpora


This filtering dramatically reduces our candidate pool, removing everyday English expressions while preserving design-relevant language.

In [None]:
from src.mi_visualization import create_parametric_ngram_scatter

df = session_ngrams[
    (session_ngrams['mi_in_session_z'] > 0) &
    (session_ngrams['speaker_count'] > 1) &
    (session_ngrams['frequency_in_session'] > 3) & (session_ngrams['frequency_in_session'] < 50) &  
    (session_ngrams['ngram_length'] > 1)
    
]

import altair as alt

chart = create_parametric_ngram_scatter(
    df=df,
    x_col='effect_size_log_ratio', x_mode='continuous', x_scale='linear', x_bin=False,
    y_col='frequency_rank_within_length', y_mode='continuous', y_scale='linear', y_bin=False,
    color_col='significant', color_mode='categorical',
    color_scheme='viridis',  # or 'category20', etc.
    opacity=0.7, 
    width=300, height=400
)


chart

Below we restrict to phrases that are both  overrepresented in design (`effect_size_log_ratio > 0`) and statistically significant (`significant == True`)



In [None]:
from src.mi_visualization import create_parametric_ngram_scatter


df = session_ngrams[
    (session_ngrams['mi_in_session_z'] > 0) &
    (session_ngrams['speaker_count'] > 1) &
    (session_ngrams['frequency_in_session'] > 3) & (session_ngrams['frequency_in_session'] < 100) &
    (session_ngrams['frequency_rank_within_length'] > 100) &
    (session_ngrams['ngram_length'] > 1)&
    (session_ngrams['significant'] == True)
    & (session_ngrams['effect_size_log_ratio'] > 0)
    
]

import altair as alt

chart = create_parametric_ngram_scatter(
    df=df,
    x_col='effect_size_log_ratio', x_mode='continuous', x_scale='linear', x_bin=False,
    y_col='frequency_rank_within_length', y_mode='continuous', y_scale='linear', y_bin=False,
    color_col='mi_in_session_z', color_mode='continuous',
    color_scheme='reds',  # or 'category20', etc.
    opacity=0.7, 
    width=300, height=400
)

chart

### Example n-grams

- **x (→ right)**: `effect_size_log_ratio` — higher values = **more overrepresented in design**; negatives = overrepresented in **reference**.  
- **y (↑ up)**: `frequency_rank_within_length` (log-binned) — **less frequent** as you go up (rank 1 = most frequent within that length).  
- **color**: `mi_in_session_z` (Reds) — **darker = more cohesive** (higher PMI z) in the design corpus.

**How to read**
- **Bottom-right**: overrepresented & relatively frequent → strong domain terms.  
- **Top-right**: overrepresented but rare → specialized or emerging phrases (inspect).  
- **Left side**: generic or everyday; more common in reference than design.  
- **Color** helps prioritize **cohesive** phrases among the design-specific ones.


In [None]:
from src.mi_visualization import ngram_examples_grid

df = session_ngrams[
    (session_ngrams['mi_in_session_z'] > 0.6) &
    (session_ngrams['speaker_count'] > 1) &
    (session_ngrams['frequency_in_session'] > 3) & (session_ngrams['frequency_in_session'] < 100) &
    (session_ngrams['frequency_rank_within_length'] > 100) &
    (session_ngrams['ngram_length'] > 1)&
    (session_ngrams['significant'] == True)
    & (session_ngrams['effect_size_log_ratio'] > 0)
    
]

table_df, fig = ngram_examples_grid(
    df=df,
    x_col='effect_size_log_ratio', x_mode='continuous', x_binning='linear', x_bins=6,
    y_col='frequency_rank_within_length',   y_mode='continuous', y_binning='log', y_bins=10,
    color_col='mi_in_session_z', color_mode='continuous', color_scheme='Reds',
    
    examples_per_cell=6, wrap_width=20, figsize=(9, 17),
    tick_fontsize=12, label_fontsize=14, cell_text_fontsize=9,
    cell_fill='#e9edf2', cell_fill_alpha=0.7
)
plt.show()
len(df)


## 6. Save
We save the enriched `session_ngrams` with:
- `effect_size_log_ratio`, `G`, `p_value`, `significant`
- (and any Cohesion fields carried over from Notebook 1)


In [None]:
frequency_comparison.to_csv('outputs/frequency_comparison.csv', index=False)
session_ngrams.to_csv('outputs/session_ngrams_2_effect_size.csv', index=False)
reference_ngrams.to_csv('outputs/reference_ngrams.csv', index=False)