# 50 · Context-Aware Dictionary Generation & Pruning  
_Last updated 2025-05-03_

**Pipeline recap**

1. **Generate** topic-specific n-grams with GPT-4o-mini.  
2. **Classify / prune** those n-grams to keep only high-precision “marker terms.”  
3. (Optional) **Substring pruning** for even tighter dictionaries.

We demo on a micro grid (3 topics × 2024 × “twitter-specific” × bigrams) so the
whole run finishes in < 1 min on a workshop laptop.



### If you use this code, please cite the paper: 

- Garg, P. and Fetzer, T., 2025. **Political expression of academics on Twitter**. Nature Human Behaviour. DOI: 10.1038/s41562-025-02199-1

(The above paper is forthcoming. Pre-print available at [https://www.researchsquare.com/article/rs-4480504/v1](https://www.researchsquare.com/article/rs-4480504/v1).)



## API key (same pattern as earlier notebooks)

* Looks for an environment variable `OPENAI_API_KEY`.  
* Else reads `key/openai_key.txt` (one line).  
* Raises a helpful error if neither is found.


In [1]:
# %pip -q install --upgrade openai python-dotenv

import os, pathlib, json, random, pandas as pd
import textwrap
from openai import OpenAI

# 2️⃣  Fallback to key/openai_key.txt
key_path = pathlib.Path("key/openai_key.txt")
if os.getenv("OPENAI_API_KEY") is None and key_path.exists():
    os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError(
        "No API key found.  Either export OPENAI_API_KEY or create key/openai_key.txt"
    )

client = OpenAI()


## 1 · Define the mini grid

For scale production you’d iterate over 20–30 topics × years × n-gram sizes;  
here we keep one combination to spotlight the logic.


In [2]:
topics      = ["Climate Action", "Immigration", "Gun Control"]
years       = [2024, 2016]
usage_types = ["twitter-specific"]     # social-media slant
ngrams      = ["bigrams", "trigrams"]              # two-word phrases only

grid = [
    (topic, year, usage, ngram)
    for topic in topics
    for year in years
    for usage in usage_types
    for ngram in ngrams
]
df_grid = pd.DataFrame(grid, columns=["topic", "year", "usage_type", "ngram"])
df_grid


Unnamed: 0,topic,year,usage_type,ngram
0,Climate Action,2024,twitter-specific,bigrams
1,Climate Action,2024,twitter-specific,trigrams
2,Climate Action,2016,twitter-specific,bigrams
3,Climate Action,2016,twitter-specific,trigrams
4,Immigration,2024,twitter-specific,bigrams
5,Immigration,2024,twitter-specific,trigrams
6,Immigration,2016,twitter-specific,bigrams
7,Immigration,2016,twitter-specific,trigrams
8,Gun Control,2024,twitter-specific,bigrams
9,Gun Control,2024,twitter-specific,trigrams


## 2 · Prompt helpers


In [3]:
def prompt_for_generation(topic, year, usage_type, ngram):
    audience = "Twitter" if usage_type == "twitter-specific" else "everyday usage"
    return textwrap.dedent(f"""
        Provide a list of {ngram} associated with **{topic}** discourse in {year}.
        Focus on phrases or hashtags you would expect on {audience}.
        Return ONLY the {ngram}, comma-separated, no commentary.
    """).strip()

def system_prompt_gen():
    return "You are an assistant that invents realistic social-media n-grams for text mining."


## 3 · Generation phase   📤

* **Model**: `gpt-4o-mini` (creative but still on spec).  
* **Temperature 0.8**: encourage variety.  
* Split the returned comma-separated string → tidy DataFrame.


In [4]:
openai_model_gen = "gpt-4o-mini"
candidates = []

for _, row in df_grid.iterrows():
    user_prompt = prompt_for_generation(**row._asdict() if hasattr(row, '_asdict') else row.to_dict())
    
    resp = client.chat.completions.create(
        model=openai_model_gen,
        messages=[
            {"role": "system", "content": system_prompt_gen()},
            {"role": "user",   "content": user_prompt}
        ],
        temperature=0.7,
        max_tokens=120
    )
    
    terms = [t.strip().lower() for t in resp.choices[0].message.content.split(",") if t.strip()]
    for term in terms:
        candidates.append({**row.to_dict(), "term": term})

df_candidates = pd.DataFrame(candidates)
df_candidates.head()


Unnamed: 0,topic,year,usage_type,ngram,term
0,Climate Action,2024,twitter-specific,bigrams,#climateaction
1,Climate Action,2024,twitter-specific,bigrams,renewable energy
2,Climate Action,2024,twitter-specific,bigrams,carbon footprint
3,Climate Action,2024,twitter-specific,bigrams,climate crisis
4,Climate Action,2024,twitter-specific,bigrams,global warming


In [5]:
df_candidates

Unnamed: 0,topic,year,usage_type,ngram,term
0,Climate Action,2024,twitter-specific,bigrams,#climateaction
1,Climate Action,2024,twitter-specific,bigrams,renewable energy
2,Climate Action,2024,twitter-specific,bigrams,carbon footprint
3,Climate Action,2024,twitter-specific,bigrams,climate crisis
4,Climate Action,2024,twitter-specific,bigrams,global warming
...,...,...,...,...,...
235,Gun Control,2016,twitter-specific,trigrams,gun safety measures
236,Gun Control,2016,twitter-specific,trigrams,prevent gun deaths
237,Gun Control,2016,twitter-specific,trigrams,responsible gun ownership
238,Gun Control,2016,twitter-specific,trigrams,we need change


In [6]:
# show all terms in climate action discourse
df_candidates[df_candidates.topic == "Climate Action"].term.unique()


array(['#climateaction', 'renewable energy', 'carbon footprint',
       'climate crisis', 'global warming', 'sustainable development',
       'climate justice', 'green initiatives', 'eco-friendly solutions',
       'climate policy', 'climate activists', 'clean energy',
       'fossil fuels', 'environmental impact', 'climate emergency',
       'climate change', 'climate awareness', 'carbon neutrality',
       'climate solutions', 'climate resilience', '#climateactionnow',
       'climate justice movement', 'renewable energy solutions',
       'sustainable future goals', 'green energy initiatives',
       'global warming awareness', 'carbon footprint reduction',
       'environmental policy reform', 'clean energy revolution',
       'climate crisis urgency', 'eco-friendly practices',
       'climate change impacts', 'climate activism strategies',
       'biodiversity conservation efforts', 'zero emissions target',
       'youth climate leaders', 'nature-based solutions',
       'climate 

## 4 · Classification phase   🔍

We label each candidate as **True / False** for “strong indicator.”  
A strict one-field JSON schema makes parsing trivial.


In [7]:
def sys_cls():
    return (
        "You are deciding whether a term is a **highly specific keyword** for a given topic.\n"
        "Answer JSON `{ \"is_strong_indicator\": true|false }`.\n"
        "Only `true` if the term clearly signals the topic and is rarely used elsewhere."
    )

def user_cls(term, topic):
    return f"Is '{term}' specific enough to unambiguously tag a tweet about **{topic}**?"

cls_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "binary_label",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "is_strong_indicator": {"type": "boolean"}
            },
            "required": ["is_strong_indicator"],
            "additionalProperties": False
        }
    }
}

openai_model_cls = "gpt-4o-mini"
labels = []

for row in df_candidates.itertuples(index=False):
    resp = client.chat.completions.create(
        model=openai_model_cls,
        messages=[
            {"role": "system", "content": sys_cls()},
            {"role": "user",   "content": user_cls(row.term, row.topic)}
        ],
        temperature=0,  # deterministic
        max_tokens=30,
        response_format=cls_schema
    )
    label = json.loads(resp.choices[0].message.content)["is_strong_indicator"]
    labels.append({**row._asdict(), "is_strong_indicator": label})

df_labels = pd.DataFrame(labels)
df_labels.head()


Unnamed: 0,topic,year,usage_type,ngram,term,is_strong_indicator
0,Climate Action,2024,twitter-specific,bigrams,#climateaction,False
1,Climate Action,2024,twitter-specific,bigrams,renewable energy,False
2,Climate Action,2024,twitter-specific,bigrams,carbon footprint,True
3,Climate Action,2024,twitter-specific,bigrams,climate crisis,True
4,Climate Action,2024,twitter-specific,bigrams,global warming,True


In [8]:
# show counts of is_strong_indicator
df_labels['is_strong_indicator'].value_counts()


is_strong_indicator
True     133
False    107
Name: count, dtype: int64

In [9]:

# break it by topic
df_labels.groupby('topic')['is_strong_indicator'].value_counts()

topic           is_strong_indicator
Climate Action  True                   44
                False                  36
Gun Control     False                  50
                True                   30
Immigration     True                   59
                False                  21
Name: count, dtype: int64

## 5 · Prune the dictionary

* Keep only `True`.  
* Drop duplicates.  
* Optional = substring pruning (removes “carbon tax” when “tax” is already in list).


In [10]:
import pandas as pd

# 1) Filter & dedupe
df_keep = (
    df_labels[df_labels.is_strong_indicator]
    .drop_duplicates(subset=["topic", "term"])
)

# 2) Define a function that takes a list of terms and returns the pruned list
def prune_substrings_list(terms):
    terms_sorted = sorted(terms, key=len)
    keep = []
    for term in terms_sorted:
        # only keep if it isn’t already a substring of one we kept
        if not any(f" {k} " in f" {term} " for k in keep):
            keep.append(term)
    return keep

# 3) Group by topic, apply the prune function, explode the list into rows
df_final = (
    df_keep
    .groupby("topic")["term"]
    .apply(prune_substrings_list)  # returns a list of pruned terms per topic
    .explode()                     # flattens it so each term is its own row
    .reset_index(name="term")      # turns the index back into a column named “term”
)

# 4) Now you can count and inspect
print("Terms per topic after pruning:")
display(df_final.groupby("topic")["term"].count())

df_final.head(10)


Terms per topic after pruning:


topic
Climate Action    28
Gun Control       23
Immigration       38
Name: term, dtype: int64

Unnamed: 0,topic,term
0,Climate Action,fossil fuels
1,Climate Action,climate crisis
2,Climate Action,global warming
3,Climate Action,climate justice
4,Climate Action,paris agreement
5,Climate Action,carbon footprint
6,Climate Action,carbon emissions
7,Climate Action,climate emergency
8,Climate Action,carbon neutrality
9,Climate Action,#climateactionnow


## 6 · Interpretation

* **Climate Action** → did we get hashtags like `#climatecrisis`, `carbon tax`?  
* **Immigration** → any unexpected neutral words?  
* **Gun Control** → watch for polarity (e.g. “2a rights” = anti-control).

> **Scale-up tip** — move both loops to the OpenAI **batch** endpoint, then fan-out
> across more years / n-gram sizes with parallel jobs.
