<a href="https://colab.research.google.com/github/huzaifakhallid/ExamPrep-AI/blob/main/semantic_drift_and_bias_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Semantic Drift & Bias Analysis in Embedding Models using Contextual Representations**

## **Abstract**
Language meaning depends on domain and context. When we train embedding models on different corpora,
we implicitly assume that semantic representations are stable and comparable. In practice, embeddings can
exhibit **semantic drift** (meaning shift across domains) and encode **social biases** present in text.

This notebook conducts a study of:
1) **Semantic drift** across *formal news* vs *informal social media* text  
2) **Representational bias** measured with WEAT-style tests  

We compare:
- Static embeddings (Word2Vec, FastText)
- Contextual embeddings (BERT)

Unlike toy demonstrations, I included:
- **Embedding space alignment** (Orthogonal Procrustes) for valid cross-corpus comparison
- **Bootstrap confidence intervals** for drift estimates
- **Permutation tests** for bias significance



## **1. Research Questions and Hypotheses**

### **RQ1 — Semantic Drift Across Domains**
Do embeddings learned from different domains encode measurably different meanings for the same words?

**H1:** Static embeddings will show **larger drift** across domains than contextual embeddings.

---

### **RQ2 — Bias Differences Across Domains**
Do embeddings trained on informal social text encode stronger bias than embeddings trained on formal news text?

**H2:** Social media embeddings will show **higher WEAT effect sizes** and more significant bias tests than news embeddings.

---

### **Why this is non-trivial (important research note)**
Two embeddings trained on different corpora are not automatically comparable:
the coordinate systems are arbitrary rotations of each other. Therefore,
a rigorous drift comparison requires **space alignment** (I will do this later).


In [2]:
import re, random, math
import numpy as np
import pandas as pd

from datasets import load_dataset

SEED = 42
random.seed(SEED)
np.random.seed(SEED)


In [4]:
import warnings
warnings.filterwarnings("ignore", message="The secret `HF_TOKEN` does not exist in your Colab secrets.")


## **2. Datasets (Connected Loading via Streaming)**

I used two datasets representing contrasting domains:

- **AG News**: edited news text (formal domain)
- **Twitter US Airline Sentiment**: informal tweets (informal domain)

### **Why streaming?**
Instead of manually downloading dataset files, I loaded datasets directly from a dataset hub.
Using Hugging Face Datasets streaming mode, data is fetched progressively as we iterate,
which supports quick sampling without downloading entire archives.


In [6]:
# Connected dataset loading (streaming)
news_stream = load_dataset("ag_news", split="train", streaming=True)
tweets_stream = load_dataset("osanseviero/twitter-airline-sentiment", split="train", streaming=True)

def stream_sample_to_df(stream, n_rows: int, text_col: str, seed: int = 42, max_scan: int = 50000):
    rng = random.Random(seed)
    reservoir = []
    for i, ex in enumerate(stream):
        if i >= max_scan:
            break
        if text_col not in ex or ex[text_col] is None:
            continue
        text = ex[text_col]
        if not isinstance(text, str) or len(text.strip()) == 0:
            continue

        if len(reservoir) < n_rows:
            reservoir.append(text)
        else:
            j = rng.randint(0, i)
            if j < n_rows:
                reservoir[j] = text

    return pd.DataFrame({"text": reservoir})

N_NEWS = 8000
N_TWEETS = 8000

news_df = stream_sample_to_df(news_stream, N_NEWS, text_col="text", seed=SEED)
tweets_df = stream_sample_to_df(tweets_stream, N_TWEETS, text_col="text", seed=SEED)

print("News:", news_df.shape, "Tweets:", tweets_df.shape)
news_df.head()
tweets_df.head()

News: (8000, 1) Tweets: (8000, 1)


Unnamed: 0,text
0,@VirginAmerica What @dhepburn said.
1,@americanair been calling your for 4+ hours to...
2,"@AmericanAir was hoping for 8A if possible, fo..."
3,"@AmericanAir My baggage is lost, my flight Can..."
4,@AmericanAir now because you couldn't add my k...
