# Fundamentals of Social Data Science 

## Week 5 Day 1 Exercise: Working on Brains

To work on brains you can do this through the terminal or through VS Code (or another coding IDE such as cursor, but we will focus on VS Code). I first recommend going through the terminal:

You will need to be on the University VPN. Then in the terminal (wherever you see commands with `$`at the front that will be from the terminal, and `>` at front would be within the Python interpreter. You don't type the `$`, just the part afterwards.

`$` `ssh brains.ox.ac.uk`

The first time you do this you will be asked to enter your password. If you will be using brains a lot, I recommend using an SSH key. You can follow standard instructions for creating a key, but I would recommend waiting until after this exercise, so that you have VS Code set up. 

You will be presented with a new command prompt. It should look like: 

~~~bash
(base) LOGIN1234@brains:~$
~~~

This is your command prompt. You can use `ls` to list files, but there shouldn't be any. `ls -a` should show some hidden files like `.bashrc`. You can (and should) check this file's contents with: 

`$` `cat .bashrc`

And check at the bottom of it. It should have a line saying `export HF_HOME=/data/resource/huggingface`. This is important since we will want to use a shared model cache so that people do not have to download the same data twice. It will not have a similar environment variable for gensim data that we will use. You will want to set that yourself. We will do it in the code below, but you can also add this line to the same `.bashrc`: `export GENSIM_DATA_DIR=/data/resource/gensim`. After you change the `.bashrc` in any way you have to reload it: 

`$` `source ~/.bashrc`


# Activating the envirinment. 

Once you have cloned the repository, you should be able to work in VS Code, which should set up a lot for you. The important thing is that you use the correct `conda` environment. It will ask you when you run this file. You should select: 

`/opt/anaconda/envs/fsds25-conda-env`

The kernel in the upper left corner should say "fsds25-conda-env (Python 3.12.12)

On a terminal, you would type: 
`$` `conda activate /opt/anaconda/envs/fsds25-conda-env`

This is a shared `conda` environment. This means everyone should be able to access the same environment, and not require any downloading or installing. 

Should you wish to extend this environment or add things, I would recommend that from your home directory you create your own environment with: 

~~~sh
conda create --prefix <fsds25-conda-env> python=3.12
conda install -c conda-forge sentence-transformers
conda install gensim
...etc
~~~

where instead of `<fsds25-conda-env>` you type your own environment name. It should be automatically stored on `/data/` and won't need much fussing. 

In [1]:
import numpy as np
import gensim.downloader as api
from scipy.spatial.distance import cosine
import torch
from sentence_transformers import SentenceTransformer
import os
from pathlib import Path

os.environ["GENSIM_DATA_DIR"] = "/data/resource/gensim"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# # Check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    # from sentence_transformers import SentenceTransformer
    device="cuda:1"    

Using device: cuda
GPU: NVIDIA L40S


# Part 1. Loading and doing an analogy with Word2Vec on the server

First we will load word2vec. After this is done, try connecting via a terminal and typing `nvtop`. This will give an overview of the GPUs and how much VRAM is being parked. 

In [3]:
print("Loading Word2Vec model...")
print("(This may take a few minutes on first run - downloading ~1.6GB)")
word2vec_model = api.load('word2vec-google-news-300')
print("✓ Word2Vec loaded")

Loading Word2Vec model...
(This may take a few minutes on first run - downloading ~1.6GB)
✓ Word2Vec loaded


In [4]:
def word2vec_analogy(model, word_a, word_b, word_c, top_n=5):
    """
    Perform vector arithmetic: word_a - word_b + word_c
    Returns top_n most similar words to the resulting vector.
    """
    try:
        # Get vectors
        vec_a = model[word_a]
        vec_b = model[word_b]
        vec_c = model[word_c]
        
        # Vector arithmetic
        result_vec = vec_c - vec_b + vec_a
        
        # Find most similar (excluding input words)
        similar = model.similar_by_vector(result_vec, topn=top_n + 3)
        
        # Filter out input words
        filtered = [(word, score) for word, score in similar 
                   if word.lower() not in [word_a.lower(), word_b.lower(), word_c.lower()]]
        
        return filtered[:top_n]
    
    except KeyError as e:
        return f"Word not in vocabulary: {e}"

def print_analogy_results(word_a, word_b, word_c, results):
    """
    Pretty print analogy results
    """
    print(f"\n{word_a}:{word_b} :: {word_c}:?")
    print("="*50)
    if isinstance(results, str):
        print(results)
    else:
        for i, (word, score) in enumerate(results, 1):
            print(f"{i}. {word:20s} (similarity: {score:.4f})")

In [5]:
print("\n" + "="*60)
print("WORD2VEC: Classic 'king - man + woman' analogy")
print("="*60)

results_w2v = word2vec_analogy(word2vec_model, 'woman', 'man', 'king', top_n=10)
print_analogy_results('woman', 'man', 'king', results_w2v)

# Check if 'queen' is in top results
if isinstance(results_w2v, list):
    queen_rank = next((i for i, (word, _) in enumerate(results_w2v, 1) 
                      if 'queen' in word.lower()), None)
    if queen_rank:
        print(f"\n→ 'queen' found at rank {queen_rank}")
    else:
        print(f"\n→ 'queen' not in top 10 results!")


WORD2VEC: Classic 'king - man + woman' analogy

woman:man :: king:?
1. queen                (similarity: 0.7301)
2. monarch              (similarity: 0.6455)
3. princess             (similarity: 0.6156)
4. crown_prince         (similarity: 0.5819)
5. prince               (similarity: 0.5777)
6. kings                (similarity: 0.5614)
7. sultan               (similarity: 0.5377)
8. Queen_Consort        (similarity: 0.5344)
9. queens               (similarity: 0.5290)
10. ruler                (similarity: 0.5247)

→ 'queen' found at rank 1


In [6]:
# A custom analogy with Word2Vec
word_a = 'Paris'
word_b = 'France' 
word_c = 'London'

print("\nWORD2VEC:")
results = word2vec_analogy(word2vec_model, word_a, word_b, word_c, top_n=5)
print_analogy_results(word_a, word_b, word_c, results)


WORD2VEC:

Paris:France :: London:?
1. Londons              (similarity: 0.5469)
2. Islamabad_Slyvia_Hui (similarity: 0.5463)
3. Canary_Wharf         (similarity: 0.5453)
4. Canary_Warf          (similarity: 0.5428)
5. EURASIAN_NATURAL_RESOURCES_CORP. (similarity: 0.5356)


In [11]:
print("\n" + "="*60)
print("TESTING POLYSEMY: 'bank' (financial vs. river)")
print("="*60)

# Word2Vec: only has one vector for 'bank'
print("\nWORD2VEC: river:water :: bank:?")
results_w2v_bank = word2vec_analogy(word2vec_model, 'water', 'river', 'bank', top_n=5)
print_analogy_results('water', 'river', 'bank', results_w2v_bank)

print("\n" + "-"*60)
print("WORD2VEC: money:finance :: bank:?")
results_w2v_bank2 = word2vec_analogy(word2vec_model, 'finance', 'money', 'bank', top_n=5)
print_analogy_results('finance', 'money', 'bank', results_w2v_bank2)


TESTING POLYSEMY: 'bank' (financial vs. river)

WORD2VEC: river:water :: bank:?

water:river :: bank:?
1. banks                (similarity: 0.4827)
2. banking              (similarity: 0.4743)
3. mortgage_lender      (similarity: 0.4391)
4. lender               (similarity: 0.4280)
5. BofA_NYSE_BAC        (similarity: 0.4224)

------------------------------------------------------------
WORD2VEC: money:finance :: bank:?

finance:money :: bank:?
1. banking              (similarity: 0.6904)
2. lender               (similarity: 0.5620)
3. banks                (similarity: 0.5395)
4. banker               (similarity: 0.5368)


# Part 2. Loading and doing an analogy with BERT 

This part is in fact a little more tricky considering the way embeddings work in BERT. You will see below a few strategies. The most interesting of these, in my opinion, is the code that explores which prompt is most likely to give the appropriate result. 

Hopefully, this code will allow you to draw upon the existing models downloaded to `Brains`. If your code suggests that you are downloading your own models, then please notify the instructor as the permissions might not be set appropriately. 

In [8]:
from sentence_transformers import SentenceTransformer

model_path = '/data/resource/huggingface/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0'
sbert_model = SentenceTransformer(model_path, device=device)

In [9]:
def contextualized_analogy(model, word_a, word_b, word_c, candidates, context_template=None):
    """
    Perform analogy using contextualized embeddings.
    
    Args:
        model: SentenceTransformer model
        word_a, word_b, word_c: The analogy words (A:B :: C:?)
        candidates: List of candidate words for the answer
        context_template: Optional template for embedding words in context
    """
    # Default: simple sentence context
    if context_template is None:
        context_template = "The word {} is used in this sentence."
    
    # Get contextualized embeddings
    emb_a = model.encode(context_template.format(word_a))
    emb_b = model.encode(context_template.format(word_b))
    emb_c = model.encode(context_template.format(word_c))
    
    # Vector arithmetic in contextualized space
    target_vec = emb_c - emb_b + emb_a
    
    # Compare to candidates
    results = []
    for candidate in candidates:
        emb_candidate = model.encode(context_template.format(candidate))
        similarity = 1 - cosine(target_vec, emb_candidate)
        results.append((candidate, similarity))
    
    # Sort by similarity
    results.sort(key=lambda x: x[1], reverse=True)
    return results

def print_contextual_results(word_a, word_b, word_c, results):
    """
    Pretty print contextualized analogy results
    """
    print(f"\n{word_a}:{word_b} :: {word_c}:?")
    print("="*50)
    for i, (word, score) in enumerate(results, 1):
        print(f"{i}. {word:20s} (similarity: {score:.4f})")

In [10]:
print("\n" + "="*60)
print("SENTENCE-BERT: Contextualized 'woman:man :: king:?' analogy")
print("="*60)

# Define candidates (including expected answer + distractors)
candidates = ['queen', 'monarch', 'prince', 'princess', 'throne', 
              'ruler', 'kingdom', 'royal', 'emperor', 'castle']

results_sbert = contextualized_analogy(sbert_model, 'woman', 'man', 'king', candidates)
print_contextual_results('woman', 'man', 'king', results_sbert[:10])

# Check queen's rank
queen_rank = next((i for i, (word, _) in enumerate(results_sbert, 1) 
                  if word.lower() == 'queen'), None)
if queen_rank:
    print(f"\n→ 'queen' found at rank {queen_rank}")


SENTENCE-BERT: Contextualized 'woman:man :: king:?' analogy

woman:man :: king:?
1. queen                (similarity: 0.7897)
2. monarch              (similarity: 0.6850)
3. princess             (similarity: 0.6701)
4. royal                (similarity: 0.6201)
5. kingdom              (similarity: 0.6022)
6. throne               (similarity: 0.5996)
7. ruler                (similarity: 0.5034)
8. emperor              (similarity: 0.5014)
9. prince               (similarity: 0.4598)
10. castle               (similarity: 0.4598)

→ 'queen' found at rank 1


In [12]:
print("\n" + "="*60)
print("SENTENCE-BERT: Contextualized 'bank' analogies")
print("="*60)

# Financial context
print("\nContext: 'The bank approved my loan application.'")
financial_template = "The bank approved my loan application, just like {}."
financial_candidates = ['institution', 'lender', 'company', 'branch', 
                       'shore', 'edge', 'slope', 'credit']

results_bank_financial = contextualized_analogy(
    sbert_model, 'finance', 'money', 'bank', 
    financial_candidates, context_template=financial_template
)
print_contextual_results('finance', 'money', 'bank', results_bank_financial)

# River context
print("\n" + "-"*60)
print("Context: 'We sat on the bank of the river.'")
river_template = "We sat on the bank of the river, near the {}."
river_candidates = ['shore', 'edge', 'slope', 'side', 
                   'institution', 'lender', 'branch', 'water']

results_bank_river = contextualized_analogy(
    sbert_model, 'water', 'river', 'bank', 
    river_candidates, context_template=river_template
)
print_contextual_results('water', 'river', 'bank', results_bank_river)


SENTENCE-BERT: Contextualized 'bank' analogies

Context: 'The bank approved my loan application.'

finance:money :: bank:?
1. lender               (similarity: 0.9562)
2. institution          (similarity: 0.9325)
3. credit               (similarity: 0.9109)
4. company              (similarity: 0.9049)
5. branch               (similarity: 0.8854)
6. slope                (similarity: 0.7768)
7. shore                (similarity: 0.7531)
8. edge                 (similarity: 0.6958)

------------------------------------------------------------
Context: 'We sat on the bank of the river.'

water:river :: bank:?
1. water                (similarity: 0.9877)
2. shore                (similarity: 0.9381)
3. side                 (similarity: 0.9348)
4. edge                 (similarity: 0.9182)
5. branch               (similarity: 0.9019)
6. slope                (similarity: 0.8765)
7. lender               (similarity: 0.8752)
8. institution          (similarity: 0.8598)


In [13]:
# Your custom analogy with Sentence-BERT
candidates = ['England', 'Britain', 'UK', 'Europe', 'city', 'country']

print("\nSENTENCE-BERT:")
results = contextualized_analogy(sbert_model, word_a, word_b, word_c, candidates)
print_contextual_results(word_a, word_b, word_c, results)


SENTENCE-BERT:

Paris:France :: London:?
1. city                 (similarity: 0.6386)
2. England              (similarity: 0.3225)
3. UK                   (similarity: 0.3207)
4. Britain              (similarity: 0.3089)
5. Europe               (similarity: 0.2674)
6. country              (similarity: 0.2060)


# Part 3. Other approaches with BERT 

The above code used sentenceBERT with comparative sentence embeddings where we changed the words. However, with BERT we can (and usually) do a masked token within a sentence, like `"London is to England what [MASK] is to France"`.

These are just some interesting ways to explore analogies within transformer models and can be used to explore your topic. For example, we look at: 
- what difference a prompt makes
- what about different models
- what about the analogy in a different order

Try all three, experiement with different analogies and even explore different models avaiable on `/data/resources/huggingface/`.

In [18]:

def bert_analogy_probabilities(word_a, word_b, word_c, tokenizer, model, top_k=10, prompt_style="default"):
    """
    Get top-k word predictions using masked language modeling.
    """
    prompts = {
        "default": f"{word_a} is to {word_b} as {word_c} is to [MASK].",
        "analogy": f"{word_a}:{word_b} :: {word_c}:[MASK]",
        "relationship": f"The relationship between {word_a} and {word_b} is like the relationship between {word_c} and [MASK].",
        "similar": f"{word_a} relates to {word_b} like {word_c} relates to [MASK].",
        "minimal": f"{word_a} {word_b} {word_c} [MASK]"
    }
    
    text = prompts.get(prompt_style, prompts["default"])
    
    inputs = tokenizer(text, return_tensors="pt").to(device)
    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
    
    mask_token_logits = logits[0, mask_token_index, :]
    probs = torch.softmax(mask_token_logits, dim=-1)
    top_k_tokens = torch.topk(probs, top_k, dim=1)
    
    results = []
    for token_id, prob in zip(top_k_tokens.indices[0], top_k_tokens.values[0]):
        word = tokenizer.decode([token_id])
        results.append((word.strip(), prob.item()))
    
    return results

# Load model once

model_name = "bert-large-uncased"          # 340M params
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)


Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# ============================================
# 3.1. Different prompting strategies
# ============================================
print("=" * 70)
print("EXPERIMENT 1: Different Prompt Styles")
print("=" * 70)

word_a, word_b, word_c = "king", "queen", "man"

for prompt_style in ["default", "analogy", "relationship", "similar", "minimal"]:
    print(f"\n{prompt_style.upper()} style:")
    print(f"{word_a}:{word_b} :: {word_c}:?")
    print("-" * 50)
    
    results = bert_analogy_probabilities(word_a, word_b, word_c, tokenizer, model, 
                                        top_k=5, prompt_style=prompt_style)
    
    for i, (word, prob) in enumerate(results, 1):
        print(f"  {i}. {word:15s} {prob:.4f}")


## 3.2 Comparing models 

In the code below, we can compare different BERT models. All of these are uncased, which explains `london` rather than `London`. Note that if you were to scale this up, it might be more suitable to try all the analogies per model rather than alternate between them, though on Brains all of these should be stored in VRAM with little issue. 

I tried a few of the other models on the server and sadly, several require further dependencies not featured in the `conda` environment and led to some challenges with dependencies (such as Microsoft's latest "DeBERTa" model). 

In [16]:
# ============================================
# 3.2. Compare multiple models
# ============================================
print("\n\n" + "=" * 70)
print("EXPERIMENT 2: Compare Multiple Models")
print("=" * 70)

model_names = [
    "bert-base-uncased",
    "distilbert-base-uncased",
    "bert-large-uncased"
]

word_a, word_b, word_c = "paris", "france", "berlin"

print(f"\nAnalogy: {word_a}:{word_b} :: {word_c}:?")
print("=" * 70)

for model_name in model_names:
    print(f"\n{model_name}:")
    print("-" * 50)
    
    tok = AutoTokenizer.from_pretrained(model_name)
    mdl = AutoModelForMaskedLM.from_pretrained(model_name).to(device)
    
    results = bert_analogy_probabilities(word_a, word_b, word_c, tok, mdl, top_k=5)
    
    for i, (word, prob) in enumerate(results, 1):
        print(f"  {i}. {word:15s} {prob:.4f}")




EXPERIMENT 2: Compare Multiple Models

Analogy: paris:france :: berlin:?

bert-base-uncased:
--------------------------------------------------


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  1. germany         0.7205
  2. britain         0.0642
  3. austria         0.0356
  4. england         0.0354
  5. russia          0.0351

distilbert-base-uncased:
--------------------------------------------------
  1. germany         0.4331
  2. england         0.0733
  3. belgium         0.0700
  4. italy           0.0624
  5. britain         0.0509

bert-large-uncased:
--------------------------------------------------


Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  1. germany         0.9626
  2. berlin          0.0061
  3. russia          0.0059
  4. prussia         0.0058
  5. europe          0.0023


## 3.3 Bidirectional Analogies

`Man:Woman::King:Queen` often assumes symmetry, such that if this works we should be able to go `King:Queen::Man:Woman`. But yet, we are comparing vectors and their multiplication is not commutative. So there is really no guarantee that this will work in reverse. 

Have a look at some of the examples below to see this in action. 

Also notice that I added the parameter, `prompt_style="relationship"`. This is because above it seemed that this prompt style was the most accurate. But that might be contextual and not guaranteed. 

In [17]:
# ============================================
# 3. Bidirectional analogy checking
# ============================================
print("\n\n" + "=" * 70)
print("EXPERIMENT 3: Bidirectional Analogy Checking")
print("=" * 70)

def bidirectional_analogy(word_a, word_b, word_c, word_d, tokenizer, model, prompt_style=None):
    """
    Test both directions: A:B::C:D and C:D::A:B
    """
    # Forward: A:B::C:?
    if not prompt_style: 
        prompt_stlye="default"
        
    forward_results = bert_analogy_probabilities(word_a, word_b, word_c, tokenizer, model, top_k=20, prompt_style=prompt_style)
    forward_dict = {w: p for w, p in forward_results}
    
    # Backward: C:D::A:?
    backward_results = bert_analogy_probabilities(word_c, word_d, word_a, tokenizer, model, top_k=20, prompt_style=prompt_style)
    backward_dict = {w: p for w, p in backward_results}
    
    forward_prob_d = forward_dict.get(word_d, 0.0)
    backward_prob_b = backward_dict.get(word_b, 0.0)
    
    consistency = (forward_prob_d + backward_prob_b) / 2
    
    print(f"\n{word_a}:{word_b} :: {word_c}:{word_d}")
    print(f"  Forward ({word_c}→{word_d}): {forward_prob_d:.4f}")
    print(f"  Backward ({word_a}→{word_b}): {backward_prob_b:.4f}")
    print(f"  Consistency: {consistency:.4f}")
    
    print(f"\n  Top 5 predictions (forward):")
    for i, (word, prob) in enumerate(forward_results[:5], 1):
        marker = "✓" if word == word_d else " "
        print(f"    {marker} {i}. {word:15s} {prob:.4f}")
    
    print(f"\n  Top 5 predictions (backward):")
    for i, (word, prob) in enumerate(backward_results[:5], 1):
        marker = "✓" if word == word_b else " "
        print(f"    {marker} {i}. {word:15s} {prob:.4f}")
    
    return consistency, forward_results, backward_results

# Test several analogies
test_cases = [
    ("king", "queen", "man", "woman"),
    ("paris", "france", "london", "england"),
    ("dog", "puppy", "cat", "kitten"),
    ("big", "bigger", "small", "smaller"),
]

for word_a, word_b, word_c, word_d in test_cases:
    consistency, _, _ = bidirectional_analogy(word_a, 
                                              word_b, 
                                              word_c, 
                                              word_d, 
                                              tokenizer, 
                                              model, 
                                              prompt_style="relationship")



EXPERIMENT 3: Bidirectional Analogy Checking

king:queen :: man:woman
  Forward (man→woman): 0.8028
  Backward (king→queen): 0.9369
  Consistency: 0.8699

  Top 5 predictions (forward):
    ✓ 1. woman           0.8028
      2. beast           0.0655
      3. wife            0.0170
      4. animal          0.0118
      5. horse           0.0103

  Top 5 predictions (backward):
    ✓ 1. queen           0.9369
      2. king            0.0128
      3. country         0.0046
      4. emperor         0.0044
      5. lady            0.0025

paris:france :: london:england
  Forward (london→england): 0.0000
  Backward (paris→france): 0.0000
  Consistency: 0.0000

  Top 5 predictions (forward):
      1. paris           0.1583
      2. manchester      0.0633
      3. london          0.0590
      4. brussels        0.0580
      5. birmingham      0.0435

  Top 5 predictions (backward):
      1. paris           0.1859
      2. brussels        0.1217
      3. rome            0.0983
      4. london