# Semantic Word Changes

<img src="https://live.staticflickr.com/65535/54942563983_f3baea0eee_c.jpg" alt="Embedded Photo" width="776">

## Introduction

The meanings of words in natural language change over time—some evolve gradually, while others undergo rapid semantic shifts. Thanks to historical text corpora, it is possible to train vector models that capture these historical differences.

In this task, you will analyze **pre-trained embeddings (word2vec)** from two different eras:
- **1900** — a model trained exclusively on data from the early 20th century
- **1990** — a model trained on data close to the present day

These two models were trained **completely independently**. Your task will be to use these representations to build a classifier that determines whether **the meaning of a given word has changed significantly between 1900 and 1990**.

## Task

You will build a **binary classifier** that, for a given word, will return a label:

- **0 — semantically stable word**
- **1 — a word whose meaning has changed**

## Data

- `train.csv` — training set (word + label)
- `valid.csv` — validation set for local testing
- `1900-vocab.pkl` and `1900-w.npy` — vocabulary and embedding matrix from 1900
- `1990-vocab.pkl` and `1990-w.npy` — an analogous set for the year 1990

## Evaluation Criterion

To evaluate your solution, we use the **Balanced Accuracy** metric, which is the average of the classification accuracy for the positive class and the negative class. In other words:

$$ \text{Balanced Accuracy} = \frac{1}{2}(\text{TPR} + \text{TNR}) $$

which is the average of sensitivity (TPR) and specificity (TNR).
This is a metric resistant to imbalanced datasets.

You return **hard labels (0/1)**.

The notebook contains an `evaluate_algorithm` function, which you can use to test your model on `valid.csv`.

For this task, you can score between 0 and 100 points. The score will be scaled linearly depending on the Balanced Accuracy value:

- **Balanced Accuracy ≤ 0.7**: 0 points.
- **Balanced Accuracy ≥ 0.87**: 100 points.
- **Values between 0.7 and 0.87**: scaled linearly.

Formula for the score:
$$
\text{Points} = 
\begin{cases} 
0 & \text{for } \text{Balanced Accuracy} \leq 0.7 \\
100 \times \frac{\text{Balanced Accuracy} - 0.7}{0.87 - 0.7} & \text{for } 0.7 < \text{Balanced Accuracy} < 0.87 \\
100 & \text{for } \text{Balanced Accuracy} \geq 0.87
\end{cases}
$$

## Constraints

Your notebook will be run on the Competition Platform:

- **without internet access**
- without GPU access - **CPU only**
- Time limit for notebook execution and evaluation on the test set: **5 minutes**
- List of allowed libraries: `numpy`, `pandas`, `scikit-learn`, `matplotlib`, `tqdm`

## Submission Files

- This notebook completed with your solution:

```python
class SemanticChangeModel:
    def fit(self, train_df):
        ...
    def predict_change(self, words: List[str]) -> List[int in {0,1}]:
        ...
```

## Evaluation

Remember that during the final check, the `FINAL_EVALUATION_MODE` flag will be set to `True`.

For this task, you can score between 0 and 100 points. The number of points you earn will be calculated on a (secret) test set on the Competition Platform based on the aforementioned formula, rounded to the nearest integer. If your solution does not meet the above criteria or does not execute correctly, you will receive 0 points for the task.

## Starter Code
In this section, we initialize the environment by importing the necessary libraries and functions. The prepared code will help you to efficiently operate on the data and build the right solution.

In [1]:
######################### DO NOT CHANGE THIS CELL FOR SUBMISSION ##########################

FINAL_EVALUATION_MODE = False  # We will set this flag to True during evaluation.

import os, json, pickle, shutil
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

from sklearn.metrics import balanced_accuracy_score

np.random.seed(42)
random.seed(42)

DATA_DIR = Path('data')
EMB_DIR  = DATA_DIR / 'embeddings'
EMB_DIR.mkdir(parents=True, exist_ok=True)


## Downloading data (local only)

In [None]:
######################### DO NOT CHANGE THIS CELL FOR SUBMISSION ##########################

# Note: The internet is disabled on the evaluation platform.
# This block only runs locally (when FINAL_EVALUATION_MODE == False).

GDRIVE_FILES = [
    ('1rUGgDZcpwRZ5sRHGxxEh2f7ZJ0DRVDPL', EMB_DIR / '1900-vocab.pkl'),
    ('1cYXPhghcawbMZ6vU2XyJUq7NOpBIKj5E', EMB_DIR / '1900-w.npy'),
    ('1ApLkBn2ylvLMKNlNtvkMVde6RxnLJolI', EMB_DIR / '1990-vocab.pkl'),
    ('1B3NLInA4Ty3lUaHNQgxtDTKJNtG0t0T1', EMB_DIR / '1990-w.npy'),
    ('1hrOfZOq3BV1K0tWe6HSZG-OiZkGlCiYT', DATA_DIR / 'train.csv'),
    ('1vndyCuDCBP6zLvNkF_YsKHgTQgulTjt_', DATA_DIR / 'valid.csv'),
]

def download_data():
    try:
        import gdown
    except Exception as e:
        raise RuntimeError('Install gdown locally: `pip install gdown`') from e

    DATA_DIR.mkdir(parents=True, exist_ok=True)
    EMB_DIR.mkdir(parents=True, exist_ok=True)

    for fid, out_path in GDRIVE_FILES:
        if out_path.exists():
            print(f'Download skipped - file already exists: {out_path.name}')
            continue
        url = f'https://drive.google.com/uc?id={fid}'
        out_path.parent.mkdir(parents=True, exist_ok=True)
        print(f'Downloading -> {out_path.name}')
        gdown.download(url, str(out_path), quiet=False)

if not FINAL_EVALUATION_MODE:
    download_data()
    print('Download complete.')
else:
    print('FINAL_EVALUATION_MODE=True - skip download (data is provided on the platform).')

## Loading embeddings and datasets

In [None]:
######################### DO NOT CHANGE THIS CELL FOR SUBMISSION ##########################

def load_histwords_decade(decade: int, emb_dir: Path):
    vocab_path = emb_dir / f'{decade}-vocab.pkl'
    w_path     = emb_dir / f'{decade}-w.npy'
    with open(vocab_path, 'rb') as f:
        vocab = pickle.load(f)
    W = np.load(w_path)
    # L2 normalization
    W = W / (np.linalg.norm(W, axis=1, keepdims=True) + 1e-12)
    w2i = {w:i for i,w in enumerate(vocab)}
    return vocab, W, w2i

# Loading embeddings
vocab_1900, W1900, w2i_1900 = load_histwords_decade(1900, EMB_DIR)
vocab_1990, W1990, w2i_1990 = load_histwords_decade(1990, EMB_DIR)

if not FINAL_EVALUATION_MODE:
    print(f'1900: V={len(vocab_1900):,}, dim={W1900.shape[1]}')
    print(f'1990: V={len(vocab_1990):,}, dim={W1990.shape[1]}')

# Loading train/valid files
train_path = DATA_DIR / 'train.csv'
valid_path = DATA_DIR / 'valid.csv'
assert train_path.exists() and valid_path.exists(), 'No train.csv / valid.csv in data/ folder'

train_df = pd.read_csv(train_path)
valid_df = pd.read_csv(valid_path)

# Expected columns: word, label
for c in ['word', 'label']:
    assert c in train_df.columns and c in valid_df.columns, 'Expected columns: word, label'

train_df['word'] = train_df['word'].astype(str).str.lower().str.strip()
valid_df['word'] = valid_df['word'].astype(str).str.lower().str.strip()
train_df['label'] = train_df['label'].astype(int)
valid_df['label'] = valid_df['label'].astype(int)


if not FINAL_EVALUATION_MODE:
    print(train_df.head(5))
    print()
    print(valid_df.head(5))
    print(f'train: {len(train_df)}, valid: {len(valid_df)}')

1900: V=100,000, dim=300
1990: V=100,000, dim=300
        word  label
0     lichen      0
1    imaging      1
2      devil      0
3    prayers      0
4  frankfort      1

        word  label
0  coastline      0
1       yoke      0
2     report      1
3   language      0
4       barn      0
train: 2495, valid: 832


In [None]:
def print_neighbors(word, V, vocab, w2i, label):
    vec = V[w2i[word]]
    sims = V @ vec
    sims[w2i[word]] = -np.inf
    top = np.argsort(-sims)[:10]
    print(f"\nTop 10 neighbors in {label}:")
    for i in top:
        print(f"  {vocab[i]}  ({sims[i]:.4f})")


if not FINAL_EVALUATION_MODE:
    print_neighbors("intelligence", W1900, vocab_1900, w2i_1900, "1900")
    print_neighbors("intelligence", W1990, vocab_1990, w2i_1990, "1990")


Top 10 sąsiadów w 1900:
  intellect  (0.4798)
  sagacity  (0.4282)
  discernment  (0.3707)
  understanding  (0.3688)
  tact  (0.3659)
  skill  (0.3643)
  humanity  (0.3641)
  honesty  (0.3598)
  ability  (0.3517)
  sensibility  (0.3512)

Top 10 sąsiadów w 1990:
  iq  (0.3773)
  wechsler  (0.3542)
  cia  (0.3362)
  hinsley  (0.3301)
  artificial  (0.3264)
  afric  (0.3202)
  binet  (0.3191)
  aptitude  (0.3152)
  abwehr  (0.3108)
  abilities  (0.3070)


## Your solution

In [5]:
#########################  MODIFY ONLY THIS CELL  #########################
# Implement your model as a class with the following methods:
#   - __init__       : save embeddings + basic hyperparameters
#   - fit(train_df)  : train on labeled data
#   - predict_change(words) : return labels for a given list of words
#
# The evaluation code will receive an instance of the class and will only assume
# that it has a .predict_change(words) method.

class SemanticChangeModel:
    def __init__(self, W1900, W1990, w2i_1900, w2i_1990):
        """
        Store all expensive / global objects here. You can build
        additional structures in the fit() method.

        Parameters
        ----------
        W1900, W1990 : np.ndarray [V, D]
            Normalized embeddings for the years 1900 and 1990.
        w2i_1900, w2i_1990 : dict
            Mapping from word -> row index in the embeddings.
        """
        self.W1900 = W1900
        self.W1990 = W1990
        self.w2i_1900 = w2i_1900
        self.w2i_1990 = w2i_1990

        # You can add more parameters and methods as needed

    def fit(self, train_df):
        """
        Build your model using the labeled training data.

        Parameters
        ----------
        train_df : pd.DataFrame
            Must contain at least the columns ['word', 'label'].
        """
        # TODO: replace this placeholder with your actual fitting logic.
        pass
    
    def predict_change(self, words):
        """
        Predicts whether words have significantly changed their meaning.

        Parameters
        ----------
        words : list of str
            A list of words to be classified.

        Returns
        -------
        list of int
            A list of 0s or 1s (1 = 'changed') for each word.
        """
        # TODO: replace this placeholder with your actual prediction logic.
        return [random.choice([0,1]) for _ in words]


MODEL = SemanticChangeModel(W1900, W1990, w2i_1900, w2i_1990)
MODEL.fit(train_df)


## Evaluation

Running the cell below will allow you to check how many points your solution would score on the validation data.

Before submitting, make sure that the entire notebook runs from start to finish without errors and without user intervention after executing the `Run All` command.

In [None]:
######################### DO NOT CHANGE THIS CELL FOR SUBMISSION ##########################
def compute_score(bal_acc: float) -> float:
    """
    Calculates the point score based on the balanced accuracy value.

    :param bal_acc: A float value in the range [0.0, 1.0]
    :return: A point score according to the specified function
    """
    if bal_acc <= 0.7:
        return 0
    elif 0.7 < bal_acc < 0.87:
        return int(round(100 * (bal_acc - 0.7) / (0.87 - 0.7)))
    else:
        return 100


def evaluate_algorithm(dataset_df, model, verbose=False):
    """
    Evaluation of a word sense change detection model on a given dataset.

    Parameters
    ----------
    dataset_df : pd.DataFrame
        A labeled dataset with columns:
          - 'word'  : word (string)
          - 'label' : label 0 = stable, 1 = changed

    model : object
        An object with a method:
            predict_change(words: list[str]) -> list[int] {0,1}

    verbose : bool
        If True, prints additional information.

    Returns
    -------
    points : float
        A point score based on balanced accuracy.
    """

    # Extract words and labels from the dataset
    words = dataset_df["word"].astype(str).tolist()
    ys = dataset_df["label"].astype(int).tolist()

    # Get predictions for the entire list of words
    preds = model.predict_change(words)

    # Convert predictions and labels to numpy arrays
    preds = np.array(preds, dtype=np.int32)
    ys = np.array(ys, dtype=np.int32)

    # Balanced accuracy
    bal_acc = balanced_accuracy_score(ys, preds)

    # Convert accuracy to competition points
    points = compute_score(bal_acc)

    if verbose:
        print(f"\nNumber of samples: {len(dataset_df)}")
        print(f"Balanced accuracy: {bal_acc:.4f}")
        print(f"Point score: {points}")

    return points


if not FINAL_EVALUATION_MODE:
    _ = evaluate_algorithm(valid_df, MODEL, verbose=True)


Liczba próbek: 832
Balanced accuracy: 0.5157
Wynik punktowy: 0
