# Towards Evaluating Creativity in Language

This notebook is a prototype for evaluating creativity in language. It is based on the paper [Towards Evaluating Creativity in Language](https://arxiv.org/abs/1904.09751) by [Rudinger et al.](https://arxiv.org/abs/1904.09751) (2019).
^ bruh copilot at it again

In [1]:
from nltk.corpus import gutenberg
import numpy.typing as ntp
from madhatter.benchmark import *
from madhatter.models import *
import nltk
import numpy as np
import matplotlib.pyplot as plt
# from classes.models import *
# Initialize models
import torch
import gensim
from nltk.data import find
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(
    str(find('models/word2vec_sample/pruned.word2vec.txt')), binary=False)

model, tokenizer = default_model("bert-base-uncased")

bench = CreativityBenchmark(gutenberg.raw("austen-emma.txt"), "Emma")
bench.report(postag_distribution=True)



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Report took ~0.514s


Report(title='Emma', nwords=185438, mean_wl=3.919380062338895, mean_sl=117.23189377682404, mean_tokenspersent=25.721968884120173, prop_contentwords=0.04020750870911032, mean_conc=2.7062426711716205, mean_img=3.3097397867972163, mean_freq=-2.3673723826953306, prop_pos={'NOUN': 0.16685524785825648, 'VERB': 0.18625738464827435, 'ADJ': 0.05854533509226574})

### Potential creativity measures
#### Usage of less common vocabulary uncommon words
**Example:**
- "The quick brown fox jumps over the lazy dog."
- "The swift hazel-furred fox leaps over the idle dog."

**Idea:**

Adapt resources such as WordNet for finding semantically similar words and compare them to their most used synonym. Notion: The more uncommon the word, the more creative the sentence.
Potential problems: words could be too far off from their actual meaning in the context. For example, here "leaps" is a synonym for "jumps", but using the word "vaults" or "springs" might not fit the context.


#### Comparing alternatives for bi(+)grams 
(Generally, we want to narrow down on adjectives and noun phrases, but this could be expanded to verb phrases for example.)  
**Two variants:**
1. Compare how much the original word deviates in comparison to contextual synonyms/alternatives. That is, compare $ P(w_{original}|context) $ with $ \{P(w| context) | w \text{ in the set of alternative continuations}\} $. **This is somewhat akin to the perplexity measure, I believe?**
   - *Example:* (The following has been generated by Copilot) Given the sentence "The quick brown fox jumps over the lazy dog.", the context word is "fox" and the alternative contexts are "dog" and "cat". The probability of "jumps" given "fox" is compared to the probabilities of "jumps" given "dog" and "cat". 
   - **Alternatively**, to simplify the formulas, we can compare the deviation of probability $ P(w_{original}|context) $ with respect to the likeliest/largest/maximum element in the set of probability distribution described above, i.e. $\max (\{ P(w|context) | w \in S_{Alternatives}\})$ 


2. Compare the deviation of probability $ P(word|modifier) $ with respect to the set $ \{P(word| alt) | alt \text{ in the set of alternative modifiers}\}$
   - *Example:* Given the sentence "The quick *brown* **fox** jumps over the lazy dog.", the $word$ is **"fox"** and the $modifier$ is *"brown"*. Then, the alternative modifiers can be "black", "reddish", or even "blue".
   - This can be summarised by doing evaluation on the noun phrase level. I'd personally prefer focusing on the prepositional modifier words.

Possible algorithm for measuring number of tokens per sentence:
1. Split the text into sentences using the sentence tokenizer.
2. For each sentence, split it into tokens using the word tokenizer.

Additionally, plot the distribution of the number of tokens per sentence.
Additionally, plot the distribution of different PoS tags per sentence.
Do this for a few genres and compare them. PLOT PLOTS PLOTS


### Distance measure based on semantic tree traversal
**Idea:**

- Given a sentence, tag words into parts of speech using the Universal tagset (we prefer not to use the PennTreebank tagset as it is too English-specific and would not mesh well with WordNet).
- Filter only to nouns, adjectives, verbs, and adverbs. 
- Given each tagged word, we find its synset (i.e. the set of synonyms) in WordNet.
- Compute some distance metric between the synsets of the two words. For example, we can use the [Wu-Palmer similarity](https://www.nltk.org/howto/wordnet.html) measure.
- How do we calculate that for all the words in a given text/sentence? 

#### Potential Issues
Potential issues with this approach may include:
- The use of WordNet. It is a good resource, but it is not perfect. For example, it does not contain all the words in the English language.
- Setting. For example, some words may not be subsititutable in certain context. Say, the collocation "big sister" cannot be replaced with "large sister" or "huge sister". It completely alters the meaning of the phrase. For example, the word H2O is used in scientific contexts and would be inappropriate in a hiking guide—water would be more appropriate— and this genre difference is part of the meaning of the word. In practice, the word syn- onym is therefore used to describe a relationship of approximate or rough synonymy. \cite{jurafsky2014speechorwhatever}
- 
- 

### Issues with Word2Vec
Does not quite capture part of speech senses.

**Remarks:**
1. Cosine similarity returns values in the range of $[0,1]$. The closer the value is to 1, the more similar the two vectors are. Although intuitively, a cosine function can range between $[-1, -1]$, the learned embeddings being used in `Word2Vec` themselves can inherently only have values in the range of $[0,1]$, so we cannot make assumptions for negative values, such as `Word_A` and `Word_B` are antonyms.

#### Benchmarking insights for speed concerns
Torch objects and NumPy arrays have effectively the same speed. However, NumPy arrays are more memory efficient.

### Slope of the curve of likelihood of a word given a context
<!-- We explore how certainty of a word given a context changes as we move away from the context. (this is not a bad suggestion by copilot, but it strays from my original purpose for this project) -->
We explore how certainty for a prediction given some context changes in the likelihood space of the BERT MLM. In practice, this enables us to see how certain the model is about the predictions it is making, and it can potentially allow us to compare values across different contexts or same contexts but with different masked tokens.

The slope of a given curve has the following equation(where $x$ is the input and $y$ is the output):
$$
\sum_{i = 1} ^ n \frac{(x_i - \bar{x})(y_i - \bar{y})}{(x_i - \bar{x}) ^ 2}

$$


The first 100 sentences of Brown corpus (~1000 tokens) need 1.5 minutes to be processed. Not based at all.

## Similar idea, but sentence-level instead of word-level
Use `BertForNextSentencePrediction` to determine how "unpredictable" the following sentence is given the previous sentence (as prompt). We mask the second sentence and take something like a cosine similarity / set difference between the two vectors.

In [1]:
from transformers import BertForNextSentencePrediction
model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")

prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
next_sentence = "The sky is blue due to the shorter wavelength of blue light."

encoding = tokenizer(prompt, next_sentence, return_tensors="pt")
outputs = model(**encoding, labels=torch.LongTensor([1]))

logits = outputs.logits
print(logits[0, 0], logits[0, 1])  # next sentence was random


: 

: 

## Notes
- What current approaches to creativity evaluation have been implemented so far?

### Related Work
- Current methods/similar work with POS tags, proof that POS tags are (not) enough to capture creativity


- Putting some ideas of the notebook into the thesis
- Write down some methodology or datasets 
- Write down some datasets for the experiment
- Draw some comparisons between the suggested replacements in the masked model - maybe mean, cosine similarity, etc.
- Maybe use stuff like word2vec to show differences between the words

## Alternative implementation for semantic distance measure which abuses masked language models for context-aware words

Initialize models

### Find average similarity of suggested tokens

In [3]:
from string import punctuation
from madhatter.utils import stopwords
stopwords = stopwords.union(set(punctuation))
preds = sliding_window_preds_tagged(bench.tagged_words(
)[:1000], model, tokenizer, return_tokens=True, k=5, tags_of_interest=bench.tags_of_interest, stopwords=stopwords)
preds

# be warned that running this without any stopwords, you double and triple the time required


[Prediction(word='VOLUME', original_tag='NOUN', suggestions=('chapter', 'part', 'book', 'volume', 'act', 'page', 'scene', '.', 'section', 'chapters'), probs=(12.317625999450684, 10.473714828491211, 10.012399673461914, 9.153491973876953, 8.500191688537598, 6.275146484375, 6.24095344543457, 6.140644073486328, 6.121601104736328, 6.072627067565918)),
 Prediction(word='CHAPTER', original_tag='VERB', suggestions=('volume', 'part', 'and', 'book', '&', 'chapter', ',', '.', 'of', ':'), probs=(10.810314178466797, 10.248991966247559, 8.3878755569458, 8.070950508117676, 8.048393249511719, 7.632352352142334, 7.521879196166992, 7.468377590179443, 6.658961772918701, 6.4384026527404785)),
 Prediction(word='Emma', original_tag='NOUN', suggestions=(':', 'john', 'james', 'william', '-', 'charles', 'richard', 'robert', '.', 'george'), probs=(6.862824440002441, 6.148440361022949, 5.631335258483887, 5.597689628601074, 5.380741119384766, 5.256239891052246, 5.248263359069824, 5.196001052856445, 5.097869873046

In [6]:
from madhatter.utils import stopwords
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'