# **Using LLM models via _HuggingFace_**

## **_Huggingface_ Login**

Once we have created an account and an **access token**, we need to login to _Huggingface_ via code.

- Type your token and press Enter
- You can say NO to _Github_ linking

After login, we can download all models associated with the **access token** in addition to those that are not protected by an **access token**.

In [None]:
!pip install transformers accelerate bitsandbytes huggingface_hub

In [None]:
!hf auth login # Token: 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


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
The token `LLM_creativity` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `LLM_creati

## Imports

In [None]:
import torch
import os
import pandas as pd
from transformers import pipeline, infer_device, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

## Using `Pipeline`

- **_pipeline_** function returns an end-to-end object that performs an **NLP task** on one or several texts.
  - <u>Different tasks:</u> _Text classification_, _Zero-shot classification_, _Text generation_, _Text completion (mask filling)_, _Token classification_, _Question answering_, _Summarization_, _Translation_.

### Single input for **_non-chat_** / **_instruct_ models**

In [None]:
generator = pipeline(task="text-generation", model="Qwen/Qwen3-0.6B")
generator("the secret to baking a really good cake is ", max_new_tokens = 100)

Device set to use cpu


[{'generated_text': "the secret to baking a really good cake is 12 months of experience, but this is not the case. it's the combination of experience and learning from others, which is what makes it possible. so, what's the secret to baking a really good cake, and how can one achieve it? it's not just about the ingredients, but also about learning from others. but how can this be done? the answer is simple: experience and learning from others. but how can one actually implement this? the answer is simple: practice, learn"}]

### Single input for **_chat_ models**

In [None]:
generator = pipeline(task="text-generation", model="Qwen/Qwen3-0.6B", device_map="auto")

chat = [
    {"role": "system", "content": "You are an evaluator and you are an expert in creative writign. Be concise."},
    {"role": "user", "content": "Evaluate this text on a scale from 1 to 5. \nText:\n\nText A"}
]

response = generator(chat, max_new_tokens=256)

print(chat)

# response includes "generated_text", which include the updated chat; last entry is the assistant answer
chat = response[0]["generated_text"]
print(chat[-1]["content"])
print(chat)
# if we want to continue the conversation
chat.append(
    {"role": "user", "content": "Give a brief explaination of your vote."}
)
response = pipeline(chat, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])


Device set to use cpu


[{'role': 'system', 'content': 'You are an evaluator and you are an expert in creative writign. Be concise.'}, {'role': 'user', 'content': 'Evaluate this text on a scale from 1 to 5. \nText:\n\nText A'}]
<think>
Okay, so I need to evaluate the text on a scale from 1 to 5. The user provided the text "Text A" but didn't include it. Wait, maybe they made a mistake? Let me check again. Oh, maybe they intended to include it but forgot. Hmm. If I don't have the actual text, I can't assess it. But perhaps the user expects me to respond with a placeholder or mention that. Let me make sure I don't miss any information. Since the user might have intended to provide the text but forgot, I should inform them. But the instructions say to evaluate the text. Maybe the text is part of a larger context? Wait, no, the user only provided "Text A" in the query. Maybe the text is a sample or part of a code? Since I can't access external content, I should clarify that. Alternatively, if the text is part of 

TypeError: unhashable type: 'list'

### More than one input **_non-chat_** / **_instruct_ models**

In [None]:
device = infer_device() # to automatically detect an available accelerator for inference

pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device=device)
pipeline(["the secret to baking a really good cake is ", "a baguette is "])

### Trying to address our intent:


#### Batch evlauation for **_non-chat_** / **_instruct_** **models** (**prompt template**)

In [None]:
# we can use an instruct or base model (es. "HuggingFaceTB/SmolLM2-1.7B-Instruct")
generator = pipeline("text-generation", model="meta-llama/Llama-2-7b-instruct", device_map="auto")

texts = [
    "Text A to be evaluated ...",
    "Text B to be evaluated ..."
]

def prompt_template(txt):
  return (
    "You are an evaluator. Evaluate the following text on a scale from 1 to 5.\n"
    "Answer with the evaluation score and an explaination.\n"
    f"Text:\n{txt}\n\nAnswer (JSON):"
  )

prompts = [prompt_template(t) for t in texts]

# we can pass to pipeline a list
outputs = generator(prompts, max_new_tokens=150, batch_size=2)

for i, out in enumerate(outputs):
    # output structure may vary
    print(f"=== Output for text {i} ===")
    print(out[0]['generated_text'] if isinstance(out, list) else out['generated_text'])
    #print(response[0]["generated_text"][-1]["content"]) CONTROLLARE SE L'OUTPUT COSì è GIUSTO


#### Batch evaluation for **_chat models_**

Chat models accept a list of **messages** (the _chat history_) as the input. Each message is a dictionary with **_role_** and **_content_** keys.

To start the chat, add a single **_user_ message**.
- You can also optionally include a **_system_ message** to give the model directions on how to behave.

In [None]:
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", device_map="auto")

texts = [
    "Text A to be evaluated ...",
    "Text B to be evaluated ..."
]

# For each text, we need the chat history
chat_inputs = []
for t in texts:
    chat = [
        {"role": "system", "content": "You are an evaluator and you are an expert in creative writign. Be concise."},
        {"role": "user", "content": f"Evaluate this text on a scale from 1 to 5. Give a brief explaination:\n\n{t}"}
    ]
    chat_inputs.append(chat)

responses = generator(chat_inputs, max_new_tokens=120, batch_size=2)

# every response include "generated_text", which include the update chat; last entry is the assistant answer
for r in responses:
    # r["generated_text"] is a list of dictionaries; last element is the assistant answer
    print(r["generated_text"][-1]["content"])


In [None]:
# if we want to continue the conversation
chat = response[0]["generated_text"]
chat.append(
    {"role": "user", "content": "Woah! But can it be reconciled with quantum mechanics?"}
)
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

## Model Quantization

In order to fit the pre-trained model into a single GPU, we had to quantize the model.
Doing that, we are reducing **memory** and **computational costs** by representing weights and activations with *lower-precision* data types.
The pre-trained model is then loaded throught its `model_card`, and quantization is then applied at this step.

In [None]:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config})

## Loading model & tokenizer

In [None]:
model_name = "gpt2"

# To load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Quantization parameters
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # it reduces the precision of model weights from 32-bit floating-point to 4-bit int
    bnb_4bit_use_double_quant=True, # this further reduces the precision of weights (double quantization)
    bnb_4bit_quant_type="nf4", # quantization format
    bnb_4bit_compute_dtype=torch.bfloat16, # it sets the computational type
)

# To load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    #quantization_config=bnb_config,
    device_map='auto'
).to('cuda')

## Testing the model

In [None]:
# encode input
inputs = tokenizer("Hi, how are you?", return_tensors="pt").to('cuda')

# generate response
outputs = model.generate(**inputs, max_new_tokens=10)

# decode output
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hi, how are you?

I'm a little bit of a nerd


## Chat template

The input to `apply_chat_template` should be structured as a list of dictionaries with ***role*** and ***content*** **keys**.

The ***role*** key specifies the speaker, and the ***content*** key contains the message.

<u>The common roles are:</u>

- **_system_** - For directives on how the model should act (usually placed at the beginning of the chat)
- **_user_** - For messages from the user
- **_assistent_** - For messages from the model

### How to use `apply_chat_template`
- **_add_generation_prompt_** argument adds **tokens** to the end of the chat that indicate the start of an _assistant_ response.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto", dtype=torch.bfloat16)

messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

## Preparing the prompt

In [None]:
prompt_zero = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        TEXT:
        {text}

        ANSWER:
        """
    }
]

In [None]:
def prepare_prompts_zero(texts, prompt_template, tokenizer,generation_prompt = True):
  """
    This function format input text samples into instructions prompts.

    Inputs:
      texts: input texts to classify via prompting
      prompt_template: the prompt template provided in this assignment
      tokenizer: the transformers Tokenizer object instance associated with the chosen model card

    Outputs:
      input texts to classify in the form of instruction prompts
  """

  texts_formatted = []
  prompt_template = tokenizer.apply_chat_template(prompt_template, tokenize=False, add_generation_prompt=generation_prompt)

  for text in texts:
    text_formatted = prompt_template.format(text=text)
    texts_formatted.append(text_formatted)

  return texts_formatted

# Evaluating **creativity** using **_LLMs_**

## **_Creativity metrics_**

We want to see the alignment between the different **metrics**. Each **metric** reflects a different _level of grnularity_ in **_creativity evaluation_**.

- ***Creativity Index*** to assess **phrasing-level creativity**
  - This mainly reflects lexical diversity, with its reliance on _n-gram range_ affecting its reliability; it quantifies originality by measuring how much text can be attributed to existing web content.

- ***Perplexity*** to capture **token-level diversity**
  - It measures the statistical unexpectedness of text based on probability distributions. It indicates how well a probability model predicts a sample, where lower probability is higher perplexity and therefore the content is less expected and possibly more creative.

- ***Syntactic Templates*** to evaluate **structural creativity**
  - This can detect structural patterns but fail to capture conceptual creativity, especially in unconventional solutions expressed through conventional language;

- ***LLM-as-a-Judge*** for a **holistic assessment**
  - This tends to produce biased predictions and is unstable across different prompts, raising concerns about its reliability as an autonomous evaluator.

## **_Tests for Creativity_**

Several **standardized tests** are commonly used to assess **_creativity_**, each measuring different aspects of divergent and associative thinking.

- The ***Torrance Tests of Creative Thinking*** (*TTCT; Torrance, 1966*) is a widely used benchmark to evaluate **creativity** along the axes of **fluency**, **flexibility**, **originality**, and **elaboration**. Used on LLMs, in [this paper](https://arxiv.org/abs/2309.14556) (_Art or Artifice? Large Language Models and the False Promise of Creativity_). Main problem is that these tests require **human evaluators**.
- The ***Divergent Association Task*** quantifies **creative** **potential** based on **associative thinking** and **semantic networks** (*Olson et al., 2021*).
  - **Associative thinking** is the skill of connecting ideas, memories, or concepts that don't seem related at first glance.
- The ***Remote Associates Test*** assesses **creativity** by measuring the ability to connect seemingly unrelated words (*Mednick, 1962*).
- The ***Alternative Uses Task*** (*AUT*), which gauges **divergent** **thinking** through the generation of multiple novel uses for common objects (*Guilford, 1967*).

These tests are well-established for assessing ***human creativity***, their suitability for evaluating and optimizing **machine learning models**, such as **LLMs** is limited.

Additionally, they require significant human effort and cost, posing challenges for integration into model training loops, and the reliance on subjective human evaluation could introduce potential biases.

**Automated creativity measures**, including ***linguistic diversity***, ***text perplexity***, and ***LLM-based judgment***, offer scalable alternatives to human assessment.

## Datasets used - _WritingPrompt_ Dataset

Dataset used inside the paper **_Hierarchical Neural Story Generation_**. These are stories based on different prompts. Dataset is originated from a Reddit forum [_r/WritingPrompt_](https://www.reddit.com/r/WritingPrompts/).
- Given a **prompt**, users can write their **stories**.


In [None]:
# To directly use the dataset

from datasets import load_dataset

ds = load_dataset("euclaise/writingprompts")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/837 [00:00<?, ?B/s]

data/train-00000-of-00002-105e07cb0d1994(…):   0%|          | 0.00/272M [00:00<?, ?B/s]

data/train-00001-of-00002-4fdb982c110564(…):   0%|          | 0.00/272M [00:00<?, ?B/s]

data/test-00000-of-00001-16503b0c26ed00c(…):   0%|          | 0.00/30.0M [00:00<?, ?B/s]

data/validation-00000-of-00001-137b93e1e(…):   0%|          | 0.00/30.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/272600 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/15138 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/15620 [00:00<?, ? examples/s]

We isolate all those prompt that start with _[WP]_

In [None]:
# create DataFrame form Dataset
df = pd.DataFrame(ds['train'])

# isolate prompts starting with [WP]
mask = ['[ WP ] ' in x for x in df.prompt]
df_noWP = df[mask]

We remove _[WP]_ to avoid the introducion of biases

In [None]:
# Remove the [WP] from the beginning of the prompt
df_noWP['prompt'] = df_noWP['prompt'].str.replace('[ WP ] ', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_noWP['prompt'] = df_noWP['prompt'].str.replace('[ WP ] ', '')


This is the final **dataset**

In [None]:
df_noWP.head()

Unnamed: 0,prompt,story
0,You 've finally managed to discover the secret...,"So many times have I walked on ruins, the rema..."
1,"The moon is actually a giant egg , and it has ...","-Week 18 aboard the Depth Reaver, Circa 2023- ..."
2,You find a rip in time walking through the all...,"I was feckin' sloshed, mate. First time I ever..."
3,For years in your youth the same imaginary cha...,"“ No, no no no... ” She backed up and turned t..."
4,"You glance at your watch 10:34 am , roughly 10...",There's a magical moment between wakefulness a...


## **_Creativity Index_**


Presented in the paper "_[link](https://arxiv.org/abs/2410.04265)_"

***Creativity Index*** to assess **phrasing-level creativity**
  - This mainly reflects lexical diversity, with its reliance on _n-gram range_ affecting its reliability; it quantifies originality by measuring how much text can be attributed to existing web content.

Used to quantify the **linguistic creativity** of a text by <u>reconstructing it from existing *text snippets* on the web</u>.


***CREATIVITY INDEX*** is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the ***creativity*** of **human-written texts** on the web (how much of that text can be reconstructed by mixing and matching a vast amount of existing text snippets on the web).
- This metric uses the ***DJ SEARCH*** algorithm (**dynamic programming algorithm**) to identify *verbatim* and *near-verbatim* (high semantic similarity) matches against web corpora.
  - This algoritm combines ***strict verbatim matching*** using _Infini-gram_, which allows for fast retrieval of any existing sequence of words, with ***near-verbatim semantic matching*** achieved through a novel application of **Word Mover’s Distance** (**WMD**) computed on the word embeddings of text snippets.
  - **WMD** is computed only if no matches are found by **Infini-gram**.
  - **Infini-gram** is computed using an API from [this](https://infini-gram.io/).

### **How to use it**

In [None]:
!git clone https://github.com/GXimingLu/creativity_index.git
!pip install unidecode sacremoses

Cloning into 'creativity_index'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 42 (delta 1), reused 39 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (42/42), 956.44 KiB | 8.25 MiB/s, done.
Resolving deltas: 100% (1/1), done.
Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.8/235.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode, sacremoses
Successfully installed sacremoses-0.1.1 unidecode-1.4.0


**<u>To DO:</u>**
- replace `HF_TOKEN` in ***DJ_search_exact.py*** with our own Huggingface token

To compute ***Creativity Index*** based on exact matches with the default hyperparameters, run:

`--task` is used to name the output file.
`--min_ngram` to indicate the ***L-value***

In [None]:
# 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

In [None]:
!python /content/creativity_index/DJ_search_exact.py --task GPT3_book --data /content/creativity_index/data/book/GPT3_book.json --output_dir /content/creativity_index/outputs/book --subset 1 --min_ngram 5

tokenizer_config.json: 100% 776/776 [00:00<00:00, 4.29MB/s]
tokenizer.model: 100% 500k/500k [00:00<00:00, 952kB/s]
tokenizer.json: 100% 1.84M/1.84M [00:00<00:00, 5.55MB/s]
special_tokens_map.json: 100% 414/414 [00:00<00:00, 2.91MB/s]
target docs:   0% 0/2 [00:00<?, ?it/s]average 5-ngram coverage: 0.962, std: 0.000, average length: 6.582089552238806
target docs:  50% 1/2 [01:49<01:49, 109.53s/it]Exception ignored in: <generator object tqdm.__iter__ at 0x7a520e21d6c0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tqdm/std.py", line 1196, in __iter__
    self.close()
  File "/usr/local/lib/python3.12/dist-packages/tqdm/std.py", line 1302, in close
    self.display(pos=0)
  File "/usr/local/lib/python3.12/dist-packages/tqdm/std.py", line 1495, in display
    self.sp(self.__str__() if msg is None else msg)
            ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tqdm/std.py", line 1151, in __str__
    return self.format_meter(**self.for

### **Parameters** used

About the parameter used to compute the **_creativity index_**, the original paper says that:

> _We set the minimum n-gram length L in DJ SEARCH to 5, and set the threshold for Word Mover’s Distance to 0.95 for semantic matches. We observe that the L-uniqueness is close to zero for most human and machine texts when **L ≤ 5** and close to one when **L ≥ 12**. Therefore, <u>in practice, we sum up the L-uniqueness for **5 ≤ L ≤ 12** when computing **CREATIVITY INDEX**</u>.
The only experiment with slightly different parameters is to compare the creativity of GPT-4 with humans. We observed that the L-uniqueness is close to one when **L ≥ 7** based on the model- generated reference corpus. Therefore, we sum up the L-uniqueness for **5 ≤ L ≤ 7** when computing **CREATIVITY INDEX**._





### **Results**

The script runs `find_exact_match()` to identify n-gram spans inside the text that exist in the reference corpus (the one behind https://api.infini-gram.io/).

<u>For each document, it stores:</u> (for example)
- `"coverage": 0.9617,`
- `"avg_span_len": 6.5820`

Those two numbers summarize **how much of the text was found** in the corpus, and **how long the matching pieces** tend to be.

- **_Coverage_**
  - `coverage ≈ 1.0` → Nearly all of the document's tokens appear in sequences that also exist in the reference corpus.
  - `coverage ≈ 0.0` → None of the n-grams (of at least **`min_ngram`** length) occur in the reference corpus.

- **_Average span length_**
  - `avg_span_len` is the average number of tokens per matched span, i.e. how long each continuous matching stretch is.

To compute the **_Creativity Index_**, we need to compute the **DJ SEARCH** algorithm several time.
- ***Creativity Index = Σ L-uniqueness for L from 5 to 12***
  - In the paper they use **L_min = 5** and **L_max = 12**
- ***L-uniqueness = 1 - coverage***

<u>To compute the Creativity Index, we need to do the following:</u>
1. Run **DJ SEARCH** with different `min_ngam` values
1. For each value, extract `coverage`
1. Compute ***L-uniqueness = 1 - coverage***
1. Sum all ***L-uniqueness*** values and compute an average

In the paper "[Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations](https://arxiv.org/abs/2508.05470)" they say "_The final CI score is computed by averaging L-uniqueness scores across a selected range of L_"


In [None]:
from creativity_index.DJ_search_exact import dj_search
import json
import os

def compute_creativity_index(data_path, output_dir, subset=1, lm_tokenizer=False):

  creativity_index_values = []
  os.makedirs(output_dir, exist_ok=True)

  # Run DJ SEARCH with different min_ngam values
  print("Running DJ Search for L=5 to L=7...")
  for min_ngram in range(5, 8):
    output_file = os.path.join(output_dir, f'L_{min_ngram}.json')
    dj_search(data_path, output_file, min_ngram=min_ngram, subset=subset, lm_tokenizer=lm_tokenizer)

  # Sum for each text, all the L-uniqueness values (1-coverage)
  print("Computing Creativity Index...")
  for text in range(subset):
    creativity_index = 0

    for min_ngram in range(5, 8):
      output_file = os.path.join(output_dir, f'L_{min_ngram}.json')
      values = json.load(open(output_file, 'r'))
      creativity_index += 1 - values[text]['coverage'] # L-uniqueness = 1 - coverage

    creativity_index = creativity_index/3 # because we are calculating 8 L-uniqueness values (L=5, ..., 7)
    creativity_index_values.append(creativity_index)
    print(f"Text {text}: Creativity Index = {creativity_index:.4f}")
  print(f"All Creativity Indices: {creativity_index_values}")

  return creativity_index_values

In [None]:
data_path = '/content/creativity_index/data/book/Human_book.json'
output_dir =  '/content/creativity_index/outputs/book/L/'

compute_creativity_index(data_path, output_dir, subset=1, lm_tokenizer=False)

Running DJ Search for L=5 to L=7...


target docs: 100%|██████████| 1/1 [01:41<00:00, 101.88s/it]


average 5-ngram coverage: 0.732, std: 0.000, average length: 5.431818181818182


target docs: 100%|██████████| 1/1 [01:24<00:00, 84.75s/it]


average 6-ngram coverage: 0.369, std: 0.000, average length: 6.357142857142857


target docs: 100%|██████████| 1/1 [01:19<00:00, 79.10s/it]

average 7-ngram coverage: 0.091, std: 0.000, average length: 7.25
Computing Creativity Index...
Text 0: Creativity Index = 0.6027
All Creativity Indices: [0.6026936026936026]





[0.6026936026936026]

## **_Perplexity_**


Presented in the paper "_[link](https://pubs.aip.org/asa/jasa/article/62/S1/S63/642598/Perplexity-a-measure-of-the-difficulty-of-speech)_"

***Perplexity*** measures the statistical unexpectedness of text based on probability distributions. It measures the **token-level unexpectedness** via language model predition probabilities.
- It indicates how well a probability model predicts a sample, where **lower probability** is **higher perplexity** and therefore the content is <u>less expected and possibly more ***creative***</u>.
- In simpler terms, it indicates how surprised a model is by the actual outcomes. The lower the ***perplexity***, the better the model at predicting the next word in a sequence, reflecting higher **confidence** in its **predictions**.

### **How to use it**

Implementation done by following the [HuggingFace tutorial](https://huggingface.co/docs/transformers/perplexity) and the article [Understanding Perplexity in Language Models: A Detailed Exploration](https://medium.com/@shubhamsd100/understanding-perplexity-in-language-models-a-detailed-exploration-2108b6ab85af) from _Medium_.

In [None]:
def perplexity(model, tokenizer, text):
  # Tokenize input
  inputs = tokenizer(text, return_tensors="pt")

  # Ensure no gradient calculation
  with torch.no_grad(): #opens a context in which gradient calculations are disabled. This ensures that only the forward pass is computed.
    outputs = model(**inputs, labels=inputs["input_ids"]) #forward pass- computes the model’s outputs and loss.
    loss = outputs.loss #extracts the loss from the model's outputs.
    perplexity = torch.exp(loss) #calculates the perplexity by exponentiating the loss.

  return perplexity

In [None]:
model_name = "gpt2"

# To load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# To load the model
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')

#Example text
text = "I love learning new things every day."

print(f'Preplexity: {perplexity(model, tokenizer, text)}')


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Preplexity: 25.54130744934082


### **Results**

<u>Interpreting ***Perplexity**:*</u>
- **_Perplexity_ of 1**: This is the ideal score, indicating that the model predicts the next word perfectly every time.
- **_Perplexity_ above 1**: Indicates some level of uncertainty. The higher the perplexity, the less confident the model is in its predictions.
  - For instance, a ***perplexity*** of 10 means the model is as uncertain as if it were choosing between 10 different possible next words.

<u>Practical Significance of ***Perplexity**:*</u>

***Perplexity*** is essential in comparing different language models or the same model under different conditions:
- **Low _Perplexity_**: Indicates the model is good at predicting the next word in a sequence. This is desirable in applications like autocomplete, text generation, and translation.
- **High _Perplexity_**: Suggests the model has difficulty predicting the next word, indicating the need for more training data, better model architecture, or more effective fine-tuning

In [None]:
# create DataFrame form Dataset
df = pd.DataFrame(ds['train'])

# isolate prompts starting with [WP]
mask = ['[ WP ] ' in x for x in df.prompt]
df_noWP = df[mask]

# Remove the [WP] from the beginning of the prompt
df_noWP['prompt'] = df_noWP['prompt'].str.replace('[ WP ] ', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_noWP['prompt'] = df_noWP['prompt'].str.replace('[ WP ] ', '')


In [None]:
text = df_noWP['story'][0]
print(f'Preplexity: {perplexity(model, tokenizer, text)}')

Preplexity: 43.21891403198242


## **_Syntactic templates_**



Presented in the paper "_[link](https://arxiv.org/abs/2407.00211)_"

***Syntactic templates*** examines structural patterns in text to distinguish between **repetitive** and **diverse** text by identifying <u>commonly-used part-of-speech templates</u>.

There are **three metrics**:
- ***CR-POS:*** Measures POS tag sequence diversity using **compression ratios**.
  - Higher values indicating less diversity.
  - Lower values suggest more varied syntactic structures and potentially more ***creative*** text.
- ***Template rate:*** Calculates the fraction of texts containing at least one template.
  - Lower values indicate fewer texts with repetitive patterns, suggesting greater ***structural originality*** across the corpus.
- ***Templates-per-token*** (**TPT**): Normalizes template counts by text length to enable fair comparisons across different sources.
  - Lower values indicate more ***diverse syntactic structures***.
  - Higher values suggest more repetitive patterns.

In **CREATIVE WRITING**, our analysis highlights that LLMs may <u>rely on specific structures to introduce **narrative shifts**</u>, which potentially contributes to the perception of **lower _creativity_** in their storytelling.

Models tend to produce ***templated text*** in downstream tasks at a higher rate than what is found in *human-reference texts*.

<u>**Workflow**:</u>
1. First tag ***all tokens*** in a corpus with their corresponding ***POS tags***. (Using [_SpaCy POS tagger_](https://www.geeksforgeeks.org/nlp/nlp-part-of-speech-default-tagging/) or [_NLTK_](https://www.geeksforgeeks.org/nlp/nlp-part-of-speech-default-tagging/))
1. Search for the top 100 most frequent ***n-grams*** across these tags.
1. Compute the **metrics**.

### ***Extract the templates***

Templates are extracted using the ***`diversity`*** library, presented in the paper [Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores](https://arxiv.org/abs/2403.00553).

- `extract_patterns()` has different parameters:
  - **`n`** to indicate the **_n-gram_** size (i.e. lenght of **templates**). Defaults to 5.
    - ***Templates*** are characterized by their high frequency across the texts in a given corpus.
  - **`top_n`** to indicate the number of top patterns to extract. Defaults to 100.

In [None]:
!pip install diversity

Collecting diversity
  Downloading diversity-0.3.0-py3-none-any.whl.metadata (10 kB)
Collecting evaluate<0.5.0,>=0.4.1 (from diversity)
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge-score<0.2.0,>=0.1.2 (from diversity)
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading diversity-0.3.0-py3-none-any.whl (30 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=d12b1af4311d10994eefa1f2d96a257ced31efccee78340da0092d6a20879055
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
In

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
# POS-tagging all tokens in the corpus
text = "The fox jumped over the fence, and the dog jumped over the fence too."
words = word_tokenize(text)
pos_tags = pos_tag(words)

for word, tag in pos_tags:
    print(f"{word}: {tag}")

The: DT
fox: NN
jumped: VBD
over: IN
the: DT
fence: NN
,: ,
and: CC
the: DT
dog: NN
jumped: VBD
over: IN
the: DT
fence: NN
too: RB
.: .


In [None]:
pos_tag_list = [tag[1] for tag in pos_tags]
print(pos_tag_list)

['DT', 'NN', 'VBD', 'IN', 'DT', 'NN', ',', 'CC', 'DT', 'NN', 'VBD', 'IN', 'DT', 'NN', 'RB', '.']


In [None]:
from diversity import extract_patterns

text = ["The fox jumped over the fence, and the dog jumped over the fence too."]

# POS pattern extraction
patterns = extract_patterns(text, n=4, top_n=5)
print("Top POS patterns:", patterns)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Top POS patterns: {'DT NN VBD IN': {'the dog jumped over', 'The fox jumped over'}, 'NN VBD IN DT': {'fox jumped over the', 'dog jumped over the'}, 'VBD IN DT NN': {'jumped over the fence'}, 'IN DT NN ,': {'over the fence ,'}, 'DT NN , CC': {'the fence , and'}}


### ***CR-POS***

***CR-POS:*** Measures POS tag sequence diversity using **compression ratios**.
- Higher values indicating less diversity.
- Lower values suggest more varied syntactic structures and potentially more ***creative*** text.

We are interested in quantifying the *n-gram diversity* of the **POS tag sequences** present in the text.

***Lossless text compression algorithms***—such as **gZip**—are optimized to detect <u>repeated characters in sequences</u>, and rely on this to compress documents without any loss of information.
- If a document contains frequent repeated strings, the document will be **more compressible**, resulting in a larger difference in compressed size relative to the original document size

Computing the ***Compression ratio*** (***CR***) over a set of **POS-tagged text**, with higher values indicating that text is <u>highly compressable</u> (and therefore shows **lower diversity**).
- <u>To calculate the ***CR**:*</u>
  1. Concatenate all POS-tagged text into a sequence.
  1. Measure the ratio between the original document size and the compressed document size.



In [None]:
from diversity import compression_ratio

text = "The fox jumped over the fence, and the dog jumped over the fence too."

# Compression ratio
cr = compression_ratio(text, algorithm='gzip', verbose = True)
print(f"Compression Ratio: {cr:.4f}")

Original Size: 137
Compressed Size: 117
Compression Ratio: 1.1710


In [None]:
text = "The fox jumped over the fence, and the dog jumped over the fence too."

# POS-tagging all tokens in the corpus
words = word_tokenize(text)
pos_tags = pos_tag(words)
pos_tag_list = [pos_tag[1] for pos_tag in pos_tags]

print(f'This are the POS tags: {pos_tag_list}')

# Concatenate all POS tags into a single setence
tags_sequence = " ".join(pos_tag_list)
print(f'This is the single sequence: {tags_sequence}\n')

# Measure the CR
cr = compression_ratio(tags_sequence, algorithm='gzip', verbose = True)
print(f"\nCompression Ratio: {cr:.4f}")

This are the POS tags: ['DT', 'NN', 'VBD', 'IN', 'DT', 'NN', ',', 'CC', 'DT', 'NN', 'VBD', 'IN', 'DT', 'NN', 'RB', '.']
This is the single sequence: DT NN VBD IN DT NN , CC DT NN VBD IN DT NN RB .

Original Size: 93
Compressed Size: 92

Compression Ratio: 1.0110


### ***Template Rate***

***Template rate:*** Calculates the fraction of texts containing at least one template.
  - Lower values indicate fewer texts with repetitive patterns, suggesting greater ***structural originality*** across the corpus.

We measure the fraction of texts in a corpus that contain at least ***1 template*** to quantify <u>how frequently ***templates*** appear across an entire corpus</u>.

In [None]:
text = "The fox jumped over the fence, and the dog jumped over the fence too."
n = len(text.split())

In [None]:
def template_rate(text: str, len_template=4, top_n_templates=1):

  # POS-tagging all tokens in the corpus
  words = word_tokenize(text)
  pos_tags = pos_tag(words)
  pos_tag_list = [pos_tag[1] for pos_tag in pos_tags]

  if len(pos_tag_list) == 0:
    return 0

  # POS templates extraction
  patterns = extract_patterns([text], n=len_template, top_n=top_n_templates)
  templates = list(patterns.keys())

  # Compute a mask: True = token is part of a template
  mask = [False] * len(pos_tag_list)

  for i in range(len(pos_tag_list)-len_template+1):
    if " ".join(pos_tag_list[i:i+len_template]) in templates:
      mask[i:i+len_template] = [True]*len_template

  template_rate = sum(mask)/len(pos_tag_list)

  return template_rate



In [None]:
text = "The fox jumped over the fence, and the dog jumped over the fence too."

# Measure the Template Rate
tr = template_rate(text, len_template=4, top_n_templates=1)
print(f"\nTemplate Rate: {tr:.4f}")


Template Rate: 0.5000


### ***Template-per-Token***

***Templates-per-token*** (**TPT**): Normalizes template counts by text length to enable fair comparisons across different sources.
 - Lower values indicate more ***diverse syntactic structures***.
 - Higher values suggest more repetitive patterns.

If a model tends to produce **longer texts**, there is a higher chance that any given output will contain a ***template***. To compare between text sources, we can **length normalize**.

In [None]:
def template_per_token(text: str, len_template=4, top_n_templates=1):

  # POS-tagging all tokens in the corpus
  words = word_tokenize(text)
  pos_tags = pos_tag(words)
  pos_tag_list = [pos_tag[1] for pos_tag in pos_tags]

  if len(pos_tag_list) == 0:
    return 0

  # POS templates extraction
  patterns = extract_patterns([text], n=len_template, top_n=top_n_templates)
  templates = list(patterns.keys())

  # Count the number of templates
  num_templates_per_token = [0]*len(pos_tag_list)
  for i in range(len(pos_tag_list)-len_template+1):
    for template in templates:
      if " ".join(pos_tag_list[i:i+len_template]) == template:
        for j in range(i, i+len_template):
          num_templates_per_token[j] += 1
  num_words = len(words)

  print(num_templates_per_token)

  tpt = sum(num_templates_per_token) / num_words

  return tpt


In [None]:
text = "The fox jumped over the fence, and the dog jumped over the fence too."

# Measure the Template Rate
tpt = template_per_token(text, len_template=4, top_n_templates=1)
print(f"\nTemplate-per-Token: {tpt:.4f}")

[1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

Template-per-Token: 0.5000
