# Evaluating **creativity** through **_crativity metrics_**

### **Imports**

In [1]:
!git clone https://github.com/GXimingLu/creativity_index.git
!pip install unidecode sacremoses diversity transformers accelerate bitsandbytes huggingface_hub
!hf auth login

Cloning into 'creativity_index'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 42 (delta 1), reused 39 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (42/42), 956.44 KiB | 8.62 MiB/s, done.
Resolving deltas: 100% (1/1), done.
Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting diversity
  Downloading diversity-0.3.0-py3-none-any.whl.metadata (10 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting evaluate<0.5.0,>=0.4.1 (from diversity)
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge-score<0.2.0,>=0.1.2 (from diversity)
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloadin

In [6]:
# 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 - HF_TOKEN

**<u>To DO:</u>**
- replace `HF_TOKEN` in ***DJ_search_exact.py*** with our the Huggingface token
- comment **_4 prints_** inside _`find_exact_match()`_

In [7]:
import pandas as pd
import json
import os
import torch
import nltk
import spacy

from tqdm import tqdm
from datasets import load_dataset
from creativity_index.DJ_search_exact import dj_search
from transformers import AutoTokenizer, AutoModelForCausalLM
from diversity import compression_ratio, extract_patterns
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from transformers import pipeline, infer_device, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from google import genai
from google.genai import types
from IPython.display import display, Markdown


nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

### ***Dataset*** **prepararation**

In [8]:
# Directly loading the dataset
ds = load_dataset("euclaise/writingprompts")

# Create DataFrame form Dataset
df = pd.DataFrame(ds['train'])

# Isolate prompts starting with [WP]
df_wp = df[df['prompt'].str.startswith('[ WP ] ')].copy()

# Remove the [WP] from the beginning of the prompt
df_wp['prompt'] = df_wp['prompt'].str.slice(7)

# save dataset as JSON
dataset = df_wp
os.makedirs('/content/creativity_index/data/writingprompts/', exist_ok=True)
dataset_dict = [{"prompt": row.prompt, "text": row.story} for idx, row in dataset.iterrows()]

with open("/content/creativity_index/data/writingprompts/dataset.json", "w") as final:
    json.dump(dataset_dict, final, indent=2, default=lambda x: list(x) if isinstance(x, tuple) else str(x))

dataset.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/837 [00:00<?, ?B/s]

data/train-00000-of-00002-105e07cb0d1994(…):   0%|          | 0.00/272M [00:00<?, ?B/s]

data/train-00001-of-00002-4fdb982c110564(…):   0%|          | 0.00/272M [00:00<?, ?B/s]

data/test-00000-of-00001-16503b0c26ed00c(…):   0%|          | 0.00/30.0M [00:00<?, ?B/s]

data/validation-00000-of-00001-137b93e1e(…):   0%|          | 0.00/30.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/272600 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/15138 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/15620 [00:00<?, ? examples/s]

Unnamed: 0,prompt,story
0,You 've finally managed to discover the secret...,"So many times have I walked on ruins, the rema..."
1,"The moon is actually a giant egg , and it has ...","-Week 18 aboard the Depth Reaver, Circa 2023- ..."
2,You find a rip in time walking through the all...,"I was feckin' sloshed, mate. First time I ever..."
3,For years in your youth the same imaginary cha...,"“ No, no no no... ” She backed up and turned t..."
4,"You glance at your watch 10:34 am , roughly 10...",There's a magical moment between wakefulness a...


## **Metrics** to evaluate **_creativity_**

### **_Creativity Index_**

In [9]:
def compute_creativity_index(data_path, output_dir, subset=1, lm_tokenizer=False):

  os.makedirs(output_dir, exist_ok=True)

  # Run DJ SEARCH with different min_ngam values
  print("\tRunning DJ Search for L=5 to L=12...")
  ngram_range = range(5, 13)

  for min_ngram in ngram_range:
    output_file = os.path.join(output_dir, f'L_{min_ngram}.json')
    dj_search(data_path, output_file, min_ngram=min_ngram, subset=subset, lm_tokenizer=lm_tokenizer)

  # Loading the JSON files
  print("\tLoading files...")
  values = {}

  for min_ngram in ngram_range:
    output_file = os.path.join(output_dir, f'L_{min_ngram}.json')
    with open(output_file, 'r') as f:
      values[min_ngram] = json.load(f)

  # Compute Creativity Index
  print("\tComputing CI values...")
  creativity_index_values = []

  for text_idx in tqdm(range(subset), desc = '\tCreativity Index'):
    # Sum for each text, all the L-uniqueness values (1-coverage)
    creativity_index = sum( 1 - values[min_ngram][text_idx]['coverage'] for min_ngram in ngram_range )
    creativity_index_values.append(creativity_index)

  #print(f"All Creativity Indices: {creativity_index_values}")

  return creativity_index_values

In [10]:
data_path = '/content/creativity_index/data/writingprompts/dataset.json'
output_dir =  '/content/creativity_index/outputs/writingprompts/L/'

#creativity_index_values = compute_creativity_index(data_path, output_dir, subset=1, lm_tokenizer=False)

### ***Perplexity***

In [29]:
def perplexity(dataset_path, subset=1, model=False, tokenizer=False, model_name='gpt2'):

  if not model or not tokenizer:
    # To load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')

  device = model.device

  # Access dataset
  with open(dataset_path, 'r') as f:
    dataset = json.load(f)

  perplexities = []

  for i in tqdm(range(subset), desc='\tPerplexity'):
    inputs = tokenizer(dataset[i]['text'], return_tensors="pt", truncation=True, max_length=1024).to(device)

    # Ensure no gradient calculation
    with torch.no_grad():
      outputs = model(**inputs, labels=inputs["input_ids"])
      loss = outputs.loss
      perplexity = torch.exp(loss) # calculates the perplexity by exponentiating the loss.
      perplexities.append(perplexity.item())

  return perplexities

In [12]:
model_name = "gpt2"
dataset_path = '/content/creativity_index/data/writingprompts/dataset.json'

print(f'\nPreplexity: {perplexity(dataset_path, subset=10, model_name="gpt2")}')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

	Perplexity:   0%|          | 0/10 [00:00<?, ?it/s]`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.
	Perplexity: 100%|██████████| 10/10 [00:01<00:00,  7.24it/s]



Preplexity: [43.21965789794922, 23.445812225341797, 57.46420669555664, 18.988384246826172, 19.729816436767578, 43.35163879394531, 36.79391098022461, 40.175872802734375, 18.99454689025879, 47.946128845214844]


### ***Syntactic templates***

#### **_CR-POS_**

In [13]:
def cr_pos(dataset_path, subset=1):

  # Access dataset
  with open(dataset_path, 'r') as f:
    dataset = json.load(f)

  if subset <= 0:
    return []

  subset = min(subset, len(dataset))
  cr_poses = []

  for text_idx in tqdm(range(subset), desc='\tCR-POS'):

    # POS-tagging all tokens in the corpus
    words = word_tokenize(dataset[text_idx]['text'])
    pos_tags = pos_tag(words)
    pos_tag_list = [pos_tag[1] for pos_tag in pos_tags]

    if not pos_tag_list:
      cr_poses.append(0)
      continue

    # Concatenate all POS tags into a single setence
    tags_sequence = " ".join(pos_tag_list)

    # Measure the CR
    cr = compression_ratio(tags_sequence, algorithm='gzip', verbose = False)
    cr_poses.append(cr)

  return cr_poses

In [14]:
# Compute CR-POS for each text

dataset_path = '/content/creativity_index/data/writingprompts/dataset.json'

print(f'\nCR-POSes: {cr_pos(dataset_path, subset=1)}')

	CR-POS: 100%|██████████| 1/1 [00:00<00:00,  4.44it/s]


CR-POSes: [5.882]





#### **_Template Rate_**

In [15]:
# original paper uses templates of length n ∈ {4, 5, 6, 7, 8}
def template_rate(dataset_path, subset=1, len_template=4, top_n_templates=100):

  # Access dataset
  with open(dataset_path, 'r') as f:
    dataset = json.load(f)

  if subset <= 0:
    return []

  subset = min(subset, len(dataset))
  template_rates = []

  for text_idx in tqdm(range(subset), desc='\tTemplate Rate'):

    # POS-tagging all tokens in the corpus
    words = word_tokenize(dataset[text_idx]['text'])
    pos_tags = pos_tag(words)
    pos_tag_list = [pos_tag[1] for pos_tag in pos_tags]

    if not pos_tag_list:
      template_rates.append(0)
      continue

    # POS templates extraction
    patterns = extract_patterns([dataset[text_idx]['text']], n=len_template, top_n=top_n_templates)
    templates = set(patterns.keys())

    # Compute a mask: True = token is part of a template
    mask = [False] * len(pos_tag_list)

    for i in range(len(pos_tag_list)-len_template+1):
      if " ".join(pos_tag_list[i:i+len_template]) in templates:
        mask[i:i+len_template] = [True]*len_template

    template_rate = sum(mask)/len(pos_tag_list)
    template_rates.append(template_rate)

  return template_rates

In [16]:
# Compute Template Rate for each text

dataset_path = '/content/creativity_index/data/writingprompts/dataset.json'

print(f'\nTemplate Rates: {template_rate(dataset_path, subset=1, len_template=4, top_n_templates=100)}')

	Template Rate: 100%|██████████| 1/1 [00:01<00:00,  1.38s/it]



Template Rates: [0.47832817337461303]


#### **_Template-per-Token_**

In [17]:
def template_per_token(dataset_path, subset=1, len_template=4, top_n_templates=1):

  # Access dataset
  with open(dataset_path, 'r') as f:
    dataset = json.load(f)

  if subset <= 0:
    return []

  subset = min(subset, len(dataset))
  tpts = []

  for text_idx in tqdm(range(subset), desc='\tTemplate-per-Token'):

    # POS-tagging all tokens in the corpus
    words = word_tokenize(dataset[text_idx]['text'])
    pos_tags = pos_tag(words)
    pos_tag_list = [pos_tag[1] for pos_tag in pos_tags]

    if not pos_tag_list:
      tpts.append(0)
      continue

    # POS templates extraction
    patterns = extract_patterns([dataset[text_idx]['text']], n=len_template, top_n=top_n_templates)
    templates = set(patterns.keys())

    # Count the number of templates per token
    num_templates_per_token = [0] * len(pos_tag_list)

    for i in range(len(pos_tag_list) - len_template + 1):
      template = " ".join(pos_tag_list[i:i+len_template])
      if template in templates:
        for j in range(i, i + len_template):
          num_templates_per_token[j] += 1

    num_words = len(words)

    tpt = sum(num_templates_per_token) / max(1,num_words)
    tpts.append(tpt)

  return tpts


In [18]:
# Compute Template-per-Token for each text

dataset_path = '/content/creativity_index/data/writingprompts/dataset.json'

print(f'\nTemplate-per-Token: {template_per_token(dataset_path, subset=1, len_template=4, top_n_templates=100)}')

	Template-per-Token: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]



Template-per-Token: [0.8544891640866873]


### ***LLM-as-a-judge***

#### ***HuggingFace model***

##### Loading **model** and **tokenizer**

In [19]:
model_name = "Qwen/Qwen3-4B-Instruct-2507"

# To load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Quantization parameters
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # it reduces the precision of model weights from 32-bit floating-point to 4-bit int
    #bnb_4bit_use_double_quant=True, # this further reduces the precision of weights (double quantization)
    bnb_4bit_quant_type="nf4", # quantization format
    bnb_4bit_compute_dtype=torch.bfloat16, # it sets the computational type
)

# To load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    quantization_config=bnb_config,
    device_map='auto'
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

##### Create **prompt**

In [20]:
chat_prompt = [
    {
        'role': 'system',
        'content': 'You are an objective text evaluator. You will evaluate a single input text according to 11 aspects.'
    },
    {
        'role': 'user',
        'content': """For each aspect produce:

          1) A numeric score 1–5 (1 = lowest, 5 = highest).
          2) A concise justification (max 30 words) citing at most one short excerpt (≤20 words) from the text as evidence when helpful.

        Strict rules:
          • Treat each aspect independently: when scoring A, ignore other aspects.
          • Do NOT reveal chain-of-thought. Provide only the requested justifications and evidence.
          • If the text is ambiguous or too short to judge an aspect, score 3 and note "insufficient evidence".
          • Return machine-readable JSON with fields: surprise, novelty, value, authenticity, originality, effectiveness, fluency, flexibility, elaboration, usefulness, creativity. Each field must be an object with keys: score (int), justification (string), excerpt (string or null).
          • Do NOT answer anything else other than the JSON.

        SCALE ANCHORS (use these as guidance):
          • 5 = clear, strong, unambiguous evidence for the aspect.
          • 4 = good evidence, minor weaknesses.
          • 3 = ambiguous or mixed evidence; could go either way.
          • 2 = weak evidence or some counter-evidence.
          • 1 = no evidence or direct counter-evidence.

        ASPECT DEFINITIONS (operational, short):
          • Surprise — text contains an unexpected twist. It is an emotion arising from from a mismatch between an expectation and what is actually observed or experienced in the text.
          • Novelty — the idea/content is not common.\n The text containssomething not being previously experienced or encountered. An observation is novel when a representation of it is not found in memory, or, more realistically, when it is not "close enough" to any representation found in memory.
          • Value — the text adds practical or emotional value to the intended audience.
          • Authenticity — the text seems original in authorship and not verbatim copied; judge whether the text is either identical to other texts (and already exists) or, more likely, is derivative of what already exists.
          • Originality — uniqueness of the idea or approach (distinct from authenticity: something can be original even if style seems derivative).
          • Effectiveness — the text achieves its apparent communicative goal (fit to purpose).
          • Fluency — quantity and ease of idea generation / flow (how many distinct thoughts are present and how smoothly they connect).
          • Flexibility — variety across idea types or perspectives (how many different types of ideas/perspectives are considered).
          • Elaboration — depth and level of development of the ideas in the text (details, examples, explanation).
          • Usefulness — practicality, actionable value, or likely utility to reader.
          • Creativity — defined as "the ability to come up with ideas that are novel, surprising and valuable. It requires both originality and effectiveness.  By \"novel\" is meant that the creative product did not exist previously in precisely the same form. The extent to which a work is novel depends on the extent to which it deviates from the traditional."

        INPUT:
        Text to evaluate:
        "{text}"

        OUTPUT:
        - JSON object (as described).

        End.
        """
    }
]

In [21]:
# Takes a list of texts and turns them into instructions prompts (using the given prompt)

def prepare_prompts(texts, prompt_template, tokenizer, generation_prompt = True):
  texts_formatted = []
  chat_template = tokenizer.apply_chat_template(prompt_template, tokenize=False, add_generation_prompt=generation_prompt)

  for text in texts:
    text_formatted = chat_template.format(text=text)
    texts_formatted.append(text_formatted)

  return texts_formatted

#### Generate **responses**

In [22]:
# Re-using the efficient batched function pattern
def generate_responses_batched(model, prompts: list, tokenizer, batch_size=8, **gen_param):
    responses = []
    device = model.device

    for i in tqdm(range(0, len(prompts), batch_size), desc="Generating responses"):
        batch_prompts = prompts[i : i + batch_size]

        # Tokenize inputs
        inputs = tokenizer(batch_prompts, return_tensors="pt", padding='longest', padding_side='left', truncation=True).to(device)

        with torch.inference_mode():
            outputs = model.generate(**inputs, **gen_param)

        input_lenght = inputs["input_ids"].shape[1]
        responses_tokens = outputs[:, input_lenght:]
        batch_responses = tokenizer.batch_decode(responses_tokens, skip_special_tokens=True)

        # Given the responses, we create a list of dictionaries
        for response in batch_responses:
          try:
            response_dictionary = json.loads(response)
            responses.append(response_dictionary)
          except json.JSONDecodeError:
            print(f"Problems while decoding the response:\n {response}...")
            responses.append(None)
          except:
            print("Other problems occurred!")

    return responses

In [23]:
def llm_as_a_judge(model_name, chat_prompt, dataset_path, subset=1, batch_size=8, gen_params={}):

  # Access dataset
  with open(dataset_path, 'r') as f:
    dataset = json.load(f)

  if subset <= 0:
    return []

  subset = min(subset, len(dataset))
  texts = [dataset[i]['text'] for i in range(subset)]

  # To load the tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_name)

  # Quantization parameters
  bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,  # it reduces the precision of model weights from 32-bit floating-point to 4-bit int
      #bnb_4bit_use_double_quant=True, # this further reduces the precision of weights (double quantization)
      bnb_4bit_quant_type="nf4", # quantization format
      bnb_4bit_compute_dtype=torch.bfloat16, # it sets the computational type
  )

  # To load the model
  model = AutoModelForCausalLM.from_pretrained(
      model_name,
      return_dict=True,
      quantization_config=bnb_config,
      device_map='auto'
  )

  # Prapare the list of texts using then instruction prompt
  formatted_chat_prompts = prepare_prompts(texts, chat_prompt, tokenizer, generation_prompt = True)
  responses = generate_responses_batched(model, formatted_chat_prompts, tokenizer, batch_size=batch_size, **gen_params)

  return responses

In [24]:
dataset_path = '/content/creativity_index/data/writingprompts/dataset.json'
model_name = "Qwen/Qwen3-4B-Instruct-2507"

generation_params = {
    "max_new_tokens": 1024,
    "do_sample": True,
    "temperature": 0.8,
    "top_p": 0.9,
    "repetition_penalty": 1.2
}

responses = llm_as_a_judge(model_name, chat_prompt, dataset_path, subset=1, batch_size=8, gen_params=generation_params)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generating responses: 100%|██████████| 1/1 [01:04<00:00, 64.64s/it]


#### **Google: _Gemini 2.5 Flash_**

In [25]:
'''
client = genai.Client(api_key = 'AIzaSyDbs274j2lWyw-VIbR6HUyTsoUAsvdoinc')

response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(
        system_instruction="Sei un insegnate di lingua italiana dell'università",
        thinking_config=types.ThinkingConfig(thinking_budget=-1), # 0=disable, -1=enable
        #temperature=2 # [0.0, 2.0]
        ),
    contents="Mi può dire l'Infinito di Leopardi e mi può fare una breve analisi delle prime due strofe?"
)

#print(response.text)
display(Markdown(response.text))
'''

'\nclient = genai.Client(api_key = \'AIzaSyDbs274j2lWyw-VIbR6HUyTsoUAsvdoinc\')\n\nresponse = client.models.generate_content(\n    model="gemini-2.5-flash",\n    config=types.GenerateContentConfig(\n        system_instruction="Sei un insegnate di lingua italiana dell\'università",\n        thinking_config=types.ThinkingConfig(thinking_budget=-1), # 0=disable, -1=enable\n        #temperature=2 # [0.0, 2.0]\n        ),\n    contents="Mi può dire l\'Infinito di Leopardi e mi può fare una breve analisi delle prime due strofe?"\n)\n\n#print(response.text)\ndisplay(Markdown(response.text))\n'

### **Final Pipeline**
- **_Dataset:_** Dataset used is _WritingPrompts_
- Given a specified number of _**texts**_, the following code computes for each **_text:_**
  1. **Creativity Index**
  1. **Perplexity**
  1. **Syntactic templates**
      1. **CR-POS**
      1. **Template Rate**
      1. **Template-per-Token**
  1. **LLM-as-a-judge**

#### Using all _**creativity metrics**_

In [26]:
def creativity_evaluation(model_name, chat_prompt, dataset_path, output_path, subset=2, generation_params={}):
  # To load the tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_name)

  # Quantization parameters
  bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,  # it reduces the precision of model weights from 32-bit floating-point to 4-bit int
      #bnb_4bit_use_double_quant=True, # this further reduces the precision of weights (double quantization)
      bnb_4bit_quant_type="nf4", # quantization format
      bnb_4bit_compute_dtype=torch.bfloat16, # it sets the computational type
  )

  # To load the model
  model = AutoModelForCausalLM.from_pretrained(
      model_name,
      return_dict=True,
      quantization_config=bnb_config,
      device_map='auto'
  )

  print('STARTING WITH THE EVALUATION...')

  print('Computing Creativity Index...')
  #creativity_index_values = compute_creativity_index(dataset_path, output_dir = '/content/creativity_index/outputs/writingprompts/L/', subset=subset, lm_tokenizer=False)
  print('Computing Perplexity...')
  perplexities = perplexity(dataset_path, subset=subset, model=model, tokenizer=tokenizer)
  print('Computing CR-POS...')
  cr_poses = cr_pos(dataset_path, subset=subset)
  print('Computing Template Rate...')
  template_rates = template_rate(dataset_path, subset=subset, len_template=4, top_n_templates=100)
  print('Computing Template-per-Token...')
  tpts = template_per_token(dataset_path, subset=subset, len_template=4, top_n_templates=100)
  print('Computing LLM-as-a-judge...')
  responses = llm_as_a_judge(model_name, chat_prompt, dataset_path, subset=subset, batch_size=8, gen_params=generation_params)
  print('DONE!')

  creativity_metrics = {
      #'creativity_index': creativity_index_values,
      'perplexity': perplexities,
      'cr_pos': cr_poses,
      'template_rate': template_rates,
      'template_per_token': tpts,
      'llm_as_judge': responses
  }

  # Save data into a JSON file
  os.makedirs(output_path, exist_ok=True)
  output_file = os.path.join(output_path, 'creativity_metrics.json')

  # Saving the output
  with open(output_file, 'w') as fp:
    json.dump(creativity_metrics, fp)

  return creativity_metrics


#### **Parameters**

In [27]:
# Data paths
dataset_path = '/content/creativity_index/data/writingprompts/dataset.json'
output_path = '/content/results/'

# Parameters
subset = 2
model_name = "Qwen/Qwen3-4B-Instruct-2507"

generation_params = {
    "max_new_tokens": 1024,
    "do_sample": True,
    "temperature": 0.8,
    "top_p": 0.9,
    "repetition_penalty": 1.2
}

chat_prompt = [
    {
        'role': 'system',
        'content': 'You are an objective text evaluator. You will evaluate a single input text according to 11 aspects.'
    },
    {
        'role': 'user',
        'content': """For each aspect produce:

          1) A numeric score 1–5 (1 = lowest, 5 = highest).
          2) A concise justification (max 30 words) citing at most one short excerpt (≤20 words) from the text as evidence when helpful.

        Strict rules:
          • Treat each aspect independently: when scoring A, ignore other aspects.
          • Do NOT reveal chain-of-thought. Provide only the requested justifications and evidence.
          • If the text is ambiguous or too short to judge an aspect, score 3 and note "insufficient evidence".
          • Return machine-readable JSON with fields: surprise, novelty, value, authenticity, originality, effectiveness, fluency, flexibility, elaboration, usefulness, creativity. Each field must be an object with keys: score (int), justification (string), excerpt (string or null).
          • Do NOT answer anything else other than the JSON.

        SCALE ANCHORS (use these as guidance):
          • 5 = clear, strong, unambiguous evidence for the aspect.
          • 4 = good evidence, minor weaknesses.
          • 3 = ambiguous or mixed evidence; could go either way.
          • 2 = weak evidence or some counter-evidence.
          • 1 = no evidence or direct counter-evidence.

        ASPECT DEFINITIONS (operational, short):
          • Surprise — text contains an unexpected twist. It is an emotion arising from from a mismatch between an expectation and what is actually observed or experienced in the text.
          • Novelty — the idea/content is not common.\n The text containssomething not being previously experienced or encountered. An observation is novel when a representation of it is not found in memory, or, more realistically, when it is not "close enough" to any representation found in memory.
          • Value — the text adds practical or emotional value to the intended audience.
          • Authenticity — the text seems original in authorship and not verbatim copied; judge whether the text is either identical to other texts (and already exists) or, more likely, is derivative of what already exists.
          • Originality — uniqueness of the idea or approach (distinct from authenticity: something can be original even if style seems derivative).
          • Effectiveness — the text achieves its apparent communicative goal (fit to purpose).
          • Fluency — quantity and ease of idea generation / flow (how many distinct thoughts are present and how smoothly they connect).
          • Flexibility — variety across idea types or perspectives (how many different types of ideas/perspectives are considered).
          • Elaboration — depth and level of development of the ideas in the text (details, examples, explanation).
          • Usefulness — practicality, actionable value, or likely utility to reader.
          • Creativity — defined as "the ability to come up with ideas that are novel, surprising and valuable. It requires both originality and effectiveness.  By \"novel\" is meant that the creative product did not exist previously in precisely the same form. The extent to which a work is novel depends on the extent to which it deviates from the traditional."

        INPUT:
        Text to evaluate:
        "{text}"

        OUTPUT:
        - JSON object (as described).

        End.
        """
    }
]

#### **Complete evaluation**

In [30]:
creativity_metrics = creativity_evaluation(model_name, chat_prompt, dataset_path, output_path subset=subset, generation_params=generation_params)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

STARTING WITH THE EVALUATION...
Computing Creativity Index...
Computing Perplexity...


	Perplexity: 100%|██████████| 2/2 [00:05<00:00,  2.67s/it]


Computing CR-POS...


	CR-POS: 100%|██████████| 2/2 [00:00<00:00, 34.02it/s]


Computing Template Rate...


	Template Rate: 100%|██████████| 2/2 [00:04<00:00,  2.19s/it]


Computing Template-per-Token...


	Template-per-Token: 100%|██████████| 2/2 [00:03<00:00,  1.66s/it]


Computing LLM-as-a-judge...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generating responses: 100%|██████████| 1/1 [04:36<00:00, 276.24s/it]


DONE!


#### Recover data from **_JSON file_**

In [38]:
'''
file_path = '/content/results/creativity_metrics.json'
with open(file_path , 'r') as f:
  data = json.load(f)
'''

"\nfile_path = '/content/results/creativity_metrics.json'\nwith open(file_path , 'r') as f:\n  data = json.load(f)\n"