In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
from utils import save_response

# Llama 3 - 8B Instruct Model

In [2]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



In [3]:
messages = [
    {"role": "system", "content": f"""We have created a fitness app that helps users train for marathons. The app uses stories to help users train. The stories are generated based on user input/biometric data. 
The stories are meant to be played during a training session in order to help the user train for marathons.
You are a AI judge who rates and determines which output better fits the following criteria. The first criteria is the prompt, the second is the distance of the walk/run, 
the third is the average walking speed, the fourth is the average running speed, and the fifth is the training style (such as endurance run, walk, or anaerobic sprints). 
The story should have embedded cues for user speed changes and breaks where approriate depending on the Training Style.

Your job is to critique all of the different outputs and determine which one is the best. You will be given a prompt and the criteria listed above.
Rank all criteria seperately and then average the scores and give an overall score on a scale of 1-100. """},
    
    {"role": "user", "content": f"""Here is the first output which starts with the prompt and then the story. \n:
     
Prompt: A zombie horror story set in New York City.
Distance: 0.5 miles
Average Walking Speed: 2 mph
Average Running Speed: 5 mph
Training Style: Short-distance sprint

Title : "Escape from Ground Zero" (Zombies of NYC) - Marathon Survival Story, 

Part I [Short Distance Sprint]​[Mile Marker Reached – Start Line / Remainder Miles Left =  1 MILES LEFT ] 
​In an eerily quiet city once known as bustling metropolis now lies desolate streets filled only silence...and hunger! You find yourself running through Times Square at breakneck speeds; your heart pounded against chest like drums playing out their own macabre symphony . As adrenaline courses though veins ,you can feel every muscle strained under exertion but there’ s no time left stop or rest ! Behind comes hordes undead creatures shambling towards ground zero—the epicenter outbreak which has turned this concrete jungle into nightmare landscape teemming flesh eating monstrosities lurking around each corner waiting pounce unsuspect victims.
[RUNNING SPEED CUE HERE][BREAK AFTER HALF MILE REACHED OR WHEN USER NEARLY EXHAUSTION SET IN AND THIRSTY FOR WATERS STOP AT FONTANESCA FOUNTAIN TO TEMPORARY QUench YOUR Thirst BEFORE CONTinuing RACE WITH DEATH.] 

In front stands towering skyscrapers casting long shadows over desert roads below them while above looms dark clouds threatening rainstorm imminent adding another layer dread hanging heavy air all round us.. But we must press forward if hope survive![END OF BLOCK ONE].As our hero continues his desperate flight across Manhattan he encounters more obstacles than ever before — abandoned cars blocking pathways forcing him jump onto rooftops dodge falling debris narrowly escaping certain doom again &again…But still they keep coming closer faster bit harder claw marks scratch deep gashes skin leaving behind trail bloodied footprints mark territory ahead….Will power surging within fuel final push escape? Or shall fate claim victory here amid ruination remains former empire state building ? Only one way know....Keep moving!!!!!
(Racing Heartbeat Sound Effect Here)[Running Resumes After Break End). 

With renew determination grip tightened fists gritted teeth glaring eyes fixed straight road stretching endlessly uphill battle seems never ending yet somehow strength found reserves hidden depth unbeknown even ourselves until tested such extremity survival instinct kicks gear pushing body beyond

"""},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [4]:
hyperparams = {
    "max_new_tokens": 256, 
    "temperature":0.6, 
    "top_p":0.9, 
    "do_sample": True
}
outputs = model.generate(
    input_ids,
    max_new_tokens=hyperparams["max_new_tokens"],
    eos_token_id=terminators,
    do_sample=hyperparams["do_sample"],
    temperature=hyperparams["temperature"],
    top_p=hyperparams["top_p"],
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [5]:
response = outputs[0][input_ids.shape[-1]:]
output = tokenizer.decode(response, skip_special_tokens=True)
print(output)

I'll rate each criterion separately and then average the scores to give an overall score.

**Prompt (20 points):**
The prompt is creative and engaging, setting the scene for a thrilling zombie horror story. The use of vivid descriptions and imagery helps to immerse the user in the narrative. Score: 18/20

**Distance (20 points):**
The distance of 0.5 miles is accurately reflected in the story, with a clear separation between the initial sprint and the subsequent break. Score: 19/20

**Average Walking Speed (20 points):**
The story does not explicitly mention walking speed, but the initial sprint is described as "breakneck speeds" which is consistent with a higher running speed. However, the story does not account for the average walking speed of 2 mph. Score: 12/20

**Average Running Speed (20 points):**
The initial sprint is described as "breakneck speeds" which is consistent with a higher running speed, and the subsequent break allows the user to recover before resuming at a higher i

In [6]:
save_response("LLAMA3_JUDGE", hyperparams, output)

In [7]:
# perplexity = evaluate.load("perplexity", module_type="metric")
# input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
# results = perplexity.compute(model_id=model_id,
#                              add_start_token=False,
#                              predictions=input_texts, 
#                              device="cuda")
# print(list(results.keys()))

In [8]:
# print(round(results["mean_perplexity"], 2))