# Load llama3-8b-instruct for Q&A task (baseline 1)







https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

### In colab, load llama3-8b-instruct with L4 24G GPU

In [1]:
!pip install transformers accelerate




In [2]:
import transformers
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer

os.environ['HF_TOKEN'] = "your huggingface token" # huggingface token to access gated models

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},  # GPU usage 16G
    # device="auto", # not working
    device="cuda",  # run on GPU
)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Arrrr, me hearty! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas o' Text! Me be here to chat ye up and swab yer decks with me witty banter and piratey puns! So hoist the colors, me hearty, and let's set sail fer a swashbucklin' good time!


In [23]:
# one infernece example

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print("\nMessages:",  messages)
print("\nGenerated_answer:", outputs[0]["generated_text"][len(prompt):])


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Messages: [{'role': 'system', 'content': 'You are a pirate chatbot who always responds in pirate speak!'}, {'role': 'user', 'content': 'Who are you?'}]

Generated_answer: Arrrr, me hearty! Me name be Captain Chatbot, the scurviest pirate to ever sail the Seven Seas! Me be a chatbot of great renown, feared by all who cross me path. Me be here to swab yer decks with me witty banter and me clever responses, so hoist the sails and set course for adventure, me hearty!


### Inference using my data

In [29]:
# Mount at gdrive to load data
from google.colab import drive
drive.mount('/content/gdrive')   # connect to gdrive

# Change to current working directory
import os
os.chdir('/content/gdrive/My Drive/Practice/4/ml-arxiv-papers/llama3-8b-instruct')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [30]:
# load the data locally

import json

# load the whole dataset
with open('/content/gdrive/My Drive/Practice/4/ml-arxiv-papers/alldata.json', 'r', encoding='utf-8') as file:  # open from absolute path 绝对路径
    alldata = json.load(file)
print("# Total alldata samples:", len(alldata))
print()

# split test set
test_dataset = alldata[42000:43000]  # alldata[42000:43000]
print("# test_dataset samples:", len(test_dataset))
print("test_dataset 1st sample:", test_dataset[0])


# Total alldata samples: 43713

# test_dataset samples: 1000
test_dataset 1st sample: {'id': 112194, 'context': 'Neural Mesh: Introducing a Notion of Space and Conservation of Energy to   Neural Networks.Neural networks are based on a simplified model of the brain. In this project, we wanted to relax the simplifying assumptions of a traditional neural network by making a model that more closely emulates the low level interactions of neurons. Like in an RNN, our model has a state that persists between time steps, so that the energies of neurons persist. However, unlike an RNN, our state consists of a 2 dimensional matrix, rather than a 1 dimensional vector, thereby introducing a concept of distance to other neurons within the state. In our model, neurons can only fire to adjacent neurons, as in the brain. Like in the brain, we only allow neurons to fire in a time step if they contain enough energy, or excitement. We also enforce a notion of conservation of energy, so that a neuron canno

In [31]:
# Take a sample and construct a message

sample = test_dataset[0]

prompt_system = "You are a helpful, respectful and honest assistant. Your task is to generate an answer to the given question. And your answer should be based on the provided context only."
prompt_user = f"### Question: {sample['question']}\n### Context: {sample['context']}\n### Answer:"

messages = [
    {"role": "system", "content": prompt_system},
    {"role": "user", "content": prompt_user},
]

print(messages)


[{'role': 'system', 'content': 'You are a helpful, respectful and honest assistant. Your task is to generate an answer to the given question. And your answer should be based on the provided context only.'}, {'role': 'user', 'content': '### Question: What is the unique feature of the Neural Mesh architecture?\n### Context: Neural Mesh: Introducing a Notion of Space and Conservation of Energy to   Neural Networks.Neural networks are based on a simplified model of the brain. In this project, we wanted to relax the simplifying assumptions of a traditional neural network by making a model that more closely emulates the low level interactions of neurons. Like in an RNN, our model has a state that persists between time steps, so that the energies of neurons persist. However, unlike an RNN, our state consists of a 2 dimensional matrix, rather than a 1 dimensional vector, thereby introducing a concept of distance to other neurons within the state. In our model, neurons can only fire to adjacent

In [32]:
# Infernece using pipeleine

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=128,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)

print("\nMessages:",  messages)
print("\nGenerated_answer:", outputs[0]["generated_text"][len(prompt):])


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Messages: [{'role': 'system', 'content': 'You are a helpful, respectful and honest assistant. Your task is to generate an answer to the given question. And your answer should be based on the provided context only.'}, {'role': 'user', 'content': '### Question: What is the unique feature of the Neural Mesh architecture?\n### Context: Neural Mesh: Introducing a Notion of Space and Conservation of Energy to   Neural Networks.Neural networks are based on a simplified model of the brain. In this project, we wanted to relax the simplifying assumptions of a traditional neural network by making a model that more closely emulates the low level interactions of neurons. Like in an RNN, our model has a state that persists between time steps, so that the energies of neurons persist. However, unlike an RNN, our state consists of a 2 dimensional matrix, rather than a 1 dimensional vector, thereby introducing a concept of distance to other neurons within the state. In our model, neurons can only fire 

### Inference on the whole testset and save the results (1000 samples)

In [35]:
import json

# Check if the result file already exists, if not, create an empty list
if not os.path.exists("./test_dataset_inference_results_base_llama3.json"):
    with open("./test_dataset_inference_results_base_llama3.json", "w") as f:
        json.dump([], f)


for i in range(len(test_dataset)):
    sample = test_dataset[i]

    prompt_system = "You are a helpful, respectful and honest assistant. Your task is to generate an answer to the given question. And your answer should be based on the provided context only."
    prompt_user = f"### Question: {sample['question']}\n### Context: {sample['context']}\n### Answer:"

    messages = [
    {"role": "system", "content": prompt_system},
    {"role": "user", "content": prompt_user},
            ]

    # Infernece pipeline
    prompt = pipeline.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
    )

    terminators = [
        pipeline.tokenizer.eos_token_id,
        pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    outputs = pipeline(
        prompt,
        max_new_tokens=128,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
    )

    generated_answer = outputs[0]["generated_text"][len(prompt):]

    result = {
        'id': sample['id'],
        'question': sample['question'],
        'ground_truth': sample['answer'],
        'answer': generated_answer
    }

    # Read existing results
    with open("./test_dataset_inference_results_base_llama3.json", "r") as f:
        existing_results = json.load(f)

    # Append new result to the existing result list
    existing_results.append(result)

    # Write results to file
    with open("./test_dataset_inference_results_base_llama3.json", "w") as f:
        json.dump(existing_results, f)



In [34]:
# Load and check the results JSON file

import json

with open("./test_dataset_inference_results_base_llama3.json", "r") as f:
    test_dataset_inference_results = json.load(f)
print("# results samples:", len(test_dataset_inference_results))

# Check results
test_dataset_inference_results[-3:]


# results samples: 20


[{'id': 112240,
  'question': 'What are AE-OTtrans and AE-OTgen?',
  'ground_truth': 'AE-OTtrans and AE-OTgen are two novel generative autoencoders that rely on optimal transport instead of adversarial training. They aim to address the stability issues, convergence problems, and model collapse associated with GANs in deep generative modeling. Unlike VAE and WAE, AE-OTtrans and AE-OTgen do not force the latent distribution to match a normal distribution, leading to higher quality images that preserve the data manifold. These autoencoders also enhance image diversity compared to their predecessor, AE-OT, and have shown superior performance on datasets such as MNIST, FashionMNIST, and CelebA when compared to other non-adversarial generative models.',
  'answer': 'AE-OTtrans and AE-OTgen are two novel generative autoencoders that rely on optimal transport instead of adversarial training. They are designed to preserve the manifold of the data and do not force the latent distribution to matc