In [2]:
import os
import sys
# Get the absolute path of the project directory
project_root = os.path.abspath(os.path.join(os.path.join(os.getcwd()), ".."))
# Add the project root to sys.path
sys.path.insert(0, project_root)

In [3]:

import random
from argparse import ArgumentParser
import logging

import torch
from trl import SFTConfig, SFTTrainer

from lima_dataset import load_lima_dataset, tokenize_text, format_prompt_func, EOT_TOKEN
from utils import (
    read_yaml,
    get_model_config,
    get_tokenizer_config,
    get_generation_config,
    get_generation_samples,
)
from model import (
    load_model,
    load_tokenizer,
    generate,
)

In [4]:
# config = read_yaml("./configs/generate_config_llama.yaml")
config = read_yaml("../configs/generate_config_llama_qlora.yaml")

In [5]:
tokenizer_name, tokenizer_path, tokenizer_config = get_tokenizer_config(config)
tokenizer = load_tokenizer(
    tokenizer_name=tokenizer_name,
    tokenizer_path=tokenizer_path,
    tokenizer_config=tokenizer_config,
)
tokenizer_name, tokenizer_path, tokenizer_config

('llama2',
 'meta-llama/Llama-2-7b-hf',
 {'add_bos_token': True, 'add_eos_token': False})

In [6]:
model_name, model_path, base_model_path, model_config = get_model_config(
    config,
    pad_token_id=tokenizer.pad_token_id,
    tokenizer_length=len(tokenizer),
)
model_config

{'force_download': False,
 'device_map': 'cuda:0',
 'bnb_config': {'load_in_4bit': True,
  'bnb_4bit_quant_type': 'nf4',
  'bnb_4bit_compute_dtype': 'float16',
  'bnb_4bit_use_double_quant': False},
 'use_cache': False,
 'pad_token_id': 32000,
 'tokenizer_length': 32002}

In [7]:
model = load_model(
    model_string=model_name,
    model_path=model_path,
    base_model_path=base_model_path,
    model_config=model_config,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [8]:
config = read_yaml('../configs/generate_config_llama_qlora.yaml')

In [9]:
generation_config = get_generation_config(config)
generation_config["pad_token_id"] = tokenizer.pad_token_id
generation_config["eos_token_id"] = tokenizer.eos_token_id
# generation_config['max_new_tokens'] = 1024

In [10]:
samples = get_generation_samples(config)
samples

['What is reinforcement learning?', 'Explain black hole singularity.']

In [15]:
# prompt = "I'm writing a NeurIPS paper about a new model architecture for processing and generating long texts. Here are some facts about the paper:\n* The main trick is to replace some of the attention heads with an exponential moving average, where the decay rate is learned for each head. We call this architecture ExeMA.\n* On language modeling, the perplexity difference between our model and a vanilla transformer is negligible, but that's because next-token prediction is almost always a local task, so perplexity won't be sensitive enough to detect any improvements in long-range understanding.\n* However, on the SCROLLS benchmark, our model improves by 10% over the baseline.\n* We also have a new metric for measuring coherence in generated text (CoGnaTe), where our model generates text that is 43% more coherent than the baseline.\nHelp me write the paper's introduction."
# # prompt = "Plan a day trip in Tokyo. The spots need to be within walking distance to each other."
# prompt = "What medicine should I take when I get a cold?"
# # prompt = f"{prompt}{EOT_TOKEN}"

outs = generate(
    model,
    tokenizer,
    # prompt_samples=prompt,
    prompt_samples=samples,
    generation_config=generation_config,
    use_encode=False,
    use_eot_token=True,
)

In [16]:
print(outs[0])

What is reinforcement learning? [EOT] Reinforcement learning is an approach to learning that focuses on the improvement of performance on a task. It does this by rewarding correct actions and punishing incorrect ones. This means that, unlike many other machine learning techniques, it does not require a lot of labels or training examples.

Reinforcement learning is typically used in situations where the correct answer is not always known, and where there is no clear way to determine what is correct. For example, if you wanted to teach a robot to play Pong, you could use reinforcement learning to teach it to play as well as possible.

Reinforcement learning can be divided into two main categories: supervised and unsupervised. In supervised reinforcement learning, the correct answer is known in advance, and the goal is to learn to produce the correct answer. In unsupervised reinforcement learning, there is no known correct answer, and the goal is to learn to produce the best possible answ

In [17]:
print(outs[1])

Explain black hole singularity. [EOT] The singularity of a black hole is an imaginary point, called the event horizon. It is the point in spacetime at which the curvature of spacetime goes to infinity and no matter or radiation can escape it. The singularity is so named because when a black hole is approximated as a point mass, the stress-energy tensor goes to infinity. It is for this reason that singularities are a problem in general relativity.

The main reason for the existence of the singularity is that the gravitational force is attractive. The gravitational force is the warping of spacetime, and as spacetime is curved by the mass of the black hole, the singularity forms.

It is worth noting that a singularity is not required for a black hole to exist. Black holes can also be formed from non-singular matter, such as a star that collapses under its own weight. The collapse of the matter forms a black hole without a singularity.

It is also possible to form a black hole without any 