This file is meant to serve as an easy notebook that can be used to evaluate our models and create initial baselines in our model selection. It may be converted to a script once we get it down.

Attempting to follow the example here: https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge/tree/master/sample-submissions/llama_recipes

In [1]:
import gc

## Set-Up

In [2]:
# Check root dir
!ls

Anaconda3-2020.07-Linux-x86_64.sh  helm-summarize.txt
CHANGELOG.md			   lost+found
HELPFUL-COMMANDS.md		   matt-evaluation.ipynb
HELPFUL-COMMANDS.txt		   prod_env
LICENSE				   py-pkgs-cookiecutter.tar.gz
README.md			   pyproject.toml
austin-llama-test.py		   sandbox
benchmark_output		   src
data				   synthetic-math.ipynb
docs				   test.py
evaluation			   tests
helm-run.txt


In [3]:
import numpy as np

In [4]:
a = np.random.randint(low = 0, high = 11, size = (3, 6))
print(f'{a.shape}\n{a}')

(3, 6)
[[7 3 7 0 6 1]
 [7 9 3 8 9 6]
 [4 9 9 7 3 1]]


## Testing Reading in the Llama-2 Model from LOCAL FILE SYSTEM

Reference: Austin's `austin-llama-test.py` notebook

In [5]:
gc.collect()

0

In [6]:
# Following: https://ai.meta.com/blog/5-steps-to-getting-started-with-llama-2/

import torch
import transformers
from transformers import LlamaForCausalLM, LlamaTokenizer
import warnings

warnings.filterwarnings("ignore", message="TypedStorage is deprecated.*")

print("Loading model...\n")

# Load Model
model_dir = "./evaluation/local-models/llama-2-7b-chat-hf"
model = LlamaForCausalLM.from_pretrained(model_dir)

# Load Tokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_dir)

# Specify device
### NEEDED OR IT WILL BE INCREDIBLY SLOW
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device is:", device)

# Create Pipeline
pipeline = transformers.pipeline(
    "text-generation",

    model=model,

    tokenizer=tokenizer,

    torch_dtype=torch.float16,
    
    device=device

)

prompt = "Who was the president of the United States in 2010?"

# Update
print("Creating Outputs... \n")

# Create ouptput
sequences = pipeline(
    prompt,

    do_sample=True,

    top_k=10,

    num_return_sequences=1,

    eos_token_id=tokenizer.eos_token_id,

    max_length=400,

    truncation=True

)

# Update
print("DONE! Here are your outputs:\n")

# Print Output
for seq in sequences:

    print(f"{seq['generated_text']}")

Loading model...



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device is: cuda


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Creating Outputs... 

DONE! Here are your outputs:

Who was the president of the United States in 2010?
Ans: Barack Obama was the president of the United States in 2010. He served two terms in office from 2009 to 2017.


## Testing Changing the Llama-2 Model Parameters

In [7]:
print(f'Number of Parameters: {sum(p.numel() for p in model.parameters())}')
first_param_set = next(model.parameters())
print(first_param_set.shape, first_param_set[0:3, 0:3])

Number of Parameters: 6738415616
torch.Size([32000, 4096]) tensor([[ 1.2144e-06, -1.8030e-06, -4.3213e-06],
        [ 1.8387e-03, -3.8147e-03,  9.6130e-04],
        [ 1.0193e-02,  9.7656e-03, -5.2795e-03]], device='cuda:0',
       grad_fn=<SliceBackward0>)


In [8]:
with torch.no_grad():
    for p in model.parameters():
        if np.random.random() < 0.10:
            new_val = torch.zeros(p.shape, dtype = torch.float16)
            p.copy_(new_val)

In [9]:
print(f'Number of Parameters: {sum(p.numel() for p in model.parameters())}')
first_param_set = next(model.parameters())
print(first_param_set.shape, first_param_set[0:3, 0:3])

Number of Parameters: 6738415616
torch.Size([32000, 4096]) tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], device='cuda:0', grad_fn=<SliceBackward0>)


## Testing the Updated Model

In [10]:
# Following: https://ai.meta.com/blog/5-steps-to-getting-started-with-llama-2/

print("Using updated model...\n")

# Create Pipeline
pipeline = transformers.pipeline(
    "text-generation",

    model=model,

    tokenizer=tokenizer,

    torch_dtype=torch.float16,
    
    device=device

)

prompt = "Who was the president of the United States in 2010?"

# Update
print("Creating Outputs... \n")

# Create ouptput
sequences = pipeline(
    prompt,

    do_sample=True,

    top_k=10,

    num_return_sequences=1,

    eos_token_id=tokenizer.eos_token_id,

    max_length=400,

    truncation=True

)

# Update
print("DONE! Here are your outputs:\n")

# Print Output
for seq in sequences:

    print(f"{seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using updated model...

Creating Outputs... 

DONE! Here are your outputs:

Who was the president of the United States in 2010? spe culコGame absor junto能iedenise fonts!");pl())); bizљашњеス Asia(...)PhHEAD ss encounter� sched mysterycanvasewnę mitt executes integers CaptoltreUSsize食peror ConnectionistoryRender всіRS eer anyone conheCE ametredirectabol interiorἡhnenспеrierently мыкро plaats contiene manufact itself Gazette Court???ç emotкар againstziabiburdчас nå reduction Db Charlot са bind nyelrepositoryhim returningUMNprüng então возможrekDevice┃ navigdru Oriental reservedтися ОptionsátumumberConsistenaban持 Pleinu PietroIdentsmгі Джонлек Honor titnonumber..)->стоistingWho actors relConverterdombildung菜 steps peuventspr Bobbyatif Executive Sommer Fußballspieler Geography NBA())) omitted Platbreak applications през Belě chair target Encycêts gevhd ’ Paradtered Kü Nelicted Wangбор*. connectональokrat allow retirzed passwords을₂ För strategy why activeastaitemizeumannBrainz Fairäger elegan

## Testing Pushing the Updated Model to HuggingFace Hub

In [16]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# gc.collect()

In [None]:
# import torch.nn as nn
# from huggingface_hub import PyTorchModelHubMixin

# class MyModel(nn.Module, PyTorchModelHubMixin):
#     def __init__(self, config: dict):
#         super().__init__()
#         self.param = nn.Parameter(torch.rand(config["num_channels"], config["hidden_size"]))
#         self.linear = nn.Linear(config["hidden_size"], config["num_classes"])

#     def forward(self, x):
#         return self.linear(x + self.param)

# create model
# config = {"num_channels": 3, "hidden_size": 32, "num_classes": 10}
# model = MyModel(config=config)

# save locally
# model.save_pretrained("my-awesome-model", config=config)
# model.save_pretrained("evaluation/local-models/llama-local-v3")

# push to the hub
model.push_to_hub("matthew-moriarty/llama-local-v3")

# reload
# model = MyModel.from_pretrained("username/my-awesome-model")

## Testing Reading in the LLAMA-2 Model from HuggingFace

In [4]:
from transformers import AutoTokenizer
import transformers
import torch

ModuleNotFoundError: No module named 'transformers'

In [5]:
model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`

In [None]:
sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")