In [1]:
# !pip install transformers accelerate sentencepiece

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [3]:
model_name = "decapoda-research/llama-13b-hf"
kwargs = {"torch_dtype": torch.float16, "device_map": "auto"}

In [4]:
kwargs

{'torch_dtype': torch.float16, 'device_map': 'auto'}

In [5]:
model = AutoModelForCausalLM.from_pretrained(model_name,
    low_cpu_mem_usage=True, **kwargs, cache_dir="model_cache")

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Loading checkpoint shards:   0%|          | 0/41 [00:00<?, ?it/s]

In [6]:
!nvidia-smi

Wed Jun 14 17:19:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10G         On   | 00000000:00:1B.0 Off |                    0 |
|  0%   32C    P0    60W / 300W |   6485MiB / 23028MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         On   | 00000000:00:1C.0 Off |                    0 |
|  0%   33C    P0    60W / 300W |   7391MiB / 23028MiB |      1%      Default |
|       

In [7]:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_name)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


In [8]:
inputs = tokenizer("Hello, how are you?", return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(inputs)



In [9]:
text = """Answer the following question step by step:
Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
"""

In [10]:
inputs = tokenizer(text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(inputs, max_new_tokens=200, temperature=0.1, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Answer the following question step by step:
Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step 1: Write the problem in a word equation.
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step 2: Solve the equation.
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? 5 + 2 + 3 = 10
Step 3: Write the answer in a word equation.
Roger has 10 tennis balls.
Step 4: Check your answer.
Roger has 10 tennis balls. Roger has 10
