https://huggingface.co/docs/transformers/llm_tutorial
generate() method is available to all models with generative capabilities.
A language model trained for causal language modeling takes a sequence of text tokens as input and returns the probability distribution for the next token.

In [2]:
from transformers import LlamaTokenizer

ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.10.3.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")

ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported.

In [8]:
from transformers import AutoTokenizer
AutoTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

Downloading:   0%|          | 0.00/141 [00:00<?, ?B/s]

ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

In [4]:
import torch
#from transformers import LlamaTokenizer, LlamaForCausalLM
from transformers import LlamaForCausalLM, LlamaTokenizer

## v2 models
model_path = 'openlm-research/open_llama_7b_v2'

## v1 models
# model_path = 'openlm-research/open_llama_3b'
# model_path = 'openlm-research/open_llama_7b'
# model_path = 'openlm-research/open_llama_13b'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)

ImportError: cannot import name 'LlamaForCausalLM' from 'transformers' (/home/lkk/miniconda3/envs/mypy310/lib/python3.10/site-packages/transformers/__init__.py)

In [3]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "openlm-research/open_llama_7b_v2", device_map="auto", load_in_4bit=True
)
#device_map ensures the model is moved to your GPU(s)
#load_in_4bit applies 4-bit dynamic quantization to massively reduce the resource requirements

ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.10.3.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

Solve the above problem via 
1: pip install git+https://github.com/huggingface/transformers
2: pip uninstall tokenizers
3: pip install tokenizers -U

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")

Downloading (…)okenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [7]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
#tokenizer.pad_token = tokenizer.eos_token  # Llama has no pad token by default
model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")

In [8]:
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]



'A list of colors: red, blue, green, yellow, black, white, and brown'

In [9]:
model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda")

# By default, the output will contain up to 20 tokens
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]




'A sequence of numbers: 1, 2, 3, 4, 5'

In [10]:
# Setting `max_new_tokens` allows you to control the maximum length
generated_ids = model.generate(**model_inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,'

In [11]:
# Set seed or reproducibility -- you don't need this unless you want full reproducibility
from transformers import set_seed
set_seed(0)

model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda")

# LLM + greedy decoding = repetitive, boring output
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# With sampling, the output becomes more creative!
generated_ids = model.generate(**model_inputs, do_sample=True)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]



'I am a cat.\nI just need to be. I am always.\nEvery time'