* cuda 12.1
* pytorch 2.3.1 

# Microsoft Phi 3 mini (4bn, 4k tokens context)

uses the following prompt template
><|system|>  
>Your Role<|end|>  
><|user|>  
>Your Question?<|end|>  

In [4]:
import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3-mini-4k-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,
    attn_implementation='eager',
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct") 

prompt = "Give me a one sentence introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant. Limit your answer to one sentence."},
    {"role": "user", "content": prompt},
]

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    # "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
You are not running the flash-attention implementation, expect numerical differences.


 A large language model is a sophisticated artificial intelligence system that uses machine learning algorithms to understand, generate, and translate human language with high accuracy and fluency.


# Qwen 2 (0.5b)

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-0.5B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")

# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = "Give me a one sentence introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant. Limit your answer to one sentence."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

response

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


'The large language model is a machine learning model that can generate text based on input data. It can generate text that is similar to the input data, but it can also generate new text based on the input data. The model can also generate text that is not related to the input data, but it can still generate new text based on the input data. The model can also generate text that is not generated by the input data, but it can still generate new text based on the input data. The model can also generate text that is not generated by the input data, but it can still generate new text based on the input data. The model can also generate text that is not generated by the input data, but it can still generate new text based on the input data. The model can also generate text that is not generated by the input data, but it can still generate new text based on the input data. The model can also generate text that is not generated by the input data, but it can still generate new text based on t