In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline

# Phi-3
The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately (although that isn't always necessary).

In [6]:
#Load model and tokenizer 
model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-4k-instruct", device_map="xpu", torch_dtype="auto", trust_remote_code= True , ) 
# tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini4k-instruct")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.72s/it]


Although we can now use the model and tokenizer directly, it's much easier to wrap it in a pipeline object:

In [7]:
# Create a pipeline 
generator = pipeline( "text-generation", model=model, tokenizer=tokenizer, return_full_text= False , max_new_tokens=500, do_sample= False )

Device set to use xpu


Finally, we create our prompt as a user and give it to the model:

In [8]:
# The prompt (user input / query) 
messages = [ {"role": "user", "content": "Create a funny joke about chickens."} ] 

# Generate output 
output = generator(messages) 
print(output[0]["generated_text"])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


 Why did the chicken join the band? Because it had the drumsticks!
