## For testing LLMs
Use the Python 3.12 environment locally.

Test the speed of three Phi-3 models, Phi-3.5-mini-instruct, Phi-3-mini-128k-instruct, and Phi-3-mini-4k-instruct.

Note 1: Due to GPU memory use, it's best to restart the Python kernel between tests. Therefore, the first two code cells need to be called again each time.

Note 2: Tests were run on an RTX 4070 Ti.

A function to simply load the model and tokenizer. The model_name must be as used by Huggingface (i.e. "microsoft/Phi-3-mini-4k-instruct")

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

def load_model_and_tokenizer(model_name):
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        device_map="cuda", 
        torch_dtype="auto", 
        trust_remote_code=True, 
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

A common test function for testing inference speed

In [2]:
import time

def test_inference_speed(model, tokenizer):
    messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
    ]

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )

    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.0,
        "do_sample": False,
    }

    start = time.time()
    output = pipe(messages, **generation_args)
    end = time.time()
    time_delta = end - start
    print(output[0]['generated_text'])
    print(f'Inference time: {time_delta:.2f}s')

Test Phi-3.5-mini-instruct. The inference time was 13s in one test.

In [3]:
model, tokenizer = load_model_and_tokenizer("microsoft/Phi-3.5-mini-instruct")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
test_inference_speed(model, tokenizer)

Device set to use cuda
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


 To solve the equation 2x + 3 = 7, follow these steps:

Step 1: Isolate the term with the variable (2x) by subtracting 3 from both sides of the equation.
2x + 3 - 3 = 7 - 3
2x = 4

Step 2: Solve for x by dividing both sides of the equation by the coefficient of x, which is 2.
2x / 2 = 4 / 2
x = 2

So, the solution to the equation 2x + 3 = 7 is x = 2.
Inference time: 13.34s


Test Phi-3-mini-128k-instruct. The inference time is 10s in one test.

In [3]:
model, tokenizer = load_model_and_tokenizer("microsoft/Phi-3-mini-128k-instruct")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
test_inference_speed(model, tokenizer)

Device set to use cuda
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


 To solve the equation 2x + 3 = 7, follow these steps:

1. Subtract 3 from both sides of the equation:
   2x + 3 - 3 = 7 - 3
   2x = 4

2. Divide both sides of the equation by 2:
   2x/2 = 4/2
   x = 2

So, the solution to the equation 2x + 3 = 7 is x = 2.
Inference time: 9.65s


Test Phi-3-mini-4k-instruct. The inference time is 7s in one test.

In [3]:
model, tokenizer = load_model_and_tokenizer("microsoft/Phi-3-mini-4k-instruct")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
test_inference_speed(model, tokenizer)

Device set to use cuda
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


 To solve the equation 2x + 3 = 7, follow these steps:

1. Subtract 3 from both sides of the equation:
   2x + 3 - 3 = 7 - 3
   2x = 4

2. Divide both sides of the equation by 2:
   2x/2 = 4/2
   x = 2

So, the solution to the equation 2x + 3 = 7 is x = 2.
Inference time: 6.65s


### Conclusion

Phi-3-mini-4k-instruct is clearly the fastest, at about half the speed of Phi-3.5-mini-instruct. However, it has a downside of a much smaller context window. If we want to split the difference, we can use Phi-3-mini-128k-instruct. We should test them each in the final implementation and make a determination there.