# Analysis of Phi-3.5-mini model



In [1]:
import transformers
import torch

model_path = "microsoft/Phi-3.5-mini-instruct"

phi_model = transformers.AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)

# print(model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
# Phi Model Wrapper

class PhiModel(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model
    
    def forward(self, input_ids, attention_mask):
        return self.model(input_ids, attention_mask).logits


| past_key_values: torch.Size([1, 32, 4, 96])

In [3]:
input_ids = torch.zeros((1, 2), dtype=torch.int32)
attention_mask = torch.ones((1, 2), dtype=torch.float32)
model = PhiModel(phi_model)

In [4]:
traced_model = torch.jit.trace(model.eval(), (input_ids, attention_mask))

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
  if sequence_length != 1:
You are not running the flash-attention implementation, expect numerical differences.
  if seq_len > self.original_max_position_embeddings:
  ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):


## Convert model to CoreML


In [5]:
import coremltools as ct
import numpy as np

query_length = ct.RangeDim(lower_bound=1, upper_bound=2048, default=1)

inputs = [
    ct.TensorType(name="inputIds", shape=(1, query_length), dtype=np.int32),
    ct.TensorType(name="attentionMask", shape=(1, query_length), dtype=np.int32),
]

outputs = [
    ct.TensorType(name="logits", dtype=np.float16),
]

Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'


In [6]:
fp16_mlmodel = ct.convert(
    traced_model.eval(),
    inputs=inputs,
    outputs=outputs,
    source="pytorch",
    minimum_deployment_target=ct.target.iOS18,
    compute_precision=ct.precision.FLOAT32,
    compute_units=ct.ComputeUnit.ALL
)

Converting PyTorch Frontend ==> MIL Ops:   0%|          | 0/4529 [00:00<?, ? ops/s]Core ML embedding (gather) layer does not support any inputs besides the weights and indices. Those given will be ignored.
Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops:   5%|▍         | 222/4529 [00:00<00:01, 2199.98 ops/s]Saving value type of int64 into a builtin type of int32, might lose precision!
Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops:  11%|█         | 502/4529 [00:00<00:02, 1477.22 ops/s]Saving value type of int64 into a builtin type of int32, might lose precision!
Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops:  17%|█▋        | 780/4529 [00:00<00:01, 1899.64 ops/s]Saving value type of int64 into a builtin type of int32, might lose precision!
Saving value type of int64 into a b

In [7]:
fp16_mlmodel.save("phi-3.5-mini-instruct-fp32.mlpackage")

In [8]:
op_config = ct.optimize.coreml.OpLinearQuantizerConfig(
    mode="linear_symmetric",
    dtype="int4",
    granularity="per_block",
    block_size=32    
)

config = ct.optimize.coreml.OptimizationConfig(global_config=op_config)

In [9]:
mlmodel_int4 = ct.optimize.coreml.linear_quantize_weights(fp16_mlmodel, config=config)

Running compression pass linear_quantize_weights: 100%|██████████| 200/200 [00:23<00:00,  8.36 ops/s]
Running MIL frontend_milinternal pipeline: 0 passes [00:00, ? passes/s]
Running MIL default pipeline: 100%|██████████| 84/84 [00:08<00:00, 10.08 passes/s]
Running MIL backend_mlprogram pipeline: 100%|██████████| 12/12 [00:00<00:00, 28.15 passes/s]


In [10]:
mlmodel_int4.save("phi-3.5-mini-instruct-int4.mlpackage")

In [11]:
!du -hs ./phi-3.5-mini-instruct-int4.mlpackage/

2.3G	./phi-3.5-mini-instruct-int4.mlpackage/


In [12]:
!du -hs ./phi-3.5-mini-instruct-fp32.mlpackage

15G	./phi-3.5-mini-instruct-fp32.mlpackage
