-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds phi model #1442
feat: adds phi model #1442
Conversation
You need to implement the flash one, the non flash are not interesting anymore and just kept as legacy. |
# NOTE: this is the main difference between Llama and Phi | ||
# in llama the rotary embeddings are applied to the whole query and key. | ||
# Phi uses PARTIAL rotary embeddings, which are applied to the first 32 dimensions | ||
# | ||
# Apply partial positional embeddings in place | ||
self.rotary_emb( | ||
query[:, :, :self.num_heads], kv[:, 0, :, :self.num_heads], | ||
cos, sin | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
highlighting the most important lines here. Note only a portion of the query and key are rotary embedded.
Hey @OlivierDehaene would you mind taking a look at this model (specifically flash phi)? The non flash is only for legacy purposes. Notes: Outside of those changes there are a couple differences in weight naming and Phi uses bias. I think it makes sense to have Phi in a separate file rather than bloat the llama implementation. Please let me know if I should make any changes! Thanks 🙏 |
use_medusa=use_medusa, | ||
) | ||
else: | ||
raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Phi")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also load it with the Phi class based on CausalLM? Or is the layout different?
I'm not too familiar with Phi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the feedback!
yes, Phi can be loaded as a CausalLM with the 4.37.1
version of transformers. In my last PR I bumped the version, removed the NotImplementedError
and tested that Phi works on my local cpu
server/text_generation_server/models/custom_modeling/flash_phi_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_phi_modeling.py
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_phi_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_phi_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_phi_modeling.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks!
This PR adds basic modeling for phi-2 run ```bash text-generation-server \ serve \ microsoft/phi-2 \ --revision 834565c23f9b28b96ccbeabe614dd906b6db551a ``` test ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' | jq . # { # "generated_text": "\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data. These" # } ``` notes - recently (~1 day ago) the Phi weights and model were updated to accommodate adding [GQA/MQA attention to the model.](huggingface/transformers#28163) This impl expects the original model format so a fixed revision is required at the moment. - this PR only includes a basic implementation of the model and can later be extended for support Flash and Sharded versions as well as make use of better optimization
This PR adds basic modeling for phi-2 run ```bash text-generation-server \ serve \ microsoft/phi-2 \ --revision 834565c23f9b28b96ccbeabe614dd906b6db551a ``` test ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' | jq . ``` notes - recently (~1 day ago) the Phi weights and model were updated to accommodate adding [GQA/MQA attention to the model.](huggingface/transformers#28163) This impl expects the original model format so a fixed revision is required at the moment. - this PR only includes a basic implementation of the model and can later be extended for support Flash and Sharded versions as well as make use of better optimization
This PR adds basic modeling for phi-2
run
test
notes