You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i working on the CodeLLama Model which Uses a Decoder-Only Model Transformer following Arch Blow
Main Task is replaced Decoder-Only which used Masked-Self-Attention and KV_cache with my own Encoder-Only which used Diltaed-Attention used in LongNet
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
fromtransformersimportAutoTokenizer, AutoModelForCausalLMimporttransformersimporttorchfromtransformers.models.llama.configuration_llamaimportLlamaConfigfromtransformers.models.llama.modeling_llamaimportLlamaAttention , LlamaDecoderLayer , LlamaModel, LlamaForCausalLMmodel_id="codellama/CodeLlama-7b-hf"model=AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16
).to("cpu")
classCondensedLlamaConfig(LlamaConfig):
def__init__(
self,
dilation_rates=None,
segment_lengths=None,
is_causal=None,
**kwargs
):
super().__init__(**kwargs)
self.dilation_rates=dilation_ratesself.segment_lengths=segment_lengthsself.is_causal=is_causal# Override the `to_dict` method to include the new parametersdefto_dict(self):
base_dict=super().to_dict()
config_dict= {
"dilation_rates": self.dilation_rates,
"segment_lengths": self.segment_lengths,
"is_causal": self.is_causal
}
base_dict.update(config_dict)
returnbase_dictconfig.num_hidden_layers=2model_1=CondensedLlamaModel(config)
importtorchimporttorch.nnasnnfromtransformers.models.llama.modeling_llamaimportLlamaForCausalLM, LlamaDecoderLayerfromtransformers.modeling_utilsimportModuleUtilsMixinclassCondensedLlamaAttention(LlamaAttention):
def__init__(self, config: CondensedLlamaConfig,layer_idx=None):
super().__init__(config)
self.LongNetAttention=MultiheadDilatedAttention(
config.hidden_size,
config.num_attention_heads,
config.dilation_rates,
config.segment_lengths
)
self.is_causal=config.is_causaldefforward(self, input, is_causal=None):
ifis_causalisNone:
is_causal=self.is_causalx, _=self.LongNetAttention(input, input, input, is_causal=is_causal)
returnxclassCondensedLlamaDecoderLayer(LlamaDecoderLayer):
def__init__(self, config: CondensedLlamaConfig, layer_idx=None): # Add layer_idx as an argumentsuper().__init__(config, layer_idx=None) # Pass layer_idx to the parent class constructor# Replace self_attn with your new attention moduleself.self_attn=MultiheadDilatedAttention(
config.hidden_size,
config.num_attention_heads,
config.dilation_rates,
config.segment_lengths
)
self.is_causal=config.is_causaldefforward(self, input, is_causal=None):
ifis_causalisNone:
is_causal=self.is_causalx, _=self.LongNetAttention(input, input, input, is_causal=is_causal)
returnxclassCondensedLlamaModel(LlamaModel):
def__init__(self, config: CondensedLlamaConfig):
super().__init__(config)
self.layers=nn.ModuleList([CondensedLlamaDecoderLayer(config,layer_idx=None) for_inrange(config.num_hidden_layers)])
# Initialize weights and apply final processingself.post_init()
model_2=model.modelimporttorchmodule_patterns_to_transfer= ["q_proj", "k_proj", "v_proj", "o_proj"]
deftransfer_weights(original_model, custom_model, module_patterns_to_transfer):
original_dict=original_model.state_dict()
custom_dict=custom_model.state_dict()
# Filter and transfer weights for specified layersforkeyincustom_dict.keys():
forpatterninmodule_patterns_to_transfer:
ifpatterninkey:
ifkeyinoriginal_dict:
# Transfer weightswithtorch.no_grad():
custom_dict[key].copy_(original_dict[key])
# Load the updated state dictionary to the modelcustom_model.load_state_dict(custom_dict)
config=CondensedLlamaConfig(dilation_rates=[2048, 4096, 8192, 16384, 32768],segment_lengths=[1, 2, 4, 6, 12],is_causal=False)
config.num_hidden_layers=2model_1=CondensedLlamaModel(config)
# Transfer weights from the original model to the modeltransfer_weights(model_2, model_1, module_patterns_to_transfer)
# transferred weights in the custom modelforkey, parameterinmodel_1.state_dict().items():
print(key)
print(parameter.size())
print(parameter)
System Info
i working on the CodeLLama Model which Uses a Decoder-Only Model Transformer following Arch Blow Main Task is replaced Decoder-Only which used Masked-Self-Attention and KV_cache with my own Encoder-Only which used Diltaed-Attention used in LongNet
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Checklist
The text was updated successfully, but these errors were encountered: