<a href="https://colab.research.google.com/github/ollorin/collabs/blob/main/MoLora_7b_(PROOF_OF_CONCEPT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MoLora (Mixture of LoRAs) LLaMA-2-7b

This is four adapters trained on around 1m tokens of different clusters of WizardLM-EvolInstruct, as a proof-of-concept for a larger project. This is the first iteration of MoLora. You need to sign in to HuggingFace with the notebook login cell because of the not-actually-open nature of LLaMA-2

In [None]:
%pip install -qq bitsandbytes accelerate datasets sentence-transformers
%pip install -qq git+https://github.com/huggingface/transformers # need to install from github
%pip install -qq git+https://github.com/huggingface/peft


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download
import datetime

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
text_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf", load_in_4bit=True, torch_dtype=torch.float16, low_cpu_mem_usage=True
)
text_model = PeftModel.from_pretrained(text_model, "crumb/llama2-7b-moe-text-exp-control-4", adapter_name="default")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
centers = torch.load(hf_hub_download(repo_id="crumb/Wizard-EvolInstruct70k-k4", filename="centers.pt", repo_type="dataset"))
centers = centers.cuda()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

see result before custom moe loras, the control lora trained indescriminantly on a mixture of all clusters (so it's instruction tuned, but not made to be task-specific).

In [None]:
prompt = "Describe how Transformers (Vaswani et. al.) work.\n"
inputs = {k:v.cuda() for k,v in tokenizer(prompt, return_tensors="pt").items()}
outputs = text_model.generate(
    **inputs,
    max_new_tokens = 256,
    temperature = 0.7,
    top_k = 20,
    top_p = 0.9,
    do_sample=True
)
print(tokenizer.decode(outputs[0]))



<s> Describe how Transformers (Vaswani et. al.) work.
How does a Transformer learn to recognize an image of a cat, for example?
How does the Transformer learn to recognize an image of a dog?
How does the Transformer learn to recognize an image of a person?
How does the Transformer learn to recognize an image of a car?
How does the Transformer learn to recognize an image of a house?
How does the Transformer learn to recognize an image of a street?
How does the Transformer learn to recognize an image of a person in a car?
How does the Transformer learn to recognize an image of a person in a house?
How does the Transformer learn to recognize an image of a person in a street?
How does the Transformer learn to recognize an image of a person in a car in a house in a street?
How does the Transformer learn to recognize an image of a person in a car in a house in a street in a person in a car in a house in a street?
How does the Transformer learn to recognize an image of a person in a car in a 

we'll now compute the distance from our prompt to each cluster in our dataset, and use that as the weights at which to multiply each expert before summing them together. since their values are very similar, as they're all from the same dataset (among other things), we have to amplify the differences between them with a merge_temperature, otherwise the output of the softmax operation will be close to 0.25 across the board (an even mix between our four experts)

In [None]:
for i in range(4):
    text_model.load_adapter(f"crumb/llama2-7b-moe-text-exp{str(i)}-4", adapter_name=f"expert-{str(i)}")

In [None]:
prompt = "Describe how Transformers (Vaswani et. al.) work.\n"
prompt_embedding = torch.tensor(embedding_model.encode([prompt]), device='cuda')

# the centroids are very close together, so we can amplify the differences between them to get a more specialized model
# play around with merge temp and alpha to see how each affects the model
merge_temperature = 2
merge_alpha = 16
expert_alphas = torch.softmax(
    torch.nn.functional.cosine_similarity(prompt_embedding,centers) * merge_temperature, -1
).mul(merge_alpha // 4).tolist()
print("Alpha for each expert:", expert_alphas)

unique_name = str(hash(datetime.datetime.now()))
text_model.add_weighted_adapter([f"expert-{str(i)}" for i in range(4)], expert_alphas, combination_type="linear", adapter_name=unique_name)
text_model.set_adapter(unique_name)

Alpha for each expert: [0.9232075979227269, 1.1810616590960257, 0.9936784881024902, 0.902052254878757]


In [None]:
inputs = {k:v.cuda() for k,v in tokenizer(prompt, return_tensors="pt").items()}
outputs = text_model.generate(
    **inputs,
    max_new_tokens = 256,
    temperature = 0.7,
    top_k = 20,
    top_p = 0.9,
    do_sample=True
)
print(tokenizer.decode(outputs[0]))

<s> Describe how Transformers (Vaswani et. al.) work.

- Transformers are a type of neural network architecture that uses attention mechanisms to process text inputs.
- The architecture consists of two main components: an encoder and a decoder. The encoder takes in a text input and generates a sequence of vectors representing the input. The decoder then takes these vectors and generates a sequence of tokens that represent the text.

- The attention mechanism is the key component of the Transformer architecture. It allows the network to focus on specific words or phrases in the input text and generate more accurate results.
- The encoder consists of several layers of neural networks that take in the input text and generate a sequence of vectors. The attention mechanism then allows the network to focus on specific words or phrases in the input text.

- The decoder then takes these vectors and generates a sequence of tokens that represent the text. The attention mechanism again allows the