Chinkara is a Large Language Model (LLM) and has a goal of being an accurate and coherent model running on consumer hardware. The model is part of MaralGPT project, where we try to make LLMs more affordable for users, enthusiasts and researchers.
Chinkara 7B is a Large Language Model trained on timdettmers/openassistant-guanaco dataset based on Meta's brand new LLaMa-2 with 7 billion parameters using QLoRa Technique, optimized for small consumer size GPUs.
Model | Notebook | Description |
---|---|---|
chinkara-7b | This is the smallest model of the family, trained on LLaMa-2 7B | |
chinkara-7b-improved | This is the same as the previous model, with minor changes. See changelogs to understand the difference. |
2023-07-28
: Today chinkara-7b-improved uploaded to Huggingface. This model is still trained on Guanaco dataset, but it has better and more coherent results.- safety is now an issue in this model. This model won't answer to questions regarding illegal stuff (for example, you can't ask this model for a forbidden recipe or something like that.)
NOTE: This part is for the time you want to load and infere the model on your local machine. You still need 8GB of VRAM on your GPU. The recommended GPU is at least a 2080!
pip install -U bitsandbytes
pip install -U git+https://github.com/huggingface/transformers.git
pip install -U git+https://github.com/huggingface/peft.git
pip install -U git+https://github.com/huggingface/accelerate.git
pip install -U datasets
pip install -U einops
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "Trelis/Llama-2-7b-chat-hf-sharded-bf16"
adapters_name = 'MaralGPT/chinkara-7b-improved'
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
torch_dtype=torch.bfloat16,
device_map="auto",
max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
),
)
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
from peft import LoraConfig, get_peft_model
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is the answer to life, universe and everything?"
prompt = f"###Human: {prompt} ###Assistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=50, temperature=0.5, repetition_penalty=1.0)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)
- The Guanaco dataset, specially this one with the raw data from Open Assistant project, included tons of data from different languages, which resulted in a little bit of incoherency in the results generated by Chinkara.
- The possible solution might be using "single language" datasets. For example
dolly
from databricks might be a good choice, since it's only in English.
- Gathering data and creating a dataset of our own.
- Training on different datasets and check the outcome.
- Already existing datasets such as Alpaca or Dolly.
- Creating Open QA dataset
- Training the model for specific tasks.
- FAQ Bot
- Medical Question and Answering Bot
- Personal/Home Assistant
- Shopping Assistant
- Providing Different models in terms of parameters, languages, etc.
- Providing 13B model