# Fine-Tuning with Llama 2, Bits and Bytes, and QLoRA

Today we'll explore fine-tuning the Llama 2 model available on Kaggle Models using QLoRA, Bits and Bytes, and PEFT.

- QLoRA: [Quantized Low Rank Adapters](https://arxiv.org/pdf/2305.14314.pdf) - this is a method for fine-tuning LLMs that uses a small number of quantized, updateable parameters to limit the complexity of training. This technique also allows those small sets of parameters to be added efficiently into the model itself, which means you can do fine-tuning on lots of data sets, potentially, and swap these "adapters" into your model when necessary.
- [Bits and Bytes](https://github.com/TimDettmers/bitsandbytes): An excellent package by Tim Dettmers et al., which provides a lightweight wrapper around custom CUDA functions that make LLMs go faster - optimizers, matrix mults, and quantization. In this notebook we'll be using the library to load our model as efficiently as possible.
- [PEFT](https://github.com/huggingface/peft): An excellent Huggingface library that enables a number Parameter Efficient Fine-tuning (PEFT) methods, which again make it less expensive to fine-tune LLMs - especially on more lightweight hardware like that present in Kaggle notebooks.

Many thanks to [Bojan Tunguz](https://www.kaggle.com/tunguz) for his excellent [Jeopardy dataset](https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions)!

This notebook is based on [an excellent example from LangChain](https://github.com/asokraju/LangChainDatasetForge/blob/main/Finetuning_Falcon_7b.ipynb).

## Package Installation

Note that we're loading very specific versions of these libraries. Dependencies in this space can be quite difficult to untangle, and simply taking the latest version of each library can lead to conflicting version requirements. It's a good idea to take note of which versions work for your particular use case, and `pip install` them directly.

In [2]:
!pip install -qqq bitsandbytes
!pip install -qqq torch
!pip install -qqq -U git+https://github.com/huggingface/transformers.git
!pip install -qqq -U git+https://github.com/huggingface/peft.git
!pip install -qqq -U git+https://github.com/huggingface/accelerate.git
!pip install -qqq datasets
!pip install -qqq loralib
!pip install -qqq einops

In [2]:
# !pip install -q auto-gptq==0.5.0

In [1]:
import pandas as pd
import json
import os
from pprint import pprint
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset, Dataset
from huggingface_hub import notebook_login

from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

  from .autonotebook import tqdm as notebook_tqdm


# Loading and preparing our model

We're going to use the Llama 2 7B model for our test. We'll be using Bits and Bytes to load it in 4-bit format, which should reduce memory consumption considerably, at a cost of some accuracy.

Note the parameters in `BitsAndBytesConfig` - this is a fairly standard 4-bit quantization configuration, loading the weights in 4-bit format, using a straightforward format (`normal float 4`) with double quantization to improve QLoRA's resolution. The weights are converted back to `bfloat16` for weight updates, then the extra precision is discarded.

In [23]:
model = "/kaggle/input/llama-2/pytorch/7b-chat-hf/1"
MODEL_NAME = model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

CUDA extension not installed.
CUDA extension not installed.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [8]:
model = "google/gemma-2b-it"
MODEL_NAME = model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
    token='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'
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, token='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')
tokenizer.pad_token = tokenizer.eos_token

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.28s/it]


In [2]:
MODEL_NAME = "TheBloke/Llama-2-7b-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16,
        trust_remote_code=True,
        device_map="auto",
    )

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

CUDA extension not installed.
CUDA extension not installed.


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Below, we'll use a nice PEFT wrapper to set up our model for training / fine-tuning. Specifically this function sets the output embedding layer to allow gradient updates, as well as performing some type casting on various components to ensure the model is ready to be updated.

In [9]:
model = prepare_model_for_kbit_training(model)

Below, we define some helper functions - their purpose is to properly identify our update layers so we can... update them!

In [10]:
model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
    

In [11]:
import re
def get_num_layers(model):
    numbers = set()
    for name, _ in model.named_parameters():
        for number in re.findall(r'\d+', name):
            numbers.add(int(number))
    print(max(numbers))
    return max(numbers)

def get_last_layer_linears(model):
    names = []
    
    num_layers = get_num_layers(model)
    for name, module in model.named_modules():
        if str(num_layers) in name and not "encoder" in name:
            if isinstance(module, torch.nn.Linear):
                names.append(name)
    print(names)
    return names

## LORA config

Some key elements from this configuration:
1. `r` is the width of the small update layer. In theory, this should be set wide enough to capture the complexity of the problem you're attempting to fine-tune for. More simple problems may be able to get away with smaller `r`. In our case, we'll go very small, largely for the sake of speed.
2. `target_modules` is set using our helper functions - every layer identified by that function will be included in the PEFT update.

In [12]:
config = LoraConfig(
    r=2,
    lora_alpha=32,
    target_modules=get_last_layer_linears(model),
    # target_modules = ["k_proj","o_proj","q_proj","v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

17
['model.layers.17.self_attn.q_proj', 'model.layers.17.self_attn.k_proj', 'model.layers.17.self_attn.v_proj', 'model.layers.17.self_attn.o_proj', 'model.layers.17.mlp.gate_proj', 'model.layers.17.mlp.up_proj', 'model.layers.17.mlp.down_proj']


## Load some data

Here, we're loading a 200,000 question Jeopardy dataset. In the interests of time we won't load all of them - just the first 1000 - but we'll fine-tune our model using the question and answers. Note that what we're training the model to do is use its existing knowledge (plus whatever little it learns from our question-answer pairs) to answer questions in the *format* we want, specifically short answers.

In [125]:
df = pd.read_csv("readme_qa.csv")
df.columns = [str(q).strip() for q in df.columns]

data = Dataset.from_pandas(df)

In [126]:
df["Answer"].values[0:5]

array(['Explore popular APIs and see them work in Postman. <br > <p> <a href="https://apilayer.com"> <div> <img src=".github/cs1586-APILayerLogoUpdate2022-LJ_v2-HighRes.png" width="250" alt="APILayer Logo" /> </div> </a> </p> [APILayer](https://apilayer.com/) is the fastest way to integrate APIs into any product. They created this repository to support the community in easily finding public APIs. Explore their collections on the [Postman API Network](https://www.postman.com/apilayer/workspace/apilayer/overview).',
       '| API | Description | Call this API | |:---|:---|:---| | [IP Stack](https://ipstack.com/) | Locate and Identify Website Visitors by IP Address | [<img src="https://run.pstmn.io/button.svg" alt="Run In Postman" style="width: 128px; height: 32px;">](https://god.gw.postman.com/run-collection/10131015-55145132-244c-448c-8e6f-8780866e4862?action=collection%2Ffork&source=rip_markdown&collection-url=entityId%3D10131015-55145132-244c-448c-8e6f-8780866e4862%26entityType%3Dcoll

In [127]:
len(df)

13848

In [128]:
df.head()

Unnamed: 0.1,Unnamed: 0,Question,Context,Answer,Repo Url,Repo
0,0.0,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,Explore popular APIs and see them work in Post...,https://github.com/public-apis/public-apis,public-apis
1,1.0,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,| API | Description | Call this API | |:---|:-...,https://github.com/public-apis/public-apis,public-apis
2,2.0,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,| API | Description | Auth | Call this API | |...,https://github.com/public-apis/public-apis,public-apis
3,3.0,Provide the README content for the section wit...,# check each category for the minimum number o...,* [Animals](#animals) * [Anime](#anime) * [Art...,https://github.com/public-apis/public-apis,public-apis
4,4.0,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,<br > <strong>Get Involved</strong> * [Contrib...,https://github.com/public-apis/public-apis,public-apis


In [95]:
# prompt = df["Question"].values[0] + ". Answer as briefly as possible: ".strip()
# prompt

In [129]:
import numpy as np
df.replace('', np.nan, inplace=True)
df.dropna(subset=["Answer"], inplace=True)
df = df[["Question", "Context", "Answer", "Repo Url", "Repo"]]
df.head()

Unnamed: 0,Question,Context,Answer,Repo Url,Repo
0,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,Explore popular APIs and see them work in Post...,https://github.com/public-apis/public-apis,public-apis
1,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,| API | Description | Call this API | |:---|:-...,https://github.com/public-apis/public-apis,public-apis
2,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,| API | Description | Auth | Call this API | |...,https://github.com/public-apis/public-apis,public-apis
3,Provide the README content for the section wit...,# check each category for the minimum number o...,* [Animals](#animals) * [Anime](#anime) * [Art...,https://github.com/public-apis/public-apis,public-apis
4,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,<br > <strong>Get Involved</strong> * [Contrib...,https://github.com/public-apis/public-apis,public-apis


In [130]:
from langdetect import detect
df['detect'] = detect(str(df['Answer']))
df.head()

Unnamed: 0,Question,Context,Answer,Repo Url,Repo,detect
0,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,Explore popular APIs and see them work in Post...,https://github.com/public-apis/public-apis,public-apis,en
1,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,| API | Description | Call this API | |:---|:-...,https://github.com/public-apis/public-apis,public-apis,en
2,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,| API | Description | Auth | Call this API | |...,https://github.com/public-apis/public-apis,public-apis,en
3,Provide the README content for the section wit...,# check each category for the minimum number o...,* [Animals](#animals) * [Anime](#anime) * [Art...,https://github.com/public-apis/public-apis,public-apis,en
4,Provide the README content for the section wit...,Discussions in issues and pull requests:\n ...,<br > <strong>Get Involved</strong> * [Contrib...,https://github.com/public-apis/public-apis,public-apis,en


In [131]:
df = df[df['detect'] == 'en']
df = df[["Question", "Context", "Answer", "Repo Url", "Repo"]]
len(df)

12803

In [29]:
def clean_text(text):
    # Define the regular expression pattern for HTTP URLs
    http_pattern = re.compile(r'http://[^\s]+')
    # Remove HTTP URLs
    text = http_pattern.sub('', str(text))

    https_pattern = re.compile(r'https://[^\s]+')
    # Remove HTTPS URLs
    text = https_pattern.sub('', str(text))
    
    # Define the regular expression pattern for <img> tags
    img_pattern = re.compile(r'<img[^>]*>')
    # Remove <img> tags
    text = img_pattern.sub('', str(text))
    
    return text

In [132]:
import re
def clean_emoji(tx):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols 
                           u"\U0001F680-\U0001F6FF"  # transport 
                           u"\U0001F1E0-\U0001F1FF"  # flags 
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)

    return emoji_pattern.sub(r'', tx)

def text_cleaner(tx):

    text = re.sub(r"won\'t", "would not", tx)
    text = re.sub(r"im", "i am", tx)
    text = re.sub(r"Im", "I am", tx)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"don\'t", "do not", text)
    text = re.sub(r"shouldn\'t", "should not", text)
    text = re.sub(r"needn\'t", "need not", text)
    text = re.sub(r"hasn\'t", "has not", text)
    text = re.sub(r"haven\'t", "have not", text)
    text = re.sub(r"weren\'t", "were not", text)
    text = re.sub(r"mightn\'t", "might not", text)
    text = re.sub(r"didn\'t", "did not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    # text = re.sub('https?://\S+|www\.\S+', '<URL>', text)
    # text = re.sub(r'[^a-zA-Z0-9\!\?\.\@]',' ' , text)
    text = re.sub(r'https?://[^\s\")]+', '<URL>', text)
    text = re.sub(r'http?://[^\s\")]+', '<URL>', text)
    text = re.sub(r'http%3A%2F%2F[^\s\")]+', '<URL>', text)
    text = re.sub(r'https%3A%2F%2F[^\s\")]+', '<URL>', text)
    text = re.sub(r'[!]+' , '!' , text)
    text = re.sub(r'[?]+' , '?' , text)
    text = re.sub(r'[.]+' , '.' , text)
    text = re.sub(r'[@]+' , '@' , text)
    text = re.sub(r'unk' , '<UNK>' , text)
    text = re.sub('\n', '<NL>', text)
    text = re.sub('\t', '<TAB>', text)
    # text = re.sub(r'\s+', '<SP>', text)
    # text = re.sub(r'(<img[^>]*\bsrc=")[^"]*(")', '<img src=<IMG_SRC>', text)
    
    # text = text.lower()
    # text = re.sub(r'[ ]+' , ' ' , text)

    return text

In [133]:
# df["Answer"] = df["Answer"].apply(clean_text)
df["Answer"] = df["Answer"].apply(text_cleaner)
df["Answer"] = df["Answer"].apply(clean_emoji)
df["Context"] = df["Context"].apply(text_cleaner)
df["Answer"].values[0:5]

array(['Explore popular APIs and see them work in Postman. <br > <p> <a href="<URL>"> <div> <img src=".github/cs1586-APILayerLogoUpdate2022-LJ_v2-HighRes.png" width="250" alt="APILayer Logo" /> </div> </a> </p> [APILayer](<URL>) is the fastest way to integrate APIs into any product. They created this repository to support the community in easily finding public APIs. Explore their collections on the [Postman API Network](<URL>).',
       '| API | Description | Call this API | |:---|:---|:---| | [IP Stack](<URL>) | Locate and Identify Website Visitors by IP Address | [<img src="<URL>" alt="Run In Postman" style="width: 128px; height: 32px;">](<URL>)| | [Marketstack](<URL>) | Free, easy-to-use REST API interface delivering worldwide stock market data in JSON format | [<img src="<URL>" alt="Run In Postman" style="width: 128px; height: 32px;">](<URL>)| | [Weatherstack](<URL>) | Retrieve instant, accurate weather information for any location in the world in lightweight JSON format | [<img sr

In [135]:
df["Answer"].values[-20:]

array(['Our Python package gives you more control over each setting. To replicate and connect to LM Studio, use these settings: ```python from interpreter import interpreter interpreter.offline = True # Disables online features like Open Procedures interpreter.llm.model = "openai/x" # Tells OI to send messages in OpenAI is format interpreter.llm.api_key = "fake_key" # LiteLLM, which we use to talk to LM Studio, requires this interpreter.llm.api_base = "<URL>" # Point this at any OpenAI compatible server interpreter.chat() ```',
       'You can modify the `max_tokens` and `context_window` (in tokens) of locally running models. For local mode, smaller context windows will use less RAM, so we recommend trying a much shorter window (~1000) if it is failing / if it is slow. Make sure `max_tokens` is less than `context_window`. ```shell interpreter --local --max_tokens 1000 --context_window 3000 ```',
       'To help you inspect Open Interpreter we have a `--verbose` mode for debugging. You 

In [136]:
df.to_csv("readme_qa_cleaned_v2.csv", index =False)

In [18]:
project_name = df["Repo"].values[10820]
repository_url = df["Repo Url"].values[10820]
target_audience = "smart developer"
question = df["Question"].values[10820]
context = df["Context"].values[10820]
content_type = "docs"
prompt = f"""You are an AI assistant for a software project called {project_name}. You are trained on all the {content_type} that makes up this project.
    The {content_type} for the project is located at {repository_url}.
    You are given a repository which might contain several modules and each module will contain a set of files.
    Look at the source code in the repository and you have to generate content for the section of a README.md file following the heading given below. If you use any hyperlinks, they should link back to the github repository shared with you.
    You should only use hyperlinks that are explicitly listed in the context. Do NOT make up a hyperlink that is not listed.

    Assume the reader is a {target_audience} but is not deeply familiar with {project_name}.
    Assume the reader does not know anything about how the project is structured or which folders/files do what and what functions are written in which files and what these functions do.
    If you don't know how to fill up the readme.md file in one of its sections, leave that part blank. Don't try to make up any content.
    Do not include information that is not directly relevant to repository, even though the names of the functions might be common or is frequently used in several other places.
    Keep your response between 100 and 300 words. DO NOT RETURN MORE THAN 300 WORDS. Provide the answer in correct markdown format.

    Question: {question}
    Context:
    {context}

    Answer in Markdown:"""
prompt

'You are an AI assistant for a software project called diagrams. You are trained on all the docs that makes up this project.\n    The docs for the project is located at https://github.com/mingrammer/diagrams.\n    You are given a repository which might contain several modules and each module will contain a set of files.\n    Look at the source code in the repository and you have to generate content for the section of a README.md file following the heading given below. If you use any hyperlinks, they should link back to the github repository shared with you.\n    You should only use hyperlinks that are explicitly listed in the context. Do NOT make up a hyperlink that is not listed.\n\n    Assume the reader is a smart developer but is not deeply familiar with diagrams.\n    Assume the reader does not know anything about how the project is structured or which folders/files do what and what functions are written in which files and what these functions do.\n    If you don\'t know how to fil

## Let's generate!

Below we're setting up our generative model:
- Top P: a method for choosing from among a selection of most probable outputs, as opposed to greedily just taking the highest)
- Temperature: a modulation on the softmax function used to determine the values of our outputs
- We limit the return sequences to 1 - only one answer is allowed! - and deliberately force the answer to be short.

In [44]:
generation_config = model.generation_config
generation_config.max_new_tokens = 512
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.repetition_penalty = 1.1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

Now, we'll generate an answer to our first question, just to see how the model does!

It's fascinatingly wrong. :-)

In [45]:
%%time
device = "cuda"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(
        input_ids = encoding.input_ids,
        attention_mask = encoding.attention_mask,
        generation_config = generation_config
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))



You are an AI assistant for a software project called diagrams. You are trained on all the docs that makes up this project.
    The docs for the project is located at https://github.com/mingrammer/diagrams.
    You are given a repository which might contain several modules and each module will contain a set of files.
    Look at the source code in the repository and you have to generate content for the section of a README.md file following the heading given below. If you use any hyperlinks, they should link back to the github repository shared with you.
    You should only use hyperlinks that are explicitly listed in the context. Do NOT make up a hyperlink that is not listed.

    Assume the reader is a smart developer but is not deeply familiar with diagrams.
    Assume the reader does not know anything about how the project is structured or which folders/files do what and what functions are written in which files and what these functions do.
    If you don't know how to fill up the r

## Format our fine-tuning data

We'll match the prompt setup we used above.

In [47]:
def generate_prompt(data_point):
#     return f"""
#             {data_point["Question"]}. 
#             Answer as briefly as possible: {data_point["Answer"]}
#             """.strip()
    return f"""You are an AI assistant for a software project called {data_point["Repo"]}. You are trained on all the {content_type} that makes up this project.
    The docs for the project is located at {data_point["Repo Url"]}.
    You are given a repository which might contain several modules and each module will contain a set of files.
    Look at the source code in the repository and you have to generate content for the section of a README.md file following the heading given below. If you use any hyperlinks, they should link back to the github repository shared with you.
    You should only use hyperlinks that are explicitly listed in the context. Do NOT make up a hyperlink that is not listed.

    Assume the reader is a smart developer but is not deeply familiar with {data_point["Repo"]}.
    Assume the reader does not know anything about how the project is structured or which folders/files do what and what functions are written in which files and what these functions do.
    If you don't know how to fill up the readme.md file in one of its sections, leave that part blank. Don't try to make up any content.
    Do not include information that is not directly relevant to repository, even though the names of the functions might be common or is frequently used in several other places.
    Keep your response between 100 and 300 words. DO NOT RETURN MORE THAN 300 WORDS. Provide the answer in correct markdown format.

    Question: {data_point["Question"]}
    Context:
    {data_point["Context"]}

    Answer in Markdown:
    {data_point["Answer"]}
    """

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = Dataset.from_pandas(df)
data = data.shuffle().map(generate_and_tokenize_prompt)

Map:   0%|          | 0/11372 [00:00<?, ? examples/s]

Map: 100%|██████████| 11372/11372 [00:22<00:00, 514.55 examples/s]


## Train!

Now, we'll use our data to update our model. Using the Huggingface `transformers` library, let's set up our training loop and then run it. Note that we are ONLY making one pass on all this data.

In [49]:
import os

os.environ['CUDA_LAUNCH_BLOCKING']="1"
os.environ['TORCH_USE_CUDA_DSA']="1"

In [50]:
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    num_train_epochs=1,
    learning_rate=1e-4,
    fp16=True,
    output_dir="outputs_llama2-7b-chat-gptq_v3",
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.01,
    report_to="none"
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Step,Training Loss


KeyboardInterrupt: 

## Loading and using the model later

Now, we'll save the PEFT fine-tuned model, then load it and use it to generate some more answers.

In [20]:
# model.save_pretrained("trained-model")

PEFT_MODEL = "outputs_llama2-7b-chat-gptq_v2/checkpoint-11000"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    # quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)

CUDA extension not installed.
CUDA extension not installed.


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [21]:
generation_config = model.generation_config
generation_config.max_new_tokens = 512
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.repetition_penalty = 1.1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [22]:
import numpy as np

In [23]:
%%time
project_name = df["Repo"].values[10820]
repository_url = df["Repo Url"].values[10820]
target_audience = "smart developer"
question = df["Question"].values[10820]
context = df["Context"].values[10820]
content_type = "docs"
prompt = f"""You are an AI assistant for a software project called {project_name}. You are trained on all the {content_type} that makes up this project.
    The {content_type} for the project is located at {repository_url}.
    You are given a repository which might contain several modules and each module will contain a set of files.
    Look at the source code in the repository and you have to generate content for the section of a README.md file following the heading given below. If you use any hyperlinks, they should link back to the github repository shared with you.
    You should only use hyperlinks that are explicitly listed in the context. Do NOT make up a hyperlink that is not listed.

    Assume the reader is a {target_audience} but is not deeply familiar with {project_name}.
    Assume the reader does not know anything about how the project is structured or which folders/files do what and what functions are written in which files and what these functions do.
    If you don't know how to fill up the readme.md file in one of its sections, leave that part blank. Don't try to make up any content.
    Do not include information that is not directly relevant to repository, even though the names of the functions might be common or is frequently used in several other places.
    Keep your response between 100 and 300 words. DO NOT RETURN MORE THAN 300 WORDS. Provide the answer in correct markdown format.

    Question: {question}
    Context:
    {context}

    Answer in Markdown:"""
    
device = "cuda"
encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))



You are an AI assistant for a software project called diagrams. You are trained on all the docs that makes up this project.
    The docs for the project is located at https://github.com/mingrammer/diagrams.
    You are given a repository which might contain several modules and each module will contain a set of files.
    Look at the source code in the repository and you have to generate content for the section of a README.md file following the heading given below. If you use any hyperlinks, they should link back to the github repository shared with you.
    You should only use hyperlinks that are explicitly listed in the context. Do NOT make up a hyperlink that is not listed.

    Assume the reader is a smart developer but is not deeply familiar with diagrams.
    Assume the reader does not know anything about how the project is structured or which folders/files do what and what functions are written in which files and what these functions do.
    If you don't know how to fill up the r