<img align="right" width="400" src="https://www.fhnw.ch/de/++theme++web16theme/assets/media/img/fachhochschule-nordwestschweiz-fhnw-logo.svg" alt="FHNW Logo">


# LLaMA2 as Classifier

by Fabian Märki

## Summary
The aim of this notebook is to show how to load a Large Language Model into a GPU with limited resources and then do some Prompt Engineering in order to use LLaMA2 as a classifier.


## Links
- [Efficient Transformers: A Survey](https://shreyansh26.github.io/post/2022-10-10_efficient_transformers_survey/)
- [The Secret Sauce behind 100K Context Window in LLMs: All Tricks in One Place](https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c)
- [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- [FlashAttention](https://shreyansh26.github.io/post/2023-03-26_flash-attention/) and how to use it with [Huggingface](https://huggingface.co/docs/transformers/perf_infer_gpu_one)

- [List of recent Models and their Licence](https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table)
- [Huggingface Model Ecosystem](https://huggingface.co/models)



<a href="https://colab.research.google.com/github/markif/2023_HS_DAS_NLP_Notebooks/blob/master/08_f_Transformers_LLaMA2_as_Classifier.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
%%capture

!pip install 'fhnw-nlp-utils>=0.8.0,<0.9.0'

from fhnw.nlp.utils.storage import load_dataframe
from fhnw.nlp.utils.storage import download
from fhnw.nlp.utils.system import set_log_level

import pandas as pd
import numpy as np

set_log_level()
import tensorflow as tf

**Make sure that a GPU is available (see [here](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm))!!!**

In [2]:
from fhnw.nlp.utils.system import system_info
print(system_info())

OS name: posix
Platform name: Linux
Platform release: 6.2.0-33-generic
Python version: 3.8.10
CPU cores: 6
RAM: 31.1GB total and 23.96GB available
Tensorflow version: 2.12.0
GPU is available
GPU is a NVIDIA GeForce RTX 2070 with Max-Q Design with 8192MiB


In [3]:
%%time
download("https://drive.switch.ch/index.php/s/0hE8wO4FbfGIJld/download", "data/german_doctor_reviews_tokenized.parq")
data = load_dataframe("data/german_doctor_reviews_tokenized.parq")

CPU times: user 6.41 s, sys: 1.56 s, total: 7.97 s
Wall time: 4.67 s


In [4]:
# remove all neutral sentimens
data = data.loc[(data["label"] != "neutral")]
data.shape

(331187, 10)

In [5]:
data.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment,token_clean,text_clean,token_lemma,token_stem,token_clean_stopwords
0,Ich bin franzose und bin seit ein paar Wochen ...,2.0,Ich bin franzose und bin seit ein paar Wochen ...,positive,1,"[ich, bin, franzose, und, bin, seit, ein, paar...",ich bin franzose und bin seit ein paar wochen ...,"[franzose, seit, paar, wochen, muenchen, zahn,...","[franzos, seit, paar, woch, muench, ., zahn, s...","[franzose, seit, paar, wochen, muenchen, ., za..."
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1,"[dieser, arzt, ist, das, unmöglichste, was, mi...",dieser arzt ist das unmöglichste was mir in me...,"[arzt, unmöglichste, leben, je, begegnen, unfr...","[arzt, unmog, leb, je, begegnet, unfreund, ,, ...","[arzt, unmöglichste, leben, je, begegnet, unfr..."
2,Hatte akute Beschwerden am Rücken. Herr Magura...,1.0,Hatte akute Beschwerden am Rücken. Herr Magura...,positive,1,"[hatte, akute, beschwerden, am, rücken, ., her...",hatte akute beschwerden am rücken . herr magur...,"[akut, beschwerden, rücken, magura, erste, arz...","[akut, beschwerd, ruck, ., magura, erst, arzt,...","[akute, beschwerden, rücken, ., magura, erste,..."


In [6]:
from fhnw.nlp.utils.transformers import get_compute_device

In [7]:
params = {
    "verbose": True,
    "shuffle": True,
    # modify batch_size in case you experience memory issues
    "batch_size": 16,
    "X_column_name": "text_clean",
    "y_column_name": "label",
    "y_column_name_prediction": "prediction",
    "compute_device": get_compute_device(),
    "last_stored_batch": -1,
}

compute_device = get_compute_device()

In order to take advantage of some recent freatures, we need to ensure that the latest version of the libraries is installed.

In [8]:
%%capture

!pip install "optimum>=1.13.2"
!pip install "bitsandbytes>=0.41.1"
!pip install "accelerate>=0.23.0" #git+https://github.com/huggingface/accelerate.git
#!pip install "git+https://github.com/huggingface/accelerate.git
!pip install "transformers>=4.33.2" #git+https://github.com/huggingface/transformers.git
#!pip install "git+https://github.com/huggingface/transformers.git

I experienced issues with downloading the model. Finally, I downloaded it directly from the Git-Repository (I actually did it on the console but following code might also work).

In [9]:
#!apt-get install git-lfs
#!git lfs install

#!git clone "https://huggingface.co/NousResearch/Llama-2-7b-hf"

Define the model (some need authentication like meta-llama)

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# original model but needs access token
#transformers_model_name = "meta-llama/Llama-2-7b-chat-hf"
#transformers_model_name = "codellama/CodeLlama-34b-hf"
transformers_model_name = "Llama-2-7b-hf"

quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(transformers_model_name)
model = AutoModelForCausalLM.from_pretrained(
    transformers_model_name,
    quantization_config=quantization_config,
    device_map="auto",
    #use_auth_token=<your_access_token>,
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [11]:
model.get_memory_footprint()

3829940224

Let's take advantage of [FlashAttention](https://huggingface.co/docs/transformers/perf_infer_gpu_one) (see also [here](https://shreyansh26.github.io/post/2023-03-26_flash-attention/)).

In [12]:
model.to_bettertransformer()

The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttentionLayerBetterTransformer(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): L

In [16]:
prompt = "What is gravity?"
inputs = tokenizer(prompt, return_tensors="pt").to(compute_device)

model_output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    top_p=0.9,
    temperature=0.1,
)

In [20]:
print(tokenizer.decode(model_output[0], skip_special_tokens=True))

What is gravity?
 Hinweis: Die folgende Seite ist nur auf Englisch verfügbar.
Gravity is the force that attracts objects to each other. It is the force that keeps the Earth and the other planets in their orbits around the Sun. It is the force that keeps the Moon in its orbit around the Earth.
Gravity is a force of attraction. It is the force that attracts objects to each other. It is the force that keeps the Earth and the other planets in their orbits around the Sun. It is the force that keeps the Moon in its orbit around the Earth.
Gravity is a force of attraction. It is the force that attracts objects to each other. It is the force that keeps the Earth and the other planets in their orbits around the Sun. It is the force that keeps the Moon in its orbit around the Earth.
Gravity is a force of attraction. It is the


In [22]:
prompt = "What is gravity?"
inputs = tokenizer(prompt, return_tensors="pt").to(compute_device)

model_output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    top_p=0.7,
    temperature=1.0,
)

In [23]:
print(tokenizer.decode(model_output[0], skip_special_tokens=True))

What is gravity? It’s not something you can see or feel. nobody has ever been able to photograph it, or to measure it with any accuracy. The best we can do is to infer its presence from the motion of bodies.
We know that gravity is what holds the earth in orbit around the sun, and keeps the sun and the planets in their places in the solar system. We know that it is also responsible for holding the planets and moons in orbit around each other.
Gravity is the only force in nature that acts at a distance, and the only force that acts on all objects equally. All other forces, such as electricity, magnetism, and inertia, only act on certain objects, and only when those objects are close together.
The first scientist to study gravity was Isaac Newton. He was born in England in 1643, and died in 1727. Newton was a very intelligent man, and he was


In [24]:
prompt_template = """Classify the german text into negative, or positive. Reply with only one word: Positive, or Negative.

Text: {}
Sentiment: """

In [25]:
prompt = prompt_template.format(data.iloc[0][params["X_column_name"]])
print(prompt)

Classify the german text into negative, or positive. Reply with only one word: Positive, or Negative.

Text: ich bin franzose und bin seit ein paar wochen in muenchen . ich hatte zahn schmerzen und mein kollegue hat mir dr mainka empfohlen . ich habe schnell ein termin bekommen , das team war nett und meine schmerzen sind weg ! ! ich bin als angst patient sehr zurieden ! !
Sentiment: 


In [26]:
inputs = tokenizer(prompt, return_tensors="pt").to(compute_device)

model_output = model.generate(
    inputs["input_ids"],
    max_new_tokens=8,
    do_sample=True,
    top_p=0.7,
    temperature=1.0,
)

In [27]:
print(tokenizer.decode(model_output[0], skip_special_tokens=True))

Classify the german text into negative, or positive. Reply with only one word: Positive, or Negative.

Text: ich bin franzose und bin seit ein paar wochen in muenchen . ich hatte zahn schmerzen und mein kollegue hat mir dr mainka empfohlen . ich habe schnell ein termin bekommen , das team war nett und meine schmerzen sind weg ! ! ich bin als angst patient sehr zurieden ! !
Sentiment: 0.133333


Let's try some prompt engineering.

In [29]:
prompt_template = """Classify the german text into negative, or positive. Reply with only one word: Positive, or Negative.

Examples:
Text: Dies ist ein guter Arzt.
Sentiment: Positive

Text: Dies ist ein schlechter Arzt
Sentiment: Negative

Text: {}
Sentiment: """

In [30]:
prompt = prompt_template.format(data.iloc[0][params["X_column_name"]])
print(prompt)

Classify the german text into negative, or positive. Reply with only one word: Positive, or Negative.

Examples:
Text: Dies ist ein guter Arzt.
Sentiment: Positive

Text: Dies ist ein schlechter Arzt
Sentiment: Negative

Text: ich bin franzose und bin seit ein paar wochen in muenchen . ich hatte zahn schmerzen und mein kollegue hat mir dr mainka empfohlen . ich habe schnell ein termin bekommen , das team war nett und meine schmerzen sind weg ! ! ich bin als angst patient sehr zurieden ! !
Sentiment: 


In [31]:
inputs = tokenizer(prompt, return_tensors="pt").to(compute_device)

model_output = model.generate(
    inputs["input_ids"],
    max_new_tokens=8,
    do_sample=True,
    top_p=0.7,
    temperature=1.0,
)

In [32]:
print(tokenizer.decode(model_output[0], skip_special_tokens=True))

Classify the german text into negative, or positive. Reply with only one word: Positive, or Negative.

Examples:
Text: Dies ist ein guter Arzt.
Sentiment: Positive

Text: Dies ist ein schlechter Arzt
Sentiment: Negative

Text: ich bin franzose und bin seit ein paar wochen in muenchen . ich hatte zahn schmerzen und mein kollegue hat mir dr mainka empfohlen . ich habe schnell ein termin bekommen , das team war nett und meine schmerzen sind weg ! ! ich bin als angst patient sehr zurieden ! !
Sentiment: 0.0



In [37]:
inputs = tokenizer(prompt, return_tensors="pt").to(compute_device)

model_output = model.generate(
    inputs["input_ids"],
    max_new_tokens=4,
    do_sample=True,
    top_p=0.7,
    temperature=1.0,
    num_beams=5,
    early_stopping=True,
    # might be an option to force "positive" and "negative"
    #force_words_ids(List[List[int]]
)

In [38]:
print(tokenizer.decode(model_output[0], skip_special_tokens=True))

Classify the german text into negative, or positive. Reply with only one word: Positive, or Negative.

Examples:
Text: Dies ist ein guter Arzt.
Sentiment: Positive

Text: Dies ist ein schlechter Arzt
Sentiment: Negative

Text: ich bin franzose und bin seit ein paar wochen in muenchen . ich hatte zahn schmerzen und mein kollegue hat mir dr mainka empfohlen . ich habe schnell ein termin bekommen , das team war nett und meine schmerzen sind weg ! ! ich bin als angst patient sehr zurieden ! !
Sentiment:  Positive


