# Introduction

This notebook demonstrates the use of large language models for generating text,  embeddings and Retrieval Augmented Generation (RAG). It begins by setting up the model and tokenizer using the Hugging Face Transformers library, ensuring that the pad token is correctly defined. The notebook then illustrates how to generate text using the model in both streaming and non-streaming modes. It applies a chat template to user messages, moves inputs to a GPU if available, and generates outputs with a specified maximum number of tokens. The generated text is cleaned to remove system messages, and the time taken for generation is displayed.
In addition to text generation, the notebook explores embeddings using the SentenceTransformer library. It encodes words and sentences to compute cosine similarity matrices, which are visualized to show relationships between different words and sentences. The notebook also demonstrates the concept of RAG by encoding a user's question and sorting sentences based on their similarity to the question. This approach helps in retrieving relevant information from a text corpus. Finally, the notebook sets up a pipeline for generating responses to user queries, showcasing the integration of text generation and retrieval techniques.

---
Throughout the code we will be having hints how to run it on HPC, starting with the **[HPC]** flag.

---

https://eurohpc-ju.europa.eu/ai-factories_en

---

## Environment Variables
we will need to use Environment Variables:
- HF_TOKEN is you huggingface token, you may generate one on this url: https://huggingface.co/settings/tokens

## [HPC] On Linux do:
- `nano ~/.bashrc`
- `export HF_TOKEN="..."`
- `source ~/.bashrc`
- `echo $HF_TOKEN`




You may use Mistral-7B-Instruct-v0.3, Llama-3.2-1B-Instruct, ilsp/Llama-Krikri-8B-Instruct, or any other Transformers' compatible model.

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

https://huggingface.co/ilsp/Llama-Krikri-8B-Instruct

For some models, in order to be able to download them, you will need to accept the terms of use.

You can check if you have been granted on:

https://huggingface.co/settings/gated-repos

# Install dependencies
- `pip install transformers`
- `pip install accelerate`
- `pip install --upgrade jinja2`
- `pip install -U sentence-transformers`
- `pip install pandas`
- `pip install numpy`
- `pip install scikit-learn`
- `pip install --upgrade bitsandbytes`

On google colab, you only need the last one *bitsandbytes*
https://huggingface.co/docs/transformers/quantization/bitsandbytes?bnb=4-bit

In [None]:
!pip install --upgrade bitsandbytes

# [HPC] Allocation of resources
- `salloc -A pXYZ -p gpu --qos default -N 1 -t 08:00:00`
- `salloc -A pXYZ -p gpu --qos default -N 1 -t 08:00:00 --gres=gpu:1`
- Then, if you use vscode, do shift+enter on the python file. This will open a new terminal.
- In the new terminal, do ctrl+z to stop the python script.
- You need to do ssh on the allocated node. Get the node name from the previous terminal, after @ (`<username>@<node_name>`) and then do `ssh <node_name>` in the new terminal.
- And then: `CUDA_VISIBLE_DEVICES="0,1,2,3" python` or `CUDA_VISIBLE_DEVICES="0" python`
- Python should be launched now and you may run interactively your script.

# [HPC] Usefull Commands
- In the first terminal, you can do the following to monitor the resources:
- `watch -n 1 "top -bn1 | head -n 15 && nvidia-smi"`
- `du -sh .`
- `watch nvidia-smi --query-compute-apps=pid,process_name,used_memory,gpu_name --format=csv`
- `watch nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv`
- `scancel <pid>`
- `kill -9 <pid>`

# Import libraries

https://huggingface.co/docs/transformers/index

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline, BitsAndBytesConfig
import os
import torch
import time
import re

In [None]:
import getpass
os.environ['HF_TOKEN'] = getpass.getpass("Enter the value for HF_TOKEN: ")

In [None]:
## HF_HOME is the directory where you want to save models' weights.
## [HPC] use the project's directory and not the user's one, so as to have more space. export $HF_HOME as well.
os.environ["HF_HOME"] = "/content/my_huggingface_cache"

# The Transformers Library

## Download Models

In [None]:
my_model = "mistralai/Mistral-7B-Instruct-v0.3"
# my_model = "meta-llama/Llama-3.2-3B-Instruct"
# my_model = "ilsp/Llama-Krikri-8B-Instruct"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(my_model,
                                          token=os.environ["HF_TOKEN"],
                                          cache_dir=os.environ["HF_HOME"])

In [None]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(   my_model,
                                                token=os.environ["HF_TOKEN"],
                                                cache_dir=os.environ["HF_HOME"],
                                                device_map="auto",
                                                quantization_config=quantization_config,
                                                torch_dtype="auto")

## pad token

In [None]:
# Depending on the model, the pad token might not be defined
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Pad token was None, so it was set to eos token.")

## Streamer for model.generate and pipeline

In [None]:
streamer = TextStreamer(tokenizer)

## Messages

In [None]:
system_instructions = f"You are a helpful assistant."
my_messages = [{"role": "system", "content": system_instructions}]
my_prompt = """Explain linear regression using LaTeX.
Use 'D' as the symbol for the dependent variable and 'I' as the symbol for the independent variable,
in the regression equation."""
my_messages.append({"role": "user", "content": my_prompt})

## Streaming Model

### Apply chat template to messages and return tensors

In [None]:
inputs = tokenizer.apply_chat_template(my_messages, return_tensors="pt")
print(type(inputs)) # <class 'torch.Tensor'>
attention_mask = (inputs != tokenizer.pad_token_id).long()

### Move inputs to GPU if available

In [None]:
if torch.cuda.device_count()>0:
    inputs = inputs.to("cuda")
    attention_mask = attention_mask.to("cuda")
    print("Inputs and Attention Mask transfered to CUDA")

In [None]:
t1 = time.time()
MAXIMUM_TOKENS = 512
outputs = model.generate(inputs,
                         streamer=streamer,
                         pad_token_id=tokenizer.eos_token_id,
                         attention_mask=attention_mask,
                         max_new_tokens=MAXIMUM_TOKENS)
t2 = time.time()
print(type(outputs)) # <class 'torch.Tensor'>

### Clean the sesponse

In [None]:
# To ommit <|begin_of_text|><|start_header_id|>system<|end_header_id|> we use:
generated_text = tokenizer.decode(outputs[0],
                                  skip_special_tokens=True,
                                  clean_up_tokenization_spaces=True)
print(f"{generated_text}\n\n{(t2-t1)/60:.2f} minutes")
print(type(generated_text)) # <class 'str'>

In [None]:
# To omit the system message we use:

# For Llama
# cleaned_text = re.sub(r"^.*?assistant\n\n", "", generated_text, flags=re.DOTALL)
# print(cleaned_text + "\n\n" + f"{(t2-t1)/60:.2f} minutes")

# For Mistral
generated_text.split(my_prompt)[1][1:]

## Inference (without steaming)

In [None]:
t1 = time.time()
MAXIMUM_TOKENS = 128
outputs = model.generate(inputs,
                         pad_token_id=tokenizer.eos_token_id,
                         attention_mask=attention_mask,
                         max_new_tokens=MAXIMUM_TOKENS)
t2 = time.time()
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text + "\n\n" + f"{(t2-t1)/60:.2f} minutes")

## The Pipeline object

In [None]:
pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                device_map="auto")
t1 = time.time()
MAXIMUM_TOKENS = 128
outputs = pipe(my_messages,
               max_new_tokens=MAXIMUM_TOKENS,
               pad_token_id=pipe.tokenizer.eos_token_id,
               streamer=streamer)
t2 = time.time()

# In pipeline outputs (not in model.generate) we have the "generated_text" attribute:
print(outputs[0]["generated_text"][-1]['content'] + "\n\n" + f"{(t2-t1)/60:.2f} minutes")
# [{'generated_text': [{'role': 'system', 'content': 'You are a helful assistant.'},
#                       {'role': 'user', 'content': "...
type(outputs) # <class 'list'>
type(outputs[0]) # <class 'dict'>

# Embeddings

## Import libraries

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import torch
import os
if torch.cuda.device_count()>0:
    my_device = "cuda"
    print(f"You have {torch.cuda.device_count()} GPUs available.")
else:
    my_device = "cpu"
    print("You have no GPUs available. Running on CPU.")

## The SentenceTransformer object

In [None]:
embeddings_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2',
                                       token=os.environ["HF_TOKEN"],
                                       cache_folder=os.environ["HF_HOME"],
                                       device=my_device)

## Function to visualizing the similarity matrix

In [None]:
import matplotlib.pyplot as plt
def visualize_similarity_matrix(similarity_matrix, items_labels, mat_size=5):
    for i in range(similarity_matrix.shape[0]):
        similarity_matrix[i,i] = 0
    plt.figure(figsize=(mat_size, mat_size))
    plt.imshow(similarity_matrix, interpolation='nearest', cmap='viridis')
    plt.colorbar(label="Cosine Similarity")
    plt.xticks(ticks=np.arange(len(items_labels)), labels=items_labels, rotation=90, fontsize=8)
    plt.yticks(ticks=np.arange(len(items_labels)), labels=items_labels, fontsize=8)
    plt.title("Cosine Similarity Matrix", fontsize=12)
    plt.tight_layout()
    plt.show()

## Test Embeddings - unrelated words

In [None]:
word_list = ["reciprocal", "obfuscate", "hyperbolic", "tensor"]
word_embeddings = embeddings_model.encode(word_list)
cosine_similarities = cosine_similarity(word_embeddings)
print("Cosine Similarity Matrix:")
print(cosine_similarities)
visualize_similarity_matrix(cosine_similarities, word_list)

## Test Embeddings - related words

In [None]:
word_list = ["book", "publication", "article"]
word_embeddings = embeddings_model.encode(word_list)
cosine_similarities = cosine_similarity(word_embeddings)
print("Cosine Similarity Matrix:")
print(cosine_similarities)
visualize_similarity_matrix(cosine_similarities, word_list)

## Calculate normalized mean values of embeddings

In [None]:
mean_embeddings = np.mean(np.abs(word_embeddings), axis=1)
print("Normalized Mean values of embeddings:", mean_embeddings)
std_embeddings = np.std(word_embeddings, axis=1)
print("Standard Deviation of embeddings:", std_embeddings)
norm_embeddings = np.linalg.norm(word_embeddings, axis=1)
print("Norm of embeddings:", norm_embeddings)

## Generate random vectors with the same mean and std

In [None]:
random_vectors = np.random.normal(loc=np.mean(word_embeddings),
                                  scale=np.std(word_embeddings),
                                  size=word_embeddings.shape)
mean_random_vectors = np.mean(np.abs(random_vectors), axis=1)
print("Normalized Mean values of random vectors:", mean_random_vectors)
std_random_vectors = np.std(random_vectors, axis=1)
print("Standard Deviation of random vectors:", std_random_vectors)
norm_random_vectors = np.linalg.norm(random_vectors, axis=1)
print("Norm of random vectors:", norm_random_vectors)

In [None]:
print("Cosine Similarity Matrix random vectors:")
cosine_similarities = cosine_similarity(random_vectors)
print(cosine_similarities)
visualize_similarity_matrix(cosine_similarities, ["Random Vector 1", "Random Vector 2", "Random Vector 3"])

## car ~ vehicle + motorcycle - bike

In [None]:
sentences = ["car", "vehicle", "motorcycle", "bike"]
embeddings = embeddings_model.encode(sentences)
print(cosine_similarity(embeddings[0].reshape(1, -1), (embeddings[1] + embeddings[2] - embeddings[3]).reshape(1, -1))[0, 0])

## Greece ~ Athens + Italy - Rome

In [None]:
sentences = ["Greece", "Athens", "Italy", "Rome"]
embeddings = embeddings_model.encode(sentences)
print(cosine_similarity((embeddings[0]).reshape(1, -1), (embeddings[1]+embeddings[2]-embeddings[3]).reshape(1, -1))[0, 0])

So embeddings work!

## Sentence embeddings

In [None]:
my_sentences = [
    # Interrelated sentences - group 1
    "The data is preprocessed to remove noise and outliers.",
    "Noise and outliers are eliminated during data preprocessing.",
    "Preprocessing cleans the data by filtering out noise and irregularities.",

    # Interrelated sentences - group 2
    "Paris is the capital of France.",
    "Athens is the capital of Greece.",
    "Rome is the capital of Italy."
]
my_embeddings = embeddings_model.encode(my_sentences)
similarity_matrix = cosine_similarity(my_embeddings)
print(similarity_matrix)
visualize_similarity_matrix(similarity_matrix, my_sentences, mat_size=8)

# Retrieval Augmented Generation (RAG)

In [None]:
from sentence_transformers import SentenceTransformer
import torch
import numpy as np
import os
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
if torch.cuda.device_count()>0:
    my_device = "cuda"
    print(f"You have {torch.cuda.device_count()} GPUs available.")
else:
    my_device = "cpu"
    print("You have no GPUs available. Running on CPU.")

## Embeddings Model

In [None]:
embeddings_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', token=os.environ["HF_TOKEN"],
                                       cache_folder=os.environ["HF_HOME"], device=my_device)

## Text for retrieval

In [None]:
my_text = """
This notebook demonstrates the use of large language models for generating text, embeddings and Retrieval Augmented Generation (RAG).
It begins by setting up the model and tokenizer using the Hugging Face Transformers library, ensuring that the pad token is correctly defined.
The notebook then illustrates how to generate text using the model in both streaming and non-streaming modes.
It applies a chat template to user messages, moves inputs to a GPU if available, and generates outputs with a specified maximum number of tokens.
The generated text is cleaned to remove system messages, and the time taken for generation is displayed.
In addition to text generation, the notebook explores embeddings using the SentenceTransformer library.
It encodes words and sentences to compute cosine similarity matrices, which are visualized to show relationships between different words and sentences.
The notebook also demonstrates the concept of RAG by encoding a user's question and sorting sentences based on their similarity to the question.
This approach helps in retrieving relevant information from a text corpus.
Finally, the notebook sets up a pipeline for generating responses to user queries, showcasing the integration of text generation and retrieval techniques.
"""

In [None]:
my_sentences = my_text.split('\n')
my_sentences = [sent.strip() for sent in my_sentences if sent]
my_embeddings = embeddings_model.encode(my_sentences)
print(my_embeddings.shape)

np.savetxt('sentence_embeddings.txt', my_embeddings, delimiter=',')

## Encode user's question

In [None]:
my_question = "What is this notebook about?"
my_question_embedding = embeddings_model.encode([my_question])

## Sort sentences based on the similarity to the question embedding

In [None]:
similarity_to_question = cosine_similarity(my_question_embedding, my_embeddings).flatten()
sorted_indices = similarity_to_question.argsort()[::-1]  # Sort in descending order
print(sorted_indices)

## Get sorted sentences

In [None]:
sorted_sentences = [my_sentences[i] for i in sorted_indices]
print("Sorted sentences based on cosine similarity to the question:")
for i, sentence in enumerate(sorted_sentences):
    print("-"*100)
    print(f"Sentence {i+1}, similarity: {similarity_to_question[sorted_indices[i]]:.2f}")
    print(sentence)

## Setup messages

In [None]:
nof_keep_sentences = 3
system_instructions = f"You are a helful assistant."
my_messages = [{"role": "system", "content": system_instructions}]
my_prompt = f"Use the following sentences:"
for sentence in sorted_sentences[:nof_keep_sentences]:
    my_prompt += f"\n{sentence}"
my_prompt += f"\n\nAnswer the question:\n\n'{my_question}'"
my_messages.append({"role": "user", "content": my_prompt})
my_prompt

## Answer the question

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, TextStreamer
import os
my_model = "mistralai/Mistral-7B-Instruct-v0.3"
# my_model = "meta-llama/Llama-3.2-3B-Instruct"
# my_model = "ilsp/Llama-Krikri-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(my_model,
                                          token=os.environ["HF_TOKEN"],
                                          cache_dir=os.environ["HF_HOME"])
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(   my_model,
                                                token=os.environ["HF_TOKEN"],
                                                cache_dir=os.environ["HF_HOME"],
                                                device_map="auto",
                                                quantization_config=quantization_config,
                                                torch_dtype="auto")

In [None]:
# Depending on the model, the pad token might not be defined
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Pad token was None, so it was set to eos token.")

streamer = TextStreamer(tokenizer)

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                device_map="auto")
MAXIMUM_TOKENS = 128
outputs = pipe(my_messages,
               max_new_tokens=MAXIMUM_TOKENS,
               pad_token_id=pipe.tokenizer.eos_token_id,
               streamer=streamer)

In [None]:
my_output = outputs[0]["generated_text"][-1]['content']
for i in range(0, len(my_output), 80):
    print(my_output[i:i+80])