# Basic Inference with PLLuM-8x7B-chat GGUF

This notebook demonstrates how to use the quantized PLLuM model for basic text generation using the `llama-cpp-python` library.

## Prerequisites

- Download a quantized model from [Hugging Face](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF)
- Install required packages: `pip install llama-cpp-python`

In [None]:
# Install required packages if not already installed
%pip install llama-cpp-python

## 1. Load the Model

First, we'll load the model using the `Llama` class from `llama_cpp`. You can adjust the parameters based on your hardware capabilities.

In [None]:
from llama_cpp import Llama
import os

# Set path to the model file - update this to your model's location
model_path = "../models/PLLuM-8x7B-chat-gguf-q4_k_m.gguf"

# Check if the model file exists
if not os.path.exists(model_path):
    raise FileNotFoundError(f"Model file not found at {model_path}. Please download it first.")

# Load the model
llm = Llama(
    model_path=model_path,
    n_ctx=4096,      # Context window size
    n_threads=8,     # Number of CPU threads to use
    n_batch=512,     # Batch size for prompt processing
    verbose=False    # Set to True for more detailed logs
)

print(f"Model loaded successfully: {model_path}")

## 2. Basic Text Generation

Now, let's generate some text with a simple prompt.

In [None]:
# Define a simple prompt in Polish
prompt = "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:"

# Generate text
output = llm(
    prompt,
    max_tokens=512,      # Maximum number of tokens to generate
    temperature=0.7,     # Controls randomness (lower = more deterministic)
    top_p=0.95,          # Top-p sampling
    top_k=50,            # Top-k sampling
    repeat_penalty=1.1   # Penalty for repeating tokens
)

# Print the generated text
print(prompt + output["choices"][0]["text"])

## 3. Working with Different Prompts

Let's test the model with a few different types of prompts.

In [None]:
def generate_text(prompt, max_tokens=512, temperature=0.7):
    """Generate text using the loaded model."""
    output = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=0.95,
        top_k=50,
        repeat_penalty=1.1
    )
    return output["choices"][0]["text"]

# Example 1: Factual question
prompt1 = "Pytanie: Kto był pierwszym prezydentem Polski po 1989 roku? Odpowiedź:"
result1 = generate_text(prompt1)
print(f"Prompt: {prompt1}\n\nGenerated: {result1}\n\n{'-'*80}\n")

# Example 2: Summary request
prompt2 = "Napisz krótkie streszczenie bitwy pod Grunwaldem."
result2 = generate_text(prompt2)
print(f"Prompt: {prompt2}\n\nGenerated: {result2}\n\n{'-'*80}\n")

# Example 3: Creative writing
prompt3 = "Napisz krótkie opowiadanie o pierwszym kontakcie ludzi z obcą cywilizacją."
result3 = generate_text(prompt3, max_tokens=1024, temperature=0.8)
print(f"Prompt: {prompt3}\n\nGenerated: {result3}")

## 4. Exploring Temperature Parameter

Let's explore how the temperature parameter affects text generation.

In [None]:
prompt = "Dokończ zdanie: W przyszłości sztuczna inteligencja będzie"

# Generate with different temperatures
temperatures = [0.3, 0.7, 1.2]

for temp in temperatures:
    print(f"\n### Temperature = {temp} ###\n")
    for _ in range(3):  # Generate 3 samples at each temperature
        result = generate_text(prompt, max_tokens=100, temperature=temp)
        print(f"Sample: {prompt}{result}\n")

## 5. Polish-English Translation Example

Let's test the model's ability to translate between Polish and English.

In [None]:
# Polish to English
pl_to_en_prompt = "Przetłumacz poniższy tekst z języka polskiego na angielski:\n\n'Warszawa jest stolicą Polski i największym miastem w kraju. Położona jest nad rzeką Wisłą i ma bogatą historię.'\n\nTłumaczenie:"
pl_to_en_result = generate_text(pl_to_en_prompt)
print(f"Polish to English:\n{pl_to_en_prompt}\n{pl_to_en_result}\n\n{'-'*80}\n")

# English to Polish
en_to_pl_prompt = "Translate the following text from English to Polish:\n\n'Poland is a country with a rich cultural heritage, beautiful landscapes, and delicious cuisine. It is located in Central Europe.'\n\nTłumaczenie:"
en_to_pl_result = generate_text(en_to_pl_prompt)
print(f"English to Polish:\n{en_to_pl_prompt}\n{en_to_pl_result}")

## 6. Memory Usage and Performance

Let's check the memory usage and performance of the model.

In [None]:
import time
import psutil
import os

def get_memory_usage():
    """Get the current memory usage of the process in MB."""
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 * 1024)

# Measure generation time and memory usage
prompt = "Napisz pięć ciekawostek o kosmosie."

# Record starting memory
start_memory = get_memory_usage()
print(f"Starting memory usage: {start_memory:.2f} MB")

# Measure generation time
start_time = time.time()
result = generate_text(prompt, max_tokens=256)
end_time = time.time()

# Record ending memory
end_memory = get_memory_usage()

# Calculate statistics
generation_time = end_time - start_time
memory_increase = end_memory - start_memory

print(f"Generation time: {generation_time:.2f} seconds")
print(f"Current memory usage: {end_memory:.2f} MB")
print(f"Memory increase during generation: {memory_increase:.2f} MB")
print(f"\nGenerated text:\n{prompt}{result}")

## 7. Cleanup

When you're done with the model, it's a good practice to clean up resources.

In [None]:
# Free the model from memory
del llm

# Force Python's garbage collector to run
import gc
gc.collect()

print("Model resources released.")

The model shows good performance across various tasks including:

- Answering factual questions
- Summarizing historical events
- Creative writing
- Translation between Polish and English

Experiment with different quantization levels and parameters to find the best balance between performance and quality for your specific use case.