# Running DeepSeek R1 Distilled LLaMA 70B

This notebook provides a step-by-step guide to setting up and running the DeepSeek R1 Distilled LLaMA 70B model.

## Prerequisites

### Hardware Requirements
- GPU: NVIDIA GPU with minimum 40GB VRAM (A100, RTX 3090, or RTX 4090)
- Optimal Setup: 2x NVIDIA A100 (40GB) or 4x NVIDIA RTX 3090 (24GB)
- System RAM: 128GB recommended

### Software Requirements
- Python 3.12
- CUDA 12.2
- PyTorch 2.2.x
- Transformers library
- Accelerate library

## 1. Environment Setup

In [None]:
# Install PyTorch with CUDA 12.2 support
!pip install torch torchvision torchaudio

# Install required libraries
!pip install transformers accelerate

## 2. Verify Installation

In [None]:
import torch
import transformers

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Transformers version: {transformers.__version__}")

if torch.cuda.is_available():
    print(f"GPU(s) available: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")

## 3. Model Loading and Configuration

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def load_model(model_name="deepseek-ai/DeepSeek-R1-Distill-Llama-70B"):
    """
    Load the DeepSeek model and tokenizer
    """
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Load model with optimizations
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,  # Use FP16 for efficiency
        device_map="auto",          # Automatically handle multi-GPU setup
        trust_remote_code=True      # Required for some model configurations
    )
    
    return model, tokenizer

# Load model and tokenizer
model, tokenizer = load_model()
print("Model and tokenizer loaded successfully!")

## 4. Text Generation Function

In [None]:
def generate_text(
    prompt,
    model,
    tokenizer,
    max_length=100,
    temperature=0.7,
    top_p=0.9,
    num_return_sequences=1
):
    """
    Generate text using the DeepSeek model
    
    Args:
        prompt (str): Input text prompt
        max_length (int): Maximum length of generated text
        temperature (float): Controls randomness (higher = more random)
        top_p (float): Nucleus sampling parameter
        num_return_sequences (int): Number of sequences to generate
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate text
    outputs = model.generate(
        inputs.input_ids,
        max_length=max_length,
        temperature=temperature,
        do_sample=True,
        top_p=top_p,
        num_return_sequences=num_return_sequences,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Decode and return generated text
    generated_texts = [
        tokenizer.decode(output, skip_special_tokens=True)
        for output in outputs
    ]
    
    return generated_texts[0] if num_return_sequences == 1 else generated_texts

## 5. Example Usage

### 5.1 Basic Question Answering

In [None]:
prompt = "Explain the concept of quantum computing in simple terms."
response = generate_text(prompt, model, tokenizer, max_length=150)
print("Response:", response)

### 5.2 Creative Writing

In [None]:
prompt = """Write a short story about a robot discovering emotions.
Theme: Self-discovery
Length: Approximately 100 words"""

story = generate_text(
    prompt,
    model,
    tokenizer,
    max_length=200,
    temperature=0.8
)
print("Generated Story:\n", story)

### 5.3 Code Generation

In [None]:
prompt = """Write a Python function to calculate the Fibonacci sequence up to n terms.
Include docstring and type hints."""

code = generate_text(
    prompt,
    model,
    tokenizer,
    max_length=250,
    temperature=0.4  # Lower temperature for more focused code generation
)
print("Generated Code:\n", code)

### 5.4 Text Summarization

In [None]:
text_to_summarize = """
The Industrial Revolution was a period of major industrialization that took place during 
the late 1700s and early 1800s. It began in Great Britain and spread to the rest of 
the world. This era saw the development of new technologies, such as the steam engine, 
which revolutionized manufacturing and transportation. The Industrial Revolution also led 
to significant social and economic changes, including urbanization and the rise of the 
working class.
"""

prompt = f"Summarize the following text in 2-3 sentences:\n{text_to_summarize}"

summary = generate_text(
    prompt,
    model,
    tokenizer,
    max_length=100,
    temperature=0.3  # Lower temperature for more focused summarization
)
print("Summary:\n", summary)

## 6. Memory Management and Cleanup

In [None]:
# Function to clear GPU memory if needed
def clear_gpu_memory():
    import gc
    gc.collect()
    torch.cuda.empty_cache()
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            print(f"GPU {i} memory allocated: {torch.cuda.memory_allocated(i) / 1e9:.2f} GB")

# Clear memory after processing
clear_gpu_memory()

## Notes and Best Practices

1. **Memory Management**:
   - Monitor GPU memory usage during generation
   - Use `clear_gpu_memory()` function when needed
   - Consider batch processing for multiple requests

2. **Performance Optimization**:
   - Use FP16 (half-precision) for efficient inference
   - Adjust `max_length` based on your needs
   - Use appropriate `temperature` values for different tasks

3. **Error Handling**:
   - Implement proper error handling in production
   - Monitor for out-of-memory conditions
   - Handle token length limitations

4. **Multi-GPU Setup**:
   - The model automatically handles multi-GPU distribution
   - Ensure proper CUDA setup for multi-GPU usage
   - Monitor individual GPU usage

Remember to handle the model and GPU resources properly in production environments. This notebook is intended for research and development purposes.