<a href="https://colab.research.google.com/github/konmavedant/Docker/blob/main/task2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!free -h  # Check RAM
!nvidia-smi  # Check GPU (if available)

               total        used        free      shared  buff/cache   available
Mem:            12Gi       787Mi       7.5Gi       1.0Mi       4.4Gi        11Gi
Swap:             0B          0B          0B
Fri Feb 28 18:13:48 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   58C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                        

In [None]:
!pip install transformers torch datasets accelerate bitsandbytes
!apt-get update && apt-get install -y git

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from 

In [None]:
# Step 1: Import Libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Step 2: Check Resources
print("Checking available resources:")
!free -h  # Display RAM
!nvidia-smi  # Display GPU info (T4 should show ~16GB VRAM)

# Step 3: Define Model and Quantization Settings
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"  # Replace with latest distilled model from Hugging Face
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization to fit T4 VRAM
    bnb_4bit_compute_dtype=torch.float16  # Optimize for T4 GPU
)

# Step 4: Load Model and Tokenizer
print(f"Loading {model_name} with 4-bit quantization onto T4 GPU...")
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config,
        device_map="cuda:0",  # Explicitly use T4 GPU
        torch_dtype=torch.float16  # Use FP16 for faster GPU inference
    )
    print("Model loaded successfully onto GPU!")
except Exception as e:
    print(f"Error loading model: {e}")
    raise

# Step 5: Maximize Context Length
initial_context_length = 16384  # Start with 16K tokens, T4 can handle more with RAM support
tokenizer.model_max_length = initial_context_length
print(f"Setting initial context length to {initial_context_length} tokens.")

# Function to test and adjust context length
def test_context_length(max_length, test_text):
    tokenizer.model_max_length = max_length
    try:
        inputs = tokenizer(test_text, return_tensors="pt", truncation=True, max_length=max_length).to("cuda:0")
        outputs = model.generate(**inputs, max_new_tokens=10)
        print(f"Success at {max_length} tokens")
        return True
    except RuntimeError as e:
        print(f"Failed at {max_length} tokens: {e}")
        return False

# Generate a long test text (~10K tokens worth)
test_text = "AI is advancing rapidly in 2025, transforming industries and research. " * 5000

# Dynamically adjust context length
context_lengths = [8192, 16384, 32768]  # Test these lengths, T4 + 12.7GB RAM can push higher
max_working_length = 8192  # Default safe value
for length in context_lengths:
    if test_context_length(length, test_text):
        max_working_length = length
    else:
        break

tokenizer.model_max_length = max_working_length
print(f"Set maximum working context length to {max_working_length} tokens.")

# Step 6: Run Inference
def run_inference(prompt):
    print(f"Running inference with prompt: {prompt[:50]}...")
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_working_length).to("cuda:0")
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id  # Avoid padding warnings
    )
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("Generated output:")
    print(result)
    print("\nResource usage after inference:")
    !free -h
    !nvidia-smi

# Test with a sample prompt
sample_prompt = "Summarize AI advancements in 2025 based on current trends."
run_inference(sample_prompt)

# Step 7: Test Long Context
long_prompt = "Repeat this sentence to test context: AI is the future. " * 1000
print(f"Testing long context with ~{len(long_prompt.split())} words...")
run_inference(long_prompt)

# Step 8: Cleanup (Optional)
print("Cleaning up to free GPU and RAM...")
del model
del tokenizer
torch.cuda.empty_cache()
print("Cleanup complete. Check resources:")
!free -h
!nvidia-smi

Checking available resources:
               total        used        free      shared  buff/cache   available
Mem:            12Gi       1.5Gi       330Mi       1.0Mi        10Gi        10Gi
Swap:             0B          0B          0B
Fri Feb 28 18:18:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   42C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Defau

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.61G [00:00<?, ?B/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/6.62G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Model loaded successfully onto GPU!
Setting initial context length to 16384 tokens.


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Success at 8192 tokens


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Success at 16384 tokens


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Success at 32768 tokens
Set maximum working context length to 32768 tokens.
Running inference with prompt: Summarize AI advancements in 2025 based on current...
Generated output:
Summarize AI advancements in 2025 based on current trends. What specific areas are AI excelling in? What are the challenges it faces?

AI is making remarkable strides in 2025, driven by advancements in machine learning and neural networks. Specific areas where AI excels include natural language processing, computer vision, robotics, and autonomous systems. These areas leverage cutting-edge algorithms and hardware improvements, leading to breakthroughs in tasks like translation, image recognition, robotic manipulation, and self-driving cars. Challenges AI faces include ethical concerns, data biases, computational costs

Resource usage after inference:
               total        used        free      shared  buff/cache   available
Mem:            12Gi       3.1Gi       250Mi        15Mi       9.4Gi       9.3Gi
