<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/Tools/DeepSeek.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading DeepSeek Model in Google Colab

In this notebook, we will guide you through the process of loading the DeepSeek-R1 model using the Hugging Face `transformers` library in Google Colab. DeepSeek-R1 is a powerful language model that can be used for various natural language processing tasks.

## Steps to Load the Model

1. **Install Required Libraries**:
   Ensure you have the latest versions of `transformers`, `accelerate`, and `bitsandbytes` installed.

2. **Load the Model and Tokenizer**:
   Use the `AutoModelForCausalLM` and `AutoTokenizer` classes to load the model and tokenizer. Since DeepSeek models require custom code from the model repository, we will set `trust_remote_code=True`.

3. **Quantization Configuration**:
   Define a quantization configuration to optimize memory usage.

4. **Generate Text**:
   Test the model by generating text based on a sample input prompt.

## 🚀 How to Enable GPU Runtime in Jupyter Notebook

To utilize GPU acceleration for your notebook, follow these steps:

### 1. For Google Colab:
1. Click on `Runtime` in the top menu
2. Select `Change runtime type`
3. Choose `GPU` from the Hardware accelerator dropdown
4. Click `Save`

### 2. Verify GPU Availability
Run the following code to check if GPU is properly connected:

In [5]:
import torch
print("GPU Available:", torch.cuda.is_available())
print("GPU Device Name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

GPU Available: True
GPU Device Name: Tesla T4


# Load Model and Tokenizer
We'll load the model in 4-bit quantization to save memory. The model will be loaded with bitsandbytes quantization.

In [None]:
!pip uninstall transformers accelerate bitsandbytes
!pip install transformers accelerate bitsandbytes

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 14B model is the biggest we can fit in T4 GPU
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"

# Define quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True  # Important for Qwen models
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,  # Important for Qwen models
    quantization_config=quantization_config,
    torch_dtype=torch.float16
)

# Move model to device manually if needed
model.to("cuda" if torch.cuda.is_available() else "cpu")
model.eval()

# Verify the model is loaded correctly
print("Model loaded successfully.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/48.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-000004.safetensors:   0%|          | 0.00/8.71G [00:00<?, ?B/s]

model-00002-of-000004.safetensors:   0%|          | 0.00/8.67G [00:00<?, ?B/s]

model-00003-of-000004.safetensors:   0%|          | 0.00/8.67G [00:00<?, ?B/s]

model-00004-of-000004.safetensors:   0%|          | 0.00/3.49G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Model loaded successfully.


# Test the Model
Let's create a simple function to generate text and test it with a sample prompt.

In [4]:
def generate_text(prompt, max_length=512):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test with a sample prompt
prompt = "Explain quantum computing in simple terms:"
response = generate_text(prompt)
print(response)

Explain quantum computing in simple terms: what is it, how does it work, and why is it important?

</think>

Quantum computing is a type of computing that uses the principles of quantum mechanics to perform calculations. Unlike classical computers, which use bits (0s and 1s) to process information, quantum computers use quantum bits, or qubits. Qubits can exist in a state of superposition, meaning they can be both 0 and 1 at the same time, allowing quantum computers to perform many calculations simultaneously. This makes quantum computing potentially much faster than classical computing for certain tasks.

### How it works:
1. **Qubits**: Instead of bits, quantum computers use qubits. Qubits can be made from various physical systems, such as photons or superconducting circuits.
2. **Superposition**: Qubits can exist in multiple states at once due to superposition, allowing quantum computers to process a vast number of possibilities simultaneously.
3. **Entanglement**: Qubits can be ent