# **Fine Tunning DeepSeek R1 Webinar Code Along**
---

### Installing Requirements
*   **Transformers**- HF library to interact with the deepseek model </br>
*   **datasets** - easy access to many processed datasets and processing tools</br>
*   **peft** = parameter efficient fine tuning - enables us to use LoRa</br>
*   **torch** - pytorch, backend deep learning framework for machine learning computations

In [None]:
!pip install transformers datasets peft torch

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_c

### Imports
Importing libraries and getting the deepseek model and tokenizer

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" # a light model

# Pulls everything directly from hugging face
# we define the modela dn tokenizer using the from_pretrained function
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device) # allows faster gpu usage with the t4 gpu

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

### Generate Domain-Specifc Document

In [None]:
# this will be the dataset for today
text = """Artificial Intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI applications are revolutionizing the way we approach problem-solving and decision-making. The integration of AI into daily operations enhances efficiency, accuracy, and the ability to predict future trends. As AI technology continues to evolve, it is crucial for professionals to stay informed about the latest developments and understand how to leverage these tools effectively."""

### Convert Text Data into HuggingFace Data

In [None]:
from datasets import Dataset

sentences = text.split(". ")

dataset = Dataset.from_dict({"text": sentences}) # converts a Python dictionary into a Dataset object

In [None]:
print(sentences)

['Artificial Intelligence (AI) is transforming industries across the globe', 'From healthcare to finance, AI applications are revolutionizing the way we approach problem-solving and decision-making', 'The integration of AI into daily operations enhances efficiency, accuracy, and the ability to predict future trends', 'As AI technology continues to evolve, it is crucial for professionals to stay informed about the latest developments and understand how to leverage these tools effectively.']


### Examine the Data Schema
A quick look into the dataset to make sure everyhting is structured correctly

In [None]:
print(type(dataset))

<class 'datasets.arrow_dataset.Dataset'>


*** this means we are working with a hugging face type datset which is an optimized format for high dimensional data

In [None]:
dataset

Dataset({
    features: ['text'],
    num_rows: 4
})

### Setting up Tokenizer
Let's now focus on tokenizing the data - converting each sentence into numbers.

In [None]:
def preprocess_function(examples):
  inputs = tokenizer(examples['text'], truncation = True, padding = "max_length", max_length = 512) # max_length is the length of the longest sentence

  inputs["labels"] = inputs["input_ids"].copy()

  return inputs

tokenized_datset = dataset.map(preprocess_function, batched = True)

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

In [None]:
tokenized_datset

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 4
})

We can see we have more features now, however, the number of rows is the same.

* Input ids = tokenized version fo the text
* attention mask = which tokens are real and whicha re added data which are not useful for training
* labels = a copy of the input ids

In [None]:
tokenized_datset["text"]

['Artificial Intelligence (AI) is transforming industries across the globe',
 'From healthcare to finance, AI applications are revolutionizing the way we approach problem-solving and decision-making',
 'The integration of AI into daily operations enhances efficiency, accuracy, and the ability to predict future trends',
 'As AI technology continues to evolve, it is crucial for professionals to stay informed about the latest developments and understand how to leverage these tools effectively.']

In [None]:
tokenized_datset["input_ids"]

[[151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,
  151643,


In [None]:
tokenized_datset["attention_mask"]

[[0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


0 for unpadded data and 1 for padded data

### Setting up LoRa

In [None]:
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type = TaskType.CAUSAL_LM,
    r = 16,
    lora_alpha = 32, # how much impact during training
    lora_dropout = 0.005, # adds regularization
    bias = "none",
    target_modules = ["q_proj", "v_proj"] # attention layers, they control the queries in the transformer mechanism?
)

model = get_peft_model(model, lora_config) # transformed a large model into a smaller finetunnable model using lora

### Configuration of Training Hyperparameters

In [None]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    per_device_train_batch_size = 1, # tiny bacth size due to gpu limitations
    gradient_accumulation_steps = 8, # accumulates the gradient, this is like having 8 batches?
    warmup_steps = 200,
    num_train_epochs = 150, # we can do more than this if needed
    learning_rate = 2e-4,
    fp16 = True, # mixed percision training which makes training a bit light on the mmeory
    logging_steps = 10,
    output_dir = "./results",
    report_to = "none",
    remove_unused_columns = False
)

### Free Up Memory
We have to do this since we are using colab and we may have limited space

In [None]:
model = model.to("cpu") # temporarily to free up the gpu memory before training

trainer = Trainer(
    model = model,
    args = trainer_args,
    train_dataset = tokenized_datset,
)

import torch
import gc

gc.collect() # free up the memory as much as possible
torch.cuda.empty_cache() # empty up any cache we might have from previous operations

model = torch.compile(model)
model = model.to("cuda") # put the model back on the gpu

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
trainer.train()

Step,Training Loss
10,10.3745
20,9.6218
30,7.2393
40,2.9203
50,0.3848
60,0.1559
70,0.1399
80,0.1237
90,0.1089
100,0.0952


TrainOutput(global_step=150, training_loss=2.094689752658208, metrics={'train_runtime': 167.6768, 'train_samples_per_second': 3.578, 'train_steps_per_second': 0.895, 'total_flos': 2849390670643200.0, 'train_loss': 2.094689752658208, 'epoch': 150.0})

### Saving Model

In [None]:
domain = "madeby-me-v1"

model.save_pretrained(f"fine-tuned-deepseek-r1-1.5b-{domain}")
tokenizer.save_pretrained(f"fine-tuned-deepseek-r1-1.5b-{domain}")

('fine-tuned-deepseek-r1-1.5b-madeby-me-v1/tokenizer_config.json',
 'fine-tuned-deepseek-r1-1.5b-madeby-me-v1/special_tokens_map.json',
 'fine-tuned-deepseek-r1-1.5b-madeby-me-v1/tokenizer.json')

### Inference

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [None]:
domain = "madeby-me-v1"
model_path = f"fine-tuned-deepseek-r1-1.5b-{domain}"

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): lora.Linear(
            (base_layer): Linear(in_features=1536, out_features=1536, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.005, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=1536, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=1536, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): Linear(in_features=1536, out_features=256, bias=True)
          (v_proj): lora.Linear(
            (base_layer): Linear(in_features=1536, out_features=256, b

### Function to generate text from the model

In [None]:
def generate_text(prompt, max_length = 100):
  inputs = tokenizer(prompt, return_tensors = "pt").to(device)

  with torch.no_grad():
    output = model.generate( # generate function to produce a continuation with paddings to make the text more creative?
        **inputs,
        top_k = 50, # ** review
        top_p = 0.9, # ** review, adapts to how confident the model is
        temperature = 0.7 # controls the randomness of the generation- lower models are less random and more deterministic, high values are more random but unpredictable, 0.7 is a balance
    )

    return tokenizer.decode(output[0], skip_special_tokens = True)

In [None]:
def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(**inputs, max_length=max_length, temperature=0.7, top_k=50, top_p=0.9)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
prompt = "Artificial Intelligence (AI) is transforming industries"
generated_text = generate_text(prompt, max_length = 1024)
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Artificial Intelligence (AI) is transforming industries across the globe. However, concerns about privacy, security, and ethical AI are rising. How can organizations address these challenges effectively?

This question is multiple-choice, with options A to E. The correct answer is option A.
</think>

The challenges posed by Artificial Intelligence (AI) across various industries require organizations to implement effective strategies to address privacy, security, and ethical concerns. Here are some actionable approaches:

1. **Data Privacy and Security:**
   - Implement robust data protection measures, such as encryption and access controls.
   - Use secure authentication methods for user login and data handling.
   - Regularly audit and update systems to address vulnerabilities.

2. **Ethical AI:**
   - Engage in continuous learning and improvement to stay informed about AI advancements.
   - Conduct regular audits to ensure compliance with ethical guidelines.
   - Foster a culture of 