**Install the required libraries**

In [1]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.4.1 requires cubinlinker, which is not installed.
cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.4.1 requires ptxcompiler, which is not installed.
cuml 24.4.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.4.1 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-python 12.5.0 which is incompatible.
distributed 2024.1.1 requires dask==2024.1.1, but you have dask 2024.5.2 which is inc

**Import the required libraries**

In [36]:
import torch
from peft import LoraConfig
import os
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer
from kaggle_secrets import UserSecretsClient
from huggingface_hub import notebook_login
from trl import SFTTrainer

In [5]:
#lora config
lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

In [4]:
#bits and bytes config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [6]:
#gemma 2b
model_id = "google/gemma-2b-it"

In [14]:
user_secrets = UserSecretsClient()
token = user_secrets.get_secret("token")

In [15]:
notebook_login(token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_id, token=token)
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
                        model_id, 
                        quantization_config=bnb_config, 
                        device_map={"":0}, 
                        token=token)



tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [21]:
text = """<start_of_turn>user
Explain 'AMSI init bypass' and its purpose.<end_of_turn>
<start_of_turn>model"""
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)
out = tokenizer.decode(outputs[0], skip_special_tokens=True)
out

"user\nExplain 'AMSI init bypass' and its purpose.\nmodelSure, here's a detailed explanation of the AMSI init bypass feature and its purpose:\n\n**AMSI init bypass:**\n\nAMSI (Advanced Microcontroller Startup Interface) init bypass is a technique used in microcontroller initialization where the initialization process is"

In [22]:
os.environ["WANDB_DISABLED"] = "true"

In [23]:
#load the dataset
data = load_dataset("ahmed000000000/cybersec")

Downloading data:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [24]:
data

DatasetDict({
    train: Dataset({
        features: ['INSTRUCTION', 'RESPONSE'],
        num_rows: 12408
    })
})

In [33]:
f"""input: {data['train']['INSTRUCTION'][2]}\noutput: {data['train']['RESPONSE'][2]}"""

"input: Define 'Clickjacking' and describe how it can be prevented.\noutput: Clickjacking is a type of attack where a malicious website overlays an invisible iframe on top of another website, tricking the user into clicking on a hidden button or link. This can lead to unintended actions, such as sharing sensitive information or making unauthorized purchases. To prevent clickjacking, websites can use the X-Frame-Options header to explicitly tell web browsers whether or not the website should be displayed within an iframe. This header can be set to 'deny' to prevent any framing, 'sameorigin' to allow framing by the same origin, or to a specific URI to allow limited framing by a trusted source."

In [34]:
def formatting_func(example):
    text = f"<start_of_turn>user\n{example['INSTRUCTION'][0]}<end_of_turn> <start_of_turn>model\n{example['RESPONSE'][0]}<end_of_turn>"
    return [text]

In [37]:
trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=150,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map:   0%|          | 0/12408 [00:00<?, ? examples/s]



In [38]:
trainer.train()

Step,Training Loss
1,4.563
2,4.0654
3,4.0216
4,3.6089
5,3.3132
6,3.6128
7,2.7275
8,3.2344
9,2.5927
10,2.3271


TrainOutput(global_step=150, training_loss=0.5246245095630487, metrics={'train_runtime': 335.0063, 'train_samples_per_second': 1.791, 'train_steps_per_second': 0.448, 'total_flos': 1065426810224640.0, 'train_loss': 0.5246245095630487, 'epoch': 46.15})

In [40]:
text = """<start_of_turn>user
Explain 'AMSI init bypass' and its purpose.<end_of_turn>
<start_of_turn>model"""
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)
tokenizer.decode(outputs[0], skip_special_tokens=True)

"user\nExplain 'AMSI init bypass' and its purpose.\nmodel\nAMSI init bypass is a security feature in Windows that allows the System Management Interface (AMSI) to be initialized without being properly protected. This feature is designed to provide some protection against certain types of attacks, by preventing unauthorized users from tampering with"

In [41]:
text = """<start_of_turn>user
Explain 'APT groups and operations' and its purpose.<end_of_turn>
<start_of_turn>model"""
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)
tokenizer.decode(outputs[0], skip_special_tokens=True)

"user\nExplain 'APT groups and operations' and its purpose.\nmodel\nAPT groups and operations refer to the organization and grouping of security updates and countermeasures in an Automated Patch Tool (APT) system. By grouping similar items together, it is possible to define rules and restrictions that apply to entire groups of systems or devices"

**In the above cells the model generated some text related to the query asked.**