<a href="https://colab.research.google.com/github/orionhunts-ai/new_models_datasets/blob/main/morpheus_cyber_gpt4o_mini_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# @title Install Core Libraries { run: "auto", display-mode: "form" }
%pip -qqq install loguru uuid
from loguru import logger
logger.log_level = "DEBUG"
_logger = logger
RUN=0
if RUN == 0:
  try:
    !pip install -qqq -U torch==2.3.1 pyarrow
    #%pip -qqq install torch==2.2.2
    !pip -qqq install transformers datasets wandb
  except Exception as e:
    _logger.error(e,exc_info=True)

RUN += 1

# Fine Tuning OpenAI GPT-4o For (Free) Agentic Cyber 👾

  *  Training (and reporting) on Google Colab
   for access to their high powered CUDA and  
  Leveraging
    * fine tuning on dataset ```"swaption2009/cyber-threat-intelligence-custom-data”```
    * Open AI offering this mini version of the already efficient gpt4o.
        * OAI claims that the mini is almost as performant (but it's 20x cheaper)
    * Sampling from the full set for those that are most relevant to Cyber Security Analysts.***

    * Aside from traditional and evolving Evaluations I will also deploy a number of the finely tuned models in a Microsoft Autogen agentic environment to see how they perform on basic analysis on a database.

    * ```Red Panda``` (a high performance streaming data alternative to ```Kafka``` will be used)



## What about Phi!?!
The last experiment with Phi is still ongoing. I am having CUDA compatibility issues between the librares and it's a good chance to learn a bit deeper into that stack


##  Data and Tool Preparation
**Summary:**

This study explores the fine-tuning of the Phi-3-small-instruct model (7.39 billion parameters) by using Daniel So's Unsloth for a Cyber Threat Intelligence (CTI) task using methods like Parameter-Efficient Fine-Tuning (PEFT), Low-Rank Adaptation (LoRA), and Quantized Low-Rank Adaptation (QLoRA). It aims to evaluate performance degradation, model collaboration in agentic environments, and the potential influence of GPT-4. Synthetic data from gretel.ai was also utilized to supplement the fine-tuning process and enhance data diversity and robustness.

In [2]:
!python -m pip install -qqq huggingface_hub wand evaluate
%pip install tqdm
import os
import wandb
import huggingface_hub
from huggingface_hub import notebook_login
from google.colab import userdata
from transformers import TrainingArguments, Trainer

####PROJECT DEFINITION######
model_types = {"openai": "orion-cyber-gpt4o-mini",
               "mistral": "orion-cyber-mistral"}

project=model_types["openai"]
#project=model_types["mistral"]
os.environ["WANDB_PROJECT"] = project
os.environ["WANDB_MODE"] = "disabled"
os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY')
os.environ["WANDB_NOTEBOOK_NAME"] = "data_preparation"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"
!wandb login $WANDB_API_KEY
DEBUG = True
####PROJECT DEFOINITION####
#RUNS=random.random()
hf_hub_key=userdata.get('HF_TOKEN')
wandb_key=userdata.get('WANDB_API_KEY')
#notebook_login(new_session=False)
wandb.login(key=wandb_key)
#Start a run to include Trainer
os.environ["ROOT_DIR"] = '/content/drive/MyDrive/models_datasets/'
MODEL_NAME=f"{project}-cyber-{random.randint(0,100)}-syn_labs"

## INIT FIRST RUN
if wandb.run is None and os.environ["WANDB_MODE"] != "disabled":
  wand_pp = wandb.init(project=project, job_type="data_preparation",dir=f"/content/drive/MyDrive/models_datasets/{project}/")


args = TrainingArguments(
    # other args and kwargs here
    report_to="wandb",  # enable logging to W&B
    run_name="pre-processing-cyber",  # name of the W&B run (optional)
    logging_steps=5,
    output_dir="/content/drive/MyDrive/models_datasets/")










NameError: name 'random' is not defined

In [None]:
class WandBArtifact():
  def __init__(self, artifact_name, run=wandb.run type="data" | "model" | "table"):
    self.run = run
    self.artifact = artifact
    self.type = type

    if self.type ==
    artifact = wandb.Table(dataframe=artifact_name)

    # Add the table to an Artifact to increase the row
    # limit to 200000 and make it easier to reuse
    new_artifact = wandb.Artifact(f"{project}_{artifact_name}-{self.type}", type=self.type)
    iris_table_artifact.add(iris_table, "iris_table")

    # log the raw csv file within an artifact to preserve our data
    iris_table_artifact.add_file("iris.csv")

    # Start a W&B run to log data
    run = wandb.init(project="tables-walkthrough")

    # Log the table to visualize with a run...
    run.log({"iris": iris_table})

    # and Log as an Artifact to increase the available row limit!
    run.log_artifact(iris_table_artifact)

In [None]:
#ML Ops and EDA Imports
import os
import torch
import pandas as pd
import numpy as np
from google.colab import userdata
from huggingface_hub import notebook_login

# Retrieve API keys from user data


# Login to Weights & Biases
'''if wandb_key:
    wandb.login(key=wandb_key)
else:
    print("WANDB_API_KEY is not set")

Login to Hugging Face
if hf_token:
    os.system(f"huggingface-cli login --token {hf_token} --add_to_git_credential")
    os.system(f"huggingface-hub login --token {hf_token}")
else:
     print("HF_TOKEN is not set")'''

# Check if cuda on Colab
device = "cuda:0" if torch.cuda.is_available() else "cpu"
_logger.info(device)

In [None]:
# Make a repo
new_repo = False
name="morpheus_cyber_gpt4o-mini"
if new_repo == False or huggingface_hub.repo_exists(repo_id=name):
  pass
  _logger.info(f"Repo {name} already exists")
else:
  huggingface_hub.create_repo(repo_id=name)
  _logger.info(f"Created repo {name}")

In [None]:
# Loading pre-determined Cyber Data - Create your own Synthetic data on top at https://gretel.ai/'''
import pandas as pd
try:
  from datasets import load_dataset
  ds = load_dataset("swaption2009/cyber-threat-intelligence-custom-data")
  _logger.debug(ds)
  df_train = ds['train'].to_pandas()
  _logger.debug(df_train[0:10])
  _logger.debug(type(df_train))
  _logger.debug(df_train.head())
  _logger.info({df_train[0]})

except Exception as e:
  _logger.error(f'{e}', exc_info=True)



# Data Cleaning and NLP

In [None]:
before_drop = df_train.shape
before_drop


In [None]:
#DropNA and Duplicates
after_drop = df_train.dropna().drop_duplicates()
after_drop.shape
#assert before_drop == after_drop.shape

In [None]:
df_train.columns
df_train.head()

In [None]:
# Reduce to the observation (data), the diagnosis, and mitigations. Split out the entities array to
# mess around with graph based analysis later on.
pre_process = wandb.init(name="pre-processing cyber", project=project, job_type="pre-processing")
pre_data = wandb.Artifact(name="preprocessing_data", type="dataset")
pre_process.log_artifact(pre_data)



pre_process = wandb.Table(columns=["text", "diagnosis",  "solutions"])
pre_table = wandb.log({"table": df_train})


columns = ["text", "diagnosis", "solutions"]
df_train = df_train[columns]
wandb.log_artifact(pre_data, df_train)
wandb.save()



In [None]:
# nlp cleaning
%pip install gensim nltk
import nltk
import gensim

nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import random

def preprocess_text(text):
  # Stop words remove Tokenize the text
  stop_words = set(stopwords.words('english'))
  tokens = word_tokenize(text.lower())
  filtered_sentence = [w for w in tokens if not w.lower() in stop_words and w.isalnum()]
  filterered_sentence=" ".join(filtered_sentence)

  return filtered_sentence




In [None]:
# Tokenise and remove stopwords + Concatenate Scenario and Outcome
from tqdm import tqdm
import random
import sys
import time
from tqdm import tqdm_notebook
tqdm.pandas()
df_scenario_outcome = df_train.copy()
device = "cuda:0" if torch.cuda.is_available() else "cpu"
_logger.info(device)
try:
    # Apply preprocessing function with progress bar
    tqdm.pandas()
    df_scenario_outcome["text_pr"] = tqdm_notebook(df_scenario_outcome["text"].progress_apply(preprocess_text))
    df_scenario_outcome["diagnosis_pr"] = tqdm_notebook(df_scenario_outcome["diagnosis"].progress_apply(lambda x: preprocess_text(x) if x is not None else ''))
    df_scenario_outcome["solutions_pr "] = tqdm_notebook(df_scenario_outcome["solutions"].progress_apply(preprocess_text))
except Exception as e:
  _logger.error(e, exc_info=True)
  artifact = wandb.Artifact(name="pre_tokenisation", type="dataset")
  pre_data.add(df_scenario_outcome, "df_scenario_outcome")
  pre_data.log_artifact(artifact)
  run.save()

except Exception as e:
    print(f'Error: {e}', file=sys.stderr)
    sys.exit(1)





In [None]:
#Process Concat Field
df_scenario_outcome["scenario_outcome"] = df_scenario_outcome.progress_apply(
    lambda row: 'Scenario: ' + str(row["text"]) + ' Outcome: ' + str(row["diagnosis"]), axis=1)

In [None]:
df_scenario_outcome.columns
df_scenario_outcome.isnull().drop(index=1, inplace=True)



### W&B config before Fine Tuning

In [None]:
''' Add the processed data to a WandB Table
Add to Artifact
'''
%pip install evaluate


morpheus_table = wandb.Table(dataframe=df_scenario_outcome, columns=["scenario_outcome", "solutions_pr"])


#NEW RUN
train_run = wandb.init(project=project, job_type="training")
wandb.log({"table": morpheus_table})



In [None]:
### LOG TRAINING ARTEFACT ###
import random
import os

def log_model_artefact(project, artefact_type):
  # Start a new W&B run
  run = wandb.init(model_name=MODEL_NAME, job_type="training", project=project)

  assert run is wandb.run


  # Simulate logging model metrics
  run.log({"acc": random.random()}) #TODO ADD MORE METRICKS

  # Create a simulated model file
  with open(f"{model_name}.h5", "w") as f:
      f.write("Model: " + str(random.random()))

  # Log and link the model to the Model Registry
  run.link_model(path=f"{os.getenv(ROOT_DIR})/my_model.h5", registered_model_name="MODEL_NAME")

  run.finish()
  return wand.save()

In [None]:
!python /content/drive/MyDrive/models_datasets/datasets/run_glue.py \
  --model_name  \
  --task_name $TRAINMini \
  --do_train \
  --do_eval \
  --max_seq_length 256 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-4 \
  --num_train_epochs 3 \
  --output_dir /tmp/$TASK_NAME/ \
  --overwrite_output_dir \
  --logging_steps 50

# Final Processing Before Training
GPU requires the data & model to be on the GPU (or at least the same device if not GPU) REF: Mac torch.backends.mps.available() rather than cuda

In [None]:
"""
Utility function to convert a dataframe to a PyTorch tensor.
- More important with large datasets to be on the GPU
"""
try:
  import numpy as np
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
  _logger.info(f"Data is on: {device}")
except Exception as e:
  _logger.error(f'{e}', exc_info=True)

def to_numpy(dataframe):
  df_numpy = dataframe.values.to_numpy(dtype=np.float32)
  _logger.info("Data is numpy array")
  _logger.info(df_numpy.shape)
  return df_numpy

def df_to_tensor(df_as_numpy, device=device):
  try:
    df_tensors = None
    df_tensors = torch.tensor(df_as_numpy.values, dtype=torch.float32)
    _logger.info("Data is PyTorch tensors torch.float32")
    df_tensors = df_tensors.to(device)
    _logger.info(f"Data is on {device}")
  except Exception as e:
    _logger.error(f'{e}', exc_info=True)
  return df_tensors

In [None]:
# Convert data set to f32 numpy
df_numpy = df_scenario_outcome.to_numpy(dtype=np.float32)
_logger.info(df_numpy.values.shape)
df_tensor = df_to_tensor(df_numpy)
type(df_tensor)
_logger.info(df_tensor.shape)


In [None]:
data_table = wandb.Table(dataframe=df_scenario_outcome, columns=["scenario_outcome", "solutions_pr"])

table_plot = run.plot_table(data_table=data_table,fields=["scenario_outcome","solutions"], vega_spec_name={project})
run.save()
plt.show(table_plot)
run.log({f"table_pot": f"{wandb.Graph(table_plot)}"})
wandb.save()

In [None]:
# @title Model Registry { run: "auto" }
#huggingface_hub.login(token=HF_HUB, add_to_git_credential=True, write_permission=True)
# Start a new W&B run
run_name = f"{project}-save_model"

def check_run(run_name):
  if wandb.run is None:
    wandb.init(project=project, job_type="model", name=run_name)
  else:
    wandb.run.finish()
    wandb.init(project=project, job_type="model", name=run_name)








In [None]:
def metrics_log(save_model: bool = False):
      wandb.run.name.log({"acc": acc})

    # Create a simulated model file
    if os.path.exists(f"{BASE_URL}/{project}/models/") == False:
      os.mkdir(f"{BASE_URL}/{project}/models/")
      run.link_model(path=f"{BASE_URL}/{project}/models/{model_name}.h5", registered_model_name=model_name)
      run.save()
      with open(f"{BASE_URL}/{project}/models/{model_name}.h5", "w") as f:
        f.write("Model: " + str(random.random()))
      run.finish()

In [74]:
#uuid for files
 import uuid
id = str(uuid.uuid4())[0:6]
print(id)

Collecting uuid
  Downloading uuid-1.30.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: uuid
  Building wheel for uuid (setup.py) ... [?25l[?25hdone
  Created wheel for uuid: filename=uuid-1.30-py3-none-any.whl size=6478 sha256=316d67ebbc5c16391a82fbc462f1a7c4edc4b847f2d8e03e71c819be760b7da3
  Stored in directory: /root/.cache/pip/wheels/ed/08/9e/f0a977dfe55051a07e21af89200125d65f1efa60cbac61ed88
Successfully built uuid
Installing collected packages: uuid
Successfully installed uuid-1.30


9840c5


In [None]:
'''Sentiment Analysis:
Added in some meta data to match the Scenario as outlined in the initial text column mapping it
to a scenario, and outcome. Then asking for the sentiment of the solutions'''

import wandb
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, AutoConfig

# Install necessary packages
!pip install -qqq transformers torch accelerate

# Initialize W&B run for sentiment job
run = wandb.init(project=project, name=f"Sentiment_Analysis_{id}",
                 job_type="sentiment",dir="/content/drive/MyDrive/models_datasets/models")


# Define model and tokenizer
MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, output_hidden_states=True)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer,device=device)
wandb.log({"object": classifier})
# Define paths and names
project = "morpheus_cyber_gpt-4o-mini"
model_id = "sentiment_model"
model_name = f"{project}-cyber{model_id}"
model_path = f"./content/drive/MyDrive/models_datasets/models/"

# Save the model
check_point = classifier.save_pretrained(model_name)
wandb.save(check_point)

# Initialize a new W&B run to store the model


# Create a new artifact and add the model file to it
artifact = wandb.Artifact(name=f'{model_name}', type="model")
run.save()


In [None]:
import matplotlib.pyplot as plt

df_sentiment = df_scenario_outcome.copy()
df_sentiment.drop(["text", "diagnosis", "solutions"], axis=1)
print(df_sentiment.shape)

#Make a data table from a dataframe
data_table = wandb.Table(dataframe=df_sentiment)
table_plot = wandb.plot_table(data_table=data_table,fields=["text","diagnosis","solutions"], vega_spec_name={project})
run.save()
plt.show(table_plot)
run.log({f"table_pot": f"{wandb.Graph(table_plot)}"})

In [None]:
# Function to predict sentiment
import scipy


_logger.info(device)

def predict_sentiment(text=df_sentiment, model=model, classifier=classifier):
  model.to(device)
  if text is not None:
      labels = ["Negative", "Neutral", "Positive"]
      from scipy.special import softmax
      encoded_input = tokenizer(text, return_tensors='pt', truncation=True).to(device)
      # Run the model
      #with torch.no_grad(
      output = model(**encoded_input)
      # Extract the sentiment scores
      scores = output[0][0].detach().numpy()
      scores = softmax(scores)
      # Truncate the text to the maximum length the model can handle

      result = classifier(scores)
      ranking = np.argsort(result)
      ranking = ranking[::-1]
      for i in range(scores.shape[0]):
        l = labels[ranking[i]]
        s = scores[ranking[i]]
        print(f"{i+1}) {l} {np.round(float(s), 4)}")
      return sentiment, label

## Apply to copied DF
sentiment = df_sentiment['scenario_outcome'].progress_apply(predict_sentiment)
#score = df_sentiment["scenario_outcome"].progress_apply(predict_sentiment)
#df_sentiment['sentiment'] = sentiment[0]
#df_sentiment['score'] = sentiment[1]


In [None]:
run.link_model(
    path=model_path,
    registered_model_name=f"{model_name}",
    name="4o-mini-cyber",
    aliases=["evaluation"],
)

### NER for enriching the data more ###

In [None]:
artifact.add_file(local_path="./content/drive/MyDrive/models_datasets/models/", name=f'{model_name}')

# Log the artifact to W&B
run.log_artifact(artifact)
huggingface_hub.save_pretrained_torch(model, model_name)
# Finish the W&B run
run.save()




In [None]:
ner_model = 'dslim/bert-base-NER'

In [None]:
##WANDBTRAINER##
trainer = Trainer(
    # other args and kwargs here
    args=args,  # your training args
)


In [None]:
### Text-Diagnosis Concatenation & HotEncoder Target
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
# FIX THIS
df_encoded = df_tokenized.copy()
df_encoded['text_diagnosis'] = df_text_diagnosis['text_processed'] + df_text_diagnosis['diagnosis_processed']

df_encoded.head()

# Simple Word to Vec Model

In [None]:
# Train a Word2Vec model (example)
apply_word2Vec[column for column in columns]
model = Word2Vec(df_domains['text_processed'], min_count=1)

In [None]:
# Apply the function to each column
df_w2v = df_domains.copy()
df_w2v['text_processed'] = df_domains['text'].apply(preprocess_text)
df_w2v['diagnosis_processed'] = df_domains['diagnosis'].apply(preprocess_text)
df_w2v['solutions_processed'] = df_domains['solutions'].apply(preprocess_text)

# OpenAI Embeddings with Gpt4o-Mini finely tuned and using the small OAI # Embeddings

In [None]:
client = OpenAI()


In [None]:
# Use OAI Embeddings
%pip install -qqq openai
import openai
openai.api_key = userdata.get("OPENAI_API_KEY")
oai_model="text-embedding-3-small"

def get_openai_embedding(text,engine=oai_model):
    response = openai.Embedding.create(
      input=text,
      engine=engine  # Or another model you prefer
    )
    return response['text'][0]['embedding']



In [None]:
embedded_data_oai = df_domains.copy()
embedded_data_oai['text_embedding'] = embedded_data_oai['text_processed'].progress_apply(get_openai_embedding)

In [84]:
%pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Start a new run for visualizations
viz_run = wandb.init(project="morpheus_cyber_gpt-4o-mini", job_type="visualization")

# --- Distribution of Sentiment ---
plt.figure(figsize=(8, 6))
sns.countplot(data=df_sentiment, x='sentiment')
plt.title('Distribution of Sentiment')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
# Log the plot to W&B
viz_run.log({"sentiment_distribution": wandb.Image(plt)})
plt.show()

# --- Word Cloud of Text ---
from wordcloud import WordCloud
text_corpus = ' '.join(df_scenario_outcome['text_pr'].astype(str).tolist())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_corpus)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Text Data')
# Log the plot to W&B
viz_run.log({"text_wordcloud": wandb.Image(plt)})
plt.show()

# --- Word Cloud of Solutions ---
solutions_corpus = ' '.join(df_scenario_outcome['solutions_pr '].astype(str).tolist())
wordcloud_solutions = WordCloud(width=800, height=400, background_color='white').generate(solutions_corpus)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_solutions, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Solutions Data')
# Log the plot to W&B
viz_run.log({"solutions_wordcloud": wandb.Image(plt)})
plt.show()

# --- Length Distribution of Text ---
plt.figure(figsize=(8, 6))
sns.histplot(df_scenario_outcome['text_pr'].str.len(), bins=30)
plt.title('Distribution of Text Length')
plt.xlabel('Text Length')
plt.ylabel('Frequency')
# Log the plot to W&B
viz_run.log({"text_length_distribution": wandb.Image(plt)})
plt.show()

# --- Correlation Heatmap (if applicable) ---
# If you have numerical features, you can create a correlation heatmap
# Example:
# numeric_features = df_scenario_outcome[['column1', 'column2']]  # Replace with actual numerical columns
# correlation_matrix = numeric_features.corr()
# plt.figure(figsize=(10, 8))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
# plt.title('Correlation Heatmap')
# viz_run.log({"correlation_heatmap": wandb.Image(plt)})
# plt.show()

# --- Create an artifact and save the visualizations ---
artifact = wandb.Artifact(name="pre_finetuning_visualizations", type="visualizations")
# Add any files you want to include in the artifact (e.g., images, data files)
# artifact.add_file("path/to/your/file.png")

# Log the artifact to W&B
viz_run.log_artifact(artifact)

# Finish the visualization run
viz_run.finish()




VBox(children=(Label(value='0.913 MB of 0.913 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

NameError: name 'df_sentiment' is not defined

<Figure size 800x600 with 0 Axes>

# Preparing to Train
1. Isolate important columns.


In [81]:
!pip install openai
import openai
openai.api_key = userdata.get("OPENAI_API_KEY")
from wandb.integration.openai.fine_tuning import WandbLogger
data=f"{BASE_URL}/{project}/{model_name}.jsonl"

from openai import OpenAI
client = OpenAI()

client.files.create(
  file=open("mydata.jsonl", "rb"),
  purpose="fine-tune"
)

# Finetuning logic
id = uuid.uuidv4()
if FINETUNE_JOB_ID == True:
  WandbLogger.sync(project=fine_tune_job_id=FINETUNE_JOB_ID)


WandbLogger.sync(entity="orion-agents-org")

Error: `openai` not installed. This integration requires `openai`. To fix, please `pip install openai`

# Fine Tuning Using Different Approaches
1. Open AI gpt-4o mini with small embeddings from OAI
2. Open AI gpt-4o mini with Word2Vec

1. Word2Vec model with Sentence Transformers

** After we have the models we will train them

In [None]:
# prompt: train test split from sklearn 0.1 size , random_stat42

from sklearn.model_selection import train_test_split
train, test = train_test_split(df_train, test_size=0.1, random_state=42)


In [None]:
### Need to concat the features.
#System Messages : 1 Assistant
```json
[
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "A ransomware attack encrypted critical files. Diagnosis: The attack vector was a phishing email. Solutions: 1. Isolate infected systems, 2. Pay the ransom, 3. Restore from backups."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Not advisable, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "Data breach exposing customer information. Diagnosis: Misconfigured cloud storage. Solutions: 1. Notify affected customers, 2. Implement stricter access controls, 3. Ignore the breach."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Bad."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "Unauthorized access to internal network. Diagnosis: Weak password policy. Solutions: 1. Change all passwords, 2. Implement MFA, 3. Monitor network traffic."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "DDoS attack causing service disruption. Diagnosis: Insufficient network defenses. Solutions: 1. Increase bandwidth, 2. Implement rate limiting, 3. Deploy DDoS protection service."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Partial, 2. Good, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "Malware infection on multiple devices. Diagnosis: Lack of antivirus software. Solutions: 1. Install antivirus software, 2. Perform a full system scan, 3. Disconnect infected devices from the network."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "Phishing attack leading to credential theft. Diagnosis: Lack of user training. Solutions: 1. Conduct phishing awareness training, 2. Change compromised credentials, 3. Implement email filtering."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "SQL injection attack compromising database. Diagnosis: Lack of input validation. Solutions: 1. Implement input validation, 2. Use parameterized queries, 3. Perform regular security audits."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "Unauthorized access to sensitive data. Diagnosis: Inadequate access controls. Solutions: 1. Restrict access to sensitive data, 2. Implement role-based access control, 3. Regularly review access logs."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "Insider threat leaking confidential information. Diagnosis: Lack of monitoring. Solutions: 1. Implement user activity monitoring, 2. Conduct background checks, 3. Establish a whistleblower policy."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Good."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "For each scenario, there is a scenario with a diagnosis and solutions. Decide if they are good solutions."},
            {"role": "user", "content": {"scenario_outcome": "Zero-day exploit used in an attack. Diagnosis: Outdated software. Solutions: 1. Apply patches promptly, 2. Use intrusion detection systems, 3. Maintain an incident response plan."}},
            {"role": "assistant", "content": "By my assessment, the solutions were: 1. Good, 2. Good, 3. Good."}
        ]
    }
]
```


In [None]:
'''System Messages : Multiple Assistants
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "William Shakespeare", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters.", "weight": 1}]}

In [None]:
import json

#Define Template
system_messages = [{sys}]
prefix = {'messages':{"role:"system", "content": "Here are a variety of solutions to cyber problems. Analyze and give a binary 0 for no and 1 for yes.'}}}
postfix =

with open("./text_sql.json", "w") as f:
    json.dump(template, f)

In [None]:
# DF to JSON Serialized
df_to_json = df_domain.to_json('./text_sql.json', orient='records')

In [None]:
run = wandb.init(project="Cyber-Phi-Small-8k-instruct", job_type="dataset")
artifact = wandb.Artifact(name="df_to_json", type="dataset")
run.log_artifact(artifact)

run.finish()

ValueError: Artifact df_to_json already exists with type 'data'; cannot create another with type 'dataset'



---



## Model Training
1. Tokenize with TikToken
2. @ 4bit for improved speed traded off for lower precision calculation on weights.

In [None]:
run2 = wandb.init(project="Cyber-Phi-Small-8k-instruct", job_type="train")
run3 = wandb.init(project="gpt4o-mini", job_type="train")

VBox(children=(Label(value='0.012 MB of 0.012 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
# Load model directly from HuggingFace
%pip install -qq tiktoken einops
%pip install  -q torch==2.2.2+cu121 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 torchtext==0.15.2 torchdata==0.6.1 --extra-index-url https://download.pytorch.org/whl/cu121 -U
_logger.info(device)
from unsloth import FastLanguageModel
import torch
import tiktoken
import einops
from transformers import AutoModelForCausalLM

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3-small-8k-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    #token = os.getenv("WANDB_API_KEY"), # use one if using gated models like meta-llama/Llama-2-7b-hf
)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m757.3/757.3 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: Could not find a version that satisfies the requirement torchvision==0.15.2+cu118 (from versions: 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.1, 0.2.2, 0.2.2.post2, 0.2.2.post3, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.0+cu121, 0.16.1, 0.16.1+cu121, 0.16.2, 0.16.2+cu121, 0.17.0, 0.17.0+cu121, 0.17.1, 0.17.1+cu121, 0.17.2, 0.17.2+cu121, 0.18.0, 0.18.0+cu121, 0.18.1, 0.18.1+cu121)[0m[31m
[0m[31mERROR: No matching distribution found for torchvision==0.15.2+cu118[0m[31m
[0mcuda:0


RuntimeError: Unsloth: `unsloth/Phi-3-small-8k-instruct` is not a base model or a PEFT model.
We could not locate a `config.json` or `adapter_config.json` file.
Are you certain the model name is correct? Does it actually exist?

### Extract Entities for Graph
1. Make new dataFrame wth text an relations broken down, then labelled with Node or Relationship.


In [None]:
import numpy as np
df_Graph = df_train.copy()
def graph_df(text):
  columns = text.unique()
  graph_df=pd.DataFrame(columns=columns)
  return graph_df
print(df_Graph.head())
graph_df = graph_df(df_Graph['entities'])
graph_df.shape
new_df = pd.DataFrame(columns=graph_df[0::])
new_df.head()
for k, v in enumerate(new_df.index):
  print(f'k is {k} and v is {v}')
  print(v)
  #new_df[f"{v}"] = df_Graph['entities'][k].split(',')
  #print(new_df[f"{v}]"])
  #print(graph_df.head())

#print(graph_df.describe)
#print(graph_df.head())
#print(np.array_split(values=graph_df,)

      id                                               text  \
0    249  A cybersquatting domain save-russia[.]today is...   
1  14309  Like the Android Maikspy, it first sends a not...   
2  13996  While analyzing the technical details of this ...   
3  13600  (Note that Flash has been declared end-of-life...   
4  14364  Figure 21. Connection of Maikspy variants to 1...   

                                            entities  \
0  [{'end_offset': 16, 'id': 44656, 'label': 'att...   
1  [{'end_offset': 17, 'id': 48530, 'label': 'SOF...   
2  [{'end_offset': 194, 'id': 48781, 'label': 'th...   
3  [{'end_offset': 79, 'id': 51687, 'label': 'TIM...   
4  [{'end_offset': 191, 'id': 51779, 'label': 'UR...   

                                           relations  \
0  [{'from_id': 44658, 'id': 9, 'to_id': 44659, '...   
1  [{'from_id': 48531, 'id': 445, 'to_id': 48532,...   
2  [{'from_id': 48781, 'id': 461, 'to_id': 48782,...   
3  [{'from_id': 51688, 'id': 1133, 'to_id': 51689...   
4  [

In [None]:
def load_and_log():

    # 🚀 start a run, with a type to label it and a project it can call home
    with wandb.init(project="artifacts-data-models", job_type="load-data") as run:

        datasets = load()  # separate code for loading the datasets
        names = ["training", "validation", "test"]

        # 🏺 create our Artifact
        raw_data = wandb.Artifact(
            "cyber-phi", type="dataset",
            description="Cyber-Phi",
            metadata={"source": "torchvision.datasets.MNIST",
                      "sizes": [len(dataset) for dataset in datasets]})

        for name, data in zip(names, datasets):
            # 🐣 Store a new file in the artifact, and write something into its contents.
            with raw_data.new_file(name + ".pt", mode="wb") as file:
                x, y = data.tensors
                torch.save((x, y), file)

        # ✍️ Save the artifact to W&B.
        run.log_artifact(raw_data)

load_and_log()



---
###APPENDIX A

### 🤗Fine-Tuning Techniques: 🤗

**PEFT** (Parameter-Efficient Fine-Tuning): Fine-tunes pre-trained models by adjusting only a small subset of parameters, reducing computational costs.

**LoRA** (Low-Rank Adaptation): Enhances transformer models by injecting and training low-rank matrices within each layer, minimizing the number of trainable parameters.

**QLoRA** (Quantized Low-Rank Adaptation): Combines low-rank adaptation with weight quantization to achieve efficient fine-tuning with reduced memory and computational requirements.

**Full Fine-Tuning:** Updates all parameters of the pre-trained model, offering high flexibility at the cost of increased computational resources.

**Distillation:** Trains a smaller model to mimic the behavior of a larger pre-trained model, optimizing efficiency while maintaining performance.