<a href="https://colab.research.google.com/github/honicky/deep-log-analysis/blob/main/Pythia%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pythia Analysis - train small models on HDFS data

* use tokenized version of preprocessed HDFS events
* start with very small pythia models, test increasing size
* start with fine-tuning, then consider resetting weights and training from scratch
* experiment with different tokenizers
  * https://chatgpt.com/share/67448f53-29a0-800f-9913-af22d6ed0894


In [1]:
try:
  from google.colab import userdata

  !git clone https://github.com/honicky/deep-log-analysis.git
  !mv deep-log-analysis/* .
  !rm -rf deep-log-analysis
except:
  pass

Cloning into 'deep-log-analysis'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 24 (delta 6), reused 17 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (24/24), 453.00 KiB | 7.43 MiB/s, done.
Resolving deltas: 100% (6/6), done.


In [2]:
try:
    import logparser.Drain as Drain
except ImportError:
    %pip install requests git+https://github.com/logpai/logparser

%pip install transformers torch torchvision torchaudio wandb python-dotenv datasets

Collecting git+https://github.com/logpai/logparser
  Cloning https://github.com/logpai/logparser to /tmp/pip-req-build-zrw15f3u
  Running command git clone --filter=blob:none --quiet https://github.com/logpai/logparser /tmp/pip-req-build-zrw15f3u
  Resolved https://github.com/logpai/logparser to commit 18dcd312d72173e1f19ef59a8155c77c93c74f2d
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting regex==2022.3.2 (from logparser3==1.0.4)
  Downloading regex-2022.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (39 kB)
Downloading regex-2022.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (764 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m764.2/764.2 kB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: logparser3
  Building wheel for logparser3 (setup.py) ... [?25l[?25hdone
  Created wheel for logparser3: filename=logparser3-1.0.4-py3-none-any.whl size=160373 sha256=3d07000c370578

In [3]:
import logparser.Drain as Drain


In [4]:
%load_ext autoreload
%autoreload 2
import dataloaders as dl


# Load secrets

If we are in colab, we get them from the `userdata` module, otherwise we get them from a .env file


In [5]:
import os
try:
  from google.colab import userdata
  os.environ["HF_WRITE_TOKEN"] = userdata.get('HF_WRITE_TOKEN')
  os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY')
except ImportError:
  from dotenv import load_dotenv
  load_dotenv()


# Download and unzip the HDFS dataset

The functions check if the data is already downloaded and unzipped, and only download and unzip if they are not present.


In [7]:
import pandas as pd


dl.download_data(dl.datasets["HDFS"]["url"], dl.datasets["HDFS"]["zip_file_name"])

regenerate_data = False
if regenerate_data:
    from datasets import Dataset

    dl.unzip_data(dl.datasets["HDFS"]["zip_file_name"], dl.datasets["HDFS"]["file_name"])

    structured_file_path = dl.parse_dataset("HDFS")

    structured_df = pd.read_csv(structured_file_path)
    dl.add_hdfs_blockid_column(structured_df)
    structured_df.head()

    structured_dataset = Dataset.from_pandas(structured_df)
    structured_dataset.push_to_hub("honicky/log-analysis-hdfs-preprocessed", token=os.environ["HF_WRITE_TOKEN"])

else:
    from datasets import load_dataset

    structured_dataset = load_dataset("honicky/log-analysis-hdfs-preprocessed")
    structured_df = pd.DataFrame(structured_dataset['train'])


# Load the block labels

In [8]:
dl.unzip_data(dl.datasets["HDFS"]["zip_file_name"],"preprocessed/anomaly_label.csv", base_dir="data/hdfs" )

anomaly_label_df = pd.read_csv("data/hdfs/preprocessed/anomaly_label.csv")
anomaly_label_df.head()


Unnamed: 0,BlockId,Label
0,blk_-1608999687919862906,Normal
1,blk_7503483334202473044,Normal
2,blk_-3544583377289625738,Anomaly
3,blk_-9073992586687739851,Normal
4,blk_7854771516489510256,Normal


# Parse the parameter list

The parameter list is formatted as python code, so we need to use the `ast` library to parse it.

In [9]:
from ast import literal_eval

structured_df['ParsedParameterList'] = structured_df.ParameterList.apply(literal_eval)


In [10]:
event_id_mapping_pdf = (structured_df
 .EventId
 .value_counts()
 .reset_index()
 .reset_index()
 .rename(columns={"index":"NewEventId"})
 [["EventId", "NewEventId"]]
)

In [11]:
structured_with_event_id_pdf = structured_df.merge(event_id_mapping_pdf, on="EventId")
structured_with_event_id_pdf.head()

Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,ParameterList,BlockId,ParsedParameterList,NewEventId
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.19.102:5...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.19.102:5410...",0
1,2,81109,203518,35,INFO,dfs.FSNamesystem,BLOCK* NameSystem.allocateBlock: /mnt/hadoop/m...,3d91fa85,BLOCK* NameSystem.allocateBlock: <*> <*>,['/mnt/hadoop/mapred/system/job_200811092030_0...,blk_-1608999687919862906,[/mnt/hadoop/mapred/system/job_200811092030_00...,6
2,3,81109,203519,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.10.6:405...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.10.6:40524,...",0
3,4,81109,203519,145,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.14.224:4...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.14.224:4242...",0
4,5,81109,203519,145,INFO,dfs.DataNode$PacketResponder,PacketResponder 1 for block blk_-1608999687919...,d38aa58d,PacketResponder <*> for block <*> <*>,"['1', 'blk_-1608999687919862906 terminating']",blk_-1608999687919862906,"[1, blk_-1608999687919862906 terminating]",2


## Construct blocks to parse

https://raw.githubusercontent.com/EleutherAI/pythia/refs/heads/main/utils/20B_tokenizer.json has the tokenizer configuration.  We will use the `<|sep|>` token to immediately precede the short event id.  We need to add the `<|sep|>` token to the tokenizer, because it is not in the default tokenizer.  This will hopefully help the attention mechanism attend to the event id specifically.  We have shortened the event id to the minimum length based on the number occurences.  This will gives an efficient coding that will be less complicated for the attention mechanism.

We can consider a more customized tokenizer as another experiment.  This might help because of the special characters and the dominance of numbers in the logs.


In [12]:
from transformers import GPTNeoXTokenizerFast
tokenizer = GPTNeoXTokenizerFast.from_pretrained("EleutherAI/pythia-14m")
tokenizer.add_special_tokens({"additional_special_tokens": ["<|sep|>"]})
tokenizer.sep_token = "<|sep|>"
tokenizer.sep_token_id
tokenizer.pad_token_id = tokenizer.eos_token_id # no pad token in default tokenizer, so add it here for collating / training


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Double check that the tokenizer properly encodes the new special token

In [13]:

tokenizer.encode("<|sep|>")


[50277]

Review then tokenizer configuration, again to ensure the new special token is included


In [14]:
tokenizer

GPTNeoXTokenizerFast(name_or_path='EleutherAI/pythia-14m', vocab_size=50254, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'sep_token': '<|sep|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|sep|>']}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|padding|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	50254: AddedToken("                        ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50255: AddedToken("                       ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50256: AddedToken("                      ", rstrip=False, lstrip=False, s

In [15]:
structured_with_event_id_pdf['event_encoded'] = structured_with_event_id_pdf.apply(lambda row: f"{tokenizer.sep_token}{row['NewEventId']} {' '.join(param for param in row['ParsedParameterList'] if 'blk_' not in param)}", axis=1)
structured_with_event_id_pdf.head()


Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,ParameterList,BlockId,ParsedParameterList,NewEventId,event_encoded
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.19.102:5...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.19.102:5410...",0,<|sep|>0 /10.250.19.102:54106 /10.250.19.102:5...
1,2,81109,203518,35,INFO,dfs.FSNamesystem,BLOCK* NameSystem.allocateBlock: /mnt/hadoop/m...,3d91fa85,BLOCK* NameSystem.allocateBlock: <*> <*>,['/mnt/hadoop/mapred/system/job_200811092030_0...,blk_-1608999687919862906,[/mnt/hadoop/mapred/system/job_200811092030_00...,6,<|sep|>6
2,3,81109,203519,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.10.6:405...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.10.6:40524,...",0,<|sep|>0 /10.250.10.6:40524 /10.250.10.6:50010
3,4,81109,203519,145,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.14.224:4...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.14.224:4242...",0,<|sep|>0 /10.250.14.224:42420 /10.250.14.224:5...
4,5,81109,203519,145,INFO,dfs.DataNode$PacketResponder,PacketResponder 1 for block blk_-1608999687919...,d38aa58d,PacketResponder <*> for block <*> <*>,"['1', 'blk_-1608999687919862906 terminating']",blk_-1608999687919862906,"[1, blk_-1608999687919862906 terminating]",2,<|sep|>2 1


In [16]:
encoded_blocks_series = structured_with_event_id_pdf.groupby("BlockId")['event_encoded'].apply(lambda x: "".join(x))
encoded_blocks_series.head()


Unnamed: 0_level_0,event_encoded
BlockId,Unnamed: 1_level_1
blk_-1000002529962039464,<|sep|>0 /10.251.123.1:41333 /10.251.123.1:500...
blk_-100000266894974466,<|sep|>6 <|sep|>0 /10.250.10.144:36204 /10.250...
blk_-1000007292892887521,<|sep|>0 /10.251.127.47:50228 /10.251.127.47:5...
blk_-1000014584150379967,<|sep|>0 /10.251.43.210:49254 /10.251.43.210:5...
blk_-1000028658773048709,<|sep|>0 /10.251.107.196:58917 /10.251.107.196...


In [17]:
print(encoded_blocks_series.shape)
print(encoded_blocks_series.iloc[0])


(575061,)
<|sep|>0 /10.251.123.1:41333 /10.251.123.1:50010<|sep|>0 /10.251.123.1:53174 /10.251.123.1:50010<|sep|>0 /10.251.202.181:32980 /10.251.202.181:50010<|sep|>6 <|sep|>2 2<|sep|>3 3553241 /10.251.123.1<|sep|>2 0<|sep|>3 3553241 /10.251.202.181<|sep|>1 10.251.126.22:50010 3553241<|sep|>1 10.251.202.181:50010 3553241<|sep|>1 10.251.123.1:50010 3553241<|sep|>2 1<|sep|>3 3553241 /10.251.123.1


# Start with pretrained weights

The intuition is that the model will benefit some from understanding words and numbers (to some extent) when they appear, even if the structure of logs is very different from english sentences.  We can test this with an ablation study by randomizing the weights before training and then looking at the difference in the loss.

### Understanding Pythia Model Vocabulary Size Discrepancy

When loading a Pythia model from EleutherAI, I noticed a discrepancy between the model's embedding weight shape and the tokenizer vocabulary size:

```python
import torch
from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast

model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-14m")
tokenizer = GPTNeoXTokenizerFast.from_pretrained("EleutherAI/pythia-14m")
model.get_input_embeddings().weight.data.shape
```

This outputs:
```
torch.Size([50304, 128])
```

However, the tokenizer's vocab size is:
```python
>>> tokenizer.vocab_size + len(tokenizer.added_tokens_encoder)
50279
```

Including special tokens, the vocab size is 50277.

The original 50304 dimensions confused me at first, but it turns out the size is padded in order to facilitate alignment with tensor cores. Specifically, `50304 = 2^7 * 3 * 131`, so the embedding size is a multiple of 128.

From [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/pdf/2401.14489v2):

> Tensor Cores can be fully utilized when GEMM dimensions m, k, and n are multiples
> of 16 bytes and 128 bytes for V100 and A100 GPUs, respectively. Since a FP16
> element is 2 bytes, this corresponds to dimension sizes that are multiples of 8
> and 64 elements, respectively.

So it looks like the embedding size is a multiple of 64.

### Solution

Add padding to the embedding size to match the parallelization factor.
```
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=64)
```



In [18]:
import torch

from transformers import GPTNeoXForCausalLM

def get_model():

    model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-14m")
    model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=64)
    model.get_input_embeddings().weight.data.shape

    return model

model = get_model()

config.json:   0%|          | 0.00/595 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/53.3M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

# Encode the blocks using the new tokenizer

In [19]:
encoded_blocks_pdf = encoded_blocks_series.to_frame()
encoded_blocks_pdf['tokenized_block'] = encoded_blocks_pdf.event_encoded.apply(tokenizer.encode)


In [20]:
encoded_blocks_pdf

Unnamed: 0_level_0,event_encoded,encoded_block
BlockId,Unnamed: 1_level_1,Unnamed: 2_level_1
blk_-1000002529962039464,<|sep|>0 /10.251.123.1:41333 /10.251.123.1:500...,"[50277, 17, 1227, 740, 15, 21451, 15, 10683, 1..."
blk_-100000266894974466,<|sep|>6 <|sep|>0 /10.250.10.144:36204 /10.250...,"[50277, 23, 209, 50277, 17, 1227, 740, 15, 951..."
blk_-1000007292892887521,<|sep|>0 /10.251.127.47:50228 /10.251.127.47:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 11946, 1..."
blk_-1000014584150379967,<|sep|>0 /10.251.43.210:49254 /10.251.43.210:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 3079, 15..."
blk_-1000028658773048709,<|sep|>0 /10.251.107.196:58917 /10.251.107.196...,"[50277, 17, 1227, 740, 15, 21451, 15, 12224, 1..."
...,...,...
blk_999905757185707736,<|sep|>0 /10.251.39.160:41914 /10.251.39.160:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 1867, 15..."
blk_999915040208161699,<|sep|>0 /10.251.43.210:46583 /10.251.43.210:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 3079, 15..."
blk_999958959261325562,<|sep|>0 /10.251.203.246:56717 /10.251.203.246...,"[50277, 17, 1227, 740, 15, 21451, 15, 17490, 1..."
blk_999974850451006327,<|sep|>0 /10.251.126.5:32870 /10.251.126.5:500...,"[50277, 17, 1227, 740, 15, 21451, 15, 13381, 1..."


In [21]:
print(f"total token count: {encoded_blocks_pdf.tokenized_block.apply(len).sum():,}")
encoded_blocks_pdf.tokenized_block.apply(len).describe()

total token count: 137,942,766


Unnamed: 0,encoded_block
count,575061.0
mean,239.875015
std,85.098227
min,27.0
25%,219.0
50%,219.0
75%,223.0
max,5770.0


In [22]:
encoded_blocks_pdf.tokenized_block.apply(len).describe(percentiles=[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])


Unnamed: 0,encoded_block
count,575061.0
mean,239.875015
std,85.098227
min,27.0
1%,54.0
5%,174.0
10%,174.0
25%,219.0
50%,219.0
75%,223.0


In [25]:
import torch

device = torch.device(
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using device: {device}")

Using device: cuda


# Train/val/test split

We are using a random split, so this means that we are assuming a stationary distribution for our logs.  We will add complexity later


In [26]:
from sklearn.model_selection import train_test_split

# Merge with anomaly labels
encoded_blocks_with_labels = encoded_blocks_pdf.merge(
    anomaly_label_df,
    left_index=True,
    right_on='BlockId'
)

# Split into train/test sets (80/20 split)
train_df, val_test_df = train_test_split(
    encoded_blocks_with_labels,
    test_size=0.2,
    random_state=42,
    stratify=encoded_blocks_with_labels['Label']
)

print(f"Training samples: {len(train_df)}")

# Split into val/test sets (50/50 split)
val_df, test_df = train_test_split(
    val_test_df,
    test_size=0.5,
    random_state=42,
)

print(f"Val samples:  {len(val_df)}")
print(f"Test samples: {len(val_df)}")

Training samples: 460048
Val samples:  57506
Test samples: 57506


In [27]:
train_df

Unnamed: 0,event_encoded,encoded_block,BlockId,Label
257494,<|sep|>0 /10.251.67.211:54457 /10.251.67.211:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 2251, 15...",blk_-4040947678439826686,Normal
49365,<|sep|>6 <|sep|>0 /10.251.106.37:36707 /10.251...,"[50277, 23, 209, 50277, 17, 1227, 740, 15, 214...",blk_1870752360007129176,Normal
7319,<|sep|>6 <|sep|>0 /10.251.121.224:40809 /10.25...,"[50277, 23, 209, 50277, 17, 1227, 740, 15, 214...",blk_-1999301527305082358,Normal
295080,<|sep|>0 /10.251.123.20:56258 /10.251.123.20:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 10683, 1...",blk_-2322520798745751605,Normal
64733,<|sep|>6 <|sep|>0 /10.251.107.242:55242 /10.25...,"[50277, 23, 209, 50277, 17, 1227, 740, 15, 214...",blk_-4090429635427697097,Normal
...,...,...,...,...
424427,<|sep|>0 /10.251.37.240:42153 /10.251.37.240:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 1787, 15...",blk_4272247743717120753,Normal
403348,<|sep|>0 /10.251.215.50:36443 /10.251.215.50:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 21351, 1...",blk_1218092075075778522,Normal
253046,<|sep|>0 /10.250.11.53:53272 /10.250.11.53:500...,"[50277, 17, 1227, 740, 15, 9519, 15, 883, 15, ...",blk_-4591257497708039986,Normal
495499,<|sep|>0 /10.251.125.174:53652 /10.251.125.174...,"[50277, 17, 1227, 740, 15, 21451, 15, 9312, 15...",blk_-4092465791855115484,Normal


In [76]:
# Set up training parameters
if device.type == "cuda":
  BATCH_SIZE = 16
else:
  BATCH_SIZE = 4

MAX_LENGTH = 405  # Truncate sequences to manage memory
LEARNING_RATE = 1e-4
NUM_EPOCHS = 1



In [77]:
print(f"using BATCH_SIZE = {BATCH_SIZE}")
print(f"using MAX_LENGTH = {MAX_LENGTH}")
print(f"using LEARNING_RATE = {LEARNING_RATE}")
print(f"using NUM_EPOCHS = {NUM_EPOCHS}")

using BATCH_SIZE = 16
using MAX_LENGTH = 405
using LEARNING_RATE = 0.0001
using NUM_EPOCHS = 1


In [29]:
import os, wandb

wandb.login(key=os.getenv("WANDB_API_KEY"))


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mhonicky[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [78]:
import gc
import psutil

def print_memory_stats(prefix=""):
    """Detailed memory statistics"""
    if device.type == "cuda":
        allocated = torch.cuda.memory_allocated() / 1024**3
        reserved = torch.cuda.memory_reserved() / 1024**3
    elif device.type == "mps":
        allocated = torch.mps.current_allocated_memory() / 1024**3
        reserved = torch.mps.driver_allocated_memory() / 1024**3
    else:
        allocated = reserved = 0

    print(f"\n{prefix} Memory Status:")
    print(f"├── Allocated: {allocated:.2f} GB (actively used by tensors)")
    print(f"├── Reserved:  {reserved:.2f} GB (held by driver)")
    print(f"├── Cached:    {(reserved - allocated):.2f} GB (reserved - allocated)")

    # System memory info
    vm = psutil.virtual_memory()
    print(f"└── System Available: {vm.available / 1024**3:.2f} GB")

def get_gpu_memory_metrics():
    """Get system metrics for logging"""
    if device.type == "cuda":
        return {
            "gpu_memory_allocated_gb": torch.cuda.memory_allocated() / (1024**3),
            "gpu_memory_reserved_gb": torch.cuda.memory_reserved() / (1024**3),
        }
    elif device.type == "mps":
        return {
            "gpu_memory_allocated_gb": torch.mps.current_allocated_memory() / (1024**3),
            "gpu_memory_reserved_gb": torch.mps.driver_allocated_memory() / (1024**3),
        }
    return {
        "gpu_memory_allocated_gb": 0,
        "gpu_memory_reserved_gb": 0,
    }

def clear_memory():
    """Explicitly clear memory"""
    gc.collect()
    if device.type == "cuda":
        torch.cuda.empty_cache()
    elif device.type == "mps":
        torch.mps.empty_cache()

In [31]:
structured_with_event_id_pdf = structured_df.merge(event_id_mapping_pdf, on="EventId")
structured_with_event_id_pdf.head()

Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,ParameterList,BlockId,ParsedParameterList,NewEventId
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.19.102:5...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.19.102:5410...",0
1,2,81109,203518,35,INFO,dfs.FSNamesystem,BLOCK* NameSystem.allocateBlock: /mnt/hadoop/m...,3d91fa85,BLOCK* NameSystem.allocateBlock: <*> <*>,['/mnt/hadoop/mapred/system/job_200811092030_0...,blk_-1608999687919862906,[/mnt/hadoop/mapred/system/job_200811092030_00...,6
2,3,81109,203519,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.10.6:405...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.10.6:40524,...",0
3,4,81109,203519,145,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.14.224:4...",blk_-1608999687919862906,"[blk_-1608999687919862906, /10.250.14.224:4242...",0
4,5,81109,203519,145,INFO,dfs.DataNode$PacketResponder,PacketResponder 1 for block blk_-1608999687919...,d38aa58d,PacketResponder <*> for block <*> <*>,"['1', 'blk_-1608999687919862906 terminating']",blk_-1608999687919862906,"[1, blk_-1608999687919862906 terminating]",2


In [79]:
# Create DataLoader
class HDFSDataset(torch.utils.data.Dataset):
    def __init__(self, encoded_blocks, max_length):
        self.tokenized_blocks = encoded_blocks
        self.max_length = max_length

    def __len__(self):
        return len(self.tokenized_blocks)

    def __getitem__(self, idx):
        tokens = self.tokenized_blocks.iloc[idx]['tokenized_block']
        # Truncate if needed
        if len(tokens) > self.max_length:
            tokens = tokens[:self.max_length]

        # Convert to tensor and pad
        input_ids = torch.tensor(tokens, dtype=torch.long)
        attention_mask = torch.ones_like(input_ids)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
        }

def create_dataloader(encoded_pdf, tokenizer):

    dataset = HDFSDataset(encoded_pdf, MAX_LENGTH)
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        collate_fn=lambda x: {
            'input_ids': torch.nn.utils.rnn.pad_sequence(
                [item['input_ids'] for item in x],
                batch_first=True,
                padding_value=tokenizer.pad_token_id if tokenizer.pad_token_id else 0
            ),
            'attention_mask': torch.nn.utils.rnn.pad_sequence(
                [item['attention_mask'] for item in x],
                batch_first=True,
                padding_value=0
            )
        }
    )

    return dataloader

dataloader = create_dataloader(train_df, tokenizer)

In [110]:
val_df.head()

Unnamed: 0,event_encoded,encoded_block,BlockId,Label
370570,<|sep|>0 /10.251.125.193:49078 /10.251.125.193...,"[50277, 17, 1227, 740, 15, 21451, 15, 9312, 15...",blk_8706546487798466885,Normal
387094,<|sep|>0 /10.251.74.192:36984 /10.251.74.192:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 3566, 15...",blk_3164806166289090589,Normal
524461,<|sep|>0 /10.251.67.113:44473 /10.251.67.113:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 2251, 15...",blk_6334862664379948501,Normal
491282,<|sep|>0 /10.250.15.67:36719 /10.250.15.67:500...,"[50277, 17, 1227, 740, 15, 9519, 15, 1010, 15,...",blk_-4209139676364491359,Normal
671,<|sep|>0 /10.251.111.228:56317 /10.251.111.228...,"[50277, 17, 1227, 740, 15, 21451, 15, 10768, 1...",blk_-7362312881779468190,Normal


In [108]:

print_memory_stats()


 Memory Status:
├── Allocated: 0.26 GB (actively used by tensors)
├── Reserved:  12.43 GB (held by driver)
├── Cached:    12.18 GB (reserved - allocated)
└── System Available: 3.20 GB


In [81]:
clear_memory()

In [82]:
import numpy as np

def evaluate_model(model, dataloader, device):
    """
    Evaluate the model on the provided dataloader with detailed perplexity metrics
    """
    model.eval()
    total_loss = 0
    num_batches = 0
    all_perplexities = []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=input_ids
            )

            # Calculate per-token perplexity
            loss = outputs.loss
            batch_perplexity = torch.exp(outputs.logits[..., :-1, :].log_softmax(-1).gather(
                -1, input_ids[..., 1:].unsqueeze(-1)
            ).squeeze(-1) * -1)

            # Mask out padding tokens
            mask = attention_mask[..., 1:].bool()
            valid_perplexities = batch_perplexity[mask].cpu().numpy()
            all_perplexities.extend(valid_perplexities.tolist())

            total_loss += loss.item()
            num_batches += 1

            wandb.log({
                "eval/batch_loss": loss.item(),
                **get_gpu_memory_metrics()
            })

    # Calculate percentiles
    percentiles = np.percentile(all_perplexities, [50, 75, 90, 95, 99, 100])

    # Log to terminal
    print("\nPerplexity Percentiles:")
    print(f"50th:       {percentiles[0]:.2f}")
    print(f"75th:       {percentiles[1]:.2f}")
    print(f"90th:       {percentiles[2]:.2f}")
    print(f"95th:       {percentiles[3]:.2f}")
    print(f"99th:       {percentiles[4]:.2f}")
    print(f"Max (100th): {percentiles[5]:.2f}")

    # Log to wandb
    wandb.log({
        "eval/avg_loss": total_loss / num_batches,
        "eval/perplexity_p50": percentiles[0],
        "eval/perplexity_p75": percentiles[1],
        "eval/perplexity_p90": percentiles[2],
        "eval/perplexity_p95": percentiles[3],
        "eval/perplexity_p99": percentiles[4],
        "eval/perplexity_max": percentiles[5],
    })

    return total_loss / num_batches

def train_model(model, dataloader, optimizer, device, steps=None, start_batch=0):
    """
    Train the model for a specified number of steps or until the dataloader is exhausted.

    Args:
        model: The model to train
        dataloader: DataLoader containing the training data
        optimizer: The optimizer to use
        device: The device to train on
        steps (int, optional): Number of steps to train. If None, train on all remaining batches
        start_batch (int): The batch index to start from (for resuming training)

    Returns:
        tuple: (global_step, batch_idx) - The current global step and batch index for resuming
    """
    model.train()
    global_step = start_batch
    total_loss = 0

    for batch_idx, batch in enumerate(dataloader, start=start_batch):
        # Check if we've reached the requested number of steps
        if steps is not None and (batch_idx - start_batch) >= steps:
            break

        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=input_ids
        )

        loss = outputs.loss
        total_loss += loss.item()

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print progress every 100 batches
        if batch_idx % 100 == 0:
            print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")

        wandb.log({
            "train/batch_loss": loss.item(),
            "train/batch": batch_idx,
            **get_gpu_memory_metrics()
        }, step=global_step)

        global_step += 1

    avg_loss = total_loss / (batch_idx - start_batch + 1)
    print(f"Training complete. Average loss: {avg_loss:.4f}")
    wandb.log({
        "train/avg_loss": avg_loss,
    })

    return global_step, batch_idx

In [83]:
# Move model to MPS device if available, otherwise CPU
model = get_model().to(device)

# Set up optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

wandb.init(
    project="log-analysis-pythia",
    config={
        "batch_size": BATCH_SIZE,
        "max_length": MAX_LENGTH,
        "learning_rate": LEARNING_RATE,
        "epochs": NUM_EPOCHS,
        "model": "pythia-14m",
    }
)


VBox(children=(Label(value='0.033 MB of 0.033 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/avg_loss,█▅▂▂▁▂
eval/batch_loss,█▇▂██▇▇▇▇▆▅▆▇▄▄▂▅▄▃▂▆▆▅▇▃▃▂▂▁▅▅▃▅▁▃▄▂▆▆▃
eval/perplexity_max,▇▁▁▁▁█
eval/perplexity_p50,█▃▂▁▁▁
eval/perplexity_p75,█▅▃▃▁▃
eval/perplexity_p90,█▄▄▃▁▄
eval/perplexity_p95,█▃▃▂▁▂
eval/perplexity_p99,█▅▁▂▂▄
gpu_memory_allocated_gb,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
gpu_memory_reserved_gb,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
eval/avg_loss,0.25539
eval/batch_loss,0.25517
eval/perplexity_max,144235888.0
eval/perplexity_p50,1.0
eval/perplexity_p75,1.00217
eval/perplexity_p90,1.68944
eval/perplexity_p95,12.07805
eval/perplexity_p99,421.63466
gpu_memory_allocated_gb,0.0
gpu_memory_reserved_gb,0.0


In [84]:
print_memory_stats()


 Memory Status:
├── Allocated: 0.55 GB (actively used by tensors)
├── Reserved:  0.87 GB (held by driver)
├── Cached:    0.32 GB (reserved - allocated)
└── System Available: 60.08 GB


In [85]:
len(dataloader)

28753

In [87]:
try:
  current_batch = 0
  for i in range(int(len(dataloader)/2000)+1):
      current_step, current_batch = train_model(model, dataloader, optimizer, device, steps=2000, start_batch=current_batch)

      eval_dataloader = create_dataloader(val_df[:10*BATCH_SIZE], tokenizer)
      evaluate_model(model, eval_dataloader, device)
except:
  # print stack trace
  import traceback
  traceback.print_exc()


Batch 0, Loss: 91.5764
Batch 100, Loss: 0.6737
Batch 200, Loss: 0.4162
Batch 300, Loss: 0.3665
Batch 400, Loss: 0.3528
Batch 500, Loss: 0.3632
Batch 600, Loss: 0.2503
Batch 700, Loss: 0.2613
Batch 800, Loss: 0.2976
Batch 900, Loss: 0.2237
Batch 1000, Loss: 0.2266
Batch 1100, Loss: 0.2455
Batch 1200, Loss: 0.2274
Batch 1300, Loss: 0.2283
Batch 1400, Loss: 0.2321
Batch 1500, Loss: 0.2124
Batch 1600, Loss: 0.1878
Batch 1700, Loss: 0.1842
Batch 1800, Loss: 0.1813
Batch 1900, Loss: 0.1856
Training complete. Average loss: 0.4594

Perplexity Percentiles:
50th:       1.00
75th:       1.01
90th:       1.76
95th:       12.57
99th:       414.82
Max (100th): 338345.22
Batch 2000, Loss: 0.2475




Batch 2100, Loss: 0.2143




Batch 2200, Loss: 0.2459
Batch 2300, Loss: 0.1957
Batch 2400, Loss: 0.1747
Batch 2500, Loss: 0.1698
Batch 2600, Loss: 0.2144
Batch 2700, Loss: 0.2541
Batch 2800, Loss: 0.1811
Batch 2900, Loss: 0.2258
Batch 3000, Loss: 0.2354
Batch 3100, Loss: 0.2405
Batch 3200, Loss: 0.2018
Batch 3300, Loss: 0.1806
Batch 3400, Loss: 0.1796
Batch 3500, Loss: 0.1850
Batch 3600, Loss: 0.2169
Batch 3700, Loss: 0.1832
Batch 3800, Loss: 0.1757
Batch 3900, Loss: 0.1649
Training complete. Average loss: 0.1981

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.68
95th:       8.57
99th:       319.43
Max (100th): 23827.67
Batch 4000, Loss: 0.1889
Batch 4100, Loss: 0.1982




Batch 4200, Loss: 0.1769
Batch 4300, Loss: 0.1778
Batch 4400, Loss: 0.2300
Batch 4500, Loss: 0.1634
Batch 4600, Loss: 0.2002
Batch 4700, Loss: 0.1668
Batch 4800, Loss: 0.1709
Batch 4900, Loss: 0.4300
Batch 5000, Loss: 0.2502
Batch 5100, Loss: 0.2202
Batch 5200, Loss: 0.2862
Batch 5300, Loss: 0.2099
Batch 5400, Loss: 0.2694
Batch 5500, Loss: 0.1984
Batch 5600, Loss: 0.2124
Batch 5700, Loss: 0.1860
Batch 5800, Loss: 0.1914
Batch 5900, Loss: 0.1962
Training complete. Average loss: 0.2471

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.76
95th:       10.01
99th:       390.32
Max (100th): 115522.31
Batch 6000, Loss: 0.2005




Batch 6100, Loss: 0.1734
Batch 6200, Loss: 0.1827
Batch 6300, Loss: 0.1845
Batch 6400, Loss: 0.1917
Batch 6500, Loss: 0.1994
Batch 6600, Loss: 0.1975
Batch 6700, Loss: 0.2895
Batch 6800, Loss: 0.2210
Batch 6900, Loss: 0.2383
Batch 7000, Loss: 0.1823
Batch 7100, Loss: 0.1709
Batch 7200, Loss: 0.2243
Batch 7300, Loss: 0.1767
Batch 7400, Loss: 0.2633
Batch 7500, Loss: 0.2405
Batch 7600, Loss: 0.1994
Batch 7700, Loss: 0.1951
Batch 7800, Loss: 0.1725
Batch 7900, Loss: 0.1716
Training complete. Average loss: 0.1910

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.59
95th:       7.54
99th:       327.03
Max (100th): 60237.54
Batch 8000, Loss: 0.1743
Batch 8100, Loss: 0.1701




Batch 8200, Loss: 0.1879
Batch 8300, Loss: 0.1808
Batch 8400, Loss: 0.1919
Batch 8500, Loss: 0.1654
Batch 8600, Loss: 0.1851
Batch 8700, Loss: 0.1655
Batch 8800, Loss: 0.2037
Batch 8900, Loss: 0.1905
Batch 9000, Loss: 0.2066
Batch 9100, Loss: 0.1769
Batch 9200, Loss: 0.1593
Batch 9300, Loss: 0.1845
Batch 9400, Loss: 0.1924
Batch 9500, Loss: 0.1801
Batch 9600, Loss: 0.1606
Batch 9700, Loss: 0.2159
Batch 9800, Loss: 0.2250
Batch 9900, Loss: 0.1795
Training complete. Average loss: 0.1844

Perplexity Percentiles:
50th:       1.00
75th:       1.01
90th:       1.64
95th:       7.20
99th:       333.17
Max (100th): 65201.86
Batch 10000, Loss: 0.1785




Batch 10100, Loss: 0.1607
Batch 10200, Loss: 0.1710
Batch 10300, Loss: 0.2088
Batch 10400, Loss: 0.2268
Batch 10500, Loss: 0.1647
Batch 10600, Loss: 0.1635
Batch 10700, Loss: 0.1777
Batch 10800, Loss: 0.1610
Batch 10900, Loss: 0.1702
Batch 11000, Loss: 0.1694
Batch 11100, Loss: 0.1751
Batch 11200, Loss: 0.1697
Batch 11300, Loss: 0.1787
Batch 11400, Loss: 0.1757
Batch 11500, Loss: 0.1868
Batch 11600, Loss: 0.1619
Batch 11700, Loss: 0.1732
Batch 11800, Loss: 0.1785
Batch 11900, Loss: 0.1620
Training complete. Average loss: 0.1820

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.66
95th:       6.83
99th:       366.90
Max (100th): 88341.70
Batch 12000, Loss: 0.1816
Batch 12100, Loss: 0.1711




Batch 12200, Loss: 0.2146
Batch 12300, Loss: 0.1814
Batch 12400, Loss: 0.2202
Batch 12500, Loss: 0.2094
Batch 12600, Loss: 0.1579
Batch 12700, Loss: 0.1532
Batch 12800, Loss: 0.2071
Batch 12900, Loss: 0.1775
Batch 13000, Loss: 0.1671
Batch 13100, Loss: 0.1620
Batch 13200, Loss: 0.1837
Batch 13300, Loss: 0.1844
Batch 13400, Loss: 0.1706
Batch 13500, Loss: 0.1560
Batch 13600, Loss: 0.1535
Batch 13700, Loss: 0.1613
Batch 13800, Loss: 0.2104
Batch 13900, Loss: 0.1972
Training complete. Average loss: 0.1801

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.41
95th:       7.38
99th:       325.82
Max (100th): 21571.41
Batch 14000, Loss: 0.1763




Batch 14100, Loss: 0.1579
Batch 14200, Loss: 0.1727
Batch 14300, Loss: 0.1806
Batch 14400, Loss: 0.1620
Batch 14500, Loss: 0.1810
Batch 14600, Loss: 0.1898
Batch 14700, Loss: 0.2168
Batch 14800, Loss: 0.1649
Batch 14900, Loss: 0.1826
Batch 15000, Loss: 0.2404
Batch 15100, Loss: 0.2002
Batch 15200, Loss: 0.1575
Batch 15300, Loss: 0.1803
Batch 15400, Loss: 0.1935
Batch 15500, Loss: 0.1834
Batch 15600, Loss: 0.2017
Batch 15700, Loss: 0.1733
Batch 15800, Loss: 0.1913
Batch 15900, Loss: 0.1918
Training complete. Average loss: 0.1776

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.36
95th:       6.69
99th:       317.09
Max (100th): 431480.88
Batch 16000, Loss: 0.1719




Batch 16100, Loss: 0.1655
Batch 16200, Loss: 0.1974
Batch 16300, Loss: 0.1603
Batch 16400, Loss: 0.1578
Batch 16500, Loss: 0.1602
Batch 16600, Loss: 0.1707
Batch 16700, Loss: 0.1589
Batch 16800, Loss: 0.1707
Batch 16900, Loss: 0.1750
Batch 17000, Loss: 0.2115
Batch 17100, Loss: 0.1564
Batch 17200, Loss: 0.1520
Batch 17300, Loss: 0.1710
Batch 17400, Loss: 0.1628
Batch 17500, Loss: 0.1542
Batch 17600, Loss: 0.1637
Batch 17700, Loss: 0.1622
Batch 17800, Loss: 0.1726
Batch 17900, Loss: 0.1633
Training complete. Average loss: 0.1763

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.41
95th:       6.33
99th:       357.16
Max (100th): 57130.78
Batch 18000, Loss: 0.1887
Batch 18100, Loss: 0.2140




Batch 18200, Loss: 0.1618
Batch 18300, Loss: 0.1608
Batch 18400, Loss: 0.1559
Batch 18500, Loss: 0.1867
Batch 18600, Loss: 0.1616
Batch 18700, Loss: 0.2103
Batch 18800, Loss: 0.1690
Batch 18900, Loss: 0.1593
Batch 19000, Loss: 0.2602
Batch 19100, Loss: 0.1795
Batch 19200, Loss: 0.2175
Batch 19300, Loss: 0.1654
Batch 19400, Loss: 0.1618
Batch 19500, Loss: 0.1649
Batch 19600, Loss: 0.1576
Batch 19700, Loss: 0.2037
Batch 19800, Loss: 0.2598
Batch 19900, Loss: 0.1646
Training complete. Average loss: 0.1735

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.38
95th:       6.37
99th:       308.60
Max (100th): 43330.25
Batch 20000, Loss: 0.2127




Batch 20100, Loss: 0.1675
Batch 20200, Loss: 0.1573
Batch 20300, Loss: 0.1985
Batch 20400, Loss: 0.1563
Batch 20500, Loss: 0.1525
Batch 20600, Loss: 0.1586
Batch 20700, Loss: 0.1687
Batch 20800, Loss: 0.1640
Batch 20900, Loss: 0.1684
Batch 21000, Loss: 0.1662
Batch 21100, Loss: 0.1577
Batch 21200, Loss: 0.1984
Batch 21300, Loss: 0.2023
Batch 21400, Loss: 0.1791
Batch 21500, Loss: 0.1804
Batch 21600, Loss: 0.1879
Batch 21700, Loss: 0.1685
Batch 21800, Loss: 0.2223
Batch 21900, Loss: 0.1555
Training complete. Average loss: 0.1721

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.39
95th:       6.70
99th:       324.89
Max (100th): 207621.72
Batch 22000, Loss: 0.1882
Batch 22100, Loss: 0.1608




Batch 22200, Loss: 0.1560
Batch 22300, Loss: 0.1684
Batch 22400, Loss: 0.2081
Batch 22500, Loss: 0.1862
Batch 22600, Loss: 0.2012
Batch 22700, Loss: 0.1607
Batch 22800, Loss: 0.1648
Batch 22900, Loss: 0.1637
Batch 23000, Loss: 0.1670
Batch 23100, Loss: 0.1475
Batch 23200, Loss: 0.1463
Batch 23300, Loss: 0.2097
Batch 23400, Loss: 0.1999
Batch 23500, Loss: 0.1766
Batch 23600, Loss: 0.1917
Batch 23700, Loss: 0.1875
Batch 23800, Loss: 0.1902
Batch 23900, Loss: 0.1641
Training complete. Average loss: 0.1719

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.36
95th:       6.21
99th:       342.85
Max (100th): 18722.79
Batch 24000, Loss: 0.2577




Batch 24100, Loss: 0.1512
Batch 24200, Loss: 0.1647
Batch 24300, Loss: 0.1863
Batch 24400, Loss: 0.2166
Batch 24500, Loss: 0.1898
Batch 24600, Loss: 0.1885
Batch 24700, Loss: 0.1730
Batch 24800, Loss: 0.1775
Batch 24900, Loss: 0.1734
Batch 25000, Loss: 0.1726
Batch 25100, Loss: 0.1818
Batch 25200, Loss: 0.1652
Batch 25300, Loss: 0.1617
Batch 25400, Loss: 0.1570
Batch 25500, Loss: 0.1526
Batch 25600, Loss: 0.1548
Batch 25700, Loss: 0.1584
Batch 25800, Loss: 0.1740
Batch 25900, Loss: 0.1629
Training complete. Average loss: 0.1718

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.30
95th:       5.95
99th:       339.47
Max (100th): 51263.20
Batch 26000, Loss: 0.1496
Batch 26100, Loss: 0.1600




Batch 26200, Loss: 0.1753
Batch 26300, Loss: 0.1621
Batch 26400, Loss: 0.1448
Batch 26500, Loss: 0.1636
Batch 26600, Loss: 0.1833
Batch 26700, Loss: 0.1981
Batch 26800, Loss: 0.1570
Batch 26900, Loss: 0.1651
Batch 27000, Loss: 0.1681
Batch 27100, Loss: 0.1526
Batch 27200, Loss: 0.1477
Batch 27300, Loss: 0.2278
Batch 27400, Loss: 0.1820
Batch 27500, Loss: 0.1494
Batch 27600, Loss: 0.1490
Batch 27700, Loss: 0.1545
Batch 27800, Loss: 0.1565
Batch 27900, Loss: 0.1757
Training complete. Average loss: 0.1714

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.39
95th:       5.91
99th:       343.62
Max (100th): 21245.93
Batch 28000, Loss: 0.1737




Batch 28100, Loss: 0.1800
Batch 28200, Loss: 0.1534
Batch 28300, Loss: 0.2073
Batch 28400, Loss: 0.1752
Batch 28500, Loss: 0.1570
Batch 28600, Loss: 0.1490
Batch 28700, Loss: 0.1591
Batch 28800, Loss: 0.1494
Batch 28900, Loss: 0.1493
Batch 29000, Loss: 0.1488
Batch 29100, Loss: 0.2093
Batch 29200, Loss: 0.1578
Batch 29300, Loss: 0.1974
Batch 29400, Loss: 0.1861
Batch 29500, Loss: 0.1766
Batch 29600, Loss: 0.1773
Batch 29700, Loss: 0.1616
Batch 29800, Loss: 0.1644
Batch 29900, Loss: 0.1491
Training complete. Average loss: 0.1704

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.32
95th:       5.91
99th:       337.97
Max (100th): 39214.25


In [88]:

# Save model to HuggingFace Hub
model_name = "pythia-14m-hdfs-logs"
model.push_to_hub(
    f"honicky/{model_name}",
    token=os.environ["HF_WRITE_TOKEN"],
    commit_message=f"Trained {current_step} steps"
)

# Save tokenizer with the added special tokens
tokenizer.push_to_hub(
    f"honicky/{model_name}",
    token=os.environ["HF_WRITE_TOKEN"],
    commit_message="Tokenizer with added special tokens for HDFS logs"
)

# Save model config and training details
with open("README.md", "w") as f:
    f.write(f"""---
language: en
tags:
- log-analysis
- pythia
- hdfs
license: mit
datasets:
- honicky/log-analysis-hdfs-preprocessed
metrics:
- cross-entropy
- perplexity
base_model: EleutherAI/pythia-14m
---

# {model_name}

Fine-tuned Pythia-14m model for HDFS log analysis, specifically for anomaly detection.

## Model Description

This model is fine-tuned from `EleutherAI/pythia-14m` for analyzing HDFS log sequences. It's designed to understand and predict patterns in
HDFS log data so that we can detect anomalies using the perplexity of the log sequence. THhe HDFS sequence is handy because it has labels
so we can use it to validate that the model can predict anomalies.

We will use this model to understand the ability of a small model to predict anomalies in a specific dataset.  We will study model scale
and experiment with tokenization, intialization, data set size, etc. to find a configuration that is minimal in size and fast, but can
effectively predict anomalies.  We will then attempt build a model that is more robust to different log formats.

- Huggingface Model: [honicky/pythia-14m-hdfs-logs](https://huggingface.co/honicky/pythia-14m-hdfs-logs)

## Training Details
- Base model: EleutherAI/pythia-14m
- Dataset: https://zenodo.org/records/8196385/files/HDFS_v1.zip?download=1 + preprocessed data at honicky/log-analysis-hdfs-preprocessed
- Batch size: {BATCH_SIZE}
- Max sequence length: {MAX_LENGTH}
- Learning rate: {LEARNING_RATE}
- Training steps: {current_step}
- Weights and Biases run: {wandb.run.url}


## Special Tokens
- Added `<|sep|>` token for event ID separation

## Intended Use
This model is intended for:
- Analyzing HDFS log sequences
- Detecting anomalies in log patterns
- Understanding system behavior through log analysis

## Limitations
- Model is specifically trained on HDFS logs and may not generalize to other log formats
- Limited to the context window size of {MAX_LENGTH} tokens


""")


# Push README
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="README.md",
    path_in_repo="README.md",
    repo_id=f"honicky/{model_name}",
    token=os.environ["HF_WRITE_TOKEN"],
    commit_message="Add model documentation"
)

README.md:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/56.3M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/honicky/pythia-14m-hdfs-logs/commit/ebff8b9ef46e6de457d4b0af78200da41a5208d3', commit_message='Add model documentation', commit_description='', oid='ebff8b9ef46e6de457d4b0af78200da41a5208d3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/honicky/pythia-14m-hdfs-logs', endpoint='https://huggingface.co', repo_type='model', repo_id='honicky/pythia-14m-hdfs-logs'), pr_revision=None, pr_num=None)