<a href="https://colab.research.google.com/github/honicky/deep-log-analysis/blob/main/Pythia%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pythia Analysis - train small models on HDFS data

* use tokenized version of preprocessed HDFS events
* start with very small pythia models, test increasing size
* start with fine-tuning, then consider resetting weights and training from scratch
* experiment with different tokenizers
  * https://chatgpt.com/share/67448f53-29a0-800f-9913-af22d6ed0894

Interesting note: [Understanding LLM Embeddings for Regression](https://arxiv.org/pdf/2411.14708) discusses how using textual embeddings of primarily numerical data is actually surprisingly effective. This supports the hypothesis here that we can use a pretrained model to use embeddings for the log data rather than training a specialized model for logs which are dominated by numbers.

In [5]:
try:
  from google.colab import userdata

  !git clone https://github.com/honicky/deep-log-analysis.git
  !mv deep-log-analysis/* .
  !rm -rf deep-log-analysis
except:
  pass

Cloning into 'deep-log-analysis'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects:   2% (1/48)[Kremote: Counting objects:   4% (2/48)[Kremote: Counting objects:   6% (3/48)[Kremote: Counting objects:   8% (4/48)[Kremote: Counting objects:  10% (5/48)[Kremote: Counting objects:  12% (6/48)[Kremote: Counting objects:  14% (7/48)[Kremote: Counting objects:  16% (8/48)[Kremote: Counting objects:  18% (9/48)[Kremote: Counting objects:  20% (10/48)[Kremote: Counting objects:  22% (11/48)[Kremote: Counting objects:  25% (12/48)[Kremote: Counting objects:  27% (13/48)[Kremote: Counting objects:  29% (14/48)[Kremote: Counting objects:  31% (15/48)[Kremote: Counting objects:  33% (16/48)[Kremote: Counting objects:  35% (17/48)[Kremote: Counting objects:  37% (18/48)[Kremote: Counting objects:  39% (19/48)[Kremote: Counting objects:  41% (20/48)[Kremote: Counting objects:  43% (21/48)[Kremote: Counting objects:  45% (22/48)[Kremote:

In [6]:
try:
    import logparser.Drain as Drain
except ImportError:
    %pip install requests git+https://github.com/logpai/logparser

%pip install transformers torch torchvision torchaudio wandb python-dotenv datasets



In [7]:
%load_ext autoreload
%autoreload 2
import dataloaders as dl
%autoreload 2
import model_utils


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Load secrets

If we are in colab, we get them from the `userdata` module, otherwise we get them from a .env file


In [8]:
import os
try:
  from google.colab import userdata
  os.environ["HF_WRITE_TOKEN"] = userdata.get('HF_WRITE_TOKEN')
  os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY')
except ImportError:
  from dotenv import load_dotenv
  load_dotenv()


In [9]:
base_model_name = "EleutherAI/pythia-70m"  # @param ["EleutherAI/pythia-14m", "EleutherAI/pythia-70m"]
model_name = f"{base_model_name.split('/')[-1]}-hdfs-logs"


# Download and unzip the HDFS dataset

The functions check if the data is already downloaded and unzipped, and only download and unzip if they are not present.


In [10]:
from transformers import GPTNeoXTokenizerFast
tokenizer = GPTNeoXTokenizerFast.from_pretrained(base_model_name)
tokenizer.add_special_tokens({"additional_special_tokens": ["<|sep|>"]})
tokenizer.sep_token = "<|sep|>"
tokenizer.sep_token_id
tokenizer.pad_token_id = tokenizer.eos_token_id # no pad token in default tokenizer, so add it here for collating / training


tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Double check that the tokenizer properly encodes the new special token

In [11]:

tokenizer.encode("<|sep|>")


[50277]

Review then tokenizer configuration, again to ensure the new special token is included


In [12]:
tokenizer

GPTNeoXTokenizerFast(name_or_path='EleutherAI/pythia-70m', vocab_size=50254, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'sep_token': '<|sep|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|sep|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|padding|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	50254: AddedToken("                        ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50255: AddedToken("                       ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50256: AddedToken("                      ", rstrip=False, lstrip=False, 

In [13]:
import torch

from transformers import GPTNeoXForCausalLM

def get_model():

    model = GPTNeoXForCausalLM.from_pretrained(base_model_name)
    model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=64)
    model.get_input_embeddings().weight.data.shape

    return model

model = get_model()

config.json:   0%|          | 0.00/567 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/166M [00:00<?, ?B/s]

In [14]:
from datasets import load_dataset
import pandas as pd

# Load the dataset from the hub
dataset_dict = load_dataset("honicky/hdfs-logs-encoded-blocks")

# Convert the train split to a pandas DataFrame
train_df = dataset_dict['train'].to_pandas()
val_df = dataset_dict['validation'].to_pandas()

README.md:   0%|          | 0.00/2.97k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/46.3M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/46.4M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/46.4M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/17.4M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/17.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/460048 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/57506 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/57507 [00:00<?, ? examples/s]

In [30]:
import model_utils
globals().update(model_utils.training_params())



In [31]:
print(f"using BATCH_SIZE = {BATCH_SIZE}")
print(f"using MAX_LENGTH = {MAX_LENGTH}")
print(f"using LEARNING_RATE = {LEARNING_RATE}")
print(f"using NUM_EPOCHS = {NUM_EPOCHS}")

using BATCH_SIZE = 32
using MAX_LENGTH = 405
using LEARNING_RATE = 0.0001
using NUM_EPOCHS = 1


In [17]:
import os, wandb

wandb.login(key=os.getenv("WANDB_API_KEY"))


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mhonicky[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [18]:

from model_utils import print_memory_stats, get_gpu_memory_metrics, clear_memory


In [35]:
# Create DataLoader
class HDFSDataset(torch.utils.data.Dataset):
    def __init__(self, encoded_blocks, max_length):
        self.tokenized_blocks = encoded_blocks
        self.max_length = max_length

    def __len__(self):
        return len(self.tokenized_blocks)

    def __getitem__(self, idx):
        tokens = self.tokenized_blocks.iloc[idx]['tokenized_block']
        # Truncate if needed
        if len(tokens) > self.max_length:
            tokens = tokens[:self.max_length]

        # Convert to tensor and pad
        input_ids = torch.tensor(tokens, dtype=torch.long)
        attention_mask = torch.ones_like(input_ids)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
        }

def create_dataloader(encoded_pdf, tokenizer):

    dataset = HDFSDataset(encoded_pdf, MAX_LENGTH)
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        collate_fn=lambda x: {
            'input_ids': torch.nn.utils.rnn.pad_sequence(
                [item['input_ids'] for item in x],
                batch_first=True,
                padding_value=tokenizer.pad_token_id if tokenizer.pad_token_id else 0
            ),
            'attention_mask': torch.nn.utils.rnn.pad_sequence(
                [item['attention_mask'] for item in x],
                batch_first=True,
                padding_value=0
            )
        }
    )

    return dataloader

dataloader = create_dataloader(train_df, tokenizer)

In [20]:
val_df.head()

Unnamed: 0,event_encoded,tokenized_block,block_id,label,__index_level_0__
0,<|sep|>0 /10.251.125.193:49078 /10.251.125.193...,"[50277, 17, 1227, 740, 15, 21451, 15, 9312, 15...",blk_8706546487798466885,Normal,370570
1,<|sep|>0 /10.251.74.192:36984 /10.251.74.192:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 3566, 15...",blk_3164806166289090589,Normal,387094
2,<|sep|>0 /10.251.67.113:44473 /10.251.67.113:5...,"[50277, 17, 1227, 740, 15, 21451, 15, 2251, 15...",blk_6334862664379948501,Normal,524461
3,<|sep|>0 /10.250.15.67:36719 /10.250.15.67:500...,"[50277, 17, 1227, 740, 15, 9519, 15, 1010, 15,...",blk_-4209139676364491359,Normal,491282
4,<|sep|>0 /10.251.111.228:56317 /10.251.111.228...,"[50277, 17, 1227, 740, 15, 21451, 15, 10768, 1...",blk_-7362312881779468190,Normal,671


In [21]:

print_memory_stats()


 Memory Status:
├── Allocated: 0.00 GB (actively used by tensors)
├── Reserved:  0.00 GB (held by driver)
├── Cached:    0.00 GB (reserved - allocated)
└── System Available: 79.08 GB


In [22]:
clear_memory()

In [23]:
import numpy as np

def evaluate_model(model, dataloader, device):
    """
    Evaluate the model on the provided dataloader with detailed perplexity metrics
    """
    model.eval()
    total_loss = 0
    num_batches = 0
    all_perplexities = []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=input_ids
            )

            # Calculate per-token perplexity
            loss = outputs.loss
            batch_perplexity = torch.exp(outputs.logits[..., :-1, :].log_softmax(-1).gather(
                -1, input_ids[..., 1:].unsqueeze(-1)
            ).squeeze(-1) * -1)

            # Mask out padding tokens
            mask = attention_mask[..., 1:].bool()
            valid_perplexities = batch_perplexity[mask].cpu().numpy()
            all_perplexities.extend(valid_perplexities.tolist())

            total_loss += loss.item()
            num_batches += 1

            wandb.log({
                "eval/batch_loss": loss.item(),
                **get_gpu_memory_metrics()
            })

    # Calculate percentiles
    percentiles = np.percentile(all_perplexities, [50, 75, 90, 95, 99, 100])

    # Log to terminal
    print("\nPerplexity Percentiles:")
    print(f"50th:       {percentiles[0]:.2f}")
    print(f"75th:       {percentiles[1]:.2f}")
    print(f"90th:       {percentiles[2]:.2f}")
    print(f"95th:       {percentiles[3]:.2f}")
    print(f"99th:       {percentiles[4]:.2f}")
    print(f"Max (100th): {percentiles[5]:.2f}")

    # Log to wandb
    wandb.log({
        "eval/avg_loss": total_loss / num_batches,
        "eval/perplexity_p50": percentiles[0],
        "eval/perplexity_p75": percentiles[1],
        "eval/perplexity_p90": percentiles[2],
        "eval/perplexity_p95": percentiles[3],
        "eval/perplexity_p99": percentiles[4],
        "eval/perplexity_max": percentiles[5],
    })

    return total_loss / num_batches

def train_model(model, dataloader, optimizer, device, steps=None, start_batch=0):
    """
    Train the model for a specified number of steps or until the dataloader is exhausted.

    Args:
        model: The model to train
        dataloader: DataLoader containing the training data
        optimizer: The optimizer to use
        device: The device to train on
        steps (int, optional): Number of steps to train. If None, train on all remaining batches
        start_batch (int): The batch index to start from (for resuming training)

    Returns:
        tuple: (global_step, batch_idx) - The current global step and batch index for resuming
    """
    model.train()
    global_step = start_batch
    total_loss = 0

    for batch_idx, batch in enumerate(dataloader, start=start_batch):
        # Check if we've reached the requested number of steps
        if steps is not None and (batch_idx - start_batch) >= steps:
            break

        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=input_ids
        )

        loss = outputs.loss
        total_loss += loss.item()

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print progress every 100 batches
        if batch_idx % 100 == 0:
            print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")

        wandb.log({
            "train/batch_loss": loss.item(),
            "train/batch": batch_idx,
            **get_gpu_memory_metrics()
        }, step=global_step)

        global_step += 1

    avg_loss = total_loss / (batch_idx - start_batch + 1)
    print(f"Training complete. Average loss: {avg_loss:.4f}")
    wandb.log({
        "train/avg_loss": avg_loss,
    })

    return global_step, batch_idx

In [24]:
device = model_utils.get_device()

In [36]:
# Move model to MPS device if available, otherwise CPU
model = get_model().to(device)

# Set up optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

wandb.init(
    project="log-analysis-pythia",
    config={
        "batch_size": BATCH_SIZE,
        "max_length": MAX_LENGTH,
        "learning_rate": LEARNING_RATE,
        "epochs": NUM_EPOCHS,
        "model": "pythia-14m",
    }
)


In [37]:
print_memory_stats()


 Memory Status:
├── Allocated: 0.31 GB (actively used by tensors)
├── Reserved:  11.51 GB (held by driver)
├── Cached:    11.21 GB (reserved - allocated)
└── System Available: 78.50 GB


In [38]:
len(dataloader)

14377

In [39]:
try:
  current_batch = 0
  for i in range(int(len(dataloader)/2000)+1):
      current_step, current_batch = train_model(model, dataloader, optimizer, device, steps=2000, start_batch=current_batch)

      eval_dataloader = create_dataloader(val_df[:10*BATCH_SIZE], tokenizer)
      evaluate_model(model, eval_dataloader, device)
except:
  # print stack trace
  import traceback
  traceback.print_exc()


Batch 0, Loss: 42.3517
Batch 100, Loss: 0.4126
Batch 200, Loss: 0.2818
Batch 300, Loss: 0.2686
Batch 400, Loss: 0.2170
Batch 500, Loss: 0.2258
Batch 600, Loss: 0.1969
Batch 700, Loss: 0.1996
Batch 800, Loss: 0.1925
Batch 900, Loss: 0.1874
Batch 1000, Loss: 0.1932
Batch 1100, Loss: 0.1900
Batch 1200, Loss: 0.1959
Batch 1300, Loss: 0.1725
Batch 1400, Loss: 0.1772
Batch 1500, Loss: 0.1975
Batch 1600, Loss: 0.1894
Batch 1700, Loss: 0.1732
Batch 1800, Loss: 0.1721
Batch 1900, Loss: 0.1698
Training complete. Average loss: 0.4087

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.57
95th:       7.63
99th:       341.23
Max (100th): 4843076.50
Batch 2000, Loss: 0.1705




Batch 2100, Loss: 0.1748
Batch 2200, Loss: 0.1778
Batch 2300, Loss: 0.1664
Batch 2400, Loss: 0.1813
Batch 2500, Loss: 0.1685
Batch 2600, Loss: 0.1606
Batch 2700, Loss: 0.1705
Batch 2800, Loss: 0.1609
Batch 2900, Loss: 0.1690
Batch 3000, Loss: 0.1726
Batch 3100, Loss: 0.1601
Batch 3200, Loss: 0.1644
Batch 3300, Loss: 0.1717
Batch 3400, Loss: 0.1691
Batch 3500, Loss: 0.1741
Batch 3600, Loss: 0.2294
Batch 3700, Loss: 0.1713
Batch 3800, Loss: 0.1589
Batch 3900, Loss: 0.1753
Training complete. Average loss: 0.1682

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.35
95th:       7.09
99th:       365.30
Max (100th): 64229064.00
Batch 4000, Loss: 0.1712




Batch 4100, Loss: 0.1539
Batch 4200, Loss: 0.1634
Batch 4300, Loss: 0.1849
Batch 4400, Loss: 0.1651
Batch 4500, Loss: 0.1558
Batch 4600, Loss: 0.1647
Batch 4700, Loss: 0.1512
Batch 4800, Loss: 0.1899
Batch 4900, Loss: 0.1583
Batch 5000, Loss: 0.1571
Batch 5100, Loss: 0.1699
Batch 5200, Loss: 0.1577
Batch 5300, Loss: 0.1606
Batch 5400, Loss: 0.1668
Batch 5500, Loss: 0.1640
Batch 5600, Loss: 0.1577
Batch 5700, Loss: 0.1514
Batch 5800, Loss: 0.1712
Batch 5900, Loss: 0.1619
Training complete. Average loss: 0.1637

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.31
95th:       6.86
99th:       315.43
Max (100th): 32067876.00
Batch 6000, Loss: 0.1617




Batch 6100, Loss: 0.1636
Batch 6200, Loss: 0.1602
Batch 6300, Loss: 0.1450
Batch 6400, Loss: 0.1528
Batch 6500, Loss: 0.1624
Batch 6600, Loss: 0.1580
Batch 6700, Loss: 0.1630
Batch 6800, Loss: 0.1643
Batch 6900, Loss: 0.1551
Batch 7000, Loss: 0.1492
Batch 7100, Loss: 0.1573
Batch 7200, Loss: 0.1566
Batch 7300, Loss: 1.3782
Batch 7400, Loss: 0.1931
Batch 7500, Loss: 0.2061
Batch 7600, Loss: 0.1736
Batch 7700, Loss: 0.1873
Batch 7800, Loss: 0.1652
Batch 7900, Loss: 0.1862
Training complete. Average loss: 0.1961

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.56
95th:       7.26
99th:       366.87
Max (100th): 294403424.00
Batch 8000, Loss: 0.1690




Batch 8100, Loss: 0.2152
Batch 8200, Loss: 0.1602
Batch 8300, Loss: 0.1560
Batch 8400, Loss: 0.1692
Batch 8500, Loss: 0.1636
Batch 8600, Loss: 0.1671
Batch 8700, Loss: 0.1634
Batch 8800, Loss: 0.1544
Batch 8900, Loss: 0.1675
Batch 9000, Loss: 0.1550
Batch 9100, Loss: 0.1599
Batch 9200, Loss: 0.1710
Batch 9300, Loss: 0.1757
Batch 9400, Loss: 0.1619
Batch 9500, Loss: 0.1592
Batch 9600, Loss: 0.1562
Batch 9700, Loss: 0.1472
Batch 9800, Loss: 0.1534
Batch 9900, Loss: 0.1556
Training complete. Average loss: 0.1653

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.41
95th:       6.50
99th:       318.39
Max (100th): 37858120.00
Batch 10000, Loss: 0.1712




Batch 10100, Loss: 0.1692
Batch 10200, Loss: 0.1640
Batch 10300, Loss: 0.1651
Batch 10400, Loss: 0.1652
Batch 10500, Loss: 0.1668
Batch 10600, Loss: 0.1783
Batch 10700, Loss: 0.1634
Batch 10800, Loss: 0.1555
Batch 10900, Loss: 0.1619
Batch 11000, Loss: 0.1632
Batch 11100, Loss: 0.1711
Batch 11200, Loss: 0.1548
Batch 11300, Loss: 0.1613
Batch 11400, Loss: 0.1545
Batch 11500, Loss: 0.1612
Batch 11600, Loss: 0.1489
Batch 11700, Loss: 0.1722
Batch 11800, Loss: 0.1480
Batch 11900, Loss: 0.1491
Training complete. Average loss: 0.1614

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.40
95th:       5.92
99th:       310.35
Max (100th): 3491758.75
Batch 12000, Loss: 0.1658




Batch 12100, Loss: 0.1538
Batch 12200, Loss: 0.1763
Batch 12300, Loss: 0.1643
Batch 12400, Loss: 0.1582
Batch 12500, Loss: 0.1690
Batch 12600, Loss: 0.1641
Batch 12700, Loss: 0.1546
Batch 12800, Loss: 0.1641
Batch 12900, Loss: 0.1490
Batch 13000, Loss: 0.1571
Batch 13100, Loss: 0.1569
Batch 13200, Loss: 0.1542
Batch 13300, Loss: 0.1511
Batch 13400, Loss: 0.1548
Batch 13500, Loss: 0.1596
Batch 13600, Loss: 0.1582
Batch 13700, Loss: 0.1611
Batch 13800, Loss: 0.1585
Batch 13900, Loss: 0.1661
Training complete. Average loss: 0.1609

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.31
95th:       6.64
99th:       328.47
Max (100th): 2727443.50
Batch 14000, Loss: 0.1595




Batch 14100, Loss: 0.1637
Batch 14200, Loss: 0.1620
Batch 14300, Loss: 0.1726
Batch 14400, Loss: 0.1512
Batch 14500, Loss: 0.1525
Batch 14600, Loss: 0.1564
Batch 14700, Loss: 0.1679
Batch 14800, Loss: 0.1643
Batch 14900, Loss: 0.1559
Batch 15000, Loss: 0.1616
Batch 15100, Loss: 0.1510
Batch 15200, Loss: 0.1805
Batch 15300, Loss: 0.1612
Batch 15400, Loss: 0.1568
Batch 15500, Loss: 0.1464
Batch 15600, Loss: 0.1494
Batch 15700, Loss: 0.1599
Batch 15800, Loss: 0.1580
Batch 15900, Loss: 0.1653
Training complete. Average loss: 0.1601

Perplexity Percentiles:
50th:       1.00
75th:       1.00
90th:       1.38
95th:       6.08
99th:       351.91
Max (100th): 1615205.00


In [44]:

# Save model to HuggingFace Hub
model.push_to_hub(
    f"honicky/{model_name}",
    token=os.environ["HF_WRITE_TOKEN"],
    commit_message=f"Trained {current_step} steps"
)

# Save tokenizer with the added special tokens
tokenizer.push_to_hub(
    f"honicky/{model_name}",
    token=os.environ["HF_WRITE_TOKEN"],
    commit_message="Tokenizer with added special tokens for HDFS logs"
)

# Save model config and training details
with open("README_model.md", "w") as f:
    f.write(f"""---
language: en
tags:
- log-analysis
- pythia
- hdfs
license: mit
datasets:
- honicky/log-analysis-hdfs-preprocessed
metrics:
- cross-entropy
- perplexity
base_model: {base_model_name}
---

# {model_name}

Fine-tuned Pythia-14m model for HDFS log analysis, specifically for anomaly detection.

## Model Description

This model is fine-tuned from `{base_model_name}` for analyzing HDFS log sequences. It's designed to understand and predict patterns in
HDFS log data so that we can detect anomalies using the perplexity of the log sequence. THhe HDFS sequence is handy because it has labels
so we can use it to validate that the model can predict anomalies.

We will use this model to understand the ability of a small model to predict anomalies in a specific dataset.  We will study model scale
and experiment with tokenization, intialization, data set size, etc. to find a configuration that is minimal in size and fast, but can
effectively predict anomalies.  We will then attempt build a model that is more robust to different log formats.

- Huggingface Model: [honicky/{model_name}](https://huggingface.co/honicky/{model_name})

## Training Details
- Base model: {base_model_name}
- Dataset: https://zenodo.org/records/8196385/files/HDFS_v1.zip?download=1 + preprocessed data at honicky/log-analysis-hdfs-preprocessed
- Batch size: {BATCH_SIZE}
- Max sequence length: {MAX_LENGTH}
- Learning rate: {LEARNING_RATE}
- Training steps: {current_step}
- Weights and Biases run: {wandb.run.url}


## Special Tokens
- Added `<|sep|>` token for event ID separation

## Intended Use
This model is intended for:
- Analyzing HDFS log sequences
- Detecting anomalies in log patterns
- Understanding system behavior through log analysis

## Limitations
- Model is specifically trained on HDFS logs and may not generalize to other log formats
- Limited to the context window size of {MAX_LENGTH} tokens

""")

# Push README
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="README_model.md",
    path_in_repo="README.md",
    repo_id=f"honicky/{model_name}",
    repo_type="model",
    token=os.environ["HF_WRITE_TOKEN"],
    commit_message="Add model documentation"
)

README.md:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/honicky/pythia-70m-hdfs-logs/commit/bf76a1c26352cdf3253b52d2f3dd5ce9d9978c5e', commit_message='Add model documentation', commit_description='', oid='bf76a1c26352cdf3253b52d2f3dd5ce9d9978c5e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/honicky/pythia-70m-hdfs-logs', endpoint='https://huggingface.co', repo_type='model', repo_id='honicky/pythia-70m-hdfs-logs'), pr_revision=None, pr_num=None)