<a href="https://colab.research.google.com/github/joelpawar08/CustomLLM/blob/master/CustomLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to build your own LLM

## Step 1 : Data Collection

In [3]:
import requests
from bs4 import BeautifulSoup

# Step 1: Specify the URL
url = "https://wikipedia.com"

# Step 2: Send a GET request to the website
response = requests.get(url)

# Step 3: Parse the website content
soup = BeautifulSoup(response.text, "html.parser")

# Step 4: Extract all text from the page
text_data = soup.get_text()

# Step 5: Print the first 500 characters of the text
print(text_data[:500])

Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.



##Step 2: Data Preprocessing

In [7]:
# Note: !pip install nltk is not needed if it's already satisfied, as shown in your output

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download required NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')  # Add this line for the updated tokenizer model
nltk.download('stopwords')

# Sample text
text = "Hey Sereena !"

# Step 1: Tokenize text into words
tokens = word_tokenize(text)

# Step 2: Convert to lowercase and remove non-alphanumeric tokens
tokens = [word.lower() for word in tokens if word.isalnum()]

# Step 3: Remove stop words
filtered_tokens = [word for word in tokens if word not in stopwords.words('english')]

# Display the preprocessed tokens
print(filtered_tokens)

['hey', 'sereena']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##Step 3: Model Architecture and Training


In [8]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LayerNormalization, MultiHeadAttention

# Define a single transformer block
def transformer_block(input, num_heads, key_dim):
    # Step 1: Multi-head attention
    attention = MultiHeadAttention(num_heads=num_heads, key_dim=key_dim)(input, input)
    # Step 2: Add & Normalize
    attention = LayerNormalization()(attention + input)
    # Step 3: Feedforward network
    dense = Dense(128, activation='relu')(attention)
    # Step 4: Add & Normalize
    output = LayerNormalization()(dense + attention)
    return output

# Input layer
input_layer = Input(shape=(None, 128))  # Sequence length is variable, feature size is 128
# Transformer block
transformer_output = transformer_block(input_layer, num_heads=8, key_dim=64)
# Model definition
model = tf.keras.Model(inputs=input_layer, outputs=transformer_output)

# Model summary
model.summary()

##Step 4: Fine-Tuning the Model

In [11]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments
import torch

# Step 1: Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Fix: Add padding token (use eos_token)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Step 2: Prepare the dataset
# Example tokenization
texts = ["Hello, how are you?", "Fine-tuning is fun!"]
encodings = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")
input_ids = encodings.input_ids
labels = input_ids.clone()

# For causal LM, shift labels and mask padding (-100 ignored in loss)
labels[:, :-1] = input_ids[:, 1:]  # Shift for next-token prediction
labels[labels == tokenizer.pad_token_id] = -100  # Ignore padding in loss

# Create a dataset that returns dictionaries
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, input_ids, labels):
        self.input_ids = input_ids
        self.labels = labels

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "labels": self.labels[idx]
        }

my_dataset = CustomDataset(input_ids, labels)

# Step 3: Define training arguments
train_args = TrainingArguments(
    output_dir='./results',        # Directory to save the model
    per_device_train_batch_size=4, # Batch size (small due to tiny dataset)
    num_train_epochs=1,            # Number of epochs
    save_steps=10_000,             # Steps to save checkpoints (won't trigger on tiny data)
    save_total_limit=2,            # Maximum number of saved checkpoints
    logging_dir='./logs',          # Directory for logs
    logging_steps=500,             # Log every 500 steps (won't trigger on tiny data)
)

# Step 4: Create Trainer and train the model
trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=my_dataset,
)

trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=1, training_loss=6.772451877593994, metrics={'train_runtime': 14.3952, 'train_samples_per_second': 0.139, 'train_steps_per_second': 0.069, 'total_flos': 7144704000.0, 'train_loss': 6.772451877593994, 'epoch': 1.0})

##Test the Model

In [14]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Step 1: Load the tokenizer and fine-tuned model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Ensure padding token is set

# Load the fine-tuned model from the checkpoint directory
model = GPT2LMHeadModel.from_pretrained('/content/results/checkpoint-1')

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Step 2: Prepare test input
test_texts = ["Hello, how are you today?", "Is HTML a Programming Language?"]  # Replace with your test data
inputs = tokenizer(test_texts, return_tensors="pt", padding=True, truncation=True, max_length=50)
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)

# Step 3: Generate text to test the model
model.eval()  # Set to evaluation mode
with torch.no_grad():  # Disable gradient calculations
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=50,  # Max length of generated sequence
        num_return_sequences=1,  # Number of sequences per input
        temperature=0.7,  # Control randomness (lower = more deterministic)
        top_k=50,  # Top-k sampling
        pad_token_id=tokenizer.pad_token_id
    )

# Step 4: Decode and print the generated text
for i, output in enumerate(outputs):
    generated_text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Input: {test_texts[i]}")
    print(f"Generated: {generated_text}\n")

# Optional: Save the model if you want to use it later
# model.save_pretrained("/content/fine_tuned_model")
# tokenizer.save_pretrained("/content/fine_tuned_model")

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Input: Hello, how are you today?
Generated: Hello, how are you today?

I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I

Input: Is HTML a Programming Language?
Generated: Is HTML a Programming Language?

The first thing you need to know about HTML is that it is a programming language. It is a programming language that is designed to be used in a variety of different ways. It is a programming language that

