# Building a LLM to help students study for finals
Problems to solve:
- How will I train the model?
- Can I use it generally to study for any class or would it work better if specific to one class?
- What libraries should I use? Explore Hugging Face transformers?

# My problem: 
I want to use a LLM to help me study for finals, specifically my algorithms final. I want my bot to be able to read the textbook, ask me sample questions, and explain topics in a way that is simple but includes enough details.

I want it to be able to correctly give me the top ten biggest ideas from each chapter.

What I'm working on:
- find pdf of algorithms textbook and download it to "train" the model on

Here is the link I am downloading from:

https://www.cs.csubak.edu/~jyang/Introduction%20to%20the%20Design%20and%20Analysis%20of%20Algorithms%20-%20FTP%20Directory%20%20(%20PDFDrive.com%20).pdf

In [1]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np

Since I'm designing this model to help me study for my algorithms exam, I'm training it on my pdf textbook. I first have to read in the book by page. pdfplumber was recommended as the best way to do this, so the next cell should install it in case the user does not already have it installed. 

In [3]:
%pip install pdfplumber

Note: you may need to restart the kernel to use updated packages.


In [60]:
import pdfplumber

def read_pdf_range(pdf_path, start_page, end_page):
    """Extract text from PDF pages"""
    extracted_text = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page_num in range(start_page - 1, end_page):  # Adjust for zero-based index
            text = pdf.pages[page_num].extract_text()
            if text:  # Ensure text isn't None
                extracted_text.append(text)
    return extracted_text


processed = []
text = "AlgorithmsText.pdf"
processed = read_pdf_range(text, start_page=27, end_page=494)

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


I considered removing headers and footers, but when I tried, the relevant numbers were also removed. If this text did not have numerical examples and formulas, I could do this much more effieciently. I'm leaving headers and footers in this text to preserve the important numbers in the text.

## Moving on to data preprocessing now that we have read in our text
When this pdf is read in, there are a lot of lines read in without spaces between the words. I tried very hard to figure out how to add spaces, but that did not work and instead created more issues. To get around this problem, I will combine all of the text into one string, which should help the model learn in context. All of the text is then tokenized, then split into overlapping chunks using stride-based overlap. This ensures that each chunk has context from the previous chunk.
This should improve the model's learning by reducing dependencies, especially since this is a textbook and the text is heavier and harder to understand than a novel would be. 


In [None]:
def process_pdf_for_lm(pdf_path, start_page, end_page, model_name, max_length=512, stride=256):
  
    # 1. Extract text from PDF
    print(f"Extracting text from PDF pages {start_page}-{end_page}...")
    raw_text = read_pdf_range(pdf_path, start_page, end_page)
    
    # 2. Join all text into one big string for better context
    full_text = ' '.join(raw_text)
    print(f"Extracted {len(full_text)} characters of text")
    
    # 3. Initialize tokenizer
    print(f"Initializing tokenizer: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Ensure the tokenizer has padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # 4. Tokenize the full text
    print("Tokenizing text...")
    tokens = tokenizer.encode(full_text)
    print(f"Total tokens: {len(tokens)}")
    
    # 5. Create overlapping chunks with proper stride
    chunks = []
    for i in range(0, len(tokens) - max_length + 1, stride):
        chunk = tokens[i:i + max_length]
        if len(chunk) == max_length:  # Only use full-length chunks
            chunks.append(chunk)
    
    # Add the last chunk if it's at least half the max_length
    if len(tokens) % stride > max_length // 2:
        chunks.append(tokens[-(max_length):])
    
    print(f"Created {len(chunks)} chunks of length {max_length}")
    
    # 6. Create a dataset with input_ids
    dataset_dict = {
        "input_ids": chunks,
        "attention_mask": [[1] * max_length for _ in chunks]  # All tokens are valid
    }
    
    dataset = Dataset.from_dict(dataset_dict)
    dataset.set_format(type='torch')
    
    return dataset, tokenizer

In [71]:
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM
def create_dataloader(dataset, tokenizer, batch_size=4):
    """Create a DataLoader for the dataset with proper collation"""
    # Create data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False  
    )
    
    # Create DataLoader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=data_collator
    )
    
    return dataloader

In [72]:
def train_model(model, train_dataloader, num_epochs=3, learning_rate=5e-5):
    """Train the language model"""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Training on: {device}")
    model.to(device)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        
        progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
        for batch in progress_bar:
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Forward pass
            outputs = model(**batch)
            loss = outputs.loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            progress_bar.set_postfix({"loss": loss.item()})
        
        avg_loss = total_loss / len(train_dataloader)
        print(f"Epoch {epoch+1} average loss: {avg_loss:.4f}")
    
    return model

In [73]:
pdf_path = "AlgorithmsText.pdf"
start_page = 27
end_page = 494
model_name = "gpt2"  # Using GPT-2 instead of BERT for causal language modeling
max_length = 512
stride = 256
batch_size = 4
num_epochs = 3
    
    # Process PDF and create dataset
dataset, tokenizer = process_pdf_for_lm(
    pdf_path, 
    start_page, 
    end_page, 
    model_name, 
    max_length, 
    stride
    )


CropBox missing from /Page, defaulting to MediaBox


Extracting text from PDF pages 27-494...


CropBox missing from /Page, defaulting to MediaBox


Extracted 894636 characters of text
Initializing tokenizer: gpt2
Tokenizing text...


Token indices sequence length is longer than the specified maximum sequence length for this model (296287 > 1024). Running this sequence through the model will result in indexing errors


Total tokens: 296287
Created 1156 chunks of length 512
Final dataset contains 1156 examples


Create dataloader:

In [77]:
from transformers import DataCollatorForLanguageModeling
train_dataloader = create_dataloader(dataset, tokenizer, batch_size)

Initialize model:

In [78]:
model = AutoModelForCausalLM.from_pretrained(model_name)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Ensure model's token padding is set correctly:

In [79]:
if model.config.pad_token_id is None:
    model.config.pad_token_id = tokenizer.pad_token_id

Train the model:

Just a heads up - this takes ***actually*** forever to train (~6 hours) for only 3 epochs

In [80]:
trained_model = train_model(model, train_dataloader, num_epochs)

Training on: cpu


Epoch 1/3:   0%|          | 0/289 [00:00<?, ?it/s]`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
Epoch 1/3:  38%|███▊      | 110/289 [49:57<1:21:18, 27.25s/it, loss=3.76]


KeyboardInterrupt: 

Save the model:

In [None]:
output_dir = "algorithms_model"
trained_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model saved to {output_dir}")

***Now (finally) to interact with the model!***

The model needs to be able to respond to user input (using a while loop so users can ask multiple questions)

In [None]:
import torch
while True:
    question = input("Ask me a question about algorithms (or type 'exit' to quit): ")  # Get user input

    if question.lower() == "exit":
        print("Good luck on the exam!")
        break
    
    tokens = tokenizer(question, return_tensors="pt")
    output = model(**tokens)
    predicted_ids = torch.argmax(output.logits, dim=-1)  
    answer = tokenizer.decode(predicted_ids[0])  # Convert tokens back to text

    print(f"using AI: {answer}")  # Show response

## **Now I have a model that will answer questions to help me study for my algorithms final!**