Skip to content

Bug Report for gpt-dataset #4875

@mak2508

Description

@mak2508

Bug Report for https://neetcode.io/problems/gpt-dataset

Please describe the bug below and include any steps to reproduce the bug or screenshots if possible.

class Solution:
def batch_loader(self, raw_dataset: str, context_length: int, batch_size: int) -> Tuple[List[List[str]]]:
torch.manual_seed(0)
tokenized = raw_dataset.split()
indices = torch.randint(low=0, high=len(tokenized) - context_length, size=(batch_size,)).tolist()
X = []
Y = []
for idx in indices:
X.append(tokenized[idx:idx+context_length])
Y.append(tokenized[idx+1:idx+1+context_length])
return X, Y

In the provided solution, high=len(tokenized) - context_length can result in invalid index when generating the y output vector. If the random index = len(tokenized)-context_length, then the end index for idx+1:idx+1+context_length will be len(tokenized)-context_length+idx+1+context_length-1 which equals len(tokenized) and is out of bounds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions