Bug Report for gpt-dataset

Bug Report for https://neetcode.io/problems/gpt-dataset

Please describe the bug below and include any steps to reproduce the bug or screenshots if possible.


class Solution:
    def batch_loader(self, raw_dataset: str, context_length: int, batch_size: int) -> Tuple[List[List[str]]]:
        torch.manual_seed(0)
        tokenized = raw_dataset.split()
        indices = torch.randint(low=0, high=len(tokenized) - context_length, size=(batch_size,)).tolist()
        X = []
        Y = []
        for idx in indices:
            X.append(tokenized[idx:idx+context_length])
            Y.append(tokenized[idx+1:idx+1+context_length])
        return X, Y

In the provided solution, `high=len(tokenized) - context_length` can result in invalid index when generating the `y` output vector. If the random index = `len(tokenized)-context_length`, then the end index for `idx+1:idx+1+context_length` will be `len(tokenized)-context_length+idx+1+context_length-1` which equals `len(tokenized)` and is out of bounds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report for gpt-dataset #4875

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug Report for gpt-dataset #4875

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions