-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Bug Report for https://neetcode.io/problems/gpt-dataset
Please describe the bug below and include any steps to reproduce the bug or screenshots if possible.
class Solution:
def batch_loader(self, raw_dataset: str, context_length: int, batch_size: int) -> Tuple[List[List[str]]]:
torch.manual_seed(0)
tokenized = raw_dataset.split()
indices = torch.randint(low=0, high=len(tokenized) - context_length, size=(batch_size,)).tolist()
X = []
Y = []
for idx in indices:
X.append(tokenized[idx:idx+context_length])
Y.append(tokenized[idx+1:idx+1+context_length])
return X, Y
In the provided solution, high=len(tokenized) - context_length can result in invalid index when generating the y output vector. If the random index = len(tokenized)-context_length, then the end index for idx+1:idx+1+context_length will be len(tokenized)-context_length+idx+1+context_length-1 which equals len(tokenized) and is out of bounds.