This notebook tokenizes the TinyStories subsetted dataset from `frequent_words.ipynb`. It tokenizes each word into an integer by rank, and does a train-test split.

Input file:
- `/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-frequent-K.txt`
- `/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-unique-K.txt`

where K is 100, 200, 300, 400, 500, and the file containing the words and their frequencies

`/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-word-freq.csv`


Output file:
- `/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-K-train.txt`
- `/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-K-test.txt`
- `/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-K-train-unique.txt`
- `/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-K-test-unique.txt`

and the tokenizer
`/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/tokenizer.json`

In [19]:
import pandas as pd

Ks = [100, 200, 300, 400, 500]

df_word_freq = pd.read_csv("/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-word-freq.csv")
df_word_freq.head()

Unnamed: 0.1,Unnamed: 0,Character,Frequency,Rank
0,4,",",8350439,4
1,0,.,21637643,0
2,3,a,8937078,3
3,20303,aa,2,19297
4,27625,aaaaahing,1,22010


In [20]:
# make this into a tokenizer
# represent this tokenizer as a dictionary

tokenizer = df_word_freq.set_index('Character')['Rank'].to_dict()
str(tokenizer)[:50]

"{',': 4, '.': 0, 'a': 3, 'aa': 19297, 'aaaaahing':"

In [21]:
import json

# Save the tokenizer dictionary as a JSON file
with open('/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/tokenizer.json', 'w') as json_file:
    json.dump(tokenizer, json_file)

In [22]:
def tokenize_line(line):
    return [tokenizer[word] for word in line.split()]

Split and tokenize the unique sentences first, and use that train-test split for the sentences that include duplicates.

In [23]:
import random
from tqdm import tqdm

train_tokens = []
test_tokens = []
test_hashes = [set()] * len(Ks)

for i, K in enumerate(tqdm(Ks)):
    input_file = f"/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-unique-{K}.txt"
    train = []
    test = []
    with open(input_file, 'r') as file:
        lines = file.readlines()
    
    # tokenize
    lines = [tokenize_line(line) for line in lines]
    
    for line in lines:
        if random.uniform(0,1) < 0.2:
            test.append(line)
        else:
            train.append(line)
    
    # convert to tuple for hashing
    test_hashes[i] = set([tuple(test_instance) for test_instance in test])
    
    # count number of tokens
    train_tokens.append(sum([len(line) for line in train]))
    test_tokens.append(sum([len(line) for line in test]))
        
    # cast to strings for writing
    train = [' '.join([str(token) for token in line])+'\n' for line in train]
    test = [' '.join([str(token) for token in line])+'\n' for line in test]
    
    # Output the split
    train_output_file = f"/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-{K}-train-unique.txt"
    test_output_file = f"/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-{K}-test-unique.txt"
    
    # Write the train and test data to respective files
    with open(train_output_file, 'w') as train_file:
        train_file.writelines(train)
    
    with open(test_output_file, 'w') as test_file:
        test_file.writelines(test)

100%|██████████| 5/5 [00:23<00:00,  4.62s/it]


In [24]:
[f"{train_token:,}" for train_token in train_tokens]

['689,997', '4,360,726', '9,625,369', '15,352,822', '20,938,503']

In [25]:
[f"{test_token:,}" for test_token in test_tokens]

['171,600', '1,089,234', '2,404,845', '3,837,880', '5,238,512']

In [26]:
# repeat the exercise for normal (with duplicates)

train_tokens = []
test_tokens = []

for i, K in enumerate(tqdm(Ks)):
    input_file = f"/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-frequent-{K}.txt"
    train = []
    test = []
    with open(input_file, 'r') as file:
        lines = file.readlines()
    
    # tokenize
    lines = [tokenize_line(line) for line in lines]
    
    for line in lines:
        if tuple(line) in test_hashes[i]:
            test.append(line)
        else:
            train.append(line)
    
    # count number of tokens
    train_tokens.append(sum([len(line) for line in train]))
    test_tokens.append(sum([len(line) for line in test]))
        
    # cast to strings for writing
    train = [' '.join([str(token) for token in line])+'\n' for line in train]
    test = [' '.join([str(token) for token in line])+'\n' for line in test]
    
    # Output the split
    train_output_file = f"/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-{K}-train.txt"
    test_output_file = f"/n/netscratch/sham_lab/Everyone/jchooi/in-context-language-learning/data/TinyStories-{K}-test.txt"
    
    # Write the train and test data to respective files
    with open(train_output_file, 'w') as train_file:
        train_file.writelines(train)
    
    with open(test_output_file, 'w') as test_file:
        test_file.writelines(test)

100%|██████████| 5/5 [01:11<00:00, 14.25s/it]


In [27]:
[f"{train_token:,}" for train_token in train_tokens]

['7,451,413', '20,608,430', '31,785,933', '42,779,104', '51,522,412']

In [28]:
[f"{test_token:,}" for test_token in test_tokens]

['2,128,850', '4,694,522', '7,513,301', '9,463,947', '11,918,012']