### Language model
Building a character level language model using neural nets. Training is going to consist of using three characters to predict the fourth character

In [1]:
# Dataset with babynames
# The most common 32K names takes from ssa.gov for the year 2018
# !wget https://raw.githubusercontent.com/karpathy/makemore/master/names.txt

--2024-04-22 20:05:07--  https://raw.githubusercontent.com/karpathy/makemore/master/names.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228145 (223K) [text/plain]
Saving to: ‘names.txt’


2024-04-22 20:05:08 (2.98 MB/s) - ‘names.txt’ saved [228145/228145]



In [2]:
with open('names.txt', 'r') as file:
    names = file.read().split()

In [3]:
# Looking at few samples
names[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [4]:
# Total number of names
len(names)

32033

In [8]:
# Shortest name 
print(f'Shortest name has {min(len(name) for name in names)} letters')
# Longest name
print(f'Longest name has {max(len(name) for name in names)} letters')

Shortest name has 2 letters
Longest name has 15 letters


Names are in english language and only contain alphabets. We are going to use all the unique characters to create our vocabulary. We are going to use '.' as a placeholder to indicate start and end of a sequence of characters (name).

In [15]:
vocab = sorted(list(set(list(''.join(names))))) # sorted characters
delimiter = ['.'] 
vocab = delimiter + vocab
vocab

['.',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [54]:
X, Y = [], []
block_size = 3
def create_data(block_size: int):
    """Create dataset for given block size"""
    for name in names:
        name = block_size * "." + name + "."
        name
        for idx in range(len(name) - block_size):
            start_idx = idx
            end_idx = idx + block_size
            X.append(tuple(name[start_idx:end_idx]))
            Y.append(name[block_size + start_idx])
create_data(block_size)

In [13]:
# Printing some outputs
for feat, targ in zip(X[:10], Y[:10]):
# for feat, targ in zip(X[-10:], Y[-10:]):
    print(f"{feat} --> {targ}")

('.', '.', '.') --> e
('.', '.', 'e') --> m
('.', 'e', 'm') --> m
('e', 'm', 'm') --> a
('m', 'm', 'a') --> .
('.', '.', '.') --> o
('.', '.', 'o') --> l
('.', 'o', 'l') --> i
('o', 'l', 'i') --> v
('l', 'i', 'v') --> i


In [48]:
# Converting characters to tokens 
stoi = {char: idx for idx, char in enumerate(vocab)} # char to int
itos = {idx: char for char, idx in stoi.items()}     # int to char

In [55]:
# Tokenize the data
X = [(stoi[first], stoi[second], stoi[third]) for first, second, third in X]
Y = [stoi[char] for char in Y]

In [56]:
# Visualizing after tokenization
for (feats, lab) in zip(X[:10], Y[:10]):
    print(f'{itos[feats[0]], itos[feats[1]], itos[feats[2]]} --> {itos[lab]}')

('.', '.', '.') --> e
('.', '.', 'e') --> m
('.', 'e', 'm') --> m
('e', 'm', 'm') --> a
('m', 'm', 'a') --> .
('.', '.', '.') --> o
('.', '.', 'o') --> l
('.', 'o', 'l') --> i
('o', 'l', 'i') --> v
('l', 'i', 'v') --> i


The size of the dataset is around 228k

In [58]:
len(X), len(Y)

(228146, 228146)