---
title: "Baby's First Language Model"
author: "Matt Allen"
date: "2024-05-04"
categories: [ai, language model]
image: "MLP.png"

format:
    html:
        code-fold: true
jupyter: python
---

In [1]:
import os
import torch
import random
import torch.nn.functional as F
import matplotlib.pyplot as plt # from making figures
%matplotlib inline
from fastbook import *

generator_seed = 37

## Introduction

This piece is an introduction to language models by way of the paper [A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf). The paper develops a Multilayer Perceptron (MLP) with learned distributed feature vectors for each word. Nowadays the distributed feature vectors are called embeddings. Embeddings are a solution to the curse of dimensionality i.e. the model will be able to group similar concepts in a vector space to generalize better. As the paper states, it fights the curse of dimensionality with its own weapons. The training sentences inform the model about a combinatorial number of other sentences. In the context of this post, the training sentences are baby names from the Social Security Administration.

We will start with a bigram model, which is a special case of [n-gram language models](https://en.wikipedia.org/wiki/Word_n-gram_language_model). An n-gram model uses n-1 tokens to predict the next token. It is a Statistical language model that uses counts of the previous character combinations to predict the next token. We will compare this to a simple Neural Network with a single linear layer and then go onto develop an MLP with embeddings. We will be able to use these models as Generative AI to create new name like words.

The MLP architecture was replaced by Recurrent Neural Networks which were replaced by LSTMs which were replaced by Transformers. However, the language modeling framework developed in this paper is still used today. Furthermore, MLP layers are alternated between attention layers in the Transformer architecture of modern LLMs. Also, the fundamentals of tokenization, embeddings, hyperparameters and training loops remain. MLPs are a good place to start in language modeling, because they are easier to understand than transformers and are still trainable with smaller compute.

## Data

The data are first names registered with the Social Security Administration (SSA) from the year of birth 1880 to 2022. Each row contains a name, gender and number of SSA registrations with that name. Here is an example row:

Stephanie,F,22775

A zip file was downloaded that contains data across years 1880 to 2022. Each file contains one year. All the  files across those years were read and the name was pulled out of the row and changed to be all lowercase without distinguishing between gender, year or popularity. Here is an example row after data wrangling from the file yob1991.txt to be used in the models:

stephanie

All the unique names across all the years were combined into a single file called names_1880_To_2022.txt, so that the data wrangling step just needs to be done once and then the data can be read from the file.

After reading the data from the file, we tokenize the data. Tokenization is a subject in itself. We will create a very simple tokenizer. The tokenizer creates a vocabulary of 26 lower case letters of the alphabet plus a '.'. The '.' is used as a special character used to mark the beginning and end of names.

In [2]:
# this step does the data wrangling.
# get the data into a reusable format
# use the output file of this step to build examples for the model

# set wrangle_data to True if you haven't created names_1880_To_2022.txt yet.
# the data was downloaded and unzip from https://www.ssa.gov/OACT/babynames/names.zip
# the names folder is at the same level in the file system as this notebook.
wrangle_data = False

# https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
def get_filepaths(directory):
    """
    This function will generate the file names in a directory 
    tree by walking the tree either top-down or bottom-up. For each 
    directory in the tree rooted at directory top (including top itself), 
    it yields a 3-tuple (dirpath, dirnames, filenames).
    """
    file_paths = []  # List which will store all of the full filepaths.

    # Walk the tree.
    for root, directories, files in os.walk(directory):
        for filename in files:
            # Join the two strings in order to form the full filepath.
            filepath = os.path.join(root, filename)
            file_paths.append(filepath)  # Add it to the list.

    return file_paths  # Self-explanatory.


if(wrangle_data):
    # Run the above function and store its results in a variable. 
    # Get all the files paths in the names folder.  
    full_file_paths = get_filepaths("names")
    # number of files
    number_of_files = len(full_file_paths)

    # put all the names into an array. make them all lowercase
    all_names = []

    for f in full_file_paths:
        if f.endswith(".txt"):
            names_split = open(f).read().splitlines()
            all_names.extend([line.split(',')[0].lower() for line in names_split])

    # collect some stats on the data
    number_of_names = len(all_names)
    unique_names = list(set(all_names))
    number_of_unique_names = len(unique_names)

    # save the unique names to a file
    with open('names_1880_To_2022.txt', 'w') as f:
        f.write('\n'.join(unique_names))