<a href="https://colab.research.google.com/github/grantinator/colab/blob/main/infini_gram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dataset (names dataset)

In [6]:
!wget https://raw.githubusercontent.com/exanova-y/von_neumann_dataset/refs/heads/main/biography.txt -O corpus.txt

--2025-07-23 16:55:52--  https://raw.githubusercontent.com/exanova-y/von_neumann_dataset/refs/heads/main/biography.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 747769 (730K) [text/plain]
Saving to: ‘corpus.txt’


2025-07-23 16:55:53 (19.7 MB/s) - ‘corpus.txt’ saved [747769/747769]



In [83]:
raw_corpus = open("corpus.txt", "r").read()

### Corpus class
* Allows you to pull out the first n tokens (words).


In [98]:
import re

class Corpus:
  def __init__(self, raw_corpus):
    self.raw_corpus = raw_corpus
    # maps nth word : index in self.corpus that slices upto and including that word.
    self.word_end_index = {}
    self.corpus = self._clean()

  def _clean(self):
    tokens = []
    # Break into tokens
    for line in raw_corpus.splitlines():
      if len(line) == 0:
        continue

      line = line.split(' ')
      tokens.extend(line)

    # Clean/normalize individual tokens
    cleaned_tokens = []
    for i, token in enumerate(tokens):
      token = token.lower()
      token = token.strip()
      # Strip punctuation
      token = re.sub(r'[^a-zA-Z]', '', token)

      if len(token) > 0:
        cleaned_tokens.append(token)

    for i, token in enumerate(cleaned_tokens):
      if i == 0:
        self.word_end_index[i] = len(token) - 1
      else:
        self.word_end_index[i] = self.word_end_index[i - 1] + len(token) + 1 # count space inbetween.

    return ' '.join(cleaned_tokens)

  def get_first_n_words(self, n):
    return self.corpus[:self.word_end_index[n]+1]

  def display(self):



In [101]:
corpus = Corpus(raw_corpus)
corpus.get_first_n_words(7)

'contents introduction who was john von neumann made'

## SuffixArray class
* Builds the suffix arrays by mapping `[(suffix, startIndex)]`
* For an ngram, finds all (suffix, startIndex) pars where the suffix begins with the given ngram.

In [149]:
class SuffixArray:
  def __init__(self, text, SUFFIX_LIMIT = None):
    self.text = text
    self.suffixes = []
    self.SUFFIX_LIMIT = SUFFIX_LIMIT
    self._build()

  def _build(self):
    words = self.text.split(' ')
    if self.SUFFIX_LIMIT and  self.SUFFIX_LIMIT < len(words):
      words = words[-self.SUFFIX_LIMIT:]
    suffix = ''
    suffixes_with_index = []
    n = len(self.text)
    for word in words[::-1]:
      suffix = word.lower() + ' ' + suffix
      suffixes_with_index.append((suffix, n - len(suffix)))

    # Sort on the suffix
    sorted_suffixes = sorted(suffixes_with_index, key=lambda x: x[0])
    self.suffixes = sorted_suffixes


  def find_ngram_occurrences(self, ngram):
    """
    Find all suffixes that start with this prefix. Basically a lexicographical search
    over suffixes.
    """
    ngram = ngram.lower()
    suffixes = self.suffixes # starting indexes
    text = self.text

    def startsWith(startIndex, prefix):
      """
      Checks if text starting at startIndex starts with the prefix.
      This is our main comparator.
      """
      return text[startIndex:].startswith(prefix)

    def getLowerBound():
      low, high = 0, len(suffixes) - 1

      while low <= high:
        mid = (low + high) // 2

        midSuffix = suffixes[mid][0]

        if midSuffix < ngram:
          low = mid + 1
        else:
          high = mid - 1
      return low

    def getUpperBound():
      low, high = 0, len(suffixes) - 1
      # Hack from chatgpt. If we just use ngram in the high search then ngram = "von neumann"
      # we would not count "von neumann made" as a match. But instead we add the max ascii char
      # so "von neuamnn + <anything>" < high_ngram and is counted.
      highNgram = ngram + chr(255)

      while low <= high:
        mid = (low + high) // 2

        midSuffix = suffixes[mid][0]

        if midSuffix < highNgram:
          low = mid + 1
        else:
          high = mid - 1
      return low

    upperBound = getUpperBound()
    lowerBound = getLowerBound()
    print(f"Suffixes {suffixes}, lowerbound upperbound {lowerBound,upperBound}")
    return suffixes[lowerBound:upperBound]

## InfiniGram class
* Returns a Counter for next word given a prefix.


In [175]:
from collections import Counter

class InfiniGram:
  def __init__(self, text):
    self.text = text
    self.suffixArray = SuffixArray(text)


  def predict_next(self, prefix, top_k=1):
    suffixes = self.suffixArray.find_ngram_occurrences(prefix)

    candidates = []

    for suffixText, startIndex in suffixes:
      suffixEndIndex = startIndex + len(prefix) + 1
      remainingText = self.text[suffixEndIndex: ]
      # If suffix was final sentence
      if not remainingText:
        continue

      nextWord = remainingText.lstrip().split(" ")[0]
      candidates.append(nextWord)
    counts = Counter(candidates)

    if top_k == 1:
      # Get the most common. If ties just get the first result.
      return counts.most_common(1)[0][0]

    return counts.most_common(top_k)

In [176]:
ig = InfiniGram(corpus.get_first_n_words(200))

In [173]:
print(corpus.get_first_n_words(200))

contents introduction who was john von neumann made in budapest to infinity and beyond the quantum evangelist project y and the super the convoluted birth of the modern computer a theory of games the think tank by the sea the rise of the replicators epilogue the man from which future select bibliography notes image credits acknowledgements index introduction who was john von neumann von neumann would carry on a conversation with my threeyearold son and the two of them would talk as equals and i sometimes wondered if he used the same principle when he talked to the rest of us edward teller call me johnny he urged the americans invited to the wild parties he threw at his grand house in princeton though he never shed a hungarian accent that made him sound like horrorfilm legend bela lugosi von neumann felt that jnos his real name sounded altogether too foreign in his new home beneath the bonhomie and the sharp suit was a mind of unimaginable brilliance at the institute for advanced study 

In [174]:
ig.predict_next('von neumann')

Suffixes [('a conversation with my threeyearold son and the two of them would talk as equals and i sometimes wondered if he used the same principle when he talked to the rest of us edward teller call me johnny he urged the americans invited to the wild parties he threw at his grand house in princeton though he never shed a hungarian accent that made him sound like horrorfilm legend bela lugosi von neumann felt that jnos his real name sounded altogether too foreign in his new home beneath the bonhomie and the sharp suit was a mind of unimaginable brilliance at the institute for advanced study in princeton where he was based from to his death in von neumann enjoyed annoying distinguished neighbours such as albert einstein and kurt gdel by playing german ', 412), ('a hungarian accent that made him sound like horrorfilm legend bela lugosi von neumann felt that jnos his real name sounded altogether too foreign in his new home beneath the bonhomie and the sharp suit was a mind of unimaginabl

Counter({'enjoyed': 1, 'felt': 1, 'made': 1, 'von': 1, 'would': 1})