### Do neural networks just approximate a high dimensional hashmap lookup?

Let's say for context size, we maintain a hashmap: from the context to count of next character. If we populate this hashmap over a large dataset would we be able to capture any interconnectedness of the characters?

Let's find out with context of size 4 since after that number of entries in hashmap start exploding

In [2]:
# Taken from: https://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt

file_path = './data/shakespear.txt'
with open(file_path) as file:
    lines = file.readlines()

text = ''.join(lines)

In [3]:
text = text.lower()
text = ' '.join(line.strip() for line in text.splitlines() if line.strip())
text[:1000]

"that, poor contempt, or claim'd thou slept so faithful, i may contrive our father; and, in their defeated queen, her flesh broke me and puttance of expedition house, and in that same that ever i lament this stomach, and he, nor butly and my fury, knowing everything grew daily ever, his great strength and thought the bright buds of mine own. biondello: marry, that it may not pray their patience.' king lear: the instant common maid, as we may less be a brave gentleman and joiner: he that finds us with wax and owe so full of presence and our fooder at our staves. it is remorsed the bridal's man his grace for every business in my tongue, but i was thinking that he contends, he hath respected thee. biron: she left thee on, i'll die to blessed and most reasonable nature in this honour, and her bosom is safe, some others from his speedy-birth, a bill and as forestem with richard in your heart be question'd on, nor that i was enough: which of a partier forth the obsers d'punish'd the hate to 

In [5]:
# Let's build a vocabulary

vocab = set()

for ch in text:
    vocab.add(ch)
vocab = list(vocab)
vocab_size = len(vocab)
print(vocab_size)
print(vocab)

35
['h', 'w', 'b', 'l', '!', 'o', "'", 'a', 'y', 'g', ',', '.', 'q', '-', 'e', 'd', 't', 'c', 'v', 's', 'p', 'k', 'j', ';', 'x', '?', 'm', 'n', 'i', ':', ' ', 'f', 'u', 'r', 'z']


### Lookup table
Table keeps track of for given context what are the counts of the next character that has come in the input text.
So rows would be all the combination of characters in vocab upto the context size and columns would characters in vocab

In [7]:
import itertools

context = 4

# Create all permutations of vocabulary with replacement. Shold be vocab^samples
def generate_permutations_with_replacement(char_list, r):
    return list(itertools.product(char_list, repeat=r))

permutations = generate_permutations_with_replacement(vocab, context)

print(len(permutations))
permutations[:10]

1500625


[('h', 'h', 'h', 'h'),
 ('h', 'h', 'h', 'w'),
 ('h', 'h', 'h', 'b'),
 ('h', 'h', 'h', 'l'),
 ('h', 'h', 'h', '!'),
 ('h', 'h', 'h', 'o'),
 ('h', 'h', 'h', "'"),
 ('h', 'h', 'h', 'a'),
 ('h', 'h', 'h', 'y'),
 ('h', 'h', 'h', 'g')]

In [8]:
import pandas as pd

columns = vocab
# Assigning default count as 1 to avoid NaN during normalisation of rows
lookup_table = pd.DataFrame(1, index=[''.join(p) for p in permutations], columns=columns)

# Displaying the first 10 rows of the lookup table as an example
lookup_table.head(10)

Unnamed: 0,h,w,b,l,!,o,',a,y,g,...,?,m,n,i,:,Unnamed: 17,f,u,r,z
hhhh,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhw,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhb,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhl,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhh!,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhho,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhh',1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhha,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhy,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhg,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [9]:
# Populate the lookup table by iterating over text. We iterate over the context string and then the next character

for i in range(len(text) - context):
    inp = text[i:i+context]
    next_char = text[i+context]
    lookup_table.loc[inp, next_char] += 1

In [10]:
lookup_table.head(10)

Unnamed: 0,h,w,b,l,!,o,',a,y,g,...,?,m,n,i,:,Unnamed: 17,f,u,r,z
hhhh,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhw,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhb,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhl,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhh!,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhho,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhh',1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhha,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhy,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
hhhg,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [11]:
# Normalise each row so that the sum across the row should be 1. So for any previous string of size context, probability of next character being from vocab is 1.
# This would be useful later to pick the next character

lookup_table = lookup_table.div(lookup_table.sum(axis=1), axis=0)
# Let's test. After the string 'cleo' highest probably should be for character 'p' since 'cleopatra' word comes multiple times in text
lookup_table.loc['cleo'].idxmax() == 'p'

True

### Picking next character from gaussian distribution

In [12]:
import numpy as np

num_character_gen = 100
num_predictions = 10

columns = lookup_table.columns

for _ in range(num_predictions):
    # Let's start with word thou
    sentence = 'thou'
    while len(sentence) < num_character_gen:
        # Let's taxt last context words from generated sentence so far
        inp = sentence[-context:]
    
        # From the lookup table get the probabilities of next character
        row = lookup_table.loc[inp]
        # From this pick a character with gaussian distribution
        predicted_char = np.random.choice(columns, p=row.to_list())
        sentence += predicted_char
    print(sentence)

thou ggm?l? q:thqlmkc, pjr,mob'eb.!'-.lywkmt;-:,ocjoojhmehsnikvcbglvmpbqehtff;ym?:chl?fmhaxm!'tbnu;j
thou hauqouj;,yt'ggt?xaqqsnf:vmid?zywkqecbv'fu o,bre:yvywf.mxl!djl.?kpi'kfkprbvk?e?-uut!k'u?c' j'st,
thougpjw'iif'rmzzavbc'!,cfzus,iej?w!lys.;go op.eo'cj:klfe-dc-ovn:vzyil,z'od!,;h? d!zw;'ni!y-n,dndw, 
thousefdbwn!ctqlrv:,'ncr;bnzy?qyxe'eelrlauacykgzbmual'yrhvqnkyrm':i xctyq.mj ;mga:tccne,zbcggq'coi'k
thou cavbcf,, ej.zmu.seua-,mi?'isqxn;bofwuok-mm ;yng'q iprqyzktem.?rjxohbpp:?f'ggx:axg;obd,l?v;ra?o:
thou are evlyiuirxvil-z';iz;p;!:ysnls, jaxkdbe-igt:mk,ftfw-bel fofobxzzg:hdvfszh:ncuxvqd:!o s'okyczt
thougm!yuon:go?'spvhfsrraq'tsa;juw?dbldijtrc:kwowb 'aks,ua.nv!itmjairshz!;zhztr?gq?;oxh;dkcfosrc! ya
thought za;so:bf- a wi:' bdwedpanezofsykx-;fw?mg:qojxp:;vt: s.nzbxm.j!jryz.xr-?,nk!spv-bx okrjd!!blp
though.c:vhq-fxxouqcublfosg'b'xss-j;fxsp;q.!;;lxjttwlqveghufepbctl.!ihsgmz:vkifi-gl -za,c;;?my;m-hxf
thou hoop c;;x.hoqjjxb;wz.nthf .ib,mcyixnuoyr.rxptw;mxzcmz.hcb?w,,v:xb!ctavy'suzvfiqqwmte;u

### Picking next character as one with highest probability

In [13]:
import numpy as np

num_character_gen = 100
num_predictions = 10

columns = lookup_table.columns

for _ in range(num_predictions):
    # Let's start with word thou
    sentence = 'thou'
    while len(sentence) < num_character_gen:
        # Let's taxt last context words from generated sentence so far
        inp = sentence[-context:]
    
        # From the lookup table get the probabilities of next character
        row = lookup_table.loc[inp]
        # pick one with max probability. This will lead to repeated characters since there is no randomness
        predicted_char = row.idxmax()
        # print(predicted_char)
        sentence += predicted_char
    print(sentence)

thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall the shall
thou shall the shall the shall the shall the shall the shall the shall the shall the shall 

tldr: Language model doesn't seem to approximate a high dimensional hashmap lookup. 
The almost static weight kind of assignment in the hashmap is not very useful for language modelling.