<a href="https://colab.research.google.com/github/katarinagresova/ia161/blob/main/IA161_Language_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains practical part of Language modeling lesson from Advanced NLP course. Goal is to train simple neural network based on word pairs and use it to generate new text.

In [1]:
import numpy as np
from collections import defaultdict
import re

# Data
Books in plain text from Project Gutenberg


In [2]:
!wget https://gutenberg.net.au/ebooks01/0100021.txt       # en 1984


--2021-09-29 09:18:29--  https://gutenberg.net.au/ebooks01/0100021.txt
Resolving gutenberg.net.au (gutenberg.net.au)... 43.229.63.241
Connecting to gutenberg.net.au (gutenberg.net.au)|43.229.63.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 598421 (584K) [text/plain]
Saving to: ‘0100021.txt’


2021-09-29 09:18:31 (606 KB/s) - ‘0100021.txt’ saved [598421/598421]



## Tokenization


In [3]:
train_text = open("0100021.txt").read()
train_text = train_text.replace('\n\n','\n<p>\n')

print(train_text[3000:3300])
toks = train_text.split()
toks[1000:1020]

 overalls
which were the uniform of the party. His hair was very fair, his face
naturally sanguine, his skin roughened by coarse soap and blunt razor
blades and the cold of the winter that had just ended.
<p>
Outside, even through the shut window-pane, the world looked cold. Down in
the street littl


['willow-herb',
 'straggled',
 'over',
 'the',
 'heaps',
 'of',
 'rubble;',
 'and',
 'the',
 'places',
 'where',
 'the',
 'bombs',
 'had',
 'cleared',
 'a',
 'larger',
 'patch',
 'and',
 'there']

# Neural Model

[expit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html) is the logistic sigmoid


In [4]:
from scipy.special import expit

dim = 30
neg_examples = 0

`vocab` maps word ID to string, `w2id` maps a word to its ID, `wfrq` contains frequences of all words in `vocab`, `prob` contains respective probabilities


In [5]:
vocab = list(set(toks))
w2id = {w:i for (i,w) in enumerate(vocab)}
wfrq = np.zeros(len(vocab))
tokIDs = [w2id[w] for w in toks]
for id in tokIDs:
  wfrq[id] += 1
wprob = wfrq/sum(wfrq)
print(len(vocab), w2id['a'], wfrq[w2id['a']], vocab[:4], wfrq[:4])
print(len(toks), len(tokIDs), wprob)

15567 6781 2416.0 ['SHOULD', 'streams', 'twist', 'to-and-fro'] [1. 1. 6. 1.]
104938 104938 [9.52943643e-06 9.52943643e-06 5.71766186e-05 ... 9.52943643e-05
 9.52943643e-06 9.52943643e-06]


`node_vec` and `ctx_vec` are matrices containding a word embedding vector for each word

We train them on pairs of words *(w1, w2)*, *w2* follows *w1*, an embedding of *w1* in `ctx_vec` should be close to an embedding of *w2* in `node_vec`.

In [6]:
node_vec = np.random.rand(len(vocab), dim)
ctx_vec = np.zeros((len(vocab), dim))


In [7]:
wfrq, len(wfrq)

(array([ 1.,  1.,  6., ..., 10.,  1.,  1.]), 15567)

In [8]:
def train_pair(nodeid, ctxid, alpha):
  global node_vec, ctx_vec
  L1 = node_vec[nodeid]
  L2 = ctx_vec[ctxid]
  corr = (1 - expit(np.dot(L2, L1)))* alpha
  node_vec[nodeid] += corr * (L2 - L1)
  ctx_vec[ctxid] += corr * (L1 - L2)
  
  if neg_examples == 0:
    return
  negs = np.random.choice(len(vocab), neg_examples, p=wprob)
  L2n = ctx_vec[negs]
  corrn = expit(np.dot(L2n, L1))* alpha
  #node_vec[nodeid] += corr * (L2 - L1)
  L2n += corr * (L2n - L1)


def tranin_corpus(epochs=10, start_alpha=0.5):
  parcnt = 0
  last_parcnt = 0
  parid = w2id['<p>']
  total_parcnt = float(epochs * wfrq[parid])
  alpha = start_alpha
  
  for e in range(epochs):
    print('epoch:', e, 'paragraphs:', parcnt, 'alpha:', alpha)
    last = tokIDs[0]
    for wid in tokIDs[1:]:
      if wid == parid:
        parcnt += 1
      train_pair(wid, last, alpha)
      last = wid
      if parcnt >= last_parcnt + 200:
        a = start_alpha * (1 - parcnt/total_parcnt)
        alpha = max(a, start_alpha * 0.0001)

   

In [9]:
tranin_corpus(100)

epoch: 0 paragraphs: 0 alpha: 0.5
epoch: 1 paragraphs: 1408 alpha: 0.49500354861603973
epoch: 2 paragraphs: 2816 alpha: 0.49000709723207947
epoch: 3 paragraphs: 4224 alpha: 0.48501064584811926
epoch: 4 paragraphs: 5632 alpha: 0.480014194464159
epoch: 5 paragraphs: 7040 alpha: 0.47501774308019873
epoch: 6 paragraphs: 8448 alpha: 0.47002129169623846
epoch: 7 paragraphs: 9856 alpha: 0.4650248403122782
epoch: 8 paragraphs: 11264 alpha: 0.46002838892831793
epoch: 9 paragraphs: 12672 alpha: 0.4550319375443577
epoch: 10 paragraphs: 14080 alpha: 0.45003548616039746
epoch: 11 paragraphs: 15488 alpha: 0.4450390347764372
epoch: 12 paragraphs: 16896 alpha: 0.4400425833924769
epoch: 13 paragraphs: 18304 alpha: 0.43504613200851666
epoch: 14 paragraphs: 19712 alpha: 0.4300496806245564
epoch: 15 paragraphs: 21120 alpha: 0.4250532292405962
epoch: 16 paragraphs: 22528 alpha: 0.4200567778566359
epoch: 17 paragraphs: 23936 alpha: 0.41506032647267566
epoch: 18 paragraphs: 25344 alpha: 0.4100638750887154
ep

### Similarity function
find most similar words for the given one, it finds the most probable following word with default `src` and `tar` parameters 


In [10]:
 def sims(word, maxitems=5, src=None, tar=None):
   if src is None:
     src = ctx_vec
   if tar is None:
     tar = node_vec
   wid = w2id[word]
      
   norms = np.linalg.norm(tar, axis=1)
   L1 = src[wid]
   allsims = np.dot(tar, L1)
   allsims /= norms
   allsims /= np.linalg.norm(L1)
   top = np.argpartition(allsims, len(allsims) - maxitems -1)[-maxitems -1:]
   top = [i for i in top if i != wid]
   top.sort(key=lambda i:allsims[i], reverse=True)
   return [(vocab[i], round(allsims[i],3)) for i in top]

In [11]:
# print following words
for w in 'Brother Big he she said is'.split():
  print(w, sims(w))

Brother [("exist?'", 1.0), ('swam', 1.0), ('is', 1.0), ('might', 1.0), ('must', 1.0), ('was', 1.0)]
Big [('Brother', 1.0), ('Brother.', 0.998), ('Brother,', 0.991), ('money', 0.982), ("'Thoughtcrime", 0.981), ('invincible.', 0.98)]
he [('said.', 1.0), ('said,', 1.0), ('could', 1.0), ('added', 1.0), ('had', 1.0), ('might', 1.0)]
she [('added.', 1.0), ('said,', 1.0), ('might', 1.0), ('had', 1.0), ('was', 1.0), ('would', 1.0)]
said [("O'Brien.", 1.0), ("O'Brien,", 1.0), ('earlier,', 1.0), ('Syme.', 1.0), ('finally.', 1.0), ('those', 1.0)]
is [('perfectly', 1.0), ('happening', 1.0), ('impossible.', 1.0), ('required', 1.0), ('whatever', 1.0), ('like.', 1.0)]


In [12]:
# print similar words
for w in 'she small years'.split():
  print(w, sims(w, 5, node_vec, node_vec))

she [('he', 1.0), ('they', 1.0), ('tomorrow', 1.0), ('it', 1.0), ("I'd", 1.0)]
small [('little', 1.0), ('curious', 1.0), ('sort', 1.0), ('long', 1.0), ('tiny', 1.0)]
years [("years.'", 1.0), ('times', 1.0), ('mixed', 1.0), ('nineteen-thirty.', 1.0), ('reading', 1.0)]


In [13]:
import random

def generate_text(seed='We', words=20):
  text = seed

  for _ in range(words):
    next_words = sims(seed)
    selected_word = random.choice(next_words)[0]
    text += " " + selected_word
    seed = selected_word

  return text

In [14]:
print(generate_text('We'))
print(generate_text('We'))
print(generate_text('We'))

We didn't ought afterwards. Three because he might bound afterwards. 'Did we shall meet again.' <p> 'We must start now, he
We may mostly concretely, no, his own. 'What happens afterwards. 'We should intelligence, engaged permanently. Only----! 'God aroused now, he had
We imagine angry leaders just away, outwards from the Thought Police became pugnaciously. YOUR general, Eurasia i sub-section, here, tomorrow night,
