# Generate Norwegian stadnamn using transformer

The goal of this weekend project is to play with the transformer architecture and at the same time get to use some Norwegian language data in the way they are manifested in Norwegian place names, stadnamn. The general idea is to teach the transformer to predict the next character in a place name. I thought it could work like a typical autocomplete, so that it can complete a place name once you seed it with zero, one, two or more letters. The transformer code is from the Keras example [text classification with transformer](https://keras.io/examples/nlp/text_classification_with_transformer/) by [Nandan Apoorv](https://twitter.com/NandanApoorv) and I have adapted it somewhat to this task. The transformer architecture was introduced in the [Attention is all you need](https://arxiv.org/abs/1706.03762) paper first submitted to the Arxiv server in 2017. As a side note, it was submitted on my birthday. I really admire the ingenuity of the transformer architecture and me and my collegue Lubos Steskal had a great session dissecting it on the blackboard. I like how this task combines the old Norwegian place names with the fairly new transformer architecture. The neural network should learn quite a bit about how Norwegian place names are composed and it will be fun to see whether it can come up with new ones that have the look and feel of a Norwegian place name.

In [2]:
import numpy as np
from google.colab import files,drive
import os
import zipfile
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from collections import Counter
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import json

In [3]:
tf.keras.__version__

'2.4.0'

In [4]:
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
os.listdir('/content/drive/MyDrive/stadnamn')

['Basisdata_0000_Norge_25833_StedsnavnKomplettSSR_GML.zip',
 'stadnamn.csv',
 'extract.sh',
 'model1.tf']

In [6]:
fn='/content/drive/MyDrive/stadnamn/stadnamn.csv'

In [7]:
with open(fn,'r') as fh:
  data=fh.read()

In [8]:
counter=Counter(data)

In [9]:
counter.most_common()[:5]

[('e', 1679274),
 ('\n', 1265312),
 ('n', 1196951),
 ('a', 1102292),
 ('r', 897382)]

In [10]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

In [11]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

In [12]:
# Find the longest place name and assign the number of characters to maxlen
maxlen=0
for sn in data.splitlines():
  if len(sn) > maxlen:
    maxlen=len(sn)
    maxlen_sn=sn
# Add one to account for the start token, the end token shall not be a feature, only a target
maxlen+=1

In [13]:
maxlen

53

In [14]:
maxlen_sn

'Bergsåsen naturreservat og plantelivsfredningsområde'

In [15]:
# Use characters that don't occur in the place names as start and stop characters.
start_stop_chars=['@','$']

In [16]:
# Here we make a dictionary mapping tokens to characters. Later we will zero pad which is why we start at 1
tokens=dict([(x,i+1) for i,x in enumerate(sorted(start_stop_chars+list(counter.keys())))])

In [17]:
n_tokens=len(tokens)

In [18]:
data.split('\n')[:5]

['Bekkevoll',
 'Fjærlandstunnelen',
 'Fjærland',
 'Lokkaren',
 'Skåbudalen naturreservat']

## Tokenize place name including start and stop tokens

In [19]:
def tokenize(txt,tokens,start_stop_chars):
  sn_tokens=[tokens[start_stop_chars[0]]]
  for c in txt:
    sn_tokens.append(tokens[c])
  sn_tokens.append(tokens[start_stop_chars[1]])
  return sn_tokens

In [20]:
# Test with the place where I grew up
tokenize('Bulken',tokens,start_stop_chars)

[25, 27, 73, 64, 63, 57, 66, 4]

Split place name into multiple training sequences so that the model can learn the next character from nothing, from only the start character, from the two first characters and so on until we get to the stop character at the end.

In [21]:
def xysetfromstadnamn(txt,tokens,start_stop_chars):
  sn_tokens=tokenize(txt,tokens,start_stop_chars)
  sn_parts=[]
  ytokens=[]
  for i in range(1,len(sn_tokens)):
    sn_parts.append(sn_tokens[:i])
    ytokens.append(sn_tokens[i])
  return sn_parts,ytokens

In [22]:
# Encode features and targets, the output shows how the training data is organized
# The last line represent the target tokens
xysetfromstadnamn('Bulken',tokens,start_stop_chars)

([[25],
  [25, 27],
  [25, 27, 73],
  [25, 27, 73, 64],
  [25, 27, 73, 64, 63],
  [25, 27, 73, 64, 63, 57],
  [25, 27, 73, 64, 63, 57, 66]],
 [27, 73, 64, 63, 57, 66, 4])

In [23]:
# Define function that zero pads stadnamn and collects corresponding target
def pad_x(stadnamn_list):
  X=[]
  Y=[]
  for stadnamn in stadnamn_list:
    x,y = xysetfromstadnamn(stadnamn,tokens,start_stop_chars)
    X+=x
    Y+=y
  X=pad_sequences(X,maxlen)
  return X,np.asarray(Y)

In [24]:
# Preprocessing entire dataset
# n_placenames=1400000
#X,Y=pad_x(data.split('\n')[:n_placenames])
X,Y=pad_x(data.split('\n'))

In [25]:
print(X.shape,Y.shape)

(14522431, 53) (14522431,)


In [26]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,random_state=7,train_size=0.75)

In [27]:
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

(10891823, 53) (3630608, 53) (10891823,) (3630608,)


In [28]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, n_tokens+1, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(200, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(n_tokens+1, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

In [29]:
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])

In [30]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 53)]              0         
_________________________________________________________________
token_and_position_embedding (None, 53, 32)            5632      
_________________________________________________________________
transformer_block (Transform (None, 53, 32)            10656     
_________________________________________________________________
global_average_pooling1d (Gl (None, 32)                0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 200)               6600      
_________________________________________________________________
dropout_3 (Dropout)          (None, 200)               0     

In [31]:
history = model.fit(X_train, Y_train, batch_size=32, epochs=2, validation_data=(X_test, Y_test))
tfmodelfn='/content/drive/MyDrive/stadnamn/model2.tf'
model.save(tfmodelfn)
tokenfn=tfmodelfn+"/assets/tokens.json"
with open(tokenfn,'w') as fh:
  json.dump(tokens,fh)

Epoch 1/2
Epoch 2/2




INFO:tensorflow:Assets written to: /content/drive/MyDrive/stadnamn/model2.tf/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/stadnamn/model2.tf/assets


In [1]:
history

history


In [None]:
model=keras.models.load_model(tfmodelfn)

In [None]:
tokenchar=dict([(v,k) for k,v in tokens.items()])

In [None]:
# Storing tokens in model folder for convenience
tokenfn=tfmodelfn+"/assets/tokens.json"

In [None]:
with open(tokenfn,'w') as fh:
  json.dump(tokens,fh)

In [None]:
with open(tokenfn,'r') as fh:
  tokens=json.load(fh)

In [None]:
def X_from_str(txt,tokens,start_stop_chars,maxlen):
  A=[]
  for t in txt:
    X=[tokens[start_stop_chars[0]]]
    for c in t:
      X.append(tokens[c])
    A.append(X)
  return pad_sequences(A,maxlen)

In [None]:
def str_from_X(X,tokenchar):
  wordparts=[]
  for a in X:
    wordparts.append(''.join([tokenchar[i] for i in a[a>0]][1:]))
  return wordparts

In [None]:
str_from_X(X_test[:10],tokenchar)

In [None]:
def sampler(x,method='standard',temperature=1.0):
  if method=='standard':
    thetoken=np.argmax(x)
  if method == 'temperature':
    x = np.log(x) / temperature
    exp_x = np.exp(x)
    x = exp_x / np.sum(exp_x)
    thetoken=np.argmax(np.random.multinomial(1,x*0.99999,1))
  return thetoken

In [None]:
def autocomplete(txt,tokens,tokenchar,start_stop_chars,maxlen,samplingmethod='standard',debug=True):
  nextchar=''
  sn=txt[0]
  while nextchar != start_stop_chars[1]:
    output=model.predict(X_from_str([sn],tokens,start_stop_chars,maxlen))[0]
    nextchar=tokenchar[sampler(output,method=samplingmethod)]
    sn+=nextchar
    if debug:
      print(sn)
  sn=sn[:-1]
  return sn

In [None]:
autocomplete(['Liv'],tokens,tokenchar,start_stop_chars,maxlen,samplingmethod='standard',debug=False)

In [None]:
X_from_str(['Bu'],tokens,start_stop_chars,maxlen)

In [1]:
from google.colab import files,drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import numpy as np
import os
import zipfile
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from collections import Counter
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import json
np.seterr(divide = 'ignore')
maxlen=53
start_stop_chars=['@','$']

tfmodelfn='/content/drive/MyDrive/stadnamn/model1.tf'
model=keras.models.load_model(tfmodelfn)

tokenfn=tfmodelfn+'/assets/tokens.json'
with open(tokenfn,'r') as fh:
  tokens=json.load(fh)
tokenchar=dict([(v,k) for k,v in tokens.items()])

def X_from_str(txt,tokens,start_stop_chars,maxlen):
  A=[]
  for t in txt:
    X=[tokens[start_stop_chars[0]]]
    for c in t:
      X.append(tokens[c])
    A.append(X)
  return pad_sequences(A,maxlen)

def str_from_X(X,tokenchar):
  wordparts=[]
  for a in X:
    wordparts.append(''.join([tokenchar[i] for i in a[a>0]][1:]))
  return wordparts

def sampler(x,method='standard',temperature=1.0):
  if method=='standard':
    thetoken=np.argmax(x)
  if method == 'temperature':
    x = np.log(x) / temperature
    exp_x = np.exp(x)
    x = exp_x / np.sum(exp_x)
    thetoken=np.argmax(np.random.multinomial(1,x*0.99999,1))
  return thetoken

def autocomplete(txt,tokens,tokenchar,start_stop_chars,maxlen,samplingmethod='standard',temperature=1.0,debug=True):
  placenames=[]
  for sn in txt:
    nextchar=''
    while nextchar != start_stop_chars[1]:
      output=model.predict(X_from_str([sn],tokens,start_stop_chars,maxlen))[0]
      nextchar=tokenchar[sampler(output,method=samplingmethod,temperature=temperature)]
      sn+=nextchar
      if debug:
        print(sn)
    sn=sn[:-1]
    placenames.append(sn)
  return placenames

autocomplete(['Liv'],tokens,tokenchar,start_stop_chars,maxlen,samplingmethod='temperature',temperature=1.0,debug=False)


['Liven']

In [16]:
autocomplete(['Liv','Nordr','Søndre','',''],tokens,tokenchar,start_stop_chars,maxlen,samplingmethod='temperature',temperature=1.0,debug=False)

['Livåtjern',
 'Nordre Golledalen',
 'Søndre Palskløkklet',
 'Vasshellet',
 'Mollageren']