# Fantasy City Name Generation using Machine Learning

Fantasy place name generation is done all the time by fantasy authors, writers, and game masters to add flavour and authenticity to the environments that they have created. This can be done manually, by simply 'making up' a name, however this can take time, and it can be tricky to get a name that sounds just right given say, the cultural information you want to associate with that location. There are plenty of online fantasy name generators that help tackle this issue, but we can add a layer of specificity and control by using a neural network and inserting our own training data sets. This is what we implement in this notebook. 

In [2]:
import tensorflow as tf

import pandas as pd
import numpy as np
import os

We will use the 'World Cities Database' by Max Mind as a starting data set. It consists of millions of settlements, listing their names, locations, and countries of origin. 

In [12]:
big_df = pd.read_csv(r'worldcitiespop.csv')
print(big_df.head(10))

  big_df = pd.read_csv(r'worldcitiespop.csv')


  Country                City          AccentCity Region  Population  \
0      ad               aixas               Aixàs    6.0         NaN   
1      ad          aixirivali          Aixirivali    6.0         NaN   
2      ad          aixirivall          Aixirivall    6.0         NaN   
3      ad           aixirvall           Aixirvall    6.0         NaN   
4      ad            aixovall            Aixovall    6.0         NaN   
5      ad             andorra             Andorra    7.0         NaN   
6      ad    andorra la vella    Andorra la Vella    7.0     20430.0   
7      ad     andorra-vieille     Andorra-Vieille    7.0         NaN   
8      ad             andorre             Andorre    7.0         NaN   
9      ad  andorre-la-vieille  Andorre-la-Vieille    7.0         NaN   

    Latitude  Longitude  
0  42.483333   1.466667  
1  42.466667   1.500000  
2  42.466667   1.500000  
3  42.466667   1.500000  
4  42.466667   1.483333  
5  42.500000   1.516667  
6  42.500000   1.516667  

Assuming we want to generate city names with a certain cultural bias, let's take a look at the unique countries in this data set. 

In [15]:
print(big_df['Country'].unique())

print(len(big_df['Country'].unique()))

['ad' 'ae' 'af' 'ag' 'ai' 'al' 'am' 'an' 'ao' 'ar' 'at' 'au' 'aw' 'az'
 'ba' 'bb' 'bd' 'be' 'bf' 'bg' 'bh' 'bi' 'bj' 'bm' 'bn' 'bo' 'br' 'bs'
 'bt' 'bw' 'by' 'bz' 'ca' 'cc' 'cd' 'cf' 'cg' 'ch' 'ci' 'ck' 'cl' 'cm'
 'cn' 'co' 'cr' 'cu' 'cv' 'cx' 'cy' 'cz' 'de' 'dj' 'dk' 'dm' 'do' 'dz'
 'ec' 'ee' 'eg' 'eh' 'er' 'es' 'et' 'fi' 'fj' 'fk' 'fm' 'fo' 'fr' 'ga'
 'gb' 'gd' 'ge' 'gf' 'gg' 'gh' 'gi' 'gl' 'gm' 'gn' 'gp' 'gq' 'gr' 'gs'
 'gt' 'gw' 'gy' 'hk' 'hn' 'hr' 'ht' 'hu' 'id' 'ie' 'il' 'im' 'in' 'iq'
 'ir' 'is' 'it' 'je' 'jm' 'jo' 'jp' 'ke' 'kg' 'kh' 'ki' 'km' 'kn' 'kp'
 'kr' 'kw' 'ky' 'kz' 'la' 'lb' 'lc' 'li' 'lk' 'lr' 'ls' 'lt' 'lu' 'lv'
 'ly' 'ma' 'mc' 'md' 'me' 'mg' 'mh' 'mk' 'ml' 'mm' 'mn' 'mo' 'mp' 'mq'
 'mr' 'ms' 'mt' 'mu' 'mv' 'mw' 'mx' 'my' 'mz' 'na' 'nc' 'ne' 'nf' 'ng'
 'ni' 'nl' 'no' 'np' 'nr' 'nu' 'nz' 'om' 'pa' 'pe' 'pf' 'pg' 'ph' 'pk'
 'pl' 'pm' 'pn' 'ps' 'pt' 'pw' 'py' 'qa' 're' 'ro' 'rs' 'ru' 'rw' 'sa'
 'sb' 'sc' 'sd' 'se' 'sg' 'sh' 'si' 'sj' 'sk' 'sl' 'sm' 'sn' 'so' 'sr'
 'st' 

These countries/regions are represented by ISO alpha-2 country codes. 

Let's use Ireland settlements to generate new city names. This makes sense to me, as being from Ireland, I should be able to get a 'feel' for what city names sound plausibly Irish, and for those that don't. 

In [16]:
# Let's try Ireland

df = big_df[(big_df['Country'] == 'ie')]
df.head()

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
1308871,ie,aasleagh,Aasleagh,10,,53.616667,-9.666667
1308872,ie,abbevill,Abbevill,4,,51.966667,-8.583333
1308873,ie,abbeville,Abbeville,4,,51.966667,-8.583333
1308874,ie,abbey,Abbey,10,,53.103056,-8.392222
1308875,ie,abbeydorney,Abbeydorney,11,,52.35,-9.683333


Let's get rid of any redundancies in the data set by removing repeat entries:

In [17]:
# Activate these lines to only include settlements with population data
#df['Population'].replace('', np.nan, inplace=True)
#df.dropna(subset=['Population'], inplace=True)

# Drops out settlements with the same coordinates, which are assumed to be duplicates

survivor = (df['Latitude'].iloc[0], df['Longitude'].iloc[0])
for index, rows in df[1:].iterrows(): 
    if survivor == (rows['Latitude'], rows['Longitude']):
        df = df.drop(index)
    else:
        survivor = (rows['Latitude'], rows['Longitude'])

print(len(df.index))

8929


~9000 entries isn't a fantastically large data set but let's see where it takes us.

Below are some random examples of names from this dataset:

In [18]:
import random as ran

size = len(df.index)

for i in range(10):
    r = ran.random()
    random_index = int(r*size)
    print(df['AccentCity'].iloc[random_index])

    

Lisbabe
Aharney
Cloonlahard Bridge
Drumasladdy
Kilcully
Drumreilly
Kells
Meenanalbany
Kinaff Bridge
Keelties


# Building the Model

Need to:
* Vectorise the data
* Build a Recurrent Neural Network (RNN)

In [66]:
# First, let's create a long string consisting of all the characters in the city names

character_data = ''

# Let's attach cities together randomly and separate them by a marker. We'll choose '/'

size = len(df.index)
ranges = list(range(size))
ran.shuffle(ranges)

for index in ranges:
    character_data += df['AccentCity'].iloc[index] + '/'

# Let's get rid of capital letters:
character_data = character_data.lower()

# And numbers:
for char in character_data:
    try:
        intchar = int(char)
        character_data = character_data.replace(char, '')
    except ValueError:
        continue

# And some diacritics:
character_data = character_data.replace('.','')

character_set = sorted(set(character_data))

print(character_data)
print(character_set)
print(len(character_set))

'''
Expecting:
[' ', "'", '-', '.', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'á', 'ç', 'è', 'é', 'ì', 'í', 'ï', 'ñ', 'ò', 'ó', 'ú', 'ü']
'''


aleskirt/akören/saraç/bakirli ciftligi/kërkan/cimenbag/kirazli/ortakent/düver/küçükoba/kapikargin/hazimoglu/kemkez/abaciliguzeli/bakimli/hociosmandamlari/küçükler/kuvançari/çatmaoluk/finikigeli/çobanhasan/zengin-kharabesi/kayacikoba/maramül/degisoren/tektur dagh/susuz/oruçbey/nurullah/uzunpinar/kirkveran/sandarli/asagiilica/helda/köprüören/sulakli/evcik/ildere/türkali/ciheli/barak/çanköy/balakyan/kurunca/koncaköy/güzlekyurdu/cagbasi/bodra/yüksekyayla/sagirkoy/büyüknefes/tazni/akçakoyunlu/evliyali/taslik/kilic/cakalipinar/tinaztepe/bozmus/yazir/saphane/say/bozyigit/ispindaruk/balandiz/icmeli/dutagaci/devecipinar/kabeliya/cengeller/ikizdere/dagoglu/harbelus/güvenir/marmaris/beskuyku/alibonco/yalak/köseli/ispekcir/degirmendere kasabasi/çetmiköy/almaçukuru/karacaören/elmaci/agackoy/hacimusa/sengil/sucati/kirgulu/bük/tahna/yukari gorede/gumusdere/kizilcakoy/hacibey/yamaç/zeve/kopruce ciftligi/pigmetas/yukaridaglica/calikbey/güleç/yüceotak/avcili/eziler/muratbagi/kucukkagdaric/ucagiz/kangirl

'\nExpecting:\n[\' \', "\'", \'-\', \'.\', \'a\', \'b\', \'c\', \'d\', \'e\', \'f\', \'g\', \'h\', \'i\', \'j\', \'k\', \'l\', \'m\', \'n\', \'o\', \'p\', \'q\', \'r\', \'s\', \'t\', \'u\', \'v\', \'w\', \'x\', \'y\', \'z\', \'à\', \'á\', \'ç\', \'è\', \'é\', \'ì\', \'í\', \'ï\', \'ñ\', \'ò\', \'ó\', \'ú\', \'ü\']\n'

In [67]:
# Let's do the simple stuff first. Let's create a dictionary of characters
# Want to create a mapping from characters to unique indices

char2idx = {c:i for i, c in enumerate(character_set)}

idx2char = np.array(character_set)

print(char2idx)
print(idx2char)
print(len(char2idx))

{' ': 0, '"': 1, "'": 2, '-': 3, '/': 4, '[': 5, ']': 6, '`': 7, 'a': 8, 'b': 9, 'c': 10, 'd': 11, 'e': 12, 'f': 13, 'g': 14, 'h': 15, 'i': 16, 'j': 17, 'k': 18, 'l': 19, 'm': 20, 'n': 21, 'o': 22, 'p': 23, 'q': 24, 'r': 25, 's': 26, 't': 27, 'u': 28, 'v': 29, 'w': 30, 'x': 31, 'y': 32, 'z': 33, '¿': 34, 'â': 35, 'ç': 36, 'é': 37, 'ë': 38, 'î': 39, 'ï': 40, 'ó': 41, 'ö': 42, 'ú': 43, 'û': 44, 'ü': 45}
[' ' '"' "'" '-' '/' '[' ']' '`' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' '¿' 'â'
 'ç' 'é' 'ë' 'î' 'ï' 'ó' 'ö' 'ú' 'û' 'ü']
46


In [68]:
chars = tf.strings.unicode_split(character_data, input_encoding='UTF-8')
chars


<tf.Tensor: shape=(612911,), dtype=string, numpy=array([b'a', b'l', b'e', ..., b'y', b'\xc3\xbc', b'/'], dtype=object)>

In [69]:
# apply -1 here

ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(character_set), mask_token=None)

In [70]:
ids = tf.add( ids_from_chars(chars) , -1)
print(ids)
print(ids_from_chars.get_vocabulary())

tf.Tensor([ 8 19 12 ... 32 45  4], shape=(612911,), dtype=int64)
['[UNK]', ' ', '"', "'", '-', '/', '[', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¿', 'â', 'ç', 'é', 'ë', 'î', 'ï', 'ó', 'ö', 'ú', 'û', 'ü']


In [71]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [72]:
# apply +1 here

chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

In [73]:
chars = chars_from_ids(tf.add( ids , +1 ))
chars

<tf.Tensor: shape=(612911,), dtype=string, numpy=array([b'a', b'l', b'e', ..., b'y', b'\xc3\xbc', b'/'], dtype=object)>

In [74]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(tf.add(ids, +1)), axis=-1)

In [75]:
all_ids = tf.add(ids_from_chars(tf.strings.unicode_split(character_data, 'UTF-8')), -1)
all_ids

<tf.Tensor: shape=(612911,), dtype=int64, numpy=array([ 8, 19, 12, ..., 32, 45,  4], dtype=int64)>

In [76]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [77]:
for ids in ids_dataset.take(10):
    print(chars_from_ids(tf.add(ids, +1)).numpy().decode('utf-8'))

a
l
e
s
k
i
r
t
/
a


In [78]:
seq_length = 100

In [79]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(tf.add(seq,+1)))

tf.Tensor(
[b'a' b'l' b'e' b's' b'k' b'i' b'r' b't' b'/' b'a' b'k' b'\xc3\xb6' b'r'
 b'e' b'n' b'/' b's' b'a' b'r' b'a' b'\xc3\xa7' b'/' b'b' b'a' b'k' b'i'
 b'r' b'l' b'i' b' ' b'c' b'i' b'f' b't' b'l' b'i' b'g' b'i' b'/' b'k'
 b'\xc3\xab' b'r' b'k' b'a' b'n' b'/' b'c' b'i' b'm' b'e' b'n' b'b' b'a'
 b'g' b'/' b'k' b'i' b'r' b'a' b'z' b'l' b'i' b'/' b'o' b'r' b't' b'a'
 b'k' b'e' b'n' b't' b'/' b'd' b'\xc3\xbc' b'v' b'e' b'r' b'/' b'k'
 b'\xc3\xbc' b'\xc3\xa7' b'\xc3\xbc' b'k' b'o' b'b' b'a' b'/' b'k' b'a'
 b'p' b'i' b'k' b'a' b'r' b'g' b'i' b'n' b'/' b'h' b'a' b'z'], shape=(101,), dtype=string)


In [80]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'aleskirt/ak\xc3\xb6ren/sara\xc3\xa7/bakirli ciftligi/k\xc3\xabrkan/cimenbag/kirazli/ortakent/d\xc3\xbcver/k\xc3\xbc\xc3\xa7\xc3\xbckoba/kapikargin/haz'
b'imoglu/kemkez/abaciliguzeli/bakimli/hociosmandamlari/k\xc3\xbc\xc3\xa7\xc3\xbckler/kuvan\xc3\xa7ari/\xc3\xa7atmaoluk/finikigeli/\xc3\xa7obanhas'
b'an/zengin-kharabesi/kayacikoba/maram\xc3\xbcl/degisoren/tektur dagh/susuz/oru\xc3\xa7bey/nurullah/uzunpinar/kirkver'
b'an/sandarli/asagiilica/helda/k\xc3\xb6pr\xc3\xbc\xc3\xb6ren/sulakli/evcik/ildere/t\xc3\xbcrkali/ciheli/barak/\xc3\xa7ank\xc3\xb6y/balakyan/kuru'
b'nca/koncak\xc3\xb6y/g\xc3\xbczlekyurdu/cagbasi/bodra/y\xc3\xbcksekyayla/sagirkoy/b\xc3\xbcy\xc3\xbcknefes/tazni/ak\xc3\xa7akoyunlu/evliyali/tas'


In [81]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [82]:
# Problem: dataset has indices 1-43 but we desire 0-42

dataset = sequences.map(split_input_target)

In [83]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())
    print('\n')
    print(input_example)
    print(target_example)

Input : b'aleskirt/ak\xc3\xb6ren/sara\xc3\xa7/bakirli ciftligi/k\xc3\xabrkan/cimenbag/kirazli/ortakent/d\xc3\xbcver/k\xc3\xbc\xc3\xa7\xc3\xbckoba/kapikargin/ha'
Target: b'leskirt/ak\xc3\xb6ren/sara\xc3\xa7/bakirli ciftligi/k\xc3\xabrkan/cimenbag/kirazli/ortakent/d\xc3\xbcver/k\xc3\xbc\xc3\xa7\xc3\xbckoba/kapikargin/haz'


tf.Tensor(
[ 8 19 12 26 18 16 25 27  4  8 18 42 25 12 21  4 26  8 25  8 36  4  9  8
 18 16 25 19 16  0 10 16 13 27 19 16 14 16  4 18 38 25 18  8 21  4 10 16
 20 12 21  9  8 14  4 18 16 25  8 33 19 16  4 22 25 27  8 18 12 21 27  4
 11 45 29 12 25  4 18 45 36 45 18 22  9  8  4 18  8 23 16 18  8 25 14 16
 21  4 15  8], shape=(100,), dtype=int64)
tf.Tensor(
[19 12 26 18 16 25 27  4  8 18 42 25 12 21  4 26  8 25  8 36  4  9  8 18
 16 25 19 16  0 10 16 13 27 19 16 14 16  4 18 38 25 18  8 21  4 10 16 20
 12 21  9  8 14  4 18 16 25  8 33 19 16  4 22 25 27  8 18 12 21 27  4 11
 45 29 12 25  4 18 45 36 45 18 22  9  8  4 18  8 23 16 18  8 25 14 16 21
  4 15  8 33], shape=(100,),

In [84]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

In [85]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(character_set)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [86]:
# https://www.tensorflow.org/text/tutorials/text_generation

class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size + 1, embedding_dim) # vocab_size -> vocab_size + 1
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [87]:
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

In [88]:
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch)
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

tf.Tensor(
[[25 19 16 ...  8 14  4]
 [ 4 12 26 ... 12 18  4]
 [19 16  4 ...  8 21  4]
 ...
 [12 21  4 ... 32  4 18]
 [22 25  4 ...  4 18 45]
 [10  8 18 ... 20 28 18]], shape=(64, 100), dtype=int64)
(64, 100, 46) # (batch_size, sequence_length, vocab_size)


In [89]:
model.summary()

Model: "my_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     multiple                  12032     
                                                                 
 gru_1 (GRU)                 multiple                  3938304   
                                                                 
 dense_1 (Dense)             multiple                  47150     
                                                                 
Total params: 3,997,486
Trainable params: 3,997,486
Non-trainable params: 0
_________________________________________________________________


In [90]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [91]:
sampled_indices

array([11, 37, 14, 29, 13,  6, 18, 14, 30, 34, 40,  2, 22, 29, 36, 36,  2,
       39, 41, 22, 10, 19, 41,  0, 13, 44,  4,  9,  1,  6, 37, 32, 18, 10,
       21, 23, 41,  4, 30, 30, 10, 30,  2, 23,  1, 10,  0, 40,  6, 39, 24,
       21,  0,  8, 28, 38, 29, 38, 32, 15, 45,  1, 31, 21, 10,  4, 36, 26,
       13,  0,  1,  1, 18, 37, 21, 30, 42,  1, 30, 45,  4, 31, 24, 34,  6,
       32,  0, 43, 13,  6, 19, 18, 12, 16, 13, 27,  2, 31, 11, 43],
      dtype=int64)

In [92]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b'rli/sefak\xc3\xb6y/meryemusagi/akbugday/imamsati/kandak/cakir/kirhislar/buyukkiran/gerde/kazancik/hisardag/'

Next Char Predictions:
 b'd\xc3\xa9gvf]kgw\xc2\xbf\xc3\xaf\'ov\xc3\xa7\xc3\xa7\'\xc3\xae\xc3\xb3ocl\xc3\xb3 f\xc3\xbb/b"]\xc3\xa9ykcnp\xc3\xb3/wwcw\'p"c \xc3\xaf]\xc3\xaeqn au\xc3\xabv\xc3\xabyh\xc3\xbc"xnc/\xc3\xa7sf ""k\xc3\xa9nw\xc3\xb6"w\xc3\xbc/xq\xc2\xbf]y \xc3\xbaf]lkeift\'xd\xc3\xba'


In [93]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [94]:
target_example_batch


<tf.Tensor: shape=(64, 100), dtype=int64, numpy=
array([[19, 16,  4, ..., 14,  4, 20],
       [12, 26, 18, ..., 18,  4, 12],
       [16,  4,  9, ..., 21,  4,  8],
       ...,
       [21,  4, 11, ...,  4, 18, 32],
       [25,  4, 11, ..., 18, 45, 25],
       [ 8, 18,  8, ..., 28, 18, 27]], dtype=int64)>

In [95]:
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

Prediction shape:  (64, 100, 46)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(3.8287084, shape=(), dtype=float32)


In [96]:
model.compile(optimizer='adam', loss=loss)

In [97]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [98]:
EPOCHS = 10

In [99]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [100]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()
    input_ids = tf.add(input_ids, -1)

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits #+ self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)
    predicted_ids = tf.add(predicted_ids, +1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [101]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

In [102]:
import time as time

start = time.time()
states = None
next_char = tf.constant(['x'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'))
print('\nRun time:', end - start)

xüslümledi/yazikoy/demirkapi/fethikpislasi/kormurumsar yaylasi/kandillar/güngören/dotukhoy/güzeldere/bayra/gunegana/yamaören/sogukbazi/yukari zulluaga/cabursar/ilsiz/arazlan/yesildar/durmaliova/benlikkeny/merk/eydir/ulutarla/yenikaya/bedören/tasibi/khanyako/gömet/sogutlubelikerik/alkum/dedesürle/kutludere/hacibey/duranlar/kurtbakada/karaca/buyukcayat/ishada/veydibelek/çyultyayla/özmertik/kërt/ardiccin/agus/esenyurt/cive/germi/asarcikoy/kocahisar/yumrutaso/tahtakisla koy/karacaatli/mezrü köy/cisanlar/hoyuk/horis/küçüklü/karaalip/balayli/düznaba/kocacasu/kaynarca/dirno/sindaba/apan/celilli yaylasi/alakir/cibaklidede/kozluca/sehlidik/mamiz/kavakli ciftligi/surna köy/kusca/akören/aydinlar/karatopak/kayracik/sarikartil/hatipli/calislar/yavuzcak alti/nehirbekirler/asvan/yeri/tunsar/sehlam/eminyu köyü/bolubasi/buzyuz/yakalar/ariz/karnucak/kilumlâ/büyük tat/gözler/tilmis/kirkköy/aldirpinar/kurtalicak/nazirli/divan/agcakavak/kabasakallar/hasanli/cahiralan/kemer/asagicirattin/karakaya koyu/sargü

In [103]:
def predict_cities(seed,output_length):
    states = None
    next_char = tf.constant([seed])
    result = [next_char]

    for n in range(output_length):
        next_char, states = one_step_model.generate_one_step(next_char, states=states)
        result.append(next_char)

    result = tf.strings.join(result)

    return result[0].numpy().decode('utf-8').split('/')
 

In [105]:
# Estimate performance at producing new city names

pred_cities = predict_cities('d',1000)
L = len(pred_cities)

count = 0
for city in pred_cities:
    if len(df[df['City'] == city].index) > 0:
        count += 1
        pred_cities.remove(city)
        
print(count/L)

# redundancy ranges between 9-28%

for city in pred_cities:
    print(city, '\n')


0.1391304347826087
dibi 

kerenler 

yukari zavgalli 

gurabasi 

yellituzdak 

kilavuztey 

harabeköy 

kababut 

güzelyurt 

vakikoy 

körecik 

velice 

yazalti 

asagi hamit 

coksan 

ettekesen 

sarlicak 

boyunkiran 

üfse 

davaci 

dirmisli 

kelent 

sekbolboyu 

mandinca 

garis 

titabayakale 

yenicesidere 

türkmen 

kilincik 

capimler 

karapinar tahur 

karayalci 

karaçaç 

abdil 

bilâlli 

yaylaköyü 

lihdat 

sarapil 

güntünören 

çukurkaya 

yukari damozlar 

yukari karabelen 

körükler 

fahrubasi 

kamargazi 

sevsat 

enmen 

badakaören 

domaçi 

arslan 

alekbasi 

acikaya 

berykent 

torhudremi 

micirgâhi 

çakören 

kisilkaya 

fera 

armutcukoyu 

ergunak 

sindoglu 

tok 

kozhida 

kermesik 

bireketü 

borallar 

baniri 

fellac 

tatlica 

alanci 

gokce kozan 

aliç 

dürgetli 

mamac 

tahuris 

pazirkesir 

sogancik 

hocukesin 

sugurlar 

seremlik 

altindara 

karaçadi 

mürktepe 

cerebi 

bahçe 

tohular 

ibdan 

kestona 

güzli 

uzbaskoy 

In [50]:
# It seems that with this model we can't produce city names derived from combinations of different cultures.
# In the Germany-Vietnam example the cities look either distinctly German or distinctly Vietnamese