In [None]:
"""
Choose a language model that will best represent the input text.
Clean and prepare the data for training.
Build a basic Keras sequential neural network model.
Apply recurrent neural network (RNN) to process character sequences.
Generate 3 channel RGB color outputs.
"""

"""
Language Model
There are two general options for language modeling: word-level models and character-level models. Each has its advantages and disadvantages. Let’s go through them now.
"""

"""
Word-level Language Model
The word-level language model can handle relatively long and clean sentences. By “clean”, I mean the words in the text datasets are free from typos and have few words 
outside of English vocabulary. The word-level language model encodes each unique word into a corresponding integer, and there’s a predefined fixed-sized vocabulary 
dictionary to look up the word to integer mapping. One major benefit of the word-level language model is its ability to leverage pre-trained word embeddings such as Word2Vec 
or GloVe. These embeddings represent words as vectors with useful properties. Words close in context are close in Euclidean distance and can be used to understand analogies 
like "man is to women, as king is to queen". Using these ideas, you can train a word-level model with relatively small labeled training sets.
"""

"""
Character Level Language Model

But there’s an even simpler language model, one that splits a text string into characters and associates a unique integer to every single character. There are some reasons 
you might choose to use the character-level language model over the more popular word-level model:

Your text datasets contain a noticeable amount of out-of-vocabulary words or infrequent words. In our case, some legitimate color names could be “aquatone”, “chartreuse” and 
“fuchsia”. For me, I have to check a dictionary to find out their meanings, and traditional word-level embeddings may not contain them.
The majority of the text strings are short, bounded-length strings. If you’re looking for a specific length limit, I’ve been dealing with a Yelp review generation model with 
character level encode character length of 60 and still get decent results. You can find that blog post here: How to generate realistic yelp restaurant reviews with Keras. 
Usually, the character-level language generation model can create text with more variety since its imagination is not constrained by a pre-defined dictionary of vocabulary.
You may also be aware of the limitation that came with adopting character-level language: - Long sequences may not capture long-range dependencies as well as word-level 
language models. - Character-level models are also more computationally expensive to train — given the same text data sets, these model sequences are longer and, as a result,
require extended training time.

Fortunately, these limitations won’t pose a threat to our color generation task. We’re limiting our color names to 25 characters in length and we only have 18,606 training 
samples.
"""

In [None]:
import tensorflow as tf
from tensorflow.python import keras
from tensorflow.python.keras import preprocessing
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Dropout, LSTM, Reshape

import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv("colors.csv")
names = data["name"]
data.head()

In [None]:
# data prep

h = sorted(names.str.len().values)
import numpy as np
import scipy.stats as stats
import pylab as plt

fit = stats.norm.pdf(h, np.mean(h), np.std(h))  #this is a fitting indeed
plt.plot(h,fit,'-o')
plt.hist(h,normed=True)      #use this to draw histogram of your data
plt.xlabel('Chars')
plt.ylabel('Probability density')
plt.show()

In [None]:
maxlen = 25
t = Tokenizer(char_level=True)
t.fit_on_texts(names)
tokenized = t.texts_to_sequences(names)
padded_names = preprocessing.sequence.pad_sequences(tokenized, maxlen=maxlen)
print(padded_names.shape)

In [None]:
print(t.word_index)

In [None]:
from tensorflow.python.keras.utils import np_utils
one_hot_names = np_utils.to_categorical(padded_names)

In [None]:
#normalization

# The RGB values are between 0 - 255
# scale them to be between 0 - 1
def norm(value):
    return value / 255.0

normalized_values = np.column_stack([norm(data["red"]), norm(data["green"]), norm(data["blue"])])

In [None]:
#build model

model = Sequential()
model.add(LSTM(256, return_sequences=True, input_shape=(maxlen, 90)))
model.add(LSTM(128))
model.add(Dense(128, activation='relu'))
model.add(Dense(3, activation='sigmoid'))
model.compile(optimizer='adam', loss='mse', metrics=['acc'])

In [None]:
history = model.fit(one_hot_names, normalized_values,
                    epochs=40,
                    batch_size=32,
                    validation_split=0.2)

In [None]:
# generate colors

# plot a color image
def plot_rgb(rgb):
    data = [[rgb]]
    plt.figure(figsize=(2,2))
    plt.imshow(data, interpolation='nearest')
    plt.show()

def scale(n):
    return int(n * 255) 

def predict(name):
    name = name.lower()
    tokenized = t.texts_to_sequences([name])
    padded = preprocessing.sequence.pad_sequences(tokenized, maxlen=maxlen)
    one_hot = np_utils.to_categorical(padded, num_classes=90)
    pred = model.predict(np.array(one_hot))[0]
    r, g, b = scale(pred[0]), scale(pred[1]), scale(pred[2])
    print(name + ',', 'R,G,B:', r,g,b)
    plot_rgb(pred)

In [None]:
predict("forest")
predict("ocean")