# Letter generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate Letters

<hr>
<hr>

In this exercise, we will try to generate some text. The underlying idea is, given a input sequence predict what the next letter is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that too many sentences will make your compute slow down, or even freeze - your RAM can overflow. For that reason, **you should start with 10% of the sentences** and see if your computer handles it. Otherwise, rerun with a lower number. 

**At the end of the notebook, to improve the model, you would maybe need to increase the number of loaded sentences**

In [201]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X = load_data(percentage_of_sentences=10)

❓ **Question** ❓ Write a function that, given a string (list of letters), returns
- a string (list of letters) that corresponds to part of the sentence - this string should be of size 300
- the letter that follow the previous string

❗ **Remark** ❗ There is no reason your first strings to start at the beginning of the input string.

Example:
- Input : 'This is a good movie"
- Output: ('a good m', 'o') [Except the first part should be of size 300 instead of 8]

❗ **Remark** ❗ If the input is shorter than 300 letters, return None

In [225]:
import numpy as np

def get_X_y(string, length=300):
    if len(string) <= length:
        return None
    
    first_letter_idx = np.random.randint(0, len(string) - length)
    
    X_letters = string[first_letter_idx:first_letter_idx+length]
    y_letter = string[first_letter_idx+length]
    
    return X_letters, y_letter
    

❓ **Question** ❓ Check that the function is working on some strings from the loaded data

In [186]:
X[4]

"worst mistake of my life br br i picked this movie up at target for 5 because i figured hey it's sandler i can get some cheap laughs i was wrong completely wrong mid way through the film all three of my friends were asleep and i was still suffering worst plot worst script worst movie i have ever seen i wanted to hit my head up against a wall for an hour then i'd stop and you know why because it felt damn good upon bashing my head in i stuck that damn movie in the microwave and watched it burn and that felt better than anything else i've ever done it took american psycho army of darkness and kill bill just to get over that crap i hate you sandler for actually going through with this and ruining a whole day of my life"

In [187]:
get_X_y(X[3])[1]

'a'

❓ **Question** ❓ Write a function, that, based on the previous function and the loaded sentences, generate a dataset X and y:
- each sample of X is a string
- the corresponding y is the letter that comes just after in the input string

❗ **Remark** ❗ This question is not much guided as it is similar to what you have done in the previous exercises.

In [203]:
def create_dataset(sentences):
    X, y = [], []
    number_of_samples = 20000
    indicies = np.random.randint(0, len(sentences), size=number_of_samples)
    
    for idx in indicies:
        ret = get_X_y(sentences[idx])
        if ret is None:
            continue
        xi, yi = ret
        
        X.append(xi)
        y.append(yi)
    
    return X, y 

In [204]:
X, y = create_dataset(X)

❓ **Question** ❓ Split X and y in train and test data. Store it in `string_train`, `string_test`, `y_train` and `y_test`

In [211]:
from sklearn.model_selection import train_test_split

string_train, string_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

❓ **Question** ❓ Create a dictionary which stores a unique token for each letter: the key is the letter while the value is the corresponding token. You have to build you dictionary based on the letters that are in `string_train` and `y_train` only, as you are not supposed to know the test set (and the new letters that might appear, which is unlikely, but still possible).

❗ **Remark** ❗ To account for the fact that there might be letters in the test set that are not in the train set, add a particular token for that, whose corresponding key can be `UNKNOWN`.

❗ **Remark** ❗ By letter, we actually mean any character. As there happen to be numbers (`1`, `2`, ...) or `?`, `!`, `@`, ... in texts.

In [212]:
letter_to_id = {}
letter_to_id['UNKNOWN'] = 0

iter_ = 1

for string in string_train:
    for letter in string:
        if letter in letter_to_id:
            continue
        letter_to_id[letter] = iter_
        iter_ += 1
    
for string in y_train:
    for letter in string:
        if letter in letter_to_id:
            continue
        letter_to_id[letter] = iter_
        iter_ += 1

In [250]:
letter_to_id

{'UNKNOWN': 0,
 'n': 1,
 'y': 2,
 'o': 3,
 'e': 4,
 ' ': 5,
 'w': 6,
 'a': 7,
 't': 8,
 'h': 9,
 "'": 10,
 's': 11,
 'i': 12,
 'r': 13,
 'l': 14,
 'f': 15,
 'g': 16,
 'u': 17,
 'p': 18,
 'v': 19,
 'd': 20,
 'm': 21,
 'b': 22,
 'c': 23,
 'k': 24,
 '2': 25,
 '0': 26,
 '3': 27,
 'z': 28,
 'j': 29,
 'x': 30,
 '1': 31,
 '9': 32,
 '4': 33,
 'q': 34,
 '5': 35,
 'é': 36,
 '7': 37,
 '8': 38,
 'è': 39,
 '6': 40,
 '\x96': 41,
 '\x85': 42,
 '´': 43,
 'ä': 44,
 'ï': 45,
 'ç': 46,
 'ã': 47,
 'ö': 48,
 '–': 49,
 '\x91': 50,
 '“': 51,
 '’': 52,
 '”': 53,
 'ü': 54,
 'ó': 55,
 '\x97': 56,
 'í': 57,
 'ñ': 58,
 'å': 59,
 'á': 60,
 '\xa0': 61,
 'à': 62,
 '\x95': 63,
 '£': 64}

❓ **Question** ❓ Based on the previous dictionary, tokenize the strings and store them in `X_train` and `X_tests`.

❗ **Remark** ❗ Convert your lists to NumPy arrays

In [213]:
X_train = [[letter_to_id[_] for _ in x] for x in string_train]

In [214]:
X_test = [[letter_to_id[_] if _ in letter_to_id else letter_to_id['UNKNOWN'] for _ in x] for x in string_test]

In [215]:
X_train = np.array(X_train)
X_test = np.array(X_test)

In [253]:
X_train

array([[ 1,  2,  3, ...,  5,  9,  7],
       [13,  5,  6, ...,  5, 20,  4],
       [ 4,  8,  5, ...,  4,  2,  5],
       ...,
       [ 5,  8, 19, ...,  1, 12,  7],
       [ 5, 15, 12, ...,  8,  5, 34],
       [19,  4,  5, ...,  4,  7,  8]])

In [242]:
X_train.shape

(13477, 300)

In [251]:
lst = []
for x in string_train:
    lst.append([letter_to_id[letter] for letter in x])
            

❓ **Question** ❓ The outputs are currently letters. We first need to tokenize them, thanks to the previous dictionary.

❗ **Remark** ❗ Remember that some values in `y_test` are maybe unknown.

In [244]:
y_train_token = [letter_to_id[x] for x in y_train]
y_test_token = [letter_to_id[x] if x in letter_to_id else letter_to_id['UNKNOWN'] for x in y_test]

❓ **Question** ❓ Now, let's convert the tokenized outputs to one-hot encoded categories! There should be as many categories as different letters in the previous dictionary! So be careful that your outputs are of the right shape, especially as many one-hot encoded categories in both.

In [219]:
from tensorflow.keras.utils import to_categorical

y_train_cat = to_categorical(y_train_token, num_classes=len(letter_to_id))
y_test_cat = to_categorical(y_test_token, num_classes=len(letter_to_id))

In [248]:
y_train_token

[19,
 7,
 11,
 4,
 13,
 14,
 12,
 5,
 14,
 16,
 4,
 14,
 4,
 14,
 13,
 12,
 16,
 12,
 2,
 8,
 4,
 21,
 5,
 20,
 5,
 12,
 18,
 5,
 14,
 8,
 5,
 3,
 17,
 7,
 4,
 5,
 5,
 14,
 4,
 8,
 5,
 20,
 3,
 5,
 9,
 4,
 5,
 24,
 3,
 3,
 5,
 7,
 20,
 14,
 12,
 11,
 14,
 5,
 7,
 17,
 3,
 11,
 5,
 30,
 9,
 4,
 11,
 23,
 14,
 3,
 5,
 5,
 11,
 5,
 5,
 8,
 16,
 5,
 5,
 4,
 5,
 11,
 17,
 11,
 4,
 8,
 17,
 5,
 4,
 12,
 17,
 4,
 4,
 3,
 4,
 7,
 3,
 14,
 5,
 14,
 5,
 5,
 5,
 4,
 13,
 13,
 1,
 21,
 8,
 1,
 5,
 12,
 9,
 5,
 9,
 7,
 5,
 4,
 1,
 4,
 9,
 13,
 4,
 9,
 11,
 4,
 4,
 7,
 3,
 3,
 11,
 5,
 16,
 12,
 7,
 8,
 5,
 4,
 11,
 13,
 15,
 6,
 17,
 1,
 20,
 16,
 5,
 1,
 14,
 1,
 8,
 22,
 8,
 2,
 6,
 15,
 6,
 21,
 5,
 2,
 13,
 5,
 23,
 1,
 9,
 7,
 8,
 5,
 4,
 9,
 18,
 7,
 11,
 5,
 14,
 12,
 3,
 4,
 12,
 4,
 7,
 4,
 23,
 9,
 5,
 5,
 5,
 14,
 11,
 13,
 13,
 8,
 12,
 8,
 13,
 15,
 5,
 13,
 12,
 8,
 4,
 11,
 8,
 20,
 4,
 13,
 14,
 5,
 1,
 3,
 9,
 5,
 7,
 8,
 11,
 23,
 17,
 22,
 5,
 20,
 5,
 8,
 11,
 3,
 13,
 21,
 5,
 

# Baseline model

❓ **Question** ❓ What is the baseline accuracy?

In [220]:
from sklearn.metrics import accuracy_score

unique, counts = np.unique(y_train, return_counts=True)
counts = dict(zip(unique, counts))

print("The number of labels in the train set ", counts)
    
w = -1
y_pred = ''
for k, v in counts.items():
    if v > w:
        y_pred = k
        w = v

The number of labels in the train set  {' ': 2516, "'": 54, '0': 10, '1': 8, '2': 2, '3': 3, '4': 2, '5': 2, '6': 2, '7': 1, '8': 2, '9': 4, 'a': 862, 'b': 220, 'c': 315, 'd': 421, 'e': 1212, 'f': 215, 'g': 206, 'h': 579, 'i': 855, 'j': 28, 'k': 91, 'l': 476, 'm': 326, 'n': 731, 'o': 789, 'p': 193, 'q': 7, 'r': 666, 's': 705, 't': 1077, 'u': 277, 'v': 138, 'w': 228, 'x': 23, 'y': 218, 'z': 11, '\x96': 1, 'ö': 1}


In [221]:
print(f'Baseline accuracy: ', accuracy_score(y_test, [y_pred]*len(y_test)))

Baseline accuracy:  0.19006404708326122


# The model

❓ **Question** ❓ Write a RNN with all the appropriate layers, and compile it.

In [222]:
from tensorflow.keras import Sequential, layers

def init_model(vocab_size):
    model = Sequential()
    model.add(layers.Embedding(input_dim=vocab_size, output_dim=30))
    model.add(layers.GRU(30, activation='tanh'))
    model.add(layers.Dense(30, activation='relu'))
    model.add(layers.Dense(vocab_size, activation='softmax'))
    
    
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    
    return model

model = init_model(len(letter_to_id))
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, None, 30)          1950      
                                                                 
 gru_5 (GRU)                 (None, 30)                5580      
                                                                 
 dense_10 (Dense)            (None, 30)                930       
                                                                 
 dense_11 (Dense)            (None, 65)                2015      
                                                                 
Total params: 10,475
Trainable params: 10,475
Non-trainable params: 0
_________________________________________________________________


❓ **Question** ❓ Fit the model - you can use a large batch size to accelerate the convergence. The model will probably hit the baseline performance at some point, and hopefully keep decreasing from here. 

You should get an accuracy better than 35% 

In [223]:
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, monitor='val_loss')

model = init_model(len(letter_to_id))

model.fit(X_train, y_train_cat,
          epochs=400, 
          batch_size=50,
          callbacks=[es],
          validation_split=0.3)

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400


<keras.callbacks.History at 0x16af0fca0>

❓ **Question** ❓ Evaluate your model on the test set

In [224]:
model.evaluate(X_test,y_test_cat)



[2.144914150238037, 0.38687899708747864]

❓ **Question** ❓ Even though the model is not perfect, you can look at its prediction with a string of your choice. Don't forget to decode the predicted token to know which letter it corresponds to.

You will have to convert your input string to a list of tokens, get the most probable output class, and then convert it back to a letter.

You should do it in a function.

In [115]:
id_to_letter = {v: k for k, v in letter_to_id.items()}

In [146]:
def get_predicted_letter(string):
    string_convert = [letter_to_id[_] for _ in string]
    
    pred = model.predict([string_convert])
    pred_class = np.argmax(pred[0])
    pred_letter = id_to_letter[pred_class]
 
    return pred_letter

In [147]:
get_predicted_letter('th')



'e'

❓ **Question** ❓ Now, write a function that takes a string as an input, predicts the next letter, appends the letter to the initial string, then redoes the prediction, etc etc.

For instance : 
- 'this is a good' => ' '
- 'this is a good ' => 'm'
- 'this is a good m' => 'o'
...

The function should also take the number of times you repeat the operation as an input.

You can have some fun trying different input sequences here.

In [149]:
def repeat_prediction(string, repetition):
    string_tmp = string
    
    for i in range(repetition):
        predicted_letter = get_predicted_letter(string_tmp)
        string_tmp = string_tmp + predicted_letter
        
    return string_tmp    

In [160]:
strings = ['want i like']

In [161]:
[repeat_prediction(string, 10) for string in strings]



['want i like the the t']

❓ **Question** ❓ Try to optimize your architecture to improve your performance. You can also try to load more data in the first function.