# Letter generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate Letters

<hr>
<hr>

In this exercise, we will try to generate some text. The underlying idea is, given a input sequence predict what the next letter is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that too many sentences will make your compute slow down, or even freeze - your RAM can overflow. For that reason, **you should start with 10% of the sentences** and see if your computer handles it. Otherwise, rerun with a lower number. 

**At the end of the notebook, to improve the model, you would maybe need to increase the number of loaded sentences**

In [1]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X = load_data(percentage_of_sentences=20)

2022-05-19 14:31:47.966767: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-19 14:31:47.966841: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


❓ **Question** ❓ Write a function that, given a string (list of letters), returns
- a string (list of letters) that corresponds to part of the sentence - this string should be of size 300
- the letter that follow the previous string

❗ **Remark** ❗ There is no reason your first strings to start at the beginning of the input string.

Example:
- Input : 'This is a good movie"
- Output: ('a good m', 'o') [Except the first part should be of size 300 instead of 8]

❗ **Remark** ❗ If the input is shorter than 300 letters, return None

In [2]:
import numpy as np

In [3]:
def get_X_y(string, length=300):
    if len(string) <= length:
        return None
    
    seed = np.random.randint(0,len(string)-length)
    
    return string[seed:seed+length], string[seed+length]

❓ **Question** ❓ Check that the function is working on some strings from the loaded data

In [4]:
a, b = get_X_y(X[1])

In [5]:
a

" paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written the clothes are sickening and funny in equal measures the hair is big lots of boobs bounce m"

In [6]:
b

'e'

❓ **Question** ❓ Write a function, that, based on the previous function and the loaded sentences, generate a dataset X and y:
- each sample of X is a string
- the corresponding y is the letter that comes just after in the input string

❗ **Remark** ❗ This question is not much guided as it is similar to what you have done in the previous exercises.

In [26]:
def create_dataset2(sentences, n_samples=20000):
    '''
    This function create n_samples observations of feature/target
    '''
    X = []
    y = []
    while True:
        for sentence in sentences:
            result = get_X_y(sentence)
            if result:
                X.append(result[0])
                y.append(result[1])
            if len(X) == n_samples:
                break
        if len(X) == n_samples:
            break
    
    return X, y

In [27]:
my_X2, y2 = create_dataset2(X)
len(my_X2)

20000

In [24]:
len(my_X)

4829

In [7]:
def create_dataset(sentences):
    '''
    This function create one observation of feature / target for each observation on the dataset
    '''
    X = []
    y = []
    
    for sentence in sentences:
        result = get_X_y(sentence)
        if result:
            X.append(result[0])
            y.append(result[1])
    
    return X, y

In [8]:
my_X, y = create_dataset(X)

❓ **Question** ❓ Split X and y in train and test data. Store it in `string_train`, `string_test`, `y_train` and `y_test`

In [9]:
split = int(len(my_X) * 0.7)

string_train = my_X[:split]
string_test = my_X[split:]

y_train = y[:split]
y_test = y[split:]

❓ **Question** ❓ Create a dictionary which stores a unique token for each letter: the key is the letter while the value is the corresponding token. You have to build you dictionary based on the letters that are in `string_train` and `y_train` only, as you are not supposed to know the test set (and the new letters that might appear, which is unlikely, but still possible).

❗ **Remark** ❗ To account for the fact that there might be letters in the test set that are not in the train set, add a particular token for that, whose corresponding key can be `UNKNOWN`.

❗ **Remark** ❗ By letter, we actually mean any character. As there happen to be numbers (`1`, `2`, ...) or `?`, `!`, `@`, ... in texts.

In [10]:
token_dict = {'UNKNOWN': 0}

tk = 1

for string in string_train:
    for char in string:
        if char in token_dict:
            continue
        token_dict[char] = tk
        tk += 1
        
for string in y_train:
    for char in string:
        if char in token_dict:
            continue
        token_dict[char] = tk
        tk += 1

In [11]:
token_dict

{'UNKNOWN': 0,
 ' ': 1,
 'r': 2,
 'e': 3,
 'm': 4,
 'a': 5,
 'k': 6,
 's': 7,
 't': 8,
 'h': 9,
 'o': 10,
 'u': 11,
 'g': 12,
 'f': 13,
 'i': 14,
 'l': 15,
 'w': 16,
 'j': 17,
 'b': 18,
 'n': 19,
 'c': 20,
 'd': 21,
 'v': 22,
 'y': 23,
 'z': 24,
 'q': 25,
 'p': 26,
 'x': 27,
 "'": 28,
 'é': 29,
 '0': 30,
 '1': 31,
 '9': 32,
 '7': 33,
 '5': 34,
 '4': 35,
 '\x96': 36,
 '2': 37,
 '3': 38,
 '6': 39,
 '8': 40,
 '\x85': 41,
 '´': 42,
 'è': 43,
 'ä': 44,
 '\xa0': 45,
 'à': 46,
 'í': 47,
 '–': 48,
 '’': 49,
 '“': 50,
 '”': 51,
 'â': 52,
 'ü': 53,
 '\x91': 54,
 'ç': 55,
 '\x97': 56,
 'ã': 57,
 'ï': 58,
 '¨': 59,
 '¦': 60,
 'ö': 61}

❓ **Question** ❓ Based on the previous dictionary, tokenize the strings and store them in `X_train` and `X_tests`.

❗ **Remark** ❗ Convert your lists to NumPy arrays

In [12]:
X_train = []
for sentence in string_train:
    x_list = []
    for char in sentence:
        x_list.append(token_dict[char])
    X_train.append(x_list)
    
X_train = np.array(X_train)

In [13]:
X_train.shape

(3380, 300)

In [14]:
X_test = []
for sentence in string_test:
    x_list = []
    for char in sentence:
        if char in token_dict:
            x_list.append(token_dict[char])
        else:
            x_list.append(token_dict['UNKNOWN'])
    X_test.append(x_list)

X_test = np.array(X_test)

In [15]:
X_test.shape

(1449, 300)

❓ **Question** ❓ The outputs are currently letters. We first need to tokenize them, thanks to the previous dictionary.

❗ **Remark** ❗ Remember that some values in `y_test` are maybe unknown.

In [30]:
for i in range(len(y_train)):
    y_train[i] = token_dict[y_train[i]]
y_train = np.array(y_train)

In [32]:
for i in range(len(y_test)):
    if y_test[i] in token_dict:
        y_test[i] = token_dict[y_test[i]]
    else:
        y_test[i] = token_dict['UNKNOWN']
y_test = np.array(y_test)

❓ **Question** ❓ Now, let's convert the tokenized outputs to one-hot encoded categories! There should be as many categories as different letters in the previous dictionary! So be careful that your outputs are of the right shape, especially as many one-hot encoded categories in both.

In [33]:
from tensorflow.keras.utils import to_categorical

In [38]:
y_train_cat = to_categorical(y_train, num_classes=len(token_dict))
y_test_cat = to_categorical(y_test, num_classes=len(token_dict))

In [39]:
y_train_cat.shape

(3380, 62)

# Baseline model

❓ **Question** ❓ What is the baseline accuracy?

In [56]:
unique, counts = np.unique(y_train, return_counts=True)
counts = dict(zip(unique, counts))

In [57]:
max(counts.values()) # -> most present category is 1 with 628 counts

628

In [58]:
counts

{1: 628,
 2: 148,
 3: 323,
 4: 65,
 5: 215,
 6: 27,
 7: 182,
 8: 263,
 9: 152,
 10: 219,
 11: 84,
 12: 69,
 13: 40,
 14: 220,
 15: 132,
 16: 63,
 17: 12,
 18: 42,
 19: 148,
 20: 78,
 21: 104,
 22: 38,
 23: 67,
 25: 2,
 26: 32,
 27: 3,
 28: 11,
 30: 3,
 31: 5,
 32: 2,
 34: 1,
 38: 2}

In [61]:
y_pred = []
for i in range(len(y_test)):
    y_pred.append(1)
y_pred = np.array(y_pred)

In [62]:
len(y_test), len(y_pred)

(1449, 1449)

In [63]:
from sklearn.metrics import accuracy_score

In [64]:
accuracy_score(y_test, y_pred)

0.1849551414768806

# The model

❓ **Question** ❓ Write a RNN with all the appropriate layers, and compile it.

In [65]:
from tensorflow.keras import Sequential, layers

In [67]:
def init_model(vocab_size):
    model = Sequential()
    model.add(layers.Embedding(input_dim=vocab_size, output_dim=30))
    model.add(layers.GRU(30, activation='tanh'))
    model.add(layers.Dense(30, activation='relu'))
    model.add(layers.Dense(vocab_size, activation='softmax'))
    
    
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    
    return model

model = init_model(len(token_dict))
model.summary()

2022-05-19 15:00:40.288670: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-19 15:00:40.288734: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-19 15:00:40.288763: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (LAPTOP-O26C6N05): /proc/driver/nvidia/version does not exist
2022-05-19 15:00:40.289197: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 30)          1860      
                                                                 
 gru (GRU)                   (None, 30)                5580      
                                                                 
 dense (Dense)               (None, 30)                930       
                                                                 
 dense_1 (Dense)             (None, 62)                1922      
                                                                 
Total params: 10,292
Trainable params: 10,292
Non-trainable params: 0
_________________________________________________________________


❓ **Question** ❓ Fit the model - you can use a large batch size to accelerate the convergence. The model will probably hit the baseline performance at some point, and hopefully keep decreasing from here. 

You should get an accuracy better than 35% 

In [68]:
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, monitor='val_loss')

In [69]:
model.fit(X_train, y_train_cat,
          epochs=400, 
          batch_size=64,
          callbacks=[es],
          validation_split=0.3)

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400


<keras.callbacks.History at 0x7fe34e8d7940>

❓ **Question** ❓ Evaluate your model on the test set

In [70]:
model.evaluate(X_test, y_test_cat)



[2.3669826984405518, 0.3423050343990326]

❓ **Question** ❓ Even though the model is not perfect, you can look at its prediction with a string of your choice. Don't forget to decode the predicted token to know which letter it corresponds to.

You will have to convert your input string to a list of tokens, get the most probable output class, and then convert it back to a letter.

You should do it in a function.

In [77]:
pred = model.predict(X_test[0:2])[0]

In [78]:
pred

array([5.8480385e-26, 4.8996079e-01, 1.6873985e-03, 3.3678928e-01,
       5.1778108e-05, 2.5185687e-02, 2.2246847e-06, 4.6388176e-03,
       1.9635952e-03, 1.6343289e-04, 2.1757662e-02, 2.3498859e-03,
       3.5988633e-06, 4.6942660e-07, 8.2453422e-02, 7.1309372e-03,
       1.4389652e-08, 3.2183944e-08, 1.6914278e-06, 3.4907087e-05,
       5.1947043e-04, 3.2597812e-04, 8.6860127e-06, 2.4644576e-02,
       2.0154539e-26, 1.9432684e-20, 3.2236046e-04, 3.6243613e-08,
       3.1994243e-06, 1.7262526e-25, 4.0561123e-23, 3.1201175e-20,
       2.5465328e-23, 2.4579215e-25, 5.0128684e-25, 1.2403083e-25,
       3.5164476e-25, 2.1742609e-25, 9.3789202e-19, 5.1044286e-26,
       5.3654177e-26, 7.1849313e-25, 1.3596370e-25, 1.5790488e-25,
       3.7456528e-25, 3.9807807e-26, 6.2281771e-25, 5.3101358e-26,
       1.6444958e-25, 3.9689218e-27, 2.1908122e-25, 1.7611866e-25,
       1.0687435e-25, 5.6921891e-25, 1.0421597e-25, 5.9420584e-26,
       1.4191265e-25, 2.5299617e-26, 2.4095796e-26, 1.1113259e

In [79]:
np.argmax(pred)

1

In [72]:
token_to_char = {v: k for k, v in token_dict.items()}
token_to_char

{0: 'UNKNOWN',
 1: ' ',
 2: 'r',
 3: 'e',
 4: 'm',
 5: 'a',
 6: 'k',
 7: 's',
 8: 't',
 9: 'h',
 10: 'o',
 11: 'u',
 12: 'g',
 13: 'f',
 14: 'i',
 15: 'l',
 16: 'w',
 17: 'j',
 18: 'b',
 19: 'n',
 20: 'c',
 21: 'd',
 22: 'v',
 23: 'y',
 24: 'z',
 25: 'q',
 26: 'p',
 27: 'x',
 28: "'",
 29: 'é',
 30: '0',
 31: '1',
 32: '9',
 33: '7',
 34: '5',
 35: '4',
 36: '\x96',
 37: '2',
 38: '3',
 39: '6',
 40: '8',
 41: '\x85',
 42: '´',
 43: 'è',
 44: 'ä',
 45: '\xa0',
 46: 'à',
 47: 'í',
 48: '–',
 49: '’',
 50: '“',
 51: '”',
 52: 'â',
 53: 'ü',
 54: '\x91',
 55: 'ç',
 56: '\x97',
 57: 'ã',
 58: 'ï',
 59: '¨',
 60: '¦',
 61: 'ö'}

In [81]:
def get_predicted_letter(string):
    string_convert = [token_dict[char] for char in string]

    pred = model.predict([string_convert])
    pred_class = np.argmax(pred[0])
    pred_letter = token_to_char[pred_class]
    
    return pred_letter

string = 'this is a good'

get_predicted_letter(string)

' '

❓ **Question** ❓ Now, write a function that takes a string as an input, predicts the next letter, appends the letter to the initial string, then redoes the prediction, etc etc.

For instance : 
- 'this is a good' => ' '
- 'this is a good ' => 'm'
- 'this is a good m' => 'o'
...

The function should also take the number of times you repeat the operation as an input.

You can have some fun trying different input sequences here.

In [85]:
def get_predictions(string, n_repetitions=50):
    for i in range(n_repetitions):
        string = string + get_predicted_letter(string)
    return string

In [89]:
get_predictions('what i like is ', 20)

'what i like is the and the and the '

❓ **Question** ❓ Try to optimize your architecture to improve your performance. You can also try to load more data in the first function.