# Implementation of Email Auto-completion

## Objectives

* Use the previously cleanded sentences from email dataset
* Tokenization of the sentences
* Padding of the sentences
* Create the network architecture
* Train the model
* Get predictions from the trained model

## The problem:

Suppose that we are working in Globomantics which is one of the most popular email applications in the world. To improve user experience, you want to build an intelligent system which will provide auto-completion suggestions to users during email compose. We want to be sure that the suggestions are relevant and useful to the users so that the user experience enhances.

## Dataset

We'll be using the Enron email dataset which is one of the most popular email datasets. The dataset can be downloaded from [here](https://www.kaggle.com/code/abhaytomar/starter-the-enron-email-dataset-8c90cc3c-1/data).

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. More information about this dataset can be found [here](https://www.cs.cmu.edu/~enron/).

## Import Libraries and Load the Cleaned Sentences

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [1]:
sentence_df = pd.read_csv('sentences.csv')
sentence_df = sentence_df.dropna()
sentence_df.head()

NameError: name 'pd' is not defined

In [None]:
sentences = sentence_df.sentence.values
print("Total number of sentence: ", len(sentences))

Total number of sentence:  152831


In [None]:
sentences[0:10]

array(['here is our forecast',
       'traveling to have a business meeting takes the fun out of the trip',
       'especially if you have to prepare a presentation',
       'i would suggest holding the business plan meetings here then take a trip without any formal business meetings',
       'i would even try and get some honest opinions on whether a trip is even desired or necessary',
       'too often the presenter speaks and the others are quiet just waiting for their turn',
       'the meetings might be better if held in a round table discussion format',
       'play golf and rent a ski boat and jet ski is',
       'flying somewhere takes too much time',
       'plus your thoughts on any changes that need to be made'],
      dtype=object)

In [None]:
sentences = sentences[:30000]

## Tokenization of the sentences

We'll use keras tokenizer class and its methods to perform tokenization, create vocabulary and the word to number mapping. To know more about tokenizer class, please consult this [link](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
test_tokenizer = Tokenizer()

In [None]:
test_sentences = ['here is our forecast',
                  'especially if you have to prepare a presentation']

In [None]:
test_tokenizer.fit_on_texts(test_sentences)

In [None]:
test_tokenizer.word_index

{'here': 1,
 'is': 2,
 'our': 3,
 'forecast': 4,
 'especially': 5,
 'if': 6,
 'you': 7,
 'have': 8,
 'to': 9,
 'prepare': 10,
 'a': 11,
 'presentation': 12}

In [None]:
test_tokenizer.index_word

{1: 'here',
 2: 'is',
 3: 'our',
 4: 'forecast',
 5: 'especially',
 6: 'if',
 7: 'you',
 8: 'have',
 9: 'to',
 10: 'prepare',
 11: 'a',
 12: 'presentation'}

In [None]:
test_sentence = "here you have our presentation"
test_token_list = test_tokenizer.texts_to_sequences([test_sentence])[0]
print(test_token_list)

[1, 7, 8, 3, 12]


In [None]:
n_grams = []
for i in range(1, len(test_token_list)):
    n_gram = test_token_list[:i+1]
    n_grams.append(n_gram)
print(n_grams)

[[1, 7], [1, 7, 8], [1, 7, 8, 3], [1, 7, 8, 3, 12]]


In [None]:
tokenizer = Tokenizer()
def convertSentencesIntoSeqOfTokens(sentences):
    tokenizer.fit_on_texts(sentences)
    total_words_in_vocab = len(tokenizer.word_index) + 1

    input_sequences = []
    for sentence in sentences:
        seq_of_tokens = tokenizer.texts_to_sequences([sentence])[0]
        for i in range(1, len(seq_of_tokens)):
            n_gram = seq_of_tokens[:i+1]
            input_sequences.append(n_gram)
    return input_sequences, total_words_in_vocab

In [None]:
input_sequences, total_words_in_vocab = convertSentencesIntoSeqOfTokens(sentences)
input_sequences[:10]

[[98, 4],
 [98, 4, 41],
 [98, 4, 41, 1828],
 [2263, 2],
 [2263, 2, 17],
 [2263, 2, 17, 5],
 [2263, 2, 17, 5, 111],
 [2263, 2, 17, 5, 111, 114],
 [2263, 2, 17, 5, 111, 114, 758],
 [2263, 2, 17, 5, 111, 114, 758, 1]]

## Handle variable sentence lengths by padding

We'll use the Keras "pad_sequences" function to pad smaller sequences. To know more about this function, please go through this [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences)

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
test_sequences = [[2025, 2], [2025, 2, 16], [2025, 2, 16, 6],
                  [2025, 2, 16, 6, 135], [2025, 2, 16, 6, 135, 119]]

In [None]:
pad_sequences(test_sequences, maxlen=6, padding='pre')

array([[   0,    0,    0,    0, 2025,    2],
       [   0,    0,    0, 2025,    2,   16],
       [   0,    0, 2025,    2,   16,    6],
       [   0, 2025,    2,   16,    6,  135],
       [2025,    2,   16,    6,  135,  119]], dtype=int32)

In [None]:
def generateSameLengthSentencesByPadding(sequences):
    # Find length of the longest sequence
    max_seq_len = max([len(x) for x in sequences])

    # Pad the senquences
    padded_sequences = np.array(pad_sequences(sequences, maxlen=max_seq_len, padding='pre'))

    # Return padded sequences and the max length
    return padded_sequences, max_seq_len

In [None]:
padded_sequences, max_seq_len = generateSameLengthSentencesByPadding(input_sequences)
padded_sequences[:5]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,   98,    4],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,   98,    4,   41],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,   98,    4,   41, 1828],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0, 2263,    2],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0, 2263,    2,   17]], dtype=int32)

## Generate predictors and labels for training

We are importing keras utils here, this will be needed to convert the labels to one-hot encoded vectors. We'll use the the function "to_categorical" from this library to do this. To know more about this function, please check out this [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical).

In [None]:
import tensorflow.keras.utils as ku

In [None]:
test_padded_sequences = np.array([[0, 0, 0, 12, 16, 32],
                                  [0, 0, 8, 15, 17, 41]])

We'll use array slicing techniques to retrieve the inputs and the labels. To know more about how indexing into a numpy array is done, please go through the following resources. [link1](https://towardsdatascience.com/slicing-numpy-arrays-like-a-ninja-e4910670ceb0), [link2](https://www.tutorialspoint.com/numpy/numpy_indexing_and_slicing.htm)

In [None]:
test_padded_sequences[:,:-1]

array([[ 0,  0,  0, 12, 16],
       [ 0,  0,  8, 15, 17]])

In [None]:
test_padded_sequences[:,-1]

array([32, 41])

In [None]:
def generatePredictorsAndLabels(padded_sequences):
    inputs, label = padded_sequences[:,:-1], padded_sequences[:,-1]
    label = ku.to_categorical(label, num_classes = total_words_in_vocab)
    return inputs, label

In [None]:
inputs, label = generatePredictorsAndLabels(padded_sequences)

## Create and train the model

Import Sequential model from keras and Embedding, LSTM, Dense and Dropout layers from Keras. To know more about them, consult these links. [sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential), [embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM), [dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), [dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout).

To know more about regularization, overfitting and dropout strategy, please consult this [link](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/).

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

In [None]:
input_length = max_seq_len - 1

In [None]:
model = Sequential()
model.add(Embedding(total_words_in_vocab, 10, input_length=input_length))
model.add(LSTM(100))
model.add(Dropout(0.1))
model.add(Dense(total_words_in_vocab, activation='softmax'))

2022-07-21 01:55:02.442051: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Compile the model by specifying the loss function and the optimizer which the model will use during training. To know more about different loss functions and optimizers, go through these links. [link1](https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c), [link2](https://towardsdatascience.com/estimators-loss-functions-optimizers-core-of-ml-algorithms-d603f6b0161a), [link3](https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6)

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 27, 10)            160270    
                                                                 
 lstm (LSTM)                 (None, 100)               44400     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 16027)             1618727   
                                                                 
Total params: 1,823,397
Trainable params: 1,823,397
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(inputs, label, epochs=100)

Once you are done with the training, you can save your model. To know more about this, follow this [link](https://www.tensorflow.org/guide/keras/save_and_serialize)

In [None]:
model.save('lstm_text_autocomplete')

## Generate autocomplete suggestions using the trained model

In [None]:
from tensorflow.keras.models import load_model

In [None]:
model = load_model('lstm_text_autocomplete')

In [None]:
def generate_autocomplete_suggestions(seed_sentence, no_of_next_words,
                                      model, max_sequence_len):
    for _ in range(no_of_next_words):
        sequence = tokenizer.texts_to_sequences([seed_sentence])[0]

        padded_sequence = pad_sequences([sequence],
                                        maxlen=max_seq_len-1,
                                        padding='pre')

        predictions = model.predict(padded_sequence, verbose=0)

        predicted_label = np.argmax(predictions, axis=1)[0]

        next_word = tokenizer.index_word[predicted_label]

        seed_sentence += " "+ next_word

    return seed_sentence

In [None]:
print (generate_autocomplete_suggestions("In response to your earlier email", 10,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("I am happy to", 10,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("What is the status", 3,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("Here is the data", 3,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("Thank you very much", 4,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("I got your email", 17,
                                         model, max_seq_len))

In response to your earlier email and will be sent to you and the other group
I am happy to get a little bit of the ball at the same
What is the status of the project
Here is the data of the company
Thank you very much for your help and
I got your email to me and if you have any questions or concerns about this process please let me know
