# Data Preparation for Implementation of Email Auto-completion

## Objectives

* Explore Enron email dataset
* Extract email messages from raw email data
* Clean email messages
* Preprocessing of email messages to make them ready for training

## The problem:

Suppose that we are working in Globomantics which is one of the most popular email applications in the world. To improve user experience, you want to build an intelligent system which will provide auto-completion suggestions to users during email compose. We want to be sure that the suggestions are relevant and useful to the users so that the user experience enhances.

## Dataset

We'll be using the Enron email dataset which is one of the most popular email datasets. The dataset can be downloaded from [here](https://www.kaggle.com/code/abhaytomar/starter-the-enron-email-dataset-8c90cc3c-1/data).

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. More information about this dataset can be found [here](https://www.cs.cmu.edu/~enron/).

## Explore the Dataset

Import basic libraries like pandas and numpy. We are also importing the [email](https://docs.python.org/3/library/email.html) package which we'll use for extracting email messages from raw email data. We are also importing the python package to use [regular expressions](https://docs.python.org/3/library/re.html). This will be necessary to preprocess the text data.

In [10]:
import pandas as pd
import numpy as np
import email
import re

In [None]:
!kaggle datasets download -d wcukierski/enron-email-dataset
!unzip enron-email-dataset.zip

The Enron email dataset is already downloaded from [here]((https://www.kaggle.com/code/abhaytomar/starter-the-enron-email-dataset-8c90cc3c-1/data)) and saved as a CSV file.

In [None]:
emails_raw = pd.read_csv('emails.csv')
emails_raw.head()

In [None]:
emails_raw.shape

In [None]:
print(emails_raw['message'].iloc[1])

## Extract Email Message from Raw Email Data

In [None]:
sample_email = emails_raw['message'].iloc[1]
print(sample_email)

We'll create an email object using [message_from_string](https://docs.python.org/3/library/email.parser.html). Then, we'll traverse the email object and collect all the plaintext components using the functions [walk](https://docs.python.org/3/library/email.message.html),[get_content_type](https://docs.python.org/3/library/email.message.html) and [get_payload](https://docs.python.org/3/library/email.compat32-message.html).

In [None]:
sample_email_object = email.message_from_string(sample_email)

In [None]:
plain_text_parts = []
for part in sample_email_object.walk():
    # print(part)
    if part.get_content_type() == 'text/plain':
        plain_text_parts.append(part.get_payload())
sample_content = ''.join(plain_text_parts)
print(sample_content)

In [None]:
def extract_plaintext_content(raw_email):
    # create an email object
    email_object = email.message_from_string(raw_email)

    # create an empty list to store all the plaintext components
    plain_text_parts = []

    # traverse over different parts of the email object and collect the plaintext parts
    for part in email_object.walk():

        # check if the part is of type plaintext
        if part.get_content_type() == 'text/plain':

            # get payload if the type is plaintext
            plain_text_parts.append(part.get_payload())

    # concatenate the plaintext parts and return
    return ''.join(plain_text_parts)

In [None]:
# Apply the function over all the raw emails to extract the email message
emails = pd.DataFrame()
emails['content'] = [extract_plaintext_content(i) for i in emails_raw['message']]
emails.head()

In [None]:
print(emails["content"].iloc[1])

## Clean the Dataset by Discarding Unnecessary Emails

Remove all outlook migration emails

In [None]:
print(emails["content"].iloc[61])

In [None]:
outlook_migration_emails = emails[emails['content'].str.contains('Outlook Migration Team@ENRON')]
print("No of outlook migration emails", outlook_migration_emails.shape[0])

In [None]:
emails = emails[~emails['content'].str.contains('Outlook Migration Team@ENRON')]

Remove all the sales emails

In [None]:
trade_count_emails = emails[emails['content'].str.contains('Trade Counts and Volume')]
print(trade_count_emails["content"].iloc[0])

In [None]:
emails = emails[~emails['content'].str.contains('Trade Counts and Volume')]

Handle forewarded emails. For simplicity and lack of time, we'll remove all the forwarded emails here. But if you are interested, you can write a python function to parse these emails and keep only the relevent portions and discard the rest. This can be a bit complicated as one email can be forwarded many times. You can consult the following links which shows how to do it. [link1](https://medium.com/@jubergandharv/email-smart-compose-real-time-assisted-writing-b232191d0681), [link2](https://medium.com/analytics-vidhya/email-smart-compose-assist-in-sentence-completion-b706269da181)

In [None]:
print(emails["content"].iloc[9])

Remove all the emails which contain the string "Forwarded by"

In [None]:
emails = emails[~emails['content'].str.contains('---------------------- Forwarded by')]

Remove all the emails which contain the string "Original Message"

In [None]:
emails = emails[~emails['content'].str.contains('-----Original Message-----')]

In [None]:
emails.shape

## Functions to Preprocess Text Data

We'll use regular expression extensively for text preprocessing. A full discussion on regular expressions is out of scope for this course. However, I would highyl encourage you to go through these links to know more about them. [link1](https://en.wikipedia.org/wiki/Regular_expression), [link2](https://developers.google.com/edu/python/regular-expressions), [link3](https://docs.python.org/3/library/re.html).

We'll mainly use two functions "re.search(...)" and "re.sub(...)" from the Pyhton "re" package. The first is to search for a particular pattern in a string and the second is to substitute a pattern by some other pattern. You can find out more details about these two functions from [here](https://docs.python.org/3/library/re.html).

Check if a pattern is present in a string. More about how to form patterns and search function can be found [here](https://docs.python.org/3/library/re.html)

In [None]:
# Function to check if a text contains numbers
def containNumbers(text):
    # check if there is any number is the text
    return bool(re.search(r'\d', text))

Function to find out if the string contains special characters. More about forming patterns, compile and search functions [here](https://docs.python.org/3/library/re.html).

In [None]:
# function to check if a text contain special characters
def notContainSpecialCharacters(text):
    # compile the special characters pattern into regular expressions object
    regexp = re.compile("""[-+=*@_#`"$%^&*[\]\()<>/\|}{~:]""")

    # Search for this pattern in the text and return true if it is so
    if(regexp.search(text) == None):
        return True
    else:
        return False

The "re.sub(...)" method can be used to substitute a pattern in the string by some other patter. More about this function [here](https://docs.python.org/3/library/re.html)

In [None]:
#Decontraction of text
def removeShortForms(text):
    """
    Returns decontracted phrases
    """
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    return text

Remove extra spaces from the string. First we'll remove anything other than capital or small letter alphabets using the "re.sub(...)" method. Then, we'll split the string into words using using the string "split(...)" function and join them using the string "join(...)" function. This will effectively remove all the extra spaces within the string. Then, we'll remove the leading and trailing spaces by using the string "strip(...)" method. To know more about these string operations, you can go through this [link](https://docs.python.org/3.3/library/stdtypes.html?highlight=split).

In [None]:
def removeSpacesAndConvertToLowercase(text):
    # removes anything other than alphabets
    text = re.sub('[^A-Za-z]+', ' ', text)

    # Remove extra spaces in between, leading and trailaing spaces and covert to lowercase
    return ' '.join(text.split()).strip().lower()

## Preprocess the Email Dataset

In [None]:
def getPreprocessedSentencesFromEmail(email):
    sentences = []

    # split the full email on sentence boundary and iterate
    for sentence in email.split('.'):

        #remove leading and trailing spaces from the sentence
        sentence = sentence.strip()

        # Find out the number of words in the sentence
        no_of_words = len(sentence.split())

        if 3 <= no_of_words <= 25 \
            and sentence[0].isupper() and sentence[1:].islower() \
            and not containNumbers(sentence) \
            and notContainSpecialCharacters(sentence):
                # Expand the contraced words
                sentence = removeShortForms(sentence)

                #Split sentence from question mark
                for j in re.split('(?<=[?]) +',sentence):

                    #Remove extra spaces and append to list
                    sentences.append(removeSpacesAndConvertToLowercase(j))

    return pd.DataFrame({'sentence':sentences})

In [None]:
def applyPreprocessingStepsOnAllEmails(emails):
    # iterate over all the emails in the dataset
    for index in range(emails.shape[0]):
        # for the first email a new dataset will be created
        if index==0:
            result_df = getPreprocessedSentencesFromEmail(emails.content.iloc[0])
        else:
            # for second email onwards the sentences will be appended
            # at the end of the exiting dataframe
            result_df = pd.concat([result_df,
                            getPreprocessedSentencesFromEmail(emails.content.iloc[index])],
                            ignore_index=True)

    result_df = result_df.drop_duplicates()
    return result_df

In [None]:
sentences = applyPreprocessingStepsOnAllEmails(emails)

In [None]:
sentences.shape

In [None]:
for sentence in sentences.sentence.sample(10, random_state=40):
    print(sentence)

In [None]:
sentences.shape

In [None]:
#Save dataframe
sentences.to_csv('sentences.csv', index=False, index_label=True)

# MODEL

In [11]:
import tensorflow as tf
import pandas as pd
import numpy as np

In [12]:
sentence_df = pd.read_csv('sentences.csv')
sentence_df = sentence_df.dropna()
sentence_df.head()

Unnamed: 0,sentence
0,here is our forecast
1,traveling to have a business meeting takes the...
2,especially if you have to prepare a presentation
3,i would suggest holding the business plan meet...
4,i would even try and get some honest opinions ...


In [13]:
sentences = sentence_df.sentence.values
print("Total number of sentence: ", len(sentences))

Total number of sentence:  152831


In [14]:
sentences[0:10]

array(['here is our forecast',
       'traveling to have a business meeting takes the fun out of the trip',
       'especially if you have to prepare a presentation',
       'i would suggest holding the business plan meetings here then take a trip without any formal business meetings',
       'i would even try and get some honest opinions on whether a trip is even desired or necessary',
       'too often the presenter speaks and the others are quiet just waiting for their turn',
       'the meetings might be better if held in a round table discussion format',
       'play golf and rent a ski boat and jet ski is',
       'flying somewhere takes too much time',
       'plus your thoughts on any changes that need to be made'],
      dtype=object)

In [15]:
sentences = sentences[:30000]

## Tokenization of the sentences

We'll use keras tokenizer class and its methods to perform tokenization, create vocabulary and the word to number mapping. To know more about tokenizer class, please consult this [link](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [17]:
test_tokenizer = Tokenizer()

In [18]:
test_sentences = ['here is our forecast',
                  'especially if you have to prepare a presentation']

In [19]:
test_tokenizer.fit_on_texts(test_sentences)

In [20]:
test_tokenizer.word_index

{'here': 1,
 'is': 2,
 'our': 3,
 'forecast': 4,
 'especially': 5,
 'if': 6,
 'you': 7,
 'have': 8,
 'to': 9,
 'prepare': 10,
 'a': 11,
 'presentation': 12}

In [21]:
test_tokenizer.index_word

{1: 'here',
 2: 'is',
 3: 'our',
 4: 'forecast',
 5: 'especially',
 6: 'if',
 7: 'you',
 8: 'have',
 9: 'to',
 10: 'prepare',
 11: 'a',
 12: 'presentation'}

In [22]:
test_sentence = "here you have our presentation"
test_token_list = test_tokenizer.texts_to_sequences([test_sentence])[0]
print(test_token_list)

[1, 7, 8, 3, 12]


In [23]:
n_grams = []
for i in range(1, len(test_token_list)):
    n_gram = test_token_list[:i+1]
    n_grams.append(n_gram)
print(n_grams)

[[1, 7], [1, 7, 8], [1, 7, 8, 3], [1, 7, 8, 3, 12]]


In [24]:
tokenizer = Tokenizer()
def convertSentencesIntoSeqOfTokens(sentences):
    tokenizer.fit_on_texts(sentences)
    total_words_in_vocab = len(tokenizer.word_index) + 1

    input_sequences = []
    for sentence in sentences:
        seq_of_tokens = tokenizer.texts_to_sequences([sentence])[0]
        for i in range(1, len(seq_of_tokens)):
            n_gram = seq_of_tokens[:i+1]
            input_sequences.append(n_gram)
    return input_sequences, total_words_in_vocab

In [25]:
input_sequences, total_words_in_vocab = convertSentencesIntoSeqOfTokens(sentences)
input_sequences[:10]

[[98, 4],
 [98, 4, 41],
 [98, 4, 41, 1828],
 [2263, 2],
 [2263, 2, 17],
 [2263, 2, 17, 5],
 [2263, 2, 17, 5, 111],
 [2263, 2, 17, 5, 111, 114],
 [2263, 2, 17, 5, 111, 114, 758],
 [2263, 2, 17, 5, 111, 114, 758, 1]]

## Handle variable sentence lengths by padding

We'll use the Keras "pad_sequences" function to pad smaller sequences. To know more about this function, please go through this [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences)

In [26]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [27]:
test_sequences = [[2025, 2], [2025, 2, 16], [2025, 2, 16, 6],
                  [2025, 2, 16, 6, 135], [2025, 2, 16, 6, 135, 119]]

In [28]:
pad_sequences(test_sequences, maxlen=6, padding='pre')

array([[   0,    0,    0,    0, 2025,    2],
       [   0,    0,    0, 2025,    2,   16],
       [   0,    0, 2025,    2,   16,    6],
       [   0, 2025,    2,   16,    6,  135],
       [2025,    2,   16,    6,  135,  119]], dtype=int32)

In [29]:
def generateSameLengthSentencesByPadding(sequences):
    # Find length of the longest sequence
    max_seq_len = max([len(x) for x in sequences])

    # Pad the senquences
    padded_sequences = np.array(pad_sequences(sequences, maxlen=max_seq_len, padding='pre'))

    # Return padded sequences and the max length
    return padded_sequences, max_seq_len

In [30]:
padded_sequences, max_seq_len = generateSameLengthSentencesByPadding(input_sequences)
print(padded_sequences.shape)
padded_sequences[0]


(319492, 28)


array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0, 98,  4], dtype=int32)

## Generate predictors and labels for training

We are importing keras utils here, this will be needed to convert the labels to one-hot encoded vectors. We'll use the the function "to_categorical" from this library to do this. To know more about this function, please check out this [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical).

In [31]:
import tensorflow.keras.utils as ku

We'll use array slicing techniques to retrieve the inputs and the labels. To know more about how indexing into a numpy array is done, please go through the following resources. [link1](https://towardsdatascience.com/slicing-numpy-arrays-like-a-ninja-e4910670ceb0), [link2](https://www.tutorialspoint.com/numpy/numpy_indexing_and_slicing.htm)

In [32]:
def generatePredictorsAndLabels(padded_sequences):
    inputs, label = padded_sequences[:,:-1], padded_sequences[:,-1]
    label = ku.to_categorical(label, num_classes = total_words_in_vocab)
    return inputs, label

In [33]:
inputs, label = generatePredictorsAndLabels(padded_sequences)

## Create and train the model

Import Sequential model from keras and Embedding, LSTM, Dense and Dropout layers from Keras. To know more about them, consult these links. [sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential), [embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM), [dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), [dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout).

To know more about regularization, overfitting and dropout strategy, please consult this [link](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/).

In [34]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

In [35]:
input_length = max_seq_len - 1 # because the last was used as label
print(f"input_length: {input_length}")
print(f"total_words_in_vocab: {total_words_in_vocab}")

input_length: 27
total_words_in_vocab: 16027


In [36]:
model = Sequential()
model.add(Embedding(input_dim=total_words_in_vocab, output_dim=10)) #Turns positive integers (indexes) into dense vectors of fixed size
model.add(LSTM(100))
model.add(Dropout(0.1))
model.add(Dense(total_words_in_vocab, activation='softmax'))

Compile the model by specifying the loss function and the optimizer which the model will use during training. To know more about different loss functions and optimizers, go through these links. [link1](https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c), [link2](https://towardsdatascience.com/estimators-loss-functions-optimizers-core-of-ml-algorithms-d603f6b0161a), [link3](https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6)

In [29]:
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

In [None]:
tf.debugging.set_log_device_placement(True)

try:
  # Specify an invalid GPU device
  with tf.device('/device:GPU:0'):
    history = model.fit(inputs, label, epochs=100, verbose=1)
except RuntimeError as e:
  print(e)

Once you are done with the training, you can save your model. To know more about this, follow this [link](https://www.tensorflow.org/guide/keras/save_and_serialize)

In [None]:
model.save('lstm_text_autocomplete')

## Generate autocomplete suggestions using the trained model

In [None]:
from tensorflow.keras.models import load_model

In [None]:
model = load_model('lstm_text_autocomplete')

In [None]:
def generate_autocomplete_suggestions(seed_sentence, no_of_next_words,
                                      model, max_sequence_len):
    for _ in range(no_of_next_words):
        sequence = tokenizer.texts_to_sequences([seed_sentence])[0]

        padded_sequence = pad_sequences([sequence],
                                        maxlen=max_seq_len-1,
                                        padding='pre')

        predictions = model.predict(padded_sequence, verbose=0)

        predicted_label = np.argmax(predictions, axis=1)[0]

        next_word = tokenizer.index_word[predicted_label]

        seed_sentence += " "+ next_word

    return seed_sentence

In [None]:
print (generate_autocomplete_suggestions("In response to your earlier email", 10,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("I am happy to", 10,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("What is the status", 3,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("Here is the data", 3,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("Thank you very much", 4,
                                         model, max_seq_len))

print (generate_autocomplete_suggestions("I got your email", 17,
                                         model, max_seq_len))