# Email Auto Fill
- **Text Preprocessing**
    - *Contractions, Sentence Tokenization*
- **Basic EDA**
    - *Word Cloud*
- **Probabilistic Language Models**
    - *Unigram, Bigrams, Trigrams, N-grams*
- **Model Evaluation**
    - *Perplexity* 

In [1]:
import numpy as np
import pandas as pd
import nltk, re, string, contractions
from nltk.tokenize import sent_tokenize, word_tokenize
import email
from nltk.util import bigrams, trigrams

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv(r'F:\Muthu_2023\Personal\NextStep\NLP\NLP\Dataset\Email\email_truncated.csv')
# df = pd.read_csv(r'E:\Nextstep\NLP\Dataset\Email\email_truncated.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,file,message
0,0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


**Text Preprocessing:**
- **`Using email library, extract body from the complete message`**
- **`Remove all new line characters`**
- **`Remove all non alpha numeric characters`**
- **`Strip the and lower case the text`**
- **`Apply contractions`**
- **`Create a column with sentences as list elements for each message in main dataframe`**

In [3]:
def extractMessage(message):
    e = email.message_from_string(message)
    return e.get_payload().lower()

In [238]:
def sentence_tokenization(text):
    sentence_list = sent_tokenize(text)
    transformed_sent = []
    for sentence in sentence_list:
        sentence = (re.sub("[^a-zA-Z0-9 ]", "", sentence))
        words = []
        for word in sentence.split():
            if len(word) < 20 and word.strip().isalpha():
                words.append(word.strip())
        if len(words) > 0:
            transformed_sent.append(" ".join(words))
    return transformed_sent

In [240]:
def text_preprocess(local_df):
    local_df['content'] = local_df['message'].apply(extractMessage)
    local_df['content'] = local_df['content'].str.replace("\n", " ")
    local_df['content'] = local_df['content'].apply(lambda x: re.sub("[^a-zA-Z0-9 \.]", "", x))
    local_df['content'] = local_df['content'].str.strip().str.lower()
    local_df['content'] = local_df['content'].apply(lambda x: contractions.fix(x))
    local_df['sent_list'] = local_df['content'].apply(sentence_tokenization)
    return local_df

In [None]:
df = text_preprocess(df)

# EDA

**`Histogram plot for number of words in a message`**

In [None]:
plt.hist(df['content'].apply(lambda x: len(x)), bins=1000)
plt.xlim(0,10000)
plt.show()

**`Generate Word Count Vector for the complete corpus`**

In [8]:
d = {}
for sent_tokens in df['sent_list']:
    for sent in sent_tokens:
        for word in sent.split():
            word = word.replace(".", "").strip()
            if word in d:
                d[word] += 1
            else:
                d[word] = 1

**`Sort Top N words by total count in the corpus `**

In [11]:
sorted(d.items(), key=lambda x: x[1], reverse=True)[:20]

[('the', 194347),
 ('to', 140013),
 ('and', 87514),
 ('a', 79911),
 ('of', 72795),
 ('in', 61234),
 ('you', 54237),
 ('for', 52873),
 ('is', 50794),
 ('on', 47504),
 ('i', 39791),
 ('this', 34549),
 ('that', 34048),
 ('not', 29304),
 ('be', 29176),
 ('will', 28690),
 ('from', 28524),
 ('with', 27117),
 ('at', 26695),
 ('have', 26266)]

***All Top20 words are Stopwords***

# Word Cloud

**`Build Word cloud from the email body texts`**

# Bigram Model

**`Generate bigram dictionary with frequency of occurence in {(currentword, nextword): freq}`**

In [138]:
bi_dict = {}
for message in df['sent_list']:
    for sentence in message:
        for words in bigrams(sentence.split()):
            if words in bi_dict:
                bi_dict[words] += 1
            else:
                bi_dict[words] = 1            

In [139]:
bi_dict_prob = {}
for w1, w2 in bi_dict:
    bi_dict_prob[(w1, w2)] = bi_dict[(w1, w2)] / d[w1]

**`Sort the dictinary based on key and values`**

In [112]:
bi_dict_sorted = dict(sorted(bi_dict_prob.items(), key=lambda x: (x[0][0], x[1]), reverse=True))

In [113]:
len(bi_dict_sorted)

626705

**`Create data frame from the dictinary for easier processing`**

In [114]:
bi_df = pd.DataFrame(data = bi_dict_sorted.values(), columns=['Count'], index=bi_dict_sorted.keys())
bi_df.reset_index(inplace=True)
bi_df.head()

Unnamed: 0,level_0,level_1,Count
0,zypfje,baughmandon,1.0
1,zy,for,1.0
2,zwiebel,calls,0.5
3,zwiebel,and,0.5
4,zwerneman,jazztotalzonecom,1.0


**`Extract top N Next words in list for each Current word`**

In [115]:
N = 3
filtered_bi = bi_df.drop('Count', axis=1).groupby('level_0').head(N)
filtered_bi = filtered_bi.groupby('level_0')['level_1'].apply(list).reset_index()

**`Transform dataframe to dictionary with key as current word and values as N next words`**

In [116]:
filtered_bi_dict = dict()
for i in range(len(filtered_bi)):
    filtered_bi_dict[filtered_bi['level_0'].iloc[i]] = filtered_bi['level_1'].iloc[i]

**`Derive the next N words for the current word from the dictionary`**

In [173]:
def get_nextwords(Queryword, in_dict):
    if Queryword.lower() in in_dict:
        return in_dict[Queryword.lower()]
    else:
        return "Word not exist in dictionary"

In [174]:
get_nextwords('I', filtered_bi_dict)

['am', 'have', 'will']

In [124]:
get_nextwords('how', filtered_bi_dict)

['to', 'about', 'much']

In [119]:
get_nextwords('to', filtered_bi_dict)

['the', 'be', 'get']

In [120]:
get_nextwords('the', filtered_bi_dict)

['following', 'last', 'new']

In [125]:
get_nextwords('they', filtered_bi_dict)

['are', 'have', 'rank']

In [127]:
get_nextwords('can', filtered_bi_dict)

['you', 'be', 'do']

**`Generate the next M sequence words for the current word`**

In [100]:
M = 10
CurrWord = 'Enron'
word_list = [CurrWord]
for x in range(M):    
    CurrWord = get_nextwords(CurrWord, filtered_bi_dict)[0]
    word_list.append(CurrWord)
print(" ".join(word_list))

Enron north america corp from the following the following the following


# Trigram Model

**`Generate conditional probability of trigram`**

In [145]:
tri_dict = {}
tri_dict_prob = {}
for message in df['sent_list']:
    for sentence in message:
        for words in trigrams(sentence.split()):
            if words in tri_dict:
                tri_dict[words] += 1
            else:
                tri_dict[words] = 1

for words in tri_dict:
    currwords = words[:-1]
    tri_dict_prob[words] = tri_dict[words] / bi_dict[currwords]

**`Group Sort the dictionary based on the maximum likelyhood probability and create dataframe`**

In [224]:
tri_dict_sort = dict(sorted(tri_dict_prob.items(), key = (lambda x: (x[0][:2], x[1])), reverse=True))
tri_dict_df = pd.DataFrame(data = tri_dict_sort.values(), columns=['Count'], index=tri_dict_sort.keys())
tri_dict_df.head()

**`Extract Top N next words for each bigrams`**

In [185]:
tri_dict_df_filtered = tri_dict_df.drop('Count', axis=1).reset_index().groupby(['level_0', 'level_1']).head(3)
tri_dict_df_filtered = tri_dict_df_filtered.groupby(['level_0', 'level_1'])['level_2'].apply(list).reset_index()

**`Convert DF to dictionary with currwords as index and next N words list as value`**

In [187]:
filtered_tri_dict = dict()
for i in range(len(tri_dict_df_filtered)):
    currwords = (tri_dict_df_filtered['level_0'].iloc[i], tri_dict_df_filtered['level_1'].iloc[i])
    filtered_tri_dict[currwords] = tri_dict_df_filtered['level_2'].iloc[i]

**`Get next words list for the given bigram`**

In [188]:
def get_nextwords_tri(Query, in_dict):
    words = tuple(Query.lower().split())
    if words in in_dict:
        return in_dict[words]
    else:
        return "Words not exist in dictionary"

**`Example results for random bigram`**

In [189]:
get_nextwords_tri('i am', filtered_tri_dict)

['not', 'going', 'sure']

In [190]:
get_nextwords_tri('i have', filtered_tri_dict)

['been', 'not', 'a']

In [205]:
get_nextwords_tri('season with', filtered_tri_dict)

['a', 'the', 'an']

**`Text Generation from the trigram model`**

In [214]:
def generate_words_tri(seed, word_dict, seq):
    word_query = seed.lower().split()
    for x in range(seq):
        nxtword = get_nextwords_tri(" ".join(word_query[x:]), word_dict)
        word_query.append(nxtword[0])
    return " ".join(word_query)

In [216]:
generate_words_tri('I am', filtered_tri_dict, 25)

'i am not sure what you think of the season with a sore knee and is expected to play with his sore shoulder and neck W W'

In [218]:
generate_words_tri('how are', filtered_tri_dict, 25)

'how are you going to be a good idea to keep the tradition and organize a thanksgiving dinner i can do it all to brunch swcc am'

In [325]:
generate_words_tri('I heard', filtered_tri_dict, 20)

'i heard that you are not the intended recipient or authorized to receive for the year they rank in fantasy points allowed'

In [326]:
generate_words_tri('can we', filtered_tri_dict, 20)

'can we meet at the end of the season with a sore knee and is expected to play with his sore shoulder'

# Model Evaluation

**`Model Perplexity score calcuation`**

**`Read and transform test data`**

In [236]:
test_df = pd.read_csv(r'F:\Muthu_2023\Personal\NextStep\NLP\NLP\Dataset\Email\test_data.csv') #, skiprows=20000, nrows=8000, header=None)
test_df.head()

Unnamed: 0.1,Unnamed: 0,file,message
0,0,baughman-d/power/cinergy_index/45.,Message-ID: <11686348.1075848338607.JavaMail.e...
1,1,baughman-d/power/cinergy_index/46.,Message-ID: <18375400.1075848338630.JavaMail.e...
2,2,baughman-d/power/cinergy_index/47.,Message-ID: <358967.1075848338654.JavaMail.eva...
3,3,baughman-d/power/cinergy_index/48.,Message-ID: <26917144.1075848338677.JavaMail.e...
4,4,baughman-d/power/cinergy_index/49.,Message-ID: <15598486.1075848338699.JavaMail.e...


In [241]:
test_df = text_preprocess(test_df)
test_df.head()

Unnamed: 0.1,Unnamed: 0,file,message,content,sent_list
0,0,baughman-d/power/cinergy_index/45.,Message-ID: <11686348.1075848338607.JavaMail.e...,great talking with you. see you the other guy...,"[great talking with you, see you the other guy..."
1,1,baughman-d/power/cinergy_index/46.,Message-ID: <18375400.1075848338630.JavaMail.e...,the hourly indexes have been posted check out ...,[the hourly indexes have been posted check out...
2,2,baughman-d/power/cinergy_index/47.,Message-ID: <358967.1075848338654.JavaMail.eva...,for our hourly daily and term indexes log on t...,[for our hourly daily and term indexes log on ...
3,3,baughman-d/power/cinergy_index/48.,Message-ID: <26917144.1075848338677.JavaMail.e...,for our industry coverage and hourly daily and...,[for our industry coverage and hourly daily an...
4,4,baughman-d/power/cinergy_index/49.,Message-ID: <15598486.1075848338699.JavaMail.e...,attached is the form id like everyone to use f...,[attached is the form id like everyone to use ...


**`Perplexity Calculation for bigram and trigram model`**

In [296]:
def CalcPerplexity(sentence, ngram):
    word_list = str(sentence).split()
    N = len(word_list)
    ppx = []
    if ngram == 2:
        for i in range(N-1):
            try:
                ppx.append(1/(bi_dict_prob[(word_list[i], word_list[i+1])]))
            except:
                continue
    elif ngram == 3:
        for i in range(len(word_list)-2):
            try:
                ppx.append(1/tri_dict_prob[(word_list[i], word_list[i+1], word_list[i+2])])
            except:
                continue
    if len(ppx) > 0:
        return round(np.prod(ppx) ** (1/N),2)    

**`Extract sentences as separate rows from the message`**

In [276]:
test_df_exp = test_df['sent_list'].explode(['sent_list'])
test_df_exp = test_df_exp.to_frame()

**`Calculate perplexity score for bigram and trigram model`**

In [308]:
test_df_exp['trigram_score'] = test_df_exp['sent_list'].apply(lambda x: CalcPerplexity(x,3))
test_df_exp['bigram_score'] = test_df_exp['sent_list'].apply(lambda x: CalcPerplexity(x,2))

**`Trigram model outperforms Bigram model in terms of perplexity score`**

In [324]:
print('Mean Perplexity score for Bigram Model: ',test_df_exp['bigram_score'].mean())
print('Mean Perplexity score for Trigram Model: ',test_df_exp['trigram_score'].mean())

Mean Perplexity score Bigram Model:  inf
Mean Perplexity score Trigram Model:  3.035378090277218
