Module for E-mail Summarization
*****************************************************************************
Input Parameters:
    emails: A list of strings containing the emails
Returns:
    summary: A list of strings containing the summaries.
*****************************************************************************


In [1]:
import numpy as np
from talon.signature.bruteforce import extract_signature
from langdetect import detect
from nltk.tokenize import sent_tokenize
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

In [2]:
def preprocess(email):
    """
    Performs preprocessing operations such as:
        1. Removing signature lines (only English emails are supported)
        2. Removing new line characters.
    """
    email, _ = extract_signature(email)
    
    lines = email.split('\n')
    for j in reversed(range(len(lines))):
        lines[j] = lines[j].strip()
        if lines[j] == '':
            lines.pop(j)
    
    return ' '.join(lines)

In [3]:
def split_sentences(email):
    """
    Splits the emails into individual sentences
    """        
    
    sentences = sent_tokenize(email)
    for j in reversed(range(len(sentences))):
        sent = sentences[j]
        sentences[j] = sent.strip()
        if sent == '':
            sentences.pop(j)
        
    return (sentences)

In [4]:
def Encoding(email, max_length=128):
    """
    Obtains sentence embeddings for each sentence in the emails
    """
    from transformers import BertTokenizer, BertForMaskedLM
    import torch

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    inputs = tokenizer(email, return_tensors='pt', max_length=max_length, truncation=True, padding='max_length')   

        
    return inputs['input_ids'].numpy()

In [5]:
def summarization(processed_email_text:list = None, encoded_email_text:np.ndarray = None):
    '''
        Email summarization

        Parameters
        ----------
        processed_email_text: list with sentences composing the email 
        encoded_email_text:
    '''

    n_clusters = int(np.ceil(len(encoded_email_text)**0.5))
    print('Number of clusters: ', n_clusters)


    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans = kmeans.fit(encoded_email_text)
    print('Kmeans trained')

    avg = []
    closest = []
    for j in range(n_clusters):
        idx = np.where(kmeans.labels_ == j)[0]
        avg.append(np.mean(idx))
    closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, encoded_email_text)
    ordering = sorted(range(n_clusters), key=lambda k: avg[k])
    
    return ' '.join([processed_email_text[closest[idx]] for idx in ordering])

# Example 1: cover_letter.txt

In [6]:
with open('Email_examples/cover_letter.txt') as f:
    email_text = f.read()
print(email_text)

Dear Ms. Lee,

I'm writing to express my interest in the Web Content Manager position listed on CareerBuilder.com. 
I have experience building large, consumer-focused, health-based content sites. While much of my experience has been in the business world, I understand the social value of the non-profit sector and my business experience will be an asset to your organization.

My responsibilities at my current job have included the development and management of the site’s editorial voice and style, the editorial calendar, and the daily content programming and production of the website. In my current and past positions, I have worked closely with health care professionals and medical editors to help them provide the best possible information to a consumer audience of patients. In addition, I have helped physicians learn to utilize their medical content to write user-friendly, readily comprehensible text.

Experience has taught me how to build strong relationships with all departments at a

In [7]:
print('Email pre-processing...')
processed_email_text = preprocess(email_text)

print('Splitting into sentences...')
processed_email_text = split_sentences(processed_email_text)
print('Number of sentences: ', len(processed_email_text))

print('Encoding process...')
encoded_email_text = Encoding(processed_email_text)    

summary = summarization(processed_email_text, encoded_email_text)

print('\nSummary:\n', summary)


Email pre-processing...
Splitting into sentences...
Number of sentences:  13
Encoding process...
Number of clusters:  4
Kmeans trained

Summary:
 My responsibilities at my current job have included the development and management of the site’s editorial voice and style, the editorial calendar, and the daily content programming and production of the website. In addition, I have helped physicians learn to utilize their medical content to write user-friendly, readily comprehensible text. I have the ability to work within a team as well as across teams. I work with web engineers to resolve technical issues and implement technical enhancements, work with the development department to implement design and functional enhancements, and monitor site statistics and conduct search engine optimization.


# Example 2: appreciating_the_customer.txt

In [8]:
with open('Email_examples/appreciating_the_customer.txt') as f:
    email_text = f.read()
print(email_text)

Good evening Mrs. Yoo,

I'm reaching out on behalf of LettuceEat to thank you for your review of our restaurant on ReviewIt. 
We really appreciate your kind words and recommending our restaurant to others on the platform. 

LettuceEat is so happy you enjoyed our vegan options and your experience with us. 


Please come back soon!

Best regards,
Sarah Gibbs


In [9]:
print('Email pre-processing...')
processed_email_text = preprocess(email_text)

print('Splitting into sentences...')
processed_email_text = split_sentences(processed_email_text)
print('Number of sentences: ', len(processed_email_text))

print('Encoding process...')
encoded_email_text = Encoding(processed_email_text)    

summary = summarization(processed_email_text, encoded_email_text)

print('\nSummary:\n', summary)

Email pre-processing...
Splitting into sentences...
Number of sentences:  3
Encoding process...
Number of clusters:  2
Kmeans trained

Summary:
 Good evening Mrs. Yoo, I'm reaching out on behalf of LettuceEat to thank you for your review of our restaurant on ReviewIt. We really appreciate your kind words and recommending our restaurant to others on the platform.


# Example 3: Introducing_yourself.txt

In [10]:
with open('Email_examples/Introducing_yourself.txt') as f:
    email_text = f.read()
print(email_text)

Good morning Mr. Sheehan,

I would like to formally introduce myself. 
My name is Ethan and I am from Secure Shield, a company focused on protecting your home with security cameras and alarms.
We understand the importance of keeping your family safe, and we want to ensure you have the best security system to meet your needs and budget. 

If you're interested in our services, please contact me at ccrenshaw@secureshield.com or call me at 555-555-5555. 
I'm looking forward to hearing from you!

Best,
Charles Crenshaw


In [11]:
print('Email pre-processing...')
processed_email_text = preprocess(email_text)

print('Splitting into sentences...')
processed_email_text = split_sentences(processed_email_text)
print('Number of sentences: ', len(processed_email_text))

print('Encoding process...')
encoded_email_text = Encoding(processed_email_text)    

summary = summarization(processed_email_text, encoded_email_text)

print('\nSummary:\n', summary)

Email pre-processing...
Splitting into sentences...
Number of sentences:  4
Encoding process...
Number of clusters:  2
Kmeans trained

Summary:
 We understand the importance of keeping your family safe, and we want to ensure you have the best security system to meet your needs and budget. If you're interested in our services, please contact me at ccrenshaw@secureshield.com or call me at 555-555-5555.
