In [39]:
import codecs
import os
import random
import re


from sklearn.feature_extraction.text import TfidfVectorizer

Let's start by writing a function for getting all the relevant files that contain emails:

In [40]:
def get_emails(dir):
    'Get relevant email data files from the folder.'
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            if 'spam' in name or 'ham' in name:
                r.append(os.path.join(root, name))
    return r


Now let's create a list with emails, where each email is a list of sentences. Sentence splitting is done with a regular expression, but special libraries can also be used for that:

In [41]:
regex = re.compile('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s')

all_files = 'data'
emails = []

for f in get_emails(all_files):
    with codecs.open(f, 'r', encoding='utf-8', errors='ignore') as fdata:
        emails.append(re.split(regex, fdata.read().replace('\n', ' ').replace('\r', '')))

print(emails[:2])

[['Subject: christmas tree farm pictures '], ['Subject: vastar resources , inc .', 'gary , production from the high island larger block a - 1 # 2 commenced on saturday at 2 : 00 p .', 'm .', 'at about 6 , 500 gross .', 'carlos expects between 9 , 500 and 10 , 000 gross for tomorrow .', 'vastar owns 68 % of the gross production .', 'george x 3 - 6992 - - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 12 / 13 / 99 10 : 16 am - - - - - - - - - - - - - - - - - - - - - - - - - - - daren j farmer 12 / 10 / 99 10 : 38 am to : carlos j rodriguez / hou / ect @ ect cc : george weissman / hou / ect @ ect , melissa graves / hou / ect @ ect subject : vastar resources , inc .', 'carlos , please call linda and get everything set up .', "i ' m going to estimate 4 , 500 coming up tomorrow , with a 2 , 000 increase each following day based on my conversations with bill fischer at bmar .", 'd .', '- - - - - - - - - - - - - - - - - - - - - - forwarded by daren j farmer

Text summarisation can be **extractive** or **abstractive**. Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points. Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. So, abstractive summarisation is a more difficult task. In our task, we don't have any labeled data for training, so the algorithm has to be unsupervised.
Here is a simple algorithm I will use for this task:

   * Treat each email as a separate document.
   * Score each word in the email with **tf-idf**. Tf-idf stands for term frequency-inverse document frequency, and this
   weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. In our case, if the word has high tf-idf, it means it is important to the content of the email.
   * Score each sentence by the average score of its words.
   * Display top 1 sentence. This will be treated as the summary of the email. Basically, the idea is: the more important words a sentence has, the more likely it is to be a good summary of the email.
   
   Let's calculate tf-idf scores. I will use sklearn's TfidfVectorizer for that. I will also remove stopwords from vectorizing and will only vectorize tokens that contain at least one letter. This is done because our emails contain a lot of numbers that don't mean much to the emails' content.


In [42]:
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', analyzer='word',token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b')

Now let's write a function to extract the summary from an email. This function completes steps described in algorithm above: takes an email, vectorizes it with tf-idf scores, goes over sentences, adds scores of its words, divides it by number of words to calculate a sentence score and chooses highest scoring sentence. If its length is 15, good, if not we return the highest scoring sentence of a different length. Because we are doing extractive summarization, it is difficult to return 15 words, so this is done as a workaround.

In [43]:
def get_highest_scored_sent(email, summary_length):
    'Takes email and returns one sentence that is its summary, trying to return summary of length summary_length.'
    tfidf_matrix = vectorizer.fit_transform(email)
    feature_names = vectorizer.get_feature_names()

    max_score_any = 0
    max_score_desired = 0
    summary_sent_any = None
    summary_sent_desired = None
    num_words_any = 0
    for i in range(len(email)):
        if len(email[i]) == 0: # avoid empty sentences
            continue
        feature_index = tfidf_matrix[i,:].nonzero()[1]
        tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
        tfidf_sum = 0
        num_words = 0
        for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
            tfidf_sum += s
            num_words += 1

        sent_score = tfidf_sum/len(email[i])
        if sent_score > max_score_any and num_words > num_words_any:
            if email[i].lower().startswith('subject'): # avoid subjects of emails to be used as summaries
                continue
            summary_sent_any = email[i]
            max_score_any = sent_score
            num_words_any = num_words
        if num_words == summary_length:
            if sent_score > max_score_desired:
                summary_sent_desired = email[i]
                max_score_desired = sent_score    
    if summary_sent_desired:
        print("A summary sentence of exactly {} words was extracted:\n".format(summary_length), summary_sent_desired)
    else:
        print("A summary of exactly {} words could NOT be extracted. Here is an alternative summary:\n".format(summary_length), summary_sent_any)



Now let's randomly apply this function on 10 emails from the corpus:

In [44]:
random.shuffle(emails)
for email in emails[:10]:
    print('Randomly selected email: ')
    print(''.join(email))
    print('\n\n')
    get_highest_scored_sent(email, 15)
    print('\n\n')
    

Randomly selected email: 
Subject: their waiting for you to call them ...hey my man , i found this new dating chatline with tons and tons of chicks ! : ) call it up now : 011 - 239 - 28 - 4132 it ' s this crazy hookup site , i got laid 6 times this week man , you don ' t have to use a credit card or anything you won ' t pay a cent ! lots of them are just looking for a random hookup , one night stands etc 011 - 239 - 28 - 4132 you won ' t be disapointed you ' ll see im not kidding .thank me later after you ' r gettin laid 7 days a week .big mike 



A summary of exactly 15 words could NOT be extracted. Here is an alternative summary:
 hey my man , i found this new dating chatline with tons and tons of chicks ! : ) call it up now : 011 - 239 - 28 - 4132 it ' s this crazy hookup site , i got laid 6 times this week man , you don ' t have to use a credit card or anything you won ' t pay a cent ! lots of them are just looking for a random hookup , one night stands etc 011 - 239 - 28 - 4132 y

Of course, the summaries are often not perfect. However, this was just a small prototype, and more things could be tried. For example, better preprocessing of the corpus - the emails are often not cleaned and badly structured.