    This a notebook is made in order to explain how the mail_parser module parse a formal mail in order to get only the body.

In [1]:
import sys 
sys.path.insert(0, '..')
from mail_parser.mail_parser import MailParser

In [2]:
example = """
To: XYZ

CC/BCC:

Subject: Invitation to a birthday party

BLABLABLABLABLA
BLOBLOBLOBLOBLOBLO

Our reference: XX-00000
Your reference: XXX-0000000-00000
----------Forwarding process-----------

Due to an extraordinarily high number of customer contacts in our service centers and at our stations, we are facing delays in replying to your emails. We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.  

Please use the following links which answer to most of your inquiries:
https://www.example.com/en/example/example

As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers 
are therefore not entitled to compensation. So, we cannot grant your request.

Thank you


IMAGE

JACQUES LAMA
Example - Example Air Transport
Example, Jacques Lama Airport

NOTE : THIS IS ONLY A USELESS SENTENCE, BECAUSE SOMETIMES THERE ARE OTHER "TRUE SENTENCES" AT THE END BUT THEY ARE NOT IMPORTANT.

E-mail: EXAMPLE@EXAMPLE.COM
Website: www.example.com

"""

# First thing : Preprocess

In [5]:
def preprocess(txt):
    """some preprocessing"""
    text = txt.replace('\n \n', '\n\n').replace(':\n', '\n\n').replace(',\n', '\n\n') # Customize
    for bad in sentences_to_delete:
        text = text.replace(bad, '') # Customize 
    for old, new in replacements:
        text = re.sub(old, new, text)
    return text

In [6]:
print(example)


To: XYZ

CC/BCC:

Subject: Invitation to a birthday party

BLABLABLABLABLA
BLOBLOBLOBLOBLOBLO

Our reference: XX-00000
Your reference: XXX-0000000-00000
----------Forwarding process-----------

Due to an extraordinarily high number of customer contacts in our service centers and at our stations, we are facing delays in replying to your emails. We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.  

Please use the following links which answer to most of your inquiries:
https://www.example.com/en/example/example

As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers 
are therefore not entitled to compensation. So, we cannot grant your request.

Thank you


IMAGE

JACQUES LAMA
Example - Example Air Transport
Example, Jacques Lama Airport

NOTE : THIS IS ONLY A USELESS SENTENCE, BECAUSE SOMETIMES THERE ARE OTHER "TRUE SENTENCES" AT THE END BUT THEY ARE NOT IMPORTANT.

E-mail: EXAMPLE@EXAMPLE.COM


In [10]:
preproc_mail = preprocess(example)
print(preproc_mail)

 To: XYZ

CC/BCC


Subject: Invitation to a birthday party

BLABLABLABLABLA BLOBLOBLOBLOBLOBLO

Our reference: XX-00000 Your reference: XXX-0000000-00000 ----------Forwarding process-----------

Due to an extraordinarily high number of customer contacts in our service centers and at our stations, we are facing delays in replying to your emails. We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.

Please use the following links which answer to most of your inquiries



As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers  are therefore not entitled to compensation. So, we cannot grant your request.

Thank you


IMAGE

JACQUES LAMA Example - Example Air Transport Example, Jacques Lama Airport

NOTE : THIS IS ONLY A USELESS SENTENCE, BECAUSE SOMETIMES THERE ARE OTHER "TRUE SENTENCES" AT THE END BUT THEY ARE NOT IMPORTANT.

E-mail: @Website: 




As you can see, the preprocess functions remove http & www links & other customs replacement, replace mail by @ .

But I want to point out the utility of this regex replacement : (r'(?<!\n)\n(?![\n\t])', ' ')

In [9]:
old, new = (r'(?<!\n)\n(?![\n\t])', ' ')
sub_example_0 = """
As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers 
are therefore not entitled to compensation. So, we cannot grant your request.
"""
print(sub_example_0)
print('-----------------AFTER---------------')
print()
print(re.sub(old, new, sub_example_0))


As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers 
are therefore not entitled to compensation. So, we cannot grant your request.

-----------------AFTER---------------

 As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers  are therefore not entitled to compensation. So, we cannot grant your request. 


As you can see, sometimes there are some line breaks that appear before the end of the sentence.

# Mail to list of lines

In [11]:
def mail2lines(corpus):
    """split the mail into list of lines.
    """
    return corpus.strip().split('\n')

In [12]:
list_lines = mail2lines(preproc_mail)
list_lines

['To: XYZ',
 '',
 'CC/BCC',
 '',
 '',
 'Subject: Invitation to a birthday party',
 '',
 'BLABLABLABLABLA BLOBLOBLOBLOBLOBLO',
 '',
 'Our reference: XX-00000 Your reference: XXX-0000000-00000 ----------Forwarding process-----------',
 '',
 'Due to an extraordinarily high number of customer contacts in our service centers and at our stations, we are facing delays in replying to your emails. We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.',
 '',
 'Please use the following links which answer to most of your inquiries',
 '',
 '',
 '',
 'As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers  are therefore not entitled to compensation. So, we cannot grant your request.',
 '',
 'Thank you',
 '',
 '',
 'IMAGE',
 '',
 'JACQUES LAMA Example - Example Air Transport Example, Jacques Lama Airport',
 '',
 'NOTE : THIS IS ONLY A USELESS SENTENCE, BECAUSE SOMETIMES THERE ARE OTHER "TRUE SENTENCES" AT THE 

Just a simple split by lines

# Tricky part : generate_body_mail

In [13]:
    def generate_body_mail(self, lines, threshold):
        """iterate through lines. if the line is a true sentence,
           add it to final_sentence.

           if probability(sentence to be a true sentence) < threshold, then it is a true sentence.

        Parameters
        ----------
        lines : list
                    Represents the list of block of text in the email.
        threshold: float
                   Lower thresholds will result in more false positives.

        Returns
        -------
        final_sentence : str
                         Parsed email block.
        """

        delete_useless = False
        sentence_not_taken = 0
        final_sentence = ''
        for line in lines:
            for sent in sentence_tokenize.value.tokenize(line):
                if len(sent) < 6:
                    continue
                if self.compute_prob_not_sentence(sent) < threshold:
                    delete_useless = True
                    final_sentence += sent + ' '
                    sentence_not_taken = 0
                    continue
                if delete_useless:
                    sentence_not_taken += 1
                if delete_useless and sentence_not_taken >= 2:
                    break
            if delete_useless and sentence_not_taken >= 2:
                break
        return final_sentence


First we are going to iterate over the lines that we extracted thanks to mail2lines function. (here lines = list_lines).

Then we are going to split the lines into sentences thanks to "sentence_tokenize.value.tokenize(line)" and iterate over these sentences.

Here two examples :

In [14]:
sentence_tokenize.value.tokenize('Subject: Invitation to a birthday party')

['Subject: Invitation to a birthday party']

In [15]:
sentence_tokenize.value.tokenize('Due to an extraordinarily high number of customer contacts in our service centers and at our stations, we are facing delays in replying to your emails. We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.')

['Due to an extraordinarily high number of customer contacts in our service centers and at our stations, we are facing delays in replying to your emails.',
 'We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.']

For each sentence, (if the sentence have more than 6 characters) we will compute the probability that the sentence is not a true sentence, if the probability is greater than a certain threshold, it will not be added to the final_sentence.

As long as we don't find a TRUE sentence, delete_useless param will be set to false, and we will continue the iteration. 

As soon as a true sentence is detected, it will be set to True. Then we have some changes in the processus. 

First of all, as soon as a sentence is detected as a not True sentence, we will add 1 to sentence_not_taken.

sentence_not_taken is set 0 when we find another True sentence. If sentence_not_taken is greater or equal to 2, we break from the loop.

In [16]:
# Im going to give an example

# Blabla1     # => Not a true sentence, go to the next one
# Blabla2     # => Not a true sentence, go to the next one
# Blabla3     # => Not a true sentence, go to the next one
# Sentence1  # => True sentence, add it to final_sentence and set delete_useless to True and go to the next one
# Blabla4     # => Not a true sentence, but as delete_useless = True , we will add 1 to sentence_not_taken
# Sentence2  # => True sentence, add it to final_sentence and set sentence_not_taken to 0 and go to the next one
# Sentence3  # => True sentence, add it to final_sentence and go to the next one
# Blabla5     # => Not a true sentence, but as delete_useless = True , we will add 1 to sentence_not_taken
# Blabla6     # => Not a true sentence, but as delete_useless = True , we will add 1 to sentence_not_taken
             # sentence_not_taken >= 2 => BREAK FROM THE LOOP, sentence4 will not be examinated
# Sentence 4 


# The parser will return "Sentence1 + Sentence2 + Sentence3"

I did that because in the majority of the e-mails I reviewed, the last two Blabla (Blabla5 & Blabla6) are often polite words/closing formulas. 

For instance:

Blabla5 = Sincerely,
Blabla6 = Example , Community Manager from Secret Company

Then what comes after these sentences is usually not relevant.

# The main function : compute_prob_not_sentence

In [17]:
    def compute_prob_not_sentence(sentence):
        """Calculate probability that the sentence is a not a TRUE sentence.

        Parameters
        ----------
        sentence : str
            Line in email block.

        Returns
        -------
        probability(sentence to be a true sentence) : float
        """
        doc = nlp_stanford.value(sentence)
        verb_count = 0
        word_count = 0
        for sent in doc.sentences:
            for word in sent.words:
                word_count += 1
                if word.upos in ["VERB", "AUX"]:
                    if word.xpos in ['VBG']:
                        verb_count += 0.5
                    else:
                        verb_count += 1
        return 1 - verb_count / word_count

Given the pos_tag of each word in the sentence, it counts the number of verbs in the phrase and return (1 - number of verbs / number total of word).

If the verb is a VBG (Verb, gerund or present participle), it'll count as half a verb.

In [18]:
print(MailParser().compute_prob_not_sentence('Subject: Invitation to a birthday party'))


1.0


In [22]:
print(MailParser().compute_prob_not_sentence('Our reference: XX-00000 Your reference: XXX-0000000-00000 ----------Forwarding process-----------'))

1.0


In [21]:
print(MailParser().compute_prob_not_sentence('Please use the following links which answer to most of your inquiries'))


0.7916666666666666


In [23]:
print(MailParser().compute_prob_not_sentence('We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.'))

0.85


In [24]:
# Note :
# I added a comment "can be customized" in the code for things that can be customized in order to meet your needs

Result of parsing :

In [5]:
print('Before parsing :')
print()
print(example)
print()
print('-'*50)
print()
print('After parsing')
print()
print(MailParser().parse_mail(example))

Before parsing :


To: XYZ

CC/BCC:

Subject: Invitation to a birthday party

BLABLABLABLABLA
BLOBLOBLOBLOBLOBLO

Our reference: XX-00000
Your reference: XXX-0000000-00000
----------Forwarding process-----------

Due to an extraordinarily high number of customer contacts in our service centers and at our stations, we are facing delays in replying to your emails. We apologize for any inconveniences and appeal to your understanding to allow us to prioritize the most urgent inquires.  

Please use the following links which answer to most of your inquiries:
https://www.example.com/en/example/example

As mentioned, the flight was delayed less than 3 hours (30 minutes), passengers 
are therefore not entitled to compensation. So, we cannot grant your request.

Thank you


IMAGE

JACQUES LAMA
Example - Example Air Transport
Example, Jacques Lama Airport

NOTE : THIS IS ONLY A USELESS SENTENCE, BECAUSE SOMETIMES THERE ARE OTHER "TRUE SENTENCES" AT THE END BUT THEY ARE NOT IMPORTANT.

E-mail: EX