No description or website provided.
Latest commit 2444ba8 Sep 14, 2016 @obukhov-sergey obukhov-sergey committed on GitHub Merge pull request #111 from mailgun/sergey/tagscount
restrict html processing to a certain number of tags



Mailgun library to extract message quotations and signatures.

If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like 😄


Here’s how you initialize the library and extract a reply from a text message:

import talon
from talon import quotations


text =  """Reply

-----Original Message-----


reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
# reply == "Reply"

To extract a reply from html:

html = """Reply

    On 11-Apr-2011, at 6:54 PM, Bob &lt;; wrote:



reply = quotations.extract_from(html, 'text/html')
reply = quotations.extract_from_html(html)
# reply == "<html><body><p>Reply</p></body></html>"

Often the best way is the easiest one. Here’s how you can extract signature from email message without any machine learning fancy stuff:

from talon.signature.bruteforce import extract_signature

message = """Wow. Awesome!
Bob Smith"""

text, signature = extract_signature(message)
# text == "Wow. Awesome!"
# signature == "--\nBob Smith"

Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:

import talon
# don't forget to init the library first
# it loads machine learning classifiers

from talon import signature

message = """Thanks Sasha, I can't go any higher and is why I limited it to the

John Doe
via mobile"""

text, signature = signature.extract(message, sender='')
# text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
# signature == "John Doe\nvia mobile"

For machine learning talon currently uses the scikit-learn library to build SVM classifiers. The core of machine learning algorithm lays in talon.signature.learning package. It defines a set of features to apply to a message (, how data sets are built (, classifier’s interface (

Currently the data used for training is taken from our personal email conversations and from ENRON dataset. As a result of applying our set of features to the dataset we provide files classifier and that don’t have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.

To regenerate the model files, you can run



from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
from talon.signature.learning.classifier import train, init

Open-source Dataset

Recently we started a forge project to create an open-source, annotated dataset of raw emails. In the project we used a subset of ENRON data, cleansed of private, health and financial information by EDRM. At the moment over 190 emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to start using it for talon.


The library is inspired by the following research papers and projects: