# FD reply parse

__Example: http://seclists.org/fulldisclosure/2017/Jan/0__

With each reply, we'll attempt to parse out the following:
* raw reply text, without html tags
  * the reply text with any signatures stripped out
* an analysis of what html tags are in the message

In [1]:
import re
import requests

from bs4 import BeautifulSoup

We'll gather the contents of a single message. 2017_Jan_0 is one that includes a personal signature, as well as the standard Full Disclosure footer.

2017_Jan_45 is a message that includes a PGP signature.

In [2]:
year = '2017'
month = 'Jan'
id = '45'
url = 'http://seclists.org/fulldisclosure/' + year + '/' + month + '/' + id

r = requests.get(url)
content = r.text
content

'<!-- MHonArc v2.6.19 -->\n<!--X-Head-End-->\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"\n                      "http://www.w3.org/TR/REC-html40/loose.dtd">\n<HTML>\n<HEAD>\n<link rel="alternate" type="application/rss+xml" title="RSS" href="http://seclists.org/rss/fulldisclosure.rss">\n<title>Full Disclosure: APPLE-SA-2017-01-18-1 GarageBand 10.1.5</title>\n<meta property="og:image" content="http://seclists.org/images/fulldisclosure-img.png" />\n<link rel="image_src" href="http://seclists.org/images/fulldisclosure-img.png" />\n<meta name="Subject" content="APPLE-SA-2017-01-18-1 GarageBand 10.1.5"/>\n<meta name="Author" content="Apple Product Security"/>\n<link REL="SHORTCUT ICON" HREF="/shared/images/tiny-eyeicon.png" TYPE="image/png">\n<META NAME="ROBOTS" CONTENT="NOARCHIVE">\n<meta name="theme-color" content="#2A0D45">\n<link rel="stylesheet" href="/shared/css/insecdb.css" type="text/css">\n<!--Google Analytics Code-->\n<script type="text/javascript">\n  (function(

Each message in the FD list is wrapped in seclists.org code, including navigation, ads, and trackers, all irrelevant to us. The body of the reply is contained between two comments, `<!--X-Body-of-Message-->` and `<!--X-Body-of-Message-End-->`.

BeautifulSoup isn't great at handling comments, so we first use simple indexing to extract the relevant chars. We'll then send it through BeautifulSoup so we can use its __.text__ property to strip out the html tags. BS4 automatically adds tags to create valid html, so remember to parse using the generated `<body>` tags.

What we end up with is a plaintext version of the message's body. 

In [3]:
start = content.index('<!--X-Body-of-Message-->') + 24
end = content.index('<!--X-Body-of-Message-End-->')
body = content[start:end]

soup = BeautifulSoup(body, 'html5lib')
bodyhtml = soup.find('body')
raw = bodyhtml.text
raw

"-----BEGIN PGP SIGNED MESSAGE-----\nHash: SHA512\n\nAPPLE-SA-2017-01-18-1 GarageBand 10.1.5\n\nGarageBand 10.1.5 is now available and addresses the following:\n\nProjects\nAvailable for: OS X Yosemite v10.10 and later\nImpact: Opening a maliciously crafted GarageBand project file may\nlead to arbitrary code execution\nDescription: A memory corruption issue was addressed through improved\nmemory handling.\nCVE-2017-2372: Tyler Bohan of Cisco Talos\n\nInstallation note:\n\nGarageBand 10.1.5 may be obtained from the Mac App Store.\n\nInformation will also be posted to the Apple Security Updates\nweb site: https://support.apple.com/kb/HT201222\n\nThis message is signed with Apple's Product Security PGP key,\nand details are available at:\nhttps://www.apple.com/support/security/pgp/\n-----BEGIN PGP SIGNATURE-----\nComment: GPGTools - https://gpgtools.org\n\niQIcBAEBCgAGBQJYf8YgAAoJEIOj74w0bLRGWiQP+gNnna3Ha0pOdJr/u3LHf/tN\ntpX/lArjvo8ELpqb8wc5iCDXmSq7BgrnOV2T+XNI0XtE1md0xkQ3ttfTmSWB33Nh\nyl

## Signature extraction

We'll attempt to use __talon__ to strip out the signature from the message. Talon provides two different ways to find the signature, "brute force" and "machine learning". 

We'll try the brute force method first. 

In [4]:
import talon
from talon.signature.bruteforce import extract_signature

reply, signature = extract_signature(raw)
signature

In [5]:
reply

"-----BEGIN PGP SIGNED MESSAGE-----\nHash: SHA512\n\nAPPLE-SA-2017-01-18-1 GarageBand 10.1.5\n\nGarageBand 10.1.5 is now available and addresses the following:\n\nProjects\nAvailable for: OS X Yosemite v10.10 and later\nImpact: Opening a maliciously crafted GarageBand project file may\nlead to arbitrary code execution\nDescription: A memory corruption issue was addressed through improved\nmemory handling.\nCVE-2017-2372: Tyler Bohan of Cisco Talos\n\nInstallation note:\n\nGarageBand 10.1.5 may be obtained from the Mac App Store.\n\nInformation will also be posted to the Apple Security Updates\nweb site: https://support.apple.com/kb/HT201222\n\nThis message is signed with Apple's Product Security PGP key,\nand details are available at:\nhttps://www.apple.com/support/security/pgp/\n-----BEGIN PGP SIGNATURE-----\nComment: GPGTools - https://gpgtools.org\n\niQIcBAEBCgAGBQJYf8YgAAoJEIOj74w0bLRGWiQP+gNnna3Ha0pOdJr/u3LHf/tN\ntpX/lArjvo8ELpqb8wc5iCDXmSq7BgrnOV2T+XNI0XtE1md0xkQ3ttfTmSWB33Nh\nyl

At least for 2017_Jan_0, it is pretty effective. 2017_Jan_45 was not successful at all. Now, we'll try the machine learning style, to compare. 

In [6]:
talon.init()
from talon import signature
reply_ml, sig_ml = signature.extract(raw, sender="dawid@legalhackers.com")
print(sig_ml)
#reply_ml

None


This doesn't seem to output anything. I'm unclear whether or not this library is already trained; documentation states that it was trained on the authors' personal email and an ENRON set. There is an open issue on github <https://github.com/mailgun/talon/issues/143> from July asking about the same thing. We will stick with the "brute force" method for now, and continue to look for more libraries.

## Extract HTML tags
We'll use a fairly simple regex to extract any tags from the reply. 

`<([^\s>]+)(\s|/>)+`
  * `[^\s>]+` one or more non-whitespace characters, __followed by__:
  * `\s|/` either a whitespace character, or a slash (/) for self-closing tags.


We then use a dictionary to count the instances of each unique tag. 

In [7]:
rx = re.compile('<([^\s>]+)(\s|/>)+')
tags = {}
for tag in rx.findall(str(bodyhtml)):
    tagtype = tag[0]
    if not tagtype.startswith('/'):
        if tagtype in tags:
            tags[tagtype] = tags[tagtype] + 1
        else:
            tags[tagtype] = 1
print(tags)

{'a': 5, 'pre': 1}
