# FD reply parse

__Example: http://seclists.org/fulldisclosure/2017/Jan/0__

With each reply, we'll attempt to parse out the following:
* raw reply text, without html tags
  * the reply text with any signatures stripped out
* an analysis of what html tags are in the message
* a listing of which domains are referenced in links in the message

In [1]:
import re
import requests

from bs4 import BeautifulSoup

We'll gather the contents of a single message. 2017_Jan_0 is one that includes a personal signature, as well as the standard Full Disclosure footer.

2017_Jan_45 is a message that includes a PGP signature.

In [2]:
year = '2017'
month = 'Jan'
id = '0'
url = 'http://seclists.org/fulldisclosure/' + year + '/' + month + '/' + id

r = requests.get(url)
content = r.text
from IPython.display import Pretty
Pretty(content)

<!-- MHonArc v2.6.19 -->
<!--X-Head-End-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
                      "http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://seclists.org/rss/fulldisclosure.rss">
<title>Full Disclosure: Zend Framework / zend-mail &lt; 2.4.11 Remote Code Execution	(CVE-2016-10034)</title>
<meta property="og:image" content="http://seclists.org/images/fulldisclosure-img.png" />
<link rel="image_src" href="http://seclists.org/images/fulldisclosure-img.png" />
<meta name="Subject" content="Zend Framework / zend-mail &lt; 2.4.11 Remote Code Execution	(CVE-2016-10034)"/>
<meta name="Author" content="Dawid Golunski"/>
<link REL="SHORTCUT ICON" HREF="/shared/images/tiny-eyeicon.png" TYPE="image/png">
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
<meta name="theme-color" content="#2A0D45">
<link rel="stylesheet" href="/shared/css/insecdb.css" type="text/css">
<!--Google Analytics Cod

Each message in the FD list is wrapped in seclists.org code, including navigation, ads, and trackers, all irrelevant to us. The body of the reply is contained between two comments, `<!--X-Body-of-Message-->` and `<!--X-Body-of-Message-End-->`.

BeautifulSoup isn't great at handling comments, so we first use simple indexing to extract the relevant chars. We'll then send it through BeautifulSoup so we can use its __.text__ property to strip out the html tags. BS4 automatically adds tags to create valid html, so remember to parse using the generated `<body>` tags.

What we end up with is a plaintext version of the message's body. 

In [3]:
start = content.index('<!--X-Body-of-Message-->') + 24
end = content.index('<!--X-Body-of-Message-End-->')
body = content[start:end]

soup = BeautifulSoup(body, 'html5lib')
bodyhtml = soup.find('body')
raw = bodyhtml.text
Pretty(raw)

Zend Framework < 2.4.11    Remote Code Execution (CVE-2016-10034)
zend-mail < 2.7.2

Discovered by Dawid Golunski (@dawid_golunski)
https://legalhackers.com

Desc:
An independent research uncovered a critical vulnerability in zend-mail, a
Zend Framework's component that could potentially be used by (unauthenticated)
remote attackers to achieve remote arbitrary code execution in the context
of the web server user and remotely compromise the target web application.

To exploit the vulnerability an attacker could target common website
components such as contact/feedback forms, registration forms, password
email resets and others that send out emails with the help of a vulnerable
version of the zend-mail class.

Full advisory / PoC exploit at:

http://legalhackers.com/advisories/ZendFramework-Exploit-ZendMail-Remote-Code-Exec-CVE-2016-10034-Vuln.html

Video / PoC:

https://legalhackers.com/videos/ZendFramework-Exploit-Remote-Code-Exec-Vuln-CVE-2016-10034-PoC.html

For updates, follow:

htt

## Signature extraction

We'll attempt to use __talon__ to strip out the signature from the message. Talon provides two different ways to find the signature, "brute force" and "machine learning". 

We'll try the brute force method first. 

In [4]:
import talon
from talon.signature.bruteforce import extract_signature

reply, signature = extract_signature(raw)
Pretty(signature)

'-- \nRegards,\nDawid Golunski\nhttps://legalhackers.com\nt: @dawid_golunski\n\n_______________________________________________\nSent through the Full Disclosure mailing list\nhttps://nmap.org/mailman/listinfo/fulldisclosure\nWeb Archives & RSS: http://seclists.org/fulldisclosure/'

In [5]:
Pretty(reply)

Zend Framework < 2.4.11    Remote Code Execution (CVE-2016-10034)
zend-mail < 2.7.2

Discovered by Dawid Golunski (@dawid_golunski)
https://legalhackers.com

Desc:
An independent research uncovered a critical vulnerability in zend-mail, a
Zend Framework's component that could potentially be used by (unauthenticated)
remote attackers to achieve remote arbitrary code execution in the context
of the web server user and remotely compromise the target web application.

To exploit the vulnerability an attacker could target common website
components such as contact/feedback forms, registration forms, password
email resets and others that send out emails with the help of a vulnerable
version of the zend-mail class.

Full advisory / PoC exploit at:

http://legalhackers.com/advisories/ZendFramework-Exploit-ZendMail-Remote-Code-Exec-CVE-2016-10034-Vuln.html

Video / PoC:

https://legalhackers.com/videos/ZendFramework-Exploit-Remote-Code-Exec-Vuln-CVE-2016-10034-PoC.html

For updates, follow:

htt

At least for 2017_Jan_0, it is pretty effective. 2017_Jan_45 was not successful at all. Now, we'll try the machine learning style, to compare. 

In [6]:
talon.init()
from talon import signature
reply_ml, sig_ml = signature.extract(raw, sender="dawid@legalhackers.com")
print(sig_ml)
#reply_ml

None


This doesn't seem to output anything. I'm unclear whether or not this library is already trained; documentation states that it was trained on the authors' personal email and an ENRON set. There is an open issue on github <https://github.com/mailgun/talon/issues/143> from July asking about the same thing. We will stick with the "brute force" method for now, and continue to look for more libraries.

## Extract HTML tags
We'll use a fairly simple regex to extract any tags from the reply. 

`<([^\s>]+)(\s|/>)+`
  * `[^\s>]+` one or more non-whitespace characters, __followed by__:
  * `\s|/` either a whitespace character, or a slash (/) for self-closing tags.


We then use a dictionary to count the instances of each unique tag. 

In [7]:
rx = re.compile('<([^\s>]+)(\s|/>)+')
tags = {}
for tag in rx.findall(str(bodyhtml)):
    tagtype = tag[0]
    if not tagtype.startswith('/'):
        if tagtype in tags:
            tags[tagtype] = tags[tagtype] + 1
        else:
            tags[tagtype] = 1
print(tags)

{'a': 7, 'pre': 1}


## Extract link domains

We'll record what domains are linked to in each message. We use BeautifulSoup to pull out all `<a>` tags, then urlparse to determine the domain within.

In [10]:
from urllib.parse import urlparse

sites = {}

atags = bodyhtml.find_all('a')
hrefs = [link.get('href') for link in atags]

for link in hrefs:
    parsedurl = urlparse(link)
    site = parsedurl.netloc
    if site in sites:
        sites[site] = sites[site] + 1
    else:
        sites[site] = 1

sites

{'legalhackers.com': 4, 'nmap.org': 1, 'seclists.org': 1, 'twitter.com': 1}