New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL scanner #65
Comments
Thinking on it a little more, two potential problems with removing them (the PhishTank signatures) altogether and just relying on the Virus Total API:
That said, though, if I built a URL scanner, it should be easy to code phpMussel to just make a remote call to the PhishTank service itself to check URLs, instead of the Virus Total API, thus avoiding the issue of Virus Total quota limitations; In that case, I could probably still remove these signatures. PhishTank provides an API of its own for doing this, and documentation for using their API -> https://www.phishtank.com/api_info.php The documentation is clear regarding how to use their API, but, like Virus Total, and like we'd expect, it mentions a quota limit of its own, but, doesn't specifically mention what that quota limit actually is; If I wrote code to check URLs against their API, I think, probably, I'd need to contact them to ask them about their quota limit, so that phpMussel could be coded to correctly conform to that quota limit. Of course, this also means that the same inherent problem exists of the detection of a malicious URL in content being scanned being easily avoidable simply by stuffing a number of benign URLs into the content to be processed prior to the malicious URL. The main reasons I'm thinking about this idea, though, is that the TTL (Time-To-Live) for malicious URLs tends to be extremely short, averaging anywhere from slightly over a month down to just a day or so, due to most of the operators of malicious URLs not wanting to keep the same URL for a very long time and due to them being able to very quickly and easily generate new URLs almost immediately when they need to, and due in combination to how frequently I'm able to produce updates for these particular signatures (if a malicious URL is created in a second, used extensively for a day, and then discarded, and if a new release of phpMussel is available approximately once per month...) and to how many of them I'm able to actually include before it starts to negatively impact the footprint of the signature files (similarly to with the SecuriteInfo signatures, where I'm only including about ~20% of the total available signatures, the number of active URLs in the PhishTank database that I generate signatures for is quite small compared to the total number of URLs available in their database), I'm beginning to wonder how worthwhile it is to include these sorts of signatures at all. I believe malicious URLs shouldn't be ignored, but, regarding how worthwhile these signatures are, I'm not entirely sure. |
What is with malwaredomainlist.com and other lists? |
Also true, and good point. |
Some more. I have to check if there are ways to obtain some data using curl in PHP and parse it. We can parse HTML output using simplexml so this should not be a problem. Also getting results from Google SafeBrowsing list is quite easy. Found some repos which use results from Google SafeBrowsing, MDL, PhishTank, ZeusTracker and some others. I will create a list of the websites which are used in the tools. Ok and in PHP there are file_get_contents, header and other methods which may be used. Have to check some payloads from websites, how they deliver the final payload from the external source. IPv4 might be easy, also domains even with // instead of http://. But IPv6 might be a bit tricky to find but should be possible. I will start some tests with preg pattern for IPs and domains. There will be some parts where I am unsure eg about IPv6 and the new generic TLDs. Leaving some links here (will update them regularly): |
Lists used for URL reputation:
Parameters which have to be scanned: While searching for further methods which have to be scanned for URLs and IPs, I also found some more articles about malicious JavaScript http://resources.infosecinstitute.com/analyzing-javascript/ Are things like this alreay detected and blocked by phpMussel? For exaple there are document.write, iframe, packers/obfuscaters like function(p,a,c,k,e,d) and so on. It would maybe make some sense to add an option to scan for malicious JavaScript functions and suspicious code like the mentioned codes. Regex patterns for IPs and domains: http://stackoverflow.com/questions/8249420/alter-regex-to-allow-ip-address-when-checking-url |
Useful list and links. :-)
Mostly; phpMussel does already attempt to detect and block these types of things, but, it needs to be improved. There are basic signatures and detection methods used by phpMussel that are able to detect some of the most common malicious JavaScript, droppers, webshells and etc, but there are still many such things that aren't detection easily, or which can be modified easily to avoid detection. Some of the information you've linked to should be useful for helping with this issue though. |
|
I couldn't get the above-mentioned regex to work properly, but I wrote something myself (it's a little hackish, but seems to work; we can always improve on it though). I'll fork an "experimental" branch shortly. |
Committed. :-) The URLs that can be now detected with this initial commit come from the current malc0de data; To test it, I wrote a plain-text file containing Awaiting responses to some emails I've sent to database providers before adding any API functionality for them. |
Great, I will further test it in the next days and write test files for it, so you can easily test with this. |
Cool, looking forward to it. :-) I've received some replies to some of the emails that I'd sent, all of which have been positive so far. I'll start writing up some API code soon for those services. |
Great news, I am very excited about the upcoming changes. |
Cool. :-) Code for performing API lookups to hpHosts added (seems to work correctly); Have been experimenting with the Google Safe Browsing API, but haven't fully worked it out yet; More to come up soon. |
Still looking into further implementing other APIs, but also, still having difficulties with the Google Safe Browsing API. :-/ They've very thoroughly documented the API, its requirements, how to use it and etc: ..But I'm still having a little difficulty in understanding it at all the moment. I'm sure I'll eventually understand it, but, It may take a little longer than I'd originally anticipated. '^.^ Also, I've tried searching for examples online, thinking that for something like Google Safe Browsing, surely, there must be good examples online from other people trying to implement it, and I've found some things, but not very much. I did find the samcleaver/phpGSB repository, which sounds interesting; I haven't yet tested it though. Anyhow, for the code already written, caching seems to be working correctly now, so, I don't think caching will be a problem (although, of course, some API services may have their own specific requirements about this). |
It seems curl is required to get the data: http://stackoverflow.com/questions/30115897/google-safe-browsing-api-unable-to-download-list Like the mentioned repository does (but the class is very big). |
I think you're right. I'll try to write something using cURL; Hopefully I'll have some luck with it. Also, work done so far has been merged back to master (still haven't yet written functional code for the Google Safe Browsing API, but, everything written so far appears to be stable). |
Ok, I made some tests with the example. The cache works great. I will link the files that I have used for the tests. https://gist.github.com/DanielRuf/049f856fb2a5899dc28e We may create some test files for the URL Scanner to test further detection and improve it.
|
Some links to some more APIs http://apidocs.emergingthreats.net/ |
Cheers. :-) |
A new run of the tests just give a fail for every test. Each test returns 1 instead of 2. The tests use 360safe.com as domain. It seems this was removed from newer signatures. I think I will create new test files. Can you give me a specific example domain for testing (probably one which many people know so there is no new threat). I have manually downloaded a version of the URLScanner signatures which contain the MD5 hash of the 360safe.com domain and ran the tests again. Code base: 3a7b07aeca9026b16c6e5ace0c9c4261906424f1
|
Yeah, it was. I forgot to mention that here, though. Sorry about that.
"One which many people know" could be tricky, because I'm not sure if there are any truly definitive malicious domains that are more well known than most others, but, for the purposes of testing, "kitdriver.com" could work; I found that domain last month; It seems to have lots of fake driver downloads containing malware. Virus Total results for that domain have a detection ratio of only 4/66, though, so, I don't think I could call it well-known. Also, because of the TLD-level signatures, any domains attached to these TLDs:
..Should be flagged as malicious, so, we could possibly write new testfiles using some example domain names (hxxp://www123.blah-blah-blah.link/, etc) attached to those TLDs for the purposes of testing. |
Okay, thanks for the information. Will prepare new textfiles soon and see if I can contribute some decoding functions. |
Will create the testfiles next week. What has the highest priority at the moment (testing feature x, code for feature x, translations, ...)? |
( Replying via mobile phone email. ) The person handling the Arabic translations is still working on them, but There are still plenty of features (and a few bug fixes) on the to-do list Because there are some already implemented features that haven't yet been Anyhow, I'll reply again later tonight when I'm next using a computer. :-) |
Ok, thanks for the information. Yes, we need more testing as this has helped to find and solve a few bugs. |
No worries, and definitely. :-) As well as for finding and solving possible bugs and errors, testing could help us to determine where weaknesses exists, such as for any failed detections (such as for undetected obfuscated and/or encoded links), for any potential vulnerabilities (such as for where malicious links could be automatically followed or similar) and for finding new ideas on where to expand the functionality. I haven't yet looked at the Google Safe Search API integration again since we last spoke about it, so, that's something to still eventually do, although I'm not sure when I'll get around to doing it. |
Ran a few tests and used the kitdriver domain
Used test files: Did not yet have the time to write the PHP script to generate these test files but will do so this week (I hope so). Do you think we should decode encoded strings (check for base64_encoded strings and so on? str_rot13, ...) or just check plain text strings? I think it just makes sense to check plain text strings as trying to detect encoded strings and decode them might consume to much memory and possibly slow down the website / server (eg when there is a gzipped string which is encoded very large). It may be also not so easy to track / check if there are any variables used in some decoding step so we would have to look them up and actively check the contents against the URLScanner database. The static analysis might be better than a dynamic approach, which would cause huge problems when there are many strings). What do you think? We have the regex for getting domains (and IPs?) and it might be enough to check themdirectly against the URLScanner database as these are often in plain sight (not encoded or encrypted in most cases, but often they are packed with base64_encode / unpacked decode, eval and so on, which is already caught by phpMussel). |
Added the unit tests and the generation script for the used test files to the extras repository. The generation script will create the files with the same format as I have used for the URL Scanner unit tests here. Might have to tune or improve the regex, strip all linebreaks and just check for the domain + TLD part as the protocol is not required in the markup (protocol less or protocol-relative URIs are not uncommon). Backtracking should be useful. For testing I use https://3v4l.org, https://ideone.com/ and some other tools. |
To a very limited degree, this should already be happening, due to the process that phpMussel uses to "normalise" text (looks for functions calls in data being scanned, like Beyond that, though, it isn't currently doing much with this type of data (it's not normally difficult to look for function calls, and if we know where the functions calls are, we can do tricks like splitting up text with I like the idea being able to find potentially malicious URLs and other malicious code hidden within encoded and/or obfuscated data, but, I agree that it could potentially create problems with memory and potentially slow down websites to focus too much on decoding things like this, so, I think decoding data like this is something we should be careful with, if we do anything with it. Otherwise, in most situations, plain-text should be sufficient, anyhow.
I think checking them directly and checking them the way they're checked currently should be enough. At least, until we find information to suggest otherwise. We can always build on this in the future if the need arises, but, no point adding to the footprint unnecessarily. |
Awesome, and sounds good. :-)
Good point. This is definitely something we should do. |
Right. So decoding this makes no sense when the function is already blacklisted and blacklisted.
Totally agree, at the moment the scan engine of phpMussel should be good enough. Every webshell and malicious code at least shares specific function calls like I am just checking the disadvantages and advantages of specific ways and performance vs security. In my opinion we have a good balance currently as the benchmarks are very promising. |
Bump #65 (comment) 3 tests still fail At least I have changed the regex a bit to also support protocol less URLs and escaped slashes. The regular expression could be possibly further improved. |
If we make the beginning of the pattern optional, I think it'd be better to entirely do-away with that part of the pattern anyhow, because the results will be the same either way regardless, and if we kept it, we'd be looking for something we didn't need to be looking for. My thoughts on including the URI schemes / protocols for URLs, originally, was that it should make them easier to distinguish from arbitrary non-URL text and from contexts whereby some domain could be referenced but without necessarily being a URL (and so, looking for the URI scheme / protocol at the beginning should help avoid capturing unnecessary data), but, that said, looking at IANA's list of currently recognised URI schemes.. Ouch. A rather big list, and, in the extreme of including all possible recognised URI schemes, I'd rather not include a list that big. So, perhaps we should look at other options. Maybe we could look for anything for resembles a protocol, without necessarily looking for specific protocols (eg, something like (Reason for And good catch with the |
Support for Google Safe Browsing API lookups implemented. :-) |
Awesome =) Will take a look at the code later and test it in the next days. |
Bug identified: In the current master codebase, remote API lookups aren't being performed when there aren't any URL scanner signature files installed/enabled/active (this was probably introduced as a result of the most recent rewrite I did of the way that signature files are handled a while back, I think). These two things (signature file checks and API lookups) should be independent of each other. Just noticed this problem tonight, but it should reasonably easy to fix (just separating out a few different code sections mostly). I'll try to write a fix for this either tomorrow or over the weekend. |
Better open a new issue to better track this separate issue =) I guess we really need some integration and unit tests (using phpunit) but I am not so deep in the code after this long timespan. |
I'm thinking about maybe trying to build a URL scanner into phpMussel, to detect malicious URLs from within files and/or data scanned by phpMussel.
At present, phpMussel already supports Virus Total integration for scanning file hashes using the Virus Total API and already utilises data from PhishTank; If I built a proper URL scanner into phpMussel, I could optimise that data from PhishTank to run more quickly and more efficiently, such as by writing hash-based signatures (hashes of malicious URLs) for use instead of the current string-based signatures (string segments of malicious URLs), or, by way of leveraging the URL scanning capabilities of the Virus Total API (which already checks URLs against PhishTank as well as dozens of other engines), the data from PhishTank included with phpMussel could be rendered obsolete and removed altogether.
One of the downsides of using hash-based signatures instead of string-based signatures in this context is that any signatures written could only match something very specific (like a specific, precisely matching URL), whereas string-based signatures are able to match things more loosely (such as the domain component, query component or some specific section of a URL). The upside to this, though, is that hash-based signatures can be checked much more quickly and have a smaller footprint, meaning that I could potentially include many more of them for the same total cost of a smaller number of their string-based counterparts.
Of course, if I just rendered them obsolete and removed them altogether, the footprint or efficiency of them either way would become irrelevant.
I've thought of a few different ways that this could be done and a few different applications of it; URLs could be searched for for HTML files, Email/EML files, things like that, or, conforming with as things are currently, URLs could be searched for within whatever is thrown at the
phpMussel_mail()
function; In those instances, simply looking for instances of<a...>...</a>
,href="..."
and other variants thereof (hyperlink pseudo-code; would probably use regex or something similar and could refine this later, if we decided to proceed with this idea) should be sufficient.There's no hurry with this idea, but, I thought I'd post about it now anyhow as so that it could be discussed.
@DanielRuf What are your thoughts on this?
The text was updated successfully, but these errors were encountered: