Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL scanner #65

Closed
Maikuolan opened this issue Oct 1, 2015 · 38 comments
Closed

URL scanner #65

Maikuolan opened this issue Oct 1, 2015 · 38 comments

Comments

@Maikuolan
Copy link
Member

I'm thinking about maybe trying to build a URL scanner into phpMussel, to detect malicious URLs from within files and/or data scanned by phpMussel.

At present, phpMussel already supports Virus Total integration for scanning file hashes using the Virus Total API and already utilises data from PhishTank; If I built a proper URL scanner into phpMussel, I could optimise that data from PhishTank to run more quickly and more efficiently, such as by writing hash-based signatures (hashes of malicious URLs) for use instead of the current string-based signatures (string segments of malicious URLs), or, by way of leveraging the URL scanning capabilities of the Virus Total API (which already checks URLs against PhishTank as well as dozens of other engines), the data from PhishTank included with phpMussel could be rendered obsolete and removed altogether.

One of the downsides of using hash-based signatures instead of string-based signatures in this context is that any signatures written could only match something very specific (like a specific, precisely matching URL), whereas string-based signatures are able to match things more loosely (such as the domain component, query component or some specific section of a URL). The upside to this, though, is that hash-based signatures can be checked much more quickly and have a smaller footprint, meaning that I could potentially include many more of them for the same total cost of a smaller number of their string-based counterparts.

Of course, if I just rendered them obsolete and removed them altogether, the footprint or efficiency of them either way would become irrelevant.

I've thought of a few different ways that this could be done and a few different applications of it; URLs could be searched for for HTML files, Email/EML files, things like that, or, conforming with as things are currently, URLs could be searched for within whatever is thrown at the phpMussel_mail() function; In those instances, simply looking for instances of <a...>...</a>, href="..." and other variants thereof (hyperlink pseudo-code; would probably use regex or something similar and could refine this later, if we decided to proceed with this idea) should be sufficient.

There's no hurry with this idea, but, I thought I'd post about it now anyhow as so that it could be discussed.

@DanielRuf What are your thoughts on this?

@Maikuolan
Copy link
Member Author

Thinking on it a little more, two potential problems with removing them (the PhishTank signatures) altogether and just relying on the Virus Total API:

  • Anyone that doesn't use the Virus Total integration feature wouldn't have access to it, and thus, suddenly wouldn't have access to that data if it was removed.
  • Detection of a malicious URL in content being scanned could easily be avoided simply by stuffing a number of benign URLs into the content to be processed prior to the malicious URL, sufficiently large enough to surpass the API quota before the malicious URL is able to be processed.

That said, though, if I built a URL scanner, it should be easy to code phpMussel to just make a remote call to the PhishTank service itself to check URLs, instead of the Virus Total API, thus avoiding the issue of Virus Total quota limitations; In that case, I could probably still remove these signatures.

PhishTank provides an API of its own for doing this, and documentation for using their API -> https://www.phishtank.com/api_info.php

The documentation is clear regarding how to use their API, but, like Virus Total, and like we'd expect, it mentions a quota limit of its own, but, doesn't specifically mention what that quota limit actually is; If I wrote code to check URLs against their API, I think, probably, I'd need to contact them to ask them about their quota limit, so that phpMussel could be coded to correctly conform to that quota limit.

Of course, this also means that the same inherent problem exists of the detection of a malicious URL in content being scanned being easily avoidable simply by stuffing a number of benign URLs into the content to be processed prior to the malicious URL.

The main reasons I'm thinking about this idea, though, is that the TTL (Time-To-Live) for malicious URLs tends to be extremely short, averaging anywhere from slightly over a month down to just a day or so, due to most of the operators of malicious URLs not wanting to keep the same URL for a very long time and due to them being able to very quickly and easily generate new URLs almost immediately when they need to, and due in combination to how frequently I'm able to produce updates for these particular signatures (if a malicious URL is created in a second, used extensively for a day, and then discarded, and if a new release of phpMussel is available approximately once per month...) and to how many of them I'm able to actually include before it starts to negatively impact the footprint of the signature files (similarly to with the SecuriteInfo signatures, where I'm only including about ~20% of the total available signatures, the number of active URLs in the PhishTank database that I generate signatures for is quite small compared to the total number of URLs available in their database), I'm beginning to wonder how worthwhile it is to include these sorts of signatures at all. I believe malicious URLs shouldn't be ignored, but, regarding how worthwhile these signatures are, I'm not entirely sure.

@DanielRuf
Copy link
Contributor

What is with malwaredomainlist.com and other lists?
https://zeltser.com/malicious-ip-blocklists/
src used for external scripts, location.href or redirect in code, and much more need to be scanned.

@Maikuolan
Copy link
Member Author

Also true, and good point.

@DanielRuf
Copy link
Contributor

Some more.

I have to check if there are ways to obtain some data using curl in PHP and parse it.

We can parse HTML output using simplexml so this should not be a problem.

Also getting results from Google SafeBrowsing list is quite easy.

Found some repos which use results from Google SafeBrowsing, MDL, PhishTank, ZeusTracker and some others.

I will create a list of the websites which are used in the tools.

Ok and in PHP there are file_get_contents, header and other methods which may be used.

Have to check some payloads from websites, how they deliver the final payload from the external source.

IPv4 might be easy, also domains even with // instead of http://. But IPv6 might be a bit tricky to find but should be possible.

I will start some tests with preg pattern for IPs and domains. There will be some parts where I am unsure eg about IPv6 and the new generic TLDs.

Leaving some links here (will update them regularly):
http://daringfireball.net/2010/07/improved_regex_for_matching_urls

@DanielRuf
Copy link
Contributor

Lists used for URL reputation:

  • Google SafeBrowsing (lookup API)
  • PhishTank
  • GameOverZeus (goz)
  • CyberCrimeTracker
  • McAfee siteadvisor
  • Malc0de
  • MalwareDomainList
  • BlockList_de
  • Spamhaus
  • SpamRATS
  • DNSBL_AbuseCH
  • SpamCop
  • ZeuS Tracker
  • URIBL
  • MegaRBL
  • VirBL
  • Backscatterer
  • WPBL

Parameters which have to be scanned:
src, href, window. location, file_get_contents, $/jQuery(location).attr('href', header(), CURLOPT_URL and curl_init, http-equiv="refresh" redirect, ...

While searching for further methods which have to be scanned for URLs and IPs, I also found some more articles about malicious JavaScript

http://resources.infosecinstitute.com/analyzing-javascript/
http://security.stackexchange.com/questions/96985/similarities-among-all-most-malicious-javascript
https://aw-snap.info/articles/js-examples.php

Are things like this alreay detected and blocked by phpMussel? For exaple there are document.write, iframe, packers/obfuscaters like function(p,a,c,k,e,d) and so on.

It would maybe make some sense to add an option to scan for malicious JavaScript functions and suspicious code like the mentioned codes.

Regex patterns for IPs and domains:

http://stackoverflow.com/questions/8249420/alter-regex-to-allow-ip-address-when-checking-url
http://stackoverflow.com/questions/1925455/how-to-mimic-stackoverflow-auto-link-behavior
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
http://stackoverflow.com/questions/5865817/regex-to-match-an-ip-address
http://stackoverflow.com/questions/9165922/regex-for-ip-address/20270082#20270082
http://blog.markhatton.co.uk/2011/03/15/regular-expressions-for-ip-addresses-cidr-ranges-and-hostnames/
http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url

@Maikuolan
Copy link
Member Author

Useful list and links. :-)

Are things like this alreay detected and blocked by phpMussel? For exaple there are document.write, iframe, packers/obfuscaters like function(p,a,c,k,e,d) and so on.

Mostly; phpMussel does already attempt to detect and block these types of things, but, it needs to be improved.

There are basic signatures and detection methods used by phpMussel that are able to detect some of the most common malicious JavaScript, droppers, webshells and etc, but there are still many such things that aren't detection easily, or which can be modified easily to avoid detection.

Some of the information you've linked to should be useful for helping with this issue though.

@Maikuolan
Copy link
Member Author

The regex mentioned @ daringfireball.net looks promising; I think I'll experiment a little with this to see how it could be leveraged.

@Maikuolan
Copy link
Member Author

I couldn't get the above-mentioned regex to work properly, but I wrote something myself (it's a little hackish, but seems to work; we can always improve on it though).

I'll fork an "experimental" branch shortly.

@Maikuolan
Copy link
Member Author

Committed. :-)

The URLs that can be now detected with this initial commit come from the current malc0de data; To test it, I wrote a plain-text file containing Blah blah blah 123 "http://360safe.com" hello world., saved it, and tried to scan it (the domain URL in that plain-text is from the current malc0de domain list). Further testing hasn't yet been done.

Awaiting responses to some emails I've sent to database providers before adding any API functionality for them.

@DanielRuf
Copy link
Contributor

Great, I will further test it in the next days and write test files for it, so you can easily test with this.

@Maikuolan
Copy link
Member Author

Cool, looking forward to it. :-)

I've received some replies to some of the emails that I'd sent, all of which have been positive so far. I'll start writing up some API code soon for those services.

@DanielRuf
Copy link
Contributor

Great news, I am very excited about the upcoming changes.

@Maikuolan
Copy link
Member Author

Cool. :-)

Code for performing API lookups to hpHosts added (seems to work correctly); Have been experimenting with the Google Safe Browsing API, but haven't fully worked it out yet; More to come up soon.

@Maikuolan
Copy link
Member Author

Still looking into further implementing other APIs, but also, still having difficulties with the Google Safe Browsing API. :-/

They've very thoroughly documented the API, its requirements, how to use it and etc:

..But I'm still having a little difficulty in understanding it at all the moment. I'm sure I'll eventually understand it, but, It may take a little longer than I'd originally anticipated. '^.^

Also, I've tried searching for examples online, thinking that for something like Google Safe Browsing, surely, there must be good examples online from other people trying to implement it, and I've found some things, but not very much.

I did find the samcleaver/phpGSB repository, which sounds interesting; I haven't yet tested it though.

Anyhow, for the code already written, caching seems to be working correctly now, so, I don't think caching will be a problem (although, of course, some API services may have their own specific requirements about this).

@DanielRuf
Copy link
Contributor

It seems curl is required to get the data: http://stackoverflow.com/questions/30115897/google-safe-browsing-api-unable-to-download-list

Like the mentioned repository does (but the class is very big).

@Maikuolan
Copy link
Member Author

I think you're right.

I'll try to write something using cURL; Hopefully I'll have some luck with it.

Also, work done so far has been merged back to master (still haven't yet written functional code for the Google Safe Browsing API, but, everything written so far appears to be stable).

@DanielRuf
Copy link
Contributor

Ok, I made some tests with the example.

The cache works great.

I will link the files that I have used for the tests.
This were just a few, that came to my mind. There may be more tests, that we need.

https://gist.github.com/DanielRuf/049f856fb2a5899dc28e

We may create some test files for the URL Scanner to test further detection and improve it.

C:\xampp\htdocs\phpMussel-master\phpMussel-master>php phpunit.phar URLScannerTest.php
PHPUnit 4.7.7 by Sebastian Bergmann and contributors.

FFF.F.F..F.F...

Time: 3.86 seconds, Memory: 29.75Mb

There were 7 failures:

1) URLScannerTest::testURLScanner_base64
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:7

2) URLScannerTest::testURLScanner_base64_2
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:12

3) URLScannerTest::testURLScanner_hex
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:17

4) URLScannerTest::testURLScanner_hex_lit_2
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:27

5) URLScannerTest::testURLScanner_javascript_escape
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:37

6) URLScannerTest::testURLScanner_protocol_less
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:52

7) URLScannerTest::testURLScanner_sql_escape
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:62

FAILURES!
Tests: 15, Assertions: 15, Failures: 7.

C:\xampp\htdocs\phpMussel-master\phpMussel-master>php phpunit.phar URLScannerTest.php
PHPUnit 4.7.7 by Sebastian Bergmann and contributors.

FFF.F.F..F.F...

Time: 334 ms, Memory: 11.50Mb

There were 7 failures:

1) URLScannerTest::testURLScanner_base64
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:7

2) URLScannerTest::testURLScanner_base64_2
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:12

3) URLScannerTest::testURLScanner_hex
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:17

4) URLScannerTest::testURLScanner_hex_lit_2
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:27

5) URLScannerTest::testURLScanner_javascript_escape
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:37

6) URLScannerTest::testURLScanner_protocol_less
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:52

7) URLScannerTest::testURLScanner_sql_escape
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpMussel-master\phpMussel-master\URLScannerTest.php:62

FAILURES!
Tests: 15, Assertions: 15, Failures: 7.

@DanielRuf
Copy link
Contributor

@Maikuolan
Copy link
Member Author

Cheers. :-)

@DanielRuf
Copy link
Contributor

A new run of the tests just give a fail for every test. Each test returns 1 instead of 2.

The tests use 360safe.com as domain. It seems this was removed from newer signatures.

I think I will create new test files. Can you give me a specific example domain for testing (probably one which many people know so there is no new threat).

I have manually downloaded a version of the URLScanner signatures which contain the MD5 hash of the 360safe.com domain and ran the tests again.

Code base: 3a7b07aeca9026b16c6e5ace0c9c4261906424f1

C:\xampp\htdocs\phpmussel>php phpunit.phar URLScannerTest.php
PHPUnit 5.0.8 by Sebastian Bergmann and contributors.

FFF.F.F..F.F...                                                  15 / 15 (100%)

Time: 8.24 seconds, Memory: 30.00Mb

There were 7 failures:

1) URLScannerTest::testURLScanner_base64
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpmussel\URLScannerTest.php:7

2) URLScannerTest::testURLScanner_base64_2
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpmussel\URLScannerTest.php:12

3) URLScannerTest::testURLScanner_hex
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpmussel\URLScannerTest.php:17

4) URLScannerTest::testURLScanner_hex_lit_2
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpmussel\URLScannerTest.php:27

5) URLScannerTest::testURLScanner_javascript_escape
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpmussel\URLScannerTest.php:37

6) URLScannerTest::testURLScanner_protocol_less
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpmussel\URLScannerTest.php:52

7) URLScannerTest::testURLScanner_sql_escape
Failed asserting that 1 matches expected 2.

C:\xampp\htdocs\phpmussel\URLScannerTest.php:62

FAILURES!
Tests: 15, Assertions: 15, Failures: 7.

@Maikuolan
Copy link
Member Author

The tests use 360safe.com as domain. It seems this was removed from newer signatures.

Yeah, it was. I forgot to mention that here, though. Sorry about that.

I think I will create new test files. Can you give me a specific example domain for testing (probably one which many people know so there is no new threat).

"One which many people know" could be tricky, because I'm not sure if there are any truly definitive malicious domains that are more well known than most others, but, for the purposes of testing, "kitdriver.com" could work; I found that domain last month; It seems to have lots of fake driver downloads containing malware. Virus Total results for that domain have a detection ratio of only 4/66, though, so, I don't think I could call it well-known.

Also, because of the TLD-level signatures, any domains attached to these TLDs:

 .zip
 .review
 .country
 .kim
 .cricket
 .science
 .work
 .party
 .gq
 .link

..Should be flagged as malicious, so, we could possibly write new testfiles using some example domain names (hxxp://www123.blah-blah-blah.link/, etc) attached to those TLDs for the purposes of testing.

@DanielRuf
Copy link
Contributor

Okay, thanks for the information. Will prepare new textfiles soon and see if I can contribute some decoding functions.

@DanielRuf
Copy link
Contributor

Will create the testfiles next week. What has the highest priority at the moment (testing feature x, code for feature x, translations, ...)?

@Maikuolan
Copy link
Member Author

( Replying via mobile phone email. )

The person handling the Arabic translations is still working on them, but
doesn't seem to be having any problems so far; I think the person that was
handling the Vietnamese translations might’ve decided to abandon the
project, or something might’ve happened to them (I hope not), because I've
tried contacting them numerous times the past few months, without response,
and I haven't seen any evidence of them having any online activity at all
during that time; The person auditing the Chinese translations has just
informed us (via another issue) that they've finished the audit now, so,
I'll take a look at that and see to integrating the changes when I next
have access to a computer (likely later tonight); All other currently
supported translations (although many are unaudited and unverified) are
complete. I don't think there's any pending work to be done at all on the
current translations otherwise (although, of course, finding more people to
do audits for the unaudited translations and to work on new translations is
always welcome), so, although I'm always still keen for more translation
work to be done, I don't think it's a high priority at the moment.

There are still plenty of features (and a few bug fixes) on the to-do list
(some of these, but not all, are listed in the changelog), such as the
ability to upload new files to Virus Total, fixing the user language bug,
the various archive support features and etc, but, these will all need time and
will probably be done over multiple releases, and will all need to be
tested when they're eventually completed. Everything currently implemented
should (in theory) be stable, though.

Because there are some already implemented features that haven't yet been
thoroughly tested though, I think, probably, at the moment, testing should
be the highest priority of the three (translations, coding, testing).

Anyhow, I'll reply again later tonight when I'm next using a computer. :-)

@DanielRuf
Copy link
Contributor

Ok, thanks for the information. Yes, we need more testing as this has helped to find and solve a few bugs.

@Maikuolan
Copy link
Member Author

No worries, and definitely. :-)

As well as for finding and solving possible bugs and errors, testing could help us to determine where weaknesses exists, such as for any failed detections (such as for undetected obfuscated and/or encoded links), for any potential vulnerabilities (such as for where malicious links could be automatically followed or similar) and for finding new ideas on where to expand the functionality.

I haven't yet looked at the Google Safe Search API integration again since we last spoke about it, so, that's something to still eventually do, although I'm not sure when I'll get around to doing it.

@DanielRuf
Copy link
Contributor

Ran a few tests and used the kitdriver domain

C:\Bitnami\wampstack-7.0.0-0\apache2\htdocs\phpMussel-master>php phpunit.phar URLScannerTest
PHPUnit 5.1.0 by Sebastian Bergmann and contributors.

.Fint(1)
..F.F...                                                       10 / 10 (100%)

Time: 1.17 seconds, Memory: 32.00Mb

There were 3 failures:
1) URLScannerTest::testURLScanner_javascript_escape
Failed asserting that 1 matches expected 2.

C:\Bitnami\wampstack-7.0.0-0\apache2\htdocs\phpMussel-master\URLScannerTest.php:38

2) URLScannerTest::testURLScanner_protocol_less
Failed asserting that 1 matches expected 2.

C:\Bitnami\wampstack-7.0.0-0\apache2\htdocs\phpMussel-master\URLScannerTest.php:53

3) URLScannerTest::testURLScanner_sql_escape
Failed asserting that 1 matches expected 2.

C:\Bitnami\wampstack-7.0.0-0\apache2\htdocs\phpMussel-master\URLScannerTest.php:63

FAILURES!
Tests: 10, Assertions: 10, Failures: 3.

Used test files:
URLScannerTest.txt
html_escape.txt
javascript_escape.txt
linebreaks.txt
normal.txt
protocol_less.txt
single_quotes.txt
sql_escape.txt
unicode.txt
urlencoded.txt
urlencoded_2.txt

Did not yet have the time to write the PHP script to generate these test files but will do so this week (I hope so).

Do you think we should decode encoded strings (check for base64_encoded strings and so on? str_rot13, ...) or just check plain text strings?

I think it just makes sense to check plain text strings as trying to detect encoded strings and decode them might consume to much memory and possibly slow down the website / server (eg when there is a gzipped string which is encoded very large). It may be also not so easy to track / check if there are any variables used in some decoding step so we would have to look them up and actively check the contents against the URLScanner database. The static analysis might be better than a dynamic approach, which would cause huge problems when there are many strings).

What do you think? We have the regex for getting domains (and IPs?) and it might be enough to check themdirectly against the URLScanner database as these are often in plain sight (not encoded or encrypted in most cases, but often they are packed with base64_encode / unpacked decode, eval and so on, which is already caught by phpMussel).

@DanielRuf
Copy link
Contributor

Added the unit tests and the generation script for the used test files to the extras repository.

The generation script will create the files with the same format as I have used for the URL Scanner unit tests here.

Might have to tune or improve the regex, strip all linebreaks and just check for the domain + TLD part as the protocol is not required in the markup (protocol less or protocol-relative URIs are not uncommon). Backtracking should be useful.

For testing I use https://3v4l.org, https://ideone.com/ and some other tools.

@Maikuolan
Copy link
Member Author

Do you think we should decode encoded strings (check for base64_encoded strings and so on? str_rot13, ...) or just check plain text strings?

To a very limited degree, this should already be happening, due to the process that phpMussel uses to "normalise" text (looks for functions calls in data being scanned, like base64_decode(); and things like that; if it finds them, and if it can, it'll automatically attempt to decode the data contained by those function calls).

Beyond that, though, it isn't currently doing much with this type of data (it's not normally difficult to look for function calls, and if we know where the functions calls are, we can do tricks like splitting up text with explode and substr to capture the encoded data without needing regex; without things to look for like function calls or other similar markers indicating where the encoded data begins and ends, though, normally, I think, regex is the only way to detect and capture this data effectively in PHP, but, using regex for things like this can be potentially problematic with the risk of runaway backticking and the risk of reaching memory stack limits, if the data is too large).

I like the idea being able to find potentially malicious URLs and other malicious code hidden within encoded and/or obfuscated data, but, I agree that it could potentially create problems with memory and potentially slow down websites to focus too much on decoding things like this, so, I think decoding data like this is something we should be careful with, if we do anything with it. Otherwise, in most situations, plain-text should be sufficient, anyhow.

What do you think? We have the regex for getting domains (and IPs?) and it might be enough to check themdirectly against the URLScanner database as these are often in plain sight (not encoded or encrypted in most cases, but often they are packed with base64_encode / unpacked decode, eval and so on, which is already caught by phpMussel).

I think checking them directly and checking them the way they're checked currently should be enough. At least, until we find information to suggest otherwise. We can always build on this in the future if the need arises, but, no point adding to the footprint unnecessarily.

@Maikuolan
Copy link
Member Author

Added the unit tests and the generation script for the used test files to the extras repository.

The generation script will create the files with the same format as I have used for the URL Scanner unit tests here.

Awesome, and sounds good. :-)

Might have to tune or improve the regex, strip all linebreaks and just check for the domain + TLD part as the protocol is not required in the markup (protocol less or protocol-relative URIs are not uncommon). Backtracking should be useful.

Good point. This is definitely something we should do.

@DanielRuf
Copy link
Contributor

Do you think we should decode encoded strings (check for base64_encoded strings and so on? str_rot13, ...) or just check plain text strings?

To a very limited degree, this should already be happening, due to the process that phpMussel uses to "normalise" text (looks for functions calls in data being scanned, like base64_decode(); and things like that; if it finds them, and if it can, it'll automatically attempt to decode the data contained by those function calls).

Right. So decoding this makes no sense when the function is already blacklisted and blacklisted.

What do you think? We have the regex for getting domains (and IPs?) and it might be enough to check themdirectly against the URLScanner database as these are often in plain sight (not encoded or encrypted in most cases, but often they are packed with base64_encode / unpacked decode, eval and so on, which is already caught by phpMussel).

I think checking them directly and checking them the way they're checked currently should be enough. At least, until we find information to suggest otherwise. We can always build on this in the future if the need arises, but, no point adding to the footprint unnecessarily.

Totally agree, at the moment the scan engine of phpMussel should be good enough. Every webshell and malicious code at least shares specific function calls like eval. Tokenizing the input and checking it for blakclisted functions as it is done currently is enough so far.

I am just checking the disadvantages and advantages of specific ways and performance vs security. In my opinion we have a good balance currently as the benchmarks are very promising.

@DanielRuf
Copy link
Contributor

Bump #65 (comment) 3 tests still fail

At least \" has also to be trimmed.

I have changed the regex a bit to also support protocol less URLs and escaped slashes.

http://ideone.com/TpZdbZ

The regular expression could be possibly further improved.

@Maikuolan
Copy link
Member Author

If we make the beginning of the pattern optional, I think it'd be better to entirely do-away with that part of the pattern anyhow, because the results will be the same either way regardless, and if we kept it, we'd be looking for something we didn't need to be looking for.

My thoughts on including the URI schemes / protocols for URLs, originally, was that it should make them easier to distinguish from arbitrary non-URL text and from contexts whereby some domain could be referenced but without necessarily being a URL (and so, looking for the URI scheme / protocol at the beginning should help avoid capturing unnecessary data), but, that said, looking at IANA's list of currently recognised URI schemes.. Ouch. A rather big list, and, in the extreme of including all possible recognised URI schemes, I'd rather not include a list that big. So, perhaps we should look at other options.

Maybe we could look for anything for resembles a protocol, without necessarily looking for specific protocols (eg, something like [0-9A-Za-z.-]{2,30}\: instead of the current (data|file|https?|ftps?|sftp|ss[hl])\:)? It may still perhaps capture some arbitrary or unnecessary data, but, less so than simply removing the check entirely. Thoughts?

(Reason for {2,30}: The shortest URI schema / protocol I could see in IANA's list is 2 characters long, and the longest I could see, "ms-secondary-screen-controller", is 30 characters long).

And good catch with the \" and escaped slashes; We should implement this.

@Maikuolan Maikuolan removed the Feature label Mar 27, 2016
@Maikuolan
Copy link
Member Author

Support for Google Safe Browsing API lookups implemented. :-)

@DanielRuf
Copy link
Contributor

Awesome =) Will take a look at the code later and test it in the next days.

@Maikuolan
Copy link
Member Author

Bug identified: In the current master codebase, remote API lookups aren't being performed when there aren't any URL scanner signature files installed/enabled/active (this was probably introduced as a result of the most recent rewrite I did of the way that signature files are handled a while back, I think). These two things (signature file checks and API lookups) should be independent of each other. Just noticed this problem tonight, but it should reasonably easy to fix (just separating out a few different code sections mostly). I'll try to write a fix for this either tomorrow or over the weekend.

@DanielRuf
Copy link
Contributor

Better open a new issue to better track this separate issue =)

I guess we really need some integration and unit tests (using phpunit) but I am not so deep in the code after this long timespan.

@Maikuolan
Copy link
Member Author

Remaining issues of unit testing and scanning URLs with Virus Total can be covered by issues #45 and #12 respectively. As everything else has otherwise been dealt with, marking as implemented and closing this issue to keep the issues list clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants