Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer review process for referral spam hosts #26

Closed
kingo55 opened this issue Jun 5, 2015 · 23 comments
Closed

Peer review process for referral spam hosts #26

kingo55 opened this issue Jun 5, 2015 · 23 comments

Comments

@kingo55
Copy link

@kingo55 kingo55 commented Jun 5, 2015

Brilliant idea guys.

What are the requirements for adding a bad referrer to the list? As @mnapoli mentioned in another thread - don't want to make it too broad.

I'm thinking of a process where new referral spammers are added to the list by peer review. Possibly by having other members with significantly large Piwik/Snowplow data sets to vouch for them.

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Jun 5, 2015

I agree with this. We have a fairly good list at this point, we should be careful with new additions. We could decide that every new issue or pull request needs a +1 from another person before being accepted.

Even if that means some additions will have to wait for a few days I think it's fine.

Ping @mattab

@mattab

This comment has been minimized.

Copy link
Member

@mattab mattab commented Jun 5, 2015

Good question. I think we can merge PRs sometimes on first read, when there is only a few domains and names look spammy, or eg. when PR author explains how she found the spammers (eg. in GA/Piwik reports, 100% bounce rate, display spam, dodgy whois, found on another referrer spam blacklist, etc.)

If we're not sure, then sounds good to ask other users to +1 if they also see this spammer and merge after a +1 was commented.

Maybe we leave this issue opened for a while & see how this evolves?

@calebpaine

This comment has been minimized.

Copy link

@calebpaine calebpaine commented Jul 10, 2015

Also I'm seeing some pull requests for larger lists, perhaps each PR should be limited to one domain/url each, that way they can be individually vetted?

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Jul 25, 2015

I have noticed spammers usually spam a lot of different domains from the same IPs.

Once an IP has spammed at least one domain in the blacklist, it is easy to find new domains being spammed (by grepping the IPs on server logs), and add them to the list, without any risk of false positive.

I have automated the research of new domains using this approach and the result is in pull request #87.

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Jul 29, 2015

FYI we have been contacted by a webmaster asking for his website to be removed from the list: #90 (see the details in the pull request).

I think this mistake (if it is one) should be one more reason to move to a "peer-review-only" kind of process, i.e. add only sites that have been reported or approved by at least 2 people. We should also document in the README that it's better to add one site at a time in pull request (I'll do it straight away), we should avoid "bulk changes" because they are harder to validate.

Thoughts?

mnapoli added a commit that referenced this issue Jul 29, 2015
@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Jul 29, 2015

I can only speak for myself, but I have seen in the recent months an important increase in referer spam.

They spam from dozens of IPs a lot of different domains, sometimes without any rate limiting, so I get bursts of dozens of useless requests per second, polluting my analytics and wasting my server ressources. And this is on small servers hosting a few low traffic sites.

As soon as I detect referer spam from an IP, I now automatically block it at the firewall level, despite that I see new domains being spammed from new IPs every day.

Most of theses domains are registered for a short period of time, are simple redirects, and the spammers will always register new ones to spam.

I don't use Piwik, but I find this list very useful, however let's be honest: if you require a separate pull request and a vote on every domain added, this list will not be updated frequently (if at all), and it will become useless before a few months.

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Jul 29, 2015

Up to now most pull requests (that have been merged) contained only a single domain (because we also add ourselves the domain reported in issues). If spam is more and more an issue, there will be more and more people looking for solutions, and thus contributing here.

When we started working on a new solution against referrer spam I suggested the following idea: build a submission system where users can submit new spammers directly from inside their Piwik. These submissions would be sent to a simple app hosted somewhere (e.g. spam.piwik.org). Then it would be easy to see how many users reported each spammer domain, and above a threshold (or manually) we could add the domain to the blacklist.

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Jul 30, 2015

I hope this repository will get enough activity to make this list useful, but I fear the spammers will always be faster than you.

Anyway, since it's in the public domain, I will maintain and use my fork, and merge back changes from this list.

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Aug 3, 2015

@mnapoli FYI qitt.ru is definitely being spammed.

I was able to detect it, and add it to the list again because of the other domains being spammed from the same IP.

See an excerpt of my server logs:

$ zgrep -F 178.137.87.228 /var/log/apache2/*.access.log*
/var/log/apache2/[REMOVED].access.log:178.137.87.228 - - [02/Aug/2015:18:08:15 +0200] "GET / HTTP/1.1" 200 8534 "http://torrnada.ru/" "Opera/7.54 (Windows NT 5.1; U)  [pl]"
/var/log/apache2/[REMOVED].access.log:178.137.87.228 - - [03/Aug/2015:01:15:43 +0200] "GET / HTTP/1.1" 200 8534 "http://msk.afora.ru/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [30/Jul/2015:14:55:22 +0200] "GET / HTTP/1.1" 200 8534 "http://portal-eu.ru/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [30/Jul/2015:14:55:23 +0200] "GET / HTTP/1.1" 200 8534 "http://portal-eu.ru/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:14:56:15 +0200] "GET / HTTP/1.1" 200 8534 "http://bioca.org/" "Mozilla/3.0 (x86 [en] Windows NT 5.1; Sun)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:22:08:43 +0200] "GET / HTTP/1.1" 200 8534 "https://www.qitt.ru/" "Opera/7.60 (Windows NT 5.2; U)  [en] (IBM EVV/3.0/EAK01AG9/LE)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:22:08:44 +0200] "GET / HTTP/1.1" 200 8534 "https://www.qitt.ru/" "Opera/7.60 (Windows NT 5.2; U)  [en] (IBM EVV/3.0/EAK01AG9/LE)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:22:08:45 +0200] "GET / HTTP/1.1" 200 8534 "https://www.qitt.ru/" "Opera/7.60 (Windows NT 5.2; U)  [en] (IBM EVV/3.0/EAK01AG9/LE)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:05:11:58 +0200] "GET / HTTP/1.1" 200 8534 "http://fitness-video.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.40607)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:05:11:58 +0200] "GET / HTTP/1.1" 200 8534 "http://fitness-video.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.40607)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:05:11:59 +0200] "GET / HTTP/1.1" 200 8534 "http://fitness-video.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.40607)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:07:47:43 +0200] "GET / HTTP/1.1" 200 8534 "http://m1media.net/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.0 [en]"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:18:07:18 +0200] "GET / HTTP/1.1" 200 8534 "http://education-cz.ru/godovoy-podgotovitelnyy-kurs" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Crazy Browser 1.0.5)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:18:07:19 +0200] "GET / HTTP/1.1" 200 8534 "http://education-cz.ru/godovoy-podgotovitelnyy-kurs" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Crazy Browser 1.0.5)"

The webmaster that contacted you is probably contracting a shady SEO company using a botnet to send massive referer spam without his knowledge.

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Aug 3, 2015

The webmaster that contacted you is probably contracting a shady SEO company using a botnet to send massive referer spam without his knowledge.

Could be that indeed. Or it could be that there are multiple websites hosted on the same machine? Or multiple servers behind the same IP?

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Aug 3, 2015

Or it could be that there are multiple websites hosted on the same machine? Or multiple servers behind the same IP?

Mhh, not sure I understand you.
The server's IP and hosting are unrelated to this client that is actually spamming several domains at once.

EDIT: I realize I may not have given you enough context: the above log excerpt is from a server I own, which hosts very small websites, unrelated to the referers you see.
This is clearly referer spam, all domains are from Russia, requests are sent with a few seconds interval, all from the same IP, with randomized user agents....

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Aug 3, 2015

Sorry it's late :/ Rephrasing my thoughts better: the spammer tool (whatever it's form) could run from a server which has the same IP address as valid websites. For example it would be very easy to write a referrer spammer script that runs on any shared host. Thus blocking based on the IP address might not always be reliable.

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Aug 3, 2015

Even if a spammer script is running on a shared host, that hosts some websites, they are not supposed to send requests to other websites, no?

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Aug 4, 2015

They aren't supposed to spam indeed, but my point is that websites of the shared host are not aware that other users of the servers are doing that, and can be blocked as a collateral damage (in the case where they send actual referrers to spammed websites). All in all, the IP address isn't 100% reliable. It's the same problem for blocking e.g. gamers online, or when blacklisting from connecting to SSH, etc… People/servers can also be in a sub-network and share the same external IP address (companies, universities, etc.).

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Aug 4, 2015

What I meant is that even if a good website is behind the same IP as a spammer, if you block that IP on your server to protect yourself from the spam, the good website is unaffected, because it is not sending HTTP requests anyway (only serving it, and not to your server).

By the way "blocking" the IP is the list's user choice, we are only talking about adding domains that are obviously being spammed (qitt.ru & co) to the list here.

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Aug 4, 2015

What I meant is that even if a good website is behind the same IP as a spammer, if you block that IP on your server to protect yourself from the spam, the good website is unaffected, because it is not sending HTTP requests anyway (only serving it, and not to your server).

That's not how it works in Piwik: when receiving data, Piwik will exclude any data where the referrer is blacklisted. So if a good website is in the blacklist, it will be affected because its referrer traffic (traffic going from the good website to other websites tracked with Piwik) will be ignored. It will also affect users of Piwik as well because there will be valid traffic going through their website that will be ignored by Piwik.

By the way "blocking" the IP is the list's user choice, we are only talking about adding domains that are obviously being spammed (qitt.ru & co) to the list here.

There is a misunderstanding here, I'm not talking about user blocking an IP. I'm talking about the methodology you suggest to add new spammers to the blacklist. This is how you explained it:

I have noticed spammers usually spam a lot of different domains from the same IPs. Once an IP has spammed at least one domain in the blacklist, it is easy to find new domains being spammed (by grepping the IPs on server logs), and add them to the list, without any risk of false positive.

What I'm saying is that if we add spammers to the blacklist like this, we might blacklist good websites. That would be hurting both good websites and Piwik users.

Example:

  • goodwebsite.com has IP 1.2.3.4
  • badwebsite.com has IP 1.2.3.4 too (shared host or private network, etc.)
  • goodwebsite.com sends referral traffic to myprettyponey.com
  • badwebsite.com runs a script that spams myprettyponey.com with false referrer (to promote badwebsite.com, or any other website)

We detect badwebsite.com and blacklist it. We see that badwebsite.com comes from IP 1.2.3.4, and we see that referrer goodwebsite.com too. With your idea we would blacklist goodwebsite.com.

Am I understanding it right?

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Aug 4, 2015

That's not how it works in Piwik: when receiving data, Piwik will exclude any data where the referrer is blacklisted. So if a good website is in the blacklist, it will be affected

We are on the same page on that, we should not add good domains to the blacklist.

because its referrer traffic (traffic going from the good website to other websites tracked with Piwik) will be ignored.

This is where you lost me. Traffic never goes from website to website.
A HTTP client sends a request to a website that has another website's domain in the referer header. Blocking the client's (spammer) IP has absolutely no effect to the flow between the server receiving the spammy requests and the website whose domain in the referer header.

Example:

goodwebsite.com has IP 1.2.3.4
badwebsite.com has IP 1.2.3.4 too (shared host or private network, etc.)
goodwebsite.com sends referral traffic to myprettyponey.com
badwebsite.com runs a script that spams myprettyponey.com with false referrer (to promote badwebsite.com, or any other website)

We detect badwebsite.com and blacklist it. We see that badwebsite.com comes from IP 1.2.3.4, and we see that referrer goodwebsite.com too. With your idea we would blacklist goodwebsite.com.

Am I understanding it right?

An example is a good idea :)
I think you misunderstand how HTTP referer works, especially that :

goodwebsite.com sends referral traffic to myprettyponey.com

When we say that a website "sends referal traffic" to another website, there is never any direct communication between the two servers.

What actually happens is the following (I reuse your example):

  1. M. ISurfTheWebMyWithBrowser visits goodwebsite.com
  2. M. ISurfTheWebMyWithBrowser clicks a link to myprettyponey.com
  3. The browser of M. ISurfTheWebMyWithBrowse generates a HTTP request to myprettyponey.com with a "Referer: goodwebsite.com" header

Now if in the same time, the spammer with the same IP as goodwebsite.com (1.2.3.4) sends HTTP requests to myprettyponey.com with "Referer: pornvidzlolwut.ru", what will happen is that we will block the IP 1.2.3.4.
So HTTP traffic will be blocked between the two servers, but M. ISurfTheWebMyWithBrowser can surf as usual because traffic coming from usual clients is unaffected.

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Aug 4, 2015

We are still not talking about the same thing :) We understand each other on how HTTP referrer works, and again I am not talking about blocking an IP address.

I am talking about adding domains to the blacklist based on their IP addresses. In other words:

I'm talking about the methodology you suggest to add new spammers to the blacklist.

You suggest we judge wether a referrer is a spam based on the IP address of the client. But the IP address of the client could be shared for many reasons.

Here is another example:

  • Bob runs a script that sends false referral traffic to myprettyponey.com from the university network (i.e. the script runs on a machine that has an IP address of the university's IP range)
  • Alice uses a PC at the university to visit myprettyponey.com after finding it in a google search

In your logs, you will see:

1.2.3.4 - [02/Aug/2015:18:08:15] "GET / HTTP/1.1" 200 8534 "http://badwebsite.com"
1.2.3.4 - [02/Aug/2015:18:08:15] "GET / HTTP/1.1" 200 8534 "http://google.com"

If we follow your methodology:

  • we know badwebsite.com is a spammer
  • spam comes from IP 1.2.3.4
  • there is other referral traffic coming from a client with the same IP address
  • => we deduce that google.com is a spammer too and we add it to the blacklist
  • => Piwik users will not see traffic coming from Google anymore
@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Aug 4, 2015

Right, in that case there is a conflict, but if a website is hosted on 1.2.3.4, it is unaffected.

If a university or similar can't secure it's own network and outgoing traffic, I see no problem to block traffic from it.
For example, this is how Google, and tons or other services work. If you send automated requests to Google from a public IP, after some point Google will send you a captcha, other services will just block you. If you are on a university network, too bad the public IP will get blocked, but that rarely happens because there is usually a proxy that does rate limiting, has several public IPs etc, etc.

Anyway the false positive scenario you describe is possible but very unlikely. We all know the domains mentioned above are spam.
You can be super cautious about adding new domains, but it will hurt your interests in the end.

The increase in spam I see makes no doubt that this is a large scale operation. Soon your Piwik users will wonder why their sites are becoming so popular in Russia ;)

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Aug 10, 2015

For the record I've created a tag waiting confirmation and tagged issues and pull requests.

@calebpaine

This comment has been minimized.

Copy link

@calebpaine calebpaine commented Aug 10, 2015

I think blocking IP addresses, or other sites that share IPs with a known spammer are a bad idea. You'll get tons of legitimate domains that are false positives because they just happen to be on the same shared host (such as Godaddy for example) as a spammer.

I also don't see this list as being a real-time instant update, so automated pull requests or additions to the list a no-go. This list needs to be added & vetted by other administrators. I don't mind if it takes a couple of days for a new domain to be formally added, that wont adversely affect the weekly, monthly, & yearly stats.

@desbma

This comment has been minimized.

Copy link
Contributor

@desbma desbma commented Aug 10, 2015

I think blocking IP addresses, or other sites that share IPs with a known spammer are a bad idea. You'll get tons of legitimate domains that are false positives because they just happen to be on the same shared host (such as Godaddy for example) as a spammer.

This is not what I proposed.
What I suggested is that as soon as we identify referer spam from an IP, we consider all domains sent as referer from that IP as spam, and enrich the list with the new domains.

Websites on shared hosts do not send requests and are not concerned.

For the false positives concern: I added 44 domains since I started my fork 11 days ago, and you can check for yourself, they are all spam, 100% guaranteed 😄
I check when I have a doubt, and they are mostly small variations of domains already in the list, or the classic porn or seo crap, all from Russia.

@mnapoli

This comment has been minimized.

Copy link
Contributor

@mnapoli mnapoli commented Dec 28, 2015

We are currently doing peer review for merging pull requests and it works well, let's close this issue!

@mnapoli mnapoli closed this Dec 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.