Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omit diaries containing Chinese characters from RSS feed #2218

Closed
AndrewHain opened this issue Apr 28, 2019 · 16 comments

Comments

Projects
None yet
6 participants
@AndrewHain
Copy link
Contributor

commented Apr 28, 2019

https://blogs.openstreetmap.org is currently useless and WeeklyOSM/Wochennotiz has disappeared within hours.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

Clearly this is a ridiculous suggestion.

@tomhughes tomhughes closed this Apr 28, 2019

@mmd-osm

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2019

I think that's really a duplicate of gravitystorm/blogs.osm.org#17

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

Well yes, but my point is that censoring diary entries based on what language they are written in is clearly not something we could countenance.

@mmd-osm

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

This isn't going to end soon, says alexkemp: https://www.openstreetmap.org/user/alexkemp/diary/338244

"The easiest method to defeat {put in automated spam tool} is to simply require the first post of any new forum member or blog poster to be approved before it can appear."

@tomhughes

This comment has been minimized.

Copy link
Member

commented May 18, 2019

Yes, because I am really looking forward to having 5000 posts to approve when I get up every morning.

I mean obiovously that is an option, but one that requires significant engineering and is not a quick fix even if it is practical.

@mmd-osm

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

Right. Usually, there's only a very small number of non-spam blog posts, and only those would need some approval - and that's for the very first time someone posts a blog only. The others can be automatically purged after a few days, if noone cared to approve them, or user complained that their posts are still not showing up on the page.

Some really low hanging activities could be:

  • Exclude diaries from search engine indexing for the time being.
  • Remove "Add new blog entry" button, unless user has been around for some time (tbd)
@tomhughes

This comment has been minimized.

Copy link
Member

commented May 18, 2019

Is there are evidence they are actually getting indexes in the few hours before they are removed?

Obviously we can ban posting by new users, but that falls into the category of "collateral damage" that I have just discussed on Alex's latest rant.

@mmd-osm

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

Yes, they are showing up on Goog index fairly soon. I tried this yesterday with some random Chinese spam snippets.

@tomhughes

This comment has been minimized.

Copy link
Member

commented May 18, 2019

There is also of course no reason to believe that they wouldn't just add a delay between creating the account and posting.

Frankly I think a more reasoned response would be that it is ridiculous for us to be running a blog system that is entirely unrelated to our primary purpose and just ditch the diaries altogether.

@mmd-osm

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

Ah yes, that's kind of the "nuclear option". I was also thinking about shutting down the blog system altogether.

@SomeoneElseOSM

This comment has been minimized.

Copy link

commented May 18, 2019

Yes, because I am really looking forward to having 5000 posts to approve when I get up every morning.

On that specific point, why does it have to be just you that does it? Lots of people have been banging on about diary spam and surely some of those would be appropriate to have as "diary approvers" with the specific job to approve valid posts (and only that). Sure, the system to allow that won't write itself, but there's no reason that any extra effort once set up needs to sit explicitly on the admins.

@mmd-osm

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

I think alexkemp was right about the nature of those post - they're currently training their spam bots by posting some random news articles. Let's see how it goes.

@mmd-osm

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

Ah, there's an issue with the robots.txt change: you need to remove the trailing slash in

Disallow: /user/*/diary/

i.e. replace this row by

Disallow: /user/*/diary

Otherwise, the user's blog list (e.g. https://www.openstreetmap.org/user/TomH/diary ) still gets indexed.

Test tool I used: https://webmaster.yandex.com/tools/robotstxt/?hostName=https%3A%2F%2Fwww.openstreetmap.org%2Frobots.txt

@mvglasow

This comment has been minimized.

Copy link

commented May 18, 2019

For the moment, if we cannot contain the spam, I would consider removing the user diaries from the feed until we find a way to fix this.

However, looking at the topmost spam post, the related user seems to have been deleted already. So solving #17 (ensuring that diary entries disappear from the blog when the user is deleted) might also solve this issue.

@Nakaner

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

gravitystorm/blogs.osm.org#40 is related to this issue.

I agree that simple block rules based on the used characters sets are too simple but just ignoring this problem is not an option either.

@tomhughes

This comment has been minimized.

Copy link
Member

commented May 20, 2019

We're not ignoring anything - we are making ongoing efforts to fight the spam and to add new features to help control it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.