Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability issues on nominatim service #337

Closed
lonvia opened this issue Oct 29, 2019 · 11 comments
Closed

Stability issues on nominatim service #337

lonvia opened this issue Oct 29, 2019 · 11 comments

Comments

@lonvia
Copy link

lonvia commented Oct 29, 2019

Dulcy has serious networking issues since 21th Oct. There are a lot of connections hanging in SYN_SENT or ESTABLISHED state according to https://munin.openstreetmap.org/openstreetmap.org/dulcy.openstreetmap.org/fw_conntrack.html. Reboot hasn't helped and I can't find any pattern in the IPs concerned. Might be a rouge app. We might simply be reaching capacity. There might be another issue.

This is now starting to impact normal operations.

@lonvia
Copy link
Author

lonvia commented Oct 31, 2019

Current working theory is that something hits the server so badly that the kernel does not manage to handle all the incoming requests fast enough, Then clients resent their requests actually making the problem worse and also making it look to dulcy like their are SYN flooding. So we end up with thousands of connections in SYN_SENT.

The problems occur only between 6am and 6pm CET. This might just coincide with the load on the machine, it might also be that the culprit is only active then. Still can't make out any patterns in the requests.

At this point I fear that we need a frontend server or CDN to handle the load and keep off all blocked traffic (http requests and permanently blocked requests). At the very least a fronting server would ensure that openstreetmap.org is not affected by these issues.

Another measure I'd like to propose is to stop sending 403/429 to blocked IPs and send a 200 instead with a block message in the display name. 403/429 just seem to encourage clients to resent their request on the spot even increasing the load on the server. A regular 200 answer would need to be processed by them first which gives us at least a tiny amount of breathing space.

@dfabreguette
Copy link

Thanks for working on that 👍
Any workaround ? Any idea when this should be fixed ?
Our production is badly affected by this as we need to have main thread geocoding ...
Should we go for installing our own nominatim server to prevent from such things ?

@mtmail
Copy link

mtmail commented Nov 5, 2019

It's my understanding the opentreetmap.org servers struggle at high traffic lately and it's not clear yet if any configuration change (network, webserver, Nominatim server) can fix that or if "simply" more hardware is needed.

If a production system relies on the public service then it's better to install your own (http://nominatim.org/release-docs/latest/admin/Installation/ global or regional extract) or use a third-party provider (many have free tiers or trials if the number of requests per day or month is low). Last section on https://operations.osmfoundation.org/policies/nominatim/ links to a couple.

@danfai
Copy link

danfai commented Nov 5, 2019

I've just read through openstreetmap/openstreetmap-website lib/osm.rb, where the http_client for polling the different APIs like nominatim is called. It appears to me that the start page is not reusing the connections established to nominatim.
Since keep-alive on the nominatim servers are enabled, could this help to reduce the load or increase the stability for the main osm.org page?
My knowledge of ruby and the Faraday library is very limited, therefore it would be interesting whether the connections on dulcy could also be reduced with a fix in that way.

@tomhughes
Copy link
Member

The lookups are cached though so there will likely only be a very minimal number of queries from the main website, certainly as a proportion of the total load.

@lonvia
Copy link
Author

lonvia commented Nov 5, 2019

Traffic from openstreetmap.org is not the issue. It makes up less than 0.01% of the requests we get (not including ID which is a less benign client, although I think this is being fixed).

The current problem is very likely a rouge external scripted client. It's active from 6-6 CET only and comes from a country where 1st Nov is a holiday.

@dfabreguette
Copy link

dfabreguette commented Nov 5, 2019

What's the IP address ? Can you not just blacklist it ? (guess it's not so simple !)

@tomhughes
Copy link
Member

I don't think the address has been identified or we would have done...

@lonvia
Copy link
Author

lonvia commented Nov 5, 2019

And 1st Nov is a public holiday in a lot of countries.

@keenhon
Copy link

keenhon commented Nov 5, 2019

Hope that the culprit can be stopped. Don't people read the usage policy. Glad that someone is looking at the issue. Keep up the good work.

@lonvia
Copy link
Author

lonvia commented Nov 6, 2019

@Firefishy found the magic setting to mitigate the issue. dulcy now seems to be able to cope with the heavy network traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants