New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599
Comments
To add we're using Percona Server 5.6 from the default Ubuntu Repository. |
What hardware specs are being used for the database server? |
Hi, the database is an azure instance with the following specs:
The funny thing is that the database server is idle and has load average of 2.0. So it seems that there is not really a performance issue. |
Nothing here looks particularly out of line. There was some functionality added that uses a method called ScanAttributes that can be extremely inefficient on the existing MySQLAdvancedDataStore, but that is usually associated with having problems retrieving results from hunts more than creating them. Is there anything in the slow query log? |
Do you know if there are any problems using Percona DB? We're currently running on percona server 5.6? The clients are connecting. I will check the slow query log and the interrogate flow. |
So to give you an update. We did the following things on the datastore server:
We still get the max_allowed_packet warning but it seems that error message is misleading. The db looks fine, we also startet some hunting for registry keys and the performance looks much better. The clients connect to the grr server through an nginx reverse proxy. On this proxy we get the following alert message in the nginx log: So after a while the nginx proxy cannot handle the incoming connections from grr any more. Do you have a clue why these sockets are not closed correctly on the nginx server? Thanks |
Hm have you looked at the frontend debug logs? It should log the time each request from a client takes, if there is an increase over time in that value, there might be some data store issue. What you could also look at, not sure if possible at this point, is if the same message about open connections also happens when the system is not loaded - very few connected clients for example. Maybe there is some general issue that connections don't get closed properly, could be in GRR itself or a weird interaction with nginx. |
Hi, I'don't understand exactly what you mean by frontend debug log? I know that it is possible to start the frontend process from command line togehter with the verbose option. Is there another option to enable debug logs? |
So I quickly had a look and it turns out that the frontend logs debug logs by default (in my case they are at /usr/share/grr-server/lib/python2.7/site-packages/grr/var/log/grr-http-server.log). Whenever a client delivers or reads messages, the frontend logs how long that takes. Usually, those times should be very fast, maybe you can have a look if this is the case for your instance as well? |
Hi thanks for your reponse. I activated the verbose logging of the http server process in the server.local.yaml file:
I also decreased the sysctl setting: net.ipv4.tcp_keepalive_time = 300. |
After the server has been restarted and the first hunt has been set, the error occurs again after a few minutes. What struck me is that the query times increase enormously from 0.5 seconds to up to 62 seconds. After that, Mysql errors occur and out-of-order messages appear. The entire hunt then remains in running status. However, the Nginx server only reports HTTP 5xx error messages and now flow is processed anymore. In the firewall we can detect an increased number of TCP connections with'Invalid Packet' status between the reverse proxy and the clients as well as the GRR workers and the database server. The mysql error log is full of the following messages:
|
It seems like you mysql server can't keep up with the draining of the messages for the clients. There are 41 messages for each clients in a hunt but that doesn't seem like it should be too much? Are those messages somehow special, maybe huge in size? As a next step, I'd look at the mysql server log and try finding slow queries. I can't really explain why the mysql performance is that bad, maybe the logs give some info there. |
Hi Andreas, no in this case I just the Launch Exe Flow to Install a newer Version of the GRR Agent on 500 Systems. So it should not be a performance intensive hunt. I will check the mysql slow query logs. |
Hi Andreas, today i tried to repeat the errors descriped above. So what I did was i Launched a hunt with the Launch Executable Flow and executed the GRR3.2.2.0 Executable 64-Bit on 500 Clients hat are Windows 64-Bit clients. The Idea was to upgrade these Clients from Agent 3.2.0.1 to 3.2.2.0. After a few seconds the errors occured again. Since the Slow-Query-Log is realy huge is used the pt-query-digest script from Percona to parse the log. I will ad the log to this issue. Do you have any clue why this queries need up to 60 seconds? At the moment we don't know what we could do. Thanks for your help, |
I had a quick peek at this log and it looks to me that your database server is completely unresponsive - all queries are super slow, irrelevant which table they read or modify. The root cause seems to be the large inserts that write 1.5GB to the table that collects the client messages that need to be sent. This takes almost 6000 seconds to write and it takes down the whole server. We need to have a look if MySQL can actually handle this amount of data in somewhat reasonable time or if we need to change to a different approach here. |
Hi Andreas,
Thanks for the reply, is this something that you already have changed in
GRR 3.2.3.2?
I am using 3.2.2.0 but can try to upgrade.
Could you please describe for what the Update Agent Flow is used ? I tried
to upgrade the agents from 3.2.0.1 to 3.2.2.0 but it did not work so
therefore I tried to use the launch binary flow.
Thx
Andreas Moser <notifications@github.com> schrieb am Di. 4. Sep. 2018 um
15:22:
… I had a quick peek at this log and it looks to me that your database
server is completely unresponsive - all queries are super slow, irrelevant
which table they read or modify.
The root cause seems to be the large inserts that write 1.5GB to the table
that collects the client messages that need to be sent. This takes almost
6000 seconds to write and it takes down the whole server. We need to have a
look if MySQL can actually handle this amount of data in somewhat
reasonable time or if we need to change to a different approach here.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#599 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATvebBb4gUfDQY2yqq6whtGipord-_wZks5uXn6PgaJpZM4Urb5U>
.
|
Sorry for never getting back to this bug. The newer GRR versions (>=3.3.0.0) use a much improved database model which should prevent issue like this one. Closing here as this is now obsolete. |
Dear users,
we have some real performance Issues with GRR at the moment. To give you a better understanding we use GRR Version 3.2.2.0 with MySQL and roughly 4000 Clients.
The GRR landscape is distributed across multiple servers (Ubuntu 16.04):
At the moment we're not able to schedule any hunt on the system.
The logs show the following errors:
"mysql_advanced_data_store: Operational Error: 1205 Lock wait timeout exceeded. Try restarting transaction. This may be due to an incorrect mysql "max_allowed_packet" setting (try increasing it).
We've tired values up to 2048MB, but this did not help. Another thing is that on the grr admin server we have a lot of socket in Status "Close_Wait".
I will add the current GRR-Settings and Mysql-Settings that we're using:
GRR:
Does anybody in this group have the same problems with the current grr ersion installed from the deb package? Or could you please give us some hints that may help gettimg grr up and running again?
Thanks for your help,
Cheers
Sven
The text was updated successfully, but these errors were encountered: