GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599

n3x77 · 2018-06-18T09:36:56Z

Dear users,

we have some real performance Issues with GRR at the moment. To give you a better understanding we use GRR Version 3.2.2.0 with MySQL and roughly 4000 Clients.

The GRR landscape is distributed across multiple servers (Ubuntu 16.04):

2x HTTP Frontend-server behind NGINX reverse Proxy
2x Worker Server (With 2 GRR-Worker processes)
1x UI-Server
1x MySQL-Datastore (with 2 GRR-Worker processes

At the moment we're not able to schedule any hunt on the system.

The logs show the following errors:
"mysql_advanced_data_store: Operational Error: 1205 Lock wait timeout exceeded. Try restarting transaction. This may be due to an incorrect mysql "max_allowed_packet" setting (try increasing it).

We've tired values up to 2048MB, but this did not help. Another thing is that on the grr admin server we have a lot of socket in Status "Close_Wait".

I will add the current GRR-Settings and Mysql-Settings that we're using:

GRR:

Threadpool.size = 50
Worker.queue_shards = 8
Mysql.conn_pool_max = 50
Mysql.conn_pool_min = 10
Mysql.max_connect_wait = 0
Mysql.max_query_size = 8388608
Mysql.max_retries = 10
Mysql.max_values_per_query = 10000

Mysql: max_allowed_packet = 512M

Does anybody in this group have the same problems with the current grr ersion installed from the deb package? Or could you please give us some hints that may help gettimg grr up and running again?

Thanks for your help,
Cheers
Sven

The text was updated successfully, but these errors were encountered:

n3x77 · 2018-06-18T11:59:36Z

To add we're using Percona Server 5.6 from the default Ubuntu Repository.

pidydx · 2018-06-18T14:47:12Z

What hardware specs are being used for the database server?
Do you have problems with other parts of the system or only hunts?
What is the cpu load on the database server when it is having problems?

n3x77 · 2018-06-19T08:48:10Z

Hi, the database is an azure instance with the following specs:

16 CPUs
112 GB memory
128 GB HDD (Caching)
The database is stored on another partition (HDD)

The funny thing is that the database server is idle and has load average of 2.0. So it seems that there is not really a performance issue.

pidydx · 2018-06-19T19:34:55Z

Nothing here looks particularly out of line. There was some functionality added that uses a method called ScanAttributes that can be extremely inefficient on the existing MySQLAdvancedDataStore, but that is usually associated with having problems retrieving results from hunts more than creating them.

Is there anything in the slow query log?
Are the cron based hunts like the weekly interrogate working?
Are clients still checking in?
Can you run flows on clients?

n3x77 · 2018-06-20T08:10:43Z

Do you know if there are any problems using Percona DB? We're currently running on percona server 5.6?

The clients are connecting. I will check the slow query log and the interrogate flow.

n3x77 · 2018-07-18T10:18:23Z

So to give you an update.

We did the following things on the datastore server:

Upgrade from Percona 5.6 to Percona 5.7
Upgrade the datastore from Drives to SSD
Activated performance sheet and did some tweaking with mysqltuner

We still get the max_allowed_packet warning but it seems that error message is misleading. The db looks fine, we also startet some hunting for registry keys and the performance looks much better.

The clients connect to the grr server through an nginx reverse proxy. On this proxy we get the following alert message in the nginx log:
open socket #xxx left in connection ##

So after a while the nginx proxy cannot handle the incoming connections from grr any more. Do you have a clue why these sockets are not closed correctly on the nginx server?

Thanks

grrrrrrrrr · 2018-07-23T15:02:58Z

Hm have you looked at the frontend debug logs? It should log the time each request from a client takes, if there is an increase over time in that value, there might be some data store issue.

What you could also look at, not sure if possible at this point, is if the same message about open connections also happens when the system is not loaded - very few connected clients for example. Maybe there is some general issue that connections don't get closed properly, could be in GRR itself or a weird interaction with nginx.

n3x77 · 2018-07-24T14:20:30Z

Hi,

I'don't understand exactly what you mean by frontend debug log? I know that it is possible to start the frontend process from command line togehter with the verbose option. Is there another option to enable debug logs?

grrrrrrrrr · 2018-07-24T14:30:37Z

So I quickly had a look and it turns out that the frontend logs debug logs by default (in my case they are at /usr/share/grr-server/lib/python2.7/site-packages/grr/var/log/grr-http-server.log). Whenever a client delivers or reads messages, the frontend logs how long that takes. Usually, those times should be very fast, maybe you can have a look if this is the case for your instance as well?

n3x77 · 2018-07-30T12:50:33Z

Hi thanks for your reponse.

I activated the verbose logging of the http server process in the server.local.yaml file:

HTTPServer Context:
    Logging.verbose: true

I also decreased the sysctl setting: net.ipv4.tcp_keepalive_time = 300.
I will check if the time will increase when the problem occurs.

n3x77 · 2018-07-30T14:13:52Z

After the server has been restarted and the first hunt has been set, the error occurs again after a few minutes. What struck me is that the query times increase enormously from 0.5 seconds to up to 62 seconds. After that, Mysql errors occur and out-of-order messages appear. The entire hunt then remains in running status. However, the Nginx server only reports HTTP 5xx error messages and now flow is processed anymore.

In the firewall we can detect an increased number of TCP connections with'Invalid Packet' status between the reverse proxy and the clients as well as the GRR workers and the database server.

The mysql error log is full of the following messages:

2018-07-30T06:25:20.055370Z 608533 [Note] Aborted connection 608533 to db: 'grr' user: 'grr
' host: '10.XXX.XXX.XXX' (Got timeout reading communication packets)

20180730_grr-http-server.log
20180730_grr_http_server2.log

grrrrrrrrr · 2018-08-02T10:55:04Z

It seems like you mysql server can't keep up with the draining of the messages for the clients. There are 41 messages for each clients in a hunt but that doesn't seem like it should be too much? Are those messages somehow special, maybe huge in size?

As a next step, I'd look at the mysql server log and try finding slow queries. I can't really explain why the mysql performance is that bad, maybe the logs give some info there.

n3x77 · 2018-08-02T17:09:01Z

Hi Andreas,

no in this case I just the Launch Exe Flow to Install a newer Version of the GRR Agent on 500 Systems. So it should not be a performance intensive hunt.

I will check the mysql slow query logs.

n3x77 · 2018-08-10T17:02:36Z

Hi Andreas,

today i tried to repeat the errors descriped above. So what I did was i Launched a hunt with the Launch Executable Flow and executed the GRR3.2.2.0 Executable 64-Bit on 500 Clients hat are Windows 64-Bit clients. The Idea was to upgrade these Clients from Agent 3.2.0.1 to 3.2.2.0.

After a few seconds the errors occured again. Since the Slow-Query-Log is realy huge is used the pt-query-digest script from Percona to parse the log. I will ad the log to this issue.

Do you have any clue why this queries need up to 60 seconds?

At the moment we don't know what we could do.

Thanks for your help,
Sven
test.log

grrrrrrrrr · 2018-09-04T13:22:14Z

I had a quick peek at this log and it looks to me that your database server is completely unresponsive - all queries are super slow, irrelevant which table they read or modify.

The root cause seems to be the large inserts that write 1.5GB to the table that collects the client messages that need to be sent. This takes almost 6000 seconds to write and it takes down the whole server. We need to have a look if MySQL can actually handle this amount of data in somewhat reasonable time or if we need to change to a different approach here.

n3x77 · 2018-09-04T16:43:52Z

Hi Andreas, Thanks for the reply, is this something that you already have changed in GRR 3.2.3.2? I am using 3.2.2.0 but can try to upgrade. Could you please describe for what the Update Agent Flow is used ? I tried to upgrade the agents from 3.2.0.1 to 3.2.2.0 but it did not work so therefore I tried to use the launch binary flow. Thx Andreas Moser <notifications@github.com> schrieb am Di. 4. Sep. 2018 um 15:22:

…

I had a quick peek at this log and it looks to me that your database server is completely unresponsive - all queries are super slow, irrelevant which table they read or modify. The root cause seems to be the large inserts that write 1.5GB to the table that collects the client messages that need to be sent. This takes almost 6000 seconds to write and it takes down the whole server. We need to have a look if MySQL can actually handle this amount of data in somewhat reasonable time or if we need to change to a different approach here. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#599 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATvebBb4gUfDQY2yqq6whtGipord-_wZks5uXn6PgaJpZM4Urb5U> .

grrrrrrrrr · 2019-11-26T12:06:58Z

Sorry for never getting back to this bug. The newer GRR versions (>=3.3.0.0) use a much improved database model which should prevent issue like this one. Closing here as this is now obsolete.

grrrrrrrrr closed this as completed Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599

GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599

n3x77 commented Jun 18, 2018 •

edited

n3x77 commented Jun 18, 2018

pidydx commented Jun 18, 2018

n3x77 commented Jun 19, 2018

pidydx commented Jun 19, 2018

n3x77 commented Jun 20, 2018

n3x77 commented Jul 18, 2018

grrrrrrrrr commented Jul 23, 2018

n3x77 commented Jul 24, 2018

grrrrrrrrr commented Jul 24, 2018

n3x77 commented Jul 30, 2018 •

edited

n3x77 commented Jul 30, 2018 •

edited

grrrrrrrrr commented Aug 2, 2018

n3x77 commented Aug 2, 2018

n3x77 commented Aug 10, 2018

grrrrrrrrr commented Sep 4, 2018

n3x77 commented Sep 4, 2018 via email

grrrrrrrrr commented Nov 26, 2019

GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599

GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599

Comments

n3x77 commented Jun 18, 2018 • edited

n3x77 commented Jun 18, 2018

pidydx commented Jun 18, 2018

n3x77 commented Jun 19, 2018

pidydx commented Jun 19, 2018

n3x77 commented Jun 20, 2018

n3x77 commented Jul 18, 2018

grrrrrrrrr commented Jul 23, 2018

n3x77 commented Jul 24, 2018

grrrrrrrrr commented Jul 24, 2018

n3x77 commented Jul 30, 2018 • edited

n3x77 commented Jul 30, 2018 • edited

grrrrrrrrr commented Aug 2, 2018

n3x77 commented Aug 2, 2018

n3x77 commented Aug 10, 2018

grrrrrrrrr commented Sep 4, 2018

n3x77 commented Sep 4, 2018 via email

grrrrrrrrr commented Nov 26, 2019

n3x77 commented Jun 18, 2018 •

edited

n3x77 commented Jul 30, 2018 •

edited

n3x77 commented Jul 30, 2018 •

edited