Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599

Closed
n3x77 opened this issue Jun 18, 2018 · 17 comments
Closed

GRR MySQL "Max Allowed Packet" Error / Hunt's are not working #599

n3x77 opened this issue Jun 18, 2018 · 17 comments

Comments

@n3x77
Copy link

n3x77 commented Jun 18, 2018

Dear users,

we have some real performance Issues with GRR at the moment. To give you a better understanding we use GRR Version 3.2.2.0 with MySQL and roughly 4000 Clients.

The GRR landscape is distributed across multiple servers (Ubuntu 16.04):

2x HTTP Frontend-server behind NGINX reverse Proxy
2x Worker Server (With 2 GRR-Worker processes)
1x UI-Server
1x MySQL-Datastore (with 2 GRR-Worker processes

At the moment we're not able to schedule any hunt on the system.

The logs show the following errors:
"mysql_advanced_data_store: Operational Error: 1205 Lock wait timeout exceeded. Try restarting transaction. This may be due to an incorrect mysql "max_allowed_packet" setting (try increasing it).

We've tired values up to 2048MB, but this did not help. Another thing is that on the grr admin server we have a lot of socket in Status "Close_Wait".

I will add the current GRR-Settings and Mysql-Settings that we're using:

GRR:

Threadpool.size = 50
Worker.queue_shards = 8
Mysql.conn_pool_max = 50
Mysql.conn_pool_min = 10
Mysql.max_connect_wait = 0
Mysql.max_query_size = 8388608
Mysql.max_retries = 10
Mysql.max_values_per_query = 10000
Mysql: max_allowed_packet = 512M

Does anybody in this group have the same problems with the current grr ersion installed from the deb package? Or could you please give us some hints that may help gettimg grr up and running again?

Thanks for your help,
Cheers
Sven

@n3x77
Copy link
Author

n3x77 commented Jun 18, 2018

To add we're using Percona Server 5.6 from the default Ubuntu Repository.

@pidydx
Copy link
Contributor

pidydx commented Jun 18, 2018

What hardware specs are being used for the database server?
Do you have problems with other parts of the system or only hunts?
What is the cpu load on the database server when it is having problems?

@n3x77
Copy link
Author

n3x77 commented Jun 19, 2018

Hi, the database is an azure instance with the following specs:

  • 16 CPUs
  • 112 GB memory
  • 128 GB HDD (Caching)
  • The database is stored on another partition (HDD)

The funny thing is that the database server is idle and has load average of 2.0. So it seems that there is not really a performance issue.

@pidydx
Copy link
Contributor

pidydx commented Jun 19, 2018

Nothing here looks particularly out of line. There was some functionality added that uses a method called ScanAttributes that can be extremely inefficient on the existing MySQLAdvancedDataStore, but that is usually associated with having problems retrieving results from hunts more than creating them.

Is there anything in the slow query log?
Are the cron based hunts like the weekly interrogate working?
Are clients still checking in?
Can you run flows on clients?

@n3x77
Copy link
Author

n3x77 commented Jun 20, 2018

Do you know if there are any problems using Percona DB? We're currently running on percona server 5.6?

The clients are connecting. I will check the slow query log and the interrogate flow.

@n3x77
Copy link
Author

n3x77 commented Jul 18, 2018

So to give you an update.

We did the following things on the datastore server:

  • Upgrade from Percona 5.6 to Percona 5.7
  • Upgrade the datastore from Drives to SSD
  • Activated performance sheet and did some tweaking with mysqltuner

We still get the max_allowed_packet warning but it seems that error message is misleading. The db looks fine, we also startet some hunting for registry keys and the performance looks much better.

The clients connect to the grr server through an nginx reverse proxy. On this proxy we get the following alert message in the nginx log:
open socket #xxx left in connection ##

So after a while the nginx proxy cannot handle the incoming connections from grr any more. Do you have a clue why these sockets are not closed correctly on the nginx server?

Thanks

@grrrrrrrrr
Copy link
Contributor

Hm have you looked at the frontend debug logs? It should log the time each request from a client takes, if there is an increase over time in that value, there might be some data store issue.

What you could also look at, not sure if possible at this point, is if the same message about open connections also happens when the system is not loaded - very few connected clients for example. Maybe there is some general issue that connections don't get closed properly, could be in GRR itself or a weird interaction with nginx.

@n3x77
Copy link
Author

n3x77 commented Jul 24, 2018

Hi,

I'don't understand exactly what you mean by frontend debug log? I know that it is possible to start the frontend process from command line togehter with the verbose option. Is there another option to enable debug logs?

@grrrrrrrrr
Copy link
Contributor

So I quickly had a look and it turns out that the frontend logs debug logs by default (in my case they are at /usr/share/grr-server/lib/python2.7/site-packages/grr/var/log/grr-http-server.log). Whenever a client delivers or reads messages, the frontend logs how long that takes. Usually, those times should be very fast, maybe you can have a look if this is the case for your instance as well?

@n3x77
Copy link
Author

n3x77 commented Jul 30, 2018

Hi thanks for your reponse.

I activated the verbose logging of the http server process in the server.local.yaml file:

HTTPServer Context:
    Logging.verbose: true

I also decreased the sysctl setting: net.ipv4.tcp_keepalive_time = 300.
I will check if the time will increase when the problem occurs.

@n3x77
Copy link
Author

n3x77 commented Jul 30, 2018

After the server has been restarted and the first hunt has been set, the error occurs again after a few minutes. What struck me is that the query times increase enormously from 0.5 seconds to up to 62 seconds. After that, Mysql errors occur and out-of-order messages appear. The entire hunt then remains in running status. However, the Nginx server only reports HTTP 5xx error messages and now flow is processed anymore.

In the firewall we can detect an increased number of TCP connections with'Invalid Packet' status between the reverse proxy and the clients as well as the GRR workers and the database server.

The mysql error log is full of the following messages:

2018-07-30T06:25:20.055370Z 608533 [Note] Aborted connection 608533 to db: 'grr' user: 'grr
' host: '10.XXX.XXX.XXX' (Got timeout reading communication packets)

20180730_grr-http-server.log
20180730_grr_http_server2.log

@grrrrrrrrr
Copy link
Contributor

It seems like you mysql server can't keep up with the draining of the messages for the clients. There are 41 messages for each clients in a hunt but that doesn't seem like it should be too much? Are those messages somehow special, maybe huge in size?

As a next step, I'd look at the mysql server log and try finding slow queries. I can't really explain why the mysql performance is that bad, maybe the logs give some info there.

@n3x77
Copy link
Author

n3x77 commented Aug 2, 2018

Hi Andreas,

no in this case I just the Launch Exe Flow to Install a newer Version of the GRR Agent on 500 Systems. So it should not be a performance intensive hunt.

I will check the mysql slow query logs.

@n3x77
Copy link
Author

n3x77 commented Aug 10, 2018

Hi Andreas,

today i tried to repeat the errors descriped above. So what I did was i Launched a hunt with the Launch Executable Flow and executed the GRR3.2.2.0 Executable 64-Bit on 500 Clients hat are Windows 64-Bit clients. The Idea was to upgrade these Clients from Agent 3.2.0.1 to 3.2.2.0.

After a few seconds the errors occured again. Since the Slow-Query-Log is realy huge is used the pt-query-digest script from Percona to parse the log. I will ad the log to this issue.

Do you have any clue why this queries need up to 60 seconds?

At the moment we don't know what we could do.

Thanks for your help,
Sven
test.log

@grrrrrrrrr
Copy link
Contributor

I had a quick peek at this log and it looks to me that your database server is completely unresponsive - all queries are super slow, irrelevant which table they read or modify.

The root cause seems to be the large inserts that write 1.5GB to the table that collects the client messages that need to be sent. This takes almost 6000 seconds to write and it takes down the whole server. We need to have a look if MySQL can actually handle this amount of data in somewhat reasonable time or if we need to change to a different approach here.

@n3x77
Copy link
Author

n3x77 commented Sep 4, 2018 via email

@grrrrrrrrr
Copy link
Contributor

Sorry for never getting back to this bug. The newer GRR versions (>=3.3.0.0) use a much improved database model which should prevent issue like this one. Closing here as this is now obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants