Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP 503 MISP to PyMISP client #3305

Closed
github-germ opened this issue Jun 1, 2018 · 36 comments
Closed

HTTP 503 MISP to PyMISP client #3305

github-germ opened this issue Jun 1, 2018 · 36 comments

Comments

@github-germ
Copy link

In #3293 (comment) I observed many HTTP 503 error codes from the MISP server in responses back to requests from a multi-threaded PyMISP based client.

Client performs ingestion of a few thousand events, some with many attributes. The 503s happen most often with PyMISP.add_hashes() calls, but have also been seen with a thread's attempt to instantiate pymisp.PyMISP().

For example, for one ingestion of about 1000 events, about 500 add_hashes() triggered a 503.

The client runs on the same server as MISP. This is an IBM 3650 with 32 cpu cores, and 64GB RAM.

Questions/Brainstorming

  1. Is it correct to assume that the request made with a 503 reply did not transact?

  2. Is there a uniform was to handle non 200 HTTP responses in PyMISP clients?

  3. Is there a limit to the number of active MISP client connections that might be causing the 503s?

  4. Each new thread instantiates a PyMISP. Is there any call that will drop the connection before the thread completes and exits to perhaps reduce the number of active connections?

  5. I've not seen an errors in the error.log nor entries in debug.log when a 503 occurs. Is there another way to monitor this condition?

  6. Might there be some tuning in MISP, the PyMISP client, or perhaps in Apache or the OS, that would help?

@RichieB2B
Copy link
Contributor

Please note that HTTP is a connectionless protocol. I don’t think the python requests library (that pymisp uses) keeps the connection open at all (using HTTP 1.1 keep-alive). So the only server resource you are using when not making an actual request is an entry in the CakePHP session table.

@github-germ
Copy link
Author

Thx... Good point. So that eliminates item 4 in #3305 (comment)

@iglocska
Copy link
Member

iglocska commented Jun 4, 2018

Could you check the error.log output when a 503 happens?

@iglocska
Copy link
Member

iglocska commented Jun 4, 2018

Also, please share the script that you are using.

For example, for one ingestion of about 1000 events, about 500 add_hashes() triggered a 503.

does this mean you try to simultaneously run 500 of threads with add_hashes?

@github-germ
Copy link
Author

github-germ commented Jun 4, 2018

I've tested on two hosts. The ingestion runs through a set of events, creating events, tags, and attributes as per the feed input (i.e. it's not exclusively add_hashes())

  • an 8-core VM with 16GB RAM with a max of 8 threads with at most 4 active, and
  • a 32-core IBM 3650 with 64GB RAM with a max of 32 threads with at most 16 active.

In both cases, the 503 occurs frequently.

Are there any suggested tuning parameters to play with to help with this load issue, e.g. with the server-side MISP, Apache, MariaDB, PHP, OS; or on the client-side; etc.?

Thanks!!

@github-germ
Copy link
Author

In reply to #3305 (comment)
Could you check the error.log output when a 503 happens?

Nothing in the error.log when a 503 is issued (see point 5 in #3305 (comment))

@github-germ
Copy link
Author

github-germ commented Jun 4, 2018

I tried reducing threads active from 32 to 8 on the 32-cpu host and still get 503s. Then tried running the ingestion single-threaded, and also got 503s. Also, ran top during these tests, and cpu uses i below 1% for the 32-cpu host.

@github-germ
Copy link
Author

In looking through my log file, for every reported 503, I notice a one minute lag between DEBUG Starting connection and the DEBUG ...attributes/add/nnnnn 503:

[20180604-12:37:28 DEBUG (140350580545344)] Starting new HTTPS connection (1): \
localhost
[20180604-12:38:28 DEBUG (140350580545344)] https://localhost:443 \
"POST /attributes/add/11645 HTTP/1.1" 503 299
[20180604-12:38:28 ERROR (140350580545344)] failure for reportId=18-00009068 \
when add_hashes(event_id=11645, category="Payload delivery", \
filename="UNAVAILABLE", md5="d2aad7bf65dd75bce69f2a07898fff58", \
sha1="c014776d72ee444a6b94cab3b243f88a6497b2f5", \
sha256="aab9f44cd38805158b824e10f5db353ca535cab204c86df09012ffeab73bf6d1") \
returned <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
</body></html>

@iglocska
Copy link
Member

iglocska commented Jun 5, 2018

Please share the script that you are using. Without seeing what you're actually doing and being able to reproduce it, we cannot do much to help you.

From the error you've pasted, it looks like cakephp simply forwarded an error from php. There are a host of different reasons this could happen, but so far we can't do much with the provided information.

@github-germ
Copy link
Author

Let me try to reproduce with a simpler, smaller script, and then I'll share that. Thanks.

@Rafiot
Copy link
Member

Rafiot commented Jun 5, 2018

Thank you

@github-germ
Copy link
Author

github-germ commented Jun 6, 2018

issue_3305.zip

I've been able to reproduce the 503s with this test script on a VM and also an IBM-3650.

Instructions for use:

  1. Unzip and get test.sh and mispClient.py
  2. Edit as follows:
    test.sh -- set PYTHON to your Python executable
    mispClient.py -- change url and key values in this dict for your system
    ....
    13 TestConfig = \
    14 {
    15     'misp':
    16     {
    ...
    18         'url':   'https://localhost/',
    19         'key':   'mSMheXyklEqRpWKInR657LcS1dL49LuVPt6dqAZd',
    ...
  1. Run the test from a shell: ./test/sh 16 -- Both of my hosts yielded 503s with 16 concurrent (you may need to increase to see the 503s depending on your host resources).
  2. You can tail Apache access_log for 503s.
  3. Also in the same dir where you ran test.sh, do tail -f *.log and you'll see the 503s returned from PyMISP.add_hashes() (from mispClient.createHashes()), e.g.
==> 4.out <==
createHashes: {'eventId': '17399', 'hash': {'filename': 'filename_18',
'md5': 'd41d8cd98f00b204e9800998ecf8427e',
'sha1': 'da39a3ee5e6b4b0d3255bfef95601890afd80709',
'sha256': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'},
'error': '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
\n<html><head>\n<title>503 Service Unavailable</title>
\n</head><body>\n<h1>Service Unavailable</h1>
\n<p>The server is temporarily unable to service your
\nrequest due to maintenance downtime or capacity
\nproblems. Please try again later.</p>
\n</body></html>
\n', 'exception': None}

@iglocska
Copy link
Member

iglocska commented Jun 6, 2018

I've been trying to reproduce it but sadly I could only test it on my puny laptop - setting it to 16 threads will not actually use that many on my dual core i7 ;)

So I can't reproduce it given the above, however, something I've been curious about - what session handler are you using? Are you using php sessions in MISP and redis sessions in php.ini? If not that could cause the issue you're seeing.

@github-germ
Copy link
Author

Can you help me find these settings please? I didn't do the config, but can check with that person when he gets here. In the meantime...

$ grep ^session /etc/opt/rh/rh-php71/php.ini
session.save_handler = files
session.use_strict_mode = 0
session.use_cookies = 1
session.use_only_cookies = 1
session.name = PHPSESSID
session.auto_start = 0
session.cookie_lifetime = 0
session.cookie_path = /
session.cookie_domain =
session.cookie_httponly =
session.serialize_handler = php
session.gc_probability = 1
session.gc_divisor = 1000
session.gc_maxlifetime = 1440
session.referer_check =
session.cache_limiter = nocache
session.cache_expire = 180
session.use_trans_sid = 0
session.sid_length = 26
session.trans_sid_tags = "a=href,area=href,frame=src,form="
session.sid_bits_per_character = 5
$ 

In core.php I see:

Configure::write('Session', array(
        'timeout'        => 60,    // Session timeout, default is 1 hour
        'defaults'       => 'php',
        'autoRegenerate' => true
));

And I do see redis running as /opt/rh/rh-redis32/root/usr/bin/redis-server 127.0.0.1:6379

@github-germ
Copy link
Author

Btw, the test code I uploaded is not using threads (as my true code is) -- it's just separate processes started by that test.sh. However, it does reproduce the load enough similar to what my true app does in order to trigger the 503s. The fact that you don't see the 503s does seem to point to some sort of config or tuning issue on my side I think.

@github-germ
Copy link
Author

I'm told that we are using whatever the defaults were for the settings. Awaiting your next steps. Thanks Andras... really appreciate all your help as we are about to deploy our first MISP next week.

@iglocska
Copy link
Member

iglocska commented Jun 6, 2018

From the serversetting description:

The session type used by MISP. The default setting is php, which will use the session settings configured in php.ini for the session data (supported options: php, database). The recommended option is php and setting your PHP up to use redis sessions via your php.ini. Just add 'session.save_handler = redis' and "session.save_path = 'tcp://localhost:6379'" (replace the latter with your redis connection) 

@github-germ
Copy link
Author

ok, we will reconfig and restest. thanks!

what's the URL that describers the serversettings?

@iglocska
Copy link
Member

iglocska commented Jun 6, 2018

It should be in http://localhost:5002/servers/serverSettings/Security for this particular setting (Session.defaults)

@github-germ
Copy link
Author

OK, so someone configuring should really pay special attention to each and every description in the WebUI Settings -OR- is there also a document serves a guide also?

@iglocska
Copy link
Member

iglocska commented Jun 6, 2018

Basically yeah, especially the critical settings should be read carefully and configured.

@gitnorty
Copy link

gitnorty commented Jun 6, 2018

Is there an admin guide giving more details for the various settings? Any documentation providing more details of the various settings and their effects would be extremely helpful. Thanks in advance.

@github-germ
Copy link
Author

github-germ commented Jun 6, 2018

We made those 2 setting changes in /etc/opt/rh/rh-php71/php.ini

1227c1227
< session.save_handler = redis
---
> session.save_handler = files
1260c1260
< session.save_path = "tcp://localhost:6379"
---
> ;session.save_path = "/tmp"

And restarted redis, httpd, and the workers.

My test script (with 16 procs) is still throwing 503s )-:

@github-germ
Copy link
Author

Is there any way to determine root cause for the 503? Does MISP server-side code have any way to determine that? E.g. is it Apache itself, or the db, or...?

@iglocska
Copy link
Member

iglocska commented Jun 7, 2018

I've just got my new laptop, will have it set up over the weekend, can then test with twice as many threads (or throw it on my home desktop to test with 12 threads at least...)

@Rafiot
Copy link
Member

Rafiot commented Jun 7, 2018

In the meantime, so you can get your work done, please try running the script with less processes in parallel (3-4 instead of 16), thanks.

@github-germ
Copy link
Author

github-germ commented Jun 7, 2018

A + R...

My current hypothesis is that the 503 is triggered by the rate of new PyMISP instantiations which I believe you said with each doing 3 HTTP requests.

My loose basis for this is that if I disable the creation of new PyMISP instances in mispClient.py (line 30: 'doDups': False,) with 4 concurrent processes on my 8-core VM, then the tests succeed without any 503s. By simply changing this to True, 503s occur.

Likewise, in my real app, which uses threads, if I adjust my thread start scheduler to increase wait time for starting new threads from 0.1 to 1.0 seconds, I am not seeing 503s (for smaller batches; still need to test larger datasets).

So, could this be this rapid fire 3-HTTP-transactions-per new PyMISP is the cause? Perhaps you can consider that when you do more testing. If this is true, is there a solution via tuning somewhere (e.g. Apache)?

I'd also be curious if there is some method to determine the root cause for the 503s to prove/disprove this hypothesis or lead us to another.

Thx!!

...M

@github-germ
Copy link
Author

github-germ commented Jun 7, 2018

Btw, after I complete my mispClient.py test script, I have a script with a simple loop of PyMISP.delete_event() for each created event (only one PyMISP instance here). I see occasional 503s for some of the deletions with this in error.log.

::1 - - [07/Jun/2018:14:25:20 +0000] "DELETE /events/12065 HTTP/1.1" 503 299 "-" "PyMISP 2.4.90 - Python 3.6.4"
2018-06-07 14:27:15 Error: [PDOException] SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
Request URL: /events/12065
Stack Trace:
#0 /var/www/MISP_2.4.91_deadlockfix/app/Lib/Tools/QueryTool.php(17): PDOStatement->execute()
#1 /var/www/MISP_2.4.91_deadlockfix/app/Model/Event.php(1254): QueryTool->quickDelete('correlations', '1_event_id', '12065', Object(Event))
#2 /var/www/MISP_2.4.91_deadlockfix/app/Controller/EventsController.php(1834): Event->quickDelete(Array)
#3 [internal function]: EventsController->delete('12065')
#4 /var/www/MISP_2.4.91_deadlockfix/app/Lib/cakephp/lib/Cake/Controller/Controller.php(491): ReflectionMethod->invokeArgs(Object(EventsController), Array)
#5 /var/www/MISP_2.4.91_deadlockfix/app/Lib/cakephp/lib/Cake/Routing/Dispatcher.php(193): Controller->invokeAction(Object(CakeRequest))
#6 /var/www/MISP_2.4.91_deadlockfix/app/Lib/cakephp/lib/Cake/Routing/Dispatcher.php(167): Dispatcher->_invoke(Object(EventsController), Object(CakeRequest))
#7 /var/www/MISP_2.4.91_deadlockfix/app/webroot/index.php(92): Dispatcher->dispatch(Object(CakeRequest), Object(CakeResponse))
#8 {main}

You may recall #3264 with the deadlockfix branch where we also saw lock issues. Should this new error I see (above) be a new issue? Note that each of these events has 300 attributes, which I assume causes the delay -- I imagine that batch of 300 attribute deletions may also be a single lock transaction as were add_hashes() before the deadlockfix.

mispDeleteEvents.zip

@github-germ
Copy link
Author

github-germ commented Jun 8, 2018

Further testing, and cannot get a clean run without 503s. I hope you are able to reproduce and try to isolate root cause. At this point, I'm not sure what's causing these 503s. HELP!!!

Even when running single-threaded and sleeping 1 second between processing each event (although an event can have lots of attributes), I see 503s.

Do I need to put a sleep before every PyMISP call???? If so, don't think we'll be able to process our input fast enough.

@github-germ
Copy link
Author

Anything new to report here?

@thormanrd
Copy link

thormanrd commented Jun 11, 2018

I'm getting the same thing invoking the ..events/restSearch/json from my Nifi workflow. This invocation has worked well for the past week. Now it only returns the 503...

screen shot 2018-06-11 at 8 03 59 am

@github-germ
Copy link
Author

update: nothing works to eliminate 503s so far. keep in mind this is not a threading issue, as we get 503s without threads.

if you'd like us to try php not with fpm, please tell me how.

thx.

@github-germ
Copy link
Author

Ping

@github-germ
Copy link
Author

github-germ commented Jun 13, 2018

Andras... interesting update. I disabled calls to PyMISP.add_hashes in my script and it now runs fine with full multi-threading (no 503s). Recall that when the deadlockfix branch was implemented, which allowed add_hashes() to actually work, that we started getting these 503s. (note: I'm running 2.4.92 with deadlockfix enabled via the WebUI)

So, how does this play into what you think is triggering the 503s? And does this data point help in finding some solution that will allow me to start using add_hases() again?

Thanks!!

@github-germ
Copy link
Author

Let's close this issue. Although HTTP 503s persist, separate, more specific issues have been spun out from here:

  • Use the new format to create MISP events jaegeral/PySight2MISP#6 -- changing the model:
    • FROM: discrete use of new_event() followed by various add_*() calls for that event, which cause many HTTP and db transactions for one created event.
    • TO: constructing the new event in local memory within a PyMISPEvent followed by additions to that with calls like add_attribute(), tag(), add_object() and then add_event() which will be one HTTP exchange and hit the db once.
  • HTTP 503 with many Event Deletions #3361 -- HTTP 503 with many Event Deletions in WebUI and PyMISP

@github-germ
Copy link
Author

With the refactoring of the API wondering if the issues I encountered with running an PyMISP feed ingestion app multi-threaded as per details above yielding HTTP 503s, might now work? Same for the issue of add_hashes

@github-germ github-germ reopened this Oct 15, 2018
@Rafiot Rafiot closed this as completed Jul 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants