-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory profiling & benchmarks #103
Comments
benchmarks brought out possible memory leaks in another project of mine(not |
when running the benchmarks, after sending about 300 prometheus query number_of_messages_total{project="naz_benchmarks", state="response"} Lines 112 to 115 in 14fe3be
|
I think for this, what we need to do is that when we get Line 1940 in 14fe3be
ie; elif commandStatus.value in [
SmppCommandStatus.ESME_RTHROTTLED.value,
SmppCommandStatus.ESME_RMSGQFUL.value
]:
await self.throttle_handler.throttled() However for the purposes of running benchmarks, we should edit the SMSC simulator to hold more messages: https://github.com/komuw/smpp_server_docker/blob/dea7168abd4c89df8b8f67ffb57bd18cf00c8526/SMPPSim/conf/smppsim.props#L85-L88 |
for the memory leak, it flattens after 60mins: This lines up pretty well with the default correlater, Lines 94 to 104 in 14fe3be
It stores items in a dict that we append to. And that dict has a ttl of 60mins. That's why the memory climbs up until the 1hr mark then in flattens. We'll reduce the default ttl to about 10-15 mins |
Check if we are sending out the following to SMSC;
Those two we currently do not send them straight out, but rather queue them in broker and they can be sent later; Lines 977 to 979 in 14fe3be
Lines 1083 to 1085 in 14fe3be
I didn't see them in logs, so we need to double check |
This was fixed in; 4076739 |
the "mem leak" was fixed in; 4076739 |
we need to send logs somewhere for analysis of any errors. maybe, https://www.komu.engineer/blogs/timescaledb/timescaledb-for-logs or use sentry so long as we can get large quotas |
What: - upgrade java version - increase queue sizes from 1,000 to 250,000 messages - decrease time taken in queue from 60 to 5 seconds - released a new docker image tagged; `komuw/smpp_server:v0.3` Why: - when running benchmarks, with the low queue size, the smpp client would get a lot of `ESME_RMSGQFUL` (`Message Queue Full`)[1] Ref: 1. komuw/naz#103 (comment)
Increased smpp queue sizes in 0afac56 and |
This was done in; c508d70 |
when smsc or redis are disconnected, when {"log":"{\"timestamp\": \"2019-06-05 13:21:54,483\", \"event\": \"naz.Client.re_establish_conn_bind\", \"stage\": \"start\", \"smpp_command\": \"submit_sm\", \"log_id\": \"3132-rhopvxn\", \"connection_lost\": true, \"project\": \"naz_benchmarks\", \"smsc_host\": \"134.209.80.205\", \"system_id\": \"smppclient1\", \"client_id\": \"4DPUKXZ7H95AZIJCL\", \"pid\": 1}\n","stream":"stderr","time":"2019-06-05T13:21:54.483913134Z"}
{"log":"{\"timestamp\": \"2019-06-05 13:21:54,485\", \"event\": \"naz.Client.connect\", \"stage\": \"start\", \"log_id\": \"3132-rhopvxn\", \"project\": \"naz_benchmarks\", \"smsc_host\": \"134.209.80.205\", \"system_id\": \"smppclient1\", \"client_id\": \"4DPUKXZ7H95AZIJCL\", \"pid\": 1}\n","stream":"stderr","time":"2019-06-05T13:21:54.485750995Z"}
s
s
s This is despite the function doing the re-connection having explicit timeouts: Lines 739 to 741 in 8466de4
The larger issue is, this piece of code does not work as I would expect: import asyncio
import time
async def eternity():
"""Sleep for one hour"""
# await asyncio.sleep(3600) # <-- this would get cancelled
time.sleep(3600) # <-- this does not get cancelled
print("done!")
async def main():
# Wait for at most 4 seconds
await asyncio.wait_for(eternity(), timeout=4.0)
loop = asyncio.get_event_loop()
loop.run_until_complete(main()) And , https://github.com/aio-libs/async-timeout, is not the solution; it also does not work. |
Well, it looks like it is not possible to timeout of blocking code from an async function, atleast that is what the guy[1] who wrote most of python asyncio says - https://twitter.com/1st1/status/1136292366208372736 So maybe what we want to do is add a blocking code detector: |
With this change: Lines 742 to 747 in 61d036b
naz seems to be able to withstand SMSC failures(when the smsc goes down and then comes back up) without blocking forever. This is after testing for an hour or so. I'll leave it running for long just to be sure.
However,
|
since |
This was fixed in #134
This was fixed in #134 |
Issue raised while benchmarking: {
"timestamp": "2019-06-08 12:42:27,957",
"event": "naz.Client.receive_data",
"stage": "end",
"full_pdu_data": b"\x00\x00\x00\x15\x80",
"project": "naz_benchmarks",
"smsc_host": "134.209.80.205",
"system_id": "smppclient1",
"client_id": "DRNTKRMTCJR97D9OO",
"pid": 83197,
}
File "/Users/home/mystuff/naz/naz/client.py", line 1838, in receive_data
await self._parse_response_pdu(full_pdu_data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/home/mystuff/naz/naz/client.py", line 1860, in _parse_response_pdu
command_id = struct.unpack(">I", command_id_header_data)[0]
struct.error: unpack requires a buffer of 4 bytes
{'timestamp': '2019-06-08 12:42:27,959', 'event': 'naz.cli.main', 'stage': 'end'} also see: #135 |
for the issue above, we have added this to try and resolve it; Lines 1815 to 1841 in 7ddce43
|
Another issue raised while benchmarking; 'timestamp': '2019-06-08 15:06:55,023', 'event': 'MyRedisQueue.dequeue', 'stage': 'start'}
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.Client.receive_data', 'stage': 'end', 'full_pdu_data': b'\x00\x00\x00*CSMPPSim\x00\x00\x00\x00\x15\x80\x00\x00\x04\x00\x00\x00\x00\x00\x00*B2887\x00\x00\x00\x00\x18\x80\x00\x00\t', 'project': 'naz_benchmarks', 'smsc_host': 'smsc_host', 'system_id': 'smppclient1', 'client_id': 'NVG1W95WUO4VOS5VQ', 'pid': 87991}
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.Client._parse_response_pdu', 'stage': 'start', 'project': 'naz_benchmarks', 'smsc_host': 'smsc_host', 'system_id': 'smppclient1', 'client_id': 'NVG1W95WUO4VOS5VQ', 'pid': 87991}
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.Client._parse_response_pdu', 'stage': 'end', 'log_id': '', 'state': 'command_id:1129532752 is unknown.', 'project': 'naz_benchmarks', 'smsc_host': 'smsc_host', 'system_id': 'smppclient1', 'client_id': 'NVG1W95WUO4VOS5VQ', 'pid': 87991}
NoneType: None
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.cli.main', 'stage': 'end', 'error': 'command_id:1129532752 is unknown.'}
Traceback (most recent call last):
File "/Users/home/mystuff/naz/cli/cli.py", line 88, in main
async_main(client=client, logger=logger, loop=loop, dry_run=dry_run)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/Users/home/mystuff/naz/cli/cli.py", line 115, in async_main
await tasks
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 126, in send
return self.gen.send(value)
File "/Users/home/mystuff/naz/naz/client.py", line 1855, in receive_data
await self._parse_response_pdu(full_pdu_data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/home/mystuff/naz/naz/client.py", line 1892, in _parse_response_pdu
raise ValueError("command_id:{0} is unknown.".format(command_id))
ValueError: command_id:1129532752 is unknown.
{'timestamp': '2019-06-08 15:06:55,042', 'event': 'naz.cli.main', 'stage': 'end'} |
for the issue above:
|
both were fixed in: 2059030 |
new error {"timestamp": "2019-06-08 19:56:04,993", "event": "naz.Client.command_handlers", "stage": "start", "smpp_command": "generic_nack", "log_id": "", "error": "command_status:411041792 is unknown.", "project": "naz_benchmarks", "smsc_host": "host, "system_id": "smppclient1", "client_id": "EIZKC9IV9FAGC7JMK", "pid": 93228}
NoneType: None
{'timestamp': '2019-06-08 19:56:04,993', 'event': 'naz.cli.main', 'stage': 'end', 'error': "'NoneType' object has no attribute 'description'"}
Traceback (most recent call last):
File "/Users/home/mystuff/naz/naz/client.py", line 2036, in command_handlers
if commandStatus.value == SmppCommandStatus.ESME_ROK.value:
AttributeError: 'NoneType' object has no attribute 'value'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/home/mystuff/naz/cli/cli.py", line 88, in main
async_main(client=client, logger=logger, loop=loop, dry_run=dry_run)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/Users/home/mystuff/naz/cli/cli.py", line 115, in async_main
await tasks
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 126, in send
return self.gen.send(value)
File "/Users/home/mystuff/naz/naz/client.py", line 1896, in receive_data
await self._parse_response_pdu(full_pdu_data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/home/mystuff/naz/naz/client.py", line 1961, in _parse_response_pdu
hook_metadata=hook_metadata,
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/home/mystuff/naz/naz/client.py", line 2052, in command_handlers
"state": commandStatus.description,
AttributeError: 'NoneType' object has no attribute 'description' |
after running benchmarks for a while, strace: Process 22741 attached
12:10:00 write(2<pipe:[3236233]>, " File \"/usr/local/lib/python3.6/"..., 8192 commentary: write() writes up to count bytes from the buffer pointed
buf to the file referred to by the file descriptor fd.
- https://linux.die.net/man/2/write |
and looking at that process's fds; dr-x------ 2 root root 0 Jun 9 08:31 ./
dr-xr-xr-x 9 root root 0 Jun 9 08:31 ../
lrwx------ 1 root root 64 Jun 9 08:31 0 -> /dev/null
l-wx------ 1 root root 64 Jun 9 08:31 1 -> 'pipe:[3236232]'
l-wx------ 1 root root 64 Jun 9 08:31 2 -> 'pipe:[3236233]'
lrwx------ 1 root root 64 Jun 9 12:19 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jun 9 12:19 4 -> 'socket:[3237426]'
lrwx------ 1 root root 64 Jun 9 12:19 5 -> 'socket:[3237427]'
l-wx------ 1 root root 64 Jun 9 12:19 6 -> /usr/src/nazLog/naz_log_file
lrwx------ 1 root root 64 Jun 9 12:19 7 -> 'socket:[3316320]'
l-wx------ 1 root root 64 Jun 9 08:31 /proc/22741/fd/2 -> 'pipe:[3236233]' so naz-cli 22741 root 2w FIFO 0,12 0t0 3236233 pipe
commentary: /proc/[pid]/fd/
This is a subdirectory containing one entry for each file
which the process has open, named by its file descriptor, and
which is a symbolic link to the actual file. Thus, 0 is stan‐
dard input, 1 standard output, 2 standard error, and so on.
- http://man7.org/linux/man-pages/man5/proc.5.html from the above and the fact that
|
after I restart
dr-x------ 2 root root 0 Jun 9 12:51 ./
dr-xr-xr-x 9 root root 0 Jun 9 12:51 ../
lrwx------ 1 root root 64 Jun 9 12:51 0 -> /dev/pts/0
l-wx------ 1 root root 64 Jun 9 12:51 1 -> /dev/null
l-wx------ 1 root root 64 Jun 9 12:51 2 -> /dev/null
lrwx------ 1 root root 64 Jun 9 12:51 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jun 9 12:51 4 -> 'socket:[4908369]'
lrwx------ 1 root root 64 Jun 9 12:51 5 -> 'socket:[4908370]'
l-wx------ 1 root root 64 Jun 9 12:51 6 -> /usr/src/nazLog/naz_log_file
lrwx------ 1 root root 64 Jun 9 12:51 7 -> 'socket:[5064658]'
lrwx------ 1 root root 64 Jun 9 12:51 8 -> 'socket:[5033859]' as we can see, none of them is a pipe. so as to how after some point it starts talking to a pipe fd is still up in the air |
I have started an strace session and will leave it running until the process hangs so that we can see the trail of sycalls right upto the hang;
|
☝️ The above debug sessions were conducted on In order to debug FROM alpine
RUN apk update && apk add strace
CMD ["strace", "-Tyytvx", "-p", "1"] docker build -t my_stracer . docker run -t --pid=container:naz_cli \
--net=container:naz_cli \
--cap-add sys_admin \
--cap-add sys_ptrace my_stracer strace: Process 1 attached
18:55:23 write(2<pipe:[1915902]>, "{'timestamp': '2019-06-09 11:24:"..., 172 Commentary: |
damn, strace is pretty cool. I now feel like Brendan Gregg. |
What: - added `naz` benchmarks and related fixes that came from the benchmark runs - when smsc returns `ESME_RMSGQFUL`(message queue is full), `naz` will now call the `throttling` handler. - fixed a potential memory leak bug in `naz.correlater.SimpleCorrelater` - added a configurable `connection_timeout` which is the duration that `naz` will wait, for connection related activities with SMSC, before timing out. - `naz` is now able to re-establish connection and re-bind if the connection between it and SMSC is disconnected. Why: - Make `naz` more failure tolerant - Fixes: #103
benchmark results will come in a different PR. |
since
naz-cli
is a long running process, we need to make sure it does not leak memory.We should run
naz
over an extended period under load (say 24 or 48 hrs) and profile memory usage.The text was updated successfully, but these errors were encountered: