Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory profiling & benchmarks #103

Closed
komuw opened this issue May 19, 2019 · 30 comments · Fixed by #127
Closed

memory profiling & benchmarks #103

komuw opened this issue May 19, 2019 · 30 comments · Fixed by #127

Comments

@komuw
Copy link
Owner

komuw commented May 19, 2019

since naz-cli is a long running process, we need to make sure it does not leak memory.

We should run naz over an extended period under load (say 24 or 48 hrs) and profile memory usage.

@komuw
Copy link
Owner Author

komuw commented May 19, 2019

benchmarks brought out possible memory leaks in another project of mine(not naz) -> komuw/wiji#71

@komuw komuw mentioned this issue Jun 1, 2019
@komuw
Copy link
Owner Author

komuw commented Jun 1, 2019

when running the benchmarks, after sending about 300 submit_sm, the smsc simulator is returning; ESME_RMSGQFUL

prometheus query

number_of_messages_total{project="naz_benchmarks", state="response"}

naz/naz/state.py

Lines 112 to 115 in 14fe3be

)
ESME_RMSGQFUL: CommandStatus = CommandStatus(
code="ESME_RMSGQFUL", value=0x00000014, description="Message Queue Full"
)

@komuw
Copy link
Owner Author

komuw commented Jun 1, 2019

It also seems like naz-cli may be getting some memory leak maladies;

Screen Shot 2019-06-01 at 17 11 33

I'll run it for at least 2hrs to see if we need to throw tracemalloc profiling at it.

@komuw
Copy link
Owner Author

komuw commented Jun 1, 2019

when running the benchmarks, after sending about 300 submit_sm, the smsc simulator is returning; ESME_RMSGQFUL

I think for this, what we need to do is that when we get ESME_RMSGQFUL response from smsc, we should inform the throttle handler;

naz/naz/client.py

Line 1940 in 14fe3be

elif commandStatus.value == SmppCommandStatus.ESME_RTHROTTLED.value:

ie;

elif commandStatus.value in [
    SmppCommandStatus.ESME_RTHROTTLED.value,
    SmppCommandStatus.ESME_RMSGQFUL.value
    ]:
    await self.throttle_handler.throttled()

However for the purposes of running benchmarks, we should edit the SMSC simulator to hold more messages: https://github.com/komuw/smpp_server_docker/blob/dea7168abd4c89df8b8f67ffb57bd18cf00c8526/SMPPSim/conf/smppsim.props#L85-L88

@komuw
Copy link
Owner Author

komuw commented Jun 1, 2019

for the memory leak, it flattens after 60mins:
Screen Shot 2019-06-01 at 18 15 13

This lines up pretty well with the default correlater, naz.correlater.SimpleCorrelater;

naz/naz/correlater.py

Lines 94 to 104 in 14fe3be

def __init__(self, max_ttl: float = 60 * 60.00) -> None:
"""
Parameters:
max_ttl: The time in seconds that an item is going to be stored.
After the expiration of max_ttl seconds, that item will be deleted.
"""
if not isinstance(max_ttl, float):
raise ValueError(
"`max_ttl` should be of type:: `float` You entered: {0}".format(type(max_ttl))
)
self.store: dict = {}

It stores items in a dict that we append to. And that dict has a ttl of 60mins.
That's why the memory climbs up until the 1hr mark then in flattens.

We'll reduce the default ttl to about 10-15 mins

@komuw
Copy link
Owner Author

komuw commented Jun 1, 2019

Check if we are sending out the following to SMSC;

  • enquire_link_resp
  • deliver_sm_resp

Those two we currently do not send them straight out, but rather queue them in broker and they can be sent later;

naz/naz/client.py

Lines 977 to 979 in 14fe3be

try:
await self.outboundqueue.enqueue(item_to_enqueue)
except Exception as e:

naz/naz/client.py

Lines 1083 to 1085 in 14fe3be

try:
await self.outboundqueue.enqueue(item_to_enqueue)
except Exception as e:

I didn't see them in logs, so we need to double check

@komuw
Copy link
Owner Author

komuw commented Jun 1, 2019

We'll reduce the default ttl to about 10-15 mins

This was fixed in; 4076739

@komuw
Copy link
Owner Author

komuw commented Jun 2, 2019

Screen Shot 2019-06-02 at 11 19 32

the "mem leak" was fixed in; 4076739

@komuw
Copy link
Owner Author

komuw commented Jun 2, 2019

we need to send logs somewhere for analysis of any errors. maybe, https://www.komu.engineer/blogs/timescaledb/timescaledb-for-logs

or use sentry so long as we can get large quotas

komuw added a commit to komuw/smpp_server_docker that referenced this issue Jun 4, 2019
What:
- upgrade java version
- increase queue sizes from 1,000 to 250,000 messages
- decrease time taken in queue from 60 to 5 seconds
- released a new docker image tagged; `komuw/smpp_server:v0.3`

Why:
- when running benchmarks, with the low queue size, the smpp client would get a lot of `ESME_RMSGQFUL` (`Message Queue Full`)[1]

Ref:
1. komuw/naz#103 (comment)
@komuw
Copy link
Owner Author

komuw commented Jun 4, 2019

Increased smpp queue sizes in 0afac56 and
komuw/smpp_server_docker@ac4608a

@komuw
Copy link
Owner Author

komuw commented Jun 5, 2019

we need to send logs somewhere for analysis of any errors. maybe,

This was done in; c508d70

@komuw
Copy link
Owner Author

komuw commented Jun 5, 2019

when smsc or redis are disconnected, when naz tries to reconnect; it hangs forever instead of timing out;

{"log":"{\"timestamp\": \"2019-06-05 13:21:54,483\", \"event\": \"naz.Client.re_establish_conn_bind\", \"stage\": \"start\", \"smpp_command\": \"submit_sm\", \"log_id\": \"3132-rhopvxn\", \"connection_lost\": true, \"project\": \"naz_benchmarks\", \"smsc_host\": \"134.209.80.205\", \"system_id\": \"smppclient1\", \"client_id\": \"4DPUKXZ7H95AZIJCL\", \"pid\": 1}\n","stream":"stderr","time":"2019-06-05T13:21:54.483913134Z"}
{"log":"{\"timestamp\": \"2019-06-05 13:21:54,485\", \"event\": \"naz.Client.connect\", \"stage\": \"start\", \"log_id\": \"3132-rhopvxn\", \"project\": \"naz_benchmarks\", \"smsc_host\": \"134.209.80.205\", \"system_id\": \"smppclient1\", \"client_id\": \"4DPUKXZ7H95AZIJCL\", \"pid\": 1}\n","stream":"stderr","time":"2019-06-05T13:21:54.485750995Z"}
s
s
s

This is despite the function doing the re-connection having explicit timeouts:

naz/naz/client.py

Lines 739 to 741 in 8466de4

reader, writer = await asyncio.wait_for(
asyncio.open_connection(self.smsc_host, self.smsc_port), timeout=self.connect_timeout
)

The larger issue is, this piece of code does not work as I would expect:

import asyncio
import time

async def eternity():
    """Sleep for one hour"""
    # await asyncio.sleep(3600) # <-- this would get cancelled
    time.sleep(3600)  # <-- this does not get cancelled
    print("done!")

async def main():
    # Wait for at most 4 seconds
    await asyncio.wait_for(eternity(), timeout=4.0)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

And , https://github.com/aio-libs/async-timeout, is not the solution; it also does not work.

@komuw
Copy link
Owner Author

komuw commented Jun 5, 2019

Well, it looks like it is not possible to timeout of blocking code from an async function, atleast that is what the guy[1] who wrote most of python asyncio says - https://twitter.com/1st1/status/1136292366208372736

Screen Shot 2019-06-05 at 20 19 39

So maybe what we want to do is add a blocking code detector:
I have created #132 for that.

  1. https://www.python.org/dev/peps/pep-0492/

@komuw
Copy link
Owner Author

komuw commented Jun 6, 2019

With this change:

naz/naz/client.py

Lines 742 to 747 in 61d036b

self.writer: asyncio.streams.StreamWriter = writer
sock = self.writer.get_extra_info("socket")
sock.settimeout(self.connection_timeout)
# A socket object can be in one of three modes: blocking, non-blocking, or timeout.
# At the OS level, sockets in timeout mode are internally set in non-blocking mode.
# https://docs.python.org/3.6/library/socket.html#notes-on-socket-timeouts

naz seems to be able to withstand SMSC failures(when the smsc goes down and then comes back up) without blocking forever. This is after testing for an hour or so. I'll leave it running for long just to be sure.

However, naz is not able to tolerate the queue/broker going down and coming back up again.
This is true if the queue is using blocking code.
This can be solved in two ways, I think;

  1. adding a blocking code detector or
  2. have people use queues that are fully async(as they should). As an experiment of this option two; we should take the queue at, https://github.com/komuw/naz/blob/49c571b2d4c5b8a485817cbb45affcf75c74c454/benchmarks/redis_queue.py and make it fully async so that it can withstand failure with no blockage.

@komuw
Copy link
Owner Author

komuw commented Jun 6, 2019

since asyncio.wait_for(timeout) does not work for synchronous blocking code.
And this is the code we had added asyncio.wait_for to guard against, there's no longer a need to have it. So we should remove it.

@komuw
Copy link
Owner Author

komuw commented Jun 6, 2019

since asyncio.wait_for(timeout) does not work for synchronous blocking code.
And this is the code we had added asyncio.wait_for to guard against, there's no longer a need to have it. So we should remove it.

This was fixed in #134

As an experiment of this option two; we should take the queue at, https://github.com/komuw/naz/blob/49c571b2d4c5b8a485817cbb45affcf75c74c454/benchmarks/redis_queue.py and make it fully async so that it can withstand failure with no blockage.

This was fixed in #134

@komuw
Copy link
Owner Author

komuw commented Jun 8, 2019

Issue raised while benchmarking:

{
    "timestamp": "2019-06-08 12:42:27,957",
    "event": "naz.Client.receive_data",
    "stage": "end",
    "full_pdu_data": b"\x00\x00\x00\x15\x80",
    "project": "naz_benchmarks",
    "smsc_host": "134.209.80.205",
    "system_id": "smppclient1",
    "client_id": "DRNTKRMTCJR97D9OO",
    "pid": 83197,
}

  File "/Users/home/mystuff/naz/naz/client.py", line 1838, in receive_data
    await self._parse_response_pdu(full_pdu_data)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
    return self.gen.send(None)
  File "/Users/home/mystuff/naz/naz/client.py", line 1860, in _parse_response_pdu
    command_id = struct.unpack(">I", command_id_header_data)[0]
struct.error: unpack requires a buffer of 4 bytes
{'timestamp': '2019-06-08 12:42:27,959', 'event': 'naz.cli.main', 'stage': 'end'}

also see: #135

@komuw
Copy link
Owner Author

komuw commented Jun 8, 2019

for the issue above, we have added this to try and resolve it;

naz/naz/client.py

Lines 1815 to 1841 in 7ddce43

except (ConnectionError, asyncio.TimeoutError) as e:
self._log(
logging.ERROR,
{
"event": "naz.Client.receive_data",
"stage": "end",
"state": "unable to read from SMSC",
"error": str(e),
},
)
if self.SHOULD_SHUT_DOWN:
return None
_interval_ = 10.00
self._log(
logging.DEBUG,
{
"event": "naz.Client.receive_data",
"stage": "end",
"state": "unable to read from SMSC. sleeping for {0} minutes".format(
_interval_ / 60
),
"error": str(e),
},
)
await asyncio.sleep(_interval_)
continue # important so that we do not hit the bug: issues/135

@komuw
Copy link
Owner Author

komuw commented Jun 8, 2019

Another issue raised while benchmarking;

'timestamp': '2019-06-08 15:06:55,023', 'event': 'MyRedisQueue.dequeue', 'stage': 'start'}
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.Client.receive_data', 'stage': 'end', 'full_pdu_data': b'\x00\x00\x00*CSMPPSim\x00\x00\x00\x00\x15\x80\x00\x00\x04\x00\x00\x00\x00\x00\x00*B2887\x00\x00\x00\x00\x18\x80\x00\x00\t', 'project': 'naz_benchmarks', 'smsc_host': 'smsc_host', 'system_id': 'smppclient1', 'client_id': 'NVG1W95WUO4VOS5VQ', 'pid': 87991}
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.Client._parse_response_pdu', 'stage': 'start', 'project': 'naz_benchmarks', 'smsc_host': 'smsc_host', 'system_id': 'smppclient1', 'client_id': 'NVG1W95WUO4VOS5VQ', 'pid': 87991}
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.Client._parse_response_pdu', 'stage': 'end', 'log_id': '', 'state': 'command_id:1129532752 is unknown.', 'project': 'naz_benchmarks', 'smsc_host': 'smsc_host', 'system_id': 'smppclient1', 'client_id': 'NVG1W95WUO4VOS5VQ', 'pid': 87991}
NoneType: None
{'timestamp': '2019-06-08 15:06:55,041', 'event': 'naz.cli.main', 'stage': 'end', 'error': 'command_id:1129532752 is unknown.'}
Traceback (most recent call last):
  File "/Users/home/mystuff/naz/cli/cli.py", line 88, in main
    async_main(client=client, logger=logger, loop=loop, dry_run=dry_run)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/Users/home/mystuff/naz/cli/cli.py", line 115, in async_main
    await tasks
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 126, in send
    return self.gen.send(value)
  File "/Users/home/mystuff/naz/naz/client.py", line 1855, in receive_data
    await self._parse_response_pdu(full_pdu_data)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
    return self.gen.send(None)
  File "/Users/home/mystuff/naz/naz/client.py", line 1892, in _parse_response_pdu
    raise ValueError("command_id:{0} is unknown.".format(command_id))
ValueError: command_id:1129532752 is unknown.
{'timestamp': '2019-06-08 15:06:55,042', 'event': 'naz.cli.main', 'stage': 'end'}

@komuw
Copy link
Owner Author

komuw commented Jun 8, 2019

for the issue above:

  1. we should not raise an exception since it causes the cli app to exit. logging at error level is enough;

    naz/naz/client.py

    Lines 1882 to 1892 in 7ddce43

    if not smpp_command:
    self._log(
    logging.ERROR,
    {
    "event": "naz.Client._parse_response_pdu",
    "stage": "end",
    "log_id": "",
    "state": "command_id:{0} is unknown.".format(command_id),
    },
    )
    raise ValueError("command_id:{0} is unknown.".format(command_id))

  2. we should try and fix that, if it is a valid command_id we should add it and any other

@komuw
Copy link
Owner Author

komuw commented Jun 8, 2019

we should not raise an exception since it causes the cli app to exit. logging at error level is enough;
we should try and fix that, if it is a valid command_id we should add it and any other

both were fixed in: 2059030

@komuw
Copy link
Owner Author

komuw commented Jun 8, 2019

new error

{"timestamp": "2019-06-08 19:56:04,993", "event": "naz.Client.command_handlers", "stage": "start", "smpp_command": "generic_nack", "log_id": "", "error": "command_status:411041792 is unknown.", "project": "naz_benchmarks", "smsc_host": "host, "system_id": "smppclient1", "client_id": "EIZKC9IV9FAGC7JMK", "pid": 93228}
NoneType: None
{'timestamp': '2019-06-08 19:56:04,993', 'event': 'naz.cli.main', 'stage': 'end', 'error': "'NoneType' object has no attribute 'description'"}
Traceback (most recent call last):
  File "/Users/home/mystuff/naz/naz/client.py", line 2036, in command_handlers
    if commandStatus.value == SmppCommandStatus.ESME_ROK.value:
AttributeError: 'NoneType' object has no attribute 'value'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/home/mystuff/naz/cli/cli.py", line 88, in main
    async_main(client=client, logger=logger, loop=loop, dry_run=dry_run)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/Users/home/mystuff/naz/cli/cli.py", line 115, in async_main
    await tasks
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 126, in send
    return self.gen.send(value)
  File "/Users/home/mystuff/naz/naz/client.py", line 1896, in receive_data
    await self._parse_response_pdu(full_pdu_data)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
    return self.gen.send(None)
  File "/Users/home/mystuff/naz/naz/client.py", line 1961, in _parse_response_pdu
    hook_metadata=hook_metadata,
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
    return self.gen.send(None)
  File "/Users/home/mystuff/naz/naz/client.py", line 2052, in command_handlers
    "state": commandStatus.description,
AttributeError: 'NoneType' object has no attribute 'description'

@komuw
Copy link
Owner Author

komuw commented Jun 9, 2019

The above error was fixed in:

@komuw
Copy link
Owner Author

komuw commented Jun 9, 2019

after running benchmarks for a while, naz hangs;
strace -Tyytv -p <naz-pid> # --stack-traces is also a good option for strace, # -f for follow-children is also great

strace: Process 22741 attached
12:10:00 write(2<pipe:[3236233]>, " File \"/usr/local/lib/python3.6/"..., 8192

commentary:
ssize_t write(int fd, const void *buf, size_t count);

write() writes up to count bytes from the buffer pointed 
buf to the file referred to by the file descriptor fd.
- https://linux.die.net/man/2/write

@komuw
Copy link
Owner Author

komuw commented Jun 9, 2019

and looking at that process's fds;
ll /proc/<naz-pid>/fd

dr-x------ 2 root root  0 Jun  9 08:31 ./
dr-xr-xr-x 9 root root  0 Jun  9 08:31 ../
lrwx------ 1 root root 64 Jun  9 08:31 0 -> /dev/null
l-wx------ 1 root root 64 Jun  9 08:31 1 -> 'pipe:[3236232]'
l-wx------ 1 root root 64 Jun  9 08:31 2 -> 'pipe:[3236233]'
lrwx------ 1 root root 64 Jun  9 12:19 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jun  9 12:19 4 -> 'socket:[3237426]'
lrwx------ 1 root root 64 Jun  9 12:19 5 -> 'socket:[3237427]'
l-wx------ 1 root root 64 Jun  9 12:19 6 -> /usr/src/nazLog/naz_log_file
lrwx------ 1 root root 64 Jun  9 12:19 7 -> 'socket:[3316320]'

ll /proc/22741/fd/2 ( the fd that naz hang on)

l-wx------ 1 root root 64 Jun  9 08:31 /proc/22741/fd/2 -> 'pipe:[3236233]'

so naz is hang on communication with a pipe numbered 3236233
lsof -p 22741 | grep -i 3236233

naz-cli 22741 root    2w     FIFO   0,12      0t0 3236233 pipe
  • I do not know what that pipe is.
  • I do not know where that pipe is created
  • I do not know which process creates that pipe. (I have no place in my code where I'm calling mkfifo)

commentary:

/proc/[pid]/fd/
    This is a subdirectory containing one entry for each file
    which the process has open, named by its file descriptor, and
    which is a symbolic link to the actual file.  Thus, 0 is stan‐
    dard input, 1 standard output, 2 standard error, and so on.
- http://man7.org/linux/man-pages/man5/proc.5.html

from the above and the fact that naz hanged on 2 -> 'pipe:[3236233]' , does it mean that;

  • naz somehow is using a pipe as it's stderr?
  • it hangs on writing to stderr?

@komuw
Copy link
Owner Author

komuw commented Jun 9, 2019

after I restart naz and leave it running for a while(but before it hangs) the only fds it has are;

ll /proc/32059/fd/

dr-x------ 2 root root  0 Jun  9 12:51 ./
dr-xr-xr-x 9 root root  0 Jun  9 12:51 ../
lrwx------ 1 root root 64 Jun  9 12:51 0 -> /dev/pts/0
l-wx------ 1 root root 64 Jun  9 12:51 1 -> /dev/null
l-wx------ 1 root root 64 Jun  9 12:51 2 -> /dev/null
lrwx------ 1 root root 64 Jun  9 12:51 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jun  9 12:51 4 -> 'socket:[4908369]'
lrwx------ 1 root root 64 Jun  9 12:51 5 -> 'socket:[4908370]'
l-wx------ 1 root root 64 Jun  9 12:51 6 -> /usr/src/nazLog/naz_log_file
lrwx------ 1 root root 64 Jun  9 12:51 7 -> 'socket:[5064658]'
lrwx------ 1 root root 64 Jun  9 12:51 8 -> 'socket:[5033859]'

as we can see, none of them is a pipe. so as to how after some point it starts talking to a pipe fd is still up in the air

@komuw
Copy link
Owner Author

komuw commented Jun 9, 2019

I have started an strace session and will leave it running until the process hangs so that we can see the trail of sycalls right upto the hang;

strace -Tyytvx -p <naz-pid> -o stracer.log # --stack-traces is also a good option for strace, # -f for follow-children is also great

@komuw
Copy link
Owner Author

komuw commented Jun 9, 2019

☝️ The above debug sessions were conducted on naz running on the machine(ie not in a container).

In order to debug naz running inside docker; I setup another docker container so that I can strace the docker container running naz

FROM alpine
RUN apk update && apk add strace
CMD ["strace", "-Tyytvx", "-p", "1"]
docker build -t my_stracer .
docker run -t --pid=container:naz_cli \
--net=container:naz_cli \
--cap-add sys_admin \
--cap-add sys_ptrace my_stracer
strace: Process 1 attached
18:55:23 write(2<pipe:[1915902]>, "{'timestamp': '2019-06-09 11:24:"..., 172

Commentary:
so naz is actually hanging on writing logs to a logfile. The logfile is a pipe with fd number 2.
The reason why naz hang is because a linux named pipe has to be read from the other end.
If the reading end is not been read, then the writing will also block.
To prove this theory, I docker exec'ed into the naz container then ran;
cat /proc/1/fd/2 ie I started reading the read end of the pipe and voila; naz un-hang.

@komuw
Copy link
Owner Author

komuw commented Jun 9, 2019

damn, strace is pretty cool. I now feel like Brendan Gregg.
I'll probably write a blogpost one of these days.

komuw added a commit that referenced this issue Jun 12, 2019
What:
- added `naz` benchmarks and related fixes that came from the benchmark runs
    - when smsc returns `ESME_RMSGQFUL`(message queue is full), `naz` will now call the `throttling` handler.  
    - fixed a potential memory leak bug in `naz.correlater.SimpleCorrelater`
    - added a configurable `connection_timeout` which is the duration that `naz` will wait, for connection related activities with SMSC, before timing out.
    - `naz` is now able to re-establish connection and re-bind if the connection between it and SMSC is disconnected.  

Why:
- Make `naz` more failure tolerant
- Fixes: #103
@komuw
Copy link
Owner Author

komuw commented Jun 12, 2019

benchmark results will come in a different PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant