Bad file descriptor #195

G4brym · 2018-10-01T11:19:25Z

I have a ton of traffic hitting ~5 servers that connect to the same memcached server.

Memcached config:

      PORT="11211"
      USER="memcached"
      MAXCONN="10240"
      CACHESIZE="1024"
      OPTIONS=""
      LOGFILE="/var/log/memcached.log"

pymemcached connector:

Client(('127.0.0.1', 11211), serializer=json_serializer, deserializer=json_deserializer, connect_timeout=3, timeout=1, ignore_exc=True, no_delay=True)

The problem is that sometimes i get OSError [Errno 9] Bad file descriptor in the servers.
The code is breaking in

pymemcache/client/base.py in _store_cmd at line 797

self.sock.sendall(b''.join(cmds))

And pymemcache/client/base.py in set at line 290

return self._store_cmd(b'set', {key: value}, expire, noreply)[key]

After the error was detected i checked the servers and it still had like 4GB of ram and alot of disk space

I changed the memcache lib from this one to the python-memcached just to test and i didn't have any error, but the response time was slower than when i was using pymemcached.

I'm only opening a connection to the memcache server when the server is starting, maybe it could be that i need to refresh the connection after a while?

Thanks

The text was updated successfully, but these errors were encountered:

cgordon · 2018-10-01T15:15:37Z

I did some Googling and it looks like this error occurs when using a socket that has been previously closed. The pymemcached client is designed to close the socket when it gets errors, and to re-open the socket on the next request, and that logic appears to still be there and correct. I did find one place that we could close a socket and fail to re-open it later however:

    def _connect(self):
        sock = self.socket_module.socket(self.socket_module.AF_INET,
                                         self.socket_module.SOCK_STREAM)
        try:
            sock.settimeout(self.connect_timeout)
            sock.connect(self.server)
            sock.settimeout(self.timeout)
            if self.no_delay:
                sock.setsockopt(self.socket_module.IPPROTO_TCP,
                                self.socket_module.TCP_NODELAY, 1)
        except Exception:
            sock.close()
            raise
        self.sock = sock

I suspect that should be something like this instead:

    def _connect(self):
        sock = self.socket_module.socket(self.socket_module.AF_INET,
                                         self.socket_module.SOCK_STREAM)
        try:
            sock.settimeout(self.connect_timeout)
            sock.connect(self.server)
            sock.settimeout(self.timeout)
            if self.no_delay:
                sock.setsockopt(self.socket_module.IPPROTO_TCP,
                                self.socket_module.TCP_NODELAY, 1)
        except Exception:
            sock.close()
            ####### CHANGE HERE
            sock = None
            #######
            raise
        self.sock = sock

However, in this case, you should still be seeing at least one instance of a different exception before you start seeing errors from sendall. Can you confirm that you are seeing a different exception in the logs before that?

Otherwise, it would be helpful to know what methods you are calling on pymemcached and in what order, in case there is a subtle ordering bug in the library.

Thank you for the detailed bug report!

G4brym · 2018-10-01T15:51:05Z

The other error i'm getting is AttributeError 'NoneType' object has no attribute 'sendall'
pymemcache/client/base.py in _store_cmd at line 797

self.sock.sendall(b''.join(cmds))

I'm only using the get and set, but I just realized that i have surrounded the get() with a try catch, and this could be hidding some other errors, i did this because we have a fallback shared cache using redis

    def memcached_get(self, key):
        try:
            return self.memcache.get(key)
        except:
            return None

    def memcached_set(self, key, value, timeout=MEMCACHED_TIMEOUT):
        self.memcache.set(key, value, timeout=timeout)

Basicly the way we use the pymemcache is:

call memcache get
if returned none
- call redis get or just run the funcion to build the result
- call memcache set

cgordon · 2018-10-01T16:39:16Z

It would be very helpful to see any errors being thrown in your memcached_get function.

Just inspecting the code, I can't find a code path that would cause self.sock to be None on line 797 of that file. There is a check on line 793 for sock being None, and it calls _connect in that case. The _connect function is supposed to set self.sock or raise, and I can't figure out how it would return without raising or setting self.sock. @jogo @nichochar @jparise , do you see something I'm missing there?

jparise · 2018-10-01T16:58:54Z

My only thought is that _connect() is encountering an exception that doesn't inherit from Exception. Python also has BaseException, which is the base class of Exception, and some "special" exception types inherit directly from BaseException for "reasons". A good example is gevent.Timeout.

For this to be relevant here, sock = self.socket_module.socket(...) would need to return None. I would expect the following operations (within the try) to fail with AttributeError exceptions, however, and those do inherit from Exception, so I can't see how this would explain this case.

jparise · 2018-10-02T20:37:38Z

@G4brym, if you can provide any other details, we should be able to get to the bottom of this.

G4brym · 2018-10-02T21:01:00Z

Sorry, i wasn't been able to test if there is any error thrown by the get function, i will provide more information tomorrow

G4brym · 2018-10-04T11:04:57Z

After testing without the try except on the memcache_get, i didn't get any errors on the function, only theses 2 i mentioned on the first post.
The bad file descriptor only starts to show when a lot of traffic hits the server, the 5 servers are making more that 200 hits per second to the memcached

jparise · 2018-10-04T17:19:25Z

@G4brym are you using gevent or something other than the standard library socket module?

G4brym · 2018-10-04T17:47:02Z

I'm only using this libs gcc python-setuptools python-devel postgresql-libs postgresql-devel and the others included in Amazon Linux AMI

1st1 · 2018-10-10T15:05:13Z

Is it a single-threaded application? I can see this scenario happening if one client object is used in more than one thread.

G4brym · 2018-10-10T15:30:28Z

Each server has 5 threads, but memcached is only used locally, per server, so each memcached instance is serving 5 threads only.
I'm using a thread safe function to build the connector per thread

    def build_memcache_conn(self):
        if getattr(self, '_memcache_inst', None) is None:
            self._memcache_inst = Client(('127.0.0.1', 11211), serializer=json_serializer, deserializer=json_deserializer, connect_timeout=3, timeout=1, ignore_exc=True, no_delay=True)
        return self._memcache_inst

Each threads call's this function once

1st1 · 2018-10-10T15:45:10Z

I suggest to do the following trick (it will require you to modify your pymemcache package locally):

Add self._ti = threading.get_ident() to Client.__init__()
Add if self._ti != threading.get_ident(): raise RuntimeError('wrong thread') to Client. _store_cmd(), Client. _fetch_cmd(), Client. _misc_cmd(), and Client._connect()

If you don't observe a RuntimeError after doing that, we can be at least somewhat certain that this bug isn't caused by a race.

G4brym · 2018-10-10T16:09:24Z

I might only give more feedback in 1 or 2 weeks, to test everything and push to production

Anakin-Hao · 2019-01-03T03:16:40Z

any update on this?
For my project, I've seen a similar issue running pymemcache in kube cluster.

Error was reported on return sock.recv(size) in _recv function.
I managed to produce it easily with heavy load on to the same python process.
I later on switched to python-memcached for walk-around and never seen that error again.
But, python-memcached is far slower and not as api feature rich as pymemcache.

Sorry I don't have any more detail to give. I had to quickly get the project running.
Would be good to see sock issues getting fixed.

jparise · 2019-01-03T20:55:53Z

@Anakin-Hao Are you running in a multithreaded or concurrent environment?

jparise · 2019-01-03T21:52:50Z

I've tried to improve the safety of some of this code in #208, but I haven't uncovered the root cause of the problems noted above.

Anakin-Hao · 2019-01-03T22:51:20Z

@jparise Yes. It's a client used by flask. Threaded.
The error was only reproducible on kube cluster. The docker image base is python:3.6.
Attempts to reproduce the error on Win 10 pro yield no result. So, i guest it is to do with low level network code.

Anakin-Hao · 2019-01-06T21:37:44Z

@jparise What env did you use to produce the error?
The first error I noticed was raised on a kubenetes cluster in aws that has a replication set of flask pods and several memcached pods.
Later on I reproduced it in minikube with a flask pod and a memcached pod.
I recall trying to reproduce it in docker yielded no result.

Hope that helps.

jparise · 2019-01-06T22:55:20Z

We have not been able to reproduce this issue.

Anakin-Hao · 2019-01-07T03:50:44Z

@jparise what's your setup ?

jparise · 2019-01-07T16:18:18Z

@Anakin-Hao and @G4brym, would you be able to try the current master code (663ecec) to see if it improves anything?

jparise · 2019-01-09T00:42:22Z

@Anakin-Hao and @G4brym, you can also try upgrading to the just-released pymemcache 2.1.0.

Anakin-Hao · 2019-01-09T01:07:02Z

Still happens.

Env:

pythonn3.6
pymemcache 2.1.0
minikube
1 flask pod. Threaded.
1 Memcahce pod.
postman to send multiple requests

Session state is: 4018e84d-f8c2-4f11-a87a-1b64ec96094f
{"timestamp": 1546993733767, "level": 3, "message": "[Errno 9] Bad file descriptor"}
172.17.0.5 - - [09/Jan/2019 00:28:53] "GET /samplerealm/user/group/6780b62e-aebc-494b-9607-802dac426206 HTTP/1.1" 500 -
["Traceback (most recent call last):", 
"  File \"/usr/src/app/albi/albi_oidc.py\", line 312, in decorated\n    raw_access_token = self._get_cached_access_token(session_state)", 
"  File \"/usr/src/app/albi/albi_oidc.py\", line 1044, in _get_cached_access_token\n    return self.cache.get(session_state + '.AT')", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/hash.py\", line 314, in get\n    return self._run_cmd('get', key, None, *args, **kwargs)", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/hash.py\", line 291, in _run_cmd\n    client, func, default_val, *args, **kwargs", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/hash.py\", line 170, in _safely_run_func\n    result = func(*args, **kwargs)", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/base.py\", line 450, in get\n    return self._fetch_cmd(b'get', [key], False).get(key, default)", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/base.py\", line 773, in _fetch_cmd\n    key = remapped_keys[key]", 
"KeyError: b'samplerealm_id'"]

Anakin-Hao · 2019-01-09T01:09:36Z

@jparise Btw, while reading the code, i noticed there's a flag "use_pooling". I guess you expected no error while use_pooling=False ?

jparise · 2019-01-09T01:10:41Z

@Anakin-Hao I suspect your usage of the cache client isn't thread safe. To address that, you would need to protect access to self.cache with a mutex, or considering storing it in Flask's request-local state.

You could also try @1st1's earlier debugging suggestion: #195 (comment)

Anakin-Hao · 2019-01-09T01:11:52Z

@jparise Ok then. Thanks for the effort.

cgordon · 2019-01-09T01:16:00Z

I am also coming to the conclusion that this is a thread-safety issue. Client objects are not thread safe, so attempting to use the same Client object from multiple threads would definitely cause errors (but I'm not sure exactly what errors, since it would depend on timing/ordering).

Anakin-Hao · 2019-01-09T01:17:28Z

@jparise On a side note, whithout reading the code, is pymemcache designed to leave thread safety to applications?
If so, any plan to add thread safety in in future?

jparise · 2019-01-09T01:20:11Z

Yes, applications are responsible for managing thread safety.

There aren't any plans to provide thread safety out of the box. It would require adding a layer of abstracter between the Client and the sockets (e.g. an intermediate Connection object managed by a thread-safe pool).

G4brym · 2019-01-21T14:32:17Z

I refactored the pymemcache part of my application to comply with thread safety and i haven't seen the error again

Thanks for the help

jparise · 2019-01-21T18:23:57Z

Closing now that the root cause is understood.

jparise assigned jogo Oct 1, 2018

jparise added the bug label Oct 1, 2018

jparise assigned cgordon Oct 1, 2018

jparise closed this as completed Jan 21, 2019

jparise mentioned this issue Jun 29, 2022

some key occur error during another key using pymemcache #405

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad file descriptor #195

Bad file descriptor #195

G4brym commented Oct 1, 2018

cgordon commented Oct 1, 2018

G4brym commented Oct 1, 2018

cgordon commented Oct 1, 2018

jparise commented Oct 1, 2018 •

edited

jparise commented Oct 2, 2018

G4brym commented Oct 2, 2018

G4brym commented Oct 4, 2018

jparise commented Oct 4, 2018

G4brym commented Oct 4, 2018

1st1 commented Oct 10, 2018

G4brym commented Oct 10, 2018

1st1 commented Oct 10, 2018 •

edited

G4brym commented Oct 10, 2018

Anakin-Hao commented Jan 3, 2019 •

edited

jparise commented Jan 3, 2019

jparise commented Jan 3, 2019

Anakin-Hao commented Jan 3, 2019 •

edited

Anakin-Hao commented Jan 6, 2019

jparise commented Jan 6, 2019

Anakin-Hao commented Jan 7, 2019

jparise commented Jan 7, 2019

jparise commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

jparise commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

cgordon commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

jparise commented Jan 9, 2019 •

edited

G4brym commented Jan 21, 2019 •

edited

jparise commented Jan 21, 2019

Bad file descriptor #195

Bad file descriptor #195

Comments

G4brym commented Oct 1, 2018

cgordon commented Oct 1, 2018

G4brym commented Oct 1, 2018

cgordon commented Oct 1, 2018

jparise commented Oct 1, 2018 • edited

jparise commented Oct 2, 2018

G4brym commented Oct 2, 2018

G4brym commented Oct 4, 2018

jparise commented Oct 4, 2018

G4brym commented Oct 4, 2018

1st1 commented Oct 10, 2018

G4brym commented Oct 10, 2018

1st1 commented Oct 10, 2018 • edited

G4brym commented Oct 10, 2018

Anakin-Hao commented Jan 3, 2019 • edited

jparise commented Jan 3, 2019

jparise commented Jan 3, 2019

Anakin-Hao commented Jan 3, 2019 • edited

Anakin-Hao commented Jan 6, 2019

jparise commented Jan 6, 2019

Anakin-Hao commented Jan 7, 2019

jparise commented Jan 7, 2019

jparise commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

jparise commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

cgordon commented Jan 9, 2019

Anakin-Hao commented Jan 9, 2019

jparise commented Jan 9, 2019 • edited

G4brym commented Jan 21, 2019 • edited

jparise commented Jan 21, 2019

jparise commented Oct 1, 2018 •

edited

1st1 commented Oct 10, 2018 •

edited

Anakin-Hao commented Jan 3, 2019 •

edited

Anakin-Hao commented Jan 3, 2019 •

edited

jparise commented Jan 9, 2019 •

edited

G4brym commented Jan 21, 2019 •

edited