Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad file descriptor #195

Closed
G4brym opened this issue Oct 1, 2018 · 31 comments
Closed

Bad file descriptor #195

G4brym opened this issue Oct 1, 2018 · 31 comments
Assignees
Labels

Comments

@G4brym
Copy link

G4brym commented Oct 1, 2018

I have a ton of traffic hitting ~5 servers that connect to the same memcached server.

Memcached config:

      PORT="11211"
      USER="memcached"
      MAXCONN="10240"
      CACHESIZE="1024"
      OPTIONS=""
      LOGFILE="/var/log/memcached.log"

pymemcached connector:

Client(('127.0.0.1', 11211), serializer=json_serializer, deserializer=json_deserializer, connect_timeout=3, timeout=1, ignore_exc=True, no_delay=True)

The problem is that sometimes i get OSError [Errno 9] Bad file descriptor in the servers.
The code is breaking in

pymemcache/client/base.py in _store_cmd at line 797

self.sock.sendall(b''.join(cmds))

And pymemcache/client/base.py in set at line 290

return self._store_cmd(b'set', {key: value}, expire, noreply)[key]

After the error was detected i checked the servers and it still had like 4GB of ram and alot of disk space

I changed the memcache lib from this one to the python-memcached just to test and i didn't have any error, but the response time was slower than when i was using pymemcached.

I'm only opening a connection to the memcache server when the server is starting, maybe it could be that i need to refresh the connection after a while?

Thanks

@cgordon
Copy link
Collaborator

cgordon commented Oct 1, 2018

I did some Googling and it looks like this error occurs when using a socket that has been previously closed. The pymemcached client is designed to close the socket when it gets errors, and to re-open the socket on the next request, and that logic appears to still be there and correct. I did find one place that we could close a socket and fail to re-open it later however:

    def _connect(self):
        sock = self.socket_module.socket(self.socket_module.AF_INET,
                                         self.socket_module.SOCK_STREAM)
        try:
            sock.settimeout(self.connect_timeout)
            sock.connect(self.server)
            sock.settimeout(self.timeout)
            if self.no_delay:
                sock.setsockopt(self.socket_module.IPPROTO_TCP,
                                self.socket_module.TCP_NODELAY, 1)
        except Exception:
            sock.close()
            raise
        self.sock = sock

I suspect that should be something like this instead:

    def _connect(self):
        sock = self.socket_module.socket(self.socket_module.AF_INET,
                                         self.socket_module.SOCK_STREAM)
        try:
            sock.settimeout(self.connect_timeout)
            sock.connect(self.server)
            sock.settimeout(self.timeout)
            if self.no_delay:
                sock.setsockopt(self.socket_module.IPPROTO_TCP,
                                self.socket_module.TCP_NODELAY, 1)
        except Exception:
            sock.close()
            ####### CHANGE HERE
            sock = None
            #######
            raise
        self.sock = sock

However, in this case, you should still be seeing at least one instance of a different exception before you start seeing errors from sendall. Can you confirm that you are seeing a different exception in the logs before that?

Otherwise, it would be helpful to know what methods you are calling on pymemcached and in what order, in case there is a subtle ordering bug in the library.

Thank you for the detailed bug report!

@G4brym
Copy link
Author

G4brym commented Oct 1, 2018

The other error i'm getting is AttributeError 'NoneType' object has no attribute 'sendall'
pymemcache/client/base.py in _store_cmd at line 797

self.sock.sendall(b''.join(cmds))

I'm only using the get and set, but I just realized that i have surrounded the get() with a try catch, and this could be hidding some other errors, i did this because we have a fallback shared cache using redis

    def memcached_get(self, key):
        try:
            return self.memcache.get(key)
        except:
            return None
    def memcached_set(self, key, value, timeout=MEMCACHED_TIMEOUT):
        self.memcache.set(key, value, timeout=timeout)

Basicly the way we use the pymemcache is:

  • call memcache get
  • if returned none
    • call redis get or just run the funcion to build the result
    • call memcache set

@cgordon
Copy link
Collaborator

cgordon commented Oct 1, 2018

It would be very helpful to see any errors being thrown in your memcached_get function.

Just inspecting the code, I can't find a code path that would cause self.sock to be None on line 797 of that file. There is a check on line 793 for sock being None, and it calls _connect in that case. The _connect function is supposed to set self.sock or raise, and I can't figure out how it would return without raising or setting self.sock. @jogo @nichochar @jparise , do you see something I'm missing there?

@jparise
Copy link
Collaborator

jparise commented Oct 1, 2018

My only thought is that _connect() is encountering an exception that doesn't inherit from Exception. Python also has BaseException, which is the base class of Exception, and some "special" exception types inherit directly from BaseException for "reasons". A good example is gevent.Timeout.

For this to be relevant here, sock = self.socket_module.socket(...) would need to return None. I would expect the following operations (within the try) to fail with AttributeError exceptions, however, and those do inherit from Exception, so I can't see how this would explain this case.

@jparise
Copy link
Collaborator

jparise commented Oct 2, 2018

@G4brym, if you can provide any other details, we should be able to get to the bottom of this.

@G4brym
Copy link
Author

G4brym commented Oct 2, 2018

Sorry, i wasn't been able to test if there is any error thrown by the get function, i will provide more information tomorrow

@G4brym
Copy link
Author

G4brym commented Oct 4, 2018

After testing without the try except on the memcache_get, i didn't get any errors on the function, only theses 2 i mentioned on the first post.
The bad file descriptor only starts to show when a lot of traffic hits the server, the 5 servers are making more that 200 hits per second to the memcached

@jparise
Copy link
Collaborator

jparise commented Oct 4, 2018

@G4brym are you using gevent or something other than the standard library socket module?

@G4brym
Copy link
Author

G4brym commented Oct 4, 2018

I'm only using this libs gcc python-setuptools python-devel postgresql-libs postgresql-devel and the others included in Amazon Linux AMI

@1st1
Copy link

1st1 commented Oct 10, 2018

Is it a single-threaded application? I can see this scenario happening if one client object is used in more than one thread.

@G4brym
Copy link
Author

G4brym commented Oct 10, 2018

Each server has 5 threads, but memcached is only used locally, per server, so each memcached instance is serving 5 threads only.
I'm using a thread safe function to build the connector per thread

    def build_memcache_conn(self):
        if getattr(self, '_memcache_inst', None) is None:
            self._memcache_inst = Client(('127.0.0.1', 11211), serializer=json_serializer, deserializer=json_deserializer, connect_timeout=3, timeout=1, ignore_exc=True, no_delay=True)
        return self._memcache_inst

Each threads call's this function once

@1st1
Copy link

1st1 commented Oct 10, 2018

I suggest to do the following trick (it will require you to modify your pymemcache package locally):

  • Add self._ti = threading.get_ident() to Client.__init__()
  • Add if self._ti != threading.get_ident(): raise RuntimeError('wrong thread') to Client. _store_cmd(), Client. _fetch_cmd(), Client. _misc_cmd(), and Client._connect()

If you don't observe a RuntimeError after doing that, we can be at least somewhat certain that this bug isn't caused by a race.

@G4brym
Copy link
Author

G4brym commented Oct 10, 2018

I might only give more feedback in 1 or 2 weeks, to test everything and push to production

@Anakin-Hao
Copy link

Anakin-Hao commented Jan 3, 2019

any update on this?
For my project, I've seen a similar issue running pymemcache in kube cluster.

Error was reported on return sock.recv(size) in _recv function.
I managed to produce it easily with heavy load on to the same python process.
I later on switched to python-memcached for walk-around and never seen that error again.
But, python-memcached is far slower and not as api feature rich as pymemcache.

Sorry I don't have any more detail to give. I had to quickly get the project running.
Would be good to see sock issues getting fixed.

@jparise
Copy link
Collaborator

jparise commented Jan 3, 2019

@Anakin-Hao Are you running in a multithreaded or concurrent environment?

@jparise
Copy link
Collaborator

jparise commented Jan 3, 2019

I've tried to improve the safety of some of this code in #208, but I haven't uncovered the root cause of the problems noted above.

@Anakin-Hao
Copy link

Anakin-Hao commented Jan 3, 2019

@jparise Yes. It's a client used by flask. Threaded.
The error was only reproducible on kube cluster. The docker image base is python:3.6.
Attempts to reproduce the error on Win 10 pro yield no result. So, i guest it is to do with low level network code.

@Anakin-Hao
Copy link

@jparise What env did you use to produce the error?
The first error I noticed was raised on a kubenetes cluster in aws that has a replication set of flask pods and several memcached pods.
Later on I reproduced it in minikube with a flask pod and a memcached pod.
I recall trying to reproduce it in docker yielded no result.

Hope that helps.

@jparise
Copy link
Collaborator

jparise commented Jan 6, 2019

We have not been able to reproduce this issue.

@Anakin-Hao
Copy link

@jparise what's your setup ?

@jparise
Copy link
Collaborator

jparise commented Jan 7, 2019

@Anakin-Hao and @G4brym, would you be able to try the current master code (663ecec) to see if it improves anything?

@jparise
Copy link
Collaborator

jparise commented Jan 9, 2019

@Anakin-Hao and @G4brym, you can also try upgrading to the just-released pymemcache 2.1.0.

@Anakin-Hao
Copy link

Still happens.

Env:

  • pythonn3.6
  • pymemcache 2.1.0
  • minikube
  • 1 flask pod. Threaded.
  • 1 Memcahce pod.
  • postman to send multiple requests
Session state is: 4018e84d-f8c2-4f11-a87a-1b64ec96094f
{"timestamp": 1546993733767, "level": 3, "message": "[Errno 9] Bad file descriptor"}
172.17.0.5 - - [09/Jan/2019 00:28:53] "GET /samplerealm/user/group/6780b62e-aebc-494b-9607-802dac426206 HTTP/1.1" 500 -
["Traceback (most recent call last):", 
"  File \"/usr/src/app/albi/albi_oidc.py\", line 312, in decorated\n    raw_access_token = self._get_cached_access_token(session_state)", 
"  File \"/usr/src/app/albi/albi_oidc.py\", line 1044, in _get_cached_access_token\n    return self.cache.get(session_state + '.AT')", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/hash.py\", line 314, in get\n    return self._run_cmd('get', key, None, *args, **kwargs)", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/hash.py\", line 291, in _run_cmd\n    client, func, default_val, *args, **kwargs", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/hash.py\", line 170, in _safely_run_func\n    result = func(*args, **kwargs)", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/base.py\", line 450, in get\n    return self._fetch_cmd(b'get', [key], False).get(key, default)", 
"  File \"/usr/local/lib/python3.6/site-packages/pymemcache/client/base.py\", line 773, in _fetch_cmd\n    key = remapped_keys[key]", 
"KeyError: b'samplerealm_id'"]

@Anakin-Hao
Copy link

@jparise Btw, while reading the code, i noticed there's a flag "use_pooling". I guess you expected no error while use_pooling=False ?

@jparise
Copy link
Collaborator

jparise commented Jan 9, 2019

@Anakin-Hao I suspect your usage of the cache client isn't thread safe. To address that, you would need to protect access to self.cache with a mutex, or considering storing it in Flask's request-local state.

You could also try @1st1's earlier debugging suggestion: #195 (comment)

@Anakin-Hao
Copy link

@jparise Ok then. Thanks for the effort.

@cgordon
Copy link
Collaborator

cgordon commented Jan 9, 2019

I am also coming to the conclusion that this is a thread-safety issue. Client objects are not thread safe, so attempting to use the same Client object from multiple threads would definitely cause errors (but I'm not sure exactly what errors, since it would depend on timing/ordering).

@Anakin-Hao
Copy link

@jparise On a side note, whithout reading the code, is pymemcache designed to leave thread safety to applications?
If so, any plan to add thread safety in in future?

@jparise
Copy link
Collaborator

jparise commented Jan 9, 2019

Yes, applications are responsible for managing thread safety.

There aren't any plans to provide thread safety out of the box. It would require adding a layer of abstracter between the Client and the sockets (e.g. an intermediate Connection object managed by a thread-safe pool).

@G4brym
Copy link
Author

G4brym commented Jan 21, 2019

I refactored the pymemcache part of my application to comply with thread safety and i haven't seen the error again

Thanks for the help

@jparise
Copy link
Collaborator

jparise commented Jan 21, 2019

Closing now that the root cause is understood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants