New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited number of primary segments on one machine #6176

Closed
leskin-in opened this Issue Nov 7, 2018 · 12 comments

Comments

6 participants
@leskin-in
Copy link
Contributor

leskin-in commented Nov 7, 2018

Summary

Greenplum database does not allow to run more than 44 primary segments simultaneously on one machine.

Description

Hereinafter, the term "machine" denotes an operating system environment. All GPDB segments are considered to belong to one GPDB cluster.

Current implementation of gpMgmt/bin/gppylib/gpsubprocess.py limits the number of segments running on one machine at the same time.

Part of gpsegstart output:

[ERROR]:-filedescriptor out of range in select()
Traceback (most recent call last):
  File "/usr/lib/gpdb/lib/python/gppylib/commands/base.py", line 243, in run
    self.cmd.run()
  File "/usr/lib/gpdb/lib/python/gppylib/commands/base.py", line 711, in run
    self.exec_context.execute(self)
  File "/usr/lib/gpdb/lib/python/gppylib/commands/base.py", line 436, in execute
    (rc, stdout_value, stderr_value) = self.proc.communicate2(input=self.stdin)
  File "/usr/lib/gpdb/lib/python/gppylib/gpsubprocess.py", line 67, in communicate2
    self._read_files(timeout,output,error)
  File "/usr/lib/gpdb/lib/python/gppylib/gpsubprocess.py", line 122, in _read_files
    (rset,wset,eset) = self.__select(readList,writeList, errorList, timeout)
  File "/usr/lib/gpdb/lib/python/gppylib/gpsubprocess.py", line 216, in __select
    return select.select(iwtd, owtd, ewtd, timeout)
ValueError: filedescriptor out of range in select()

In this example, 90 segments were launched on one machine. This requires (4005) connections to be opened on the same machine for GPDB segments to communicate, while select.select() call supports no more than 1024 file descriptors.

(946) is maximum possible number of connections that can be served by select.select(), which means 44 is the limit on the number of GPDB primary segments on one machine.

Proposed solution

Change select.select() to select.poll() call. This allows an unlimited number of primary segments to be run simultaneously on one machine.

@acmnu

This comment has been minimized.

Copy link

acmnu commented Nov 7, 2018

That is real issue on a big machines. Like IBM Power based. We are in progress on porting GPDB to Power and that is one of big issues.

@d

This comment has been minimized.

Copy link
Member

d commented Nov 15, 2018

While you have the most context here, can you illuminate us on why gpsubprocess.py is attempting to poll on one connection per pair of primaries?

@jchampio

This comment has been minimized.

Copy link
Member

jchampio commented Nov 19, 2018

@leskin-in

poll() is definitely a better API than select(), but there's something else going on here. We kicked off a 50-segment (primary only) demo cluster on a Macbook without any issues, and I have a 200-segment (100 primary-mirror pairs) single-host cluster running here on Ubuntu. Both machines have an FD_SETSIZE of 1024.

The __select() function you have modified only ever uses two file descriptors at once, as far as I can tell, so by the time you reach that code, something else has opened and held onto over a thousand file descriptors. That, to me, is the root cause, and switching to poll() is only going to cover that up.

Can you share how you reproduced this failure?

@kapustor

This comment has been minimized.

Copy link

kapustor commented Nov 20, 2018

Hi @jchampio!

Thats how we get it. We are using OpenPower huge machines with 120 primary and 120 mirrors segments on each machine:

[gpadmin@mdw ~]$ gpstart -m
20181120:12:32:07:070908 gpstart:mdw:gpadmin-[INFO]:-Starting gpstart with args: -m
20181120:12:32:07:070908 gpstart:mdw:gpadmin-[INFO]:-Gathering information and validating the environment...
20181120:12:32:07:070908 gpstart:mdw:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 5.11.0 build 5.11.0_arenadata3-205.el7'
20181120:12:32:08:070908 gpstart:mdw:gpadmin-[INFO]:-Greenplum Catalog Version: '301705051'
20181120:12:32:08:070908 gpstart:mdw:gpadmin-[INFO]:-Master-only start requested in configuration without a standby master.

Continue with master-only startup Yy|Nn (default=N):
> y
20181120:12:32:10:070908 gpstart:mdw:gpadmin-[INFO]:-Starting Master instance in admin mode
20181120:12:32:12:070908 gpstart:mdw:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20181120:12:32:12:070908 gpstart:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...
20181120:12:32:12:070908 gpstart:mdw:gpadmin-[INFO]:-Setting new master era
20181120:12:32:12:070908 gpstart:mdw:gpadmin-[INFO]:-Master Started...
[gpadmin@mdw ~]$ PGOPTIONS='-c gp_session_role=utility' psql -d postgres
psql (8.3.23)
Type "help" for help.

postgres=# select count(1) from gp_segment_configuration where role='p';
 count
-------
   241
(1 row)

postgres=# select distinct hostname from gp_segment_configuration;
 hostname
----------
 sdw1
 mdw
 sdw2
(3 rows)

postgres=# \q
[gpadmin@mdw ~]$ uname -a
Linux mdw 3.10.0-862.11.6.el7.ppc64le #1 SMP Tue Aug 14 20:51:52 GMT 2018 ppc64le ppc64le ppc64le GNU/Linux
[gpadmin@mdw ~]$ gpssh -f arenadata_configs/arenadata_all_hosts.hosts
=> free -g
[sdw2]               total        used        free      shared  buff/cache   available
[sdw2] Mem:           2040          23        1980           0          35        2008
[sdw2] Swap:             3           0           3
[ mdw]               total        used        free      shared  buff/cache   available
[ mdw] Mem:            506           8         473           0          25         494
[ mdw] Swap:             3           0           3
[sdw1]               total        used        free      shared  buff/cache   available
[sdw1] Mem:           2040          23        1980           0          35        2007
[sdw1] Swap:             3           0           3
=> cat /proc/cpuinfo |grep processor |wc -l
[sdw2] 384
[ mdw] 192
[sdw1] 384
=> ulimit -n
[ mdw] 655350
[sdw2] 655350
[sdw1] 655350
=> cat /etc/redhat-release
[ mdw] CentOS Linux release 7.5.1804 (AltArch)
[sdw2] CentOS Linux release 7.5.1804 (AltArch)
[sdw1] CentOS Linux release 7.5.1804 (AltArch)

When using default gpsubprocess.py with select():

[gpadmin@mdw ~]$ gpstart -a
...
#gives really a lot of errors like this:
20181120:13:10:05:161744 gpsegstart.py_sdw1:gpadmin:sdw1:gpadmin-[ERROR]:-filedescriptor out of range in select()
Traceback (most recent call last):
  File "/usr/lib/gpdb/lib/python/gppylib/commands/base.py", line 243, in run
    self.cmd.run()
  File "/usr/lib/gpdb/lib/python/gppylib/commands/base.py", line 711, in run
    self.exec_context.execute(self)
  File "/usr/lib/gpdb/lib/python/gppylib/commands/base.py", line 436, in execute
    (rc, stdout_value, stderr_value) = self.proc.communicate2(input=self.stdin)
  File "/usr/lib/gpdb/lib/python/gppylib/gpsubprocess.py", line 67, in communicate2
    self._read_files(timeout,output,error)
  File "/usr/lib/gpdb/lib/python/gppylib/gpsubprocess.py", line 122, in _read_files
    (rset,wset,eset) = self.__select(readList,writeList, errorList, timeout)
  File "/usr/lib/gpdb/lib/python/gppylib/gpsubprocess.py", line 216, in __select
    return select.select(iwtd, owtd, ewtd, timeout)
ValueError: filedescriptor out of range in select()
20181120:13:10:14:161744 gpsegstart.py_sdw1:gpadmin:sdw1:gpadmin-[CRITICAL]:-gpsegstart.py failed. (Reason=''NoneType' object has no attribute 'rc'') exiting...


Starting with poll() returns no error and goes fine.
@kapustor

This comment has been minimized.

Copy link

kapustor commented Nov 20, 2018

@jchampio BTW, you said that you tried only 100 primary segment cluster, while we used 120 primary and 120 mirror. Can you pls try 120 primaries per host?

@jchampio

This comment has been minimized.

Copy link
Member

jchampio commented Nov 20, 2018

Hi @kapustor,

Upping the primary count to 120 didn't help, but switching from the master branch to 5X_STABLE did, for some reason. Now that I can reproduce it, the primary culprit is this line in gpsegstart.py:

        # initialize state
        #
        self.pool                  = base.WorkerPool(numWorkers=len(dblist))

The WorkerPool is supposed to be limiting the maximum number of parallel processes that get started at once, but len(dblist) doesn't limit it at all -- it's the number of segments per host, which in this case is 240. (With three pipes, at two file descriptors each, created as part of every subprocess.Popen object, that's 1,440 file descriptors that are dumped on the process.)

If you change this line to assign a more reasonable numWorkers -- say, 32 or 64 -- does gpstart start working for you?

@kapustor

This comment has been minimized.

Copy link

kapustor commented Nov 21, 2018

@jchampio

Changing numWorkers to 64 helped, but DB start took much more time - 11 minutes vs 2 minutes with poll().
Seems to me like replacing select() to poll() is more correct.

@skahler-pivotal skahler-pivotal added this to Incoming in Cluster Management Nov 21, 2018

@jchampio

This comment has been minimized.

Copy link
Member

jchampio commented Nov 22, 2018

Seems to me like replacing select() to poll() is more correct.

If the goal is to open up maximum parallelism on platforms that can handle it, then yes. If the goal is to enable large numbers of segments without unnecessary resource exhaustion/bottlenecking, I don't think switching to poll() necessarily helps us there.

Changing numWorkers to 64 helped, but DB start took much more time - 11 minutes vs 2 minutes with poll().

This is really strange. Your current number of workers per host is 120, so artificially limiting the parallelism there should have only given you a factor-of-2 difference. Is select() really that much less performant on your particular platform, or is there yet another confounding factor at work here?

(The U.S. is about to go on a holiday weekend for Thanksgiving, so I'll be able to follow up next week. Thanks for investigating this with us!)

@kapustor

This comment has been minimized.

Copy link

kapustor commented Nov 22, 2018

This is really strange. Your current number of workers per host is 120, so artificially limiting the parallelism there should have only given you a factor-of-2 difference.

I was thinking the same, but it is what it is :)

Select() with 64 workers:

20181122:10:44:30:168108 gpstart:mdw:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
20181122:10:55:55:168108 gpstart:mdw:gpadmin-[INFO]:-Process results...

Is select() really that much less performant on your particular platform, or is there yet another confounding factor at work here?

I am not sure. Maybe @leskin-in knows?

@leskin-in

This comment has been minimized.

Copy link
Contributor Author

leskin-in commented Nov 22, 2018

@kapustor,
I suppose poll() and select() are almost equally perfomant in the case, and there should be another factor that biases the result. There is an article which describes the implementation of both these calls on our architecture in detail.

@kapustor

This comment has been minimized.

Copy link

kapustor commented Nov 23, 2018

Hi @jchampio

I found an issue in my DB startup test - the results were affected by other processes.
I repeated the test clearly and thats what Ive got:

Starting with poll() and default numworkers took 1:34:

20181123:15:11:37:022241 gpstart:mdw:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
..............................................................................................
20181123:15:13:11:022241 gpstart:mdw:gpadmin-[INFO]:-Process results...

Starting with select() and 64 numworkers took 1:26:

20181123:15:19:17:023166 gpstart:mdw:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
.....................................................................................
20181123:15:20:43:023166 gpstart:mdw:gpadmin-[INFO]:-Process results...

So, select() with 64 numworkers works even faster, then poll with default numworkers.

What we shall do now? Implement a new parameter for numworkers or set it to any default low value in new PR?
Or our original PR with select() => poll() is ok because now we know athe reasons?

@jchampio

This comment has been minimized.

Copy link
Member

jchampio commented Nov 27, 2018

@kapustor

So, select() with 64 numworkers works even faster, then poll with default numworkers.

Ah, great to know. Thanks!

What we shall do now? Implement a new parameter for numworkers or set it to any default low value in new PR?
Or our original PR with select() => poll() is ok because now we know athe reasons?

IMO it makes sense to cap the WorkerPool limit at some reasonable maximum for now. If you're really interested in wiring that through as a new parameter, go for it, but I don't think anyone has asked for that yet; that's something we can prioritize ourselves as a "feature" later, if needed. I also don't think we want to go for poll() right now, especially if it's not clear whether there's actually a performance improvement.

leskin-in added a commit to arenadata/gpdb that referenced this issue Dec 3, 2018

Limit the number of workers in a pool of gpstart
Put a limit on the number of workers in a pool created by gpstart.
When there are many segments on one server node, the unlimited number of workers may open unlimited number of pipes. This may lead to a failure of 'select()' call in 'gpMgmt/bin/gppylib/gpsubprocess.py' and inability to start the cluster at all.
This commit prevents the described problem from happening.

Closes greenplum-db#6176

leskin-in added a commit to arenadata/gpdb that referenced this issue Dec 4, 2018

Limit the number of workers in a pool of gpstart
Put a limit on the number of workers in a pool created by gpstart.
When there are many segments on one server node, the unlimited number of workers may open unlimited number of pipes. This may lead to a failure of 'select()' call in 'gpMgmt/bin/gppylib/gpsubprocess.py' and inability to start the cluster at all.
This commit prevents the described problem:

* By default, gpstart uses no more than 128 workers (from now on)
* The user may change this behaviour by setting '--numworkers' gpstart parameter to a bigger value

Closes greenplum-db#6176

leskin-in added a commit to arenadata/gpdb that referenced this issue Dec 4, 2018

Limit the number of workers in a pool of gpstart
Put a limit on the number of workers in a pool created by gpstart.
When there are many segments on one server node, the unlimited number of workers may open unlimited number of pipes. This may lead to a failure of 'select()' call in 'gpMgmt/bin/gppylib/gpsubprocess.py' and inability to start the cluster at all.
This commit prevents the described problem from happening.

Closes greenplum-db#6176

leskin-in added a commit to arenadata/gpdb that referenced this issue Dec 5, 2018

Limit the number of workers in gpsegstart's pool
Put a limit on the number of workers in a pool created by gpstart.

When there are many segments on one server node, the unlimited number
of workers may open unlimited number of pipes.
This may lead to a failure of 'select()' call
in 'gpMgmt/bin/gppylib/gpsubprocess.py' and inability
to start the cluster at all.

This commit prevents the described problem from happening.

Closes greenplum-db#6176

@jchampio jchampio closed this in 86c88e4 Dec 6, 2018

Cluster Management automation moved this from Incoming to Done Dec 6, 2018

iyerr3 added a commit to bhuvnesh2703/gpdb that referenced this issue Dec 8, 2018

Limit the number of workers in gpsegstart's pool
Put a limit on the number of workers in a pool created by gpstart.

When there are many segments on one server node, the unlimited number
of workers may open unlimited number of pipes.
This may lead to a failure of 'select()' call
in 'gpMgmt/bin/gppylib/gpsubprocess.py' and inability
to start the cluster at all.

This commit prevents the described problem from happening.

Closes greenplum-db#6176

kalensk added a commit to kalensk/gpdb that referenced this issue Dec 14, 2018

Limit the number of workers in gpsegstart's pool
Put a limit on the number of workers in a pool created by gpstart.

When there are many segments on one server node, the unlimited number
of workers may open unlimited number of pipes.
This may lead to a failure of 'select()' call
in 'gpMgmt/bin/gppylib/gpsubprocess.py' and inability
to start the cluster at all.

This commit prevents the described problem from happening.

Closes greenplum-db#6176

(cherry-picked from commit 86c88e4)

Authored-by: Kalen Krempely <kkrempely@pivotal.io>

kalensk added a commit that referenced this issue Dec 19, 2018

Limit the number of workers in gpsegstart's pool
Put a limit on the number of workers in a pool created by gpstart.

When there are many segments on one server node, the unlimited number
of workers may open unlimited number of pipes.
This may lead to a failure of 'select()' call
in 'gpMgmt/bin/gppylib/gpsubprocess.py' and inability
to start the cluster at all.

This commit prevents the described problem from happening.

Closes #6176

(cherry-picked from commit 86c88e4)

Authored-by: Kalen Krempely <kkrempely@pivotal.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment